A Block Successive Lower-Bound Maximization Algorithm for the Maximum Pseudo-Likelihood Estimation of Fully Visible Boltzmann Machines.

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

Communicated by Aapo Hyvarinen

NOTE

of

A Block Successive Lower-Bound Maximization Algorithm for the Maximum Pseudo-Likelihood Estimation of Fully Visible Boltzmann Machines Hien D. Nguyen

Pro

[email protected]

Ian A. Wood

[email protected] School of Mathematics and Physics, University of Queensland, St. Lucia, Brisbane Queensland 4072, Australia

orr ect

ed

Maximum pseudo-likelihood estimation (MPLE) is an attractive method for training fully visible Boltzmann machines (FVBMs) due to its computational scalability and the desirable statistical properties of the MPLE. No published algorithms for MPLE have been proven to be convergent or monotonic. In this note, we present an algorithm for the MPLE of FVBMs based on the block successive lower-bound maximization (BSLM) principle. We show that the BSLM algorithm monotonically increases the pseudo-likelihood values and that the sequence of BSLM estimates converges to the unique global maximizer of the pseudo-likelihood function. The relationship between the BSLM algorithm and the gradient ascent (GA) algorithm for MPLE of FVBMs is also discussed, and a convergence criterion for the GA algorithm is given. 1 Introduction

Un c

Let X = (X1 , . . ., Xd )T ∈ {−1, 1}d be a d-dimensional random vector, with realization x and probability mass function, P(X = x; θ) =

1 T 1 exp x Mx + bT x , Z(θ) 2

(1.1)

where

Z(θ) =

1 T T exp ξ Mξ + b ξ , 2 d

ξ∈{−1,1}

b = (b1 , . . ., bd )T ∈ Rd , and M = [m jk ] j,k=1,...,d ∈ Rd×d is a symmetric matrix with diag(M) = 0. We put the elements of b and the upper triangular elements of M (i.e., mjk , where j, k = 1, . . ., d and j < k) into the parameter vector θ. Neural Computation 28, 1–8 (2016) doi:10.1162/NECO_a_00813

c Massachusetts Institute of Technology

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

2

H. Nguyen and I. Wood

orr ect

ed

Pro

of

Mass functions of form 1.1 are known as fully visible Boltzmann machines (FVBMs), which are special cases of the Boltzmann machines of Ackley, Hinton, and Sejnowski (1985), with no latent variables. Recently there has been interest in training FVBMs via maximum pseudo-likelihood estimation (MPLE) due to the probabilistic consistency and asymptotic normality of the MPLE (see Hyvarinen, 2006, and Nguyen & Wood, in press, respectively; see Arnold & Strauss, 1991, for a general treatment regarding MPLE). The statistical properties of MPLEs allow for the construction of hypothesis tests and confidence intervals such as those in Nguyen and Wood (in press). There are currently no published algorithms for MPLE that are proven to be convergent or monotonic. In their work, Hyvarinen (2006) and Nguyen and Wood (in press) used gradient ascent (GA) and the Nelder-Mead algorithm (Nelder & Mead, 1965), respectively, neither of which has known convergence results for the problem. In this note, we present a block successive lower-bound maximization (BSLM) algorithm based on the principles of Razaviyayn, Hong, and Luo (2013). We show that the BSLM algorithm increases the pseudo-likelihood in each iteration and is convergent to the global maximum of the pseudolikelihood function. Furthermore, we discuss the relationship between the BSLM and the GA algorithm of Hyvarinen (2006), and we provide some simulation results that show the monotonicity of the log-pseudo-likelihood sequences generated by the BSLM algorithm. 2 Maximum Pseudo-Likelihood Estimation and the BSLM Algorithm

Un c

Let X 1 , . . ., X n be a random sample from an FVBM with some unknown parameter vector θ 0 (i.e., the parameter components b0 and M 0 are unknown), and let X i = (Xi1 , . . ., Xid )T for each i = 1, . . ., n. Following Nguyen and Wood (in press), the log-pseudolikelihood function for the FVBM can be given as Pn (θ) =

d n xi j mTj xi + b j xi j − log cosh mTj xi + b j − log 2 ,

(2.1)

i=1 j=1

where m j is the jth column of M. MPLE is conducted by maximizing equation 2.1 to obtain the MPLE, θˆ n = arg max Pn (θ). θ

Under the BSLM paradigm, we construct an iterative algorithm whereupon we maximize a lower-bounding approximation of the objective function (i.e., equation 2.1) that is simple and has desirable properties at each iteration and for each coordinate of the parameter vector. The maximization

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

A BSLM Algorithm for the MPLE of FVBMs

3

j

in the order j = 1, . . ., d. Here

(2.2)

Pro

b(s+1) = arg maxb ubj b j ; b[(s) , M (s) , j j]

of

occurs over blocks or subsets of the parameter vector (e.g., each coordinate) noncontemporaneously. In each iteration, all blocks are updated successively, taking into account previous updates. The BSLM algorithm for computing θˆ n first requires an initialization (0) θ . Next, we define the sth iterate of the parameter vector as θ (s) , and we compute the (s + 1)th iterate in two steps, the b and the M step. In the b step, we compute the updates

∂ Pn (s) ubj b j ; b[(s) , M (s) = Pn b[(s) , M (s) + b , M (s) b j − b(s) j j] j] ∂b j j 2 n b j − b(s) , j 2

ed

−

(2.3)

orr ect

(s) (s) T = (b(s+1) , . . ., b(s+1) and b[(s) 1 j−1 , b j , . . ., bd ) . Definition 2.2 yields the updates j]

b(s+1) = j where

1 ∂ Pn (s) b , M (s) + b(s) j , n ∂b j j

(2.4)

Un c

∂ Pn = xi j − tanh mTj xi + b j . ∂b j n

n

i=1

i=1

Next in the M step, in lexicographic order of the upper triangular elements of M (i.e. m12 , m13 , . . ., md−2,d , md−1,d ), we compute the updates , = arg maxm umjk m jk ; b(s+1) , M [(s) m(s+1) jk jk] jk

(2.5)

where = Pn b(s+1) , M [(s) umjk m jk ; b(s+1) , M [(s) jk] jk] +

∂ Pn (s+1) b m jk − m(s) , M [(s) jk] jk ∂m jk j

2 −n m jk − m(s) , jk

(2.6)

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

4


= 0 and nondiand M [(s) = m[(s) is symmetric with diag M [(s) jk] jk] j k j ,k =1,...,d jk] agonal elements, =

if j < j, or ( j = j and k < k), m(s+1) j k m(s) otherwise, j k

of

m[(s) jk] j k

for j , k = 1, . . ., d. Definition 2.5 yields the updates 1 ∂ Pn (s+1) b j , M [(s) + m(s) , jk] jk 2n ∂m jk

where

Pro

m(s+1) = jk

(2.7)

∂ Pn =2 xi j xik − xik tanh mTj xi + b j ∂m jk n

i=1

i=1

n

ed

−

n

xi j tanh mTj xi + b j .

orr ect

i=1

The b and M steps are iterated until the algorithm converges, whereupon the final iterate is declared the MPLE θˆ n . Here, we define convergence in the sense that Pn (θ (s+1) ) − Pn (θ (s) ) < τ for some sufficiently small tolerance τ > 0. 3 Convergence Results

Un c

For some initialization θ (0) , if we let τ arrow0 (or, equivalently, sarrow∞), then the sequence θ (s) goes to θ ∗ , where θ ∗ is a limit point of the BSLM algorithm. Using theorem 2 of Razaviyayn et al. (2013), we can state the following convergence result. Theorem 1. Every limit point θ ∗ of the BSLM algorithm, defined by updates 2.4 and 2.7, is a stationary point of equation 2.1, and the sequence of log-pseudolikelihood values Pn (θ (s) ) is increasing. Proof. By theorem 2 of Razaviyayn et al. (2013), we obtain the result by checking that ubj and umjk satisfy the following assumptions. A1: For each j = 1, . . ., d, ubj (b j ; b[(s) , M (s) ) ≤ Pn (b[(s) , M (s) ) with equality j] j] if and only if b j = b(s) j .

) ≤ Pn (b(s+1) , M [(s) ) A2: For each j, k = 1, . . ., d, umjk (m jk ; b(s+1) , M [(s) jk] jk] . with equality if and only if m jk = m(s) jk

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36


5

A3: For each j = 1, . . ., d, ubj (b j ; b[(s) , M (s) ) is quasi-concave and continj] uous in bj , with a unique global maximizer.

A4: For each j, k = 1, . . ., d, umjk (m jk ; b(s+1) , M [(s) ) is quasi-concave and jk] continuous in mjk , with a unique global maximizer.

of

To show that assumptions A1 and A2 are satisfied, we use the quadratic bound principle (QBP), (see equation 8.8 of Lange, 2013), which states that for any real function f (v), if f (v) ≥ γ for some γ < 0, then γ (v − w)2 , 2

Pro

f (v) ≥ f (w) + f (w)(v − w) +

where v, w ∈ R. By the QBP, assumption A1 is satisfied if ∂ 2 Pn /∂b2j > −n for each j. Since

ed

n 2 ∂ 2 Pn tanh(mTj xi + b j ) , = −n + 2 ∂b j i=1

orr ect

the result holds by noting that [tanh(v)]2 > 0 for v ∈ R. Similarly, by the QBP, assumption A2 is satisfied if ∂ 2 Pn /∂m2jk > −2n for each j and k, which can be confirmed by observing that n n 2 2 ∂ 2 Pn T = −2n + x + b ) + tanh(m tanh(mTk xi + b j ) . j i j ∂m2jk i=1 i=1

Un c

Next, consider that ubj and umjk are concave quadratic functions of bj and mjk , respectively, which implies their continuity and the uniqueness of their maximizers. Furthermore, all concave functions are quasi-concave; hence, assumptions A3 and A4 are satisfied. As Nguyen and Wood (in press), noted, equation 2.1 is strictly concave with respect to θ. This fact can be established by observing that d2 log cosh(v)/dv 2 = [sech(v)]2 > 0 for v ∈ R; hence, Pn (θ) must be strictly concave by composition. By elementary calculus, we obtain the following corollary: Corollary 1. The limit point θ ∗ of the BSLM algorithm, defined by updates 2.4 and 2.7, is the unique global maximizer of equation 2.1. 4 Relation to Gradient Ascent In Hyvarinen (2006), a GA algorithm was considered, where updates 2.4 and 2.7 were replaced with

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

6


b(s+1) = j

μ ∂ Pn (s) b , M (s) + b(s) j n ∂b j j

(4.1)

= m(s+1) jk

μ ∂ Pn (s+1) , b j , M [(s) + m(s) jk] jk 2n ∂m jk

of

and (4.2)

orr ect

5 Simulation Results

ed

Pro

respectively, for some μ > 0. Thus, the BSLM algorithm is the μ = 1 case of the GA algorithm. From our convergence results and by the QBP, we can deduce that the GA algorithm with μ ≤ 1 will yield an increasing sequence of pseudo-likelihood values that converges to the global maximum, whereas no guarantees can be made when μ > 1. Using the same argument as in theorem 1, we note that ∂ 2 Pn /∂b2j > −n/μ and ∂ 2 Pn /∂m2jk > −2n/μ, for any μ ≤ 1. To obtain equation 4.1, it suffices to substitute −n/μ in place of −n in equation 2.3, and to solve the first-order condition (FOC). Similarly, to obtain equation 4.2, it suffices to substitute −2n/μ in place of −2n in equation 2.6, and to solve the FOC.

Un c

To demonstrate the increasing property of the BSLM sequence of logpseudo-likelihood values, we performed a simulation, following the design of Hyvarinen (2006). In each of our four simulation scenarios, we simulated a single instance of n = 16, 000 observations from a d-dimensional FVBM with parameters M 0 and b0 for d = 5, 10, 15, 20. For all of the scenarios, the upper triangular values of M 0 and the values of b0 are each generated from a normal distribution with mean zero and variance 0.5. The initialization of the BSLM algorithm θ 0 is simulated in the same manner, and the tolerance is set at τ = 10−5 . Using the BSLM algorithm, we obtained five sequences of log-pseudolikelihood values for each scenario with the results shown in Figure 1. We observed that the log-pseudo-likelihood values are increasing in each simulation, as expected. Furthermore, most of the increase in log-pseudolikelihood values occurs in early iterations, and the algorithm appears to converge rapidly. We also calculated the average mean squared error (MSE) over the five repetitions of each scenario to be 2.09 × 10−6 , 5.96 × 10−7 , 4.64 × 10−7 , and 2.89 × 10−7 , for d = 5, 10, 15, 20, respectively. Here, the average MSE is T

computed as (qR)−1 Rj=1 θˆ n j − θ 0 j θˆ n j − θ 0 j , where θ 0 j and θˆ n j are the true parameter and MPL estimate, respectively, for repetitions j = 1, . . ., R, R = 5, and q is the number of elements of the parameter vectors. The average MSE values found were small and conformed to the theoretical results of Nguyen and Wood (in press).

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

7

ed

Pro

of


orr ect

Figure 1: BSLM-obtained sequences of log-pseudo-likelihood values for the five repetitions of each of the four simulation scenarios. The solid lines are the log-pseudo-likelihood values (left axis), and the dashed lines are the increases in values (right axis) in each iteration (in log base 10). Each point shape indicates a different repetition.

6 Conclusion

Un c

In this note, we have presented a BSLM algorithm for the MPLE of the FVBM. Furthermore, we have shown that the pseudo-likelihood sequence generated by the algorithm is monotonically convergent to the unique global maximum. Using the convergence results for the BSLM algorithm, we have also deduced a convergence criterion for the GA of Hyvarinen (2006). References

Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Arnold, B. C., & Strauss, D. (1991). Pseudolikelihood estimation: Some examples. Sankhya B, 53, 233–243. Hyvarinen, A. (2006). Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Computation, 18, 2283–2292. Lange, K. (2013). Optimization. New York: Springer. Nelder, J. A., & Mead, R. (1965). A simplex algorithm for functional minimization. Computer Journal, 7, 308–313.

NECO_a_00813-Nguyen

neco.cls

December 28, 2015

16:36

8


of

Nguyen, H. D., & Wood, I. A. (in press). Asymptotic normality of the maximum pseudolikelihood estimator for fully visible Boltzmann machines. IEEE Transactions on Neural Networks and Learning Systems. Razaviyayn, M., Hong, M., & Luo, Z.-Q. (2013). A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal of Optimization, 23, 1126–1153.

Un c

orr ect

ed

Pro

Received June 26, 2015; accepted November 6, 2015.

Asymptotic Normality of the Maximum Pseudolikelihood Estimator for Fully Visible Boltzmann Machines.

Filtered Backprojection Algorithm Can Outperform Iterative Maximum Likelihood Expectation-Maximization Algorithm.

Gaussian-binary restricted Boltzmann machines for modeling natural image statistics.

Maximum likelihood estimation for second level fMRI data analysis with expectation trust region algorithm.

Protein Function Prediction Using Deep Restricted Boltzmann Machines.

Risk estimation using probability machines.

Employing a Monte Carlo algorithm in Newton-type methods for restricted maximum likelihood estimation of genetic parameters.

MAXIMUM LIKELIHOOD ESTIMATION FOR SOCIAL NETWORK DYNAMICS.

Targeted Maximum Likelihood Estimation for Pharmacoepidemiologic Research.

A maximum-likelihood estimation of pairwise relatedness for autopolyploids.

Robust Multi-Frame Adaptive Optics Image Restoration Algorithm Using Maximum Likelihood Estimation with Poisson Statistics.

Measuring the usefulness of hidden units in Boltzmann machines with mutual information.

Correction: Gaussian-binary restricted Boltzmann machines for modeling natural image statistics.

Nonparametric maximum likelihood estimation for the multisample Wicksell corpuscle problem.

Hierarchical trie packet classification algorithm based on expectation-maximization clustering.

Diffused bounce-back condition and refill algorithm for the lattice Boltzmann method.

A block and wedge holder for the Picker Series 8 cobalt-60 teletherapy machines.

A consensus successive projections algorithm--multiple linear regression method for analyzing near infrared spectra.

An efficient block matching and spectral shift estimation algorithm with applications to ultrasound elastography.

Block-wise two-dimensional maximum margin criterion for face recognition.

ntCard: a streaming algorithm for cardinality estimation in genomics data.

Crossword: a fully automated algorithm for the segmentation and quality control of protein microarray images.

A distributed parallel genetic algorithm of placement strategy for virtual machines deployment on cloud platform.

Neurofibromatosis-1: a maximum likelihood estimation of mutation rate.