IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

1927

Scalable Nonparametric Low-Rank Kernel Learning Using Block Coordinate Descent En-Liang Hu and James T. Kwok

Abstract— Nonparametric kernel learning (NPKL) is a flexible approach to learn the kernel matrix directly without assuming any parametric form. It can be naturally formulated as a semidefinite program (SDP), which, however, is not very scalable. To address this problem, we propose the combined use of low-rank approximation and block coordinate descent (BCD). Low-rank approximation avoids the expensive positive semidefinite constraint in the SDP by replacing the kernel matrix variable with V V, where V is a low-rank matrix. The resultant nonlinear optimization problem is then solved by BCD, which optimizes each column of V sequentially. It can be shown that the proposed algorithm has nice convergence properties and low computational complexities. Experiments on a number of real-world data sets show that the proposed algorithm outperforms state-of-the-art NPKL solvers. Index Terms— Block coordinate descent (BCD), clustering analysis, kernel learning, low-rank approximation, semidefinite programming (SDP).

I. I NTRODUCTION

K

ERNEL methods have been highly successful in various machine learning problems, including classification, regression, clustering, ranking, and dimensionality reduction [1]. Because of the central role of the kernel, it is important to identify an appropriate kernel function/matrix for the task at hand. In the past decade, there have been a large body of the literature on this kernel learning problem. Many of these assume that the kernel has a predefined parametric form [2] or is a combination of multiple kernels [3]–[5]. On the other hand, nonparametric kernel learning (NPKL) makes no such assumptions and directly learns a kernel matrix from the data. It is thus more flexible, and has received significant interest in recent years [6]–[13]. In NPKL, the supervisory information can be in the form of class labels [6], [7], or weaker side information that are Manuscript received April 3, 2014; revised July 28, 2014; accepted September 27, 2014. Date of publication October 17, 2014; date of current version August 17, 2015. This work was supported in part by the Research Grants Council, Hong Kong, under Grant 614513, in part by the National Natural Science Foundation of China under Grant 61165012, in part by the Natural Science Foundations Fund of Yunnan Province under Grant 2011FZ074, and in part by the Department of Education Fund of Yunnan Province under Grant 2011Y305. E.-L. Hu is with the Department of Mathematics, Yunnan Normal University, Kunming 650092, China, and was with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong (e-mail: [email protected]). J. T. Kwok is with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2361159

often more readily available [8]–[13]. A popular form of such side information is the pairwise must-link and cannot-link constraints [14]. A must-link constraint on two instances specifies that they should belong to the same class, while a cannot-link constraint specifies that they should belong to different classes. Obviously, these pairwise constraints are less informative than class labels, in the sense that the pairwise constraints can be generated from class labels, but not vice verse. Since NPKL aims to learn a kernel matrix [which has to be positive semidefinite (psd) [1]], it is naturally formulated as a semidefinite program (SDP) [6], [15]. However, standard SDP solvers, which are often based on interior-point methods, can take O(n 6.5 ) time on a data set with n samples [13]. This is computationally expensive even on medium-sized data sets. To address this problem, Hoi et al. [8] proposed to first transform the SDP to its dual, and then solve it with a strategy similar to the sequential minimal optimization algorithm [16]. Recently, this is further improved by the SimpleNPKL algorithm [12], [13]. In particular, when a linear loss is used between the target kernel matrix and the given pairwise constraints, the solution can be obtained efficiently in closed form. However, for other loss functions, a closed-form solution is no longer available. SimpleNPKL reformulates the SDP to a saddle-point optimization problem, which is then solved iteratively by alternating the optimization of the kernel matrix (which still has a closed-form solution) and the dual variables. However, obtaining this closed-form kernel matrix solution requires eigenvalue decomposition, which can be computationally expensive for large n. In many machine learning problems, low-rank approximation is a prominent tool that often drastically reduces the computational effort [17], [18]. In the context of kernel learning, one can replace the n × n matrix K by V V, where V ∈ Rr×n and r  n is the rank. The psd constraint in the original SDP, which is expensive to enforce, can now be dropped. To handle the resultant low-rank SDP, the low-rank semidefinite programming (SDPLR) solver in [19] can be used. It solves the resultant nonlinear optimization problem using the method of generalized augmented Lagrangian and limited BFGS algorithm [20]. However, the procedure is iterative, and each iteration requires computing the augmented Lagrangian objective and its gradient, both of which can be computationally expensive. In this paper, we combine the use of low-rank approximation and block coordinate descent (BCD) [21] to solve the optimization problem in NPKL. Because of its simplicity

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1928

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

and efficiency, BCD has recently witnessed a resurgence of interest in machine learning [22]–[29]. More specifically, we first replace the matrix variable (K) in NPKL by V V as described above. The resultant nonlinear program is then solved by a block-coordinate-based Gauss–Seidel iteration, with each block corresponding to a column of V. It will be shown that for a variety of loss functions, the resultant optimization subproblems (each involving just one block) can be solved efficiently. Moreover, the proposed procedure has nice theoretical and computational properties. For example, it converges to a stationary point of the original SDP. Its space and periteration time complexities are also linear in n, which are thus more scalable than existing NPKL algorithms. Moreover, this allows for an efficient out-of-sample extension. In other words, new unseen samples can be easily incorporated into a learned kernel matrix without complete retraining. The rest of this paper is organized as follows. Section II first reviews the NPKL problem formulation and its state-of-the-art solver SimpleNPKL. Section III describes the proposed NPKL solver based on low-rank approximation and BCD. A discussion on its theoretical and computational properties is also provided. Experimental results are presented in Section IV, and the last section gives some concluding remarks. A preliminary version has been reported in [11]. However, as will be discussed in more detail in Section III-F, the approach in [11] involves the inversion of a large matrix, and thus is not computationally attractive. In contrast, the revised algorithm proposed here avoids this matrix inversion problem. In addition to providing a more efficient NPKL algorithm and a more thorough literature review, this paper: 1) extends the use of BCD to the linear loss. As will be seen from the experimental results, this also leads to good accuracy and scalability; 2) provides a detailed complexity and convergence analysis; and 3) provides an efficient extension for out-ofsample prediction, while the NPKL formulation in [11] is only transductive. Notations: In the sequel, matrices, and vectors are denoted in bold, with upper-case letters for matrices and lower case for vectors. The transpose of a vector/matrix is denoted by the superscript  , 0 is the zero vector, 1 is the vector of all ones, and I is the identity matrix. For a square matrix A, tr(A) is its trace, A F its Frobenius norm, λmin (A) its smallest eigenvalue, and A  0 means that A is symmetric and psd. II. R ELATED W ORK A. Nonparametric Kernel Learning Let X = {x1 , x2 , . . . , xn } be a given set of n samples. The learning performance can often be improved by exploiting the data’s intrinsic geometric structure. A popular approach is to represent the underlying data manifold by a weighted graph G = (V, E), where each sample in X becomes a vertex in V. For the edge ei j ∈ E connecting samples xi and x j , the weight Si j represents their similarity. A common choice is the Gaussian similarity Si j = exp(−xi − x j 2 /2σ 2 ), where σ is a user-defined parameter. The data geometry can then be captured by the Laplacian matrix of G, or its normalized

variant 1

1

L = (1 + δ)I − D− 2 SD− 2

(1)

where  S = [Si j ] ∈ Rn×n , D = diag(d1 , d2 , . . . , dn ) and n di = j =1 Si j . Here, an additional δI (with δ > 0) is introduced to prevent L from being singular. The use of manifolds has been successfully used in various graph-based semisupervised learning algorithms [30]–[32], and in learning problems, such as manifold learning, dimension reduction, and clustering [33], [34]. In NPKL [8], [12], [13], we are also given a set T = {(i, j )} of m pairwise must-link and cannot-link constraints. A matrix T ∈ {±1}n×n , with element  +1 if (xi , x j ) is a must-link pair in T Ti j = (2) −1 if (xi , x j ) is a cannot-link pair in T can be used to represent T . For (xi , x j ) pairs that are not associated with any must-link or cannot-link constraint, the corresponding Ti j entries are missing and will not be accessed by the algorithm. The goal of NPKL is to learn a kernel matrix K ∈ Rn×n that is consistent with both L and T. Following [8], it can be formulated as the following SDP:    min tr(LK) + γ (3)  Ti j K i j K

(i, j )∈T

s.t. K  0

(4)

where (·) is the loss function between K and T (e.g., Hoi et al. [8] uses the hinge loss; Zhuang et al. [13] also considers the linear loss, square loss, and squared hinge loss), and γ is a tradeoff parameter. The first term in (3) is similar to manifold regularization [30] and enforces smoothness of K over the manifold; while the second term penalizes inconsistency between K and the given constraints in T. After solving this SDP, the learned K can be directly plugged into a kernel classifier (such as the SVM) for classification, or the kernel k-means algorithms for clustering. Recently, NPKL is also extended for learning the kernel matrix in data embedding algorithms (such as colored maximum variance unfolding [35] and minimum volume embedding [36]). The optimization problem in (3) has to be slightly modified so as to incorporate the centering and distance-preserving constraints in data embedding. Interested readers are referred to [13] for details. B. SimpleNPKL To solve the SDP in (3), its time complexity can be as high as O(n 6.5 ) [13]. A more efficient approach is the recently proposed SimpleNPKL [12], [13]. For the optimization problem in (3), it first adds the following constraint that further regularizes the complexity of K : tr(K p ) ≤ B.

(5)

Here, p is an integer (typically set to 2), and B > 0 is a userdefined parameter. When the linear loss (Ti j K i j ) = −Ti j K i j

HU AND KWOK: SCALABLE NONPARAMETRIC LOW-RANK KERNEL LEARNING

is used, it can be seen that the modified problem (3) reduces to min tr(AK) K

s.t. K  0, tr(K p ) ≤ B where A=L−γ



Ti j

(6)

(7)

(i, j )∈T

and T is the n × n matrix with its (i, j ) entry set to Ti j , and 0 otherwise. A key observation is that with the additional constraint (5), problem (6) has the following closed-form solution: ⎞1 ⎛ p

1 ⎟ ⎜ B ⎟ A p−1

K=⎜ p + ⎠ ⎝ tr A+p−1

1929

explicitly use the decomposition in (9), in practice, it assumes that K is low-rank so that a full eigenvalue decomposition of A in (7) can be avoided. In the following, we propose a more efficient approach based on BCD [21]. In general, BCD divides the set of variables into multiple blocks, and then minimizes them one by one. Here, we consider each vi in (10) as a block. Starting from an initial estimate V(0) of V, we update the estimate V(t ) in the tth iteration column-by-column, with the i th subproblem being (t +1)

vi

(t +1)

= arg min (v1 vi

(11)

(8) A. Samples Without Pairwise Constraint Let U be the set of samples that are not associated with any pairwise constraint, and L the set of samples that are associated with some pairwise constraints. When sample i ∈ U, updating vi(t ) in (11) is straightforward. Specifically, the second term in (10) disappears and (11) reduces to  L ik vk . (12) min i (vi ) ≡ L ii vi vi + 2vi vi

k:k =i

Setting the derivative ∂i /∂vi = 2L ii vi + 2 zero, we obtain 1  vi = − L ik vk . L ii

As the target kernel matrix K is symmetric and psd, it can be factorized as

V

(i, j )∈T

Note that the psd constraint (4), which is expensive to enforce, is no longer needed. However, the price to pay is that the optimization problem now becomes nonconvex. Note that the trick of replacing a psd matrix variable by its low-rank factorization has been similarly used for solving general SDPs [19]. Their algorithm, called SDPLR, solves the resultant nonlinear optimization problem using the method of generalized augmented Lagrangian and limited BFGS algorithm [20]. However, the procedure is iterative, and each iteration requires computing the augmented Lagrangian objective and its gradient, both of which can be computationally expensive. In addition, though SimpleNPKL does not

k:k =i

L ik vk to (13)

Instead of updating vi ’s one by one sequentially, we can alternatively update all of them together as a single block. ¯ = diag(L) and Let D ¯ L¯ = L − D.

(14)

Then, L¯ ii = 0, and (13) can be rewritten as vi = −

(9)

where V = [v1 , . . . , vn ], vi ∈ Rr and r is the rank of K. Typically, r is much smaller than n. Each vi can be viewed as a new data representation for the corresponding sample xi . Substituting (9) into (3), we obtain    (10)  Ti j vi v j . min (V) ≡ tr(VLV ) + γ



k:k =i

III. L OW-R ANK NPKL U SING C OORDINATE D ESCENT

K=V V

(t )

The following sections show that (11) can be easily solved.

where A+ is the projection of A onto the psd cone. Moreover, because the graph G is often sparse (and so subsequently L is a sparse matrix) and the number of pairwise constraints is small, A in (7) is also sparse. Hence, computing the psd projection A+ , which requires eigenvalue decomposition of A, can be sped up by the Lanczos algorithm [37]. However, in general, this eigenvalue decomposition can still be computationally expensive, especially when n is large. When other loss functions are used in (3), a closed-form solution for K is no longer available. SimpleNPKL then reformulates (3) [together with the model complexity constraint in (5)] to a saddle-point optimization problem. Using an update strategy proposed in [38], this is solved by alternately minimizing K [which has a closed-form solution similar to (8)] and the dual variables based on gradient ascent.



(t +1)

, . . . , vi−1 , vi , vi+1 , . . . , vn(t ) ).

n 1 ¯ 1 ¯ L ik vk = − VL·i ¯ ¯ Dii k=1 Dii

¯ Hence where L¯ ·i is the i th column of L. ¯ −1 VU ≡ [vl+1 , . . . , vn ] = VGD UU

(15)

where ¯ U), and D ¯ U U = D(U, ¯ G = −L(:, U). (16) ¯ U) denotes Here, we use the MATLAB-like notation and L(:, ¯ the submatrix of L that contains all the columns indexed by U ¯ (and similarly for D(U, U)). Rearranging   the columns of V as   GL VL VU and the rows of G as , (15) can be rewritten GU as   ¯ −1 = VL GL D ¯ U U − GU −1 . VU = (VL GL + VU GU ) D UU ¯ U U − GU )−1 can be expenWhen |U| is large, computing (D (0) , we instead update VU sive. Hence, with an initial VU recursively as   (t +1) (t ) ¯ −1 , t = 0, 1, 2, . . . (17) VU = VL G L + VU GU D UU

1930

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Using the method of Lagrangian multipliers, the Lagrangian function is   γ  1 − βπ( j ) j L ik vk + L = L ii vi vi + vi 2 2 k:k =i j ∈T (i)  − απ( j )(Ti j vi v j − 1 + j ) (20) j ∈T (i)

where α, β ∈ R|T (i)| are the dual variables, and π is a permutation that maps j ∈ T (i ) to π( j ) ∈ {1, . . . , |T (i )|}. By setting the derivatives of L with respect to j ’s and vi ’s to zero, we have ∂L γ γ = − απ( j ) − βπ( j ) = 0 ⇒ 0 ≤ απ( j ) ≤ ∂ j 2 2

Fig. 1. Loss functions considered in this paper. (a) Hinge loss: max (1 − z, 0). (b) Squared hinge loss: (max (1 − z, 0))2 . (c) Square loss: (1 − z)2 . (d) Linear loss: −z.

¯ −1 is smaller Proposition 1: The spectral radius of GU D UU than 1. Proof: Let ρ(A) be the spectral radius of matrix A. As there is no self-loop in the graph G, Sii = 0 and so ¯ = (1 + δ)I diag(D−(1/2)SD−(1/2) ) = 0. Thus, from (14), D −(1/2) −(1/2) ¯ and L = −D SD . Moreover, from the definitions ¯ U U and D ¯ −1 = 1/(1 + δ)I. As the in (16), GU = −L UU ¯ ¯ ≤ 1. Using eigenvalues of L are in [−1, 1] [31], hence ρ(L) the interlacing property for symmetric matrices [39], we have ¯ −1 ) = 1/(δ + 1)ρ(GU ) ≤ ¯ ≤ 1. Thus, ρ(GU D ρ(GU ) ≤ ρ(L) UU 1/(1 + δ) < 1. Corollary 1: The iteration in (17) converges to a fixed point. Proof: Result follows immediately from that (17) is a contraction mapping.

and   ∂L = L ii vi + L ik vk − απ( j ) Ti j v j = 0 ∂vi k:k =i j ∈T (i) ⎛ ⎞  1 ⎝  ⇒ vi = απ( j ) Ti j v j − L ik vk ⎠. (21) L ii j ∈T (i)

k:k =i

Substituting these back into (20), we obtain the dual ⎛ ⎞   maxγ απ(s) ⎝ Tis vs L ik vk + L ii ⎠ 0≤α≤ 2 1

s∈T (i)

k:k =i



1 2



απ(s)απ(t ) Tis Tit vs vt

s,t ∈T (i)

which is a quadratic program (QP). This can be written more compactly as min

0≤α≤ γ2 1

where

1  α Pα − α  q 2

(22)

  P = pπ(s),π(t )  0   q = q1 , . . . , q|T (i)|  qπ(s) = Tis vs L ik vk + L ii k:k =i

B. Sample i ∈ L

pπ(s),π(t ) = Tis Tit vs vt .

When sample i is associated with some pairwise constraint (i.e., i ∈ L), the optimization problem in (11) becomes  1 ˜ i (vi ) ≡ L ii vi vi + vi L ik vk min  vi 2 k:k =i γ  + (Ti j vi v j ) (18) 2 j ∈T (i)

where T (i ) = { j | (i, j ) ∈ T }. The following sections show that this can be efficiently obtained for a number of commonly used loss functions (Fig. 1). 1) Hinge Loss (z) = max (1 − z, 0): Problem (18) can then be rewritten as  1 γ  L ik vk +

j min L ii vi vi + vi vi , j 2 2 k:k =i

j ∈T (i)

s.t. Ti j vi v j ≥ 1 − j , j ≥ 0 ∀ j ∈ T (i ).

(19)

(23)

Recall that α ∈ R|T (i)| , and T (i ) only contains samples involved in the pairwise constraints with sample i . Typically, |T (i )| is small. Hence, (22) can be solved efficiently with off-the-shelf QP algorithms. In particular, when |T (i )| = 1, α becomes a scalar and can be obtained in closed-form as min (max (q1 / p11, 0), γ /2). 2) Squared Hinge Loss (z) = (max (1 − z, 0))2 : Problem (18) can then be rewritten as  1 γ  2 L ik vk +

j min L ii vi vi + vi vi , j 2 2 k:k =i

s.t.

Ti j vi v j

≥ 1 − j ∀ j ∈ T (i ).

j ∈T (i)

(24)

As shown in Section III-B1, it is easy to show that its dual is min α≥0

1  α Hα − α  q 2

(25)

HU AND KWOK: SCALABLE NONPARAMETRIC LOW-RANK KERNEL LEARNING

where H = P + (L ii /γ )I, and P, q are as defined in (23). As α ∈ R|T (i)| , this is again a small QP. In particular, when |T (i )| = 1, α becomes a scalar and can be simply obtained as max(q1 /H11, 0). Moreover, it can be easily shown that vi ’s can again be recovered from α using (21). 3) Square Loss (z) = (1 − z)2 : In this case, the objective in (18) can be rewritten as  γ  1 L ii vi vi + vi L ik vk + (vi v j − Ti j )2 2 2 k:k =i j ∈T (i) ⎛ ⎞   1 γ = L ii vi vi + vi L ik vk + vi ⎝ v j vj ⎠ vi 2 2 k:k =i j ∈T (i)   − γ vi Ti j v j + constant j ∈T (i)

⎛ ⎞   1  = vi Mvi + vi ⎝ L ik vk − γ Ti j v j ⎠ 2

Algorithm 1 Block Coordinate Descent for NPKL 1: Input: T, L, r, γ , ε and itermax . 2: Output: Target kernel matrix K. 3: Initialization: V(0) ∈ Rr×n , t = 0. 4: repeat 5: randomly reorder the samples; 6: for i ∈ L do (t +1) (t ) 7: update vi ← vi by (21), (28) or (31) according to the loss function; 8: end for 9: for i ∈ U do 10: update vi(t +1) ← vi(t ) by (13) or (17); 11: end for (t +1) (t +1) 12: V(t +1) ← [v1 , v2 , . . . , vn(t +1)]; 13: t ← t + 1; V(t) −V(t−1)  14: until < ε or t > itermax . V(t)  15:



return K = V(t ) V(t ).

j ∈T (i)

k:k =i

+ constant

(26)

where M = L ii I + γ



v j vj .

(27)

j ∈T (i)

j ∈T (i)

Recall that M ∈ Rr×r , where r is the rank of the desired kernel matrix. Typically, r is small, and so computing M−1 is computationally efficient. In case r is large, and in particular when r |T (i )|, one can compute M−1 more efficiently using the generalized Sherman–Morrison–Woodbury (SMW) formula [40]  −1 m  W+ Ak B = W−1 − W−1 AY−1 B W−1 k k=1

···

vi  ≤ b

(28)

k:k =i

 where A = [A1 , . . . , Am ], B = [B 1 , . . . , Bm ] and ⎡ −1 −1 B I + B 1 W A1 · · · 1 W Am  ⎢ B W−1 A1 · · · B2 W−1 Am ⎢ 2 Y=⎢ .. .. . .. ⎣ . .

4) Linear Loss (z) = −z: In this case, the objective in (10) can be rewritten as   (V) = tr V(L − γ T)V . This is lower-bounded by λmin (L − γ T)V2F , and so will be unbounded when λmin (L − γ T) < 0. To avoid this problem, we add the constraint

Minimizing (26) leads to the closed-form solution ⎛ ⎞   vi = M−1 ⎝γ Ti j v j − L ik vk ⎠.

−1 B m W A1

1931

⎤ ⎥ ⎥ ⎥. ⎦

−1 I + B m W Am

From (27), we can set m = |T (i )|, W = L ii I, Ai = vi , and Bi = γ vi , and noting that W−1 = (L ii I)−1 = (1/L ii )I, the generalized SMW formula leads to   −1

1 L ii −1   I+A A M = A I−A L ii γ where A = [v j1 , . . . , v j|T (i)| ] ∈ Rr×|T (i)| . Hence, the inverse operation on the r ×r matrix M is reduced to an inverse operation on the much smaller |T (i )| × |T (i )| matrix (L ii /γ )I + A A.

(29)

(where b > 0) for each i . The optimization subproblem in (18) ˜ i (vi ), where can then be rewritten as min  vi ≤b

⎛ ⎞   1 γ ˜ i (vi ) = L ii vi vi + vi ⎝  L ik vk − Ti j v j ⎠. 2 2 j ∈T (i)

k:k =i

(30) Similar to the square loss, its optimal solution can also be obtained in closed-form, as  ˜vi  ≤ b v˜ i (31) vi = v˜ i ˜vi  > b b ˜vi  where

⎛ ⎞  1 ⎝γ  v˜ i = Ti j v j − L ik vk ⎠. L ii 2 j ∈T (i)

k:k =i

The whole procedure will be called BCD for NPKL (BCDKL), and is shown in Algorithm 1. In general, the performance of BCD can be affected by the samples’ presentation order. To alleviate this problem, we randomize the order of vi ’s at the beginning of each iteration as in [41]. C. Computational Complexities 1) Time Complexity: Let the degree of node i in the manifold graph be di . Typically, the graph Laplacian is sparse, and so di is small. In the following, we consider the time complexities for updating VU and VL in each iteration.

1932

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

TABLE I P ERITERATION T IME C OMPLEXITIES OF BCDKL FOR VARIOUS L OSS F UNCTIONS

For updating VU , obtaining one vi from (13) takes O(r di ) time. Hence, obtaining VU = [vl+1 , . . . , vn ] takes a total of    O r ni=l+1 di time. For updating VL , the per-iteration complexity for a sample i ∈ L depends on the loss function. 1) Hinge loss or squared hinge loss: a) obtaining P in (23) takes O(r |T (i )|2 ) time; b) obtaining q in (23) takes O(r di |T (i )|) time; c) solving for α in (22) or (25) takes O(|T (i )|3 ) time for a general-purpose QP solver; d) using (21) to recover one vi from α takes O(r (|T (i )| + di )) time.  Hence, the total time to obtain VL is O( li=1 (|Ti |3 + r |Ti |2 + r di |T (i )| + r di )). 2) Square loss: a) obtaining M in (27) takes O(r 2 |T (i )|) time; b) to obtain M−1 , a direct inversion takes O(r 3 ) time, whereas using the generalized SMW formula takes O(|T (i )|3 + r 2 |T  (i )|) time; Ti j v j − L ik vk takes c) obtaining γ j ∈T (i)

k:k =i

O(r (|T (i )| + di )) time. Thus, the total time to obtain VL is O(r 3 + l l 2 3 2 i=1 (r |T (i )| + r di )), or O( i=1 (|Ti | + r |T (i )| + r di )) when the generalized SMW formula is used. 3) Linear loss: It takes O(r (|T (i )| + di )) time to obtain v˜ i , and O(r ) time for its norm (31). Thus, the total time to obtain VL is O(r li=1 (|T (i )| + di )). The total periteration time complexities (for obtaining both VU and VL ) are summarized in Table I. In practice, both the time comr and |Ti | are much smaller than n. Hence,  plexities are mainly dominated by r O( ni=1 di ). When the data manifold is represented by a k-nearest-neighbor graph, di = k for all i . The time complexities then reduce to O(r kn), which only scale linearly with the total number of samples n. 2) Space Complexity: In Algorithm 1, only V has to be resident in the memory. Thus, the space complexity for BCDKL is only O(r n). On the other hand, SimpleNPKL needs to store the whole K, which takes O(n 2 ) space. D. Convergence The convergence analysis of BCD has been studied extensively [21], [42]. Let x = [x 1 , . . . , x m ] be the variable to

be optimized. We first introduce the following definition and proposition from [21]. Definition 1: A function f is strictly quasi-convex with respect to x i if for every x, f (x 1 , . . . , αx i + (1 − α) yi , . . . , x m ) < max{ f (x), f (x 1 , . . . , yi , . . . , x m )} for any yi = x i and any α ∈ (0, 1). Note that strict convexity implies strict quasi-convexity, but not vice versa. Moreover, as can be observed from (12), (19), (24), (26), and (30), for all the loss functions considered here, the optimization subproblem with respect to vi (i = 1, . . . , n) is always strictly convex. Proposition 2 (Proposition 5 in [21]): Assume that the function f (x) is strictly quasi-convex with respect to x i for i = 1, 2, . . . , m − 2, and that the sequence {x(k) } generated by BCD has limit points. Then, every limit point of {x(k) } is a local minimum. From (11), obviously, we have (. . . , vi(t +1) , . . . ) ≤ (. . . , vi(t ) , . . . ) for i = 1, . . . , n. Hence, the objective in (10) is monotonically decreasing (i.e., (V(0)) ≥ (V(1)) ≥ · · · ≥ (V(t ))). The following theorem shows that this {V(t )} sequence converges to a local minimum. Theorem 1: Let {V(t )} be a sequence generated from Algorithm 1. Then, V(t ) converges to a local minimum. Proof: In addition, as L in (1) is  0, its smallest eigenvalue λmin (L) is greater than zero. Moreover 0 ≤ λmin (L)||V ||2F ≤ tr(V LV).

(32)

When the hinge loss, squared hinge loss, or square loss is used, (·) ≥ 0. Hence, from (10) and (32) (V) ≥ tr(V LV) ≥ λmin (L)||V ||2F ≥ 0.

(33)

When the linear loss is used, the objective can be rewritten as ⎛ ⎞   (V) ≡ tr(VLV ) + 2γ ⎝ vi v j − vi v j ⎠. Ti j 0

From (31), vi  ≤ b and so vi v j ≤ vi v j  ≤ b2 . Thus, from (10) and (32)  Ti j vi v j (V) = tr(VLV ) − γ (i, j )∈T

≥ λmin (L) − 2γ |T |b2 ≥ −2γ |T |b2 .

(34)

In both cases, (33) and (34) imply that (·) is bounded below, Hence, using the monotone convergence theorem, {(V(t ))} converges to a limit point, and so does {V(t )} as  is continuous. As the subproblem with respect to vi (i = 1, . . . , n) is strictly convex, every limit point is a local minimum on using Proposition 2. E. Out-of-Sample Extension Suppose that a kernel matrix K has been learned on a set of n samples. With the arrival of some new unseen samples, it is obviously desirable to extend K to these samples. Note that as NPKL learns a kernel matrix instead of a kernel function,

HU AND KWOK: SCALABLE NONPARAMETRIC LOW-RANK KERNEL LEARNING

1933

TABLE II D ATA S ETS U SED IN THE E XPERIMENTS

existing methods for handling this out-of-sample problem, such as the Nyström algorithm [43], cannot be used here. A straightforward approach is to apply Algorithm 1 on the original and new samples together. However, this involves complete retraining, and can be computationally expensive. As the new samples typically do not come with any must-link or cannot-link constraint, we propose a simplified version of Algorithm 1 by considering the new samples as unlabeled data. Specifically, let the original in-sample data be XI , and the outof-sample data be XO . Suppose that we have already learned a low-rank kernel matrix KI ,I = VI VI on XI . Recall that VI can be viewed as the data representation for the samples in XI . With the arrival of out-of-sample data, assume that the representation of XI does not change. Hence, we aim to find the low-rank kernel matrix [VI , VO ] [VI , VO ], where VO is the data representation for samples in XO . As the new samples are unlabeled, we have the following simplified version of (10): min tr([VI , VO ]L[VI , VO ] ). VO

Following the same derivation, as shown in Section III-A, we can obtain the columns of VO sequentially as (13): 1  vi = − L ik vk , i ∈ O. L ii

˜ in (36) and potentially very large. However, computing  Q in (37) require the inversion of LU U ∈ R|U |×|U | , which takes O(|U|3 ) time. In contrast, Algorithm 1 avoids such matrix inversion in the update of VU (steps 9–11). IV. E XPERIMENTS

k:k =i

Alternatively, we can update VO as a single block in (17)   (t +1) (t ) ¯ −1 VO = VI G I + VO G O D OO ¯ OO are analogously defined, as shown where GO , GI , and D in Section III-A, on XO .

A. Setup

F. Discussion In [11], the NPKL problem in (3) is reformulated as the following optimization problem:  ˜ LL ) + γ (Ti j [KLL ]i j ) (35) min tr(K KLL 0

In this section, we demonstrate the performance of the proposed NPKL approach in the context of clustering. As in [8], a kernel matrix is learned from a given set of pairwise constraints, which is then used to define the kernel in the kernelized k-means clustering algorithm.

(i, j )∈T

where ˜ = LLL − LLU L−1 L (36)  U U LU   LLL LLU , and K is also similarly decomposed. L = L LU LU U Hence, while (3) optimizes over the whole kernel matrix K, (35) only needs to optimize the submatrix of K involving samples associated with pairwise constraints. With the lowV . rank factorization in (9), KLL can be further written as VL L Plugging back into (35), VL is obtained using BCD in [11]. Finally, K is recovered as QKLLQ , where   I Q= . (37)  −L−1 U U LLU While (35) involves a smaller (and thus cheaper) optimization problem than (3), computing the matrices in (35) can be expensive. In particular, one often has an abundance of unlabeled samples in semisupervised learning. Typically, they are not associated with any pairwise constraints, and |U| is

Experiments are performed on a number of standard benchmark data sets1 commonly used in the NPKL literature [8], [13] (Table II). Following [13], the data manifold is constructed using the k-nearest neighbors (where k = 50 for the adult data sets, and k = 5 for the others). For each data set, we randomly sample 0.6n pairs of patterns that belong to the same class to form the must-link pairs; and another 0.6n pattern pairs belonging to different classes to form the cannot-link constraints. The following NPKL solvers (implemented in MATLAB) will be compared in the experiments. 1) The Proposed BCDKL: The initial estimate V(0) is generated randomly. When the hinge loss or squared hinge loss is used, the small QPs in (22) and (25) are solved using MATLABs quadprog package. For the linear loss, b in (29) is simply set to 1. 2) SimpleNPKL [13]: As the graph Laplacian matrix is often sparse and the number of pairwise constraints is small, the Lanczos algorithm [37] is used for the solving the underlying sparse eigen-decomposition, as suggested in [13]. Moreover, in (5), p is set to 2; while B is set to 100n for the adult data sets, and 100 for the others. 1 Downloaded from http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/ binary.html.

1934

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Fig. 2. Optimization objective (3) versus CPU time (in seconds) on the soybean (top) and a1a (bottom) data sets. (a) and (e) Hinge loss. (b) and (f) Squared hinge loss. (c) and (g) Square loss. (d) and (h) Linear loss.

Fig. 3. Optimization objective (3) versus the number of iterations for BCDKL on the soybean (top) and a1a (bottom) data sets. (a) and (e) Hinge loss. (b) and (f) Squared hinge loss. (c) and (g) Square loss. (d) and (h) Linear loss.

3) SDPLR2 [19]: As for SimpleNPKL, the sparse input format is used for more efficient processing. We do not compare with the approaches based on interior point optimization [8] and multiple kernel learning [6], as they have been shown to be outperformed by SimpleNPKL [8], [12], [13]. For BCDKL, the maximum number of iterations (itermax in Algorithm 1) is set to 1000 (except for the larger adult data sets in which we set it to 800), and ε is 10−5 . The same stopping criterion is used for SimpleNPKL. All the three NPKL solvers assume that the target kernel matrix is low rank. As in [13], we set the rank r to the largest integer satisfying r (r + 1)/2 ≤ m, where m is the total number of must-link and cannot-link constraints. The learned kernel matrix is then used as input to the kernelized k-means clustering algorithm. As shown in [8] and [13], the number 2 Downloaded from http://dollar.biz.uiowa.edu/~sburer/.

of clusters k is set to the true number of classes. Each clustering experiment is repeated 20 times with random restarts. Experiments are performed on a PC with i7-3370 3.4-GHz CPU and 32-G RAM. As shown in [13], clustering accuracy is measured by the RAND index [44], which is defined as follows. Given n samples in the data set, there are 0.5n(n − 1) pattern pairs. For each pair, the clustering algorithm either assigns them to the same cluster or to different clusters. Let a be the number of pairs belonging to the same class that are placed in the same cluster, and b the number of pairs belonging to different classes and are placed in different clusters. The RAND index is then defined as a+b × 100%. 0.5n(n − 1) The higher RAND index, the better the performance.

HU AND KWOK: SCALABLE NONPARAMETRIC LOW-RANK KERNEL LEARNING

1935

TABLE III C LUSTERING A CCURACIES (%) ON THE S MALLER D ATA S ETS . T HE B EST AND C OMPARABLE R ESULTS (A CCORDING TO THE PAIRWISE t-T EST W ITH 95% C ONFIDENCE ) A RE H IGHLIGHTED

TABLE IV CPU T IME ( IN S ECONDS ) ON THE S MALLER D ATA S ETS . T HE B EST AND C OMPARABLE R ESULTS (A CCORDING TO THE PAIRWISE t-T EST W ITH 95% C ONFIDENCE ) A RE H IGHLIGHTED

B. Convergence In this section, we first compare the different methods’ convergence behaviors on two representative data sets: soybean and the larger a1a. Fig. 2 shows the convergence of the optimization objective in (3) with time. Since the SDPLR package does not output intermediate objective values produced during the optimization iterations, only the final objective value attained is shown. Moreover, when the linear loss is used, SimpleNPKL has a closed-form solution and so only a single point is shown in this case. As can be seen, SDPLR is often slow. Indeed, on the ala data set, it cannot converge after 800 s, and so is not shown in Fig. 2(g) and (h). BCDKL converges much faster than both

SDPLR and SimpleNPKL when the hinge loss, squared hinge loss, or square loss is used. With the linear loss, SimpleNPKL is the fastest on the soybean data set, as its solution can be obtained in closed-form and its running time is dominated by the (sparse) eigenvalue decomposition. However, the final objective value obtained is much inferior to the other two solvers.3 On the large a1a data set, the eigenvalue decomposition performed by SimpleNPKL becomes computationally expensive, and BCDKL is as fast as SimpleNPKL. Moreover, note that the BCDKL objective decreases monotonically on both data sets. This agrees with the 3 We have also experimented with different B settings, but still with little improvements.

1936

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

TABLE V C LUSTERING A CCURACIES (%) ON THE Adult D ATA S ETS . T HE B EST AND C OMPARABLE R ESULTS (A CCORDING TO THE PAIRWISE t-T EST W ITH 95% C ONFIDENCE ) A RE H IGHLIGHTED . N OTE T HAT S IMPLE NPKL C ANNOT H ANDLE THE T WO L ARGEST D ATA S ETS (a6a AND a7a) W HEN THE H INGE L OSS , S QUARED H INGE L OSS , OR S QUARE L OSS IS U SED

TABLE VI CPU T IME ( IN S ECONDS ) ON THE Adult D ATA S ETS . T HE B EST AND C OMPARABLE R ESULTS (A CCORDING TO THE PAIRWISE t-T EST W ITH 95% C ONFIDENCE ) A RE H IGHLIGHTED

convergence property of BCD, as described in Section III-D. On the other hand, SimpleNPKL is based on gradient descent (except for the linear loss), and its objective does not decrease monotonically. Fig. 3 shows the convergence of BCDKL with respect to the number of iterations. As can be seen, the algorithm converges in around 10 iterations. C. Clustering Accuracies and CPU Time Tables III and IV compare the clustering accuracies and CPU time, respectively, of the three NPKL solvers on the smaller data sets. As can be seen, all of them have comparable clustering accuracies when the hinge loss, squared hinge loss, or square loss is used. However, BCDKL is consistently faster, which agrees with the results in the previous section. When the linear loss is used, SimpleNPKL is the fastest, but its clustering accuracy is outperformed by the other two methods. This also agrees with the observations in Section IV-B. Tables V and VI show the clustering accuracies and CPU time, respectively, on the larger adult data sets. SDPLR does not scale well (e.g., with the hinge loss, SDPLR does not converge after 4 h on the a3a data set), and so will not be reported here. As can be seen, BCDKL obtains good accuracies, and can be used on the large data sets efficiently. D. Varying the Number of Samples In this section, we perform experiments on the largest data set in Table II (i.e., a7a), with varying number of samples (500, 1000, 5000, and 10 000). The number of pairwise constraints is fixed at 1000. As SDPLR does not scale well, it will not be compared here.

Fig. 4. CPU time versus the number of samples on the a7a data set. Note that both the abscissa and ordinate are in log scale. (a) Hinge loss. (b) Squared hinge loss. (c) Square loss. (d) Linear loss.

Results are shown in Fig. 4. As can be seen, BCDKL is faster than SimpleNPKL on the square loss, hinge loss, and squared hinge loss. As the data manifold is represented by a k-nearest-neighbor graph, recall from Section III-C1 that the periteration time complexity of BCDKL is mainly dominated by O(r kn), which only scales linearly with the total number of samples n. Table VII shows how the total time scales with n. As can be seen, BCDKL scales much better than SimpleNPKL. Its scaling is linear or even sublinear with n, while SimpleNPKL often scales quadratically. Thus, BCDKL is more suitable than SimpleNPKL for NPKL on big data sets.

HU AND KWOK: SCALABLE NONPARAMETRIC LOW-RANK KERNEL LEARNING

TABLE VII S CALING OF THE E MPIRICAL T IME C OMPLEXITY W ITH THE N UMBER OF S AMPLES ON THE a7a D ATA S ET

1937

demonstrated that the proposed algorithm outperforms the state-of-the-art algorithms in both scalability and accuracy. This paper can be extended in several directions. For example, the SimpleNPKL has been used for speeding up the data embedding tasks in colored maximum variance unfolding [35] and minimum volume embedding [36]. The proposed solver can also be extended in a similar manner. Moreover, the proposed method can be easily parallelized. Specifically, using the reordering technique [45], one can first color the graph for both G and constraints T , and then apply parallel BCD on coordinates sharing the same color. R EFERENCES

Fig. 5. Clustering accuracies on the out-of-sample data with varying number of pairwise constraints on the in-sample data for the iris data set. (a) Hinge loss. (b) Squared hinge loss. (c) Square loss. (d) Linear loss.

E. Out-of-Sample Extension In this section, we perform out-of-sample experiments on the iris data set. We randomly pick half of the samples as in-sample data, and the rest as out-of-sample data. Pairwise constraints are then generated from the in-sample data only. The number of pairwise constraints is varied from 0.05n, 0.1n, 0.5n, n to 1.5n (where n is the total number of in-sample and out-of-sample data). The experiment is repeated 20 times. The average clustering accuracy on the out-of-sample data is shown in Fig. 5. As can be seen, as the number of constraints in the in-sample data increases, the accuracy on the out-of-sample data is gradually improved. This indicates that supervised information in the form of pairwise constraints can be transferred from the in-sample data to the out-of-sample data. V. C ONCLUSION In this paper, we proposed the use of low-rank approximation and BCD to improve the scalability of NPKL solvers. Using low-rank approximation, the expensive psd constraint in the SDP can be dropped; while BCD decomposes the optimization problem into small subproblems that can be efficiently solved with various loss functions. Both the time and space complexities of the resultant procedure are linear in the number of samples. In addition, we developed an efficient out-of-sample extension, which is highly useful when new samples arrive continually. Experimental results

[1] B. Schölkopf and A. J. Smola, Learning with Kernels. Cambridge, MA, USA: MIT Press, 2002. [2] O. Chapelle, J. Weston, and B. Schölkopf, “Cluster kernels for semisupervised learning,” in Advances in Neural Information Processing Systems 15. Cambridge, MA, USA: MIT Press, 2003, pp. 585–592. [3] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. 21st Int. Conf. Mach. Learn., Banff, AB, Canada, Jul. 2004, pp. 6–13. [4] M. Hu, Y. Chen, and J. T.-Y. Kwok, “Building sparse multiple-kernel SVM classifiers,” IEEE Trans. Neural Netw., vol. 20, no. 5, pp. 827–839, May 2009. [5] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, Jul. 2006. [6] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, no. 1, pp. 27–72, 2004. [7] B. Kulis, M. Sustik, and I. Dhillon, “Learning low-rank kernel matrices,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, Jun. 2006, pp. 505–512. [8] S. C. H. Hoi, R. Jin, and M. R. Lyu, “Learning nonparametric kernel matrices from pairwise constraints,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, USA, Jun. 2007, pp. 361–368. [9] Z. Li, J. Liu, and X. Tang, “Pairwise constraint propagation by semidefinite programming for semi-supervised classification,” in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 2008, pp. 576–583. [10] E. Hu, S. Chen, D. Zhang, and X. Yin, “Semisupervised kernel matrix learning by kernel propagation,” IEEE Trans. Neural Netw., vol. 21, no. 11, pp. 1831–1841, Nov. 2010. [11] E.-L. Hu, B. Wang, and S. Chen, “BCDNPKL: Scalable non-parametric kernel learning using block coordinate descent,” in Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, Jun. 2011, pp. 209–216. [12] J. Zhuang, I. W. Tsang, and S. C. H. Hoi, “SimpleNPKL: Simple non-parametric kernel learning,” in Proc. 26th Int. Conf. Mach. Learn., Montreal, QC, Canada, Jun. 2009, pp. 1273–1280. [13] J. Zhuang, I. W. Tsang, and S. C. H. Hoi, “A family of simple nonparametric kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 1313–1347, Apr. 2011. [14] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl, “Constrained K-means clustering with background knowledge,” in Proc. 18th Int. Conf. Mach. Learn., Williamstown, MA, USA, Jun. 2001, pp. 577–584. [15] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996. [16] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods—Support Vector Learning, B. Schölkopf, C. Burges, and A. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, pp. 185–208. [17] C. K. I. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” in Advances in Neural Information Processing Systems 13. Cambridge, MA, USA: MIT Press, 2001, pp. 682–688. [18] K. Zhang and J. T. Kwok, “Clustered Nyström method for large scale manifold learning and dimension reduction,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1576–1587, Oct. 2010. [19] S. Burer and R. D. C. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Math. Program., vol. 95, no. 2, pp. 329–357, 2003. [20] D. P. Bertsekas, Nonlinear Programming, 2nd ed. Belmont, MA, USA: Athena Scientific, 1999.

1938

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

[21] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear Gauss–Seidel method under convex constraints,” Oper. Res. Lett., vol. 26, no. 3, pp. 127–136, 2000. [22] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear SVM,” in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jul. 2008, pp. 408–415. [23] C.-J. Hsieh and I. S. Dhillon, “Fast coordinate descent methods with variable selection for non-negative matrix factorization,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, San Diego, CA, USA, Aug. 2011, pp. 1064–1072. [24] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent approaches to parallel matrix factorization for recommender systems,” in Proc. IEEE 12th Int. Conf. Data Mining, Brussels, Belgium, Dec. 2012, pp. 765–774. [25] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin, “Feature clustering for accelerating parallel coordinate descent,” in Advances in Neural Information Processing Systems 25. Red Hook, NY, USA: Curran Associates, 2012, pp. 28–36. [26] C. Scherrer, M. Halappanavar, A. Tewari, and D. Haglin, “Scaling up coordinate descent algorithms for large 1 regularization problems,” in Proc. 29th Int. Conf. Mach. Learn., Edinburgh, U.K., Jun. 2012. [27] Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. Optim., vol. 22, no. 2, pp. 341–362, 2012. [28] P. Richtarik and M. Takac, “Iteration complexity of randomized blockcoordinate descent methods for minimizing a composite function,” Math. Program., vol. 144, nos. 1–2, pp. 1–38, Apr. 2014. [29] R. Mazumder, J. H. Friedman, and T. Hastie, “SparseNet: Coordinate descent with nonconvex penalties,” J. Amer. Statist. Assoc., vol. 106, no. 495, pp. 1125–1138, 2011. [30] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, no. 11, pp. 2399–2434, Dec. 2006. [31] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16. Cambridge, MA, USA: MIT Press, 2004, pp. 321–328. [32] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proc. 20th Int. Conf. Mach. Learn., Washington, DC, USA, Aug. 2003, pp. 912–919. [33] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems 14. Cambridge, MA, USA: MIT Press, 2002, pp. 585–591. [34] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in Neural Information Processing Systems 14. Cambridge, MA, USA: MIT Press, 2000. [35] L. Song, A. Gretton, K. M. Borgwardt, and A. J. Smola, “Colored maximum variance unfolding,” in Advances in Neural Information Processing Systems 20. Red Hook, NY, USA: Curran Associates, 2008. [36] B. Shaw and T. Jebara, “Minimum volume embedding,” in Proc. 11th Int. Conf. Artif. Intell. Statist., 2008, pp. 460–467. [37] Y. Saad, Numerical Methods for Large Eigenvalue Problems, 2nd ed. Philadelphia, PA, USA: 2011. [38] S. Boyd and L. Xiao, “Least-squares covariance matrix adjustment,” SIAM J. Matrix Anal. Appl., vol. 27, no. 2, pp. 532–546, 2005.

[39] G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore, MA, USA: The Johns Hopkins Univ. Press, 1989. [40] M. Batista, “A note on a generalization of Sherman–Morrison– Woodbury formula,” Univ. Ljubljana, Ljubljana, Slovenia, Tech. Rep. arXiv:0807.3860, 2008. [41] K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, “Coordinate descent method for large-scale L2-loss linear support vector machines,” J. Mach. Learn. Res., vol. 9, pp. 1369–1398, Jun. 2008. [42] P. Tseng and S. Yun, “Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization,” J. Optim. Theory Appl., vol. 140, no. 3, pp. 513–535, 2009. [43] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the Nyström method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 2, pp. 214–225, Feb. 2004. [44] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” J. Amer. Statist. Assoc., vol. 66, no. 336, pp. 846–850, 1971. [45] J. G. Lewis, B. W. Peyton, and A. Pothen, “A fast algorithm for reordering sparse matrices for parallel factorization,” SIAM J. Sci. Statist. Comput., vol. 10, no. 6, pp. 1146–1173, 1989.

En-Liang Hu received the Ph.D. degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. He was a Research Assistant with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, from 2012 to 2013.He is currently an Associate Professor with the Department of Mathematics, Yunnan Normal University, Kunming, China. His current research interests include machine learning, pattern recognition, and optimization.

James T. Kwok received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Hong Kong, in 1996. He was with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, as an Assistant Professor. He is currently a Professor with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His current research interests include kernel methods, machine learning, example recognition, and artificial neural networks. Dr. Kwok has been a Program Co-Chair for a number of international conferences, and served as an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS from 2006 to 2012. He is currently an Associate Editor of the Neurocomputing journal. He was a recipient of the IEEE Outstanding 2004 Paper Award, and the Second Class Award in Natural Sciences by the Ministry of Education, China, in 2008.

Scalable Nonparametric Low-Rank Kernel Learning Using Block Coordinate Descent.

Nonparametric kernel learning (NPKL) is a flexible approach to learn the kernel matrix directly without assuming any parametric form. It can be natura...
2MB Sizes 1 Downloads 11 Views