2088

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

Learning Locality Preserving Graph from Data Yan-Ming Zhang, Kaizhu Huang, Member, IEEE, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE

Abstract—Machine learning based on graph representation, or manifold learning, has attracted great interest in recent years. As the discrete approximation of data manifold, the graph plays a crucial role in these kinds of learning approaches. In this paper, we propose a novel learning method for graph construction, which is distinct from previous methods in that it solves an optimization problem with the aim of directly preserving the local information of the original data set. We show that the proposed objective has close connections with the popular Laplacian Eigenmap problem, and is hence well justified. The optimization turns out to be a quadratic programming problem with n(n − 1)/2 variables (n is the number of data points). Exploiting the sparsity of the graph, we further propose a more efficient cutting plane algorithm to solve the problem, making the method better scalable in practice. In the context of clustering and semi-supervised learning, we demonstrated the advantages of our proposed method by experiments. Index Terms—Graph construction, graph-based learning, manifold learning, semi-supervised learning, spectral clustering.

I. Introduction

I

N THE past decade, manifold learning has attracted great interest of the machine learning community. Many classical learning methods of this type have been proposed, and also achieved great success both theoretically and empirically. Generally, manifold learning methods represent the data set by a graph in which each data point is treated as one node and the weighted edge between nodes characterizes the similarity between data points. Then various machine learning tasks, such as dimensionality reduction [2], [7], [23], [24], [26], [31], [35], clustering [17], [22], [25], [27], or semi-supervised learning [3], [5], [29], [30], [33], [39], [41], are performed on this graph. Formally, given a set of data points x1 , . . . , xn ∈ Rp , we construct a weighted, undirected graph G = (V, E) where Manuscript received May 7, 2012; revised May 2, 2013; accepted January 7, 2014. Date of publication July 8, 2014; date of current version October 13, 2014. This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB316301, and in part by the National Natural Science Foundation of China Grant 61203296 and Grant 61075052. This paper was recommended by Associate Editor F. Hoffmann. Y.-M. Zhang, X. Hou, and C.-L. Liu are with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 200240, China (e-mail: [email protected]; [email protected]; [email protected]). K. Huang is with the Electrical and Electronic Engineering Department, Xian Jiaotong-Liverpool University, Suzhou 215123, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2300489

V (|V | = n) is the set of n nodes forming graph vertices. wij = wji ≥ 0 denotes the weight of the edge between nodes i n  and j, and di = wij denotes the weighted degree of node i j=1

(also known as node strength). When there is no edge between nodes i and j, we have wij = 0. There is no self-loop, so wii = 0 for all i. As a graph G can be fully specified by a weight matrix W, graph construction is equivalent to defining a non-negative symmetric matrix W. Indicated by many researchers, graph as a discrete approximation of data manifold plays a crucial role in the success of graph-based methods [11], [13], [18], [40]. Different graph construction methods may result in significantly different learning results even when the sample size tends to infinity [21]. However, compared with the great advancement in graph-based machine learning methods, the research in graph construction is rather limited. Particularly, previous methods usually adopt very straightforward approaches to construct a graph. In more details, these graph construction methods involve two steps. First, the edge set E is defined, which can be called as connectivity determination step. Then, each edge e ∈ E is weighted by some weighting function to obtain the weight matrix W. For example [2], to define E, the k nearest neighbor method (kNN) can be used to simply connect each node with its nearest k neighbors; or the -ball method can be engaged to link any two nodes whose distance is within . After E is determined, a weighting function, e.g., radial basis function (RBF) [36], is adopted to set the weight for each edge in E. Recently, Cheng et al. [11] and Daitch et al. [13] proposed methods that learns the edges and weights simultaneously. One major shortcoming of the aforementioned approaches is that they often lead to serious non-local edges, that is, edges connecting nodes that are far away in the feature space. Fig. 1 illustrates the graph construction results based on various methods. As clearly observed, kNN usually generates nonlocal edges connecting the inner cluster and the outer circle even if we choose different k [Fig. 1(a) and (b)]. -ball also leads to non-local edges [Fig. 1(c)]. Note that no single k or  can give perfect solution since a bigger k or  produces more non-local edges while a smaller one makes the outer circle fragile into pieces. So does the method proposed in [11] and [13] [Fig. 1(d) and (e)]. Non-local edges make the graph a bad approximation of the original data set manifold that will hurt the subsequent machine learning tasks. The problem with these construction methods roots in the fact that they fail to appropriately preserve locality. Due to

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

ZHANG et al.: LEARNING LOCALITY PRESERVING GRAPH FROM DATA

2089

Fig. 1. Graphs constructed by different methods on circle data set. (a) 5NN graph. (b) 6NN graph. (c) -ball graph  = 0.7. (d) Graph given by [11]. (e) Graph given by [13]. (f) Proposed method.

this intrinsic problem, they may link nodes that are very dissimilar in the feature space. Locality is a concept that is closely related to data density. Intuitively, data points in low density area should have fewer neighbors than those in high density area, since neighbors of a point in the low density area are often far away, and are more likely to lead to nonlocal edges. This observation suggests that a good graph construction method should be able to adapt the scale of each data point’s neighborhood to the data density automatically. However, either kNN or -ball method uses a unique parameter to determine the locality that dooms to fail when data are unevenly distributed. Different from the previous work [2], [11], [13], [18], in this paper, we propose a method to construct graph by directly preserving the local information with the aim of alleviating the problem of non-local edges. Particularly, by using the Euclidean distance as the similarity measure, we aim to construct graphs such that the weighted sum of distances between every two nodes is minimized under some constraints. Fig. 1(f) shows one illustration of our proposed method where the locality is perfectly preserved. We show that our proposed framework has close relationship with the popular Laplacian Eigenmap method [2] and is hence well justified. One important feature of the proposed method is that, we only need to solve a quadratic programming (QP) problem. Thus, the connectivity and the weights are learned simultaneously during the learning procedure. However, since the number of edges is n(n−1) , directly solving the QP is only feasible for 2 very small data sets. Fortunately, under reasonable conditions, we demonstrate the solution graph is sparse, which enables us to propose an efficient cutting plane algorithm to solve the

problem on large data sets. In the context of clustering and semi-supervised learning, we evaluate our method on several data sets against other competitive methods. Experimental results demonstrated the advantages of our proposed approach. The main contributions of this paper is as follows. 1) We propose a graph that directly aims to preserve the locality in the data set. Due to the great importance of the local information in graph representation, this makes our graph a better approximation of data distribution. 2) In the proposed graph, each data point may have its dataadaptive neighborhood. Unlike kNN or -ball, which use a fixed global parameter to determine the neighborhoods, our method has the flexibility to adjust the scale of neighborhood to data density. 3) We design an efficient cutting algorithm to solve the optimization problem, and largely reduce the computational complexity. We show theoretically and empirically that the constructed graph is reasonably sparse. The rest of this paper is organized as follows. In Section II, we give a review to existing graph construction methods. Section III describes the proposed locality preserving graph, and its properties. Section IV introduces an efficient cutting plane algorithm to learn the locality preserving graph. Experimental results are presented in Section V, and we conclude our method in Section VI. II. Related Work As introduced before, most popular graph construction methods involve two steps. First, edge set E is defined by the kNN or -ball method. Then, a similarity measure is chosen to

2090

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

give each edge a weight. The most popular similarity measure x −x 2 is RBF, which is defined as wij = exp{− i2σ 2j }. Since these methods have serious limitations, researchers have recently explored different methods to construct graph. In traditional kNN graph, a node i is supposed to be connected not only with its nearest k neighbors but also with those whose nearest k neighbors include i, leading that the number of one node’s neighbors is equal to or bigger than k. To solve this problem, Jebara et al. [18] proposed a connectivity determination method that has a similar objective function with kNN but results in a k-regular graph, in which each node has exactly k neighbors. Instead of choosing one weighting function, Wang and Zhang [28] proposed a method to learn the weights for each node by minimizing the local reconstruction error given its predefined neighborhood. A direct method to compute the kNN graph takes O(pn2 ) times, Bentley [4] and Chen et al. [10] explored methods to speedup the construction of kNN graph. Until recently, Daitch et al. [13] proposed a method that can simultaneously learn the connectivity and weights of graph. The objective function was formulated as   min di xi − (1) wij xj 2 wij ≥0

s.t.

i n 

j

(max(0, 1 − di ))2 ≤ αn

i=1

where α is a hyperparameter, and other notations can be seen in Section I. It can be shown the resulting graph is highly sparse, and has (p + 1) × n edges at most. However, one serious limitation of this approach is that the resulting graph often contains non-local edges as shown in Fig. 1(e) due to the global nature of its objective function. As this method is related to local linear embedding (LLE) [24], we refer to it as LLE based graph construction (LLEGC) method for simplicity. Exploiting the idea of sparse representation, Cheng et al. [11] and Yan and Wang [34] proposed the L1 graph that approximates each sample with a small number of data points. Then, the reconstructed coefficients are used as edge weights. This is achieved by solving the following optimization problem for each data point [11]: min i α

s.t.

αi 1

(2)

i i

xi = B α αij ≥ 0

where Bi = [x1 , . . . , xi−1 , xi+1 , . . . , xn , I]. W is given by Wij = αij if i > j, and Wij = αij−1 if i < j. Therefore, the L1 graph can provide data-adaptive neighborhood. However, L1 graph is not invariant with orthogonal transformation, which means a shift of data set will result in a different graph. Typical graph-based learning methods involve computing the inverse or eigenvectors of the graph Laplacian which makes the computational cost very high. This motivates researchers to approximate the original graph with a simple graph. Cesa-Bianchi et al. [8] and Herbster et al. [16] used spanning tree, such as minimum spanning trees, shortest path tree or random spanning tree, to approximate a graph. By

exploring the special structure of trees, authors came up with efficient algorithms to infer the labels of tree nodes. In [20], a subset of data points are selected as prototype points. Then, each xi is reconstructed as a convex combination of its closest prototype points, and the combination coefficients are perceived as transition probabilities from xi to prototype points. After the transition matrix Z is obtained, the weight matrix W is defined as W = Z−1 ZT , where  is a diagonal n  matrix and kk = Zik . Since W is of low rank, the inverse of i=1

its Laplacian can be computed efficiently with the Woodbury identity. Zhang et al. [37] explored the similar idea as ours that is to construct graph by preserving locality structure. However, by their method, the resulting graph is directed and non-sparse, which will increase memory requirements and computational costs for the subsequent learning tasks. Zhang et al. [38] explored the idea of using the label information to improve graph construction. Specifically, the method encourages points with the same label become more similar while points with different labels become more dissimilar. However, this method cannot be used in unsupervised learning. To summarize this section, we note that most of newly developed methods are based on minimizing reconstruction error. Although they improve kNN and -ball in different aspects, they do not directly aim to preserve the local information, making them vulnerable to the non-local edges. For a more detailed survey for graph construction methods, readers are referred to [18].

III. Locality Preserving Graph Unlike previous methods, in this section, we propose a graph that directly preserves the local information of the original data set. We first introduce the objective function, then discuss its relationship with Laplacian Eigenmap. We call this method as locality preserving graph construction (LGPC) method. A. Objective Function Intuitively, to preserve locality of data, we would like to give a zero or small weight to dissimilar points, and a large weight to similar points. Adopting the Euclidean distance as the similarity measurement, we build a graph such that the weighted sum of distances between every two nodes is minimized. Thus, we propose to minimize the following objective function with respect to wij :  wij xi − xj 2 = aT w (3) ij n(n−1) 2

n(n−1)

where w ∈ R is the weight vector of all edges, a ∈ R 2 is the vector of distances between any two points. This objective function can be justified intuitively as follows. When xi and xj are dissimilar, namely xi − xj 2 is big, the weight wij should be small in order to minimize the objective. Assume that we are to construct a l-edge graph by minimizing (3) and wij ∈ {1, 0}, then the l edges with nearest distances in all the potential edges will be selected. Hence,

ZHANG et al.: LEARNING LOCALITY PRESERVING GRAPH FROM DATA

2091

compared with kNN, this objective function tends to preserve the local information globally across the whole data. However, in general, directly minimizing (3) with respect to wij ≥ 0 will lead to wij = 0 for all i and j. To avoid this, we constrain the degree of node di should not be too small, while at the same time allow different nodes to have different degrees. We also penalize the norm of the weight to prevent some edges dominating all the degree of nodes. This leads to the final objective function for LPGC 1 T μ λ min (4) a w + d − 12 + w2 w≥0 p 2 2 where d ∈ Rn is the vector of node degrees, 1 ∈ Rn is the vector of ones and μ, λ ≥ 0 are regularization factors. Recall that p is the dimension of the input space. We define U to be the n× n(n−1) matrix such that the column 2 of U, corresponding to edge e between node i and j, has exactly two nonzero entries: Uie = 1 and Uje = 1. Then we can express the degree vector d by d = Uw, and transform the optimization problem (4) as 1 T μ λ min a w + Uw − 12 + w2  f (w). (5) w≥0 p 2 2 Since the Hessian matrix of f (w), equals to μU T U + λI, is positive definite, f (w) is convex. As both the objective function and the constraints are convex, a unique global optimum solution always exists [6]. Sparsity Analysis: Defining B = μU T U + λI and n(n−1) b = p1 a − 2μ1 where 1 ∈ R 2 , the LPGC problem (5) can be written as 1 T 1 min w Bw + bT w + μn. w≥0 2 2 Through the standard routine, we can obtain its Lagrange dual problem as follows: 1 1 max − (α − b)T B−1 (α − b) + μn α≥0 2 2 where α are Lagrange multipliers. Suppose w∗ and α∗ are the optimal solutions of primal and dual problems. From the Karush-Kuhn-Tucker (KKT) conditions α∗i wi∗ = 0, we have wi∗ = 0 for each non-zero α∗i . Roughly speaking, in most cases, the solution α∗ has many non-zero entries that lead to 1 a sparse w∗ . In the extreme case, when μ < 2p min{ai , i = n(n−1) 1 1, . . . , 2 }, we will have b = p a − 2μ1 > 0. For this case, the optimum α∗ will be equal to b exactly and has all entries bigger than zero, leading to a graph with no edges. B. Relationship with Laplacian Eigenmap It turns out that the objective (3) has a close relationship with the Laplacian Eigenmap [2], which is a well-known graph embedding algorithm. Moreover, this relationship provides another explanation to our method. To see this, we reformulate (3) as p    wij xi − xj 2 = wij (xi(m) − xj(m) )2 (6) ij

m=1 i,j p



T

x(m) Lx(m)

=

2

=

2tr(XT LX)

(7)

m=1

(8)

where X ∈ Rn×p is the data matrix in which the ith row equals to xiT , x(m) is the mth column of X, and L is the graph Laplacian, defined as  −wij if i = j Lij = . (9) di if i = j On the other hand, the Laplacian Eigenmap method aims to solve the following problem: min X

tr(XT LX) XT DX = I

s.t.

(10)

where D is a diagonal matrix with Dii = di . Therefore, LPGC, and Laplacian Eigenmap share the same objective function. Remarks: Two important points deserve our attention. First, Laplacian Eigenmap aims to find an embedding X by using a given graph G, while LPGC uses the given data vectors X to construct a graph G. Second, (7) supplies an interesting interpretation to our method. According to the spectral graph T theory [2], [12], x(m) Lx(m) can be viewed as a measurement of the smoothness of the mth feature with respect to the graph. Therefore, minimizing (7) can be explained as finding a graph such that each feature is smooth on it.

IV. LPGC Algorithm The optimization of problem (5) is a standard QP problem, which can be solved by standard tools. However, as the length of vector w is n(n−1) , the problem is only feasible for very 2 small data set. In this section, we first design an algorithm to efficiently optimize the problem (5) by utilizing the sparsity of solution, then analyze its computational complexity. A. Cutting Plane Algorithm We propose a cutting plane style algorithm [19] to optimize (5). It is an iterative method, and in each iteration we solve a small size QP problem. The underling idea is quite simple: first, we choose a set of edges according to some rules and reject all the other edges; this can be viewed as a connectivity determination step like kNN. Then, we optimize the objective function (5) over these selected edges that can be viewed as an edge weighting step. These two steps are repeated until the global optimum is reached. Specifically, given the active edge set E, we compose the following QP problem: min w

s.t.

1 T a w p

+ μ2 Uw − 12 + λ2 w2 we ≥ 0 we = 0

e∈E e∈ / E.

(11)

This is a QP problem that has only |E| variables to optimize. Suppose w∗ is the optimum of problem (11). To verify whether w∗ is also the optimum of problem (5), we only need to check if w∗ meets all the KKT conditions of problem (5), which can

2092

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

Therefore, the improvement is given by

Algorithm 1 Locality Preserving Graph Construction

Input: x1 , . . . , xn ∈ R , r > 0 2: Init: construct a and U, w0 ← 0, E ← φ, t ← 0 3: while NOT ∇f + r1 ≥ 0 do 4: E ← E ∪ {e = argmin∇i f } p

1:

i

Starting from wt , optimize problem (11) and get wt+1 6: t ←t+1 7: end while 8: Output: wt 5:

be written as ∇f w w i ∇i f

1 = a + μ(U T Uw − U T 1) + λw ≥ 0 p ≥ 0 n(n − 1) = 0 ∀i = 1, . . . , . 2

(12) (13) (14)

The conditions (13) and (14) are always satisfied by w∗ . We only need to check condition (12). If condition (12) is also satisfied, then w∗ is the unique global optimum of problem (5). Otherwise, there must be some i’s such that ∇fi < 0. From all the edges satisfying ∇fi < 0, we add the one with smallest ∇fi into the active edge set E, and repeat the iteration. Since each iteration will result in a decrease in the objective function (5), the convergence is guaranteed. In practice, it is always sufficient to have an approximate solution. Therefore we just require ∇f ≥ −r1 where r > 0 is a precision parameter. This condition not only helps to reduce the computational time but also results in a more sparse graph. The whole procedure is summarized in Algorithm 1. B. Complexity Analysis of LPGC Algorithm The complexity of Algorithm 1 is dominated by solving a sequence of sub-QP problems. In this part, we show the number of iterations, until the algorithm terminates, is bounded. The key idea is to show that the improvement made by each iteration is lower bounded by a constant. Lemma 1: Let J be a symmetric, positive semi-definite matrix, and define a convex objective in x 1 T x Jx + hT x. 2 Assume that a solution x0 and an optimization direction ei are given such that ∇i g(x0 ) < 0. Then optimizing g starting from x0 along ei will decrease the objective by g(x) =

max{g(x0 ) − g(x0 + βei )} = β>0

1 (∇i g(x0 ))2 . 2Jii

Proof: Assume we are minimizing g(x) along direction d. Thus

⇐⇒ ⇐⇒

β∗ = arg min g(x0 + βd) β≥0   d g(x0 + βd) =0 dβ β=β∗ β∗ = −

dT ∇g(x0 ) ≥ 0. dT Jd

(dT ∇g(x0 ))2 . 2dT Jd The lemma follows by setting d = ei . Proposition 1: In each iteration of Algorithm 1, the imr2 provement of the objective is lower bounded by 2(2μ+λ) . Proof: Obviously, the improvement f (wt ) − f (wt+1 ) made by step 1 in Algorithm 1 is lower bounded by f (wt ) − minβ>0 f (wt + βei ) where i is the index of the newly added edge. To apply lemma 1, we notice that J = μU T U + λI and h = p1 a − 2μ1. Since Jii = 2μ + λ and ∇i f < −r, we have g(x0 ) − g(x0 + β∗ d) =

f (wt ) − f (wt+1 ) ≥ f (wt ) − min f (wt + βei ) ≥ β>0

r2 . 2(2μ + λ)

Theorem 1: Algorithm 1 terminates after at most iterations where B = μU T U + λI, or equivalently, Algorithm 1 generates a graph with at most 4μ2 (2μ+λ)1T B−1 1 edges. r2 Proof: From the Lagrange dual theory, we know that the dual function − 21 (α−b)T B−1 (α−b)+ 21 μn yields lower bounds on the optimal value of the primal problem (5). By setting α = 1 a ≥ 0, we get a lower bound equals to −2μ2 1T B−1 1 + 21 μn. p , we have Since f (w0 ) = μn 2 μn 1 f (w0 ) − f (wt ) ≤ − (−2μ2 1T B−1 1 + μn) 2 2 = 2μ2 1T B−1 1. 4μ2 (2μ+λ)1T B−1 1 r2

Using this fact and Proposition 2, the algorithm terminates 2 T −1 B 1 after at most 4μ (2μ+λ)1 . r2 This bound is reasonable since 1T B−1 1 = 1T (μU T U + λI)−1 1 = 1T ( λ1 I − λ12 U T ( μ1 I +UU T )−1 U)1 < n(n−1) . In general, 2λ the bound may not be very tight, but it adds useful information, i.e., to make the solution more sparse, we need to use large r and small μ and λ.

V. Experiments In this section, we build graphs with our LPGC method and other construction methods, then evaluate their performance by employing these graphs in clustering and semi-supervised classification tasks. Additionally, we design experiments to investigate the sparseness of LPGC graph. All codes were implemented by MATLAB, and experiments were carried on a PC with Intel Core 3.00 GHz processor and 2 GB of RAM. A. Graphs on Toy Data Set In addition to the toy example presented in Section I, in this section, we further use a toy data set to show that the different construction methods will result in different graphs. Particularly, the experiment is performed on the two-moon data set in which each moon contains 25 randomly sampled points and corresponds to one potential class. Therefore, we hope that points in one moon should be connected, while points in different moons should be disconnected. The graphs constructed by kNN, -ball, LLEGC, and LPGC is shown in Fig. 2. We show the kNN graphs for different k,

ZHANG et al.: LEARNING LOCALITY PRESERVING GRAPH FROM DATA

but neither of them is satisfying. When k is small, the upper moon fragments into two pieces. When k is large, two moons are connected by the edges starting from corner points. Similar problem exists in the -ball graph, while because of the space limitation we show the best graph constructed by this method. Both L1-graph and LLEGC methods results in graphs with severe non-local edges. We use the soft version of LLEGC method, and set α = 0.1 as suggested by the authors [13]. Our method gives the best graph in which the local information is perfectly preserved. B. Data Sets In the following experiments, eight data sets are used which are collected from UCI machine learning repository [1] and [9]. Each data set is preprocessed to have zero mean and unit standard deviation in each dimension. Table I gives a brief description of these data sets. C. Comparison Methods To evaluate the performance of the proposed method, we compare LPGC with the following methods: 1) kNN with RBF weighting; 2) -ball with RBF weighting; 3) LLEGC proposed in [13]; 4) L1-graph proposed in [11]; and 5) b-regular graph proposed in [18]. Particular parameter setting for each method is specified as follows. 1) kNN with RBF weighting: k is chosen from {3, 5, 10, 15, 25, 40}. The kernel bandwidth of RBF function σ is chosen from σ0 × {2−4 , 2−3 , 2−2 , 2−1 , 1, 21 , 22 , 23 , 24 } where σ0 is the averaged distance between two points in the data set. 2) -ball with RBF weighting: Both the radius  and the kernel bandwidth σ are chosen from σ0 × {2−4 , 2−3 , 2−2 , 2−1 , 1, 21 , 22 , 23 , 24 }. 3) LLEGC: α is fixed to 0.1 as suggested by the authors of [13]. 4) LPGC: λ is chosen from {2−2 , 1, 22 , 24 } and μ is fixed to 24 (See the next subsection for a discussion). The precision r is fixed to 0.01. 5) b-regular graph with RBF weighting: the setting for k and σ is same as kNN graph. For LPGC, we use the quadprog routine which is provided by MATLAB to solve sub-QP problems (11). The codes of LLEGC and b-regular graph are provided by authors. To solve optimization problems (2) of L1-graph, we use CVX, a package for specifying and solving convex programs [14], [15]. D. Clustering In this part of experiment, we compare different graphs by employing them in spectral clustering [22]. We detail this algorithm as follows. 1) Construct a graph by computing the weighting matrix 1 1 W. Then, normalize the graph by L = D− 2 WD− 2 where D is a diagonal matrix with Dii = j Wij . 2) Find c1 , . . . , ck , the eigenvectors of L corresponding to the k largest eigenvalues, and form the matrix C = [c1 , . . . , ck ].

2093

3) Treat each row of C as a point in Rk , and cluster them into k clustering via k-means.1 4) Finally, assign xi to the cluster j if the ith row of the matrix C is assigned to the cluster j. To evaluate the clustering accuracy, we follow the strategy of [32]. We applied different graphs to classification problems where labels are hidden from the learning algorithm. The labels are used afterward to report a classification accuracy by seeing how well the clustering algorithms agree with the true labeling. More specifically, clustering algorithm runs on a data set, then each resulting cluster is labeled with the majority class according to the labels, and finally the number of misclassifications can be calculated for each cluster. For kNN and -ball, the best results among all the possible parameter settings are reported. For LLEGC and L1-graph, no parameters need to tune. For LPGC, we compare graphs built by different λ, choose the one with the minimum k-means objective in the step 3 of spectral clustering algorithm, then report its clustering accuracy. The quantitative comparison results are listed in Table II. LLEGC reported memory error on Digit1 data set, therefore, no result is reported. It can be observed that the results of the LPGC based spectral clustering algorithm presents best answers on most of data sets. Another observation is that -ball method generally show weak performance which is consistent with results reported previously in [13] and [18]. E. Semi-Supervised Classification In this part of experiment, we evaluate different construction methods by employing the built graphs in semi-supervised classification algorithm. The particular setting is: 1) we are given both the training data (labeled data) and testing data (unlabeled data) before training and 2) the goal is to predict the labels of unlabeled data. The particular semi-supervised learning method we adopt is the Gaussian random fields (GRF) [41]. We briefly introduce it in the following way. First, a graph G is constructed based on both the labeled and unlabeled data points; then, for a k-class problem, GRF solves the following optimization problem: minF s.t.

F T LF Fi = yi

i∈S

(15)

where L is the Laplacian matrix for graph G defined in (9), F ∈ Rn×k is the prediction matrix, S is the index set of labeled data, and yi ∈ Rk is the label vector of labeled data represented in one-in-k manner.2 As the close-form solution for the problem (15) exists, the algorithm can be fairly simple. After obtaining the optimal F , we classify unlabeled data xi to the j th class if Fij is the largest entry in the ith row of F . Our experiment proceeds as this: we randomly split each data set into labeled and unlabeled sets 100 times.3 For kNN, 1 Since k-means is sensitive to initialization, we repeat k-means for 20 times with different initial centers and choose the one achieving the minimum k-means objective as the clustering result. 2 For one-in-k representation of the label vector y , y = 1 if x belongs to i ij i the j th class and yij = 0 otherwise. 3 Splits for different methods are same.

2094

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

Fig. 2. Graphs constructed by different methods on two moons data set. (a) 6NN graph. (b) 7NN graph. (c) -ball graph,  = 1.8. (d) L1 graph. (e) Soft LLEGC graph, α = 0.1. (f) LLPC graph, μ = 16, λ = 3 . TABLE I Data Sets Description

TABLE II Averaged Clustering Accuracy for Different Graph Construction Methods. The Best Performance on Each Data Set is Bolded

-ball and LGPC, in each split, we apply five-fold cross validation on labeled set to select the hyperparameters using GRF, then use the hyperparameters with the lowest validation error to construct graph. For L1-graph and LLEGC, graphs are constructed directly. Finally, we learn the labels of unlabeled data by applying the GRF approach. Average classification accuracy and standard deviation on 100 labeled/unlabeled splits are recorded. We perform the experiments with the number of labeled data varying from 10 to 100, and show the final results in Fig. 3. Similar conclusions as in clustering can be made. Namely, LPGC consistently provides comparable or better results compared with other methods. In particular, it performs exceptionally well on ionosphere, breast, and vehicle data sets. Another observation is that LLEGC and b-regular graph can also generate good results especially when the number of labeled data is large, while L1 graph and -ball’s performance

is weak. It can be seen that as the number of labeled data increases, the performance of different construction methods becomes similar. This is because when a large proportion of data is labeled, unlabeled points can receive label information from many different sources, which make the predictions rather robust to noise edges. F. Robustness to Hyperparameters In this section, we design experiments to evaluate the robustness of LPGC to the variance of λ and μ. In the first experiment, for each data set, we fix μ = 16, and construct graphs with λ ∈ {2−4 , 2−3 , 2−2 , 2−1 , 1, 2, 22 , 23 , 24 } respectively. For each graph, we randomly select 100 nodes as labeled data, perform the GRF algorithm to predict the rest nodes, and obtain the classification accuracy. The average accuracy on 100 labeled/unlabeled splits are shown in Fig. 4 (a). In the second experiment, for each data set, we fixed λ = 2,

ZHANG et al.: LEARNING LOCALITY PRESERVING GRAPH FROM DATA

2095

Fig. 3. Semi-supervised classification accuracy using GRF algorithm for different graph construction methods under different data sets. The horizontal axis represents the number of the randomly labeled data, and the vertical axis is the classification accuracy averaged over 100 independent runs. (a) Pima. (b) Ionosphere. (c) Sonar. (d) Wine. (e) Breast. (f) Vehicle.

TABLE III Running Time of Different Graph Construction Methods (in Seconds)

TABLE IV Sparseness of Graphs Constructed by Different Methods

and construct graphs with μ ∈ {2−1 , 1, 2, 22 , 23 , 24 , 25 , 26 } respectively. Then, the same procedure is performed as in the first experiment, and the results are shown in Fig. 4(b). As the results have shown, the performance of LPGC is very robust to μ, especially when μ is large. On the other hand, LPGC is also robust to λ on data sets: Pima, Ionosphere, Wine and Breast, but is sensitive on Sonar and Vehicle. Thus, in practice, we should use technique, like cross-validation, to tune this parameter.

Fig. 4. Robustness of LPGC graph to hyperparameters. (a) Robustness of LPGC graph to λ (μ = 16). (b) Robustness of LPGC graph to μ (λ = 2).

G. Speed In this experiment, we compare the speed of different graph construction methods. The experiments are conducted on a PC

2096

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

In the second experiment, we investigate how the values of λ and μ affect the spareness of LPGC graphs. Fig. 5(a) shows the sparseness of LPGC graph against λ with μ=8 for several data sets, while Fig. 5(b) shows the sparseness against μ with λ=8. From this experiment, we can clearly see the effect that the number of edges with non-zero weights increases as λ and μ increase. Furthermore, the number of non-zero edges increases exponentially with log2 (λ), or linearly with λ, and sublinearly with μ. In fact, as our method is rather robust to μ, we simply fix it to 24 in the previous classification experiments. It seems counter-intuitive but reasonable actually in that the solution becomes less sparse as λ increases. Since we require each node approximately to have a degree of 1, increasing λ would 1 force each edge to have a weight of n−1 ; this hence makes the graph less sparse. Interestingly, in Fig. 5(a) and (b), the curves from top to bottom correspond to data sets with increasing size, which implies the sparseness increases with the data size. This observation k is consistent with the sparseness of kNN graph whose r = n−1 approximately. Moreover, the data sets with similar size have the similar shape of sparseness curve which implies that the sparseness is independent with the data dimension. Another very favorable property is that the best graph solution for the classification tasks are always corresponding to sparse graphs. This phenomenon deserves our future exploration. Fig. 5. Spareness of LPGC graph. (a) Sparseness of LPGC graph for different λ (μ=8). (b) Sparseness of LPGC graph for different μ (λ=8).

VI. Conclusion with a 3.10 GHz 4 core CPU, and 8 GB RAM. All methods are implemented by MATLAB. The parallelization is disabled, and each method is restricted on one thread. We fix k = 10 for kNN graph and b-regular graph,  = σ0 for -ball graph, and λ = 16, μ = 2 for LPGC graph. The running times on different data sets are shown in Table III. We can see that kNN and -ball graphs have a great advantage in speed. Among the more sophisticated methods, LPGC is fastest, while the b-regular graph method is slowest. H. Sparseness of edges in G In this section, we define a quantity r = No. No. of possible edges in G (hence, 0 ≤ r ≤ 1), and use it to evaluate the sparseness of a graph. In the first experiment, for each graph construction method and each data set, we randomly select 100 data points as labeled data, and perform the GRF on the graphs constructed by using different hyperparameters (see Section V-C) to predict the unlabeled points. The sparseness of the graph that corresponds to the highest accuracy is reported in Table IV. As the results show the b-regular graphs exhibit good sparseness property, while other methods have similar performances. Although the proposed method does not lead to the most sparse graph, its graph construction time is significantly less than those recent advanced techniques such as b-regular, LLEGC, and L1-graph. In addition, as we will see in the following, the sparsity of the proposed method can also be adapted by varying λ and μ.

In this paper, we proposed a novel learning method for graph construction which is capable of preserving the local information from data. Formulating this graph learning task as a QP problem, we solve it by a cutting plane algorithm which enables us to handle data sets of mild size. Compared with kNN or -ball which has to choose one parameter to define the neighborhood of every node, our method is more flexible in the sense that different nodes can have different number of neighbors or different size of neighborhood. These characteristics make our method a good choice for graph-based machine learning methods. Experimental results also verify the advantages of our method. In the future, we will apply this method in some real application, such as face recognition, text classification etc, to evaluate our method. Currently, the computational cost of LGPC is still high to handle large data set, e.g., about 230 s for the 1500 points in g241c data set. We will consider other optimization techniques to further speed up our method. Another interesting direction is inspired by the work of [38], we will investigate ways to incorporate label information to improve graphs.

References [1] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml/ [2] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neur. Comput., vol. 15, no. 6, pp. 1373–1396, 2003.

ZHANG et al.: LEARNING LOCALITY PRESERVING GRAPH FROM DATA

[3] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol 7, pp. 2399–2434, Dec. 2006. [4] J. L. Bentley, “Multidimensional divide-and-conquer,” Commun. ACM, vol. 23, no. 4, pp. 214–229, 1980. [5] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proc. Int. Conf. Mach. Learn., pp. 19–26, 2001. [6] S. P. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K: Cambridge Univ. Press, 2004. [7] C. J. C. Burges, “Dimension reduction: A guided tour,” Foundat. Trends Mach. Learn., vol. 2, no. 4, pp. 275–365, 2009. [8] N. Cesa-Bianchi, C. Gentile, and F. Vitale, “Fast and optimal prediction of a labeled tree,” in Proc. Annu. Conf. Learn. Theory, 2009. [9] O. Chapelle, B. Sch¨olkopf, and A. Zien, Semi-Supervised Learning. Boston, MA, USA: MIT Press, 2006. [10] J. Chen, H. Fang, and Y. Saad, “Fast approximate k-NN graph construction for high dimensional data via recursive Lanczos bisection,” J. Mach. Learn. Res., vol. 10, pp. 1989–2012, Sep. 2009. [11] B. Cheng, J. C. Yang, S. C. Yan, Y. Fu, and T. Huang, “Learning with l1-graph for image analysis,” IEEE Trans. Image Process., vol. 19, no. 4, pp. 858–866, Apr. 2010. [12] F. R. K. Chung, Spectral Graph Theory. Providence, RI, USA: Amer. Math. Soc., 1997. [13] S. I. Daitch, J. A. Kelner, and D. A. Spielman, “Fitting a graph to vector data,” in Proc. Int. Conf. Mach. Learn., pp. 201–208, 2009. [14] M. Grant and S. Boyd. “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control. Berlin, Germany: Springer, 2008, pp. 95–110. [15] M. Grant and S. Boyd. (2010, Sep.). CVX: MATLAB Software for Disciplined Convex Programming, Version 1.21 [Online]. Available: http://cvxr.com/cvx [16] M. Herbster, M. Pontil, and S. R. Galeano, “Fast prediction on a tree,” Adv. Neur. Inf. Process. Syst., pp. 657–664, 2008. [17] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010. [18] T. Jebara, J. Wang, and S. F. Chang, “Graph construction and b-matching for semi-supervised learning,” in Proc. Int. Conf. Mach. Learn., pp. 441–448, 2009. [19] J. E. Kelley, Jr., “The cutting-plane method for solving convex programs,” J. Soc. Ind. Appl. Math., vol. 8, no. 4, pp. 703–712, 1960. [20] W. Liu, J. F. He, and S. F. Chang, “Large graph construction for scalable semi-supervised learning,” in Proc. Int. Conf. Mach. Learn., pp. 679–686, 2010. [21] M. Maier, U. Von Luxburg, and M. Hein, “Influence of graph construction on graph-based clustering measures,” Adv. Neur. Inf. Process. Syst., pp. 1025–1032, 2009. [22] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Adv. Neur. Inf. Process. Syst., pp. 849–856, 2002. [23] F. Nie, H. Wang, H. Huang, and C. Ding, “Unsupervised and semisupervised learning via l1-norm graph,” in Proc. IEEE Int. Conf. Comput. Vision, pp. 2268–2273, 2011. [24] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [25] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [26] J. B. Tenenbaum, V. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000. [27] U. Von Luxburg, “A tutorial on spectral clustering,” Statist. Comput., vol. 17, no. 4, pp. 395–416, 2007. [28] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 1, pp. 55–67, Jan. 2008. [29] G. Wang, F. Wang, T. Chen, D. Y. Yeung, and F. H. Lochovsky, “Solution path for manifold regularized semisupervised classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 308–319, Apr. 2012. [30] S. Xiang, F. Nie, and C. Zhang, “Semi-supervised classification via local spline regression,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 11, pp. 2039–2053, Nov. 2010. [31] S. Xiang, F. Nie, C. Zhang, and C. Zhang, “Nonlinear dimensionality reduction with local spline embedding,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1285–1298, Sep. 2009. [32] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” Adv. Neur. Inf. Process. Syst., pp. 1537–1544, 2005.

2097

[33] Z. Xu, I. King, M. R. Lyu, and R. Jin. “Discriminative semi-supervised feature selection via manifold regularization,” IEEE Trans. Neur. Netw., vol. 21, no. 7, pp. 1033–1047, Jul. 2010. [34] S. Yan and H. Wang, “Semi-supervised learning by sparse representation,” in Proc. SIAM Int. Conf. Data Min., 2009. [35] S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin, “Graph embedding and extensions: A general framework for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1, p. 40, Jan. 2007. [36] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” Adv. Neur. Inf. Process. Syst., pp. 1601–1608, 2004. [37] L. Zhang, L. Qiao, and S. Chen, “Graph-optimized locality preserving projections,” Pattern Recognit., vol. 43, no. 6, pp. 1993–2002, 2010. [38] Y. M. Zhang, Y. Zhang, D. Y. Yeung, C. L. Liu, and X. W. Hou, “Transductive learning on adaptive graph,” in Proc. AAAI Conf. Artif. Intell., pp. 661–666, 2010. [39] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” Adv. Neur. Inf. Process. Syst., pp. 321–328, 2004. [40] X. Zhu, “Semi-supervised learning literature survey,” Dept. Comput. Sci., Univ. Wisconsin-Madison, Madison, WI, USA, Tech. Rep. TR 1530, 2008. [41] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proc. Int. Conf. Mach. Learn., pp. 912–919, 2003. Yan-Ming Zhang received the bachelor’s degree from the Beijing University of Posts and Telecommunications, Beijing, China, in 2004, and the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, in 2011. He is currently an Assistant Professor at the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include machine learning and pattern recognition. Kaizhu Huang (M’09) received the B.S. degree in automation from Xian Jiaotong University, Shaanxi, China, the M.S. degree in pattern recognition from the Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China, and the Ph.D. degree from the Chinese University of Hong Kong (CUHK), Hong Kong, in 2004. He was a Researcher with Fujitsu Research and Development Centre, Beijing, CUHK, and the University of Bristol, Bristol, U.K., until 2009. From 2009 to 2012, he was an Associate Professor in CASIA. Since 2012, he has been an Associate Professor and the Director of Multimedia Telecommunications Programme at Xian Jiaotong-Liverpool University, Jiangsu, China. He is involved with machine learning and its applications to pattern recognition. He has developed many promising approaches including semi-supervised learning, metric learning, kernel methods, and sparse learning. In addition, he has applied learning algorithms on various applications, for example, web information processing and character recognition. He has published over 80 international papers including 26 SCIindexed journals and many tier one conference papers such as, NIPS, IJCAI, CVPR, UAI, and ICDM. Xinwen Hou received the B.S. degree from Zhengzhou University, Henan, China, in 1995, the M.S. degree from the University of Science and Technology of China, Anhui, China, in 1998, the Ph.D. degree from the Department of Mathematics, Beijing University, Beijing, China, in 2001. From 2001 to 2003, he did Post-Doctoral research with the Department of Mathematics, Nankai University, Tianjin, China. Currently, he is an Associate Professor at National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, which he joined in 2003. He has published more than ten papers on the international journals and conferences. His current research interests include machine learning, video semantics understanding, face recognition, and so on.

2098

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 11, NOVEMBER 2014

Cheng-Lin Liu (SM’04) received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, the Ph.D. degree in pattern recognition and intelligent control from the Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a Post-Doctoral Fellow at Korea Advanced Institute of Science and Technology, Daejeon, South Korea, and later at Tokyo University of Agriculture and Technology, Tokyo, Japan, from 1996 to 1999. From 1999 to 2004, he was a Research Staff Member, and later a Senior Researcher at the Central Research Laboratory, Hitachi, Tokyo. Currently, he is a Professor at the National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, and is also the Director of the laboratory. He has published over 180 technical papers in prestigious international journals and conferences. His current research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis.

Learning locality preserving graph from data.

Machine learning based on graph representation, or manifold learning, has attracted great interest in recent years. As the discrete approximation of d...
9MB Sizes 6 Downloads 3 Views