Low-rank structure learning via nonconvex heuristic recovery.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

383

Low-Rank Structure Learning via Nonconvex Heuristic Recovery Yue Deng, Qionghai Dai, Senior Member, IEEE, Risheng Liu, Zengke Zhang, and Sanqing Hu, Senior Member, IEEE

Abstract— In this paper, we propose a nonconvex framework to learn the essential low-rank structure from corrupted data. Different from traditional approaches, which directly utilizes convex norms to measure the sparseness, our method introduces more reasonable nonconvex measurements to enhance the sparsity in both the intrinsic low-rank structure and the sparse corruptions. We will, respectively, introduce how to combine the widely used p norm (0 < p < 1) and log-sum term into the framework of low-rank structure learning. Although the proposed optimization is no longer convex, it still can be effectively solved by a majorization–minimization (MM)-type algorithm, with which the nonconvex objective function is iteratively replaced by its convex surrogate and the nonconvex problem finally falls into the general framework of reweighed approaches. We prove that the MM-type algorithm can converge to a stationary point after successive iterations. The proposed model is applied to solve two typical problems: robust principal component analysis and low-rank representation. Experimental results on low-rank structure learning demonstrate that our nonconvex heuristic methods, especially the log-sum heuristic recovery algorithm, generally perform much better than the convex-norm-based method (0 < p < 1) for both data with higher rank and with denser corruptions. Index Terms— Compressive sensing, log-sum heuristic, matrix learning, nuclear norm minimization, sparse optimization.

ADM MM GPCA PCP LHR LLMC LRMR LRR LRSL

N OMENCLATURE Alternating direction method. Majorization minimization. Generalized principal component analysis. Principal component pursuit. Log-sum heuristic recovery. Locally linear manifold clustering. Low-rank matrix recovery. Low-rank representation. Low-rank structure learning.

Manuscript received February 12, 2012; revised December 5, 2012; accepted December 7, 2012. Date of publication January 4, 2013; date of current version January 30, 2013. This work was supported in part by the National Basic Research under Project 2010CB731800 and the Key Project of National Science Foundation of China under Grant 61120106003, Grant 61035002, and Grant 61021063. Y. Deng, Q. Dai and Z. Zhang are with the Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]; [email protected]). R. Liu is with the Electronic Information and Electrical Engineering Department, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]). S. Hu is with the College of Computer Science, Hangzhou Dianzi University, Hangzhou 310000, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2235082

LSA MoG RANSAC RPCA SC SSC

Local subspace affinity. Mixture of Gaussian. Random sample consensus. Robust principal component analysis. Subspace clustering. Sparse subspace clustering. I. I NTRODUCTION

L

EARNING the intrinsic data structures via matrix analysis [1], [2] has received wide attention in many fields, e.g., neural networks [3], learning systems [4], [5], control theory [6], computer vision [7], [8], and pattern recognition [9]–[11]. There are quite a number of efficient mathematical tools for rank analysis, e.g., principal component analysis (PCA) and singular value decomposition (SVD). However, these typical approaches could only handle some preliminary and simple problems. With the recent progresses of compressive sensing [12], a new concept on nuclear norm optimization has emerged into the field of rank minimization [13] and has led to a number of interesting applications, e.g., low-rank structure learning (LRSL) from corruptions. LRSL is a general model for many practical problems in the communities of machine learning and signal processing, which considers learning data of the low-rank structure from sparse errors [14]–[17]. Such a problem can be formulated as P = f (A) + g(E), where P is the corrupted matrix observed in practical world; A and E are low-rank matrix and sparse corruption, respectively, and the functions f (·) and g(·) are both linear mappings. Recovering two variables (i.e., A and E) just from one equation is an ill-posed problem but is still possible to be addressed by optimizing (P0)

min

rank(A) + λ E0

s.t.

P = f (A) + g(E).

(A,E)

(1)

In (P0), rank(A) is adopted to describe the low-rank structure of matrix A, and the sparse errors are penalized via E0 , where 0 norm counts the number of all the nonzero entries in a matrix. (P0) is always referred to as sparse optimization since rank term and 0 norm are sparse measurements for matrices and vectors, respectively. However, such sparse optimization is of little use due to the discrete nature of (P0) and the exact solution to it requires an intractable combinatorial search. A common approach that makes (P0) trackable tries to minimize its convex envelope, where the rank of a matrix is replaced by the nuclear norm and the sparse errors are

2162–237X/$31.00 © 2013 IEEE

384


penalized via 1 norm, which are convex envelopes for rank(·) and · 0 , respectively. In practical applications, LRSL via 1 heuristic is powerful enough for many learning tasks with relative low-rank structure and sparse corruptions. However, when the desired matrix becomes complicated, e.g., it has high intrinsic rank structure or the corrupted errors become dense, the convex approach may not achieve promising performances. In order to handle those tough tasks via LRSL, in this paper we take the advantages of nonconvex approximation to better enhance the sparseness of signals. The prominent reason that we adopt nonconvex terms is mainly due to their enhanced sparsity. Geometrically, nonconvex terms generally lie much closer to the essential 0 norm than the convex 1 norm [18]. To fully exploit the merits of nonconvex terms for LRSL, we formulate (P0) as a semidefinite programming (SDP) problem so that LRSL with two different norms will eventually be converted into an optimization only with 1 norm. Thanks to the SDP formulation, nonconvex terms can be explicitly combined into the paradigm of LRSL and we will investigate two widely used nonconvex terms in this paper, i.e., p norm (0 < p < 1) and log-sum term. Accordingly, two nonconvex models, i.e., p -norm heuristic recovery ( pHR) and log-sum heuristic recovery (LHR), will be proposed for corrupted matrix learning. Theoretically, we will analyze the relationship of these two models and reveal that the proposed LHR exhibits the same objective of pHR when p infinitely approaches to 0+ . Therefore, LHR owns more powerful sparseness enhancement capabilities than pHR. For the sake of accurate solutions, the majorization– minimization (MM) algorithm will be applied to solve the nonconvex heuristic model. The MM algorithm is implemented in an iterative way that it first replaces the nonconvex component of the objective with a convex upper-bound and then it minimizes the constructed surrogate, which makes the nonconvex problem fall exactly into the general paradigm of the reweighted schemes. Accordingly, it is possible to solve the nonconvex optimization following a sequence of convex optimizations and we will prove that with the MM framework, nonconvex models finally converge to a stationary point after successive iterations. We will adapt the nonconvex models to two specific models for practical applications. In a nutshell, they will be used to solve the problems of low-rank matrix recovery (LRMR) and low-rank representation (LRR). In LRMR, nonconvex heuristic models are used to recover a low-rank matrix from sparse corruptions and their performance will be compared with the benchmark principal component pursuit (PCP) [14] method. In practice, our approach often performs very well in spite of its simplicity. By numerical simulations, nonconvex models could handle many tough tasks that typical algorithm fails to handle. For example, the feasible region of LHR is much larger than that of PCP, which implies that it could deal with much denser corruptions and exhibit much higher rank tolerance. The feasible region of PCP is subjected to the boundary of ηPCP + ξ PCP = 0.35, where η and ξ are rank rate and error rate, respectively. With the proposed LHR model, the feasible boundary can be extended to ηLHR + ξ LHR = 0.58. The

advancements are also verified on two practical applications of shadow removal on face images and video background modeling. In the second task of LRR, the power of a nonconvex heuristic model will be generalized to LRR for subspace clustering (SC), whose goal is to recover the underlying low-rank correlation of subspaces in spite of noisy disturbances. In order to judge the performances, we will first apply it to motion segmentation in video sequences, which is a benchmark test for SC algorithms. Besides, in order to highlight the robustness of the proposed framework to noises and disturbances, we apply LHR and pHR to stock clustering that is to determine a stock’s industrial category given its historical price record. From both the experiments, nonconvex heuristic model gains higher clustering accuracy than other state-of-the-art algorithms and the improvements are especially noticeable on the stock data which include significant disturbances. The contribution of this paper is threefold. 1) This paper presents a nonconvex heuristic framework to handle the typical LRSL problem with enhanced sparsity terms. We introduce an MM algorithm to solve the nonconvex LRSL optimization with reweighted schemes and theoretical justifications are provided to prove that the proposed algorithm converges to a stationary point. 2) The proposed nonconvex heuristic model, especially the LHR method, extends the feasible region of existing 1 norm-based LRSL algorithm, which implies that it could successfully handle more learning tasks with denser corruptions and higher rank. 3) We apply the LHR model to a new task of stock clustering which serves to demonstrate that LRSL is not only a powerful tool restricted in the areas of image and vision analysis, but also can be applied to solve the profitable financial problems. The remainder of this paper is organized as follows. We review previous works in Section II. The background of the typical convex LRSL model and its specific form of the SDP will be provided in Section III. Section IV introduces the general nonconvex heuristic models and discusses how to solve the nonconvex problem by the MM algorithm. We addresses the LRMR problem and compare the proposed LHR and pHR models with PCP from both the simulations and practical applications in Section V. Nonconvex models for LRR and subspace segmentation are discussed in Section VI. Section VII concludes this paper. II. P REVIOUS W ORKS In this part, we review some related works from the following perspectives. First, we discuss two famous models in LRSL, i.e., LRMR from corruptions and LRR. Then, some previous works about MM algorithm and reweighted approaches are presented. A. Low-Rank Structure Learning 1) Low-Rank Matrix Recovery: Sparse learning and rank analysis are now drawing more and more attention in both

DENG et al.: LOW-RANK STRUCTURE LEARNING VIA NONCONVEX HEURISTIC RECOVERY

the fields of machine learning and signal processing. In [19], sparse learning is incorporated into a nonnegative matrix factorization framework for blind sources separation. Besides, it has also been introduced to the typical subspace-based learning framework for face recognition [20]. In addition to the widely used 1 norm-based sparse learning, some other surrogates have been proposed for signal and matrix learning. In [21], the 1/2 norm and its theoretical properties are discussed for sparse signal analysis. In [22], elastic-net-based matrix completion algorithms extend the extensively used elastic-net penalty to matrix cases. A fast algorithm for matrix completion by using 1 filter has been introduced in [23]. Corrupted matrix recovery considers decomposing a lowrank matrix from sparse corruptions which can be formulated as P = A + E, where A is a low-rank matrix, E is the sparse error, and P is the observed data from real-world devices, e.g., cameras, sensors, and other pieces of equipment. The rank of P is not low, in most scenarios, due to the disturbances of E. How can we recover the low-rank structure of the matrix from gross errors? This interesting topic has been discussed in a number of works, e.g., [14]–[16]. Wright et al. proposed the PCP (a.k.a. RPCA) to minimize the nuclear norm of a matrix by penalizing the 1 norm of errors [15]. PCP could exactly recover the low-rank matrix from sparse corruptions. In some recent works, Ganesh et al. investigated the parameter choosing strategy for PCP from both the theoretical justifications and simulations [24]. In this paper, we will introduce the reweighted schemes to further improve the performances of PCP. Our algorithm could exactly recover a corrupted matrix from much denser errors and higher rank. 2) LRR: LRR [5] is a robust tool for SC [25], the desired task of which is to classify the mixed data in their corresponding subspaces/clusters. The general model of LRR can be formulated as P = PA + E, where P is the original mixed data, A is the affine matrix that reveals the correlations between different pairs of data, and E is the residual of such a representation. In LRR, the affine matrix A is assumed to be of low rank and E is regarded as sparse corruptions. Compared with existing SC algorithms, LRR is much robust to noises and archives promising clustering results on public datasets. Fast implementations for LRR solutions are introduced in [26] by iteratively linearizing approaches. In this paper, inspired by LRR, we will introduce the log-sum recovery paradigm to LRR and show that, with the log-sum heuristic, its robustness to corruptions can be further improved. B. MM Algorithm and Reweighted Approaches The MM algorithm is widely used in machine learning and signal processing. It is an effective strategy for nonconvex problems in which the hard problem is solved by optimizing a series of easy surrogates. Therefore, most optimizations via the MM algorithm fall into the framework of reweighted approaches. In the field of machine learning, the MM algorithm has been applied to parameter selection for bayesian classification [27]. In the area of signal processing, the MM algorithm leads to a number of interesting applications, including waveletbased processing [28] and total variation minimization [29].

385

For compressive sensing, the reweighted method was used in 1 heuristic and led to a number of practical applications including portfolio management [30] and image processing [18]. Reweighted nuclear norm was first discussed in [31] and the convergence of such an approach has been proven in [32]. Although there are some previous works on reweighted approaches for rank minimization, our approach is quite different. First, this paper tries to consider a new problem of LRSL from corruptions while not on the single task of sparse signal or nuclear norm minimization. Besides, existing works on reweighted nuclear norm minimization in [31] and [32] are solved by SDP which could only handle the matrix of relative small size. In this paper, we will use the first-order numerical algorithm [e.g., alternating direction method (ADM)] to solve the reweighed problem, which can significantly improve the numerical performance. Due to the distributed optimization strategy, it is possible to generalize the learning capabilities to large-scale matrices. III. 1 -BASED LRSL In this part, we formulate the LRSL as an SDP. With the SDP formulation, it will become apparent that typical LRSL is a kind of general 1 heuristic optimizations. This section serves as the background material for the discussions in Section IV. As stated previously, the basic optimization (P0) is nonconvex and generally impossible to solve as its solution usually requires an intractable combinatorial search. In order to make (1) trackable, convex alternatives are widely used in a number of works, e.g., [14] and [15]. Among these approaches, one prevalent method tries to replace the rank of a matrix by its convex envelope, i.e., the nuclear norm, and the 0 sparsity is penalized via 1 norm. Accordingly, by convex relaxation, the problem in (2) can actually be recast as a semidefinite programming min A∗ + λ E1

(A,E)

s.t. P = f (A) + g(E)

(2)

where A∗ = ri=1 σi (A) is the nuclear norm of the matrix which is defined as the summation of the singular values of A and E1 = i j |E i j | is the 1 norm of a matrix. Although the objective in (2) involves two norms, nuclear norm and 1 norm, its essence is based on the 1 heuristic. We will verify this point with the following lemma. Lemma 3.1: For a matrix X ∈ Rm×n , its nuclear norm is equivalent to the following optimization: ⎧ ⎫ min 12 [tr(Y) + tr(Z)] ⎪ ⎪ ⎨ (Y,Z,X) ⎬ X∗ = (3) Y X ⎪ 0 ⎪ ⎩ s.t. ⎭ T X Z where Y ∈ Rm×m , Z ∈ Rn×n are both symmetric and positive definite. The operator tr(·) means the trace of a matrix and represents semipositive definite. The proof of Lemma 3.1 may refer to [13] and [30]. According to this lemma, we can replace the nuclear norm

386


in (2) and formulate it in the form of 1 [tr(Y) + tr(Z)] + λ E1 2 Y A s.t. 0 AT Z P = f (A) + g(E). min

(Y,Z,A,E)

(4)

From Lemma 3.1, we know that both Y and Z are symmetric and positive definite. Therefore, the trace of Y and Z can be expressed as a specific form of 1 norm, i.e., tr(Y) = diag(Y)1 . diag(Y) is an operator that only keeps the entries on the diagonal position of Y in a vector. Therefore, the optimization in (4) can be expressed as 1 min (diag(Y)1 + diag(Z)1 ) + λ E1 Xˆ ∈ Dˆ 2

(5)

where Xˆ = {Y, Z, A, E} and

Y A 0, (A, E) ∈ C . Dˆ = (Y, Z, A, E) : AT Z (A, E) ∈ C stands for convex constraint. IV. C ORRUPTED M ATRIX R ECOVERY VIA N ONCONVEX H EURISTIC By Lemma 3.1, the convex problem with two norms in (2) has been successfully converted to an optimization only with 1 norm, and therefore, it is called 1 -heuristic. 1 norm is the convex envelope of the concave 0 norm but a number of previous research works have indicated the limitation of approximating 0 sparsity with 1 norm, e.g., [18], [33], and [34]. It is natural to ask, for example, whether might a different alternative not only find a correct solution, but also outperform the performance of 1 norm? One natural inspiration is to use some nonconvex terms lying much closer to the 0 norm than the convex 1 norm. However, by using the nonconvex heuristic terms, two problems come out inevitably: 1) which nonconvex functionality is ideal, and 2) how to efficiently solve the nonconvex optimization. In the following two subsections, we will, respectively, address these two problems by introducing the LHR and its reweighted solution. A. Nonconvex Heuristic Recovery In this paper, we will introduce two nonconvex terms to enhance the sparsity of model in (5). The first one is the widely used p norm with 0 < p < 1. Intuitively, it lies in the scope between the 0 norm and the 1 norm. Therefore, it is believed to have a better sparse representation ability than the convex 1 norm. We define the general concave p norm by f p (X) = p i j |X i j | , 0 < p < 1. Therefore, by taking it into (5), the following p -norm Heuristic Recovery ( pHR) optimization is obtained: 1 ( pHR) H p ( Xˆ ) = min f p (diag(Y)) + f p (diag(Z)) 2 ˆ ˆ X∈D +λ f p (E). (6) In the formulation of pHR, obviously, it differs from (5) only on the selection of the sparse norm, where the latter uses

concave p norm instead of the typical 1 norm. Starting from pHR, another nonconvex heuristic model with much sparser penalization can be derived. Obviously, ∀ p > 0, minimizing the aforementioned pHR is equivalent to 1 1 1 ˆ ˆ min F( X ) = H p( X ) − m + n + λmn p 2 2 ˆ Dˆ X∈ m n p 1 |Yii | − 1 1 |Z ii | p − 1 = + 2 p 2 p i=1

+λ

i=1

m n |E i j | p − 1 i=1 j =1

p

.

(7)

The optimization in (6) is the same as the problem in (7) because the multiplied scaler (1/ p) is a positive constant and (1/2)m + (1/2)n + λmn is a constant. According to the L’Hôspital’s rule, we know that lim p→0 (x p − 1/ p) = ((∂ p (x p − 1))/∂ p ( p)) = log x, where ∂ p ( f ( p)) stands for the derivative of f ( p) with respect to p. Accordingly, by taking ˆ the limit lim p→0+ F(X) in (7), we get the LHR model H L ( X): 1 f L (diag(Y)) + f L (diag(Z)) Xˆ ∈ Dˆ 2 (8) +λ f L (E).

(LHR) H L ( Xˆ ) = min

m×n , the log-sum term is defined For any matrix X ∈ R as f L (X) = i j log(|X i j | + δ), where δ > 0 is a small regularization constant. From (6) and (8), we know that LHR is a particular case of pHR by taking the the limit of p at 0+ . It is known that when 0 < p < 1, the closer p approaches to zero, the stronger sparse enhancement that p -based optimization exhibits. We also comment here that when p equals zero, the pHR exactly corresponds to the intractable discrete problem in (1). When p = 0 and p → 0+, pHR gives two different objectives. This finding does not deny our basic derivation since when p = 0 or p < 0, the equivalence from (6) to (7) does not hold any longer. Therefore, LHR only uses a limit to approximate to intractable objective of 0 -based recovery. This is meanwhile the very reason why we denote a plus on the superscript of zero in limit p → 0+ . LHR exploits the limit of the 0 norm in the objective and is regarded to have much stronger sparsity enhancement capability than general pHR. Due to much more powerful sparseness of LHR, in this paper, we advocate the usage of LHR for nonconvex based LRSL. Therefore, in the remainder of this paper, we will discuss the formulations of LHR for low-rank optimization in detail. However, fortunately, thanks to the natural connections between pHR and LHR, the optimization rule of LHR is also applied to pHR. We will state their tiny variations when necessary. The experimental comparisons on numerical simulations and practical applications of pHR and LHR will be given in Sections V and VI.

B. Solving LHR via Reweighed Approaches Although we have placed a powerful term to enhance the sparsity, unfortunately it also causes nonconvexity into the objective function. For example, the LHR model is not convex


since the log-function over R++ = (δ, ∞) is concave. In most cases, a nonconvex problem can be extremely hard to solve. Fortunately, the convex upper bound of f L (·) can be easily found and defined by its first-order Taylor expansion. Therefore, we will introduce the MM algorithm to solve the LHR optimization. The MM algorithm replaces the hard problem by a sequence of easier ones. It proceeds in an expectation maximizationlike fashion by repeating two steps of majorization and minimization in an iterative way. During the majorization step, it constructs the convex upper bound of the nonconvex objective. In the minimization step, it minimizes the upper bound. According to Appendix A, the first-order Taylor expansion of each component in (8) is well defined. Therefore, we can construct the upper bound of LHR instead to optimize the following problem: 1 1 ˆ = tr ( Y +δIm )−1 Y + tr[( z +δIn )−1 Z] min T ( Xˆ | ) 2 2 Xˆ ∈ Dˆ −1 Ei j + δ E i j + const. (9) +λ ij

In (9), set Xˆ = {Y, Z, A, E} contains all the variables to be optimized and set ˆ = { Y , Z , E } contains all the parameter matrices. The parameter matrices define the points at which the concave function is linearized via Taylor expansion. See Appendix A for details. At the end of (9), const stands for the constants that are irrelative to {Y, Z, A, E}. In some previous works of MM algorithms [18], [27], [32], they denote the parameter ˆ in tth iteration with the optimal value of Xˆ ∗ of the last iteration, i.e., ˆ = Xˆ t . To numerically solve the LHR optimization, we remove the ˆ ) ˆ and get constants that are irrelative to Y, Z, and E in T ( X| the new convex objective 1 2 (W E )i j E i j tr WY Y + tr W2Z Z + λ min ij 2 where WY(Z) = ( Y(Z) + δIm(n) )−1/2 and (W E )i j = (E i j + δ)−1 ∀i j . It is worth noting that tr(WY2 Y) = tr(WY YWY ). Besides, since both WY and W Z are positive definite, the first constraint in (8) is equivalent to Y A WY 0 WY 0 0. 0 WZ 0 WZ AT Z Therefore, after convex relaxation, the optimization in (8) is now subjected to 1 [tr(WY YWY ) + tr(W Z ZW Z )] + λW E E1 2 WY YWY WY AW Z s.t. 0 (WY AW Z )T W Z ZW Z P = f (A) + g(E). (10) min

Here, we apply Lemma 3.1 to (10) once again and rewrite the optimization in (10) in the form of the summation of the nuclear norm and 1 norm min · WY AW Z ∗ + λW E E1

(A,E)

s.t. P = f (A) + g(E).

(11)

387

In (11), the operator in the error term denotes the component-wise product of two variables, i.e., for W E and E: (W E E)i j = (W E )i j E i j . According to [30], we know that Y∗ = U UT and Z∗ = V VT , if we do SVD for A∗ = U VT . Accordingly, the weight matrix WY = (U UT + δIm )−1/2 and matrix W Z = (V VT + δIn )−1/2 .1 We should comment here that (11) is also applied to solve the pHR problem. It just uses different weighting matrices WY = diag((U UT + δIm ))( p−1)/2,W Z = diag((V VT + δIn ))( p−1)/2, and W E = [(E i j + δ)( p−1) ]. The derivations of these parameter matrices are very similar to the formulations of LHR in Appendix IV that we omit here. Here, based on the MM algorithm, we have converted the nonconvex LHR optimization to be a sequence of convex reweighted problems. We call it reweighted method (11) since in each iteration we should redenote the weight matrix set Wˆ and use the updated weights to construct the surrogate convex function. Besides, the objective in (11) is convex with a summation of a nuclear norm and an 1 norm and can be solved by convex optimization. In the next two sections, the general LHR model will be adapted to two specific models and we will provide the optimization strategies for those two models, respectively. But before that, we first stop here and extend some theoretic discussions of the LHR model. C. Theoretical Justifications In this part, we investigate some theoretical properties of the proposed nonconvex heuristic algorithm with the MM optimization and prove its convergence. For simplicity, we ˆ and the surrogate define the objective function in (8) as H ( X) ˆ ˆ ˆ function in (9) is defined as T ( X | ). X is a set containing all the variables and set ˆ records the parameter matrices. Before discussing the convergence property of LHR, we will first provide two lemmas. Lemma 4.1: If set ˆ t := Xˆ t , the MM algorithm could monotonically decrease the nonconvex objective function ˆ i.e., H ( Xˆ t +1) ≤ H ( Xˆ t ). H ( X), The proof of this lemma may refer to Appendix B.1. According to Lemma 4.1, we can give Lemma 4.2 to prove the convergence of the LHR iterations. Lemma 4.2: Let Xˆ = { Xˆ 0 , Xˆ 1 · · · Xˆ t · · · } be a sequence generated by the MM framework; after successive iterations, such a sequence converges to the same limit point. The proof of Lemma 4.2 includes two steps. First, we should prove that there exists a convergent subsequence Xˆ tk ∈ Xˆ . See the discussions in Appendix B.3 for details. Then, we prove the whole Lemma 4.2 by the contradiction in Appendix B.2. Based on the two lemmas proved previously, we can now provide the convergence theorem of the proposed LHR model. Theorem 4.3: With the MM framework, the LHR model finally converges to a stationary point. See Appendix B.4 for the proof. In this part, we have shown that with the MM algorithm, the LHR model could converge to a stationary point. However, it is impossible to 1 In cases, the weighting matrices may cause complex numbers due to the inverse operation. In such condition, we use the approximating matrices WY = U( + δIm )−1/2 U T and W Z = V( + δIm )−1/2 V T in LHR.

388


claim that the converged point is the global minimum since the objective function of LHR is not convex. Fortunately, with a good starting point, we can always find a desirable solution by iterative approaches. In this paper, the solution of 1 heuristic model was used as a starting point and it could always lead to a satisfactory result. V. LHR FOR LRMR F ROM C ORRUPTIONS In this part, we first apply the LHR model to recover a lowrank matrix from corruption and its performance is compared with the widely used PCP.

Algorithm 1 Optimization Strategy of LHR for Corrupted Matrix Recovery Input : Corrupted matrix P and parameter λ (1) Initialization : t := 1, E i0j := 1, ∀i, j. WY (Z ) = Im(n) . 1 repeat (t ) (t ) (t ) 2 Update the weighting matrices W E , WY and W Z (t ) (t ) according to current estimation of A and E ; Reset C0 > 0; μ0 > 0; ρ > 1; k = 1; A0 = E0 = 0; 3 4 while not converged do 5 //Variables updating. E ikj = s −1 (t) (P − Ak−1 − μ−1 C1k )i j , ∀i j ; λμ

6

A. Proposed Algorithm

7

Based on the LHR derivations, the corrupted LRMR problem can be formulated as a reweighted problem

8

9

min · WY AW Z ∗ + λW E E1

10

(A,E)

s.t. P = A + E.

(12)

11 12

Because the reweighted weights are placed in the nuclear norm, it is impossible to directly get the closed-form solution of the nuclear norm minimization. Therefore, inspired by this paper [5], we introduce another variable J to (12) by adding another equality constraint and to solve

13 14 15 16

(W E )i j (t ) (t ) dμ−1 (WY Ak−1 W Z

= + μ−1 Ck2 ) ; = (t ) (t ) Ak−1 +γ [−WY (h1k +μ−1 Ck2 )W Z +(h2k +μ−1 Ck1 )]; //Dual ascending. Ck1 = Ck−1 + μk h1k ; 1 Ck2 = Ck−1 + μk h2k ; 2 k := k + 1, μk+1 = ρμk ; end (A(t ), E(t ) ) = (Ak , Ek ); t := t + 1; until convergence; Output : (A(t ), E(t ) ). Jk

Ak

min · J∗ + λ W E E1 s.t. h1 = P − A − E = 0 h2 = J − WY AW Z = 0.

(13)

Based on the transformation, there is only one single J in the nuclear norm of the objective that we can directly get its closed-form update rule by [35]. There are quite a number of methods that can be used to solve it, e.g., with proximal gradient (PG) algorithm [36] or ADMs [37]. In this paper, we will introduce the ADM method since it is more effective and efficient. Using the ALM method [38], it is computationally expedient to relax the equality in (13) and instead solve L = J∗ + λ W E E1 + < C1 , h1 > μ h1 2F + h2 2F . (14) + < C2 , h2 > + 2 where is an inner product and C1 and C2 are the lagrange multipliers, which can be updated via dual ascending method. Equation (14) contains three variables, i.e., J, E, and A. Accordingly, it is possible to solve a problem via a distributed optimization strategy called the ADM. The convergence of the ADM for convex problems has been widely discussed in a number of works [37], [39]. By ADM, the joint optimization can be minimized by four steps such as E-minimization, J-minimization, A-minimization, and dual ascending. The detailed derivations are similar to the previous works in [8] and [38] and we omit them here. The whole framework to solve the LHR model for LRMR via reweighted schemes is given in Algorithm 1.2 In lines 6 and 7 of the algorithm, sα (·) and dα (·) are defined as signal shrinkage operator and matrix shrinkage operator, respectively [38]. 2 The optimization for pHR is very similar by changing the weight matrices.

B. Simulation Results and Validation We have explained how to recover a low-rank matrix via LHR in preceding sections. In this section, we will conduct some experiments to test its performances with the comparisons to robust PCP from numerical simulations. 1) Numerical Simulations: We demonstrate the accuracy of the proposed nonconvex algorithms on randomly generated matrices. For an equivalent comparison, we adopted the same data generating method in [14] that all the algorithms are performed on the squared matrices and the ground-truth lowrank matrix (rank r ) with m × n entries, denoted as A∗ , is generated by an independent random orthogonal model [14]; the sparse error E∗ is generated via uniformly sampling the matrix and the error values are randomly generate in the range [−100, 100]. The corrupted matrix is generated by P = A∗ +E∗ , where A∗ and E∗ are the ground truth. For simplicity, we denote the rank rate as η = (rank(A∗ )/max{m, n}) and the error rate as ξ = (E0 /m × n). For an equivalent comparison, we use the code in [40] to solve the PCP problem.3 Reference [14] indicated that PCP method could exactly recover a low-rank matrix from corruptions within the region of η + ξ < 0.35. Here, in order to highlight the effectiveness of our LHR model, we directly consider much difficult tasks, that is, we set η + ξ = 0.5. We compare the PCP ( p = 1) model with the proposed pHR (with p = 1/3 and p = 2/3) and the LHR (can be regarded as p → 0+ ). Each experiment is repeated ten times 3 In [38], Lin et al. provided two solvers, i.e., exact and inexact solvers, to solve the PCP problem. In this paper, we use the exact solver for PCP because it performs better than the inexact solver.


389

TABLE I E VALUATIONS OF LRMR OF ROBUST PCA AND N ONCONVEX H EURISTIC R ECOVERY (M EAN ± S TD )

m

A−A∗ F A∗ F

methods

rank(A∗ ) = 0.4m rank(A)

||E||0

time(s)

5.9 ± 1.3 16.4 ± 3.1 13.1 ± 3.7 12.7 ± 2.5

(9.3 ± 0.9)e−3 (3.6 ± 0.6)e−3 (1.3 ± 0.7)e−3

107 ± 11 20 ± 0 20 ± 0 20 ± 0

23 098 ± − 16011 ± 17 16 000 ± 0 16 031 ± 21

7.4 ± 1.5 16.3 ± 2.6 13.4 ± 3.5 14.1 ± 2.7

82 149 ± − 15 782 ± 123 15 873 ± 115 16 038 ± 24

27.4 ± 3.3 73.8 ± 12.3 64.2 ± 13.4 53.4 ± 11.8

(6.4 ± 2.3)e−1 (5.0 ± 1.1)e−3 (4.0 ± 0.7)e−4 (1.7 ± 0.5)e−4

217 ± 52 71 ± 9 41 ± 3 40 ± 0

89 370 ± − 64 000 ± 0 64 000 ± 0 64 000 ± 0

33.2 ± 4.6 63.2 ± 15.7 63.2 ± 12.4 54.3 ± 10.7

336 188 ± − 63 901 ± 115 63 962 ± 36 63 999 ± 12

36.2 ± 8.4 103.6 ± 21.5 96.2 ± 27.3 89.3 ± 21.2

(9.1 ± 1.7)e−2 (6.2 ± 0.9)e−3 (5.3 ± 1.1)e−3 (4.1 ± 0.5)e−3

348 ± 42 80 ± 3 80 ± 0 80 ± 0

355 878 ± − 257 762 ± 1268 256 097 ± 72 255 746 ± 133

50.1 ± 10.6 129.2 ± 25.2 119.2 ± 23.6 107.6 ± 25.3

102 ± 12 88 ± 4 83 ± 2 80 ± 0

21 132 ± − 4378 ± 211 4113 ± 77 4000 ± 13

400

PCP pHR2/3 pHR1/3 LHR

(4.5 ± 1.2)e−1 (2.3 ± 0.9)e−2 (1.2 ± 0.5)e−2 (2.3 ± 0.7)e−3

205 ± 37 193 ± 23 160 ± 0 160 ± 2

800


(4.7 ± 1.1)e−1 (2.3 ± 0.6)e−2 (8.7 ± 2.3)e−3 (1.7 ± 0.3)e−3

435 ± 97 361 ± 32 320 ± 0 320 ± 0

0.4

||E∗ ||0 = 0.4m 2 ||E||0

(3.7 ± 0.3)e−2 (1.8 ± 0.3)e−2 (8.1 ± 1.2)e−4

not feasible

rank(A∗ ) = 0.1m rank(A)


0.5

A−A∗ F A∗ F

(1.2 ± 0.5)e−1

(4.6 ± 1.1)e−1

200

Error rate (ξ)

||E∗ ||0 = 0.1m 2 time(s)

boundary of PCP boundary of pHR(2/3) boundary of pHR(1/3) boundary of LHR

0.3

0.2

feasible 0.1

0

0

0.1

0.2

0.3

0.4

0.5

Rank (η) (a) Fig. 1.

(b)

Feasible region and the convergence verifications. (a) Feasible region verification. (b) Convergence verification.

and the mean values and their standard deviations (std) are tabulated in Table I. In the table, (A − A∗ F /A∗ F ) denotes the recovery accuracy, rank denotes the rank of the recovered matrix A, E0 is the card of the recovered errors, and time records the computational costs in seconds. We do not report the std of the PCP method on the recovered errors because PCP definitely diverges on the tasks and its std does not exhibit significant statistic meanings. From the results, obviously, compared with PCP, the LHR model could exactly recover the matrix from higher ranks and denser errors. The pHR model could correctly recover the matrix in most cases but the recover accuracy is a bit lower than LHR. We also report the processing time in Table I. The computer implementing the experiments is equipped with a 2.3 GHz CPU processor and a 4 GB RAM. 2) Feasible region: Since the basic optimization involves two terms, i.e., low-rank matrix and sparse error, in this part we will vary these two variables to test the feasible boundary of PCP£ pHR and LHR, respectively. The experiments are conducted on the 400 × 400 matrices with sparse errors

uniformly distributed in [−100, 100]. In the feasible region verification, when the recovery accuracy is larger than 1% (i.e., (A − A∗ F /A∗ F ) > 0.01), it is believed that the algorithm diverges. The two rates η and ξ are varied from 0 to 1 with the step of 0.025. At each test point, all the algorithms are repeated ten times. If the median recovery accuracy is less than 1%, the point is regarded as the feasible point. The feasible regions of these two algorithms are shown in Fig. 1(a). From Fig. 1(a), the feasible region of LHR is much larger than the region of PCP. We get the same conclusion as made in [14] that the feasible boundary of PCP roughly fits the curve, that is, ηPCP + ξ PCP = 0.35. The boundary of LHR is around the curve that ηLHR + ξ LHR = 0.575. Moreover, on the two sides of the red curve in Fig. 1(a), the boundary equation can be even extended to ηLHR + ρ LHR = 0.6. Although the performance of pHR is not as good as LHR, it still greatly outperforms the performance of PCP. When p = 1/3 and p = 2/3, the boundary equations are subjected to ηpHR + ρ pHR = 0.52 and ηpHR + ρ pHR = 0.48, respectively. These improvements are reasonable since pHR and LHR

390


use the functionalities that are much closer to the 0 norm. Accordingly, the proposed nonconvex heuristic method covers a larger feasible region of from this test, it is apparent that the proposed LHR algorithm covers the largest area of the feasible region, which implies that LHR could handle more difficult tasks that robust PCA fails to do. 3) Convergence Verification: Finally, we will experimentally verify the convergence of the LHR. The experiments are conducted on 400×400 matrices with the rank equivalent to 40 and the portion of gross errors are set as 15%, 30%, and 45%, respectively. The experimental results are reported in Fig. 1(b) where the axis’s coordinate denotes the iteration sequences. The top subfigure in Fig. 1(b) reports the time cost of each iteration. It is interesting to note that the denser the error, the more the time cost required for one iteration. Besides, the most time-consuming part occurs in the first iteration. During the first iteration, (11) subjects to the typical PCP problem. However, in the second and third iterations, the weight matrix is assigned with different values and thus it could make (11) converge with less iterations. Therefore, the time cost for each iteration is different in LHR. The first iteration needs many computational resources while the later ones can be further accelerated owing to the penalty of the weight matrix. The middle subfigure records the stopping criterion, which is denoted as (W (t +1) − W (t ) F /W (t ) F ). It is believed that the LHR converges when the stopping criterion is less than 1e − 5. It is apparent from Fig. 1(b) that the LHR could converge in just three iterations with 15% and 30% gross errors. While for the complicated case with 45% errors, LHR can converge in four steps. The bottom subfigure shows the recovery accuracy after each iteration. It is obvious that the recovery accuracy increases significantly from the first iteration to the second one. Such an increase phenomenon verifies the advantage of the reweighted approach derived from LHR. 4) Robustness on Initialization: We discuss the performances of the proposed algorithm with different initialization strategies. In the previous experiments, following the steps in Algorithm 1, both the matrices A and E are initialized to be zero. In this part, to further verify the effectiveness of LHR, we adopt a random initializing strategy. All the entries in A and E are, respectively, randomly initialized. With such initialization, LHR is implemented on a 400×400 matrix with rank = 10 and with 15%, 30%, and 45% corruptions, respectively. Since the algorithm is randomly initialized, we repeat LHR ten times with different initializations on the same matrix. The final recovery accuracy and the time costs are reported in Table II with standard deviation. From the table, it is apparent that a different initialization method does not make significant differences on the recovery accuracy. The random initialization method returns much consistent accuracy with tiny standard deviation. Meanwhile, the accuracy of random initialization-based recovery is very similar to the accuracy obtained by zero initialization, which is advocated in Algorithm 1. It is not surprising to see the robustness of LHR with different initializations since the inner loops of LHR depend on a convex programming.

TABLE II P ERFORMANCE C OMPARISONS W ITH D IFFERENT I NITIALIZATION M ETHODS Random Initialization Corruptions 15% 30% 45%

A−A∗ F A∗ F

(1.63 ± 0.07)e−5 (5.33 ± 0.06)e−5 (2.61 ± 0.01)e−4

time(s) 25.7 ± 3.2 41.2 ± 4.1 64.7 ± 7.2

Zero Initialization A−A∗ F A∗ F 1.58e−5

5.17e−5 2.37e−4

time(s) 19.3 39.6 51.2

The convex problem has a unique global minimum and thus it is very robust to the initial points. However, with different initializations, the time costs are different. Generally, zero initialization requires less computational costs than random methods. This is because with a bad random initialization, each inner programming requires many loops to get to the optimum. For the weighting matrices, their initial value cannot be arbitrarily set because there is not any prior about these two sparse components. An arbitrary random initialization will possibly make the whole programming diverge. Therefore, a good choice is to initialize all the weighting matrices by identity matrices. With such setting, LHR exactly solves a PCP-like optimization in the first inner programming. After getting the initial guess for the two matrices, in the second and following inner loops, the weighting matrices can be initialized according to the current estimation of A and E. C. Practical Applications PCP is a powerful tool for many practical applications. Here, we will conduct two practical applications to verify the effectiveness of PCP and LHR on real-world data. 1) Shadow and Specularities Removal From Faces: In compute vision, a diverse of computational models has been adopted to address many kinds of tasks for face analysis [9], [41]. Among these mathematical models, matrix recovery and sparse learning have been proven to be very robust tools for shadow removal on faces. Following the framework suggested in [14], we stack the faces of the same subject under different lighting conditions as the columns in a matrix P. The experiments are conducted on extended Yale-B dataset where each face is with the resolutions of 192 × 168. Then, the corrupted matrix P is recovered by PCP and LHR, respectively. After recovery, the shadows, specularities, and other reflectance are removed in the error matrix (E) and the clean faces are accumulated in the low-rank matrix (A). The experimental results are provided in Fig. 2, where in each subfigure from left to right are original faces in Yale-B, faces recovered by PCP, faces recovered by pHR ( p = 1/3),4 and faces recovered by LHR, respectively. It is greatly recommended to enlarge the faces in Fig. 2 to view the details. In Fig. 2(a), when there exist dense shadows on the face image, the effectiveness of LHR becomes apparent to remove the dense shadows distribute on 4 We only report the result of p = 1/3 here since in the previous numerical simulation pHR ( p = 1/3) achieves higher recovery accuracy than pHR with P = 2/3.


391

(a)

(a) (b)

(c) (b) Fig. 2. Shadow and specularities removal from faces (best viewed on screen). (a) Dense shadow. (b) Shadow texture.

the left face. However, in Fig. 2(a), there are no significant differences between the two nonconvex models. Both of them achieve sound result. However, the dense texture removal ability is especially highlighted in Fig. 2(b), where there are significant visual contrasts between the faces recovered by PCP, pHR, and LHR. The face recovered by LHR is much cleaner. 2) Video Surveillance: The background modeling can also be categorized as an LRMR problem, where the backgrounds correspond to the low-rank matrix A and the foregrounds are removed in the error matrix E. We use the videos and ground truth in [42] for quantitative evaluations. Three videos used in this experiment are listed in Fig. 3. For the sake of computational efficiency, we normalize each image to the resolutions of 120 × 160 and all the frames are converted to gray scaler. The benchmark videos used here contain too many frames which lead to a large matrix. It is theoretical feasible to use the two methods for any large matrix recovery. Unfortunately, for practical implementation, large matrices are always beyond the memory limitation of MATLAB. Therefore, for each video, we uniformly divide the large matrix to be submatrices which has less than 200√ columns. We recover these submatrices by setting λ = (1/ m), respectively. The segmented foregrounds and the ground truth are shown in Fig. 3. From the results, we know that LHR could remove much denser errors from the corrupted matrix rather than PCP. Such a claim is verified from three sequences in Fig. 3 that LHR make much complete object recovery from the video. Besides, in Fig. 3(c), it is also apparent that LHR only keeps dense errors in the sparse error term. In the seam sequences, there are obvious illumination changes in different frames. PCP is sensitive to these small variations and thus makes much more small isolated noise parts in the foreground. On the other hand, LHR is much robust to these local variations and only keeps dense corruptions in the sparse term. Although there are many advanced techniques for video background modeling, it is not the main concern of this

Fig. 3. Benchmark videos for background modeling. In each subfigure, from left to right are original video frames, foreground ground truth, LHR result, and PCP result, respectively. (a) HW (439 frames). (b) Lab (886 frames). (c) Seam (459 frames). TABLE III Q UANTITATIVE E VALUATION OF PCP AND N ONCONVEX H EURISTIC R ECOVERY FOR V IDEO S URVEILLANCE Data False Negative Rate%

False Positive Rate%

Time(m)

MoG PCP pHR LHR MoG PCP pHR LHR PCP pHR LHR HW

22.2 18.7 16.2 14.3 8.8

Lab. 15.1 10.1 9.4

7.8

8.2

8.4

13.2 24.7 23.5

8.3

6.7

6.4

6.4

6.1

25.4 45.3 43.7

Seam 23.5 11.3 10.1 9.2

9.7

6.1

6.5

6.3

11.4 23.2 19.9

paper. Therefore, without the loss of generality, we use the mixture of Gaussian (MoG) [43] as the comparison baseline. In the MoG, five Gaussian components are used to model each pixel in the image. For evaluation, both the false-negative rate (FNR) and false-positive rate (FPR) are calculated in the sense of foreground detection. These two scores exactly correspond to the Type I and Type II errors in machine learning, whose definitions may refer to [44]. FNR indicates the ability of the method to correctly recover the foreground and FPR represents the potential of a method to distinguish the background. Both these two rates are judged by the criterion that the less the better. The experimental results are tabulated in Table III. We also report the time cost (in minutes) of PCP, pHR, and LHR on these videos. But we omit the time cost of MoG since it can be finished in almost real time. From the results, PCP and LHR greatly outperform the performance of MoG. Moreover, LHR has lower FNRs than PCP and pHR which implies that LHR could better detect the foreground than them. However, on the video highway and seam, the FPR score of LHR is a little worse than PCP and pHR. One possible reason may ascribe to that there are too many moving shadows in these two videos, where both the objects and shadows are regarded as errors. In the ground truth frames, the shadows are regarded as background. LHR could recover much denser errors from a low-rank matrix and thus causes a relative low FNR score.

392


Algorithm 2 Update Rule for the Variables in (16) k−1 − μ−1 C k ) , ∀i j ; (t) (P − P A 1 ij λμ−1 (W E )i j (t ) k−1 (t ) k W Z + μ−1 Ck2 ); 2 J = dμ−1 (WY A (t ) k k−1 +γ [W (bk +μ−1 Ck )W(t ) +P T (bk +μ−1 Ck )]; 3 A = A 1 2 2 1 Y Z 1

4 5 6

E ikj = s

//Dual ascending. Ck1 = Ck−1 + μk bk1 ; 1 k−1 k C2 = C2 + μk bk2 ;

TABLE IV M OTION S EGMENTATION E RRORS (M EAN ) OF S EVERAL A LGORITHMS ON THE

H OPKINS 155 M OTION S EGMENTATION D ATABASE Category

Method

Two

Three

All

Algebraic

GPCA

11.2

27.7

14.2

Statistic

RANSAC

8.9

24.1

12.5

Manifold

LSA LLMC

8.7 8.1

21.4 20.8

11.6 10.9

SSC LRR pHR LHR

5.4 4.7 4.2 3.1

15.3 15.1 14.4 13.9

7.6 6.9 6.1 5.6

Sparse

VI. LHR FOR LRR In this part, LHR will be applied to the task of LRR [5], [17] by formulating the constraint as P = PA + E, where the correlation affine matrix A is of low rank and the noises in E are sparse. In the remaining parts of this section, we will first show how to use the joint optimization strategy to solve the LRR problem by the LHR model. Then, two practical applications on motion segmentation and stock clustering will be presented and discussed. A. Proposed Algorithm When applying LHR to LRR, we should solve a sequence of convex optimizations in the form min · WY AW Z ∗ + λ W E E1 s.t. P = PA + E.

(15)

To make the nuclear norm trackable, we add an equality and tries to solve min · J∗ + λ W E E1 s.t. b1 = P − PA − E = 0 b2 = J − WY AW Z = 0.

(16)

Using the ADM strategy and following the similar derivations introduced in Section V-A, we can solve the optimization in (16) and we directly provide the update rules for each variable in Algorithm 2. To show that LHR ideally represents low-rank structures from data, experiments on SC are conducted on two datasets. First, we test LHR on slightly corrupted dataset, i.e., Hopkins156 motion database. Since the effectiveness of LHR is especially emphasized on the data with great corruptions, we will also consider one practical application of using LHR for stock clustering. B. Results and Performance Evaluation 1) Motion Segmentation in Video Sequences: In this part, we apply LHR to the task of motion segmentation in Hopkins155 dataset [25]. Hopkins155 database is a benchmark platform to evaluate general SC algorithms, which contains 156 video sequences and each of them has been summarized to be a matrix recoding 39–50 data vectors. The primary task of SC is to categorize each motion to its corresponding subspace, where each video corresponds to a sole clustering task and it leads to 156 clustering tasks in total.

For comparisons, we will compare LHR with LRR as well as other benchmark algorithms for SC. The comparisons include random sample consensus (RANSAC) [45], generalized principal component analysis (GPCA) [46], local subspace affinity (LSA), locally linear manifold clustering (LLMC), and sparse subspace clustering (SSC). RANSAC is a statistic method which clusters data by iteratively distinguishing the data by inliers and outliers. GPCA presents an algebraic method to cluster the mixed data by the normal vectors of the data points. Manifold-based algorithms, e.g., LSA and LLMC, assume that one point and its neighbors span as a linear subspace and they are clustered via spectral embedding. SSC assumes that the affine matrix between data are sparse and it segments the data via normalized cut [47]. In LRR [5], Liu et al. introduced two models that, respectively, used 1 norm and 2,1 norm to penalize sparse corruptions. In this paper, we will only report the results with comparison to 2,1 norm since it always performs better than 1 penalty in LRR. In order to provide a thorough comparison with LRR, we strictly follow the steps and the default parameter settings suggested in [5]. For the LHR model, we choose parameter λ = 0.4. In the experiments of LRR for motion segmentation, some postprocessing is performed on the learned low-rank structure to seek for the best clustering accuracy. For example, in LRR, after getting the representation matrix A, an extra PCP processing is implemented on A to enhance the low rankness and such postprocessing definitely increases SC accuracy. However, the main contribution of this paper only focuses the LHR model on LRSL while not on the single task of SC. Therefore, we exclude all the postprocessing steps to emphasize the effectiveness of the LRSL model itself. In our result, all the methods are implemented with the same criterion to avoid bias treatments. Hopkins155 contains two subspace conditions in a video sequence, i.e., with two motions or three motions, and thus we report the segmenting errors for two subspaces (Two), for three subspaces (Three), and for both conditions (All) in Table IV. From the results, we know that sparse-based methods generally outperform other algorithms for motion segmentation. Among three sparse methods, LHR gains the best clustering accuracy. However, the accuracy only has slight improvements on LRR. As indicated in [5], motion data only contain small corruptions and LRR could already achieve promising performance with the accuracy higher than 90%. With some postprocessing


implementations, the accuracy can even be further improved. Therefore, in order to highlight the effectiveness of LHR on LRR with corrupted data, some more complicated problems will be considered. 2) Stock Clustering: It is not trivial to consider applying the LHR model to more complicated practical data where the effectiveness of LHR on corrupted data will be overemphasized. In practical world, one of the most difficult data structures to be analyzed is the stock price which can be greatly affected by company news, rumors, and global economic atmosphere. Therefore, data mining approaches of financial signals have been proven to be very difficult but, on the other hand, they are very profitable. In this paper, we will discuss how to use the LRR and LHR models for the interesting, albeit not very lucrative, task of stock clustering based on their industrial categories. In many stock exchange centers around the world, stocks are always divided into different industrial categories. For example, on the New York Stock Exchange Center, IBM and J. P. Morgan are, respectively, categorized into the computer-based system category and money center banks category. It is generally assumed that stocks in the same category always have similar market performance. This basic assumption is widely used by many hedge funds for statistic arbitrage. In this paper, we consider that stocks in the same industrial category span as a subspace and therefore the goal of stock clustering, a.k.a. stock categorization, is to identify a stock’s industrial label by its historical prices. The experiments are conducted on stocks from two global stock exchange markets in New York and Hong Kong. In each market, we choose ten industrial categories which have the largest market capitalizations. The categories divided by the exchange centers are used as the ground truth label. In each category, we only choose the stocks whose market capitalizations are within the top ten ranks in one category. The stock prices on the New York market are obtained from [48] and the stock prices in the Hong Kong market are obtained from [49]. Unfortunately, some historical prices for stocks in [48] are not provided.5 Therefore, for the U.S. market, we accumulated 76 stocks divided into 10 classes and each class contains 7 to 9 stocks; for the Hong Kong market, we obtain 96 stocks spanning 10 classes. For classification, the weekly closed prices from January 7, 2008 to October 31, 2011, including 200 weeks, are used because financial experts always look at weekly close prices to judge the long-term trend of a stock. As stated previously, the stock prices may have extreme highs and lows, which cause outliers in the raw data. Besides, the prices of different stocks vary; that cannot be evaluated with the same quantity scaler. For the ease of pattern mining, we use the time-based normalization strategy suggested in [50] and [51] to preprocess the stock prices p(t) ˜ =

p(t) − μα (t) σα (t)

5 For example, in the industrial category of drug manufactures, it is not possible to get the historical data of CIPILA.LTD from [48] which is the only interface for us to get the stock prices in the U.S.

393

Areospace&Defense 2 1 0 −1 −2 0

20

40

60

80

0

20

40

60

0

20

40

60

100 120 Banks

140

160

180

200

80 100 120 140 Wireless communication

160

180

200

160

180

200

2 1 0 −1 −2 2 1 0 −1 −2

80

100

120

140

Fig. 4. Normalized stock prices in NY of the categories: areospace and defense, banks, and wireless communication. In each category, lines in different colors represent different stocks (best viewed on screen). TABLE V C LUSTERING E RRORS OF THE S TOCKS IN T EN C ATEGORIES FROM N EW Y ORK AND H ONG K ONG M ARKETS Markets

GPCA

RANSAC

LSA

LLMC

New York Hong Kong Markets New York Hong Kong

60.1 57.3 SSC 48.6 49.1

59.3 55.8 LRR 44.3 47.2

51.7 54.7 pHR 39.3 42.7

54.3 53.7 LHR 36.2 38.3

where p(t) is the price of a certain stock at time t, μα (t), and σα (t) are, respectively, the average value and standard deviation of the stock prices in the interval [t − α, t]. We plot the normalized stock prices of three categories in Fig. 4. After normalization, we further adopt the PCA method to reduce the dimensions of stocks from R200 to R5 . Theoretically, the rank of subspaces after PCA should be 10 − 1 = 9 because it contains ten subspaces and the rank is degraded by 1 during PCA implementation. But, in the simulation, we find that the maximal clustering accuracies for both markets are achieved with the PCA dimensions of 5. The clustering errors of different SC methods on the stocks from these two markets are summarized in Table V. From the results, it is obvious that LHR significantly outperforms other methods. It improves statistic- and graph-based methods for about 20%. Among all the sparse methods, LHR makes improvements on LRR for about 8%. Although LHR performs the best among all the methods, the clustering accuracy is only about 63% and 61% on U.S. and Hong Kong markets, respectively. The clustering accuracy is not as high as those on the motion data. This may be ascribed to the fact that the raw data and ground truth label themselves contain many uncertainties. See the bottom subfigure in Fig. 4 for the stocks in the wireless communication category; the normalized stock marked with green performs quite different from other stocks in the same category. But the experimental results reported here are sufficient to verify the effectiveness of SC for ten classes categorization. If no intelligent learning approaches were imposed, the expected accuracy may be only

394


10%. Although with such bad raw data, the proposed LHR could achieve the accuracy as high as 62% in a definitely unsupervised way. VII. C ONCLUSION This paper presented an LHR algorithm to learn the essential low-rank structures from corrupted matrices. We introduced an MM algorithm to convert the nonconvex objective function a series of convex optimizations via reweighed approaches and proved that the solution may converge to a stationary point. Then, the general model was applied to two practical tasks of LRMR and SC. In both of the two models, we gave the solution/update rules to each variable in the joint optimizations via ADM. For the general PCP problem, LHR extended the feasible region to the boundary of η + ξ = 0.58. For the SC problem, LHR achieved state-of-the-art results on motion segmentation and achieved promising results on stock clustering which contain too many outliers and uncertainties. However, a limitation of the proposed LHR model is for the reweighted phenomenon that requires to solve convex optimizations for multiple times. The implementation of LHR is a bit more time consuming than PCP and LRR. Therefore, the LHR model is especially recommended to learn the low-rank structure from data with denser corruptions and higher ranks. A PPENDIX A. Constructing the Upper Bound of LHR To see how the MM works for LHR, let us recall the objective function in (8) and make some simple algebra operations 1 diag(Y) L + diag(Z) L + λ E L 2 1 = log(Yii + δ) + log(Z kk + δ) 2 i k log(|E i j | + δ) +λ ij

1 = log det(Y + δIm ) + log det(Z + δIn ) 2 +λ log(|E i j | + δ)

B. Theoretic Proof of the Convergence of LHR 1) Proof of Lemma 4.1: In order to prove the monotonically decreasing property, we can instead prove H Xˆ t +1 ≤ T Xˆ t +1 | ˆ t ≤ T Xˆ t | ˆ t = H Xˆ t . (20) We prove (20) by the following three steps. 1) The first inequality follows from the argument that ˆ ˆ is the upper bound of H ( X). T ( Xˆ | ) 2) The second inequality holds since the MM algorithm computes Xˆ t +1 = arg min Xˆ T ( Xˆ | ˆ t ). The function T (·) is convex; therefore, Xˆ t +1 is the unique global minimum. This property guarantees that T ( Xˆ t +1 | ˆ t +1 ) < T (·| ˆ t ) with any Xˆ = Xˆ t +1 and T ( Xˆ t +1 | ˆ t +1 ) = T (·| ˆ t ) if and only if Xˆ = Xˆ t +1 . 3) The last equality can be easily verified by expanding T ( Xˆ t | ˆ t ) and making some simple algebra. The transformation is straightforward and omitted here. 2) Proof of Lemma 4.2: We give a proof by contradiction. We assume that sequence Xˆ diverges, which means that limt →∞ Xˆ t +1 − Xˆ t F = 0. According to the discussions in Appendix B.3, we know that there exists a convergent subsequence Xˆ tk converging to φ, i.e., limk→∞ Xˆ tk = φ and meanwhile, we can construct another convergent subsequence Xˆ tk +1 that limk→∞ Xˆ tk +1 = ϕ. We assume that ˆ is continuφ = ϕ. Since the convex upper-bound T (·| ) ous, we get limk→∞ T ( Xˆ tk +1 | ˆ tk ) = T ( lim Xˆ tk +1 | ˆ tk ) < k→∞ T ( lim Xˆ tk | ˆ tk ) k→∞

=

ϕ

limk→∞ T ( Xˆ tk | ˆ tk ). The strict less

φ

than operator “ 0 E i j −( E )i j . log[( E )i j + δ]+ log(|E i j | + δ) ≤ ij ij ( E )i j + δ (19) We replace each term in (17) with the convex upper bound ˆ as the surrogate function after convex and define T ( Xˆ | ) relaxation.

k→∞

Besides, it is obvious that the function of H (·) in (8) is ˆ > (mn + m + n) log δ. Moreover, bounded below, i.e., H ( X) ˆ is monotonically decreasing, as proved in Lemma 4.1, H ( X) which guarantee that limt →∞ H ( Xˆ t ) exists lim H Xˆ tk = lim H Xˆ t = lim H Xˆ t +1 t →∞ t →∞ k→∞ t +1 k . (22) = lim H Xˆ k→∞

Obviously, (22) contradicts to (21). Therefore, φ = ϕ and we get the conclusion that limt →∞ Xˆ t +1 − Xˆ t F = 0. 3) Convergence of Subsequences in the Proof of Lemma 4.2: In this part, we provide the discussions about the properties of the convergent subsequences that are used in the proof of Lemma 4.2. Since sequence Xˆ t = {Yt , Zt , At , Et } is generated via (8), we know that Xˆ ∈ Dˆ strictly holds. Therefore, all the variables (i.e., Yt , Zt , At , Et ) in set Xˆ should be bounded. This claim can be easily verified because if any variable in the set Xˆ goes to infinity, the constraints in domain Dˆ


will not be satisfied. Accordingly, we know that sequence Xˆ t is bounded. According to the Bolzano–Welestrass theorem [52], we know that every bounded sequence has a convergent subsequence. Since Xˆ t is bounded, it is apparent that there exists a convergent subsequence Xˆ tk . Without the loss of generality, we can construct another subsequence Xˆ tk +1 which is also convergent. The proof of the convergence of Xˆ tk +1 relies on the monotonically decreasing property proved in Lemma 4.1. Since H (·) is monotonically decreasing, it is easy to check that H ( Xˆ tk ) ≥ H ( Xˆ tk +1 ) ≥ H ( Xˆ tk+1 ) ≥ H ( Xˆ tk+1 +1 ) ≥ H ( Xˆ tk+1+1 ). According to the aforementioned inequalities, we get that lim H Xˆ tk ≥ lim H Xˆ tk+1 ≥ lim H Xˆ tk+2 . (23) k→∞

k→∞

k→∞

Since subsequence Xˆtk converges, it is obvious that limk→∞ H ( Xˆ tk ) = limk→∞ H ( Xˆ tk+2 ) = β. According to the famous Squeeze theorem [53], from (23), we get the limk→∞ H ( Xˆ tk +1 ) = H (limk→∞ Xˆ tk +1 ) = β. Since the function H (·) is monotonically decreasing and Xˆ is bounded, the convergence of H ( Xˆ tk +1 ) can be obtained if and only if the subsequence Xˆ tk +1 is convergent. 4) Proof of Theorem 4.3: As stated in Lemma 4.2, the sequences generated by the MM algorithm converges to a limitation and here we will first prove that the convergence is a fixed point. We define the mapping from Xˆ k to Xˆ k+1 as M(·), and it is straightforward to get lim t →∞ Xˆ t = limt →∞ Xˆ t +1 = limt →∞ M( Xˆ t ), which implies that limt →∞ Xˆ t = φ is a fixed point. In the MM algorithm, when constructing the upper bound, we use the first-order Taylor expansion. It is ˆ ) ˆ is tangent to well known that the convex surrogate T ( X| ˆ at Xˆ by the property of Taylor expansion. Accordingly, H ( X) ˆ and H ( X) ˆ are equal when the gradient vectors of T ( Xˆ | ) ˆ evaluating at Xˆ . Besides, we know that 0 ∈ ∇ Xˆ =φ T ( Xˆ | ) ˆ and because it is tangent to H ( X), we can directly get that ˆ which proves that the convergent fixed point 0 ∈ ∇ Xˆ =φ H ( X) φ is also a stationary point of H (·). ACKNOWLEDGMENT The authors would like to thank Q. Zhang for his helpful discussions on the proof of Theorem 4.3. R EFERENCES [1] X. Li and Y. Pang, “Deterministic column-based matrix decomposition,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 1, pp. 145–149, Jan. 2010. [2] Y. Yuan, X. Li, Y. Pang, X. Lu, and D. Tao, “Binary sparse nonnegative matrix factorization,” IEEE Trans.Circuits Syst. Video Technol. , vol. 19, no. 5, pp. 772–777, May 2009. [3] S. Hu and J. Wang, “Absolute exponential stability of a class of continuous-time recurrent neural networks,” IEEE Trans. Neural Netw., vol. 14, no. 1, pp. 35–45, Jan. 2003. [4] A. Goldberg, X. J. Zhu, B. Recht, J. Sui, and R. Nowak, “Transduction with Matrix Completion: Three birds with one stone,” in Proc. Neural Inform. Process. Syst., 2010, pp. 1–9. [5] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 171–184, Jan. 2013. [6] S. Hu and J. Wang, “Quadratic stabilizability of a new class of linear systems with structural independent time-varying uncertainty,” Automatica, vol. 37, no. 1, pp. 51–59, 2001.

395

[7] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, “Robust principal component analysis: Exact recovery of corrupted low-rank matrices,” in Proc. Neural Inform. Process. Syst., 2009, pp. 1–9. [8] Y. Deng, Y. Liu, Q. Dai, Z. Zhang, and Y. Wang, “Noisy depth maps fusion for multiview stereo via matrix completion,” IEEE J. Select. Topics Signal Process., vol. 6, no. 5, pp. 566–582, Sep. 2012. [9] Y. Deng, Q. Dai, and Z. Zhang, “Graph laplace for occluded face completion and recognition,” IEEE Trans. Image Process., vol. 20, no. 8, pp. 2329–2338, Aug. 2011. [10] Y. Deng, Q. Dai, R. Wang, and Z. Zhang, “Commute time guided transformation for feature extraction,” Comput. Vis. Image Understand., vol. 116, no. 4, pp. 473–483, 2012. [11] R. Liu, Z. Lin, S. Wei, and Z. Su, “Feature extraction by learning lorentzian metric tensor and its extensions,” Pattern Recognit., vol. 43, no. 10, pp. 3298–3306, 2010. [12] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. [13] B. Recht, M. Fazel, and P. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010. [14] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” J. ACM, vol. 59, no. 3, pp. 1–37, May 2011. [15] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Ranksparsity incoherence for matrix decomposition,” Tech. Rep. 2220, Elect. Eng. Res. Lab., Univ. Texas, Austin, Jun. 2009. [16] D. Hsu, S. M. Kakade, and T. Zhang, “Robust matrix decomposition with outliers,” Tech. Rep. 1011.1518, Dept. Stat., Univ. Pennsylvania, Philadelphia, PA, Dec. 2010. [17] G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 663–670. [18] E. J. Candes, M. Wakin, and S. Boyd, “Enhancing sparsity by reweighted 1 minimization,” J. Fourier Anal. Appl., vol. 14, no. 5, pp. 877–905, 2007. [19] Z. Yang, Y. Xiang, S. Xie, S. Ding, and Y. Rong, “Nonnegative blind source separation by sparse component analysis based on determinant measure,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1601–1610, Oct. 2012. [20] Z. Lai, W. K. Wong, Z. Jin, J. Yang, and Y. Xu, “Sparse approximation to the eigensubspace for discrimination,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 12, pp. 1948–1960, Dec. 2012. [21] Z. Xu, X. Chang, F. Xu, and H. Zhang, “L1/2regularization: A thresholding representation theory and a fast solver,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1013–1027, Jul. 2012. [22] H. Li, N. Chen, and L. Li, “Error analysis for matrix elastic-net regularization algorithms,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 5, pp. 737–748, May 2012. [23] R. Liu, Z. Lin, S. Wei, and Z. Su, “Solving principal component pursuit in linear time via l1 filtering,” Tech. Rep., 11085359, Elect. Eng. Res. Lab., Univ. Texas, Austin, Aug. 2011. [24] A. Ganesh, J. Wright, X. Li, E. J. Candes, and Y. Ma, “Dense error correction for low-rank matrices via principal component pursuit,” in Proc. Int. Symp. Inform. Theory, Jun. 2010, pp. 1–5. [25] R. Vidal, “Subspace clustering,” IEEE Signal Process. Mag., vol. 28, no. 2, pp. 52–68, Mar. 2011. [26] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. Neural Inform. Process. Syst., Sep. 2011, pp. 1–7. [27] C.-S. Foo, C. B. Do, and A. Y. Ng, “A majorization-minimization algorithm for (multiple) hyperparameter learning,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 321–328. [28] M. Figueiredo, J. Bioucas-Dias, and R. Nowak, “Majorization minimization algorithms for wavelet-based image restoration,” IEEE Trans. Image Process., vol. 16, no. 12, pp. 2980–2991, Dec. 2007. [29] J. Bioucas-Dias, M. Figueiredo, and J. Oliveira, “Total variation-based image deconvolution: A majorization-minimization approach,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., May 2006, p. 2. [30] M. Fazel, “Matrix rank minimization with applications,” Ph.D. thesis, Stanford Univ., Stanford, CA, 2002. [31] M. Fazel, H. Hindi, and S. Boyd, “Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices,” in Proc. Amer. Control Conf., vol. 3. Jun. 2003, pp. 2156–2162. [32] K. Mohan and M. Fazel, “Reweighted nuclear norm minimization with application to system identification,” in Proc. Amer. Control Conf., 2010, pp. 2953–2959.

396


[33] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Royal Statist. Soc. Series B, vol. 67, no. 2, pp. 301–320, 2005. [34] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for compressive sensing,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.. Apr. 2008, pp. 3869–3872. [35] J. Cai, E. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” Preprint, Vol. 20, no. 4, pp. 1–27, 2008. [36] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Image Science, vol. 2, no. 1, pp. 183–202, 2009. [37] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–123, 2010. [38] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” Aerospace Corp., Los Angeles, CA, Tech. Rep. 10095055, Mar. 2011. [39] M. Lees, “A note on the convergence of alternating direction methods,” Math. Comput., vol. 16, no. 77, pp. 70–75, 1963. [40] “Low-rank matrix recovery and completion via convex optimization.” (2012) [Online]. Available: http://perception.csl.uiuc.edu/matrixrank/sample_code.html [41] J. Suo, S. Zhu, S. Shan, and X. Chen, “A compositional and dynamic model for face aging,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 385–401, Mar. 2010. [42] C. Benedek and T. Sziranyi, “Bayesian foreground and shadow detection in uncertain frame rate surveillance videos,” IEEE Trans. Image Process., vol. 17, no. 4, pp. 608–621, Apr. 2008. [43] J. K. Suhr, H. G. Jung, G. Li, and J. Kim, “Mixture of gaussians-based background subtraction for bayer-pattern image sequences,” IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 3, pp. 365–370, Mar. 2011. [44] Type i and Type ii Errors. (2009) [Online]. Available: http://en.wikipedia.org/wiki/type_i_and_type_ii_errors [45] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, pp. 381–395, Jun. 1981. [46] R. Vidal, Y. Ma, and S. Sastry, “Generalized principal component analysis (gpca),” in Proc. IEEE Comput. Soc. Conf., Comput. Vis. Pattern Recognit., Jun. 2003, pp. 621–628. [47] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [48] “Yahoo!finacial.” (2012) [Online]. Available: http://finance.yahoo.com [49] “Google financial.” (2012) [Online]. Available: http://www.google. com.hk/finance?q= [50] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani, “Mining the stock market (extended abstract): Which measure is best?” in Proc. 6th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2000, pp. 487–496. [51] T. Wittman, “Time-series clustering and association analysis of financial data,” Elect. Eng. Res. Lab., Univ. Texas, Austin, Tech. Rep. CS 8980, 2002. [52] “Bolzano–Weierstrass theorem.” (2012) [Online]. Available: http://en.wikipedia.org/wiki/Bolzano%E2%80%93Weierstrass_theorem [53] “Squeeze theorem.” (2012) [Online]. Available: http://en.wikipedia.org/ wiki/Squeezetheorem

Yue Deng received the B.E. degree (Hons.) in automatic control from Southeast University, Nanjing, China, in 2008. He is currently pursuing the Ph.D. degree with the Department of Automation, Tsinghua University, Beijing, China. He was a Visiting Scholar with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, from 2010 to 2011. His current research interests include machine learning, signal processing, and computer vision. Mr. Deng was a recipient of the Microsoft fellowship in 2010.

Qionghai Dai (SM’05) received the B.S. degree in mathematics from Shanxi Normal University, Xian, China, in 1987, and the M.E. and Ph.D. degrees in computer science and automation from Northeastern University, Shenyang, China, in 1994 and 1996, respectively. He has been a member of the Faculty of Tsinghua University, Beijing, China, since 1997. He is currently a Cheung Kong Professor with Tsinghua University and is the Director of the Broadband Networks and Digital Media Laboratory. His current research interests include signal processing and computer vision and graphics.

Risheng Liu received the B.Sc. degree in mathematics and Ph.D. degree in computational mathematics from the Dalian University of Technology, Dalian, China, in 2007 and 2012, respectively. He was a joint Ph.D. student with the Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, from 2010 to 2012. He is currently a PostDoctoral Researcher with the Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology. His current research interests include machine learning and computer vision.

Zengke Zhang received the B.S. degree in industrial electrization and automation from Tsinghua University, Beijing, China, in 1970. He is a Professor with the Department of Automation, Tsinghua University. His current research interests include intelligent control, motion control, system integration, and image processing.

Sanqing Hu (M’05–SM’06) received the B.S. degree from the Department of Mathematics, Hunan Normal University, Hunan, China, the M.S. degree from the Department of Automatic Control, Northeastern University, Shenyang, China, and the Ph.D. degree from the Department of Automation and Computer-Aided Engineering, The Chinese University of Hong Kong, Kowloon, Hong Kong, and the Department of Electrical and Computer Engineering, University of Illinois, Chicago, in 1992, 1996, 2001, and 2006, respectively. He was a Research Fellow with the Department of Neurology, Mayo Clinic, Rochester, MN, from 2006 to 2009. From 2009 to 2010, he was a Research Assistant Professor with the School of Biomedical Engineering, Science & Health Systems, Drexel University, Philadelphia, PA. He is currently a Chair Professor with the College of Computer Science, Hangzhou Dianzi University, Hangzhou, China. His current research interests include biomedical signal processing, cognitive and computational neuroscience, neural networks, and dynamical systems. He is the co-author of more than 60 international journal and conference papers. Dr. Hu is an Associate Editor of four journals, including the IEEE T RANSACTIONS ON B IOMEDICAL C IRCUITS AND S YSTEMS , the IEEE T RANSACTIONS ON SMC—PART B, the IEEE T RANSACTIONS ON N EURAL N ETWORKS , and Neurocomputing. He was a Guest Editor of Neurocomputing’s special issue on Neural Networks 2007 and Cognitive Neurodynamics’ special issue on cognitive computational algorithms in 2011. He is the Organizing Committee Co-Chairs for ICAST2011, was and is the Program Chairs for the ICIST2011, ISNN2011, and IWACI2010. He was the Special Sessions Chair for the ICNSC2008, ISNN2009, and IWACI2010. He served and is serving as a member of the program committee of 20 international conferences.

GLOBAL SOLUTIONS TO FOLDED CONCAVE PENALIZED NONCONVEX LEARNING.

Neural network for nonsmooth, nonconvex constrained minimization via smooth approximation.

A lifelong learning hyper-heuristic method for bin packing.

Clustered Multi-Task Learning Via Alternating Structure Optimization.

DC Proximal Newton for Nonconvex Optimization Problems.

Mesochronal Structure Learning.

A machine learning heuristic to identify biologically relevant and minimal biomarker panels from omics data.

A machine-learning heuristic to improve gene score prediction of polygenic traits.

Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning.

Materials Prediction via Classification Learning.

People use the memory for past-test heuristic as an explicit cue for judgments of learning.

Rate-Agnostic (Causal) Structure Learning.

Heuristic alignment methods.

Resolving intravoxel fiber architecture using nonconvex regularized blind compressed sensing.

Robust Frequency-Domain Constrained Feedback Design via a Two-Stage Heuristic Approach.

Heuristic ternary error-correcting output codes via weight optimization and layered clustering-based approach.

A collective neurodynamic optimization approach to bound-constrained nonconvex optimization.

Automatic structure recovery for additive models.

Blind Image Blur Estimation via Deep Learning.

Feature Screening via Distance Correlation Learning.

Stochastic learning via optimizing the variational inequalities.

Discriminative clustering via extreme learning machine.

Learning interactions via hierarchical group-lasso regularization.

Blind image quality assessment via deep learning.