Forward basis selection for pursuing sparse representations over a dictionary.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 35, NO. 12,

DECEMBER 2013

3025

Forward Basis Selection for Pursuing Sparse Representations over a Dictionary Xiao-Tong Yuan and Shuicheng Yan, Senior Member, IEEE Abstract—The forward greedy selection algorithm of Frank and Wolfe [9] has recently been applied with success to coordinate-wise sparse learning problems, characterized by a tradeoff between sparsity and accuracy. In this paper, we generalize this method to the setup of pursuing sparse representations over a prefixed dictionary. Our proposed algorithm iteratively selects an atom from the dictionary and minimizes the objective function over the linear combinations of all the selected atoms. The rate of convergence of this greedy selection procedure is analyzed. Furthermore, we extend the algorithm to the setup of learning nonnegative and convex sparse representation over a dictionary. Applications of the proposed algorithms to sparse precision matrix estimation and low-rank subspace segmentation are investigated with efficiency and effectiveness validated on benchmark datasets. Index Terms—Greedy selection, sparse representation, optimization, Gaussian graphical models, subspace segmentation

Ç 1

INTRODUCTION

T

HE

past decade has witnessed a tremendous surge of research activities in high-dimensional data analysis. One major driving force is the rapid development of data collection technology in applications domains such as social network analysis, natural language processing, genomic, and computer vision. In these applications, data samples often have thousands or even millions of features by the use of which an underlying parameter must be inferred or predicted. In many circumstances, the number of collected samples is substantially smaller than the dimensionality of the feature, meaning that consistent estimators cannot be obtained unless additional constraints are imposed on the model. It is widely acknowledged that the datasets that need to be processed usually exhibit low-dimensional structure that can often be captured by imposing sparsity constraint on the model space. Developing techniques that are robust and computationally tractable to solve these sparsity-constrained problems, even only approximately, is therefore critical. In this paper, we are particularly interested in a class of sparse optimization problems in which the target solution admits a sparse representation over a fixed dictionary. Among others, several examples falling inside this model include: 1) coordinate-wise sparse learning where the optimal solution is expected to be a sparse combination of canonical bases [32], [35], 2) low-rank matrix approximation where the target solution is expected to be the weighted sum of a few rank-1 matrices in the form of outer product of

. X.-T. Yuan is with the School of Information and Control, Nanjing University of Information Science and Technology, Nanjing 210044, China. E-mail: [email protected]. . S. Yan is with the Vision and Machine Learning Lab, Department of Electrical and Computer Engineering, National University of Singapore, Block E4, #08-27, 4 Engineering Drive 3, Singapore 117583. E-mail: [email protected]. Manuscript received 14 Apr. 2012; revised 14 Jan. 2013; accepted 15 Apr. 2013; published online 1 May 2013. Recommended for acceptance by F. Bach. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2012-04-0292. Digital Object Identifier no. 10.1109/TPAMI.2013.85. 0162-8828/13/$31.00 ß 2013 IEEE

unit-norm vectors [18], [31], and 3) boosting classification where a strong classifier is a linear combination of several weak classifiers [13], [29]. Formally, this class of problems can be unified inside the following framework of sparse representation over a dictionary V in a euclidean space E: min fðxÞ; x2E

s:t: x 2 LK ðV Þ;

ð1Þ

where f is a real-valued differentiable convex function and ( ) X [ u u : u 2 IR LK ðV Þ :¼ UV ;jUjK

u2U

is the union of the linear hulls spanned by those subsets U V with cardinality jUj K. Here, we allow the dictionary V to be finite or countable. In the aforementioned examples, V is the canonical bases in coordinatewise sparse learning (finite), a certain family of rank-1 matrices in lowrank matrix approximation (countable), and a set of weak classifiers in boosting (finite or countable). Due to the cardinality constraint, problem (1) is not convex, and thus we will resort to approximation algorithms for solution. In recent years, forward greedy selection algorithms have received wide interest in pursuing sparse representation. A category of algorithms called coreset [2] have been successfully applied in functional approximation [40] and coordinate-wise sparse learning [20]. This body of work dates back to the Frank-Wolfe algorithm (also known as conditional gradient method) [9] for polytope constrained optimization. Some variants of the coreset method were investigated in the scenarios of positive semidefinite program [14] and low-rank matrix completion/approximation [18], [31], [4], which only require dominant singular value decomposition at each iteration step. In the study of boosting classification, the restricted gradient projection algorithm [13] is essentially a forward greedy selection method over L2 -functional space. More recently, a Frank-Wolfe-type algorithm was developed in [33] to minimize a convex objective over the (scaled) convex hull of a collection of atoms. Published by the IEEE Computer Society

3026


1.1

Fully Corrective Forward Greedy Selection (FCFGS) The method presented in this paper is motivated from the FCFGS method [32] that was originally proposed for solving the following coordinate-wise sparse learning problem: min fðxÞ;

x2IRp

s:t: kxk0 K;

ð2Þ

where kxk0 denotes the number of nonzero entries in x. Let ½½1; p :¼ f1; . . . ; pg and denote suppðxÞ the index set of nonzero entries of x. The FCFGS algorithm is outlined in Algorithm 1. At each iteration, FCFGS first selects a coordinate at which the gradient has the largest magnitude, and then minimizes f over the coordinates selected so far. The algorithm is demonstrated to be an efficient and accurate sparse approximation method in both theory [32], [41] and practice [19], [31]. More precisely, there are two appealing properties of FCFGS: Orthogonal coordinates selection: Provided that the gradient rfðxðk1Þ Þ is nonzero, the algorithm always selects a novel coordinate at each iteration. . Geometric rate of convergence: It is shown in [32, Theorem 2.8] that FCFBS can achieve geometric rate of convergence with restricted strong convexity/ smoothness assumptions on the objective f. It is noteworthy that FCFGS is identical to the wellrecognized orthogonal matching pursuit algorithm [26], [35] in signal processing society. Theoretical analysis [32], [41] and strong numerical evidences [19], [31] show that FCFGS is more appealing, both in sparsity and accuracy, than traditional forward selection algorithms such as sequential greedy approximation [40] and gradient boosting [10]. .

Algorithm 1. Fully Corrective Forward Greedy Selection (FCFGS) [32].

We note that FCFGS can be directly applied to solve (1) when the cardinality of the dictionary V is finite. Indeed, let us consider dictionary V ¼ fv1 ; . . . ; vN g is finite with P cardinality N. For any x 2 LðVP Þ with representation x ¼ N i¼1 i vi , let us define gðÞ :¼ fð N i¼1 i vi Þ. Since fðxÞ is convex, it is easy to verify that gðÞ is convex in IRN . By such a reparameterization, problem (1) is converted to the following standard coordinate-wise sparse learning problem: min gðÞ;

2IRN

s:t: kk0 K;

which can be readily solved by FCFGS.

ð3Þ

VOL. 35, NO. 12,

DECEMBER 2013

The key challenge of generalizing FCFGS to solve problem (1) lies in the setup where the cardinality of V is infinite. This paper is interested in addressing this challenge. Our goal is to present a generic forward selection algorithm to solve problem (1) over an arbitrary dictionary. We note that there were already several recent attempts at developing such an algorithm in low-rank matrix learning [31] and decision tree boosting [19]. While pieces of work have been done in these special problems, a unified treatment of forward greedy selection methods for sparse representation problem (1) still remains an open problem.

1.2 Our Contribution In this paper, we propose a generic forward selection algorithm, namely, forward basis selection (FBS), which generalizes FCFGS to approximately solve the problem (1) with an arbitrary dictionary. One important property of FBS is that it will automatically select out a group of bases from the dictionary for sparse representation. The Oð1 Þ sublinear rate of convergence is established for FBS under mild assumptions on dictionary and objective function. When a dictionary is finite, a better Oðln 1 Þ geometric rate bound can be obtained under proper conditions. We then extend FBS to the settings of nonnegative sparse representation and convex sparse representation. Such extensions facilitate the applications of FBS to positive semidefinite program and general convex constrained sparse learning problems. Convergence properties are analyzed for these extensions. On iteration complexity, the per-iteration computational overhead of FBS is dominated by a linear gradient projection and a subproblem to optimize the aggregation weights of bases. The former can be efficiently solved in several applications including sparse precision matrix estimation [3] and low-rank matrix estimation [22] as addressed in this paper. The latter is typically of limited scale and thus is relatively easy to solve. The proposed algorithm in this paper unifies existing results and lead to new applications. We study the applications of the proposed method and its variants in two concrete sparse learning problems: sparse precision matrix estimation and low-rank subspace segmentation. We formulate both problems as learning sparse representation over dictionaries composed by atom matrices with specific structures. We demonstrate the performance of the proposed method on several benchmark datasets. 1.3 Notation Before proceeding, we establish the notation to be used in the remainder of thisppaper. ffiffiffiffiffiffiffiffiffi We denote h; i the linear product and k k ¼ h; i the euclidean norm. The ith entry of vector x 2 IRp is denoted by ½xi . We denote kxk1 its ‘1 -norm, kxk0 the number of nonzero entries, and suppðxÞ the indices of nonzero entries of x. We say a vector x 0 if each entry of x is nonnegative. The linear hull of the dictionary V is given by ( ) X v v; v 2 IR : LðV Þ :¼ x : x ¼ v2V

P For any x 2 LðV Þ, the representation x ¼ v2V v v could be not unique. In the analysis to follow, we are interested in

YUAN AND YAN: FORWARD BASIS SELECTION FOR PURSUING SPARSE REPRESENTATIONS OVER A DICTIONARY

the following gauge function [27], [16] of x, which is defined as the smallest sum of absolute weights: ( ) X X CðxÞ :¼ min jv j : x ¼ v v : v2V

v2V

We say f has restricted strong smoothness and restricted strong convexity over LðV Þ at sparsity level k if there exists positive constants þ ðkÞ and ðkÞ such that for any x; x0 2 LðV Þ and x x0 2 Lk ðV Þ, fðx0 Þ fðxÞ hrfðxÞ; x0 xi

þ ðkÞ kx x0 k2 2

ð6Þ

and fðx0 Þ fðxÞ hrfðxÞ; x0 xi

ðkÞ kx x0 k2 : 2

ð7Þ

We say dictionary V is bounded with radius A if 8v 2 V , kvk A. We say V is symmetric if v 2 V implies v 2 V . As already used in Algorithm 1, we denote ½½1; p ¼ f1; . . . ; pg. X 0 (X 0) means that X is a positive semidefinite (positive definite) matrix. This paper proceeds as follows: We present in Section 2 the FBS algorithm along with convergence analysis and extensions. The applications of FBS to sparse and low-rank matrix learning tasks are studied in Section 3. We conclude this work in Section 4.

2

FORWARD BASIS SELECTION

The forward basis selection method is outlined in Algorithm 2. At each time instance k, we first search for a steepest descent direction uðkÞ 2 V that solves the linear subproblem (8), and then we update the iterate xðkÞ by solving the subproblem (9) over the linear hull of U ðkÞ that contains the descent directions selected so far. It is easy to check that the subproblem (9) is still convex. Extremely, when U ðkÞ becomes V , one recovers the original convex problem. For relatively small k (e.g., k 100 for low-rank representation problem studied in Section 3.2), subproblem (9) can be efficiently optimized via quasi-Newton approach [30]. Particularly, by choosing V ¼ f ei ; i ¼ 1; . . . ; dg where fei gpi¼1 are the canonical bases in IRp , FBS reduces to FCFGS. Algorithm 2. Forward Basis Selection (FBS).

3027

as orthogonal coordinate selection and fast convergence can be similarly established for FBS. We will answer this question in the following analysis. The lemma below shows that before reaching the optimal solution, Algorithm 2 always introduces at each iteration a new basis atom as the descent direction. Lemma 1. Assume that V is symmetric, and we run Algorithm 2 until time instance k: 1. 2.

If hrfðxðk1Þ Þ; uðkÞ i 6¼ 0, then the elements in U ðkÞ ¼ fuð1Þ ; . . . ; uðkÞ g are linearly independent. If hrfðxðk1Þ Þ; uðkÞ i ¼ 0, then xðk1Þ is the optimal solution over the linear hull LðV Þ.

Proof. Part 1: We prove the claim with induction. Obviously, the claim holds for k ¼ 1 (since uð1Þ 6¼ 0). Given that the claim holds until time instance k 1. Assume that at time instance k, fuð1Þ ; . . . ; uðkÞ g are linearly dependent. Since fuð1Þ ; . . . ; uðk1Þ g are linearly independent, we have that uðkÞ can be expressed as a linear combination of fuð1Þ ; . . . ; uðk1Þ g. Due to the optimality of xðk1Þ for solving (9) at time instance k 1, we have hrfðxðk1Þ Þ; uðiÞ i ¼ 0, i k 1. Therefore, hrfðxðk1Þ Þ; uðkÞ i ¼ 0, which leads to contradiction. Thus, the claim holds for k. This proves the desired result. Part 2: Given that hrfðxðk1Þ Þ; uðkÞ i ¼ 0, we have 8v 2 V , hrfðxðk1Þ Þ; vi 0, which implies hrfðxðk1Þ Þ; vi ¼ 0 since u V is symmetric. Therefore, xðk1Þ is optimal over LðV Þ. t This lemma indicates that if we run Algorithm 2 until time instance k with hrfðxðk1Þ Þ; uðkÞ i 6¼ 0, then the atom set U ðkÞ forms k bases in V . This is analogous to the orthogonal coordinate selection property of FCFBS, which justifies why we call Algorithm 2 a forward basis selection. On approximation performance of FBS, we are interested in the rate of convergence of the output xðKÞ toward a fixed S-sparse competitor x 2 LS ðV Þ. We first discuss the special case where V is finite and then address the general case where V is bounded and symmetric.

2.1 V Is a Finite Set Let us consider the special case that dictionary V ¼ fv1 ; . . . ; vN g is finite with cardinality N. Without loss of generality, we assume that the elements in V are linearly independent (otherwise, we can replace V with its bases without affecting the feasible P set). Thus, for any x 2 LðV Þ, P the representation x ¼ N i¼1 i vi is unique. Let gðÞ :¼ fð N i¼1 i vi Þ. Since fðxÞ is convex, it is easy to verify that gðÞ is convex in IRN . We may convert problem (1) to the following standard coordinate-wise sparse learning problem: min gðÞ;

2IRN

s:t: kk0 K:

ð10Þ

In light of this conversion, we can straightforwardly apply the FCFGS (Algorithm 1) to solve problem (10). By making restricted strong convexity assumptions on gðÞ, it has been shown in [32, Theorem 2.8] that the rate of convergence of FCFGS toward any sparse competitive solution is geometric.

Since FBS is a generalization of FCFGS, one natural question is whether the appealing aspects of FCFGS such

2.2 General Case The following theorem is our main result on approximation performance of FBS over a bounded and symmetric dictionary V .

3028


Theorem 1. Let us run FBS (Algorithm 2) with K iterations. Assume that V is symmetric and bounded with radius A. Assume that f is þ ð1Þ-restricted-strongly smooth over LðV Þ. Let x 2 LS ðV Þ be an S-sparse vector over V . Given > 0, if xÞ and 8k K, fðxðkÞ Þ > fð 2þ ð1ÞA2 Cð xÞ2 1; K

ð11Þ

then FBS will output xðKÞ satisfying fðxðKÞ Þ fð xÞ þ .

fðxðkÞ Þ

f x

ðk1Þ

þ u

x2E

Lþ K ðV Þ

:¼

(

[

X

) þ

u u : u 2 IR

:

u2U

To apply FBS to this problem, we have to modify the update (9) to adapt the nonnegative constraint: ð13Þ

x2Lþ ðU ðkÞ Þ

where Lþ ðU ðkÞ Þ denotes the nonnegative hull of U ðkÞ given by ( ) X þ ðkÞ þ v v : v 2 IR : L ðU Þ :¼ v2U ðkÞ

where the second inequality follows the restricted strong smoothness and the boundness assumption of V , the third inequality follows (8) and the assumption that V is symmetric, the first equality follows the optimality condition hrfðxðk1Þ Þ; xðk1Þ i ¼ 0 of the iterate xðk1Þ , and the last inequality follows from the convexity of f. Particularly, the preceding inequality holds for fðxðk1Þ Þ fð xÞ > 0; þ ð1ÞA2 Cð xÞ

fðxðkÞ Þ fðxk1 Þ

ðfðxðk1Þ Þ fð xÞÞ2 2þ ð1ÞA2 Cð xÞ2

2þ ð1ÞA2 Cð xÞ

s:t: kk0 K; 0:

ð14Þ

ðkÞ ¼

arg min

gðÞ:

ð16Þ

suppðÞF ðkÞ ;0

By making restricted strong convexity assumptions on gðÞ and using the similar arguments as in [32, Theorem 2.8], it can be proven that the geometric rate of convergence of FCFGS is still valid with the preceding updating schemes (15) and (16).

: 2

Invoking [32, Lemma B.2] shows that 2þ ð1ÞA2 Cð xÞ2 : kþ1

If K satisfies (11), then it is guaranteed K .

min gðÞ;

2IRN

i2½½1;N

:

xÞ. The preceding inequality Denote k :¼ fðxðkÞ Þ fð implies 2k1

The subproblem (13) is essentially a smooth optimization over half-space with scale dominated by the time instance k. When k is relatively small, it can be efficiently solved via quasi-Newton methods such as PQN [30]. Following a similar argument as in Section 2.2, it can be proven that Theorem 1 still holds for this extension when V is bounded and symmetric. Particularly, when V is finite with cardinality N, as discussed in Section 2.1 we may convert the problem (12) to the following nonnegative coordinatewise sparse learning problem:

To apply the FCFGS (Algorithm 1) to solve this problem, we have to make the following slight modifications of (4) and (5) to adapt the nonnegative constraint: jðkÞ ¼ arg min rg ðk1Þ i ; ð15Þ

and consequently,

k

ð12Þ

where

xðkÞ ¼ arg min fðxÞ;

þ ð1ÞA2 2 2 þ ð1ÞA2 2 ðk1Þ ðk1Þ hrfðx fðx Þþ Þ; xi þ Cð xÞ 2 ðk1Þ ðk1Þ ðk1Þ hrfðx ¼ fðx Þþ Þ; x x i Cð xÞ þ ð1ÞA2 2 þ 2 þ ð1ÞA2 2 ðk1Þ ðfð xÞ fðxðk1Þ ÞÞ þ ; fðx Þþ Cð xÞ 2

k k1

s:t: x 2 Lþ K ðV Þ;

min fðxÞ;

ðkÞ

fðxðk1Þ Þ þ hrfðxðk1Þ Þ; uðkÞ i þ

¼

DECEMBER 2013

motivates us to consider the following problem of nonnegative sparse representation in V :

UV ;jUjK

Proof. From the update of xðkÞ in (9) and the definition of restricted strong smoothness in (6), we get that 8 0,

VOL. 35, NO. 12,

u t

Note that the bound at the right-hand side of (11) is proportional to the square of Cð xÞ that can be taken as a measurement of sparsity of x over the dictionary V .

2.3 Nonnegative Sparse Approximation Over certain sparse learning problems, for example, nonnegative sparse regression and positive semidefinite matrix learning, the target solution is expected to admit a nonnegative sparse representation over a dictionary. This

2.4 Convex Sparse Approximation In many sparse learning problems, the feasible set is the convex hull L4 ðV Þ of a dictionary V given by ( ) X X 4 þ v v : v 2 IR ; v ¼ 1 : L ðV Þ :¼ v2V

v

For example, in Lasso [34], the solution is restricted in the ‘1 -norm ball that is a convex hull of the canonical bases. Generally, for any convex dictionary V , we have V ¼ L4 ðV Þ. Therefore, the feasible set of any convex optimization problem is the convex hull of itself.


Let us consider the following problem of convex sparse approximation over V : s:t: x 2 L4 K ðV Þ;

min fðxÞ; x2E

ð17Þ

where L4 K ðV Þ

(

[

:¼

X

UV ;jUjK

þ

u u : u 2 IR ;

X

) u ¼ 1 :

u

u2U

where the third inequality follows the boundness of set V , the fourth inequality follows the update rule (8), and the last inequality follows the convexity of f. xÞ. By invoking Lemma 2 on Denote k :¼ fðxðkÞ Þ fð the preceding inequality, we get

k1 k1 min 1; fðxðkÞ Þ fðxðk1Þ Þ ; 2 4þ ðK þ 1ÞA2 which implies

To apply FBS to solve problem (17), we modify the update (9) to adapt the convex constraint: xðkÞ ¼ arg min fðxÞ:

3029

ð18Þ

x2L4 ðU ðkÞ Þ

The preceding subproblem is essentially a smooth optimization over simplex with scale dominated by k. When k is relatively small, such a subproblem can be efficiently solved via off-the-shelf methods. We next establish convergence rates of FBS (with update (18)) for convex sparse approximation.

k k1

k1 k1 : min 1; 2 4þ ðK þ 1ÞA2

When k1 4þ ðK þ 1ÞA2 , we obtain that k 12 k1 , that is, k converges geometrically toward 4þ ðK þ 1ÞA2 . 0 Hence, we need at most log2 ½4þ ðKþ1ÞA 2 to achieve this level of precision. Subsequently, we have k k1

2k1 : 8þ ðK þ 1ÞA2

By invoking [32, Lemma B.2], we have

2.4.1 V Is Bounded We begin with the general case where V is a bounded set. The following theorem establishes the sublinear rate of convergence for FBS. Theorem 2. Let us run K iterations of FBS (Algorithm 2) with xðkÞ updated by (18). Assume that V is bounded with radius A. Assume that f is þ ðK þ 1Þ-strongly smooth over LðV Þ. xÞ and Given > 0 and x 2 L4 ðV Þ, if 8k K, fðxðkÞ Þ > fð fðxð0Þ Þ fð xÞ 8þ ðK þ 1ÞA2 þ K log2 1; 4þ ðK þ 1ÞA2 then FBS will output x

ðKÞ

ðKÞ

satisfying fðx

Þ fð xÞ þ .

Note that in this result, we do not require V to be symmetric. To prove this result, we need the following simple lemma. Lemma 2. Denote by f : ½0; 1 ! IR a quadratic function fðxÞ ¼ ax2 þ bx þ c with a > 0 and b 0. Then, we have b g. minx2½0;1 fðxÞ c þ 2b minf1; 2a Proof of Theorem 2. By the definition of restricted strongly-smooth (see (6)) and the definition of xðkÞ in (18), it holds that fðxðkÞ Þ min fðð1 Þxðk1Þ þ uðkÞ Þ 2½0;1

min fðxðk1Þ Þ þ hrfðxðk1Þ Þ; uðkÞ xðk1Þ i 2½0;1

2 þ ðK þ 1Þ ðkÞ ku xðk1Þ k2 2 min fðxðk1Þ Þ þ hrfðxðk1Þ Þ; uðkÞ xðk1Þ i þ

2½0;1

þ 2þ ðK þ 1ÞA2 2 min fðxðk1Þ Þ þ hrfðxðk1Þ Þ; x xðk1Þ i 2½0;1

þ 2þ ðK þ 1ÞA2 2 min fðxðk1Þ Þ þ ðfð xÞ fðxðk1Þ ÞÞ 2½0;1

þ 2þ ðK þ 1ÞA2 2 ;

k

8þ ðK þ 1ÞA2 ; kþ1 2

and k after at most 8þ ðKþ1ÞA 1 more steps. Altogether, FBS converges to the desired precision if 1 8þ ðK þ 1ÞA2 þ 1: K log2 2 4þ ðK þ 1ÞA This proves the desired result.

u t

Remark 1. The study of sparse approximation over simplex/convex set dates back to [9], [5] and recently regained wide interest in machine learning and optimization [40], [32], [17], [21]. The Oð1=Þ rate of convergence as shown in Theorem 2 is standard in the literature [9], [40], [17]. In fact, such a rate also achieves the lower bound when objective function is arbitrarily convex [17]. When inaccurate solution is allowed for the subproblem (18), it is known that the same convergence guarantee still holds up to the error level [17], [5]. Remark 2. Compared to the optimal first-order methods pffiffi [36], [24] that converge with rate Oð1= Þ and the quasiNewton methods [30] with near superlinear convergence rate, a moderately increased number of Oð1=Þ steps are needed in total by the FBS for arbitrary convex objectives. As demonstrated shortly in Section 3, for some relatively complex constraints such as positive semidefinite constraint, the linear subproblem (8) in FBS is significantly cheaper than the euclidean projection operator used in most projected gradient methods. Therefore, the Oð1=Þ rate in Theorem 2 represents the price for the simplification in each individual step as well as the natural sparsity over V .

2.4.2 V Is Finite When V is finite with cardinality N, based on the arguments in Section 2.1, we may convert the problem (17) to the following convex sparse learning problem:

3030


min gðÞ;

2IRN

s:t: kk0 K; 2 4N ;

ð19Þ

where 4N :¼ f 2 IRN : 2 IRþ ; kk1 ¼ 1g is the N-dimensional simplex. To apply the FCFGS (Algorithm 1) to solve the problem (19), we have to make the following modifications of (4) and (5) to adapt the convexity constraint: jðkÞ ¼ arg min½rgððk1Þ Þi ;

ð20Þ

ðkÞ ¼

ð21Þ

i2½½1;N

arg min

gðÞ:

suppðÞF ðkÞ ;24k

Theorem 3. Let gðÞ be a differentiable convex function with an S-sparse vector in a simplex. Let us run domain IRN and K iterations of FCFGS (Algorithm 1) to solve the problem (19) with updating schemes (20) and (21). Assume that g is þ ðK þ 1Þ-strongly smooth and is ðK þ SÞ-strongly convex. Assume that g is L-Lipschitz continuous, i.e., jgðÞ gð0 Þj and Lk 0 k. Given > 0, if 8k K, gððkÞ Þ > gðÞ

The definition of j implies hj ðÞ hi ðÞ, i ¼ 1; . . . ; N. The lemma is a direct consequence of the following stronger statement: ~ gðÞ hj ðÞ sðgðÞ gðÞÞ;

¼ gðÞ þ

~i ½rgðÞi hrgðÞ; i

ð25Þ

þ 22 þ ðjF j þ 1Þ: From the optimality of , we get that * + X i ~i e =ð1 Þ 0: rgðÞ;

ð26Þ

i2F

P ~i ei =ð1 Þ 2 4N and is supported on F . Indeed, i2F ~i ¼ 0 for i 62 F~. Additionally, i ¼ 0 for i 62 F and Therefore, X X ~i ½rgðÞi ð1 Þi Þ ~i ½rgðÞi ¼ ð i2F c

i2F c

X

ð ~i ½rgðÞi ð1 Þi Þ

i2F [F~

~ ð1 Þi ¼ hrgðÞ; ~ i þ hrgðÞ; i; ¼ hrgðÞ; where the inequality follows (26). Combining the preceding inequality with (7), we obtain that X ~i ½rgðÞi hrgðÞ; i i2F c

We first prove the following lemma that is key to the proof of this theorem. ~ ¼ F~, let F be an index ~ 2 4N with suppðÞ Lemma 3. Given set such that F~ n F 6¼ ;. Let gðÞ:

suppðÞF ;24N

Assume that g is L-Lipschitz continuous, þ ðjF j þ 1Þ-restricted-strongly smooth, and ðjF [ F~jÞ-restricted-strongly ~ Let j ¼ arg mini ½rgðÞi . convex. Assume that gðÞ > gðÞ. Then, there exists 2 ½0; 1 such that ~ gðÞ gðð1 Þ þ ej Þ sðgðÞ gðÞÞ; where constant s is given by ( ) þ ðjF j þ 1Þ ðjF [ F~jÞ ; : s :¼ min L 4þ ðjF j þ 1ÞjF~j

!

i2F c

ð22Þ

L 4Sþ ðK þ 1Þ ; ; þ ðK þ 1Þ ðK þ SÞ

arg min

X

þ . then FCFGS will output ðKÞ satisfying gððKÞ Þ gðÞ

¼

ð24Þ

for an appropriate choice of 2 ½0; 1 and s given by (23). We next show the validity of theP inequality (24). ~i . It holds that Denote F c ¼ F~ n F and ¼ i2F c ~i 0) (recall X ~i hi ðÞ hj ðÞ

where sðK; SÞ :¼ max

DECEMBER 2013

i2F c

The following result shows that by making restricted strong convexity/smoothness assumptions on g, the geometric rate of convergence of FCFGS [32, Theorem 2.8] is still valid with the preceding modifications. This result is a nontrivial extension of the result [32, Theorem 2.8] to the setting of convex sparse approximation. In the context of this section, the restricted strong smoothness and restricted strong convexity are both defined over canonical bases.

gðð0Þ Þ gðÞ K sðK; SÞ ln ;

VOL. 35, NO. 12,

~ i ¼ hrgðÞ; ~ gðÞ gðÞ

ðjF [ F~jÞ ~ 2: k k 2

Combining the above with (25) ðjF [ F~jÞ 2 ~ þ ~ hj ðÞ gðÞ gðÞ gðÞ k k 2 þ 22 þ ðjF j þ 1Þ: By invoking Lemma 2 on the right-hand side of the preceding inequality, we get that 9^ 2 ½0; 1 such that

; Þ min 1; gðÞ hj ð^ 2 4þ ðjF j þ 1Þ ~

ð23Þ

Proof. The strong smoothness of gðÞ and 2 4N imply that for 2 ½0; 1, the following inequality holds: gðð1 Þ þ ej Þ hj ðÞ :¼ gðÞ þ hrgðÞ; ej i þ 22 þ ðjF j þ 1Þ:

~ þ ðjF2[F jÞ k k ~ 2 . We next diswhere :¼ gðÞ gðÞ tinguish the following two cases: 1.

If 4þ ðjF j þ 1Þ. In this case, 2þ ðjF j þ 1Þ 2 ~ þ ðjF j þ 1ÞðgðÞ gðÞÞ ; L

Þ gðxÞ hj ð^


2.

where the last inequality follows from the Lipschitz ~ Lk k ~ 2L. continuity gðÞ gðÞ If < 4þ ðjF j þ 1Þ. In this case, Þ gðÞ hj ð^

2 8 2 þ ðjF j

þ 1Þ

~ ~ 2 2 ðjF [ F~jÞðgðÞ gðÞÞk k 8 2 þ ðjF j þ 1Þ P ðjF [ F~jÞðgðÞ gðÞÞ ~ ~2i i2F c 2 4 þ ðjF j þ 1Þ ~ ðjF [ F~jÞðgðÞ gðÞÞ ~ 0 4þ ðjF j þ 1Þkk ~ ðjF [ F~jÞðgðÞ gðÞÞ : ~ 4þ ðjF j þ 1ÞjF j Combining both cases, we prove the claim (24).

u t

The Proof of Theorem 3. Denote k :¼ gððkÞ Þ gðÞ. definition of update (21) implies that gððkÞ Þ ðkÞ min2½0;1 gðð1 Þðk1Þ þ ej Þ. The conditions of Lemma 3 are satisfied, and therefore we obtain that (with F ¼ F ðkÞ and F~ ¼ suppðÞ) k1 k ¼ gððk1Þ Þ gððkÞ Þ ðsðK; SÞÞ1 k1 ; where s is given by

L 4Sþ ðK þ 1Þ ; : þ ðK þ 1Þ ðK þ SÞ

Therefore, k k1 ð1 ðsðK; SÞÞ1 Þ. Applying this inequality recursively, we obtain k 0 ð1 ðsðK; SÞÞ1 Þk . Using the inequality 1 s expðsÞ and rearranging, we get that k 0 expðkðsðK; SÞÞ1 Þ. When K satisfies u t (22), it can be guaranteed that K . Remark 3. To the best of our knowledge, Theorem 3 for the first time establishes a geometric rate of convergence for convex sparse approximation over a finite dictionary set V . Such an improvement contributes to the fully corrective adjustment at each iteration. Note that both sides of (22) are dependent on K. For sufficiently large K such that þ ðK þ 1Þ 0:5ðL ðK þ SÞÞ1=2 , we have sðK; SÞ ¼ 4Sþ ðK þ 1Þ= ðK þ SÞ which is nondecreasing with respect to K. If the ratio sðK; SÞ=K ¼ oðKÞ is vanishing, then the condition (22) can be satisfied when K is sufficiently large.

3

under the assumption that the true precision matrix is sparse. This problem arises in a variety of applications, among them computational biology, natural language processing, and document analysis, where the model dimension may be comparable to or substantially larger than the sample size. Specifically, for multivariate Gaussian distribution, precision matrix estimation is equivalent to learning the structure of Gaussian Markov random field (GMRF) [6].

3.1.1 Problem Setup Let x be a p-variate random vector from a zero-mean Gaussian distribution N ð0; 0 Þ. Its density is parameterized by the precision matrix 0 ¼ 1 0 0 as follows: 1 1 fðx; 0 Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp x> 0 x : 2 ð2Þp ðdet 0 Þ1

We are now in the position to prove Theorem 3.

sðK; SÞ :¼ max

3031

APPLICATIONS

In this section, we demonstrate the performance of FBS when applied to two matrix learning problems: sparse precision matrix estimation and low-rank subspace segmentation. Our algorithms are implemented in Matlab (Version 7.7, Vista). All runs are performed on a desktop with Intel Core2/Quad 2.80 GHz and 8G RAM.

3.1 FBS for Sparse Precision Matrix Estimation We consider the problem of estimating the precision (inverse covariance) matrix of high-dimensional random vectors

Let G ¼ ðV ; EÞ be a graph representing conditional independence relations between components of x. The vertex set V has p elements corresponding to ½x1 ; . . . ; ½xp , and the edge set E consists of ordered pairs ði; jÞ, where ði; jÞ 2 E if there is an edge between ½xi and ½xj . The edge between ½xi and ½xj is excluded from E if and only if ½xi and ½xj are independent given f½xk ; k 6¼ i; jg. It is well known that when x N ð0; 0 Þ, the conditional independence between ½xi and ½xj given other variables is equivalent to ½0 ij ¼ 0. Thus, for Gaussian distributions, learning the structure of G is equivalent to estimating the support of the precision matrix 0 . Given i.i.d. samples fxi gni¼1 following N ð0; 0 Þ, the negative log likelihood, up to a constant, can be written in terms of the precision matrix as Lðx1 ; . . . ; xn ; 0 Þ :¼ log det 0 þ hn ; 0 i; where n is the sample covariance matrix given by n ¼

n 1X ðxi xÞðxi xÞ> : n i¼1

We are interested in the problem of estimating a sparse precision 0 with no more than a prespecified number of off-diagonal nonzero entries. For this purpose, we consider the following cardinality constrained log-determinant program: min LðÞ; 0

s:t: Cardð Þ 2K;

ð27Þ

where LðÞ :¼ log det þ hn ; i, is the restriction of on the off-diagonal entries, Cardð Þ ¼ jsuppð Þj is the cardinality of the support set of , and integer K > 0 bounds the number of edges jEj in GMRF. If K is unknown, then one may regard K as a tuning parameter, which can be selected through validation (as done in our experiments). Note that problem (27) is essentially a subset selection problem that is NP-hard. To solve this problem, we consider applying FBS to pursue sparse representation over dictionary Vggm given by (recall that ei is the ith canonical basis in IRp ) Vggm :¼ fuuT ; u ¼ ei ; or u ¼ ei ej ; i 6¼ jg:

3032


It is easy to verify that problem (27) is equivalent to min LðÞ;

s:t: 2 Lþ 2Kþp ðVggm Þ:

ð28Þ

Algorithm 3 is a specialization of FBS for the above sparse precision matrix estimation problem. In this algorithm, we initialize the iteration with ð0Þ ¼ I. At each iteration k, the linear subproblem is solved by (29) that selects the two (symmetric) off-diagonal entries with largest magnitude in the gradient matrix rLððk1Þ Þ. The coefficients aggregation subproblem (30) minimizes L restricted to the selected entries so far. Since problem (30) is convex, any off-the-shelf convex optimization method can be applied for solution. In our implementation, we resort to the alternating direction method (ADM) to solve this subproblem for its reported efficiency for log-determinant program [39]. The computational accuracy for solving subproblem is set to be 104 . Note that Vggm is essentially a vector set; thus, the formulation (28) reduces sparse precision matrix estimation to a convex optimization problem over vectors that has also been considered in [17]. Algorithm 3 can thus be taken as a variant of FCFGS [32] for solving this problem. Algorithm 3. FBS for Sparse Precision Matrix Estimation.

.

VOL. 35, NO. 12,

DECEMBER 2013

There exist a variety of optimization algorithms addressing this convex formulation [3], [11], [30], [28], [23], [37], [39]. Among these, we compare our method with three representative solvers: a block coordinate descent method GLasso [11], a smoothing and proximal gradient method VSM [23], and a quasi-Newton method PQN [30]. For these three methods, the algorithm stops when duality gap is below 103 . Smoothly clipped absolute deviation (SCAD) penalty regularized log-determinant program: Fan et al. [8] have considered the following nonconvex extension of (31) with SCAD penalty: min log det þ hn ; i þ 0

p X p X

SCAD ;a ðj½ij jÞ;

i¼1 j¼1

where the SCAD function SCAD ;a is that proposed in [7]. This model can be solved by an iterative reweighted procedure that solves at each iteration an ‘1 -norm penalized log-determinant program in form of (31). Simulation. Our simulation study employs the following three sparse precision matrix models, with varying levels of sparsity: Model 1: ½0 ij ¼ 0:6jijj . Model 2: ½0 ij ¼ ðji jj ¼ 0Þ þ 0:4ðji jj ¼ 1Þ þ 0:2ðji jj ¼ 2Þ þ 0:2ðji jj ¼ 3Þ þ 0:1ðji jj ¼ 4Þ. . Model 3: 0 ¼ B þ I, where each off-diagonal entry in B is generated independently and equals 1 with probability P ¼ 0:1 or 0 with probability 1 P ¼ 0:9. B has zeros on the diagonal, and is chosen so that the condition number of 0 is p. Models 1 and 2 have banded structures, and the values of the entries decay as they move away from the diagonal. Model 3 is an example of a sparse matrix without any special sparsity pattern. For each model, we generate a training sample of size n ¼ 100 from N ð0; 0 Þ, and an independent sample of size 100 from the same distribution for validating the tuning parameter K. Using the training data, we compute a series of estimators with different values of K and use the one with the smallest log-likelihood loss on the validation sample where the loss is defined by

L j val ¼ log detðÞ þ val n n ; : . .

3.1.2 Experiment We evaluate the performance of FBS for sparse precision matrix selection on several synthetic and real datasets. In particular, we investigate parameter estimation and support recovery accuracy in terms of several different criteria using simulation data (for which we know the ground truth). In addition, we apply our method to the analysis of a gene data and report results in terms of binary classification performance. We compare the performance of our estimator to the following two representative log-determinant programbased models for sparse precision matrix estimation: .

‘1 -norm regularized log-determinant program. Given a sample covariance matrix n , Yuan and Lin [38] formulated sparse inverse covariance estimation as the following ‘1 -norm penalized log-determinant program:

min log det þ hn ; i þ j j1 :

2SSpþ

ð31Þ

This cross-validation scheme is applied to tune parameters for all methods in our simulation study. We compare performance for different values of p ¼ 30; 60; 120; 200, replicated 100 times each. The quality of precision matrix estimation is measured by its distance to the truth in spectral norm, matrix ‘1 -norm and Frobenius norm. To evaluate the support recovery performance, we use the standard F-score from the information retrieval literature defined as Precision ¼ TP=ðTP þ FPÞ; Recall ¼ TP=ðTP þ FNÞ; 2 Precision Recall F-score ¼ ; Precision þ Recall


TABLE 1 Comparison of Average (SE) of Different Matrix Loss and F-Score for Synthetic Data Model 1 over 100 Replications

For matrix loss, the smaller the better, while for F-score, the larger the better.

where TP stands for true positives (for nonzero entries), and FP and FN stand for false positives/negatives. Since one can generally trade off precision and recall by increasing one and decreasing the other, it is common practice to use the F-score as a single metric to evaluate different methods. The larger the F-score, the better the support recovery performance. The numerical values over 103 in magnitude are considered to be nonzero. The results are listed in Tables 1, 2, and 3. The key observations include .

On Model 1: In most cases, FBS achieves considerately smaller average Matrix ‘1 -norm and Frobenius norm losses, and higher F-scores than the other methods. In term of spectral norm loss, FBS is superior to GLasso, VSM and PQN, but inferior to SCAD. The solution quality of GLasso, VSM, and



3033



PQN is identical since these three optimize the same objective to the same accuracy level. . On Model 2: In most cases, FBS outperforms the other two in terms of all the four criteria. . On Model 3: In most cases, FBS works favorably in spectral norm loss, Frobenius norm loss, and F-score. In term of spectral norm loss, FBS is inferior to SCAD and GLasso when p ¼ 120 and p ¼ 200. We note that the standard errors by FBS are relatively larger comparing to the other algorithms. This can be expected since the estimator (27) is not convex, and FBS is a sparse approximation method that is typically less stable than those convex optimization-based methods. Real data. We have also compared the considered methods for linear discriminative analysis (LDA) classification of tumors using the breast cancer data taken from [15].1 The dataset consists of 22,283 gene expression levels of 133 subjects, including 34 subjects with pathological complete response (pCR) and 99 subjects with residual disease (RD). The pCR subjects are considered to have a high chance of cancer-free survival in the long term, and thus it is of great interest to study the response states of the patients (pCR or RD) to neoadjuvant (preoperative) chemotherapy. Based on the estimated inverse covariance matrix of the gene expression levels, we apply LDA to predict whether a subject can achieve the pCR state. We follow the same experimental protocol used in [1] as well as references therein. For completeness, we briefly describe the steps here. The data are randomly divided into the training and test sets. A stratified sampling approach is applied to divide the data, with five pCR subjects and 16 RD subjects randomly selected to constitute the test data (roughly 1/6 of the subjects in each group). The remaining subjects form the training set. For the training set, a two-sample t test is performed between the 1. The dataset is publicly available at http://bioinformatics.mdanderson. org/.

3034


TABLE 4 Breast Cancer Data: Comparison of Average (SE) pCR Classification Errors over 100 Replications

min rankðXÞ;

Specificity ¼

TN ; TN þ FP

TP ; TP þ FN TP TN FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ

SensitivityðRecallÞ ¼

where TP and TN stand for true positives (pCR) and true negatives (RD), respectively, and FP and FN stand for false positives/negatives. For these metrics, the larger the value, the better the classification performance. Since one can adjust decision threshold in any specific algorithm to trade off specificity and sensitivity (increase one while reduce the other), the MCC is more meaningful as a single performance metric. Table 4 reports the means and standard deviations of pCR classification errors over the 100 replications of the experiment. We note that FBS performs favorably in terms of the overall performance.

s:t: D ¼ DX; X 0;

ð32Þ

where D ¼ ½d1 ; d2 ; . . . ; dn 2 IRp n are n observed data vectors drawn from l unknown linear subspaces fS i gli¼1 . The analysis in [22] shows that the optimal representation D of problem (32) captures the global structure of data and thus naturally forms an affinity matrix for spectral clustering. Furthermore, it has been justified in [25] that the positive semidefinite (PSD) constraint is effective to enforce the representation D to be a valid kernel matrix. To apply FBS to solve the low-rank representation problem, we alternatively solve its penalized variant: min kD DXk2F rob ;

X2IRn n

s:t: rankðXÞ K; X 0;

ð33Þ

where k kF rob is the Frobenius norm. Indeed, this problem is a special case of the following problem of convex optimization over the cone of PSD matrices with low-rank constraint: min fðXÞ;

X2IRn n

^ 1 ^> ^ ^ þ log ^l ; l ðxÞ ¼ x> ^ l l 2 l P where ^l ¼ ð1=nl Þ i2groupl xi is the within-group average in the training set and ^l ¼ nl =n is the proportion of group l subjects in the training set. The classification rule is taken to ^ ¼ arg maxl¼1;2 l ðxÞ. be lðxÞ The classification performance is clearly associated with ^ We use the test data to assess the estimation accuracy of . the estimation performance and to compare the considered methods. For parameter tuning, we use a six-fold cross validation on the training data for choosing K. We repeat the experiment 100 times. To compare the classification performance, we use specificity, sensitivity (or recall), and Mathews correlation coefficient (MCC) criteria given by

DECEMBER 2013

3.2 FBS for Subspace Segmentation Consider the following problem of low-rank representation [22], [25] for subspace segmentation: X2IRn n

two groups for each gene, and the 113 most significant genes are retained as the covariates for prediction. Note that the size of the training sample is 112, 1 less than the variable size, which allows us to examine the performance when p > n. The gene data are then standardized by the estimated standard deviation, estimated from the training data. Finally, following the LDA framework, the normalized gene expression data are assumed to be normally distributed as N ðl ; 0 Þ, where the two groups are assumed to have the same covariance matrix, 0 , but different means, l , l ¼ 1 for pCR and l ¼ 2 for RD. The ^ produced by different estimated inverse covariance methods under consideration is used in the LDA scores

VOL. 35, NO. 12,

s:t: X 0; rankðXÞ K:

ð34Þ

To solve this problem, we consider applying FBS to perform sparse approximation over Lþ K ðVsdp Þ where Vsdp is given by Vsdp :¼ fuuT ; u 2 IRn ; kuk ¼ 1g: This is motivated from the spectral theorem that any PSD n n matrix PKX 2 IR T with rank at most K can be written as X ¼ i¼1 i ui ui with i 0. In this case, at time instance k, the subproblem (8) in FBS becomes Y ðkÞ ¼ arg maxhrfðXðk1Þ Þ; Y i:

ð35Þ

Y 2Vsdp

The following result shows that we can find a closed-form solution to the preceding linear subproblem. Proposition 1. For any matrix X 2 IRn n , one solution of Y ¼ arg maxY 2Vsdp hX; Y i is given by Y ¼ uuT , where u is the leading eigenvector of X. Proof. Rewrite the matrices in terms of the eigendecomposition of the positive semidefinite matrix Y ¼ P UU T ¼ ni¼1 i ui uTi , where is a vector containing the diagonal entries of . From the constraint of Y , we have P 1g. Insert this expression 2 S :¼ f i 0 and i i into the objective function max hX; Y i ¼ max

Y 2Vpsd

2S;U2O

n X

i uTi Xui ;

ð36Þ

i¼1

where O is the set of orthonormal matrices also known as the Stiefel manifold. Let v be eigenvector Pn Pnthe leading T T of X. Then, u Xu v Xv vT Xv. Obi i i i i¼1 i¼1 T viously, the equality holds for Y ¼ vv . This proves the desired result. u t By invoking this proposition to (35), we immediately get that Y ðkÞ ¼ uuT , where u is the leading eigenvector of


3035

TABLE 5 Results on the EYD-B Dataset

Fig. 1. Samples images from the EYD-B dataset.

rfðX ðk1Þ Þ. The Lanczos algorithm [12] can be used to estimate the leading eigenvector. It has been shown in [14], [17] that under certain conditions, the Lanczos algorithm is sufficient to maintain the convergence guarantee for the compact case. Algorithm 4 summarizes the FBS for lowrank representation. Algorithm 4. FBS for Low Rank Representation.

3.2.1 Experiment We conduct the experiment on the extended Yale face database B (EYD-B).2 The EYD-B contains 16,128 images of 38 human subjects under nine poses and 64 illumination conditions. Fig. 1 shows several sample images from this dataset. Following the experimental setup in [22], we use the first 10 individuals with 64 near frontal face images for each individual in our experiment. The size of each cropped grayscale image is 42 48 pixels, and we use the raw pixel vales to form data vectors of dimension 2,016. Each image vector is then normalized to unit length. We compare FBS with the LRR [22] that solves problem (32) via augmented Lagrange multiplier (ALM). For clustering, the respectively learned representations X by FBS and LRR are fed into the same spectral clustering routine. The recognition performance is measured by the rate of recognition accuracy. In this experiment, we initialize Xð0Þ ¼ 0 and set K ¼ 70 in FBS. In our implementation, we utilize the quasi-Newton method to optimize the subproblem (37) with precision parameter 104 . Our numerical experience shows that the quasi-Newton method scales well when K 100. The parameters for LRR are set as suggested by the original authors [22]. The quantitative results on EYD-B are tabulated in Table 5. It can be observed that FBS and LRR achieve the comparative clustering accuracies, while the former needs much less CPU time. Meanwhile, it can be seen from the column 2. http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html.

“Rank” that FBS outputs a representation matrix with lower rank than that of LRR. This experiment validates that FBS is an efficient and effective greedy method for the low-rank representation problem.

4

CONCLUSION

The proposed FBS algorithm generalizes the FCFGS from coordinate-wise sparse approximation to a relaxed setting of pursuing sparse representation over a fixed dictionary. At each iteration, FBS automatically selects a new basis atom in the dictionary, achieving the minimum inner product with the current gradient, and then optimally adjusting the combination weights of the bases selected so far. We then extend FBS to the setup of pursuing nonnegative and convex sparse representation. Convergence analysis shows that FBS and its extensions generally converge sublinearly, while geometric rate of convergence can be achieved under more restrictive conditions. The periteration computational overhead of FBS is dominated by a linear subproblem that can be efficiently solved by Newtontype methods when K is relatively small. The subproblem of combination weights optimization can be efficiently solved via off-the-shelf methods. The proposed method has been applied to sparse precision matrix estimation and subspace segmentation, with effectiveness and efficiency validated by experiments on benchmark datasets. To conclude, FBS is a generic yet efficient method for learning sparse representation over a finite or countable dictionary.

ACKNOWLEDGMENTS The authors would like to thank the anonymous reviewers for their constructive comments on this paper. This research was partially supported by the Singapore National Research Foundation under its International Research Center @ Singapore Funding Initiative and administered by the IDM Programme Office, and partially supported by the Grant BK2012045, Jiangsu Province, China. This work was undertaken while Xiao-Tong Yuan was affiliated with the Department of Electrical and Computer Engineering, National University of Singapore.

REFERENCES [1] [2] [3] [4]

T. Cai, W. liu, and X. Luo, “A Constrained ‘1 Minimization Approach to Sparse Precision Matrix Estimation,” J. Am. Statistical Assoc., vol. 106, no. 494, pp. 594-607, 2011. K. Clarkson, “Coresets, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm,” Proc. ACM-SIAM Symp. Discrete Algorithms, pp. 922-931, 2008. A. d’Aspremont, O. Banerjee, and L. Ghaoui, “First-Order Methods for Sparse Covariance Selection,” SIAM J. Matrix Analysis and Its Applications, vol. 30, no. 1, pp. 56-66, 2008. M. Dudik, Z. Harchaoui, and J. Malick, “Lifted Coordinate Descent for Learning with Trace-Norm Regularization,” Proc. 15th Int’l Conf. Artificial Intelligence and Statistics, pp. 327-336, 2012.

3036

[5]

[6] [7] [8]

[9] [10]

[11]

[12] [13]

[14]

[15]

[16] [17] [18]

[19] [20] [21]

[22]

[23] [24] [25]

[26]

[27] [28]

[29]

[30]

[31]


J.C. Dunn and S. Harshbarger, “Conditional Gradient Algorithms with Open Loop Step Size Rules,” J. Math. Analysis and Applications, vol. 62, no. 2, pp. 432-C444, 1978. D.M. Edwards, Introduction to Graphical Modelling. Springer, 2000. J. Fan, “Comment on ‘Wavelets in Statistics: A Review’ by A. Antoniadis,” J. Italian Statistical Assoc., vol. 6, pp. 131-138, 1997. J. Fan, Y. Feng, and Y. Wu, “Network Exploration via the Adaptive Lasso and Scad Penalties,” Annals of Applied Statistics, vol. 3, no. 2, pp. 521-541, 2009. M. Frank and P. Wolfe, “An Algorithm for Quadratic Programming,” Naval Research Logistic Quarterly, vol. 5, pp. 95-110, 1956. J. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics, vol. 29, pp. 1189-1232, 2001. J. Friedman, T. Hastie, and R. Tibshirani, “Sparse Inverse Covariance Estimation with the Graphical Lasso,” Biostatistics, vol. 9, no. 3, pp. 432-441, 2008. G. Golub and C. Loan, Matrix Computations. John Hopkins Univ. Press, 1996. A. Grubb and J. Bagnell, “Generalized Boosting Algorithms for Convex Optimization,” Proc. 28th Int’l Conf. Machine Learning, pp. 1209-1216, 2011. E. Hazan, “Sparse Approximate Solutions to Semidefinite Programs,” Proc. Eighth Latin Am. Conf. Theoretical Informatics, pp. 306316, 2008. K. Hess, K. Anderson, W. Symmans, V. Valero, N. Ibrahim, J. Mejia, D. Booser, R. Theriault, A. Buzdar, P. Dempsey, R. Rouzier, N. Sneige, J. Ross, T. Vidaurre, H. Go´mez, G. Hortobagyi, and L. Pusztai, “Pharmacogenomic Predictor of Sensitivity to Preoperative Chemotherapy with Paclitaxel and Fluorouracil, Doxorubicin, and Cyclophosphamide in Breast Cancer,” J. Clinical Oncology, vol. 24, pp. 4236-4244, 2006. J.-B. Hiriart-Urruty and C. Lemare`chal, Convex Analysis and Minimization Algorithms. Springer Verlag, 1993. M. Jaggi, “Sparse Convex Optimization Methods for Machine Learning,” PhD thesis, ETH Zurich, 2011. M. Jaggi and M. Sulovsk y, “A Simple Algorithm for Nuclear Norm Regularized Problem,” Proc. 27th Int’l Conf. Machine Learning, pp. 471-478, 2010. R. Johnson and T. Zhang, “Learning Nonlinear Functions Using Regularized Greedy Forest,” technical report, 2011. Y. Kim and J. Kim, “Gradient Lasso for Feature Selection,” Proc. 21st Int’l Conf. Machine Learning, pp. 60-67, 2004. R. Luss and M. Teboulle, “Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint,” technical report, 2011. G. Liu, Z. Lin, and Y. Yu, “Robust Subspace Segmentation by Low-Rank Representation,” Proc. 27th Int’l Conf. Machine Learning, pp. 663-670, 2010. Z. Lu, “Smooth Optimization Approach for Sparse Covariance Selection,” SIAM J. Optimization, vol. 19, no. 4, pp. 1807-1827, 2009. Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004. Y. Ni, J. Sun, X. Yuan, S. Yan, and L. Cheong, “Robust Low-Rank Subspace Segmentation with Semidefinite Guarantees,” Proc. Workshop Optimization Based Methods for Emerging Data Mining Problems, 2010. Y. Pati, R. Rezaiifar, and P. Krishnaprasad, “Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition,” Proc. 27th Ann. Asilomar Conf. Signals, Systems, and Computers, vol. 1, pp. 40-41, 1993. R. Rockfellar, Convex Analysis. Princeton Press, 1970. A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu, “Sparse Permutation Invariant Covariance Estimation,” Electronic J. Statistics, vol. 2, pp. 494-515, 2008. R. Schapire, “The Boosting Approach to Machine Learning: An Overview,” Proc. MSRI Workshop Non-Linear Estimation and Classification, 2002. M. Schmidt, E. Berg, M. Friedlander, and K. Murphy, “Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm,” Proc. 12th Int’l Conf. Artificial Intelligence and Statistics, pp. 456-463, 2009. S. Shalev-Shwartz, A. Gonen, and O. Shamir, “Large-Scale Convex Minimization with a Low-Rank Constraint,” Proc. 28th Int’l Conf. Machine Learning, 2011.

VOL. 35, NO. 12,

DECEMBER 2013

[32] S. Shalev-Shwartz, N. Srebro, and T. Zhang, “Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints,” SIAM J. Optimization, vol. 20, pp. 2807-2832, 2010. [33] A. Tewari, P. Ravikumar, and I.S. Dhillon, “Greedy Algorithms for Structurally Constrained High Dimensional Problems,” Proc. Advances in Neural Information Processing Systems, 2011. [34] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. Royal Statistical Soc. B., vol. 58, no. 1, pp. 267-288, 1996. [35] J. Tropp and A. Gilbert, “Signal Recovery from Random Measurements via Orthogonal Matching Pursuit,” IEEE Trans. Information Theory, vol. 53, no. 12, pp. 4655-4666, 2007. [36] P. Tseng, “On Accelerated Proximal Gradient Methods for Convex-Concave Optimization,” submitted to SIAM J. Optimization, 2008. [37] C. Wang, D. Sun, and K.-C. Toh, “Solving Log-Determinant Optimization Problems by a Newton-CG Primal Proximal Point Algorithm,” SIAM J. Optimization, vol. 20, pp. 2994-3013, 2010. [38] M. Yuan and Y. Lin, “Model Selection and Estimation in the Gaussian Graphical Model,” Biometrika, vol. 94, no. 1, pp. 19-35, 2007. [39] X. Yuan, “Alternating Direction Method of Multipliers for Covariance Selection Models,” J. Scientific Computing, vol. 62, no. 2, pp. 432-444, 2012. [40] T. Zhang, “Sequential Greedy Approximation for Certain Convex Optimization Problems,” IEEE Trans. Information Theory, vol. 49, no. 3, pp. 682-691, Mar. 2003. [41] T. Zhang, “Sparse Recovery with Orthogonal Matching Pursuit under Rip,” IEEE Trans. Information Theory, vol. 57, no. 9, pp. 62156221, Sept. 2011. Xiao-Tong Yuan received the PhD degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009. He is currently a professor in the School of Information and Control, Nanjing University of Information Science and Technology. His main research interests include machine learning, data mining, and computer vision. He has authored or coauthored more than 40 technical papers over a wide range of research topics. He received the winner prizes of the classification task in PASCAL VOC ’10. Shuicheng Yan is currently an associate professor in the Department of Electrical and Computer Engineering, National University of Singapore, and the founding lead of the Learning and Vision Research Group (http://www.lvnus.org). His research areas include computer vision, multimedia, and machine learning, and he has authored/coauthored more than 300 technical papers over a wide range of research topics, with Google Scholar citation > 8,500 times and H-index-42. He is an associate editor of the IEEE Transactions on Circuits and Systems for Video Technology and the ACM Transactions on Intelligent Systems and Technology. He received the Best Paper Awards from ACM MM ’12 (demo), PCM ’11, ACM MM ’10, ICME ’10, and ICIMCS ’09, the winner prizes of the classification task in PASCAL VOC 2010-2012, the winner prize of the segmentation task in PASCAL VOC 2012, the honorable mention prize of the detection task in PASCAL VOC ’10, the 2010 TCSVT Best Associate Editor (BAE) Award, 2010 Young Faculty Research Award, 2011 Singapore Young Scientist Award, 2012 NUS Young Researcher Award, and the coauthor of the best student paper awards of PREMIA ’09, PREMIA ’11, and PREMIA ’12. He is a senior member of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Sparse representation for infrared Dim target detection via a discriminative over-complete dictionary learned online.

Iris classification based on sparse representations using on-line dictionary learning for large-scale de-duplication applications.

Learning Stable Multilevel Dictionaries for Sparse Representations.

Learning Low-Rank Class-Specific Dictionary and Sparse Intra-Class Variant Dictionary for Face Recognition.

Robust Cell Detection of Histopathological Brain Tumor Images Using Sparse Reconstruction and Adaptive Dictionary Selection.

Distributed dictionary learning for sparse representation in sensor networks.

Orthogonal Procrustes Analysis for Dictionary Learning in Sparse Linear Representation.

Correction: Multiple Sparse Representations Classification.

On A Nonlinear Generalization of Sparse Coding and Dictionary Learning.

A Sparse Bayesian Approach for Forward-Looking Superresolution Radar Imaging.

Learning joint intensity-depth sparse representations.

A Fast Algorithm for Learning Overcomplete Dictionary for Sparse Representation Based on Proximal Operators.

Multiple kernel sparse representations for supervised and unsupervised learning.

Cerebellar Functional Parcellation Using Sparse Dictionary Learning Clustering.

Sparse dictionary learning of resting state fMRI networks.

Classification using sparse representations: a biologically plausible approach.

Supervised block sparse dictionary learning for simultaneous clustering and classification in computational anatomy.

Subject Specific Sparse Dictionary Learning for Atlas based Brain MRI Segmentation.

A Modified Rife Algorithm for Off-Grid DOA Estimation Based on Sparse Representations.

A statistical prediction model based on sparse representations for single image super-resolution.

Automated identification of crystallographic ligands using sparse-density representations.

Antibody Engineering for Pursuing a Healthier Future.

Pairwise Constraint-Guided Sparse Learning for Feature Selection.

Minimax sparse logistic regression for very high-dimensional feature selection.