IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

1769

Stochastic Learning via Optimizing the Variational Inequalities Qing Tao, Qian-Kun Gao, De-Jun Chu, and Gao-Wei Wu

Abstract— A wide variety of learning problems can be posed in the framework of convex optimization. Many efficient algorithms have been developed based on solving the induced optimization problems. However, there exists a gap between the theoretically unbeatable convergence rate and the practically efficient learning speed. In this paper, we use the variational inequality (VI) convergence to describe the learning speed. To this end, we avoid the hard concept of regret in online learning and directly discuss the stochastic learning algorithms. We first cast the regularized learning problem as a VI. Then, we present a stochastic version of alternating direction method of multipliers (ADMMs) to solve the induced VI. We define a new VI-criterion to measure the convergence of stochastic algorithms. While the rate of convergence for any iterative algorithms to solve nonsmooth convex optimization √ problems cannot be better than O(1/ t), the proposed stochastic ADMM (SADMM) is proved to have an O(1/t) VI-convergence rate for the l1-regularized hinge loss problems without strong convexity and smoothness. The derived VI-convergence results also support the viewpoint that the standard online analysis is too loose to analyze the stochastic setting properly. The experiments demonstrate that SADMM has almost the same performance as the state-of-the-art stochastic learning algorithms but its O(1/t) VI-convergence rate is capable of tightly characterizing the real learning speed. Index Terms— Alternating direction method of multiplier (ADMM), machine learning, online learning, optimization, regret, stochastic learning, variational inequality (VI).

I. I NTRODUCTION

A

WIDE variety of problems of recent interest in machine learning can be posed in the framework of convex optimization. Among them, the regularized loss optimization is a dominant learning paradigm in which one jointly minimizes an empirical loss plus a regularization term. In particular, given a training set S = {(x1 , y1 ), . . . , (xm , ym )}, where (xi , yi ) ∈ R N × Y, Y = {−1, 1}, xi = 0 is independently drawn and identically distributed, and yi is the label of xi , the regularized learning yields a convex optimization problem of the form F(w) = λr (w) + l(w)

(1)

Manuscript received May 3, 2013; revised October 5, 2013; accepted December 8, 2013. Date of publication January 2, 2014; date of current version September 15, 2014. This work was supported in part by the National Science Foundation of China under Grant 61273296 and Grant 61175050, and in part by the National Science Foundation of Anhui Province under Grant 1308085QF121. Q. Tao, Q.-K. Gao, and D.-J. Chu are with the New Star Institute of Applied Technology, Hefei 230031, China (e-mail: [email protected]; [email protected]; [email protected]). G.-W. Wu is with the Institute of Automation, Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2294741

where λ is atradeoff parameter, r (w) is a regularizer, and l(w) = 1/m m i=1 l i (w) and each l i (w) is the loss caused by (xi , yi ). Then, the learning model is typically trained to find an -accurate solution w ˆ such that F(w) ˆ ≤ minw F(w) + . A wide range of efficient algorithms have been developed based on solving the optimization problem (1). For large scale learning problems, even the most cheap first-order batch approach cannot work well simply because evaluating one single gradient of the objective function requires going through the whole data set. One popular way is to use the online learning, which operates by repetitively drawing examples in a sequence of consecutive rounds, one at a time, and providing an answer to the chosen sample using simple calculations. As argued by [4], the combined low complexity and low accuracy, together with other tradeoffs in statistical learning theory, still make online algorithms favorite choices. Since the samples are independent identically distributed (i.i.d.), one of the most important facts behind the efficiency of online learning is that classification accuracy becomes stable after only a small part of samples in the whole training set are learned [19]. So far, many well-known optimization methods have been extended from batch to the online setting. Typical examples include the projected subgradient method [10], [19], [24], regularized dual averaging (RDA) [23] and composite objective mirror descent (Comid) [6], in which various regret bounds are established to theoretically guarantee the effectiveness of online optimization. Regularly, online learning is able to yield an alternative algorithm for solving stochastic learning problems, the goal of which is to minimize a regularized expectation of the loss function. The rate of associated stochastic algorithms can be obtained using the analysis of the algorithm in the harder setting of online learning combined with a standard online-to-batch conversion [6], [23]. However, [11] and [18] indicated that the standard online analysis is too loose to analyze the stochastic setting properly. In this paper, we directly focus on stochastic learning and point out why our analysis fails to get the equivalent regret bound of online learning. Research in machine learning and research in optimization theory have become increasingly coupled. The optimization framework of learning ensures that learning accuracy of solution of the induced optimization problem is sufficiently good when the objective value converges, i.e., the outputted classifier is decided by an approximate minimum of the objective function. For general convex√optimization problems, RDA and Comid only have an O(1/ t) convergence rate, where t is the number of iterations performed by the algorithm, and this

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1770

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

limited rate is in a sense unbeatable according to Nemirovski and Yudin’s analysis [14]. However, in practice, RDA and Comid often exhibit high performances that their convergence rate does not match. As generalization is the bottom line in machine learning, achieving a very high accuracy in solving the learning optimization problem does not translate to a significant increase in generalization ability. Therefore, we have enough reasons to argue that the objective-convergence may not be the best criterion for learning because there indeed exists a gap between the witness of empirical efficiency and the lack of matched theoretical results. In fact, the learning optimization problems being solved are only rough approximations of the real problem of finding a model that generalizes well [3]. Solving the optimization problem is only regarded as a proxy method helping us to obtain good prediction accuracy. In real online learning, the thing that one concerns about most of all is the number of samples with which the optimization algorithm attains stable test accuracy. This number reveals the generalization ability of a learner on the size of samples as well as the real rate of learning. Therefore, it is an interesting question if we can employ some other criteria of convergence that can tightly describe the learning speed while keeping the same test accuracy. The key tools here are variational inequality (VI) approach and alternating direction method of multipliers (ADMMs). This paper is mainly motivated by [12], the focus of which is on solving a theoretical challenge for a few decades about the convergence of ADMM in the batch setting. In contrast, the intent of this paper is to provide a stochastic learning framework for large scale learning problems that can be phrased as a VI, which is somewhat beyond the pure convex optimization techniques and may shed light on new tight bounds characterizing the learning speed. Following the recent development in stochastic learning [11], [18], we directly deal with ADMM in the stochastic setting. This enables us to avoid the hard concept of VI-regret as well as the weak online analysis with an online-to-batch conversion. Within the VI framework, we first equivalently cast the regularized learning optimization problem as a VI. Then we introduce a stochastic VI-convergence to measure the convergence of a class of stochastic algorithms. We present a stochastic ADMM (SADMM), at each round of which the sample is chosen to optimize uniformly at random. Specifically, for the l1 -regularized hinge losses problems, we prove that the SADMM obtains an O(1/t) VI-convergence rate. Similar to that in the optimization framework, the using of equivalent VI means that the learning accuracy of a VI solution should be sufficiently good when its VI-associated algorithm converges, i.e., the outputted classifier is decided by an approximate solution in terms of the VI. Therefore, the derived O(1/t) VI-convergence rate implies that we achieve a tight learning bound for a general convex problem without strong convexity and smoothness. The derived VI-convergence results also support the viewpoint in [11] and [18] that the standard online analysis is too loose to analyze the stochastic setting properly. The experiments demonstrate that our SADMM has almost the same performance as the state-of-the-art stochastic algorithms such as RDA and Comid but its O(1/t) VI-convergence rate

is more capable of bounding √ the real learning speed than the usual unbeatable rate O(1/ t) in terms of the objectiveconvergence. The rest of this paper is organized as follows. Section II discusses ADMM and VI. A framework for stochastic learning via optimizing the VI is established in Sections III and IV. Experimental results are reported in Section V, and the conclusion is made in Section VI. II. ADMM AND VI ADMM is a first-order dual optimization algorithm originally proposed in [9]. The last two decades have witnessed impressive development on ADMM in the areas of convex programming. Recently, ADMM in batch settings has received explosively increasing interests in machine learning [5], [20]. To introduce a general form of ADMM, we first consider the following optimization problem: min f (w) + g(z) s.t. Aw + Bz = c

(2)

A, B ∈ and f and g are convex. where w, z, c ∈ The augmented Lagrangian function with a penalty parameter ρ > 0 of this problem is given by RN ,

R N×N ,

L ρ (w, z, μ) = f (w) + g(z) + μ, Aw + Bz − c ρ + Aw + Bz − c2 . 2 The well-known ADMM [5] iterates as follows:  wt +1 = arg min f (w) + μt , Aw + Bzt − c  ρ + Aw + Bzt − c2 2  zt +1 = arg min g(z) + μt , Awt +1 + Bz − c  ρ + Awt +1 + Bz − c2 2 μt +1 = μt + ρ(Awt +1 + Bzt +1 − c).

(3)

(4)

In ADMM, the two primal variables w and z are updated in an alternating fashion, which accounts for the term alternating direction. ADMM can be viewed as a version of the method of multipliers where a single Gauss–Seidel pass over w and z is used instead of the usual joint minimization of L ρ (w, z, μ). In fact, minimizing L ρ (w, z, μ) with respect to w and z jointly is not any easier than solving the original problem (2). For many applications, the optimization subproblems in (4) often have closed-form solutions and ADMM is specifically efficient, and this is the main reason inspiring recent burst of ADMM’s wide applications in statistical machine learning. In 2011, a simple unified proof was provided on the convergence rate for ADMM (4) using the VI [12]. This result solves ADMM’s convergence rate problem which remains a theoretical challenge for a few decades. The general VI problem can be defined as follows: Definition 2.1 [8]: Given a subset S of R N and a mapping G: S → R N . The VI, denoted by VI (S, G), is to find a vector w∗ ∈ S such that w − w∗ , G(w∗ ) ≥ 0 ∀ w ∈ S. Generally speaking, each constrained optimization problem can be equivalently formulated as a VI. Specifically, when F

TAO et al.: STOCHASTIC LEARNING VIA OPTIMIZING THE VI

1771

is differentiable and S is convex, solving the VI (S, ∇ F) is equivalent to solve the constrained convex problem min F(w) s.t.

w ∈ S.

In addition to providing a unified mathematical model for a variety of optimization problems, the VI theory includes many special cases that are important in their own right [8]. Using the VI, a remarkable O(1/t) convergence rate was proved for batch ADMM (4) when solving the optimization problem (2) [12]. We briefly review some issues in the following. Let u = (w, z), F(u) = f (w) + g(z), v = (w, z, μ), Q(v) = (A T μ, B T μ, −Aw − Bz + c). It is easy to see that a saddle point of the Lagrangian associated with the constrained problem (2) can be reformulated as a VI [8], [12], i.e., find v∗ = (w∗ , z∗ , μ∗ ) ∈ R3N such that ∀ v ∈ R3N VI(Q, F) : F(u) − F(u∗ ) + v − v∗ , Q(v∗ ) ≥ 0.

(5)

According to the Lagrangian duality in optimization theory [2], solving the constrained optimization problem (2) is equivalent to solving the VI(Q, F) (5). However, there exists a key difference between the algorithms in deriving an approximate VI-solution and an approximate minimal solution, i.e., the former can offer an√O(1/t) convergence rate while the latter only has an O(1/ t) convergence rate [12]. Note that v − v∗ , Q(v∗ ) = v − v∗ , Q(v). Therefore, (5) can be equivalently characterized as [12] F(u) − F(u∗ ) + v − v∗ , Q(v) ≥ 0 ∀v ∈ R3N .

(6)

Based on this equivalent relationship, an approximate solution of VI(Q, F) is defined as follows: Definition 2.2 [12]: v∗ is an approximate solution of VI(Q, F) with accuracy  > 0 if it satisfies F(u∗ ) − F(u) + v∗ − v, Q(v) ≤ 

∀v ∈ R3N .

(7)

In [12], it was proved that ADMM (4) derives an approximate solution of VI(Q, F) with  = O(1/t) when v is forced to be in any compact set.

where the sample (xt , yt ) is uniformly at random chosen from the training set. Its augmented Lagrangian is ρ L ρ (w, z, μ) = λz1 + lt (w) + μ, w − z + w − z2 . 2 (11) Based on the above augmented Lagrangian, we present an SADMM for solving problem (9), in which the key operation is   ρ wt +1 = arg min lt (w) + μt , w − zt  + w − zt 2 2   ρ zt +1 = arg min λz1 + μt , wt +1 − z + wt +1 − z2 2 μt +1 = μt + ρ(wt +1 − zt +1 ). (12) In the experiments section, we will compare SADMM with the state-of-the-art algorithms. To make this paper selfcontained, we briefly review the key operation of RDA and Comid here. In [16], a dual averaging method was presented for different types of nonsmooth optimization problems only requiring the subgradient information. At each iteration, the learning variables are adjusted by solving a simple minimization problem that involves running average of all past subgradients that emphasizes more recent gradients. RDA is an extension of this method to solve the regularized learning problems [23] in an online setting. More specifically, the key iteration of RDA takes the form   wt +1 = arg min  g¯ t (wt ), w + λw1 + (βt /t)h(w) (13) w

 where g¯ = 1/t tj =1 ∇l j (w), ∇l j (w) is a subgradient of l j at w, h(w) is an auxiliary strongly convex function, and {βt }t ≥1 is a nonnegative and nondecreasing input sequence. In [6], a mirror descent [1] is generalized to the case when the objective function is composite, which is named as Comid. The key iteration of Comid is   wt +1 = arg min ηt ∇lt (wt ), w+ηt λw1 + Bψ (w, wt ) t

w

(14)

III. S TOCHASTIC ADMM In this section, we concentrate our discussion on the learning problem F(w) = λw1 +

m   1  max 0, 1 − yi w, xi  . m

(8)

i=1

As both w1 and L 1 -loss are subdifferentiable, each iterative algorithm √ for solving (8) such as RDA or Comid only has an O(1/ t) rate in terms of the objective-convergence. Problem (8) can be equivalently reformulated as min λz1 +

m 1  li (w) s.t. m

w−z =0

(9)

i=1

where li (w) = max{0, 1 − yi w xi }. In stochastic learning, at the tth step, we only need to solve (9) caused by the sample (xt , yt ) min λz1 + lt (w) s.t.

w−z =0

(10)

where Bψ is a Bregman divergence about a convex function ψ. Intuitively, Comid works by minimizing a first-order approximation of the function at the current iterate wt and forcing the next iterate wt +1 to lie close to wt , where the step-size η controls the tradeoff between these two. It can be observed that that the regularization term is not linearized in both RDA and Comid. They treat the regularizer and loss separately other than regarding them as a black-box objective. This is why RDA and Comid can effectively keep the regularization structure in online learning. For SADMM, from the subproblem about zt +1 , it is easy to know that the sparsity imposed by the l1 -regularizer is guaranteed. In fact, SADMM is naturally suited to exploit the regularization structure because it individually cope with the variables concerned in loss and regularizer. In the following, we discuss how to cope with the subproblems in SADMM (12).

1772

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

To obtain the closed-form solution of wt +1 , we first reformulate wt +1 as   ρ 1 wt +1 = arg min lt (w) + w − zt + μt 2 . 2 ρ By introducing the slack variables ξt ≥ 0, we rewrite it as  2 1 t ρ t  + ξt μ + min  w − z (15) 2 ρ  s.t. yt w, xt  ≥ 1 − ξt , ξt ≥ 0. Then, we obtain the following dual form:

y 2 xt 2 2 1 min t β + yt zt − μt , xt  − 1 β β 2ρ ρ s.t. 1 ≥ β ≥ 0. This is a quadratic optimization problem with only a single variable. We can easily get its closed-form solution   ⎤ ⎫ ⎧ ⎡  ⎪ ⎪ t − 1 μt , x  ⎪ ⎪ 1 − y z ρ t t ⎨ ⎢ ⎥ ⎬ ρ ⎢ ⎥ βt = min max ⎣0, ⎦ , 1 ⎪. ⎪ xt 2 ⎪ ⎪ ⎭ ⎩ Therefore, the closed-form solution of wt +1 in (15) is wt +1 =

1 1 βt yt x t + zt − μ t . ρ ρ

and stochastic learning, i.e., the former is to minimize the regret while the letter is to optimize the expected regularized loss λw1 + E xi li (w). Moreover, the recent study shows that the stochastic learning can derive better convergence results [11], [18]. Following the paradigm of stochastic gradient descent in [11] and [18], we directly focus on the stochastic version of ADMM (12). We get an O(1/t) convergence rate but the method here fails to analyze ADMM in the online setting. Problem (9) is a special case of (2), i.e., A = I , B = −I and c = 0. According to Definition 2.1, v∗ is an approximate solution of the VI associated with (9) if and only if F(u∗ ) − F(u) + v∗ − v, Q(v) ≤  ∀v ∈ R3N

(16)

where u = (w, z), v = (w, z, μ), F(u) = λz1 + l(w) and Q(v) = (μ, −μ, −w + z). Let ξ [T ] denote the collection of i.i.d. random variables (ξ0 , ξ1 , . . . , ξT ). In the stochastic setting, each vector vT in SADMM (12) is a random variable that depends on ξ [T ]. Similar to that in stochastic optimization, it is now natural for us to use the expectation of VI to analyze the convergence of stochastic learning algorithms. Definition 4.2: Let F t (u) = λz1 + lt (w). The expectation of VI of a stochastic learning sequence {vt } in terms of solving the VI (6) with respect to a fixed v is defined as  Eξ [T ] F(u) − F(ut ) + v − vt , Q(v) .

The closed-form solution of zt +1 can be derived using the well-known soft thresholding method ( j = 1, . . . , N) ⎧   According to (6) and Definition 4.2, the goal of a stochastic ⎨0, if ρwtj+1 + μtj  ≤ λ algorithm in terms of optimizing the VI is to generate a t +1     zj = 1 3N ⎩ ρwtj+1 + μtj − λsgn ρwtj+1 + μtj , otherwise. sequence {vt }∞ t =1 such that ∀v ∈ R ρ  lim Eξ [T ] F(u) − F(ut ) + v − vt , Q(v) ≥ 0. (17) t →∞ IV. C ONVERGENCE A NALYSIS Regret is a de-facto standard in measuring convergence performance of the online learning algorithms [10], [24]. More precisely, it measures the difference in payoff between the online player and the best fixed point in hindsight. For optimization problem (1), the regret can be defined as follows. Definition 4.1: Let F t (w) = λr (w) + lt (w). The regret of an online learning sequence {wt } in terms of solving (1) with respect to a fixed w is defined as R F (T ) =

T  

F t (wt ) − F t (w) .

t =1

Under fair conditions, the online algorithms such as√the projected gradient method, RDA and Comid derive an O( T ) regret for general convex problems [6], [19], [23], which is the best bound currently known for these problems. While r (w) = w2 /2, they all have an O(logT ) regret. Generally speaking, online learning can yield an alternative algorithm for solving stochastic optimization problems, which we call it stochastic learning. In appearance, the goal of a stochastic learning algorithm such as RDA or Comid is only to solve the induced optimization problem caused by the sample (xi , yi ) at the tth step, where i is chosen uniformly at random from [m]. However, from the viewpoint of optimization, there is an important difference between the online

Let {vt = (wt , zt , μt )} be the sequence generated by SADMM (12). Let  u˜ t = (w ˜ t , z˜ t ) = wt +1 zt +1  ,t +1 t t t t v˜ = (w ˜ , z˜ , μ ˜ ) = w , zt +1 , μt + ρ(wt +1 − zt ) . As shown in the Appendix, our analysis is mainly based on the sequence v˜ t . Theorem 4.1: ∀v = (w, z, μ) ∈ R3N , we have T  

F t (u˜ t ) − F t (u) +

t =0

T  ˜vt − v, Q(v) t =0



ρ 1 z − z0 2 + μ − μ0 2 . 2 2ρ

Theorem 4.2: For any integer number T > 0, let 1  t u˜ , T +1 T

u¯ T =

t =0

1  t v˜ . T +1 T

v¯ T =

t =0

Then, ∀ v ∈ R3N , we have  Eξ [T ] F(u¯ T ) − F(u) + ¯vT − v, Q(v)

ρ 1 1 z − z0 2 + μ − μ0 2 . ≤ T +1 2 2ρ

(18)

TAO et al.: STOCHASTIC LEARNING VIA OPTIMIZING THE VI

According to Theorem 4.2, it is easy to know that {¯vT }∞ T =1 satisfies (17). Based upon the inequality (18) and the approximate solution (16) in the stochastic setting, we say that SADMM (12) has an O(1/t) VI-convergence rate. In addition to the convergence in expectation, we can also derive the same order of VI-convergence rate with high probability. For general convex problems, this lower bound guarantee that less number of samples are required for SADMM to reach an approximate VI solution and our experiments demonstrate that it is capable of describing the real learning speed (see Section V-A). To highlight the contribution of this paper, we give two important remarks. 1) According to Definition 4.1, we can define a VI-regret as T T   [F t (ut ) − F t (u)] + vt − v, Q(v) t =1

t =1

to analyze ADMM in the online setting, in which each sample is drawn in a sequence of consecutive rounds. we only T can t get the bound T Unfortunately, t ˜ t ) − F t (u)] + v − v, Q(v) in of t =0 [F (u t =0 ˜ ˜ t , z˜ t ) = [wt +1 , zt +1 ]. Theorem 4.1. Note that u˜ t = (w This means that we only consider the loss of the prediction for time t + 1 suffered at time t in Theorem 4.1, which is not the real online loss. It implies that the analysis method in this paper cannot ensure an O(1) VI-regret bound for the online ADMM. Thus, directly discussing SADMM is beneficial, which supports the viewpoint in [11] and [18]. 2) As the idea underlying SADMM is quite simple, it is not surprising that similar algorithms have been proposed. In [21], they present an online ADMM, in which the following proximal augmented Lagrangian is employed L ρ (w, z, μ) = λr (z) + lt (w) + μ, Aw + Bz − c +ρ/2Aw+ Bz − c2 +ηBφ (w, wt ) (19) where η is a learning rate and Bregman divergence Bφ (w, wt ) ≥ α/2w − wt 2 . Using additional Bregman divergence with specifically increasing learning rate about η and increasing step-size rules about ρ,√the online ADMM associated with (19) derives an O( T ) regret bound for general convex optimization problems and an O(logT ) regret bound for strongly convex optimization problems [21]. When η = 0, they consider two regrets, based on whether the solution needs to lie in the feasible set or not. By using specifically increasing step-size rules about ρ, they get similar regret bounds like √ that in proximal cases. Naturally, a SADMM with O(1/ t) convergence rate can be obtained using the standard online-to-batch conversion. However, our SADMM (12) is different from the ADMM in [21] in several aspects. For example, we do not use the proximal Bregman term as the augmented Lagrangian already contains a strongly convex regularizer. We discuss convergence of the SADMM with constant ρ. Even when η = 0, different methods are used to solve the subproblems, i.e., we use dual formulation to get the closed-form

1773

TABLE I R EAL D ATA S ETS W HERE THE S PLIT D ESCRIBES THE S IZE OF A

T RAIN /T EST S ET

solution of w while the first-order optimality conditions are employed in [21]. Therefore, we can easily deal with the nonsmooth hinge loss. What is more, we achieve an O(1/t) VI-convergence rate. The l1 -regularized stochastic learning algorithm (12) in Section III can be easily extended to r (w) = w2 by simply replacing z1 with z2 in (12). In the l2 -regularized case, the closed-form solutions of wt +1 and zt +1 in (12) can still easily be derived. Since only the convexity is required in convergence analysis, an O(1/t) VI-convergence rate can still be achieved. V. E XPERIMENT In this section, we present the experiments to validate the theoretical analysis and demonstrate the performance of our algorithms. Our goal here is mainly to show the basic analysis and characteristics of the method, rather than comprehensive performance evaluation on a wide range of data sets. Typically, we only consider six benchmark data sets CCAT, Astro-physics, Covertype, A9a, Gisette and Minist (Table I), which are publicly available at http://www.csie.ntu.edu. tw/∼cjlin/libsvmtools/datasets/. Obviously, the tradeoff parameter λ affects both convergence and sparsity. In all the comparison experiments, λ is chosen to have reasonable performance for all involved algorithms. As the scalability of ADMM in the batch setting is demonstrated to be competitive with many state-of-the-art optimization methods [7], [22], we only focus on comparison with the-state-of-art algorithms in the stochastic setting. All algorithms are implemented on a Sun Fire X4170 M2 Server with 2.40-GHZ Intel(R) Xeon(R) CPU and 12 GB of main memory under Solaris. As usual, we do not include the bias term for all the solvers. Each stochastic algorithm is run 10 times on the identical random example sequences and the reported are averaged results. A. O(1/t) Bound and Test Error Difference We first consider the l1 -regularized hinge loss problems. We investigate the test error difference of SADMM empirically with respect to sample size using three different tradeoff parameters, where the medium λ can derive nearly the best accuracy. The relationships between the O(1/t) bound, test error difference and number of training samples on the six data sets are shown in Fig. 1, where the solid curve is the relative test error difference while the dashed is the O(1/t) bound derived in Theorem 4.2. Based on these observations,

1774

Fig. 1.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

Test error differences on the real data sets. (a) CCAT. (b) Astro-physics. (c) Covertype. (d) A9a. (e) Gisette. (f) Mnist.

we conclude that the derived O(1/t) VI-convergence rate is indeed able to characterize the real learning speed. B. Comparison Experiments on the l1 -Regularized Problems Since a few papers cope with this weak differentiable problem, we choose to compare SADMM with RDA and Comid [6], [23]. The relationships between the test error and the number of iterations on the six data sets are shown in Fig. 2.

Besides obtaining the same level of sparsity, SADMM has the similar scalability with RDA and Comid as shown in Fig. 2. This figure shows that SADMM, RDA and Comid require almost the same number of samples before their accuracy becomes stable in dealing with the l1 -regularized problems but SADMM has a matched VI-convergence rate. Combined with the O(1/t) bound experiments, we really √ observe that the O(1/ t) convergence rate of RDA and Comid in terms of the function value is too loose to describe

TAO et al.: STOCHASTIC LEARNING VIA OPTIMIZING THE VI

Fig. 2.

1775

Comparison results of the l1 -regularized problems. (a) CCAT. (b) Astro-physics. (c) Covertype. (d) A9a. (e) Gisette. (f) Mnist.

the real learning speed. To some extent, these experiments demonstrate that the objective-convergence may not be the best criterion for learning. C. Comparison Experiments on the l2 -Regularized Problems We consider standard Support Vector Machines (SVMs). For this problem with strong convexity, one of the most

efficient stochastic algorithms is Pegasos, which outperforms many state-of-the-art SVM solvers as shown in [19]. Therefore, we choose to compare with RDA, Comid and Pegasos. They all can derive the convergence rate O(logt/t). The relationships between the test error and the number of iterations on the six data sets are shown in Fig. 3. We observe that SADMM has competitive performance in solving the l2 regularized problems. This fact demonstrates that SADMM

1776

Fig. 3.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

Comparison results of the l2 -regularized problems. (a) CCAT. (b) Astro-physics. (c) Covertype. (d) A9a. (e) Gisette. (f) Mnist.

is also a practical algorithm for SVM but with a matched VI-convergence rate. VI. C ONCLUSION In this paper, we construct a new framework for stochastic learning via optimizing the VI. Within this framework,

we present a SADMM with O(1/t) VI-convergence rate for the l1 -regularized hinge loss problems, which is a convex optimization problem without strong convexity and smoothness. The proposed SADMM has observed efficiency and matched VI-convergence rate. Recently, there has been a growing body of work discussing the generalization property

TAO et al.: STOCHASTIC LEARNING VIA OPTIMIZING THE VI

of online strongly convex programming algorithms [13], [17]. Obviously, there are several possible extensions to this paper. For example, can we derive some tighter generalization bounds via optimizing the VI? This issue will be investigated in our future work. A PPENDIX A To get Theorem 4.1, we first prove Lemma A.1 and A.2. Lemma A.1: Let {vt = (wt , zt , μt )} be the sequence generated by the ADMM (12). Let   u˜ t = (w ˜ t , z˜ t ) = wt +1 , zt +1 ,   v˜ t = (w ˜ t , z˜ t , μ ˜ t ) = wt +1 , zt +1 , μt + ρ(wt +1 − zt ) . Then, ∀ v ∈ R3N , we have F t (u) − F t (u˜ t ) + v − v˜ t , Q(˜vt ) 1 ˜ t , μt − μt +1 . ≥ ρz − zt +1 , zt − zt +1  + μ − μ ρ Proof: By the relationship between the VI and the optimization problems in SADMM (12) lt (w) − lt (wt +1 ) + w − wt +1 , μt + ρ(wt +1 − zt ) ≥ 0 λz1 − λzt +1 1 + z − zt +1 , ρ(−wt +1 + zt +1 ) − μt  ≥ 0. Using the notations in u˜ t and v˜ t , the above two inequalities can be rewritten as ! " ˜ t) + w − w ˜ t,μ ˜t ≥ 0 (20) lt (w) − lt (w

1777

1 Proof: Due to the identity a, b = (a2 + b2 − a − b2 ), 2 we have z − zt +1 , zt − zt +1  1 = (z − zt +1 2 + zt − zt +1 2 − z − zt 2 ). 2 Similarly μ − μt +1 , μt − μt +1  1 = (μ − μt +1 2 + μt − μt +1 2 − μ − μt 2 ) (22) 2 and −μ ˜ t − μt +1 , μt − μt +1  1 ˜ t − μt +1 2 + μt − μt +1 2 − μ = − (μ ˜ t − μt 2 ). (23) 2 Summing (22) and (23), we have μ − μ ˜ t , μt − μt +1  =

1 (μ − μt +1 2 − μ − μt 2 ) 2 1 − (μ˜ t − μt +1 2 − μ˜ t − μt 2 ). 2

Note μ ˜ t − μt +1 = −ρ(zt − zt +1 ). So μ − μ ˜ t , μt − μt +1  1 ≥ (μ − μt +1 2 − μ − μt 2 − ρ(zt − zt +1 )2 ). 2 Thus, Lemma A.2 is proved. Proof of Theorem 4.1: Note ˜vt −v, Q(v) = ˜vt −v, Q(˜vt ). Combining Lemma A.1 with Lemma A.2, we have F t (u˜ t ) − F t (u) + ˜vt − v, Q(v)

and ! " ˜ t  ≥ − z − z˜ t , ρ(˜zt − zt ) . λz1 − λ˜zt 1 + z − z˜ t , −μ (21) Using the notation in v˜ t and the update rule of μt +1 in SADMM (12), we have ˜ t,μ ˜ t v − v˜ t , Q(˜vt ) = w − w +z − z˜ t , −μ ˜ t  + μ − μ ˜ t , z˜ t − w ˜ t ˜ t  + z − z˜ t , −μ ˜ t = w − w ˜ t,μ # $ 1 + μ−μ ˜ t , (μt − μt +1 ) . ρ Summing the inequalities (20) and (21), we have lt (w) − lt (w ˜ t ) + λz1 − λ˜zt 1 + v − v˜ t , Q(˜vt ) # $ 1 ˜ t , (μt − μt +1 ) . ≥ −z − z˜ t , ρ(˜zt − zt ) + μ − μ ρ Thus, Lemma A.1 is proved. Lemma A.2: Let {vt = (wt , zt , μt )} be the sequence generated by the ADMM (12). Then, we have ρ z − zt +1 , zt − zt +1  + ≥

1 μ − μ ˜ t , μt − μt +1  ρ

ρ (z − zt +1 2 − z − zt 2 ) 2 1 (μ − μt +1 2 − μ − μt 2 ). + 2ρ

≤ F t (u˜ t ) − F t (u) + ˜vt − v, Q(˜vt ) ρ ≤ (z − zt 2 − z − zt +1 2 ) 2 1 (μ − μt 2 − μ − μt +1 2 ). + 2ρ Summing the above inequality over t = 0, 1, . . . , T , we obtain T T   [F t (u˜ t ) − F t (u)] + ˜vt − v, Q(v) t =0

t =0



1 ρ z − z0 2 + μ − μ0 2 . 2 2ρ

Thus, Theorem 4.1 is proved. Proof of Theorem 4.2: Note that the random variable v˜ t , where 1 ≤ t ≤ T , is a function of ξ [t − 1] = (ξ0 , ξ1 , ..., ξt −1 ), and is independent of (ξt , ξt +1 , ..., ξT ). We substitute all F t (ut ) by F(ut , ξt ). Therefore  Eξ [T ] F(ut , ξt ) − F(u) + vt − v, Q(v)  = Eξ [t −1] Eξt F(ut , ξt ) − F(u) + vt − v, Q(v)  = Eξ [t −1] F(ut ) − F(u) + vt − v, Q(v)  = Eξ [T ] F(ut ) − F(u) + vt − v, Q(v) . Since F is convex % T

F(u¯ ) = F

1  t u˜ T +1 T

t =0

&

1  F(u˜ t ). T +1 T



t =0

1778

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 10, OCTOBER 2014

Taking expectation with respect to ξ [T ], we have Eξ [T ] [F(u¯ T ) − F(u) + ¯vT − v, Q(v)]  1 [F(u˜ t ) − F(u) + ˜vt − v, Q(v)] Eξ [T ] T +1 T



t =0

= =

1 Eξ [T ] T +1 1 Eξ [T ] T +1

T 

[F(u˜ t , ξt ) − F(u) + ˜vt − v, Q(v)]

[21] H. Wang and A. Banerjee, “Online alternating direction method,” in Proc. Int. Conf. Mach. Learn., 2012, pp. 1–8. [22] A. Y. Yang, S. S. Sastry, A. Ganesh, and Y. Ma, “Fast 1 -minimization algorithms and an application in robust face recognition: A review,” in Proc. 17th IEEE ICIP, Sep. 2010, pp. 1849–1852. [23] L. Xiao, “Dual averaging methods for regularized stochastic learning and online optimization,” J. Mach. Learn. Res., vol. 11, pp. 2543–2596, Jan. 2010. [24] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” in Proc. Int. Conf. Mach. Learn., 2003, pp. 1–8.

t =0 T 

[F t (u˜ t ) − F(u) + ˜vt − v, Q(v)].

t =0

According to Theorem 4.1, Theorem 4.2 is proved. R EFERENCES [1] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Oper. Res. Lett., vol. 31, no. 3, pp. 167–175, 2003. [2] D. P. Bertsekas, A. Nedi, and A. E. Ozdaglar, Convex Analysis and Optimization. Belmont, MA, USA: Athena Scientific, 2003. [3] K. P. Bennett and E. Parrado-Hernandez, “The interplay of optimization and machine learning research,” J. Mach. Learn. Res., vol. 7, pp. 1265–1281, Jul. 2006. [4] L. Bottou and O. Bousquet, “The trade-off of large scale learning,” in Advances in Neural Information Processing Systems, vol. 20. Cambridge, MA, USA: MIT Press, 2008, pp. 161–168. [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122, 2011. [6] J. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari, “Composite objective mirror descent,” in Proc. Annu. Conf. Learn. Theory, 2010, pp. 1–13. [7] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” J. Math. Imag. Vis., vol. 40, no. 1, pp. 120–145, 2011. [8] F. Facchinei and J. S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems, vol. 1. New York, NY, USA: SpringerVerlag, 2003. [9] D. Gabay and B. Mercier, “A dual algorithm for the solution of nonlinear variational problems via finite element approximation,” Comput. Math. Appl., vol. 2, no. 1, pp. 17–40, 1976. [10] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret algorithms for online convex optimization,” Mach. Learn., vol. 69, no. 2, pp. 169–192, 2007. [11] E. Hazan and S. Kale, “Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization,” in Proc. Annu. Conf. Learn. Theory, 2011, pp. 421–436. [12] B. S. He and X. M. Yuan, “On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method,” SIAM J. Numer. Anal., vol. 50, no. 2, pp. 700–709, 2012. [13] S. M. Kakade and A. Tewari, “On the generalization ability of online strongly convex programming algorithms,” in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2009. [14] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization. New York, NY, USA: Wiley, 1983. [15] Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k 2 ),” Soviet Math. Doklady, vol. 27, no. 2, pp. 372–376, 1983. [16] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Math. Program., vol. 120, no. 1, pp. 221–259, 2009. [17] A. Rakhlin, K. Sridharan, and A. Tewari, “Online learning: Beyond regret,” in Proc. Annu. Conf. Learn. Theory, 2010, pp. 1–50. [18] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Mach. Learn., 2012, pp. 1–8. [19] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal estimated sub-gradient solver for SVM,” in Proc. Int. Conf. Mach. Learn., 2007, pp. 807–814. [20] R. Tomioka, T. Suzuki, and M. Sugiyama, “Super-linear convergence of dual augmented lagrangian algorithm for sparsity regularized estimation,” J. Mach. Learn. Res., vol. 12, pp. 1537–1586, Jan. 2011.

Qing Tao received the M.S. degree in mathematics from Southwest University, Chongqing, China, in 1989, and the Ph.D. degree from the University of Science and Technology of China, Chengdu, China, in 1999. He was a Post-Doctoral Fellow with the University of Science and Technology of China from 1999 to 2001. From 2001 to 2003, he was a Post-Doctoral Fellow with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. From 2004 to 2012, he was a Professor with the Institute of Automation, Chinese Academy of Sciences. Currently, he is a Professor with the New Star Research Institute of Applied Technology, Hefei, China. He has authored or co-authored more than 30 international journal papers. His current research interests include applied mathematics, neural networks, SVM, boosting, and statistical learning theory.

Qian-Kun Gao is currently pursuing the Postgraduate degree with the New Star Research Institute of Applied Technology, Hefei, China. His current research interests include statistical machine learning and large-scale optimization algorithms.

De-Jun Chu is a Lecturer with the New Star Research Institute of Applied Technology, Hefei, China. His current research interests include online learning and optimization algorithms.

Gao-Wei Wu received the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2003. He was a Post-Doctoral Fellow with the Institute of Computing Technology, Chinese Academy of Sciences, from 2003 to 2006. Currently, he is an Associate Professor with the Institute of Automation, Chinese Academy of Sciences. His current research interests include big data, neural networks, SVM, boosting, and statistical learning theory.

Stochastic learning via optimizing the variational inequalities.

A wide variety of learning problems can be posed in the framework of convex optimization. Many efficient algorithms have been developed based on solvi...
1MB Sizes 0 Downloads 5 Views