LETTER

Communicated by Ding-Xuan Zhou

Refined Generalization Bounds of Gradient Learning over Reproducing Kernel Hilbert Spaces Shao-Gao Lv [email protected] Statistics School, Southwestern University of Finance and Economics, ChengDu, China, and Institute of Statistical Mathematics, Tachikawa, Tokyo 190-8562, Japan

Gradient learning (GL), initially proposed by Mukherjee and Zhou (2006) has been proved to be a powerful tool for conducting variable selection and dimensional reduction simultaneously. This approach presents a nonparametric version of a gradient estimator with positive definite kernels without estimating the true function itself, so that the proposed version has wide applicability and allows for complex effects between predictors. In terms of theory, however, existing generalization bounds for GL depend on capacity-independent techniques, and the capacity of kernel classes cannot be characterized completely. Thus, this letter considers GL estimators that minimize the empirical convex risk. We prove generalization bounds for such estimators with rates that are faster than previous results. Moreover, we provide a novel upper bound for Rademacher chaos complexity of order two, which also plays an important role in general pairwise-type estimations, including ranking and score problems. 1 Introduction The importance of variable selection in practice has grown in recent years as data collection technology and data storage devices have become more powerful, for example, in bioinformatics and computer vision, where the dimensionality of the available data is of the order of thousands and even tens of thousands. There is an enormous literature on the selection of predictors, and within this literature, variable selection for multivariate regression problems is the most thoroughly investigated case. Given the regression models below (see equation 2.1), a standard assumption is that the regression function lies in a low-dimensional manifold. In this case, the d-dimensional regression functions are assumed to be influenced by a few nonzero components and are independent of other predictors, whereas nonzero components indicate the important variables. With sparsity, variable selection can improve estimation accuracy by efficiently identifying the “activate” subset among all the predictors and can enhance model interpretability with parsimonious Neural Computation 27, 1294–1320 (2015) doi:10.1162/NECO_a_00739

c 2015 Massachusetts Institute of Technology 

Refined Generalization

1295

representation. Moveover, the computational cost is reduced significantly when sparsity is very high. Approaches in this literature are mostly limited to structure assumptions for the regression function. For instance, one assumes the regression function to be linear (Breiman, 1995; Efron, Hastie, Johnstone, & Tibshirani, 2004; Fan & Li, 2001), nonparametric additive (Meier, Geer, & Buhlmann, 2009; Ravikumar, Liu, Lafferty, & Wasserman, 2009), or expressed as a sum of functions depending on couples or triples of variables (Lin & Zhang, 2006). Although these approaches share theoretical guarantees and interpretability, their functional complexities increase exponentially in the original variables. Moreover, when interaction components of the regression function exist, most previous work, such as that of Lin and Zhang (2006), attempts to select the important components rather than the original variables. In most of the predictor selection literature, the terms variable selection and model selection are used interchangeably. This use appears to be reasonable since most selection approaches are based on models, yet are usually required to be parametric or semiparametric. Thus, the selection of variables is done after model selection. However, if the model is misspecified, variable selection will be distorted seriously. Indeed, variable selection does not always have to be part of model selection, and some significant differences between variable selection and model selection have been discussed explicitly in Cox and Snell (1974) and Li, Cook, and Nachtsheim (2005). In essence, the importance of a variable can be reflected by the corresponding partial derivative. The larger the norm of the partial derivative with respect to a variable, the more important the corresponding variable is likely to be for prediction. The gradient of the true function provides a natural interpretation of the geometric structure among data (Mukherjee, Wu, & Zhou, 2010; Mukherjee & Wu, 2006). Based on this observation, the method of learning gradient over reproducing kernel Hilbert spaces (RKHSs) was proposed for variable selection for regression and binary classification in a high-dimensional setting (Mukherjee & Zhou, 2006). This approach is essentially a supervised version of Hessian eigenmaps (Donoho & Grimes, 2003) in machine learning terminology and can also be viewed as a nonparametric extension of minimum average variance estimation (Xia, Tong, Li, & Zhu, 2002). Additionally, gradient learning (GL) can also conduct dimension reduction (Ye & Xie, 2012), which is significantly different from most existing variable selection approaches. In related papers, GL has been applied, extended, and further analyzed by several researchers since its initial publication. Mukherjee and Zhou (2006) proposed a general gradient-based algorithm in a supervised learning framework, which simultaneously estimates a classification function as well as its gradient. Wu, Guinney, Maggioni, and Mukherjee (2010) proposed two gradient-induced quadratic programs to integrate information encoded by the gradient, so as to allow for inference of the geometric

1296

S.-G. Lv

structure of relevant variables. Ying and Campbell (2008) formulated a novel unifying framework for coordinate gradient learning from the perspective of multitask learning, so that multitask kernels are used to flexibly simulate the intrinsic structure of GL. In a similar argument to that of Ying and Campbell (2008), Wu et al. (2010) used GL to estimate correlation between predictors. Mukherjee et al. (2010) extended GL to the manifold setting for dimension reduction with few observations, whose generalization bounds show that the convergence rate depends on the intrinsic dimension of the manifold, not on the dimension of the ambient space. In addition, some other forms concerning GL have been proposed in recent years, such as GL with Lasso-type penalty (Ye & Xie, 2012), robust GL (Feng, Yang, & Suykens, 2014), and an early stopping algorithm for GL (Guo, 2010). For theoretical aspects, recall that most of the papers mentioned derive generalization error bounds over RKHSs for these corresponding algorithms; however, their rates are based on capacity-independent techniques, which cannot reflect the information of variance and the learning ability of RKHS. It is known that many kernel-based algorithms such as support vector machines (SVMs) often have powerful learning abilities, mainly due to the functional capacity of the RKHS. The key feature of RKHS is not reflected totally in previous theoretical results, which may lead to an insufficient understanding of GL, especially in high-dimensional frameworks. In this letter, we aim at providing better generalization bounds for GL estimators, which minimize the convex empirical risk, than existing theoretical results do. This in turn indicates the role of RKHSs in GL general settings. Most of the work already mentioned belongs to regularized methods, and each regularization term reflects a certain model structure so as to control the functional complexity as well. Nonetheless, this letter focuses on empirical risk minimization of GL within a bounded domain because we are primarily interested in its generalization ability in a more general framework. Since the natural estimate of the risk is of the form of U-statistics, results of U-processes theory are required for investigating the generalization performance of the consistency of empirical risk minimizers. It is worth noting that similar studies have been done for ranking algorithms (Clemencon, Lugosi, & Vayatis, 2008; Rejchel, 2012) and preference learning (Li, Ren, & Li, 2014), as well as entropy-based minimization (Hu, Fan, Wu, & Zhou, 2013). However, GL algorithms belong to local methods, which makes it impossible to conduct error analysis parallel with those global-based estimations, including ranking and other pairwise-type ones. Moreover, most of existing generalization bounds for ranking are based on a classical chaining lemma of a U-process (Arcones & Gine, 1993), which shows that a U-process can be controlled by the entropy integral of its covering number. The upper bound of the covering number is required to decay rapidly to ensure the finiteness of the integral. This excludes some commonly used kernels, such as Sobolov spaces with less smoothness and any convex hull

Refined Generalization

1297

when the base space admits a large pseudodimension. In view of this problem, Lei and Ding (2014) have introduced an adjustable parameter to control the divergence of the entropy integral. While this approach enriches the capacity of function spaces, an additional parameter is involved; in addition, whether it can usually attain the lower bound of the U-process is unclear. To this end, this letter provides novel upper bounds for the U-process, and almost all of interesting kernels are included in our context. Such tight bounds for the U-process are also quite useful for studying other pairwise problems, including score problems, ranking algorithms, and preference learning (Li et al., 2014). The letter is organized as follows. Section 2 provides an overview of the background of GL and gives some classic examples in the literature. To derive our error bounds for GL, we first use Hoeffding’s decomposition technique in section 3, and thus our problem can be divided into two parts. Under mild conditions, we state some satisfactory generalization bounds and compare them with previous results. We devote section 4 to a U-process and provide a novel upper bound for Rademacher chaos complexity with order two. Section 5 presents a sharper bound for sample error in GL. Some proofs are deferred to appendix. 2 Gradient Learning with Positive Definite Kernels There is an enormous literature on the selection of predictors, and within this literature, variable selection for multivariate conditional mean regression models is the most thoroughly investigated topic. More precisely, let X be the input space and Y be the output space. One assumes that y = E(y|x) + ,

(2.1)

where  is the noise term with E[| x] = 0, and fρ (x) = E(y|x) is called the regression function. Suppose the function f is smooth, and the gradient of f with respect to predictors is defined as  ∇ f :=

∂f ∂f ,..., d ∂x1 ∂x

 .

The goal of GL is to estimate the components of ∇ fρ without specifying any form of the regression function, only imposing some smoothing assumptions. Note that the first-order Taylor series expansion of any smooth function f is written as f (u) ≈ f (x) + ∇ f (x) · (u − x), for x ≈ u.

1298

S.-G. Lv

Define a vector-valued function f := ( f 1 , . . . , f d ) and Z = X × Y . When a set of observations D = {(xi , yi )}ni=1 is drawn from the underlying joint distribution ρ on Z , the empirical risk with the least square loss in GL is defined as Ez ( f ) =

   2 n  xi + x j 1 · (x j − xi ) , ωi,(s)j y j − yi − f n(n − 1) 2 i= j

x −x 2

where the weight function is defined as ωi,(s)j = w s (x j − xi ) = exp{− j2s2 i } with the bandwidth s of the weight function. We slightly modify the original GL algorithm in Mukherjee and Zhou (2006) by replacing f (xi ) with f ( xi +x j ). Thus, the empirical risk is symmetric with respect to (x , y ) and i i 2 (x j , y j ), which is needed in U-statistics theory. In this letter, we are interested in the excess risk of an estimator fz , which is achieved by minimizing the empirical risk, that is, f = arg min{E ( f )}, z z

(2.2)

f ∈F

where F is called hypothesis space, which we specify to be the ball of RHKS. Recall that an RKHS HK associated with the positive kernel K is defined to be the completion of the linear span of the set of functions {Kx := K(·, x), x ∈ X } with the inner product ·, · K satisfying Kx , Ku K = K(x, u). The reproducing property of HK is Kx , f K = f (x),

∀x ∈ X , f ∈ HK .

For notational simplicity, we use H instead of HK in the following and denote Hd = H1 ⊗ · · · ⊗ Hd . Some previous examples related to GL algorithms are listed as follows. Example 1. 2 -GL is the first motivating example among GL framework (Mukherjee & Zhou, 2006), which employs  · 2K as the regularizer. The regularized optimization of GL can be expressed as ⎧ ⎨ min

f ∈Hd

Ez ( f ) + λ



d  j=1

 f j 2K j

⎫ ⎬ ⎭

,

where λ is a regularization parameter. Note that the solution of the 2 -GL has a closed form.

Refined Generalization

1299

Example 2. 1 -GL in Ye and Xie (2012), motivated by Lasso-type approaches, is to generate sparse solutions for selecting variables efficiently. This employs dj=1  f j K as the regularizer, that is, j

⎧ ⎨ min

f ∈Hd



Ez ( f ) + λ

d 

 f j K

⎫ ⎬ j

j=1



.

Example 3. An extended GL program is proposed in Ying and Campbell (2008) from the perspective of multitask learning. A vector-valued RKHS is used to simulate the target function and its gradient, ⎧ n ⎨1  min ωi,(s)j (y j − f (xi ) − f (xi ) · (x j − xi ))2 2 ( f, f )∈H×Hd ⎩ n i, j

+ λ1

d 

 f j 2K + λ2  f 2K j

j=1

⎫ ⎬ ⎭

.

To measure the total error of any GL estimator f , the expected error is given as E ( f ) =

Z

  x+u  ω (u − x) y − v − f 2 Z (s)

2 · (u − x)

dρ(x, v)dρ(u, y).

Under the statistical learning framework, a central concern is to measure the expected error between empirical estimators and the target functions or ideal rules. To this end, we restrict the hypothesis space F to the ball of Hd with the radius R, denoted by HRd , where HRd = { f ∈ Hd :  f j K ≤ R, j = 1, 2, . . . , d}. j

In our analysis, we assume that ∇ fρ is a minimizer of the expected error E ( f ) over HRd as s goes to zero, that is, for any x ∈ X , ∇ fρ (x) = inf

s∈(0,∞)

fρ,s (x),

where fρ,s = arg min {E ( f )}. f ∈HdR

1300

S.-G. Lv

This assumption is mild and has been verified from proposition 15 in Mukherjee and Zhou (2006), which also shows that E ( f ) =

X

X

  x+u ωs (u − x)( fρ (u) − fρ (x) − f 2

· (u − x))2 dρX (x)dρX (u) + 2σs2 ,   where σs2 = X X ωs (u − x)(v − fρ (x))2 dρ(x, v)dρX (u), which is referred to as the intrinsic error. 3 Generalization Bounds Generalization performance is the main concern of machine learning theoretical research. Its overall goal is to characterize the risk that some algorithms may have in some given situations. In our case, it is referred to as the risk of fz , quantified by E ( fz ). However, E ( fz ) is a random variable (since it depends on the data), and it cannot be computed from the data (since it also depends on the underlying distribution ρ). Estimates of E ( fz ) thus usually take the form of probabilistic bounds. To obtain refinable generalization bounds for GL minimizers, the critical tool that we use is Hoeffding’s decomposition (Pena & Gine, 1999) of a U-statistic Ez ( f ) − Ez (∇ fρ ). For any f ∈ HRd , we define    2 x 1 + x2 · (x2 − x1 ) φ f (Z1 , Z2 ) = ω (s) (x2 − x1 ) y2 − y1 − f 2 with Z1 = (x1 , y1 ) and Z2 = (x2 , y2 ). Clearly, φ f (Z1 , Z2 ) is symmetric, nonnegative, and uniformly bounded by assumption 1 below on Z × Z . 1 g(Zi ), n n

Pn (g) =

P(g) = E(g),

Pf (z2 ) = E[φ f (Z1 , Z2 )|Z2 = z2 ],

i=1

 1 [h f (Zi , Z j ) − h∇ f (Zi , Z j )], ρ n(n − 1) n

Un (h f − h∇ f ) = ρ

i= j

h f (z1 , z2 ) = φ f (z1 , z2 ) − Pf (z1 ) − Pf (z2 ) + E ( f ).

Refined Generalization

1301

Let us slightly modify Hoeffding’s decomposition of the U-statistics Ez ( f ) − Ez (∇ fρ ), that is, E ( f ) − E (∇ fρ ) − [Ez ( f ) − Ez (∇ fρ )]

= 2Pn {[E ( f ) − E (∇ fρ )] − [Pf − P∇ f ]} − [Un (h f − h∇ f )] ρ

ρ

:= I1 + I2 ,

(3.1)

where I1 = Pn {[E ( f ) − E (∇ fρ )] − [Pf − P∇ f ]} ρ

and I2 = Un (h f − h∇ f ). ρ

Therefore, Hoeffding’s decomposition breaks a difference between a Ustatistic Ez ( f ) − Ez (∇ fρ ) and its expectation into the sum of independent and identically distributed (i.i.d.) random variables and a degenerate Ustatistic I2 . As will be shown clearly in the next sections, the concentration inequalities in empirical process theory form the basis for our considerations concerning I1 in Hoeffding’s decomposition of U-statistics. To bound I2 , one also needs to take into account the capacity of the hypothesis space. In this letter, we seek to characterize the capacity of H by the so-called Rademacher chaos complexity of order two. To derive generalization bounds for GL minimizers 2.2, we need the following conditions. Assumption 1. The noise term  = y − f ρ (x) is bounded by some constant M, that is, || ≤ M almost surely. The noise boundedness can be relaxed to an unbounded situation as in Steinwart and Christmann (2008) if we consider gaussian noise, but we do not pursue that direction for simplicity. The other key assumptions in further analysis are some regularity conditions on both the marginal distribution and the density that are required, as well as the smoothness of the regression function: Assumption 2. Assume that for some constants, c ρ > 0 and 0 < θ ≤ 1, the marginal distribution ρ X satisfies ρ X ({x ∈ X : d(x, ∂ X ) < t}) ≤ c ρ t,

(3.2)

and the density p(x) of ρ X satisfies sup p(x) ≤ c ρ , and | p(x) − p(u)| ≤ c ρ x − uθ , ∀ u, x ∈ X . x∈X

(3.3)

1302

S.-G. Lv

Assumption 3. Suppose that ∇ f ρ ∈ HRd and satisfies   x + u   · (u − x) ( f ρ (u) − f ρ (x) − ∇ f ρ 2 ≤ c ρ u − x2 , ∀ u, x ∈ X .

(3.4)

We now introduce an operator LK : L2 (ρX ) → L2 (ρX ) associated with a kernel K as K(u, x) f (u)dρX (u). (LK f )(x) := X

It is known that this operator is compact, self-adjoint, and semidefinite positive. Thus, it has at most countable nonnegative eigenvalues. We denote by μk the kth largest eigenvalue of LK . By theorem 4.27 of Steinwart and Christmann (2008), the sum of μk is bounded, that is, ∞ k μk < ∞. Thus, μk decreases with order μk = o(k−1 ), which naturally leads to assumption 4. Throughout this letter, we focus on only the special case where the RKHS H j is identical across all elements of vector-valued functions. Indeed, our proof can be extended to general settings with only notational changes. Assumption 4. There exists α > 1/2 and c > 0 such that μk ≤ ck −2α

∀ k ≥ 1.

(3.5)

It was shown that the spectral assumption, equation 3.5, is equivalent to the classical covering numbers assumption (Steinwart, Hush, & Scovel, 2009). Recall that the -covering number N(, BH , L2 (X )) with respect to L2 (X ) is the minimal number of balls with radius  needed to cover the unit ball BH in H. In particular, there exists a constant c such that log N(, BH , L2 (X )) ≤ c 1/α ,

(3.6)

and the converse is also true. Thus, if α is large, the RHKS is regarded as “simple.” For Sobolev class W p,2 , consisting of p-times continuously differential functions on the Euclidean ball of Rd , equation 3.6 holds with p α = d. Remark. The constants c, c1 , and c2 that appear in this letter may differ from line to line. We do not attempt to optimize the constants. We now state our first main result that provides generalization bounds on the estimation error achieved by the GL estimator 2.2: Theorem 1. Let f z be any minimizer of the empirical risk, equation 2.2, over HRd , and suppose that assumptions 1, 3, and 4 hold simultaneously. For every δ ∈ (0, 1),

Refined Generalization

1303

with probability at least 1 − δ, there holds  E ( f z ) − E (∇ f ρ ) ≤ C

  2α  1 1+2α d+ 2 log(2/δ) 1+2α s + , n n

 is some constant independent of n, δ. In particular, with the choice of where C  1 1/(d+4) , we have s= n 2   d+4 1 d+2   E ( f z )/s ≤ C log(2/δ) . n

Generalization bounds with McDiarmid’s inequality were first pursued by Ye and Xie (2012) for a sparse GL algorithm. The following rate in theorem 4 there was stated, E ( fz ) − E (∇ fρ ) = O

  1  1 2 d+4 +s , n

 1 which implies that n1 2 is the best rate there. From theorem 1, our worst   2α rate is of order n1 2α+1 , which is much sharper than that in Ye and Xie (2012) since α > 1/2. To be extreme, our rate tends to 1 as α goes to infinity. Note that no scaling is taken for the weight function, and the second result in theorem 1 is more objective and regarded as an efficient measure for the GL estimator’s quality. In this case, the bias induced by local linear expansion needs considering. This requires an appropriate choice for the width parameter s for deriving our final result. In order to derive refinable learning rates in a strong norm, assumption 2 is needed to characterize some regularity conditions on the marginal distribution and the density function: Corollary 1. Let f z be any minimizer of the empirical risk, equation 2.2, over HRd ,  1/(2θ +d+2) For and suppose that assumptions 1 to 4 hold simultaneously. Let s = n1 every δ ∈ (0, 1), with probability at least 1 − δ, there holds    log(2/δ) θ/(2θ +d+2) max  f z − ∇ f ρ  L 2 ≤ C ρ

X

 is some constant independent of n, δ. where C

  θ (2α−1)θ+2αd    (1+2α)(2θ +d+2) 1 2+2θ+d 1 , , n n

1304

S.-G. Lv

Note that without assumption 2, a loose learning rate can also be obtained by lemma 3 in the appendix. It is seen from the conclusion of theorem 1 that the capacity parameter α has a positive effect on learning ability. We now compare our rate in corollary 1 with several existing works. Recall that the convergence rate derived in Ye and Xie (2012) is

 fz − ∇ fρ L2 = O ρ

X

θ   4(d+2+2θ ) 1 . n

Under the same conditions as in this letter, another similar result, given in Mukherjee and Zhou (2006), is

 fz − ∇ fρ L2 = O ρ

X

θ   2(d+2+3θ ) 1 . n

It is easily seen that our rate is much sharper than two previous convergence rates. 4 Estimate of Rademacher Chaos with Order Two The estimate of I2 is based on degenerate U-processes. The degeneration of a U-statistic means that the conditional expectation of its common term is zero function, that is, E[h f (z1 , Z2 ) − h∇ f (z1 , Z2 )|z1 ] = 0 for each z1 ∈ X × R. ρ

Let us recall the considered object of the following form, ⎧ ⎫ n ⎨ ⎬  1 d Un (h f − h∇ f ) = [h f (Zi , Z j ) − h∇ f (Zi , Z j )] : f ∈ HR , ρ ρ ⎩ ⎭ n(n − 1) i= j

where h f (z1 , z2 ) = φ f (z1 , z2 ) − Pf (z1 ) − Pf (z2 ) + E ( f ). The application of Uprocesses to the generalization of ranking and score problems is developed well in Clemencon et al. (2008). Definition 1. Let F be a class of functions on Z × Z and {zi , i ∈ Nn } are independent random variables distributed according to a distribution ρ on Z . The homogeneous Rademacher chaos process of order two, with respect to the Rademacher variables σ , is a random variable system defined by ˆ f (σ ) = U

1 σ σ f (zi , z j ), f ∈ F . n i< j i j

Refined Generalization

1305

Denote the expectation of its suprema with a normalized empirical variance, ˆ n (F ; r ) = Eσ [ U

sup f ∈F ;Pn(n−1) ( f 2 )≤r

ˆ f (σ )|], |U

(4.1)

where Pn(n−1) is defined with respect to the composite random samples {(zi , z j )}i< j,i, j=1,...,n . We notice that the bounds for Rademacher chaos complexities established in Ying and Campbell (2009) and Arcones and Gine (1993) are based on the standard entropy integral. More precisely, the quantitative relationship between Rademacher complexities and covering numbers can be formulated in Rejchel (2012) by the following entropy integral, Uˆ n (F ; ∞) ≤ C1



1/4

ln N(t, F , ω)dt,

0

where N(t, F , ω) is the covering number with the empirical norm    1 1 [ f1 (Zi , Zi ) − f1 (Zi , Zi )]2 . ω( f1 , f2 ) =  R n(n − 1) i= j

These bounds may not be useful when the involved integral diverges, which occurs naturally if the logarithm of the covering number at a polynomial rate with an order not less than 1. (We refer interested readers to Lei & Ding, 2014; and Van der Vaart & Wellner, 1996, where several ordinary spaces with greater order than 1 are listed.) This letter handles this issue by presenting a novel bound on Rademacher chasos complexities. As compared to the previous results, our bounds admit some absolute superiorities since most of the widely used kernels are included in our context, as we mentioned in assumption 4. To this end, we denote = Z × Z and zi j = (zi , z j ) ∈ Z (i < j, i, j =  be an RKHS defined on associated with the ker1, . . . , n). Let H  . We introduce the normalized Gram matrix T , defined as T = nel K n n 1  (K(z i1 j1 , zi2 j2 ))il < jl , l = 1, 2. Obviously Tn is an n(n − 1)/2 × n(n − 1)/2 n(n−1) semipositive definite matrix. Let μˆ i be these eigenvalues, arranged in a nonincreasing order. In addition, denote by μi the eigenvalue sequence of LK, defined in section 2. It is known from Smale and Zhou (2009) that μˆ i converges to μi uniformly in probability as n goes to infinity, which is based on perturbation theory and concentration inequality under a martingale difference sequence in Hilbert space.

1306

S.-G. Lv

Proposition 1. For any r > 0, conditional on z1 , . . . , zn , we have ⎛ ˆ n (H R ; r ) ≤ ⎝ U



n(n−1)/2

⎞1/2 min{r, R μ ˆ k }⎠ 2

.

k=1

In particular, there holds

ˆ n (H R ; ∞)] ≤ R E[U

∞ 

1/2 μk

.

k=1

As mentioned in section 3, the quantity k=1 μk < ∞ holds under quite  generally (see the details weak conditions. In fact, we have k=1 μk < K ∞ in Cucker & Smale, 2001).  defined as Proof. We introduce the operator Ln on H  2  , z). f (zi j )K(z ij n(n − 1) n

(Ln f )(z) =

(4.2)

i< j

It is also seen from Smale and Zhou (2009) that the eigenvalues of Ln are the same as the n(n − 1) largest eigenvalues of Tn , and the remaining eigen of values of Ln are zero. Also let (φi )i≥1 be an orthonormal basis of H eigenfunctions of Ln , which is associated with μˆ i . Fix N ∈ [0, n(n − 1)/2], , we have and notice that for any f ∈ H 

 σi σ j f (zi j ) = f,



i< j

 , ·) σi σ j K(z ij

i< j

=

 N ! k=1



+

f,

 N   1  , ·), φ φ " μˆ k  f, φk φk , σi σ j K(z ij k k μˆ k i< j k=1   , ·), φ φ .  σi σ j K(z ij k k k>N i< j

If  f K ≤ R and r ≥ Pn(n−1)/2 ( f 2 ) =  f, Ln f =

 k≥1

μˆ k  f, φk 2 ,

Refined Generalization

1307

then, by the Cauchy-Schwarz inequality,        N      1   , ·), φ  σi σ j f (zi j ) ≤ r σi σ j K(z ij k  μˆ k   i< j i< j k=1        , ·), φ + R σi σ j K(z ij k k≥N

2

2

.

(4.3)

i< j

Moreover, for any given k, a simple algebra yields that  Eσ



2

 , ·), φ σi σ j K(z ij k

= Eσ

 i1 < j1 i2 < j2

i< j

 σi σ j σi σ j K(z i j , ·), φk 1

1

2

2

1 1

 × K(z i2 j2 , ·), φk    , ·), φ 2 = K(z φk (zi j )2 = ij k i< j

=

i< j

n(n − 1) n(n − 1) φk , Ln φk = μˆ k , 2 2

where the second equality follows i.i.d. of σi , and the third equality is based on the reproducing property of RHKS. The fourth equality follows from the definition of Ln and the reproducing property again. Using equation 4.3 and Jensen’s inequality, it follows that  R ; r) ≤ Uˆ n (H

min

0≤N≤n(n−1)/2

 ! √ rN + R μˆ k . N>k

This implies our first desired result immediately. By combining lemmas 1 and 2 of Guo and Zhou (2012), we have that μˆ k → μk uniformly on k with probability 1. Letting r = ∞, we derive the last result easily from Jensen’s inequality. Proposition 2. If the loss function l is L ∗ -Lipschitz, we have ˆ n (l ◦ F ; r ) ≤ L ∗ U ˆ n (F ; r ). U

1308

S.-G. Lv

Proposition 2 can be shown by the following lemma: Lemma 1. Let gi, j (θ ) and f i, j (θ ) be sets of functions such that ∀ i, j, θ, θ  |gi, j (θ ) − gi, j (θ  )| ≤ | f i, j (θ ) − f i, j (θ  )|. Then for any function c(x, θ ) and any distribution over X , ⎡ Eσ Ex sup ⎣c(x, θ ) +



θ

⎤ σi σ j gi, j (θ )⎦

i< j



≤ Eσ Ex sup ⎣c(x, θ ) + θ



⎤ σi σ j f i, j (θ )⎦ .

i< j

Proof. We prove it by induction. First, recall the following useful conclusion on Rademacher variables for any function f: ' Eσ [sup σ f (θ )] = sup θ

θ1 ,θ2

( f (θ1 ) − f (θ2 ) . 2

(4.4)

We note that it is trivial for n = 0. Now suppose that the lemma holds for n = k, when n = k + 1: ) Eσ

1

...σk+1 Ex

sup c(x, θ ) + θ

1

...σk

Ex sup

k +

i=1

θ1 ,θ2

* σi σ j gi, j (θ )

i< j

) = Eσ

n=k+1 

gi, j (θ1 ) + gi, j (θ2 ) c(x, θ1 ) + c(x, θ2 )  + σi σ j 2 2 n=k

i< j

σi [gi,k+1 (θ1 ) − gi,k+1 (θ2 )]

*

2 )

,

 1 c(x, θl ) + σi σ j = Eσ ...σ Ex sup 1 k θ1 −θ4 4 k +

4

n=k

l=1

i< j

i=1 [gi,k+1 (θ1 )

4

− gi,k+1 (θ2 )]

 4 1 gi, j (θl ) 4 l=1

k +

i=1 [gi,k+1 (θ3 )

4

− gi,k+1 (θ4 )]

* ,

Refined Generalization

1309

)

 1 ≤ Eσ ...σ Ex sup c(x, θl ) + σi σ j 1 k θ1 −θ4 4 k i=1

+

4

n=k

l=1

i< j

| fi,k+1 (θ1 ) − fi,k+1 (θ2 )| 4 )

i=1

+

i=1

+

4

n=k

l=1

i< j

fi,k+1 (θ1 ) − fi,k+1 (θ2 ) 4 )

= Eσ

1

...σk

Ex sup

k i=1

+

θ1 −θ2

i=1

+

,

 4 1 gi, j (θl ) 4 l=1

fi,k+1 (θ3 ) − fi,k+1 (θ4 )

*

4

,

n=k

i< j

σi [ fi,k+1 (θ1 ) − fi,k+1 (θ2 )]

...σk+1 Ex

*

gi, j (θ1 ) + gi, j (θ2 ) c(x, θ1 ) + c(x, θ2 )  σi σ j + 2 2

) 1

| fi,k+1 (θ3 ) − fi,k+1 (θ4 )| 4

k

2

= Eσ

 4 1 gi, j (θl ) 4 l=1

k

 1 c(x, θl ) + σi σ j = Eσ ...σ Ex sup 1 k θ1 −θ4 4 k

sup c(x, θ ) + θ

n=k 

* ,

σi σ j gi, j (θ ) + σk+1

i< j

k 

* σi [ fi,k+1 (θ )] ,

i=1

where all the equalities except the third one follow from equation 4.4 and the first inequality is based on the Lipschitz condition. The third equality follows from the symmetrization argument and the definition of supremum. Thus, this yields our desired result immediately by induction. To prove proposition 2, it suffices to take c(x, θ ) = 0, gi, j (θ ) = l( f (zi , z j ), fi, j (θ ) = L∗ ( f (zi , z j ) and the uniform distribution as our object. We are now ready to present the second main result in this letter. Theorem 2 shows that I2 can be bounded by the quantity n1 up to some constant. Indeed, our result is motivated by the recent work on ordinary Rademacher complexities in Bartlett, Bousquet, and Mendelson (2005). Theorem 2. Say that H satisfies equation 3.5 with α > 1/2. Let D = c 1 (2M + 1/2  ∞ R)R dj=1 μ , then for every δ ∈ (0, 1), with probability at least 1 − δ k=1 k, j

|Un (h f − h ∇ f )| ≤ (1 + D)2 ρ

log(1/δ) . n

1310

S.-G. Lv

Proof. By Markov’s inequality, for any λ > 0, we get ⎞



P ⎝ sup |Un (h f − h∇ f )| ≥ ε ⎠ ρ

f ∈Hd R



⎞ , " ⎜ ⎟ ≤ exp(−λ (n − 1)ε)E exp ⎝λ sup |(n − 1)Un (h f − h∇ f )|⎠ . ρ

f ∈Hd R

(4.5) Using symmetrization for the U-process (Pena and Gine, 1999), we can bound the right side of equation 4.5 by ⎛

⎞ , ⎜ ⎟ E exp ⎝λ sup |(n − 1)Un (h f − h∇ f )|⎠ ρ

f ∈Hd R



⎞ , ⎜ ⎟ ≤ c2 E exp ⎝c1 λ sup |U f (h f − h∇ f )|⎠ , ρ

(4.6)

f ∈Hd R

where c1 , c2 are constants, which may differ from line to line. Furthermore, ¨ by Arcones and Gine (1994) and Holder’s inequality, we bound equation 4.6 by ⎛



c2 E exp ⎝c1 λ2 Eσ sup |U f (h f − h∇ f )|⎠ .

(4.7)

ρ

f ∈Hd R

By the subadditivity of Rademacher chaos (see proposition 12 of Ying & Campbell, 2009) and the symmetrization of φ, we have Uˆ n (hHd − h∇ f ) ≤ Uˆ n (φHd − φ∇ f ). R

ρ

R

(4.8)

ρ

Moreover, it is easy to calculate that  |φ f (z1 , z2 ) − φ∇ f

ρ

(z1 , z2 )| ≤ (2M + R)|( f − ∇ fρ )

xi + x j 2

 · (x j − xi )|.

Refined Generalization

1311

Applying proposition 2 for φHd − φ∇ f , we obtain from equation 4.8 that ρ

R

Uˆ n (hHd − h∇ f ) ≤ (2M + R)Uˆ n ((HRd − ∇ fρ ) ⊗ X d ) R

ρ

≤ (2M + R)

d 

Uˆ n ((HR, j − (∇ fρ ) j )),

j=1

where the subadditivity of Rademacher chaos is employed again for the second inequality. Then, following proposition 1, we conclude from the above quantities that

Uˆ n (hHd − h∇ f ) ≤ (2M + R)R R

ρ

d  j=1

n(n−1)/2 

1/2 μˆ k, j

.

(4.9)

k=1

1/2  n(n−1)/2 ˆ = c (2M + R)R d μ ˆ . This, together with equDenote D 1 k, j j=1 k=1 ations 4.5, 4.6, 4.7, and 4.9, yields ⎞



"

ˆ 2) P ⎝ sup |Un (h f − h∇ f )| ≥ ε ⎠ ≤ exp(−λ (n − 1)ε)E exp(Dλ f ∈Hd R

ρ

" = exp(−λ (n − 1)ε + Dλ2 ).

As a consequence, we can complete the proof by taking λ =



(n−1)ε . 1+D

It is known that the quantity n1 is the best rate in learning theory, so the upper bound of I2 is able to ignor deriving generalization bounds. This case is similar to that in ranking in Clemencon et al. (2008) and Rejchel (2012). However, we consider more candidate kernels than those in Clemencon et al. (2008) and Rejchel (2012), mainly as a result of the upper bounds we derived for Rademacher chaos in proposition 1. 5 Sample Error Using Local Rademacher Complexity The empirical process theory is essential for our consideration associated with the Hoeffding’s first decomposition of U-statistics I1 . To obtain better rates, one has to be able to uniformly bound second moments of random functions by their expectations. To this end, we need some notations characterizing the complexity of function classes. Let {σi }ni=1 be an i.i.d. sequence of Rademacher variables, and let {zi }ni=1 be an i.i.d. sequence of random variables from Z , drawn according to some

1312

S.-G. Lv

distribution Q. For each r > 0, define the Rademacher complexity on the function class F as   n  1   Rn (F ; r) = sup σi f (zi )    n 2 f ∈F ,E f ≤r i=1

and call an expression Ez,σ [Rn (F ; r)] the local Rademacher average of the class F . It is worth noting that the Rademacher complexity can be regarded as a homogeneous Redemacher chaos of order one. In general, a subroot function is used as an upper bound for the local Rademacher complexity. A function ψ : [0, ∞) → [0, ∞)√is a subroot if it is nonnegative, nondecreasing, and satisfying that ψ (r)/ r is nonincreasing. Now we can state the theorem for empirical processes with minor changes, which is similar to theorem 1 in Bartlett et al. (2005): Lemma 2. Let F be a class of measurable, square integrable functions such that E f − f ≤ b for all f ∈ F . Let ψ be a subroot function, A be some positive constant, and r ∗ be the unique solution to ψ(r ) = r/A. Assume that E[Rn (F ; r )] ≤ ψ(r ),

r ≥ r ∗.

Then for all t > 0 and all K > A/7, with probability at least 1 − e −t , there holds E f 2 50K ∗ (K + 9b)t 1 f (zi ) ≤ + 2 r + , n i=1 K A n n

Ef −

f ∈ F.

Lemma 2 tells us that to get better bounds for the empirical term, one needs to study properties of the fixed point r∗ of a subroot ψ. Although there exists no general method for choosing ψ, a sharp bound for local Rademacher complexity in the reproducing kernel space has been established in Mendelson (2002; see lemma 4 below). Analogous to the Rademacher chaos process, we also use the eigenvalue decay of the kernel operator to characterize the functional complexity of H. First, we establish an equivalent relationship between the loss concerning φ f and the usual L2ρ -norm: X

Lemma 3. Suppose that assumptions 1 and 2 are both satisfied. When s is bounded, there exists some positive constant c and every function f ∈ HRd such that E[Pf − P∇ f ]2  s d+2  f − ∇ f ρ 2L 2 . ρ

ρ

X

Recall the notation a  b, which means that there exists a numerical constant c > 0 such that c, c −1 ≤ a /b ≤ c.

Refined Generalization

1313

The proof of lemma 3, in the appendix, is very useful for estimating the upper bound of I1 . The appearance of sd+2 is mainly due to the fact that the weight function is not normalized. Theorem 3. Suppose that the regularity conditions associated with the underlying distributions 3.2, 3.3, and 3.4 are satisfied, as well as that the spectral assumption 3.5 holds. Then there exist constants c1 , c2 , and c3 for every t > 0 and N > 2 with probability at least 1 − e −t : 2α   1+2α 1 2  E ( f ) − E (∇ f ρ ) ≤ Pn (Pf − P∇ f ) + c 2 N s d+ 1+2α ρ n

+ c3

(N + R)t , n

∀ f ∈ HRd .

Proof. In order to apply lemma 2, we define the family of functions PH − P∇ f = {Pf − P∇ f , f ∈ HRd } ρ

ρ

= {γ = E[φ f (Z1 , ·) − φ∇ f (Z1 , ·)| ·], f ∈ HRd }. ρ

Since HRd is uniformly bounded, for any z2 ∈ Z , it follows that E[φ f (Z1 , z2 ) − φ∇ f (Z1 , z2 )| z2 ]∞ ≤ c∇ fρ − f ∞ ≤ c1 R. ρ

(5.1)

This means that b = 2c1 R in lemma 2. In addition, by the Cauchy-Schwarz inequality, we have EZ (E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )| Z2 ])2 ρ

2

≤ E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )]2 , ρ

≤ (R + M)[E ( f ) − E (∇ fρ )],

∀ f ∈ HRd .

(5.2)

Thus, the remaining work we need to do is to provide a tight upper bound of r∗ . We notice that ( f ) := Pf − P∇ f belongs to the Lipschiz class ρ

with respect to the following function space HRd ⊗ X d , which is defined as x +x HRd ⊗ X d = {F(x1 , x2 ), F(x1 , x2 ) = f ( 1 2 2 ) · (x2 − x1 ), f ∈ HRd }. From equation A.3 in the appendix, it is easy to check that |(Pf − P∇ f )(z2 )| ≤ csd+1 M( f − ∇ fρ )(x2 )2 , ρ

∀ z2 ∈ Z .

1314

S.-G. Lv

Hence, combining the contraction property of Rademacher complexity (see theorem A.6 of Bartlett et al., 2005) with lemma 3, we have E[Rn (PH − P∇ f , P[( f )]2 ≤ r)] ρ

'  ≤ c1 sd+1 E Rn HRd ⊗ X d , ∇ fρ − f 2L2 ≤ ρ

X

r sd+2

( .

At the same time, note the fact (Mendelson, 2002) for any sequences of   function class {Fk }N k=1 with any N ≥ 1,

Rn





N 

Fk





k=1

N 

Rn (Fk ).

k=1

Then we can reach the following conclusion, E[Rn (PH − P∇ f , P( f )2 ≤ r)] ρ

'  ≤ c1 sd+1 E Rn HRd ⊗ X d ,  f 2L2 ≤ ρ

'  ≤ c2 Rsd+1 E Rn BH ,  f 2L2 ≤ ρ

≤c



X

X

r

(

sd+2 (

r R2 sd+2 1/2

1 min{rsd , R2 μk s2(d+1) } n

,

k=1

where we use the conclusion from lemma 4 in the appendix. Denote the subroot function by

ψ (r) =

1/2 ∞ 1 d 2(d+1) min{rs , μk s } , n k=1

 d(2α+1)+2  1 By a truncation technique, one can show that nψ (r)2 = O s 2α r1− 2α , provided that μk  (k)−2α is satisfied. Consequently, the fixed point r∗ satisfies the following equation, nr2 − s

d(2α+1)+2 2α

1

r1− 2α  0.

Refined Generalization

1315

which yields that

r∗ 

2α   1+2α 1 2 sd+ 1+2α . n

(5.3)

Applying lemma 2 and combining it with the derived quantities 5.1, 5.2, and 5.3, we have E ( f ) − E (∇ fρ )

≤ Pn (Pf − P∇ f ) + ρ

+ c2 N

1 E (E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )| Z2 ])2 ρ N Z2

2α   1+2α (N + R)t 1 2 sd+ 1+2α + c3 . n n

< 12 . In this case, equation Finally, let N be sufficiently large such that R+M N 5.2 implies that the second term of the above inequality can be bounded by / . 1 E ( f ) − E (∇ fρ ) . 2 6 Discussion In this letter, we analyze the generalization ability of gradient learning algorithms via Hoeffding’s decomposition and modern empirical process theory. Compared to other pairwise problems such as ranking algorithms, score problems, and preference learning, GL algorithms belong to the kind of local methods that cannot be analyzed using most classic techniques. Our theoretical analysis may be helpful for studying other local learning algorithms, including local SVM and local Semisupervised learning. As a by-product, we also provide a tight upper bound for the Rademacher chaos complexity of order two. This quantity we derive also plays an important role in general pairwise-type estimations, including ranking and score problems. From our main results, we see that the generalization error still depends heavily on the ambient dimension. This leads to an inefficient estimation even if the dimension is relatively high. In view of this problem, we advise screening out some irrelevant features by independent screening techniques and then conducting the GL algorithm based on selected relevant features. An alternative approach is that the target function is assumed to be a groupadditive form. In this case, the dimensional problem caused by GL can also be alleviated efficiently under such semiparametric setting, since it is known that additivity avoids the curse of dimensionality in statistics.

1316

S.-G. Lv

Appendix: Proofs for Theorem 1 and Lemma 3 In contrast with the bound for I1 given in theorem 3, the bound for the U-statistics I2 is able to be ignore for E ( fz ) − E (∇ fρ ) derived from equation 3.1. Proof of Theorem 1. By the definition of fz , it follows that Ez ( fz ) ≤ Ez (∇ fρ ). Thus, the error decomposition of equation 3.1 can be formulated as E ( fz ) − E (∇ fρ ) ≤ I1 + I2 .

Combining theorems 3 and 2, for every δ ∈ (0, 1), with probability at least 1 − δ, there holds

E ( fz ) − E (∇ fρ ) ≤ c2 N

2α   1+2α (N + R) log(2/δ) 1 2 sd+ 1+2α + c3 n n

+ (1 + D)2

log(2/δ) . n

(A.1)

  Furthermore, we notice that E (∇ fρ ) = X X ωs (u − x)( fρ (u) − fρ (x) − ) · (u − x))2 dρX (x)dρX (u). Following the regularity condition, ∇ fρ ( x+u 2 equation 3.4, we see that |E (∇ fρ )| ≤ c2 sd+4 , which yields the second part of theorem 1 directly. Proof of Corollary 1. By slightly modifying the results from theorem 2 of Ye and Xie (2012), we have  ∇ fρ − f L2 ≤ c sθ + ρ

X

 f ) − E (∇ f ) + E (∇ f )] . [ E ( ρ ρ sθ +d+2 1

From the proof of theorem 1, we see that |E (∇ fρ )| ≤ c2 sd+4 . Then it follows that  ∇ fρ − f L2 ≤ c sθ + ρ

X

 f ) − E (∇ f )] . [ E ( ρ sθ +d+2 1

(A.2)

1/(2θ +d+2)  log(2/δ) , we complete the proof from the concluBy choosing s = n sion of theorem 1.

Refined Generalization

1317

Proof of Lemma 3. Let y1 = fρ (x1 ) + 1 and y2 = fρ (x2 ) + 2 . It follows that E[φ f (Z1 , Z2 )|Z2 , x1 ]/ωs (x1 − x2 )

 (2  x 1 + x2 fρ (x1 ) − fρ (x2 ) − f · (x2 − x1 ) + (1 + 2 )2 2  '  ( x1 + x2 · (x2 − x1 ) . − 22 fρ (x1 ) − fρ (x2 ) − f 2 '

=

This implies that E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )|Z2 , x1 ]/ωs (x1 − x2 ) ρ

'

 (2  x 1 + x2 fρ (x1 ) − fρ (x2 ) − f · (x2 − x1 ) 2 ' (2   x 1 + x2 − fρ (x1 ) − fρ (x2 ) − ∇ fρ · (x2 − x1 ) 2 '  (  x1 + x2  + 22 ( f − ∇ fρ ) · (x2 − x1 ) 2   ( ' x1 + x2 · (x2 − x1 ) = −2 fρ (x1 ) − fρ (x2 ) − ∇ fρ 2  (  ' x1 + x2  · (x2 − x1 ) × ( f − ∇ fρ ) 2 '   (2 x1 + x2  + ( f − ∇ fρ ) · (x2 − x1 ) 2  (  ' x + x2 + 22 ( f − ∇ fρ ) 1 · (x2 − x1 ) 2  (  ' x + x2 := (x1 , x2 ) + 22 ( f − ∇ fρ ) 1 · (x2 − x1 ) . 2 =

(A.3)

By the Cauchy-Schwartz inequality, we have that E[Pf − P∇ f ]2 = E(E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )| Z2 ])2 ρ

ρ

≤ E(E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )| Z2 ]2 ) ρ

≤ Ex Ex (E[φ f (Z1 , Z2 ) − φ∇ f (Z1 , Z2 )| x2 , x1 ]2 ) 1

ρ

2

≤ Ex Ex [ω (x1 − x2 ) (x1 , x2 )]2 s

1

2

+ 4E[22 ]Ex Ex [ωs (x1 − x2 ) f − ∇ fρ 2 x2 − x1 2 ]2 . 1

2

1318

S.-G. Lv

It is easy to check that Ex Ex [ωs (x1 − x2 ) (x1 , x2 )]2 ≤ c1 sd+4  f − ∇ fρ 22 1

2

and Ex Ex 1

2

0 12 ωs (x1 − x2 ) f − ∇ fρ 2 x2 − x1 2 ≤ c2 sd+2  f − ∇ fρ 22 .

Hence, we conclude that E[Pf − P∇ f ]2 ≤ c2 sd+2  f − ∇ fρ 22 , ρ

(A.4)

where we used the condition of noise decay E[22 ] < ∞. Similarly, as long as |s| is finite and E[22 ] > c > 0, we can also verify that E[Pf − P∇ f ]2 ≥ c1 sd+2  f − ∇ fρ 22 . ρ

(A.5)

Recall that Bartlett et al. (2005) provide an equivalent relationship between the local Rademacher complexity and the eigenvalues decay associated with the kernel operator LK , defined in section 3. Lemma 4. There exist two positive constants c1 and c2 such that if r ≥ 1/n, there holds

c1

1/2 ∞ 1 min{r, μk } ≤ E[Rn ( f ∈ BH , P f 2 ≤ r )] n k=1

≤ c2

1/2 ∞ 1 min{r, μk } . n k=1

Acknowledgments I thank the anonymous reviewers for their helpful comments. My research is supported partially by the National Natural Science Foundation of China (grant 11301421), and Fundamental Research Funds for the Central Universities of China (grants JBK141111, 14TD0046, and JBK140210). My research is also supported partially by MEXT Grant-in-Aid for Scientific Research on Innovative Areas of Japan (grant 25120012).

Refined Generalization

1319

References Arcones, A., & Gine, E. (1993). Limit theorems for U processes. Ann. Probab., 21, 1494–1542. Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). Local Rademacher complexities. Ann. Stat., 33, 1497–1537. Breiman, L. (1995). Better subset regression using the non-negative garrote. Technom., 37, 373–384. Clemencon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical minimization of U-statistics. Ann. Stat., 36, 844–874. Cox, D. R., & Snell, E. J. (1974). The choice of variables in observational studies. Appl. Stat., 23, 51–59. Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39, 1–49. Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: New locally linear embedding techniques for high dimensional data. Proc. Natl. Acad. Sci, 100, 5591–5596. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Stat., 32, 407–499. Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Stat. Assoc., 96, 1348–1360. Feng, Y. L., Yang, Y. N., & Suykens, J. K. (2014). Robust gradient learning algorithm with applications to nonlinear variable selection (Tech. Rep.). Leuven, Belgium: Katholieke Universiteit Leuven. Guo, X. (2010). Learning gradients via an early stopping gradient descent method. J. Appro. Theory, 11, 1919–1944. Guo, X., & Zhou, D. X. (2012). An empirical feature-based learning algorithm producing sparse approximations. Appl. Comput. Harmon. Anal., 32, 389–400. Hu, T., Fan, J., Wu, Q., & Zhou, D. X. (2013). Learning theory approach to minimum error entropy criterion. J. Mach. Learn. Res., 14, 377–397. Lei, Y. W., & Ding, L. X. (2014). Refined Rademacher chaos complexity bounds with applications to the multi-kernel learning problem. Neural Comput., 26, 739–760. Li, H., Ren, C. B., & Li, L. Q. (2014). U-processes and preference learning. Neural Comput., 26, 2896–2924. Li, L. X., Cook, R. D., & Nachtsheim, C. J. (2005). Model-free variable selection. J. R. Stat. Ser. B, 67, 285–299. Lin, Y., & Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Stat., 34, 2272–2297. ¨ Meier, L., Geer, S. V. D., & Buhlmann, P. (2009). High-dimensional additive modeling. Ann. Stat., 37, 3779–3821. Mendelson, S. (2002). Geometric parameters of kernel machines. In Proceedings of the 15th Annual Conference on Computational Learning Theory (pp. 29–43). New York: Springer. Mukherjee, S., & Wu, Q. (2006). Estimation of gradients and coordinate covariation in classification. J. Mach. Learn. Res., 7, 2481–2514. Mukherjee, S., Wu, Q., & Zhou, D. X. (2010). Learning gradient on manifolds. Bernoulli, 16, 181–207.

1320

S.-G. Lv

Mukherjee, S., & Zhou, D. X. (2006). Learning coordinate covariance via gradients. J. Mach. Learn. Res., 7, 519–549. Pena, V. H., & Gine, E. (1999). Decoupling: From dependence to independence. New York: Springer-Verlag. Ravikumar, P., Liu, H., Lafferty, J., & Wasserman, L. (2009). SpAM: Sparse additive models. J. Roy. Stat. Soc., Ser. B, 71, 1009–1030. Rejchel, W. (2012). On ranking and generalization bounds. J. Mach. Learn. Res., 13, 1373–1392. Smale, S., & Zhou, D. X. (2009). Geometry on probablity spaces. Constr. Approx., 30, 311–323. Steinwart, I., & Christmann, A. (2008). Support vector machine. New York: Springer. Steinwart, I., Hush, D., & Scovel, C. (2009). Optimal rates for regularized least square regression. In Proceedings of the 22nd Annual Conference on Learning Theory. New York: Springer-Verlag. Van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag. Wu, Q., Guinney, J., Maggioni, M., & Mukherjee, S. (2010). Learning gradient: Predictive models that infter geometry and statistical dependence. J. Mach. Learn. Res., 99, 2175–2198. Xia, Y. C., Tong, H., Li, W. K., & Zhu, L. X. (2002). An adaptive estimation of dimension reduction space. J. Roy. Stat. Soc. Ser. B, 64, 360–410. Ye, G. B., & Xie, X. H. (2012). Learning sparse gradients for variable selection and dimension reduction. Mach. Learn., 87, 303–355. Ying, Y. M., & Campbell, C. (2008). Learning coordinate gradients with multi-task kernels. In Proceedings of the 22nd Annual Conference on Learning Theory. New York: Springer-Verlag. Ying, Y. M., & Campbell, C. (2009). Generalization bounds for learning the kernel. In Proceedings of the 22nd Annual Conference on Learning Theory. New York: SpringerVerlag.

Received November 10, 2014; accepted January 20, 2015.

Copyright of Neural Computation is the property of MIT Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.

Refined Generalization Bounds of Gradient Learning over Reproducing Kernel Hilbert Spaces.

Gradient learning (GL), initially proposed by Mukherjee and Zhou (2006) has been proved to be a powerful tool for conducting variable selection and di...
356KB Sizes 0 Downloads 9 Views