IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

1913

Fully Empirical and Data-Dependent Stability-Based Bounds Luca Oneto, Member, IEEE, Alessandro Ghio, Member, IEEE, Sandro Ridella, Member, IEEE, and Davide Anguita, Senior Member, IEEE

Abstract—The purpose of this paper is to obtain a fully empirical stability-based bound on the generalization ability of a learning procedure, thus, circumventing some limitations of the structural risk minimization framework. We show that assuming a desirable property of a learning algorithm is sufficient to make data-dependency explicit for stability, which, instead, is usually bounded only in an algorithmic-dependent way. In addition, we prove that a well-known and widespread classifier, like the support vector machine (SVM), satisfies this condition. The obtained bound is then exploited for model selection purposes in SVM classification and tested on a series of real-world benchmarking datasets demonstrating, in practice, the effectiveness of our approach. Index Terms—Algorithmic stability, data-dependent bounds, fully empirical bounds, in-sample, model selection, out-of-sample, support vector machine (SVM).

I. I NTRODUCTION HE notion of stability [1]–[3] allows to answer a fundamental question in learning theory: which are the properties that a learning algorithm A should fulfill in order to achieve good generalization performance? Stability answers this question in a very intuitive way: if A selects similar models, even if the training data are (slightly) modified, then we can be confident that the learning algorithm is stable. In other words, if stability shows that A does not excessively fit the noise that afflicts the available data, therefore, A is able to achieve good generalization. Note that, using this approach, there is no need to aprioristically fix a set of models to be explored by the learning algorithm, which represents a novelty with respect to the structural risk minimization (SRM) principle, developed in the framework of statistical learning theory (SLT) [4]. The groundbreaking idea of stability, in fact, allows to overcome some computational and theoretical issues of SRM, where it is necessary to fix a class of functions F in a data-independent way, and

T

Manuscript received January 10, 2014; revised July 16, 2014 and September 18, 2014; accepted October 1, 2014. Date of publication October 20, 2014; date of current version August 14, 2015. This paper was recommended by Associate Editor G.-B. Huang. L. Oneto, A. Ghio, and S. Ridella are with the Department of Electrical, Electronic, and Telecommunications Engineering, and Naval Architecture, University of Genoa, Genoa I-16145, Italy (e-mail: [email protected]; [email protected]; [email protected]). D. Anguita is with the Department of Informatics, BioEngineering, Robotics, and Systems Engineering, University of Genoa, Genoa I-16145, Italy (e-mail: [email protected]). Digital Object Identifier 10.1109/TCYB.2014.2361857

to measure its complexity for obtaining valid generalization bounds1 [8]–[12]. The main idea of SRM is to suggest a way to balance the two driving forces of learning: on one hand, it looks for the model that best fits the data, while, at the same time, it constrains the search for the best model to a suitable hypothesis set. The selection of the optimal hypothesis space is also known as the model selection problem [13], [14], which corresponds to finding the optimal hyperparameters of a learning algorithm (e.g., the value of k in k–local rules [15], the size or depth of decision trees [16], [17], the number of layers and hidden nodes in neural networks [18], [19], or the regularization and kernel hyperparameters in support vector machines (SVMs) [12]). Up to now, the main competitive advantage of SRM over stability consisted in the development of complexity measures that allows to study the uniform deviation of models [10], [11] and provide fully empirical data-dependent generalization bounds. In fact, thanks to these results it is possible to assess, in practice, the performance of a learning procedure [12]. However, several successful learning algorithms, like, for example, the k-nearest neighbors (k-NN) [20], are developed from (eventually heuristic) training procedures or strategies, without explicitly defining a fixed hypothesis space. The k-NN idea is to group similar objects into the same class, but the hypothesis is only defined as soon as the data become available: no set of functions, from which we pick up the best one fitting the available data, is defined [20]. When the hypothesis space cannot be defined in advance, the SRM principle fails and it becomes mandatory to resort to well-known out-of-sample techniques, such as the K-fold cross validation (KCV) or the bootstrap [21], [22]: however, they are unsatisfactory both from a theoretical and a practical point of view [12], [23]. Some attempts have also been made for extending the SRM principle to data-dependent hypothesis spaces, but no practical and general results have been obtained so far [9], [24], [25]. The difference between the two approaches can be clarified through the simple graphical description of Fig. 1. The SRM principle is depicted in Fig. 1(a). Let F1 ⊆ F2 be two classes of functions and T1 , T2 be two different training sets, originated from the same distribution. The learning phase consists 1 This is true not only in a frequentist setting, like the SRM, but in the Bayesian learning framework as well, where a prior distribution of models must be specified before seeing the data [5]–[7].

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1914

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

(a) Fig. 1. is not.

(b)

Different philosophy in learning theory. (a) SRM: F1 is a good hypothesis space, while F2 is not. (b) Stability: A1 is a stable algorithm, while A2

of finding a model from the selected hypothesis space, say F1 , which best fits the data: if T1 is used, then f1F1 is obtained, while f2F1 is selected if we opt to learn the dataset T2 , instead. Since, in this case, the hypothesis space F1 is simple (namely, small) enough, f1F1 will be forced to be “close” (in some sense) to f2F1 . In other words, the final outcome of the learning phase will not be heavily influenced by the randomness of the process generating the data, so the risk of learning the noise, i.e., of overfitting the data, will be minimized. If, instead, we select the larger hypothesis set F2 , the model f1F2 could end up “far” from f2F2 : this means that the hypothesis class is too large for the particular learning task and the risk of overfitting the data is high. The advantage of stability, instead, is that it builds on a completely different approach: the hypothesis space is implicitly defined by the learning algorithm itself, when fitting the available data. A graphical description is reported in Fig. 1(b): let us consider two algorithms A1 and A2 and, again, two training sets T1 , T2 , generated from the same distribution. Let us suppose that, by applying A1 to T1 and T2 , we obtain two models, f1A1 and f2A1 , respectively, that are “close” (in some sense) one to each other. Instead, if we apply A2 on T1 and T2 , we obtain two models f1A2 and f2A2 , that are far from each other. Stability indicates that A1 will generalize better than A2 , since it is a more stable algorithm. Note that no hypothesis space has been defined: the stability of a particular learning algorithm is simply computed after learning the dataset. The main unsolved problem of the stability framework is the lack of data-dependent generalization bounds. In fact, the few results appeared so far in the literature allow to compute stability only in some particular cases. Moreover, the results are only algorithmic-dependent but not data-dependent [1], [26]–[30]. This is a remarkable drawback for model selection purposes, where we target the definition of data-dependent complexity measures, which allow us to effectively design an optimal model. To the best knowledge of the authors, we propose for the first time a stability-based fully data-dependent generalization

bound, where all the involved quantities can be easily estimated from the available data. Despite the fact that our bounds do not achieve the optimal convergence rates of the ones stemming from the better-explored SRM framework, we will show that our stability-based bounds outperform even well-known out-of-sample methods on several real-world benchmarking datasets. This paper is organized as follows. In Section II, we recall the usual learning framework, while Section III deals with the theoretical aspects of stability, with particular reference to the state-of-the-art survey and the derivation of the new results (Section III-B). Then, we briefly summarize in-sample and out-of-sample approaches to model selection in Section IV, where we will also propose a general approach for applying our result to in-sample model selection. In Section V, we will consider a particular case study on SVM [31], for which we also propose the benchmarking comparison discussed above in Section VI. The conclusion of this paper is drawn in Section VII. II. L EARNING F RAMEWORK Rd

Let X ∈ and Y ∈ R be, respectively, an input and an output space. We consider a set of labeled independent and identically distributed (i.i.d.) data Sn : {z1 , . . . , zn } of size n, where zi∈{1,...,n} = (xi , yi ), sampled from an unknown distribution μ. A learning algorithm A, characterized by a set of hyperparameters H that must be tuned, maps Sn into a function f : A(Sn ,H) from X to Y. In particular, A allows designing f ∈ FH and defining the hypothesis space FH , that is generally unknown (and depends on H). We assume that A satisfies some minor properties detailed in [1]: namely, we consider only deterministic algorithms that are symmetric with respect to Sn (then, they do not depend on the order of the elements in the training set); moreover, all the functions are measurable and all the sets are countable. \i We also define two modified training sets. Sn , where the ith element is removed Sn\i : {z1 , . . . , zi−1 , zi+1 , . . . , zn }

(1)

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

and Sni , where the ith element is replaced   Sni : z1 , . . . , zi−1 , zi , zi+1 , . . . , zn

  ˆ loo A(Sn ,H) , Sn D

R (A(Sn ,H) ) is a random variable that depends on Sn , that unfortunately cannot be computed, since μ is unknown and, consequently, must be estimated. Two of its most exploited estimators are the empirical error [8], [10]–[12]   1     A(Sn ,H) , z (4) Rˆ emp A(Sn ,H) , Sn = n z∈Sn

(5)

z∈Sn

III. B OUNDING THE G ENERALIZATION E RROR W ITH S TABILITY A. Preliminary Definitions and State-of-the-Art of Stability Bounds As remarked in the previous section, R(A(Sn ,H) ) cannot be computed, since μ is unknown. Conventionally, an upper-bound of R(A(Sn ,H) ) is derived by studying the supremum of the uniform deviation of the generalization error from the empirical error of (4) or, alternatively, from the LOO error of (5)     sup R A(Sn ,H) − Rˆ emp A(Sn ,H) , Sn (6) f :A(Sn ,H) ∈FH

sup

f :A(Sn ,H) ∈FH

    R A(Sn ,H) − Rˆ loo A(Sn ,H) , Sn .

    = R A(Sn ,H) − Rˆ loo A(Sn ,H) , Sn . (9)

(2)

where zi is an i.i.d. pattern, sampled from μ. The accuracy of A(Sn ,H) in representing the hidden relationship μ is measured with reference to a loss function  (A(Sn ,H) , z) = c( f (x), y), where c : Y × Y → [0, 1]: note that assuming the boundedness of the target values is a typical assumption in theoretical analysis of supervised learning [10], [12], [13]. To analyze the case of unbounded targets, one usually truncates the values at a certain threshold and bounds the probability of exceeding that threshold (refer, for example, to the techniques proposed by [32]–[34]). Consequently, the quantity of interest is defined as the generalization error, namely the error that a model will perform on new data generated by μ and previously unseen     (3) R A(Sn ,H) = Ez  A(Sn ,H) , z .

and the leave-one-out (LOO) error [35], [36] 

  1  Rˆ loo A(Sn ,H) , Sn =  A \i , z . Sn ,H n

1915

(7)

In SLT and, in particular, in the SRM framework [8], we hypothesize that the class of functions FH is defined in a data-independent fashion and, then, is known. When dealing with stability, we suppose that FH is not aprioristically designed, thus studying the uniform deviation is not a road ˆ (A(Sn ,H) , Sn ) of the generalthat can be run. The deviation D ization error from the empirical or the LOO errors is analyzed, instead   ˆ emp A(Sn ,H) , Sn D     = R A(Sn ,H) − Rˆ emp A(Sn ,H) , Sn (8)

Note that the deterministic counterpart of ˆ emp (A(Sn ,H) , Sn ) and D ˆ loo (A(Sn ,H) , Sn ) can be defined as D   ˆ 2emp A(Sn ,H) , Sn D2emp (AH , n) = ESn D (10)   2 2 ˆ loo A(Sn ,H) , Sn . Dloo (AH , n) = ESn D (11) ˆ (A(Sn ,H) , Sn ), we can adopt differIn order to study D ent approaches. The first one consists of using the hypothesis stability H (AH , n)    Hemp (AH , n) = ESn ,zi  A(Sn ,H) , zi −  A(Sni ,H) , zi ≤ βemp

(12) 

  Hloo (AH , n) = ESn ,z  A(Sn ,H) , z −  A \i , z Sn ,H ≤ βloo .

(13)

[1, Lemma 3] proves that 1 + 3Hemp (AH , n) (14) 2n 1 + 3Hloo (AH , n). (15) D2loo (AH , n) ≤ 2n By exploiting the Chebyshev inequality, we get that, for a random variable a and with probability (1 − δ) E[a2 ] E[a2 ] P [a > t] ≤ 2 , a < . (16) δ t D2emp (AH , n) ≤

Then, by combining (8), (12), and (14) [or, analogously, (9), (13), and (15)] we obtain that, with probability (1 − δ)

    3βemp 1 + (17) R A(Sn ,H) ≤ Rˆ emp A(Sn ,H) , Sn + δ

2nδ     3βloo 1 R A(Sn ,H) ≤ Rˆ loo A(Sn ,H) , Sn + + (18) 2nδ δ which are the polynomial bounds previously derived in [1] and based on hypothesis stability.2 Another approach, targeted toward deriving a stability bound on the generalization error, consists of exploiting the uniform stability U(A)    U i (AH , n) =  A(Sn ,H) , · −  A(Sni ,H) , · ∞ i ≤β 

  \i  U (AH , n) =  A(Sn ,H) , · −  A \i , · Sn ,H

≤ β \i .

(19)



(20)

By noting that Hemp (AH , n) ≤ U i (AH , n)

Hloo (AH , n) ≤ U \i (AH , n)

(21) (22)

2 For the sake of precision, note that the bounds slightly differ from the results proposed in [1]: as a matter of fact, as also underlined in [3], the original work on stability [1] contains one error, which motivates the exploitation of the two notions of hypothesis stability of (14) and (15).

1916

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

we can derive the following exponential bounds [1], that hold with probability (1 − δ):     R A(Sn ,H) ≤ Rˆ emp A(Sn ,H) , Sn     log 1   δ (23) + 2β i + 4nβ i + 1 2n     R A(Sn ,H) ≤ Rˆ loo A(Sn ,H) , Sn      log 1δ  + β \i + 4nβ \i + 1 . (24) 2n The proof is mainly based on the exploitation of the McDiarmid inequalities [37]. Although, the bounds are exponential, stability must decrease with n in order to obtain a nontrivial result. Unfortunately, this is seldom the case: for example, when a hard {0, 1} loss function is exploited in binary classification to count the number of misclassifications, it is possible to prove that β i = β \i = 1 for many well-known and widely used algorithms [38], [39] (such as k-local rules [15] or SVMs [31]). Moreover, in those cases where nontrivial results can be derived (e.g., in bounded support vector regression [40]), strong conditions on μ must hold, which are rarely satisfied in practice. Finally, also note that the previous results are all algorithmic-dependent, but they are not datadependent: as remarked in the introduction, this represents a drawback in practical applications. In the next sub-section, we cope with these blind spots of stability by deriving a new fully empirical and data-dependent result.

of the learning curves of an algorithm [42]–[45]. Moreover, such quantities are strictly linked to the concept of smart rule [38]. The purpose of these works is to prove that an algorithm performs better as the cardinality of the learning set increases: then, the more data we have, the more concentrated the empirical or the LOO errors should be around the generalization error. It is worth underlining that, in many of the above-referenced works, (25) [or, alternatively, (26)] is proved to be satisfied by many well-known algorithms (SVMs, kernelized regularized least squares, k-local rules with k > 1, etc.). In the following, we start by considering and using the assumption of (25). In this case, we exploit (16) and derive that, with probability (1 − δ)   D2loo (AH , n) ˆ loo A(Sn ,H) , Sn ≤ D δ   √   D2 A , n H 2  loo . (27) ≤ δ By exploiting (15), we have that √



  1 n n 2 Dloo AH , ≤ √ + 3Hloo AH , . (28) 2 2 n √ We focus now on Hloo (AH , ( n/2)). For this purpose, let us introduce the following empirical quantity: ⎛ ⎞ ˆ loo ⎜ H ⎝A

S√

B. New Data-Dependent Stability Bounds Let us consider the LOO error: we will devote the last part of this section to the motivations of this choice. We start by making an assumption on the learning algorithm A. In particular, we suppose that the hypothesis stability does not increase with the cardinality of the training set √

 n . (25) Dloo (AH , n) ≤ Dloo AH , 2 We point out that (25) is a desirable requirement for any learning algorithm: in fact, the impact on the learning procedure of removing samples from Sn should decrease, on average, as n grows. Alternatively, we can hypothesize that √

 n . (26) Hloo (AH , n) ≤ Hloo AH , 2 Note that



 n if: Hloo (AH , n) ≤ Hloo AH , 2 √

 n then: Dloo (AH , n) ≤ Dloo AH , . 2

Note also that (25) [or, alternatively, (26)] have already been studied by many researchers in the past. In particular, these properties are related to the concept of consistency [38], [41]. However, connections can also be identified with the trend

, S √

n ,H 2 √ n 2

n 2



⎟ ⎠



n n 

2  2  8  k√ k ˘ = √  A S n , z˘ j m n

k=1 j=1 i=1

where S˘ k√n : 2

√   n k ∈ 1, . . . , (30) (k−1) n+ 2 √   n . (31) k ∈ 1, . . . , 2

 z(k−1)√n+1 , . . . , z

z˘ kj : z

√ √n (k−1) n+ 2 +j

2

  

\i S˘ k√n − A , z˘ kj (29) 2 





n 2

Note that the quantity√of (29) is the empirical unbiased estimator of Hloo (AH , ( n/2)) and then ⎛ ⎞ √

 n  , S √ ⎟. (32) ˆ loo ⎜ Hloo AH , = ES √n H ⎝A n⎠ 2 √ 2 S n ,H 2 2

It is worth noting that, when dealing with ˆ loo (A(S √ ,H) , S(√n/2) ), all the samples zi are i.i.d. H ( n/2) and sampled from μ. Thus, all |(A(S˘(k√n/2) , z˘ kj ) − (A((S˘(k√n/2) )\i , z˘ kj )| ∈ [0, 1] will be i.i.d., and we can bound √ in probability the difference between Hloo (AH , ( n/2)) and

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

ˆ loo (A(S √ ,H) , S(√n/2) ) by exploiting, for example, the H ( n/2) Hoeffding inequality [46] ⎤ ⎛ ⎞ ⎡ √

 n ⎢  , S √ ⎟ > t⎥ ˆ loo ⎜ −H P ⎣Hloo AH , ⎦ ⎝A n⎠ 2 2 S √n ,H 2

√ − nt2

≤e

.

(33)

Then, with probability (1 − δ) √

 n Hloo AH , 2 ⎛

1917

 Hemp (AH , n) ≤ Hemp

ˆ emp ⎜ H ⎝A √

    log 1  δ , S √ ⎟ + ˆ loo ⎜ ≤H . (34) √ ⎝A n⎠ n √ 2 S n ,H

, S √

n 2

⎟ ⎠

2 √

n     

i 2 4  S˘ k√n , z˘ ki =  A S˘ k√n , z˘ ki −  A m 2 2 n 2

k=1 i=1

(40) where

2

Combining (27), (28), and (34) we get that, with probability (1 − δ)     R A(Sn ,H) ≤ Rˆ loo A(Sn ,H) , Sn  ⎛ ⎞ ⎛  ⎡   ⎜  2 ⎢ √1 + 3 ⎜H , S √ ⎟ ⎝ ˆ loo ⎝A n⎠ δ ⎣ n √  2 S ,H n  2 +  ⎞⎤   2 log δ  ⎠⎦. √ +  n (35) When exploiting (26), the proof is analogous. We can make use of (15) and (16), and state that, with probability (1 − δ)   D2loo (AH , n) ˆ loo A(Sn ,H) , Sn ≤ D δ ! " 1 1 + 3Hloo (AH , n) ≤ δ 2n ! √ "  1 1 n + 3Hloo AH , . (36) ≤ δ 2n 2 Then, with probability (1 − δ)     R A(Sn ,H) ≤ Rˆ loo A(Sn ,H) , Sn  ⎛ ⎞ ⎛  ⎡   ⎜  2 ⎢ 1 + 3 ⎜H , S √ ⎟ ⎝ ˆ loo ⎝A n⎠  δ ⎣ 2n √  2 S n ,H  +  ⎞2⎤   log 2δ  ⎠⎦. √ +  n (37) Let us roll back to the choice of using the LOO error in place of the empirical error for the previous proofs. One can imagine that the empirical estimator can be exploited as well, e.g., by defining two properties analogous to the ones of (25) and (26) √

 n (38) Demp (AH , n) ≤ Demp AH , 2

(39)

Consequently, √ we can study the empirical estimator of Hemp (AH , ( n/2)). For this purpose, we have to introduce the following empirical quantity: ⎛ ⎞ S √n ,H





n . AH , 2

 S˘ k√n : z(k−1)√n+1 , . . . , z

(k−1) n+

2

z˘ ki

 √

:

z(k−1)√n+i



n 2

√   n k ∈ 1, . . . , 2

√   n k ∈ 1, . . . , 2 (41)



i # S˘ k√n : z(k−1)√n+1 , . . . , z(k−1)√n+i−1 , z˘ k i 2  √ √ √ z(k−1) n+i+1 , . . . , z n (k−1) n+ 2 √   n k ∈ 1, . . . , 2 √   n k √ √ z˘ i : z . k ∈ 1, . . . , n (k−1) n+ 2 +i 2

(42)

(43) (44)

Unfortunately, although, all the patterns zi are i.i.d. and sampled from μ, |(A(S˘(k√n/2) , z˘ ki ) − (A((S˘(k√n/2) )i , z˘ ki )| ∈ [0, 1] are not i.i.d., in general: thus, a bound, analogous to the one for the LOO error, cannot be derived. IV. M ODEL S ELECTION : I N -S AMPLE V ERSUS O UT- OF -S AMPLE A PPROACHES Model selection is strictly linked to the generalization error estimation of a model: as a matter of fact, the function characterized by the lowest estimated error value is chosen among several alternatives. In other words, we opt for the set of hyperparameters, whose corresponding trained model minimizes the upper-bound of R (A(Sn ,H) ). Out-of-sample and in-sample approaches have been proposed for this purpose, which are briefly recalled in the following. Out-of-sample approaches suggest to use an independent dataset for validation purposes, sampled from the same data distribution μ. Then, Sn is split into the training set SnTR and the validation set SnMS , such that Sn = SnTR ∪ SnMS and SnTR ∩ SnMS = . Then, we define the following empirical estimator of R(A(SnTR ,H) ):  Rˆ emp A(Sn ,H) , SnMS TR 1   = (45)  A(Sn ,H) , z . TR n z∈SnMS

1918

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

Since, by construction, the data in SnMS are i.i.d. with respect to the samples in SnTR , all (A(SnTR ,H) , z) ∈ [0, 1] with z ∈ SnMS are i.i.d. as well, and we can bound the generalization error through the Hoeffding inequality  R A(Sn ,H) TR    1    log δ ≤ Rˆ emp A(Sn ,H) , SnMS + (46) TR 2nMS which holds with probability (1 − δ). The model selection phase is thus performed by varying the hyperparameters of the algorithm until the right side of (46) is minimized. In particular, let us suppose to consider different sets of hyperparameters H1 , H2 , . . .; then, the optimal classifier A(SnTR ,H∗ ) is the result of the following minimization process:  ⎤ ⎡ Rˆ emp A(Sn ,H) , SnMS ⎢ ⎥

TR  min H∗ = arg ⎣ ⎦. (47) log 1δ H∈{H1 ,H2 ,...} + 2nMS Usually the splitting procedure of Sn into SnTR and SnMS is replicated in order to obtain a more reliable estimate. It is also worth mentioning that the partition of the original dataset into training and validation sets can affect the tightness of the bound, due to lucky or unlucky splittings, and, therefore, its effectiveness. This is a major issue for out-of-sample methods and several heuristics have been proposed in the literature for dealing with this problem (e.g., stratified sampling [22], topology-based splitting [47], and optimal splitting strategies [48]): however, we will not focus on these topics, as they are outside of the scope of this paper. It is also worth underlining that the model selection procedure for outof-sample approaches is valid for any A, namely it does not depend on the learning algorithm exploited and does not require any assumptions nor constraints to be satisfied by A. Note, moreover, that out-of-sample procedures do not require an hypothesis space FH to be designed and defined a-priori. In-sample approaches, instead, target the use of the same dataset for learning and model selection, without resorting to additional samples: thanks to this peculiarity, in-sample methods proved to outperform the widespread out-of-sample techniques in those settings, where the use of the latter ones has been questioned (e.g., in the small sample setting) [12], [23]. A learning algorithm A exploits the input data Sn to produce a function f : A(Sn ,H) and an estimate of the generalization error Rˆ emp (A(Sn ,H) , Sn ). The latter is a random variable, depending on the data themselves. As the trained function cannot be aprioristically known, the uniform deviation of the error estimate is studied     R A(Sn ,H) − Rˆ emp A(Sn ,H) , Sn     ≤ sup R A(Sn ,H) − Rˆ emp A(Sn ,H) , Sn . (48) f :A(Sn ,H) ∈FH

Then, the model selection phase can be performed accordingly in a data-dependent fashion in the SRM framework [8]. A possibly infinite sequence {FHi , i = 1, 2, . . .} of model classes of increasing complexity is chosen, and the following

bound is minimized, which consists of the empirical error and a penalty term:     R A(Sn ,H) ≤ Rˆ emp A(Sn ,H) , Sn     + sup R A(Sn ,H) − Rˆ emp A(Sn ,H) , Sn . (49) f :A(Sn ,H) ∈FH

From (49), it is straightforward noting that the hypothesis space FH must be known in order to apply in-sample methods, as formulated above. In this paper, instead, we propose to use the bound of (35) [or equivalently of (37)], which requires that the hypothesis of (25) [or (26)] is satisfied but does not request FH to be known: in this way, we eliminate the main disadvantages of SRM by introducing the exploitation of the innovative stability bounds, proven in the previous section. A reformulation of in-sample approaches in this direction is thus necessary. In particular, let us consider different sets of hyperparameters H1 , H2 , . . .. Then, the optimal classifier A(SnTR ,H∗ ) is the result of the following minimization process: $  % R A(Sn ,H) (50) min H∗ = arg H∈{H1 ,H2 ,...}

where R (A(Sn ,H) ) is estimated in a data-dependent way thanks to the innovative bounds we introduced in (35) and (37). V. SVM We consider SVM models for binary classification applications (i.e., Y ∈ {−1, +1}) [31] and we show how stability bounds can be effectively used for model selection purposes: in fact, model selection is known to be an open problem for SVM, and a topic to which a huge amount of works has been devoted in the last decades (e.g., refer to [12], [14], and [49]–[54]). In particular, we exploit the hard loss function $ %   1 − sign f (x) yÊ H A(Sn ,H) , z = 2  sign(·) = +1 if(·) > 0 f = A(Sn ,H) (51) sign(·) = −1 if(·) ≤ 0 where H (A(Sn ,H) , z) ∈ {0, 1}, to count the number of misclassifications. We exploit the nonlinear formulation of SVM, where we map our input space X ∈ Rd to another space X˜ ∈ RD , where generally D  d, using a function φ(x) : Rd → RD . The SVM function is defined as % $ (52) f (x) = sign wT φ(x) + b where the weights w ∈ RD and the bias b ∈ R are found by solving the following primal convex constrained quadratic programming problem [4], [31]:  1 w2 + C ξi w,b,ξ 2 i=1   s.t. yi wT φ(xi ) + b ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ {1, . . . , n} n

min

∀i ∈ {1, . . . , n} (53)

which is equivalent to maximizing the margin (underfitting tendency) while penalizing the errors (overfitting tendency)

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

through the hinge loss function % $ ξ ( f (x), y) = 1 − yi f (xi ) + = ξi

j

S in ∩ S n = , K

(54)

where [ · ]+ = max(0, ·) [4], [31]. Note that we do not exploit the hard loss function, but its convex upper-bound ξ ( f (x), y), where the regularization hyperparameter C controls the tradeoff between under- and over-fitting: as a consequence, the hypothesis space depends on the data, and cannot be aprioristically defined. By introducing n Lagrange multipliers (α1 , . . . , αn ), we can reformulate primal problem (53) in its dual form, for which efficient solvers have been developed throughout the years (see [55], [56])    1  αi αj yi yj K xi , xj − αi 2 n

min α

n

n

i=1 j=1

n 

i=1

0 ≤ αi ≤ C,

∀i ∈ {1, . . . , n}.

(55)

After solving (55), the Lagrange multipliers can be used to define the SVM classifier in its dual form & n '  f (x) = sign yi αi K (xi , x) + b (56) i=1

where we used the well-known kernel trick [57] (K(xi , xj ) = φ(xi )T φ(xj )), and K(xi , xj ) is a suitable kernel function. In particular, in this paper, we will focus on Gaussian kernels  ( (2   (57) K xi , xj = exp −γ (xi − xj ( because of their widespread use and their practical and theoretical benefits [58]. However, our results can be straightforwardly generalized to other kernel functions. Note that using Gaussian kernels leads to the introduction of an additional hyperparameter (γ ): the hyperparameter set is consequently defined as H = {C, γ }, which must be tuned during the model selection phase. As discussed in Section IV, out-of-sample and in-sample techniques can be used for this purpose. In the following, for the sake of benchmarking the in-sample method based on stability, that we introduced in this paper [refer to (50)], we contextualize one of the most commonly used and most effective [12] out-of-sample approaches, namely the KCV [22], [49], [59]–[62], and its particular case where the number of folds coincides with the number of patterns, i.e., the LOO [28]. As a further comparison, we also propose an alternative data-independent measure based on the Rademacher complexity notion, namely the margin bound [63]. A. SVM Model Selection With KCV and LOO KCV consists of splitting the dataset Sn in K folds of equal i , with i ∈ {1, . . . , K}, such that3 size S(n/K) Sn =

K )

K

i = j,

i=1 3 Note that, in general, K  n.

K

(59)

i Then, we use in turn (K − 1) sets S(n/K) for training purposes, and the remainder fold for model selection purposes. In other words, we generate K training sets SnTR and K validation sets SnMS i i Sni TR = Sn− n = Sn \ S n K

K

i ∈ {1, . . . , K}.

Sni MS = S in , K

(60)

Thus, we exploit (47) in order to find H∗ = {C∗ , γ ∗ } such that 

⎤ ⎡  , S i ˆ R A nMS ⎥ K ⎢ emp S i ,H  ⎢ ⎥ nTR  min H∗ = arg ⎢ ⎥. (61) ⎣ ⎦ H∈{H1 ,H2 ,...} log 1δ i=1 + 2n As clearly emerging from the discussion above, the KCV procedure leads to training K classifiers A(Sni ,H∗ ) , TR i = 1, . . . , K: the most used procedure to deal with this issue is to finally retrain a classifier f : A(Sn ,H∗ ) after model selection has concluded, by exploiting the whole learning set Sn and the best hyperparameter values H∗ [62]. Finally, it is worth noting that the LOO approach represents a particular case of the KCV procedure, where K = n. In this case, n training sets of size (n − 1) are created, while one single instance in turn is left out for validation purposes. B. SVM Model Selection With Stability Bounds As a first step, in order to highlight the novelty of the proposed approaches for stability, we show how the bounds, originally proposed in [1], cannot be effectively exploited in practice in the SVM case study. On the contrary, in the second part of this section, we will show that our proposal can be used to implement an in-sample model selection approach for SVM classifiers. As a first issue, we recall that, in the original formulation of [1], the computation of stability leads to trivial results if a hard loss function is used (see Section III and refer to [38] and [39]). However, this first issue could be easily circumvented by replacing the hard loss function with the soft loss $ % (62) S ( f (x), y) = min 1, ξ ( f (x), y) . By exploiting S ( f (x), y), the following bound holds with probability (1 − δ):     R A(Sn ,H) ≤ Rˆ emp A(Sn ,H) , Sn     log 1  δ (63) + C + (2nC + 1) 2n since C . (64) 2 Despite being fully empirical, the bound is only algorithmic-dependent (and data-independent), as it relies on the value assumed by C, which is obviously problem-specific. βi ≤

S in

i, j ∈ {1, . . . , K}.

K

yi αi = 0

i=1

1919

(58)

1920

Moreover, even if the soft loss function is used, in order to avoid deriving trivial results, C must also be proven to decrease with n: unfortunately, this condition is seldom satisfied, since both [41] and [64] underline that, in order to reach the SVM consistency, C must be kept constant or even increased with n. For the reasons listed above, the original bound of [1] is of very scarce use for SVM model selection. On the opposite, the bound we proposed in (35) [or, equivalently, (37)] is fully empirical and data-dependent, i.e., both the learning algorithm and the learning samples have influence on stability. Moreover, contrarily to the conventional approach of [1], we have no restrictions with reference to the use of a hard loss function, and no strong conditions must hold on how C varies with respect to n. We simply have to prove that (25) [or, alternatively, (26)] are satisfied: several works addressed these points in the past [38], [42]–[45], which showed that these properties are fully satisfied by SVM. Then, in our case, we can safely exploit (35) in order to find H∗ = {C∗ , γ ∗ }, such that *   Rˆ loo A(Sn ,H) , Sn H∗ = arg min H∈{H1 ,H2 ,...}  ⎛ ⎞⎤ ⎛  ⎡   ⎥ ⎜  2 ⎢ √1 + 3 ⎜H , S √ ⎟ ⎥ ⎝ ˆ loo ⎝A n⎠ ⎥ δ ⎣ n  2 ⎥ S √n ,H  ⎥ 2 + ⎥. (65) ⎞ ⎤   ⎥  2 ⎥ log  δ ⎥ ⎠⎦ √ +  ⎦ n

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

The Rademacher complexity of the class of functions is defined as [12] 2 σi T ( f (x), y) f ∈FH n i=1 n

Rˆ (FH ) = Eσ sup

(68)

where σ1 , . . . , σn are n independent random variables for ˆ H) which P[σi = +1] = P[σi = −1] = 1/2. Note that R(F can be computed based on the available data [12]. It is possible to prove that the probability of error R (A(Sn ,H) ) can be upper-bounded through empirical quantities only [12]     log 2 n     1 δ R A(Sn ,H) ≤ T ( f (xi ), yi ) + Rˆ (FH ) + 3 . n 2n i=1

(69) When SVM is concerned, the bound of (69) can be applied, as described in [63], in order to obtain the margin bound, a data-independent upper-bound of the probability of error of a classifier selected with SVM     log 2 n   1  δ 4W √ ξ ( f (xi ), yi ) + T +3 R A(Sn ,H) ≤ n n 2n i=1

(70) where

Alternatively, we can use (37) *   H∗ = arg Rˆ loo A(Sn ,H) , Sn min H∈{H1 ,H2 ,...}  ⎛ ⎞⎤ ⎛  ⎡   ⎥ ⎜  2 ⎢ 1 + 3 ⎜H , S √ ⎟ ⎥ ⎝ ˆ loo ⎝A n⎠ ⎥  δ ⎣ 2n  2 ⎥ S √n ,H  ⎥ 2 + ⎥. (66) ⎞ ⎤   ⎥  2 ⎥ log δ  ⎥ ⎠⎦ √ +  ⎦ n The selected optimal classifier will then be f : A(Sn ,H∗ ) .

W 2 = w2 =

n  n 

  αi αj yi yj K xi , xj

(71)

i=1 j=1

T=

n 

K (xi , xi ).

(72)

i=1

Note that, when a Gaussian kernel is used, the bound of (70) becomes    2  n log 2    1 δ 2W +3 R A(Sn ,H) ≤ ξ ( f (xi ), yi ) + n n 2n i=1

(73)

C. SVM Model Selection With Margin Bound As a final issue, we also propose, for benchmarking purposes, a theoretical data-independent bound, namely the margin bound: it is less general than the approaches of the previous sections, but it can be applied to SVM as an in-sample method for selecting the best set of SVM hyperparameters [63]. The margin bound builds on the Rademacher complexity [10], [11], [13], a powerful tool for studying the performance of a learning classifier. Let us consider the following trimmed hinge loss function [63]: %   $ T ( f (x), y) = min 1, max 0, 1 − yi f (xi ) (67) = min {1, ξi } which is bounded. Then, we consider an algorithm able to select a function f in an hypothesis space FH .

since K(xi , xj ) = 1. Based on the SRM principle [4], [12], we can safely exploit (70) [or (73), when applicable] in order to select H∗ = {C∗ , γ ∗ } ⎡ H∗ = arg

min

H∈{H1 ,H2 ,...}



1 n

+

+n

i=1 ξ

4W n

( f (x

i ), yi )

√ T +3



 ⎦ . log 2δ

(74)

2n

Note that the margin bound is considered a data-independent theoretical bound, since it depends only on the empirical error and on the size of the SVM margin, while the distribution of the data does not affect it. In this sense, it is similar to the Akaike information criterion [65] and the Vapnik–Chervonenkis dimension (dVC ) [4].

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

D. Analysis of the Computational Complexities of the Analyzed Methods All the procedures introduced above consist in searching for the best combination of the hyperparameters H∗ ∈ {H1 , H2 , . . .}: thus, we have to estimate the cost of computing the upper-bound of the probability of error for each hypothesis space, when the different methods are exploited. In order to compute the upper-bound of the probability of error, all methods require the solution of several SVM problems, with different cardinality of the training set. If the cardinality of the set is n, the computational complexity for finding the SVM solution lies between O (n), in some “lucky” cases or by exploiting additional hypotheses [66], [67], and O (n3 ) in the worst case scenario [55], [68], [69]. On average, we can safely assume that the computational complexity of the learning phase of SVM is O(n2 ) [55], [69]–[71]. Let us start by considering KCV (see Section V-A). In this case, we have to perform K SVM-learning procedures, each one with a dataset of cardinality (n − n/K): an approximate number of operations is then K(n − n/K)2 , leading to an overall complexity of the method O(Kn2 ). LOO is a particular case of KCV (see Section V-A), where K = n. Then, its computational complexity is O (n3 ). In order to reduce it, we can take advantage of the possibility of upper-bounding LOO error through the number of support vectors [4], [63]. In this case, the computational complexity is equivalent to a single SVM training [i.e., O (n2 )], but model selection performance can be compromised. Instead, we suggest to exploit the unlearning strategy proposed in [72], which represents a viable trade-off. The unlearning strategy allows computing the LOO error, while only charging a small overhead on the complexity of training a single SVM model [i.e., O (ζ n2 ), where 1 ≤ ζ  n]. It is also worth noting that a multiinstance unlearning technique [73] can also be used in place of the standard KCV approach: however, the overhead due to unlearning multiple patterns can lead to nonbeneficial effects on computational time, especially when K is small with respect to n. Concerning model selection based on stability (see Section V-B), the first step consists of computing the LOO error. As a consequence, we can opt for the approach described above, which leads to O (ζ n2 ) complexity. Then, we have to complete √ n/2 SVM learning procedures on training sets of size n/2 to compute the empirical stability. The number of √ these operations is approximately n/2( n/2)2 , thus leading the overall complexity of the method to O (ζ n2 ). Finally, concerning the margin bound-based model selection (see Section V-C), we have only to train one SVM classifier with n samples for each hypothesis space, thus giving a complexity of O (n2 ). From a practical point of view, the margin-bound technique is the fastest technique in terms of computational time, since it only requires learning one SVM model for each combination of the hyperparameters. The LOO procedure comes next, as it is affected only by a small overhead due to the unlearning procedure. Finally, KCV and stability-based procedures are comparable in terms of complexity: KCV usually requires few

1921

TABLE I DATASETS E XPLOITED IN T HIS PAPER : N UMBER OF C LASSES (m), C ARDINALITY (nn ), D IMENSIONALITY (d), AND N UMBER OF OVO B INARY C LASSIFIERS , G ENERATED TO C OPE W ITH M ULTICLASS C LASSIFICATION

learning procedures to be run (because K  n), each one on sets of cardinality close to n. Stability-based model selection, instead, consists of a larger number of learning procedures √ (n/2), but run on smaller cardinality training sets ( n/2). VI. E XPERIMENTAL R ESULTS In this section, we compare the performance of the KCV and LOO out-of-sample approaches to SVM model selection, presented in Section V-A, with the in-sample stability-based technique, depicted in Section V-B, and the margin boundbased method of Section V-C. In particular, we make use of a series of real-world datasets. 1) Covertype Dataset: Classification of forest cover types from cartographic variables only [74]. 2) DaimlerChrysler Dataset: Pedestrian detection in low-resolution images in the automotive domain [75]. 3) Mnist Dataset: Recognition of handwritten digits in low-resolution images [76]. 4) NotMnist Dataset: Optical character recognition database, distributed by Google and generated in the framework of the Google Books project [77]. Table I shows the main characteristics of the datasets, where m is the number of classes which the samples are classified into, nn is the cardinality of the set and d is its dimensionality. Since, we are dealing with binary classification, in case of multiclass datasets we adopt the one versus one (OVO) procedure [78] in order to derive m(m − 1)/2 binary classification problems: the number of binary problems generated is reported in Table I. As in-sample approaches, like the stability-based one, are mostly targeted toward small-sample datasets [12], we create small-sample sets Sn by randomly sampling a limited number of patterns, ranging from n = 25 to n = 400; the remaining r = nn − n data are included in a reference set Sr , used to assess the performance of the different model selection approaches. In order to build statistically relevant results, the entire procedure is repeated 30 times during the experiments. Given that nn  n, the size of the reference set is such that the error, measured on Sr , can be considered a reasonable approximation of the (unknown) true generalization error. Note that, in order to avoid coping with largely unbalanced problems, which are out of the scope of this paper, we opted for creating almost balanced learning sets Sn . The experimental setup is the following. We set δ = 5% in the experiments computing KCV, LOO, stability (indicated as STAB in the results from here further) and margin bound

1922

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

TABLE II C OVERTYPE —95% C ONFIDENCE I NTERVALS OF E RROR ON THE R EFERENCE S ETS , P ERFORMED BY THE M ODEL , S ELECTED ACCORDING TO THE KCV, LOO, STAB, AND MARG P ROCEDURES . B EST S CORES A RE I NDICATED IN B OLD FACE

TABLE III DAIMLER C HRYSLER —95% C ONFIDENCE I NTERVALS OF E RROR ON THE R EFERENCE S ETS , P ERFORMED BY THE M ODEL , S ELECTED ACCORDING TO THE KCV, LOO, STAB, AND MARG P ROCEDURES . B EST S CORES A RE I NDICATED IN B OLD FACE

TABLE IV M NIST—95% C ONFIDENCE I NTERVALS OF E RROR ON THE R EFERENCE S ETS , P ERFORMED BY THE M ODEL , S ELECTED ACCORDING TO THE KCV, LOO, STAB, AND MARG P ROCEDURES . B EST S CORES A RE I NDICATED IN B OLD FACE

(indicated as MARG). We also fix K = 10 for the KCV procedure, as it is a common choice in SVM model-selection applications [12], [79]. As we are using Gaussian kernels for SVM, the hyperparameters set consists of C and γ : the hyperparameters search is performed via a grid search, where

C is explored in the range [10−5 , 102 ] among 20 points equally spaced in a logarithmic scale, while γ ∈ [10−5 , 102 ] is explored among 15 points equally spaced in a logarithmic scale [12], [79]. SVM model selection is implemented according to the procedures, depicted in Sections V-A, V-B, and V-C

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

1923

TABLE V N OT M NIST—95% C ONFIDENCE I NTERVALS OF E RROR ON THE R EFERENCE S ETS , P ERFORMED BY THE M ODEL , S ELECTED ACCORDING TO THE KCV, LOO, STAB, AND MARG P ROCEDURES . B EST S CORES A RE I NDICATED IN B OLD FACE

for KCV, LOO, STAB, and MARG: the learning sets Sn are used in this phase, while, once the optimal classifier is selected, its performance is then tested on the reference sets Sr . Note that the hard loss function of (51) is used for both computing the bounds and for evaluating the error performed by the selected models on Sr . Results are summarized in Tables II–V for the different datasets: in each table, 95% confidence intervals, computed according to student’s t-distribution, are reported for the error on the reference sets.4 Error values are in percentage, while bold face indicates the best result, for a given cardinality n, obtained by KCV, LOO, STAB, and MARG. The last row of each table presents the number of “wins” of the procedures: an approach is considered as winner if characterized by a smaller average error rate than the direct antagonist; in case of equal average value, the approach characterized by the narrower confidence interval is preferred. In case of ex-aequo, a win is assigned to both approaches. Some conclusions can be drawn. 1) The proposed in-sample approach, based on STAB, remarkably outperforms the out-of-sample KCV, LOO, and MARG techniques especially when d  n, namely in the conventional small-sample setting. This is expected from previous results in literature: out-of-sample methods require “enough” data for the

2) 3)

4)

5) 4 Note that some intervals are of the type 0.0 ± 0.0: in general, every model

leads to some misclassifications on the reference sets, thus these null values are mostly due to rounding. However, in these cases, the number of errors is so limited that, for the sake of procedures benchmarking, it can be safely approximated to 0 for the subsequent discussions.

splitting to be effective. Nevertheless, even when n is increased, STAB stands the ground of KCV: this is a new result with respect to other in-sample approaches (e.g., based on Rademacher complexity [10] and maximal discrepancy [13]), which are usually outperformed by KCV as n > 100 (e.g., refer to the results in [12, Table IV] on the DaimlerChrysler dataset). LOO is usually outperformed by KCV. This phenomenon is also underlined in [12]. MARG tends to perform relatively well when n is very small, while its performance remarkably worsens as n increases. In addition to allow performing model selection, it is also worth mentioning that in-sample approaches (STAB and MARG) can be used for error estimation purposes, without the need of resampling [10], [12], [13]. On the contrary, in case we are interested in error estimation when exploiting KCV, we have to resort to complete-KCV (C-KCV) [22]: C-KCV requires that only (K − 2) folds are used, in turn, for training the model, while one set is used for validation and the remaining one for testing purposes [12], [22]. This further split generally leads to a worsening in KCV performance. Similar issues obviously affect LOO as well. The results, derived in this paper, allow to effectively implement STAB bounds in practice for the first time in literature, to the authors’ best knowledge: the results, obtained on the benchmarking real-world datasets, are appealing and encouraging. On the contrary,

1924

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

TABLE VI 95% C ONFIDENCE I NTERVALS OF C OMPUTATIONAL L EARNING (T RAINING AND M ODEL S ELECTION ) T IME R EQUIRED BY THE KCV, LOO, STAB, AND MARG P ROCEDURES , AVERAGED OVER A LL THE P ERFORMED E XPERIMENTS . A LL VALUES A RE R EPORTED IN S ECONDS

we performed some tests with the original STAB bound of (63) [1], but the results were of scarce use in practice: the procedure, based on the bound of (63), always selects very small values of C, leading to a noticeable underfitting and to poor performing models. We also analyze, finally, the computational learning time required by KCV, LOO, STAB, and MARG on the different datasets, which includes the time needed to perform both the training and the model selection phases: results are shown in Table VI, where an averaged value depending only on n is reported for the sake of readability and easiness of analysis. Note that, we exploited the unlearning approach to speed-up LOO, KCV, and the first phase of STAB. Tools for analysis are implemented in MATLAB and experiments are performed on a server running Microsoft Windows Server 2012 and mounting four Intel Xeon E5 4620 2.2 GHz processors and 128 GB RAM. It is worthwhile noting that the effort necessary to complete STAB model selection is the largest recorded, since it requires a LOO round plus the training of (n/2) SVMs (though on smaller cardinality datasets). However, computational time for STAB is still very close to KCV one (1.2x/1.3x times slower), which is the only approach able to compete with STAB in terms of error rate. This represents a huge step forward for in-sample data-depended approaches, which are conventionally known to be remarkably more computationally intensive than out-of-sample alternatives. For example, in [12] a comparison between computational time needed by conventional in-sample and out-of-sample methods was performed (refer to [12, Table XIV]), where it was clearly shown that in-sample methods based on Rademacher complexity [10] and maximal discrepancy [13] were approximately 10x/15x times slower than KCV, independently of n. For the first time in literature, to our best knowledge, the results obtained for STAB open new perspectives concerning the applicability of in-sample methods also in the medium-sample setting, i.e., when the number of available samples is comparable to, or slightly larger than, the dimensionality of the problem. VII. C ONCLUSION In this paper, we derived a fully empirical and data-dependent bound based on the notion of STAB, derived and inspired by [1] and [2]. STAB bounds have been seldom applied in practice up to now due to their limitations in practical problems; the approach we depicted allows to perform some steps forward toward exploiting STAB bounds in practice, thanks to the capability of deriving data-dependent

measures. The results, obtained by applying the results of this paper to the case study of the SVM model-selection phase, demonstrate the potentiality of the approach, which we focussed on in this paper. Nevertheless, rooms for improvements exist, mainly concerning the tightness in estimating the generalization error of a learned model and the convergence rate of the bound: as estimations based on STAB seem to be mostly (though not solely) targeted toward small-sample problems, the capability of obtaining effective estimations, even with small values of the cardinality of the learning set, is of crucial importance. In this sense, STAB bounds still have to be improved. As a matter of fact, the bound proposed so far is polynomial: more powerful statistical tools should be developed in order to overcome this limitation (for example by exploiting recent results on concentration inequalities [37], [80]–[82]). Moreover, the proposed STAB bound is characterized by slow convergence rate, which could be improved by exploiting the advances on fast convergence rates [83]–[85], developed in the last years. As a final remark, we also point out that this paper allows to open the door to the application of STAB to other learning algorithms, by checking whether other techniques satisfy the conditions of (25) [or (26)]. R EFERENCES [1] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, Mar. 2002. [2] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004. [3] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin, “Learning theory: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization,” Adv. Comput. Math., vol. 25, no. 1, pp. 161–193, 2006. [4] V. N. Vapnik, Statistical Learning Theory. New York, NJ, USA: Wiley, 1998. [5] R. Herbrich and T. Graepel, “A PAC-Bayesian margin bound for linear classifiers,” IEEE Trans. Inf. Theory, vol. 48, no. 12, pp. 3140–3150, Dec. 2002. [6] D. A. McAllester, “PAC-Bayesian stochastic model selection,” Mach. Learn., vol. 51, no. 1, pp. 5–21, 2003. [7] E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun, “PAC-Bayes bounds with data dependent priors,” J. Mach. Learn. Res., vol. 13, pp. 3507–3531, Dec. 2012. [8] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [9] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony, “Structural risk minimization over data-dependent hierarchies,” IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1926–1940, Jan. 1998. [10] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Mar. 2003. [11] P. Bartlett, O. Bousquet, and S. Mendelson, “Localized Rademacher complexities,” in Computational Learning Theory. Berlin, Germany: Springer, 2002. [12] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-sample and outof-sample model selection and error estimation for support vector machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1390–1406, Jun. 2012. [13] P. L. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and error estimation,” Mach. Learn., vol. 48, nos. 1–3, pp. 85–113, 2002. [14] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Model selection: Beyond the Bayesian/frequentist divide,” J. Mach. Learn. Res., vol. 11, pp. 61–87, Jan. 2010. [15] W. H. Rogers and T. J. Wagner, “A finite sample distribution-free performance bound for local discrimination rules,” Ann. Stat., vol. 6, no. 3, pp. 506–514, 1978.

ONETO et al.: FULLY EMPIRICAL AND DATA-DEPENDENT STABILITY-BASED BOUNDS

[16] J. R. Quinlan, C4. 5: Programs for Machine Learning, vol. 1. San Francisco, CA, USA: Morgan Kaufmann, 1993. [17] H.-W. Hu, Y.-L. Chen, and K. Tang, “A novel decision-tree method for structured continuous-label classification,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 1734–1746, Jan. 2013. [18] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [19] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, and Y. C. Soh, “Dynamic extreme learning machine and its approximation capability,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 2054–2065, Feb. 2013. [20] P. Klesk and M. Korzen, “Sets of approximating functions with finite Vapnik–Chervonenkis dimension for nearest-neighbors algorithms,” Pattern Recognit. Lett., vol. 32, no. 14, pp. 1882–1893, 2011. [21] B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, vol. 38. Philadelphia, PA, USA: SIAM, 1982. [22] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. Int. Joint Conf. Artif. Intell., San Francisco, CA, USA, 1995, pp. 1137–1143. [23] U. M. Braga-Neto and E. R. Dougherty, “Is cross-validation valid for small-sample microarray classification?” Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004. [24] Q. Wu and D.-X. Zhou, “Learning with sample dependent hypothesis spaces,” Comput. Math. Appl., vol. 56, no. 11, pp. 2896–2907, 2008. [25] A. Cannon, J. M. Ettinger, D. Hush, and C. Scovel, “Machine learning with data dependent hypothesis classes,” J. Mach. Learn. Res., vol. 2, pp. 335–358, Jan. 2002. [26] A. Rakhlin, S. Mukherjee, and T. Poggio, “Stability results in learning theory,” Anal. Appl., vol. 3, no. 4, pp. 397–417, 2005. [27] S. Agarwal and P. Niyogi, “Stability and generalization of bipartite ranking algorithms,” in Proc. 18th Annu. Conf. Learn. Theory, Berlin, Germany, 2005, pp. 32–47. [28] M. Kearns and D. Ron, “Algorithmic stability and sanity-check bounds for leave-one-out cross-validation,” Neural Comput., vol. 11, no. 6, pp. 1427–1453, 1999. [29] A. Elisseeff, T. Evgeniou, and M. Pontil, “Stability of randomized learning algorithms,” J. Mach. Learn. Res., vol. 6, no. 1, p. 55, 2006. [30] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” J. Mach. Learn. Res., vol. 11, pp. 2635–2670, Mar. 2010. [31] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995. [32] L. Györfi, A Distribution-Free Theory of Nonparametric Regression. London, U.K.: Springer, 2002. [33] R. Adamczak, “A tail inequality for suprema of unbounded empirical processes with applications to Markov chains,” Electron. J. Probab., vol. 13, no. 34, pp. 1000–1034, 2007. [34] S. Kutin, “Extensions to McDiarmid’s inequality when differences are bounded with high probability,” Dept. Comp. Sci., Univ. Chicago, Chicago, IL, USA, Tech. Rep. TR-2002-04, 2002. [35] K. Fukunaga and D. M. Hummels, “Leave-one-out procedures for nonparametric error estimates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 4, pp. 421–423, Aug. 1989. [36] M. M. Lee, S. S. Keerthi, C. J. Ong, and D. DeCoste, “An efficient method for computing leave-one-out error in support vector machines with Gaussian kernels,” IEEE Trans. Neural Netw., vol. 15, no. 3, pp. 750–757, May 2004. [37] C. McDiarmid, “On the method of bounded differences,” Surv. Comb., vol. 141, no. 1, pp. 148–188, 1989. [38] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, vol. 31. New York, NY, USA: Springer, 1996. [39] L. Devroye and T. Wagner, “Distribution-free inequalities for the deleted and holdout error estimates,” IEEE Trans. Inf. Theory, vol. 25, no. 2, pp. 202–207, Jan. 1979. [40] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” Advances in Neural Information Processing Systems. San Mateo, CA, USA: Morgan Kaufmann, 1997, pp. 155–161. [41] I. Steinwart, “Consistency of support vector machines and other regularized kernel classifiers,” IEEE Trans. Inf. Theory, vol. 51, no. 1, pp. 128–142, Jan. 2005. [42] R. Dietrich, M. Opper, and H. Sompolinsky, “Statistical mechanics of support vector networks,” Phys. Rev. Lett., vol. 82, no. 14, p. 2975, 1999. [43] M. Opper, W. Kinzel, J. Kleinz, and R. Nehl, “On the ability of the optimal perceptron to generalise,” J. Phys. A. Math. Gen., vol. 23, no. 11, p. L581, 1990.

1925

[44] M. Opper, “Statistical mechanics of learning: Generalization,” in The Handbook of Brain Theory and Neural Networks, 2nd ed, Cambridge, MA, USA: MIT Press, 2002, pp. 922–925. [45] S. Mukherjee et al., “Estimating dataset size requirements for classifying DNA microarray data,” J. Comput. Biol., vol. 10, no. 2, pp. 119–142, 2003. [46] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13–30, 1963. [47] M. Aupetit, “Nearly homogeneous multi-partitioning with a deterministic generator,” Neurocomputing, vol. 72, nos. 7–9, pp. 1379–1389, 2009. [48] D. Anguita, L. Ghelardoni, A. Ghio, L. Oneto, and S. Ridella, “The ‘K’ in K-fold cross validation,” in Proc. Eur. Symp. Artif. Neural Netw., Bruges, Belgium, 2011, pp. 441–446. [49] D. Anguita, S. Ridella, and F. Rivieccio, “K-fold generalization capability assessment for support vector classifiers,” in Proc. IEEE Int. Joint Conf. Neural Netw., Montreal, QC, Canada, 2005, pp. 855–858. [50] B. Milenova, J. Yarmus, and M. Campos, “SVM in Oracle database 10g: Removing the barriers to widespread adoption of support vector machines,” in Proc. 31st Int. Conf. Very Large Data Bases, Trondheim, Norway, 2005, pp. 1152–1163. [51] Z. Xu, M. Dai, and D. Meng, “Fast and efficient strategies for model selection of Gaussian support vector machine,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1292–1307, Mar. 2009. [52] T. Glasmachers and C. Igel, “Maximum likelihood model selection for 1-norm soft margin SVMS with multiple parameters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp. 1522–1528, Apr. 2010. [53] K. De Brabanter, J. De Brabanter, J. Suykens, and B. De Moor, “Approximate confidence and prediction intervals for least squares support vector regression,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 110–120, Nov. 2011. [54] M. Karasuyama and I. Takeuchi, “Nonlinear regularization path for quadratic loss support vector machines,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1613–1625, Aug. 2011. [55] J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods, B. Schölkopf, C. J. C. Burges, and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, pp. 185–208. [56] L. J. Cao et al., “Parallel sequential minimal optimization for the training of support vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 1039–1049, Jul. 2006. [57] B. Scholkopf, “The kernel trick for distances,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 1299–1319. [58] S. Keerthi and C. Lin, “Asymptotic behaviors of support vector machines with Gaussian kernel,” Neural Comput., vol. 15, no. 7, pp. 1667–1689, 2003. [59] S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,” Stat. Surv., vol. 4, pp. 40–79, 2010. [60] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-fold cross validation in prediction error estimation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 3, pp. 569–575, Mar. 2010. [61] J. G. Moreno-Torres, J. A. Sáez, and F. Herrera, “Study on the impact of partition-induced dataset shift on k-fold cross-validation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1304–1312, Jun. 2012. [62] D. Anguita, A. Ghio, S. Ridella, and D. Sterpi, “K-fold cross validation for error rate estimate in support vector machines,” in Proc. Int. Conf. Data Mining, Las Vegas, NV, USA, 2009, pp. 291–297. [63] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge, U.K.: Cambridge Univ. Press, 2004. [64] I. Steinwart and A. Christmann, Support Vector Machines. Berlin, Germany: Springer, 2008. [65] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control, vol. 19, no. 6, pp. 716–723, Dec. 1974. [66] T. Joachims, “Training linear SVMS in linear time,” in Proc. Int. Conf. Knowl. Disc. Data Mining, Philadelphia, PA, USA, 2006, pp. 217–226. [67] R. Collobert, S. Bengio, and Y. Bengio, “A parallel mixture of SVMS for very large scale problems,” Neural Comput., vol. 14, no. 5, pp. 1105–1114, 2002. [68] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Very large SVM training using core vector machines,” in Proc. Int. Workshop Artif. Intell. Stat., 2005, pp. 349–356. [69] J. Shawe-Taylor and S. Sun, “A review of optimization methodologies in support vector machines,” Neurocomputing, vol. 74, no. 17, pp. 3609–3618, 2011. [70] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, “Improvements to Platt’s SMO algorithm for SVM classifier design,” Neural Comput., vol. 13, no. 3, pp. 637–649, 2001.

1926

[71] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for training support vector machines,” J. Mach. Learn. Res., vol. 6, pp. 1889–1918, Dec. 2005. [72] G. Cauwenberghs and T. Poggio, “Incremental and decremental support vector machine learning,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 409–415. [73] M. Karasuyama and I. Takeuchi, “Multiple incremental decremental learning of support vector machines,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 1048–1059. [74] J. A. Blackard and D. J. Dean, “Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables,” Comput. Electron. Agr., vol. 24, no. 3, pp. 131–151, 1999. [75] S. Munder and D. M. Gavrila, “An experimental study on pedestrian classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1863–1868, Nov. 2006. [76] L. Bottou et al., “Comparison of classifier methods: A case study in handwritten digit recognition,” in Proc. Int. Conf. Pattern Recognit. Conf. B Comput. Vis. Image Process., Jerusalem, Israel, Oct. 1994, pp. 77–82. [77] Y. Bulatov, (2011). “Notmnist dataset,” Google (Books/OCR), Tech. Rep. [Online]. Available: http://yaroslavvb.blogspot.it/2011/09/ notmnist-dataset.html [78] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, Mar. 2002. [79] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, “A practical guide to support vector classification,” Dept. Comput. Sci., Nat. Taiwan Univ., Tech. Rep., 2003 [Online]. Available http://www.csie.ntu.edu.tw/ ∼cjlin/papers/guide/guide.pdf [80] M. Talagrand, “A new look at independence,” Ann. Probab., vol. 24, no. 1, pp. 1–34, 1996. [81] S. Boucheron, G. Lugosi, and P. Massart, “A sharp concentration inequality with applications,” Random Struct. Algorithms, vol. 16, no. 3, pp. 277–292, May 2000. [82] O. Bousquet, “A Bennett concentration inequality and its application to suprema of empirical processes,” CR Math., vol. 334, no. 6, pp. 495–500, 2002. [83] N. Srebro, K. Sridharan, and A. Tewari, “Smoothness, low-noise, and fast rates,” in Proc. Neural Inf. Process. Syst., 2010, pp. 2199–2207. [84] E. Mammen and A. B. Tsybakov, “Smooth discrimination analysis,” Ann. Stat., vol. 27, no. 6, pp. 1808–1829, 1999. [85] J.-Y. Audibert, “Fast learning rates in statistical inference through aggregation,” Ann. Stat., vol. 37, no. 4, pp. 1591–1646, 2009.

Luca Oneto (M’11) was born in Rapallo, Italy, in 1986. He received the bachelor’s and master’s degrees in electronic engineering and intelligent system and statistics from the University of Genoa, Genoa, Italy, in 2008, where he received the Ph.D. degree from the School of Sciences and Technologies for Knowledge and Information Retrieval, in learning based on empirical data. He is currently a Researcher with the University of Genoa. From 2010, he was a Consultant with the Departments of Electrical, Electronic and Telecommunication Engineering and Naval Architecture and Biophysical and Electronic Engineering, University of Genoa, together with other consultant activities for Mac96, Genoa, and Ansaldo STS, Canning Vale, WA, Australia, in the context of few European projects. He is currently a Consultant and teaches for several B.Sc. and M.Sc. courses with the University of Genoa. His current research interests include machine learning and statistical learning theory.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 9, SEPTEMBER 2015

Alessandro Ghio (M’07) was born in Chiavari, Italy, in 1982. He received the master’s degree in electronic engineering and the Ph.D. degree in knowledge and information science in 2010, both from the University of Genoa, Genoa, Italy. He is currently research associate and IT consultant at the Department of Electrical, Electronic, Telecommunications Engineering, and Naval Architecture (DITEN) of the University of Genoa. His current research interests include both theoretical and practical aspects of smart systems based on computational intelligence and machine-learning methods.

Sandro Ridella (M’93) received the Laurea degree in electronic engineering from the University of Genoa, Genoa, Italy, in 1966. He is currently a Full Professor with the Department of Biophysical and Electronic Engineering (DIBE, now Department of Electrical, Electronic and Telecommunication Engineering and Naval Architecture Department), University of Genoa, where he teaches circuits and algorithms for signal processing. His current research interests include the field of neural networks.

Davide Anguita (SM’12) received the Laurea degree in electronic engineering and the Ph.D. degree in computer science and electronic engineering from the University of Genoa, Genoa, Italy, in 1989 and 1993, respectively. He was a Research Associate with the International Computer Science Institute, Berkeley, CA, USA, on special-purpose processors for neurocomputing, and then returned to the University of Genoa, where he is currently an Associate Professor of Computer Engineering with the Department of Informatics, BioEngineering, Robotics, and Systems Engineering. His current research interests include theory and application of kernel methods and artificial neural networks.

Fully Empirical and Data-Dependent Stability-Based Bounds.

The purpose of this paper is to obtain a fully empirical stability-based bound on the generalization ability of a learning procedure, thus, circumvent...
2MB Sizes 2 Downloads 6 Views