1390

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines Davide Anguita, Member, IEEE, Alessandro Ghio, Member, IEEE, Luca Oneto, Member, IEEE, and Sandro Ridella, Member, IEEE

Abstract— In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data-dependent structural risk minimization framework and propose a proper reformulation of the SVM learning algorithm, so that the in-sample approach can be effectively applied. The experiments, performed both on simulated and real-world datasets, show that our in-sample approach can be favorably compared to out-of-sample methods, especially in cases where the latter ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out-of-sample approaches such as the cross validation, the leave-one-out, or the Bootstrap methods. Index Terms— Bootstrap, cross validation, error estimation, leave one out, model selection, statistical learning theory (SLT), structural risk minimization (SRM), support vector machine (SVM).

I. I NTRODUCTION

M

ODEL selection addresses the problem of tuning the complexity of a classifier to the available training data so as to avoid either under- or overfitting [1]. These problems affect most classifiers because, in general, their complexity is controlled by one or more hyperparameters, which must be tuned separately from the training process in order to achieve optimal performances. Some examples of tunable hyperparameters are the number of hidden neurons or the amount of regularization in multilayer perceptrons (MLPs) [2], [3] and the margin/error tradeoff or the value of the kernel parameters in support vector machines (SVMs) [4]–[6]. Strictly related to this problem is the estimation of the generalization error of a classifier: in fact, the main objective of building an optimal classifier is to choose both its parameters and hyperparameters so to minimize its generalization error and compute an estimate of this value for predicting the classification performance on future data. Unfortunately, despite the large amount of work on this important topic, the problem of model selection Manuscript received August 16, 2011; revised May 25, 2012; accepted May 27, 2012. Date of publication June 29, 2012; date of current version August 1, 2012. The authors are with the DITEN, University of Genova, Genova I-16145, Italy (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2202401

and error estimation of SVMs is still open and the object of extensive research [7]–[12]. Among the several methods proposed for this purpose, it is possible to identify two main approaches: out-of-sample and in-sample methods. The first ones are favored by practitioners because they work well in many situations and allow the application of simple statistical techniques for estimating the quantities of interest. Some examples of out-of-sample methods are the well-known k-fold cross validation (KCV), the leave-one-out (LOO), and the Bootstrap (BTS) [13]–[15]. All these techniques rely on a similar idea: the original dataset is resampled, with or without replacement, to build two independent datasets called, respectively, the training and validation (or estimation) sets. The first one is used for training a classifier, while the second one is exploited for estimating its generalization error, so that the hyperparameters can be tuned to achieve its minimum value. Note that both error estimates computed through the training and validation sets are, obviously, optimistically biased, therefore, if a generalization estimate of the final classifier is desired, it is necessary to build a third independent set, called the test set, by nesting two of the resampling procedures mentioned above. Unfortunately, this additional splitting of the original dataset results in a further shrinking of the available learning data and contributes to a further increase of the computational burden. Furthermore, after the learning and model selection phases, the user is left with several classifiers (e.g., k classifiers in the case of KCV), each one with possibly different values of the hyperparameters, and combining them or retraining a final classifier on the entire dataset for obtaining a final classifier can lead to unexpected results [16]. Despite some drawbacks, when a reasonably large amount of data is available, out-of-sample techniques work reasonably well. However, there are several settings where their use has been questioned by many researchers [17]–[19]. In particular, the main difficulties arise in the small-sample regime or, in other words, when the size of the training set is small compared to the dimensionality of the patterns. A typical example is the case of microarray data, where less than a hundred samples, composed of thousands of genes, are often available [20]. In these cases, in-sample methods would be the obvious choice for performing the model selection phase: in fact, they allow exploiting the whole set of available data for both training the model and estimating its generalization error, thanks to the application of rigorous statistical procedures. Despite their unquestionable advantages over the

2162–237X/$31.00 © 2012 IEEE

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

out-of-sample methods, their use is not widespread: one of the reasons is the common belief that in-sample methods are very useful for gaining deep theoretical insights on the learning process or for developing new learning algorithms, but they are not suitable for practical purposes. The SVM itself, which is one of the most successful classification algorithms of the last decade, stems from the well-known Vapnik’s structural risk minimization (SRM) principle [5], [21], which represents the seminal approach to in-sample methods. However, SRM is not able, in practice, to estimate the generalization error of the trained classifier or select its optimal hyperparameters [22], [23]. Similar principles are equally interesting from a theoretical point of view, but seldom useful in practice [21], [24], [25] as they are overly pessimistic. In the past years, some proposals have heuristically adapted the SRM principle to in-sample model selection purposes with some success, but they had to give up its theoretical rigor, thereby compromising its applicability [26]. We present in this paper, a new method for applying a data-dependent SRM approach [27] to model selection and error estimation by exploiting new results in the field of statistical learning theory (SLT) [28]. In particular, we describe an in-sample method for applying the data-dependent SRM principle to a slightly modified version of the SVM. Our approach is general, but is particularly effective in performing model selection and error estimation in the small-sample setting: in these cases, it is able to outperform out-of-sample techniques. The novelty of our approach is the exploitation of new results on the maximal discrepancy and Rademacher complexity theory [28], trying not to give up any theoretical rigor, but achieving good performance in practice. Our purpose is not to claim the general superiority of in-sample methods above out-of-sample ones, but to explore advantages and disadvantages of both approaches in order to understand why and when they can be successfully applied. For this reason, a theoretically rigorous analysis of out-of-sample methods is also presented. Finally, we show that the proposed in-sample method allows using a conventional quadratic programming solver for SVMs to control the complexity of the classifier. In other words, even though we make use of a modified SVM, to allow for the application of the in-sample approach, any well-known optimization algorithm such as, for example, the sequential minimal optimization (SMO) method [29], [30] can be used for performing the training, model selection, and error estimation phases, as in the out-of-sample cases. This paper is organized as follows. Section II details the classification problem framework and describes the in-sample and out-of-sample general approaches. Section III and Section IV survey old and new statistical tools that form the basis for the subsequent analysis of out-of-sample and in-sample methods. Section V proposes a new method for applying the data-dependent SRM approach to the model selection and error estimation of a modified SVM and details also an algorithm for exploiting conventional SVM-specific quadratic programming solvers. Finally, Section VI shows the application of our proposal to real-world small-sample problems, along with a comparison to out-of-sample methods.

1391

II. C LASSIFICATION P ROBLEM F RAMEWORK We consider a binary classification problem with an input space X ∈ d and an output space Y ∈ {−1, +1}. We assume that the data (x, y), with x ∈ X and y ∈ Y, is composed of random variables distributed according to an unknown distribution P and we observe a sequence of n independent and identically distributed (i.i.d.) pairs Dn = {(x1 , y1 ), . . . , (x n , yn )}, sampled according to P. Our goal is to build a classifier or, in other words, to construct a function f : X → Y, which predicts Y from X . Obviously, we need a criterion to choose f , therefore we measure the expected error performed by the selected function on the entire data population, i.e., the risk L( f ) = E(X ,Y ) ( f (x), y)

(1)

where ( f (x), y) is a suitable loss function that measures the discrepancy between the prediction of f and the true Y, according to some user-defined criteria. Some examples of loss functions are the hard loss  0, y f (x) > 0  I ( f (x), y) = (2) 1, otherwise which is an indicator function that simply counts the number of misclassified samples, the hinge loss, which is used by the SVM algorithm and is a convex upper bound of the previous one [4]  H ( f (x), y) = max (0, 1 − y f (x))

(3)

the logistic loss, which is used for obtaining probabilistic outputs from a classifier [31] 1 1 + e−y f (x) and, finally, the soft loss [32], [33]    1 − y f (x)  S ( f (x), y) = min 1, max 0, 2  L ( f (x), y) =

(4)

(5)

which is a piecewise linear approximation of the former and a clipped version of the hinge loss. Let us consider a class of functions F . Thus, the optimal classifier f ∗ ∈ F is f ∗ = arg min L( f ). f ∈F

(6)

Since P is unknown, we cannot directly evaluate the risk, nor find f ∗ . The only available option is to depict a method for selecting F , and consequently f ∗ , based on the available data and, eventually, some a priori knowledge. Note that the model selection problem consists, generally speaking, in the identification of a suitable F : in fact, the hyperparameters of the classifier affect, directly or indirectly, the function class where the learning algorithm searches for the, possibly optimal, function [21], [34]. The empirical risk minimization (ERM) approach suggests the estimation of the true risk L( f ) by its empirical version n 1 ( f (x i ), yi ) Lˆ n ( f ) = n i=1

(7)

1392

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

so that f n∗ = arg min Lˆ n ( f ). f ∈F

(8)

Unfortunately, Lˆ n ( f ) typically underestimates L( f ) and can lead to severe overfitting because, if the class of functions is sufficiently large, it is always possible to find a function that perfectly fits the data but shows poor generalization capability. For this reason, it is necessary to perform a model selection step, by selecting an appropriate F , so as to avoid classifiers that are prone to overfitting the data. A typical approach is to study the random variable L( f ) − Lˆ n ( f ), which represents the generalization bias of the classifier f . In particular, given a user-defined confidence value δ, the objective is to bound the probability that the true risk exceeds the empirical one   (9) P L( f ) ≥ Lˆ n ( f ) +  ≤ δ leading to bounds of the following form, which hold with probability (1 − δ): L( f ) ≤ Lˆ n ( f ) + .

(10)

Equation (10) can be used to select an optimal class function and, consequently, an optimal classifier, by minimizing the term on right side of the inequality. The out-of-sample approach suggests the use of an independent dataset, sampled from the same data distribution that generated the training set, so that the bound is valid, for any classifier, even after it has learned the Dn set. Given additional m samples Dm = {(x1 , y1 ), . . . , (x m , ym )} and given a classifier f n , which has been trained on Dn , its generalization error can be upper bounded, in probability, according to L( fn ) ≤ Lˆ m ( f n ) + 

(11)

where the bound holds with probability (1 − δ). Then the model selection phase can be performed by varying the hyperparameters of the classifier until the right side of (11) reaches its minimum. In particular, let us consider several function classes F1 , F2 , . . ., indexed by different values of ∗ the hyperparameters, then the optimal classifier fn∗ = fn,i ∗ is the result of the following minimization process:   ∗ i ∗ = arg min Lˆ m ( f n,i )+ (12) i

where

∗ f n,i = arg min Lˆ n ( f ). f ∈Fi

(13)

Note that, if we are interested in estimating the generalization error of f n∗ , we need to apply again the bound of (11), but using some data (i.e., the test set) that has not been involved in this procedure. It is also worth mentioning that the partition of the original dataset into training and validation (and eventually test) sets can affect the tightness of the bound, owing to lucky or unlucky splittings, and, therefore, its effectiveness. This is a major issue for out-of-sample methods and several heuristics have been proposed in the literature for dealing with this problem (e.g., stratified sampling or topology-based

Fig. 1.

SRM principle.

splitting [35]) but they will be not analyzed here, as they are outside of the scope of this paper. Equation (11) can be rewritten as L( f n ) ≤ Lˆ n ( f n ) + ( Lˆ m ( f n ) − Lˆ n ( fn )) + 

(14)

clearly showing that the out-of-sample approach can be considered as a penalized ERM, where the penalty term takes into account the discrepancy between the classifier performance on the training and the validation set. This formulation explains also other approaches to model selection such as, for example, the early stopping procedure, which is widely used in neural networks learning [36]. In fact, (14) suggests stopping the learning phase when the performances of the classifier on the training and validation set begin to diverge. The in-sample approach, instead, targets the use of the same dataset for learning, model selection, and error estimation, without resorting to additional samples. In particular, this approach can be summarized as follows: a learning algorithm takes as input the data Dn and produces a function f n and an estimate of the error Lˆ n ( f n ), which is a random variable depending on the data themselves. As we cannot a priori know which function will be chosen by the algorithm, we consider uniform deviations of the error estimate L( f n ) − Lˆ n ( f n ) ≤ sup (L( f ) − Lˆ n ( f )). f ∈F

(15)

Then, the model selection phase can be performed according to a data-dependent version of the SRM framework [27], which suggests the choice of a possibly infinite sequence {Fi , i = 1, 2, . . .} of model classes of increasing complexity, F1 ⊆ F2 ⊆ . . . (Fig. 1) and minimize the empirical risk in each class with an added penalty term, which, in our case, gives rise to bounds of the following form: L( f n ) ≤ Lˆ n ( f n ) + sup (L( f ) − Lˆ n ( f )). f ∈F

(16)

From (16) it is clear that the price to pay for avoiding the use of additional validation/test sets is the need to take in account the behavior of the worst possible classifier in the class, while the out–of–sample approach focuses on the actual learned classifier. Applying (16) for model selection and error estimation purposes is a straightforward operation, at least in theory.

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

1393

In the recent literature, the data-dependent selection of the centroid has been theoretically approached, for example, in [27], but only few authors have proposed some methods for dealing with this problem [37]. One example is the use of localized Rademacher complexities [38] or, in other words, the study of penalty terms that take into account only the classifier with low empirical error: although this approach is very interesting from a theoretical point of view, its application in practice is not evident. A more practical approach has been proposed by Vapnik and others [21], [39], introducing the concept of Universum, i.e., a dataset composed of samples that do not belong to any class represented in the training set. However, no generalization bounds, such as the ones that will be presented here, have been proposed for this approach. Fig. 2.

Hypothesis spaces with different centroids.

A very small function class is selected and its size (i.e., its complexity) is increased, by varying the hyperparameters, until the bound reaches its minimum, which represents the optimal tradeoff between under- and overfitting, and therefore, identifies the optimal classifier  ∗ min Lˆ n ( f ) + sup (L( f ) − Lˆ n ( f )) . f n = arg f ∈Fi

Fi ∈{F1 ,F2 ,...}

f ∈Fi

(17) Furthermore, after the procedure has been completed, the value of (16) provides, by construction, a probabilistic upper bound of the error rate of the selected classifier. A. Need for a Strategy for Selecting the Class of Functions From the previous analysis, it is obvious that the class of functions F , where the classifier f n∗ is searched, plays a central role in the successful application of the in-sample approach. However, in the conventional data-dependent SRM formulation, the function space is arbitrarily centered and this choice severely influences the sequence Fi . Its detrimental effect on (16) can be clearly understood through the example of Fig. 2, where we suppose to know the optimal classifier f ∗ . In this case, in fact, the hypothesis space Fi ∗ , which includes f n∗ , is characterized by a large penalty term and, thus, f n∗ is greatly penalized with respect to other models. Furthermore, the bound of (16) becomes very loose, as the penalty term takes in account the entire Fi ∗ class, so the generalization error of the chosen classifier is greatly overestimated. The main problem lies in the fact that f n∗ is “far” from the a priori chosen centroid f 0 of the sequence Fi . On the contrary, if we were able to define a sequence of function classes Fi , i = 1, 2, . . ., centered on a function f 0 sufficiently close to f ∗ , the penalty term would be noticeably reduced and we would be able to improve both the SRM model selection and error estimation phases, by choosing a model close to the optimal one. We argue that this is one of the main reasons why the in-sample approach has not been considered effective so far and that this line of research should be better explored if we are interested in building better classification algorithms and, at the same time, more reliable performance estimates.

III. T HEORETICAL A NALYSIS OF O UT- OF -S AMPLE T ECHNIQUES Out-of-sample methods are favored by practitioners because they work well in many real cases and are simple to implement. Here we present a rigorous statistical analysis of two well-known out-of-sample methods for model selection and error estimation: the KCV and BTS. The philosophy of the two methods is similar: part of the data is left out from the training set and is used for estimating the error of the classifier that has been found during the learning phase. The splitting of the original training data is repeated several times, in order to average out the unlucky cases, therefore the entire procedure produces several classifiers (one for each data splitting). Note that, from the point of view of our analysis, it is not statistically correct to select one of the trained classifiers to perform the classification of new samples because, in this case, the samples of the validation (or test) set would not be i.i.d. anymore. For this reason, every time a new sample is received, the user should randomly select one of the classifiers, so that the error estimation bounds, one for each trained classifier, can be safely averaged. A. Bounding the True Risk With Out-of-Sample Data Depending on the chosen loss function, we can apply different statistical tools to estimate the classifier error. When dealing with hard loss, we are considering sums of Bernoulli random variables, so we can use the well-known one-sided Clopper–Pearson bound [40], [41]. Given t = m Lˆ m ( f n ) misclassifications, and defining p = Lˆ m ( f n ) + , the errors follow a binomial distribution t    m p j (1 − p)m− j (18) B(t; m, p) = j j =0

so we can bound the generalization error by computing the inverse of the Binomial tail

 ∗ ( Lˆ m , m, δ) = max  : B(t; m, Lˆ m + ) ≥ δ (19) 

and, therefore, with probability (1 − δ) L( fn ) ≤ Lˆ m ( f n ) +  ∗ ( Lˆ m , m, δ).

(20)

More explicit bounds, albeit less sharp, are available in the literature that allow us to gain a better insight into the behavior

1394

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

of the error estimate. Among them are the so-called empirical Chernoff bound [40] and the more recent empirical Bernstein bound [42], [43], which is valid for bounded random variables, including Bernoulli ones



 7 ln 2δ 2 ln 2δ + (21) L( fn ) ≤ Lˆ m ( f n ) + s m 3(m − 1)  where s = Lˆ m ( fn )(1 − Lˆ m ( f n )). This bound can be easily related to (11) and shows clearly that the classifier error decays at a pace between O(n −1 ) and O(n −1/2 ), depending on the performance of the trained classifier on the validation dataset. The hinge loss, unfortunately, gives rise to bounds that decay at a much slower pace with respect to the previous one. In fact, as noted in [44] almost 50 years ago, when dealing with a positive unbounded random variable, the Markov inequality cannot be improved, as there are some distributions for which the equality is attained. Therefore, the assumption that the loss function is bounded becomes crucial to obtain any improvement over the Markov bound. On the contrary, the soft and the logistic losses show a behavior similar to the hard loss; in fact, (21) can be used in these cases as well. In this paper, however, we propose to use a tighter bound that was conceived by Hoeffding in [44] and has been neglected in the literature, mainly because it cannot be presented in a closed form. With our notation, the bound is   P L( f ) − Lˆ m ( f ) >  ⎡ 1− Lˆ m ( f )   Lˆ ( f )⎤m ˆ m( f ) −  ˆ m( f ) −  m 1 − L L ⎦ . ≤⎣ 1 − Lˆ m ( f ) Lˆ m ( f ) (22) By equating the right part of (22) to δ and solving it numerically, we can find the value  ∗ , which can be inserted in (20), as a function of δ, m, and Lˆ m ( f ). Note that the above inequality is, in practice, as tight as the one derived by the application of the Clopper-Pearson bound [45], [46], is numerically more tractable, and is valid for both hard and soft losses, so it will be used in the rest of the paper, when dealing with the out-of-sample approach. B. KCV The KCV technique consists in splitting a dataset into k independent subsets and using, in turn, all but one set to train the classifier, while using the remaining set to estimate the generalization error. When k = n, this becomes the well-known LOO technique, which is often used in the small-sample setting, because all the samples except one are used for training the model [14]. If our target is to perform both the model selection and the error estimation of the final classifier, a nested KCV is required, where (k − 2) subsets are used, in turn, for the training phase, one is used as a validation set to optimize the hyperparameters, and the last one is used as a test set to estimate the generalization error. Note that O(k 2 ) training steps are necessary in this case.

To guarantee the statistical soundness of the KCV approach, one of the k trained classifiers must be randomly chosen before classifying a new sample. This procedure is seldom used in practice because, usually, one retrains a final classifier on the entire training data: however, as pointed out by many authors, we believe that this heuristic procedure is the one to blame for the unexpected and inconsistent results of the KCV technique in the small-sample setting. If the nested KCV procedure is applied, to guarantee the independence of the training, validation, and test sets, the generalization error can be bounded by [40], [44], [46] L( f n∗ )

k   1  ˆ j  ∗  j n ≤ L n f n , j +  ∗ Lˆ n , , δ k k k k k

(23)

j =1

where Lˆ (n/ k) ( fn,∗ j ) is the error performed by the j th optimal classifier on the corresponding test set composed of (n/k) samples, and f n∗ is the randomly selected classifier. It is interesting to note that, for the LOO procedure, (n/k) = 1, so the bound becomes useless, in practice, for any reasonable value of the confidence δ. This is another hint that the LOO procedure should be used with care, as this result raises a strong concern on its reliability, especially in the small-sample setting, which is the setting of choice for LOO. j

C. BTS The BTS method is a pure resampling technique: at each j th step, a training set, with the same cardinality of the original one, is built by sampling the patterns with replacement. The remaining data, which consists, on average, of approximately 36.8% of the original dataset, are used to compose the validation set. The  procedure is then repeated several times (N B ∈ [1, 2n−1 n ]) in order to obtain statistically sound results [13]. As for the KCV, if the user is interested in performing the error estimation of the trained classifiers, a nested BTS is needed, where the sampling procedure is repeated twice in order to create both a validation set and a test set. If we suppose that the test set consists of m j patterns, then, after the model selection phase, we will be left with N B different models, for which the average generalization error can be expressed as L( fn∗ ) ≤

NB   1  j j Lˆ m j ( f m∗ j , j ) +  ∗ ( Lˆ m j , m j , δ) . NB

(24)

j =1

As can be seen by comparing (23) and (24), the KCV and the BTS are equivalent except for the different sampling approach of the original dataset. IV. T HEORETICAL A NALYSIS OF I N -S AMPLE T ECHNIQUES As detailed in the previous sections, the main objective of in-sample techniques is to upper-bound the supremum on the right of (16), so we need a bound that holds simultaneously for all functions in a class. The maximal discrepancy and Rademacher complexity are two different statistical tools that can be exploited for such purposes. The Rademacher

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

complexity of a class of functions F is defined as ˆ ) = Eσ sup 2 R(F f ∈F n

n 

σi ( f (x i ), yi )

(25)

i=1

where σ1 , . . . , σn are n independent random variables for which P(σi = +1) = P(σi = −1) = 1/2. An upper bound of ˆ ) was proposed in [47], and the proof L( f ) in terms of R(F is mainly an application of the following result, known as McDiarmid’s inequality: Theorem 1 [48]: Let Z 1 , . . . , Z n be independent random variables taking values in a set Z, and assume that g : Z n →  is a function satisfying   sup g(z 1 , . . . , z n ) − g(z 1 , . . . , zˆ i , . . . , z n ) < ci . z 1 ,...,z n ,ˆz i

Then, for any  > 0 P {g(z 1 , . . . , z n ) − E {g(z 1 , . . . , z n )} ≥ } < e P {E {g(z 1 , . . . , z n )} − g(z 1 , . . . , z n ) ≥ } < e

2 2 i=1 ci

− n2

2 − n2 2 c i=1 i

.

In other words, the theorem states that, if by replacing the i th coordinate z i by any other value g changes by at most ci , then the function is sharply concentrated around its mean. Using the McDiarmid’s inequality, it is possible to bound the supremum of (16), thanks to the following theorem [47]. We detail here a simplified proof, which also corrects some of the errors that appear in [47]. Theorem 2: Given a dataset Dn , consisting of n patterns x i ∈ X d , and given a class of functions F and a loss function (·, ·) ∈ [0, 1], then      −2nε2 ˆ ˆ P sup L ( f ) − L n ( f ) ≥ R(F ) + ε ≤ 2 exp . 9 f ∈F (26) Proof: Let us consider a ghost sample Dn = {x i , yi }, composed of n patterns generated from the same probability distribution of Dn , the following upper bound holds1 :   E(X ,Y ) sup L ( f ) − Lˆ n ( f ) (27) f ∈F

  = E(X ,Y ) sup E(X ,Y ) [ Lˆ n ( f )] − Lˆ n ( f ) f ∈F   ≤ E(X ,Y ) E(X ,Y ) sup Lˆ n ( f ) − Lˆ n ( f ) f ∈F

n  1  = E(X ,Y ) E(X ,Y ) sup i − i f ∈F n i=1  n  1 = E(X ,Y ) E(X ,Y ) Eσ sup σi i − i f ∈F n i=1  n 1 ≤ 2E(X ,Y ) Eσ sup σi i f ∈F n

(28) (29)



(30)

from which we obtain   ˆ ). E(X ,Y ) sup L( f ) − Lˆ n ( f ) ≤ E(X ,Y ) R(F f ∈F

(34)

ˆ ) = For the sake of simplicity, let us define S(F  ˆ sup f ∈F L( f ) − L n ( f ) . Then, by using McDiarmid’s ˆ ) is sharply concentrated around inequality, we know that S(F its mean   ˆ ) ≥ E(X ,Y ) S(F ˆ ) +  ≤ e−2n 2 P S(F (35) because the loss function is bounded. Therefore, combining these two results, we obtain   ˆ ) ≥ E(X ,Y ) R(F ˆ )+ P S(F (36)   ˆ ) +  ≤ e−2n 2 . ˆ ) ≥ E(X ,Y ) S(F (37) ≤ P S(F ˆ ), and so we We are interested in bounding L( f ) with R(F can write   ˆ ) ≥ R(F ˆ )+ P S(F (38)   ˆ ) + a ˆ ) ≥ E(X ,Y ) S(F ≤ P S(F   ˆ ) ≥ R(F ˆ ) + (1 − a) +P E R(F (39) ≤ e−2na

22

+ e− 2 (1−a) n

22

(40)

where, in the last step, we applied again the McDiarmid’s inequality. By setting a = (1/3) we have   −2n 2 ˆ ) ≥ R(F ˆ ) +  ≤ 2e 9 . P S(F (41) The previous theorem allows us to obtain the main result by fixing a confidence δ and solving (26) with respect to  and to obtain the following explicit bound, which holds with probability (1 − δ):

 log 2δ ˆ )+3 . (42) L( fn ) ≤ Lˆ n ( f n ) + R(F 2n The approach based on the maximal discrepancy is similar to the previous one and provides similar results. For the sake of brevity, we refer the reader to [32], [49] for the complete proofs and to [32] for a comparison of the two approaches: here we give only the final results. (1) (2) and Dn/2 , and compute Let us split Dn in two halves, Dn/2 the corresponding empirical errors as n

(31) (32)

Lˆ (1) n/2 ( f )

2 2 =  ( f (x i ), yi ) n

(43)

i=1

n 2  (2)  ( f (x i ), yi ). Lˆ n/2 ( f ) = n n

(44)

i= 2 +1

i=1

ˆ ) = E(X ,Y ) R(F

1395

(33)

1 In order to simplify the notation we define  = ( f (x , y )) and  = i i i i ( f (x i , yi )).

Then, the maximal discrepancy Mˆ of F is defined as   ˆ ) = max Lˆ (1) ( f ) − Lˆ (2) ( f ) M(F n/2 n/2 f ∈F

(45)

1396

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

⎤   2 2 ( f i , −yi )− ( f i , yi )⎦ = 1 + Eσ sup ⎣− n n f ∈F + − ⎡

and, under the same hypothesis of Theorem 2, the following bound holds, with probability (1 − δ):

 log 2δ ˆ )+3 L( f ) ≤ Lˆ n ( f ) + M(F . (46) 2n A. In-Sample Approach in Practice The theoretical analysis of the previous section does not clarify how the in-sample techniques can be applied in practice and, in particular, can be used to develop effective model selection and error estimation phases for SVMs. The first problem, analogous to the case of the out-of-sample approach, is related to the boundedness requirement of the loss function, which is not satisfied by the SVM hinge loss. A recent result appears to be promising in generalizing the McDiarmid’s inequality to the case of (almost) unbounded functions [50], [51]: Theorem 3 [50]: Let Z 1 , . . . , Z n be independent random variables taking values in a set Z and assume that a function g : Z n → [−A, A] ⊆  satisfies   sup g(z 1 , . . . , z n ) − g(z 1 , . . . , zˆi , . . . , z n ) < cn ∀i z 1 ,...,z n ,ˆz i

on a subset G ⊆ Z with probability 1 − δn , while ¯ ∃z ∈ Z such that ∀ {z 1 , . . . , z n } ∈ G, i    cn < g(z 1 , . . . , z n ) − g(z 1 , . . . , z i , . . . , z n ) ≤ 2 A where G¯ ∪ G = Z, then for any  > 0 − 2 2

2 Anδn . cn In other words, the theorem states that if g satisfies the same conditions of Theorem 1, with high probability (1 − δn ), then the function is (almost) concentrated around its mean. Unfortunately, the bound is exponential only if it is possible to show that δn decays exponentially, which requires us to introduce some constraints on the probability distribution generating the data. As we are working in the agnostic case, where no hypothesis on the data is assumed, this approach is outside of the scope of this paper, although it opens some interesting research cases such as, for example, when some additional information on the data is available. The use of the soft loss function, instead, which is bounded in the interval [0, 1] and will be adapted to the SVM in the following sections, allows us to apply the bound of (42). By noting that the soft loss satisfies the symmetry property P {|g − E[g]| ≥ } ≤ 2e 8ncn +

( f (x), y) = 1 − ( f (x), −y)

(47)

i∈I





2 ( f i , −σi yi ) n n

= 1 + Eσ sup − f ∈F

i∈I

(51)

i=1

n 2 = 1 − Eσ inf ( f i , σi ) . f ∈F n 

(50)

(52)

i=1

In other words, the Rademacher complexity of the class F can be computed by learning the original dataset, but where the labels have been randomly flipped. Analogously, it can be proved that the maximal discrepancy of the class F can be computed by learning the original dataset, where the labels of (2) the samples in Dn/2 have been flipped [28]. The second problem that must be addressed is to find an ˆ avoiding efficient way to compute the quantities Rˆ and M, the computation of Eσ [·], which would require N = 2n training phases. A simple approach is to adopt a Monte Carlo estimation of this quantity, by computing k n 2 j 1 sup σi ( f (x i ), yi ) Rˆ k (F ) = k f ∈F n j =1

(53)

i=1

where 1 ≤ k ≤ N is the number of Monte Carlo trials. ˆ can be explicited The effect of computing Rˆ k , instead of R, by noting that the Monte Carlo trials can be modeled as a sampling without replacement from the N possible label configurations. Then, we can apply any bound for the tail of the hypergeometric distribution such as, for example, the Serfling’s bound [52], to write 2  − 2k 1− k−1 ˆ ˆ N . P R(F ) ≥ Rk (F ) +  ≤ e



(54)

We know that ˆ ) ≤ E(X ,Y ) R(F ˆ ) E(X ,Y ) S(F and, moreover   ˆ ) ≥ Rˆ k (F ) + a P S(F   ˆ ) + a1  ˆ ) ≥ E(X ,Y ) S(F ≤ P S(F   ˆ ) ≥ R(F ˆ ) + a2  +P E(X ,Y ) R(F   ˆ ) ≥ Rˆ k (F ) + a3  +P R(F

(55)

(56)

(57)

it can be shown that the Rademacher complexity can be easily n 2 2 2 2 2 2 (58) ≤ e−2na1  + e− 2 a2  + e−2ka3  computed by learning a modified dataset. In fact, let us define + − I = {i : σi = +1} and I = {i : σi = −1}, then where a = a1 +a  n √2 +a3 . So, by setting a1 = (1/4), a2 = (1/2)  and a3 = (1/4) n(1 − (k − 1)/N)/k, we have 2 ˆ ) = Eσ sup σi i (48) R(F  ⎤ ⎡ n(1− k−1 f ∈F n i=1 N ) 3+ n 2 ⎤ ⎡ k ˆ ) ≥ R(F ˆ )+  ⎦ ≤ 3e− 8 . (59) P ⎣ S(F   2 4 ( f i , yi )⎦ (49) = 1+Eσ sup ⎣ (( f i , yi ) − 1)− n f ∈F + − i∈I

i∈I

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

Then, with probability (1 − δ) ⎛ L( f ) ≤ Lˆ n ( f ) + Rˆ k (F ) + ⎝3 +



n(1 − k−1 N )⎠ k

3

ln δ 2n

(60) which recovers the bound of (42), for k → N, up to some constants. The maximal discrepancy approach results in a very similar bound (the proofs can be found in [32])

 k  − log 2δ 1 ( j ) (61) L( f ) ≤ Lˆ n ( f ) + Mˆ (F ) + 3 k 2n j =1

which holds with probability (1−δ) and where k is the number of random shuffles of Dn before splitting it into two halves. Note that, in this case, the confidence term does not depend on k: this is a consequence of retaining the information provided by the labels yi , which is lost in the Rademacher complexity approach.

The hyperparameter C in (63) and (64) is tuned during the model selection phase, and indirectly defines the set of functions F . Then, any out-of-sample technique can be applied to estimate the generalization error of the classifier and the optimal value of C can be chosen accordingly. Unfortunately, this formulation suffers from several drawbacks. 1) The hypothesis space F is not directly controlled by the hyperparameter C, but only indirectly through the minimization process. 2) The loss function of SVM is not bounded, which represents a problem for out-of-sample techniques as well, because the optimization is performed using the hinge loss, while the error estimation is usually computed with the hard loss. 3) The function space is centered in an arbitrary way with respect to the optimal (unknown) classifier. It is worthwhile to write the SVM optimization problem as [21] min

V. A PPLICATION OF THE I N -S AMPLE A PPROACH TO THE SVM

(62)

where the weights w ∈  D and the bias b ∈  are found by solving the following primal convex constrained quadratic programming (CCQP) problem min

w,b,ξ

1 w2 + C eT ξ 2 yi (w · φ(x i ) + b) ≥ 1 − ξi

(63)

ξi ≥ 0 where ei = 1 ∀i ∈ {1, . . . , n} [4]. The above problem is also known as the Tikhonov formulation of the SVM, because it can be seen as a regularized ill-posed problem. By introducing n Lagrange multipliers α1 , . . . , αn , it is possible to write (63) in its dual form, for which efficient solvers have been developed throughout the years  1  αi α j yi y j K (x i , x j ) − αi min α 2 n

i=1 j =1

(64)

i=1

0 ≤ αi ≤ C n  yi αi = 0 i=1

where K (x i , x j ) = φ(x i )·φ(x j ) is a suitable kernel function. After solving (64), the Lagrange multipliers can be used to define the SVM classifier in its dual form as n  yi αi K (x i , x) + b. (65) f (x) = i=1

ξi

(66)

i=1

w2 ≤ ρ yi (w · φ(x i ) + b) ≥ 1 − ξi

(67) (68)

ξi ≥ 0

(69)

which is the equivalent Ivanov formulation of (63) for some value of the hyperparameter ρ. From (67), it is clear that ρ explicitly controls the size of the function space F , which is centered in the origin and consists of the set of linear classifiers with margin greater or equal to 2/ρ. In fact, as ρ is increased, the set of functions is enriched by classifiers with smaller margin and, therefore, of greater classification (and overfitting) capability. A possibility to center the space F in a different point is to translate the weights of the classifiers by some constant value, so that (67) becomes w − w0 2 ≤ ρ. By applying this idea to the Ivanov formulation of the SVM and substituting the hinge loss with the soft loss, we obtain the following new optimization problem: min

w,b,ξ ,η

n 

ηi

i=1

w2 ≤ ρ

n

n

n 

w,b,ξ

Let us consider the training set Dn and the input space X ∈ d . We map our input space in a feature space X ∈  D with the function φ : d →  D . Then, the SVM classifier is defined as f (x) = w · φ(x) + b

1397

yi (w · φ(x i ) + b) + yi λ f 0 (x i ) ≥ 1 − ξi ξi ≥ 0 ηi = min(2, ξi )

(70)

where f 0 is the classifier that has been selected as the center of the function space F , and λ is a normalization constant. Note that f 0 can be either a linear classifier f 0 (x) = w0 · φ(x) + b0 or a nonlinear one (e.g., analogous to that shown in [53]) but, in general, can be any a priori and auxiliary information that helps in relocating the function space closer to the optimal classifier. In this respect, f 0 can be considered as a hint, a concept introduced in [54] in the context of neural networks,

1398

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

Algorithm 1 Concave-Convex Procedure Initialize θ (0) repeat   (θ)  θ (t +1) = arg minθ Jconvex (θ ) + d Jconcave  dθ until θ

(t +1)



(t )

 θ=θ

(t)

To apply the CCCP, we must compute the derivative of the concave part of the objective function at the tth step as    dJconcave (θ )  (73)  (t) · θ dθ θ=θ    n  d (−Cςi )  ·θ (74) =   (t) dθ

·θ

θ=θ

i=1

which must be defined independently from the training data. The normalization constant λ weights the amount of hints that we are keen to accept in searching for our optimal classifier: if we set λ = 0, we obtain the conventional Ivanov formulation for SVM, while for larger values of λ the hint is weighted even more than the regularization process itself. The sensitivity analysis of the SVM solution with respect to the variations of λ is an interesting issue that would require a thorough study, so we are not addressing it here. In any case, as we are working in the agnostic case, we equally weight the hint and the regularized learning process, and thus choose λ = 1 in this paper. The previous optimization problem can be reformulated in its dual form and solved by general-purpose convex programming algorithms [21]. However, we show here that it can also be solved by conventional SVM learning algorithms if we rewrite it in the usual Tikhonov formulation n  1 w2 + C ηi min w,b,ξ ,η 2 i=1

yi (w · φ(x i ) + b) + yi f 0 (x i ) ≥ 1 − ξi ξi ≥ 0 ηi = min(2, ξi ).

(71)

It can be shown that the two formulations are equivalent, in the sense that, for any ρ, there is at least one C for which the Ivanov and Tikhonov solutions coincide [21]. In particular, the value of C is a nondecreasing function of the value of ρ, so that, given a particular C, the corresponding ρ can be found by a simple bisection algorithm [55]–[57]. Regardless of the formulation, the optimization problem is nonconvex, so we must resort to methods that are able to find an approximate suboptimal solution, such as the Peeling technique [32], [58] or the concave-convex procedure (CCCP) [33]. In particular, the CCCP, which is synthesized in Algorithm 1, suggests breaking the objective function of (71) in its convex and concave parts as Jconvex (θ)

min

w,b,ξ,ς

Jconcave (θ)

  !  ! n n   1 w2 + C ξi −C ςi 2 i=1

i=1

yi (w · φ(x i ) + b) + yi f 0 (x i ) ≥ 1 − ξi ξi ≥ 0 ςi = max(0, ξi − 2)

(72)

where θ = [w|b] is introduced to simplify the notation. Obviously, the algorithm does not guarantee finding the optimal solution, but it converges to a usually good solution in a finite number of steps [33].

=

n 

(t )  i yi

(w · φ(x i ) + b)

(75)

i=1

where



 if yi w (t ) · φ(x i ) + b (t ) < −1 (76) = otherwise. # " Then, the (t + 1)th solution w(t +1), b(t +1) can be found by solving the following learning problem: (t ) i

C, 0,

  (t ) 1 ξi + i yi (w · φ(x i ) + b) min w2 + C w,b,ξ 2 n

n

i=1

i=1

yi (w · φ(x i ) + b) + yi f0 (x i ) ≥ 1 − ξi ξi ≥ 0.

(77)

As a last issue, it is worth noting that the dual formulation of the previous problem can be obtained, by introducing n Lagrange multipliers βi  1  βi β j yi y j K (x i , x j ) + (yi f 0 (x i ) − 1) βi 2 n

min β

n

i=1 j =1 (t ) −i ≤ βi n 

n

i=1

≤C

(t ) − i

yi βi = 0

(78)

i=1

which can be solved by any SVM-specific algorithm such as, for example, the well-known SMO algorithm [29], [30]. A. Our Method in a Nutshell In this section, we briefly summarize the method, which allows us to apply the in-sample approach to the SVM model selection and error estimation problems. As a first step, we have to identify a centroid f 0 : for this purpose, possible a priori information can be exploited, else, in [53] a method to identify a hint in a data-dependent way is suggested. Note that f 0 can be either a linear or a nonlinear SVM classifier and, in principle, can be even computed by exploiting a kernel that differs from the one used during the learning phase. Once the sequence of classes of functions is centered, we explore its hierarchy according to the SRM principle: ideally by looking for the optimal hyperparameter ρ ∈ (0, +∞), similar to the search for the optimal C in conventional SVMs. For every value of ρ, i.e., for every class of functions, (70) is solved by exploiting the procedure previously presented in Section V and either the Rademacher complexity (60) or the maximal discrepancy (61) bounds are computed. Finally, F and, therefore, the corresponding classifier are chosen, for which the value of the estimated generalization error

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

1399

TABLE I MNIST D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE S OFT L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

8.46 ± 0.97 5.10 ± 0.67 3.05 ± 0.23 2.36 ± 0.23 1.96 ± 0.14 1.63 ± 0.11 1.44 ± 0.11 1.27 ± 0.09 1.20 ± 0.08 1.08 ± 0.09 0.92 ± 0.05 0.81 ± 0.07 0.70 ± 0.06

8.98 ± 1.12 5.10 ± 0.67 3.05 ± 0.23 2.36 ± 0.23 1.96 ± 0.14 1.63 ± 0.11 1.44 ± 0.11 1.27 ± 0.09 1.20 ± 0.08 1.08 ± 0.09 0.92 ± 0.05 0.81 ± 0.07 0.70 ± 0.06

12.90 ± 0.83 8.39 ± 1.11 6.26 ± 0.16 5.95 ± 0.12 5.61 ± 0.07 5.26 ± 0.29 4.98 ± 0.40 3.71 ± 0.58 2.71 ± 0.42 2.25 ± 0.21 2.07 ± 0.03 2.02 ± 0.04 1.93 ± 0.02

13.20 ± 0.86 8.93 ± 1.20 6.26 ± 0.16 5.95 ± 0.12 5.61 ± 0.07 5.36 ± 0.21 4.98 ± 0.40 4.41 ± 0.53 3.59 ± 0.57 2.75 ± 0.47 2.07 ± 0.03 2.02 ± 0.04 1.93 ± 0.02

10.70 ± 0.88 6.96 ± 0.70 4.56 ± 0.27 3.42 ± 0.27 2.94 ± 0.18 2.42 ± 0.14 2.17 ± 0.14 1.89 ± 0.12 1.74 ± 0.11 1.53 ± 0.09 1.34 ± 0.06 1.18 ± 0.08 0.98 ± 0.06

10.70 ± 0.88 6.69 ± 0.71 4.31 ± 0.26 3.25 ± 0.29 2.79 ± 0.17 2.35 ± 0.17 2.09 ± 0.17 1.85 ± 0.15 1.65 ± 0.11 1.44 ± 0.09 1.27 ± 0.06 1.11 ± 0.09 0.92 ± 0.07

13.40 ± 0.76 9.37 ± 0.62 5.93 ± 0.26 4.40 ± 0.25 3.61 ± 0.17 3.15 ± 0.14 2.86 ± 0.15 2.43 ± 0.14 2.18 ± 0.12 1.98 ± 0.09 1.67 ± 0.08 1.48 ± 0.09 1.24 ± 0.07

TABLE II D AIMLERCHRYSLER D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE S OFT L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

37.40 ± 3.38 31.50 ± 2.02 28.00 ± 0.76 26.60 ± 0.51 25.70 ± 0.50 25.20 ± 0.71 23.80 ± 0.43 22.90 ± 0.38 22.40 ± 0.35 21.80 ± 0.39 21.30 ± 0.39 20.50 ± 0.40 19.60 ± 0.29

37.90 ± 3.52 31.70 ± 2.00 28.00 ± 0.75 26.60 ± 0.50 25.70 ± 0.50 25.20 ± 0.71 23.80 ± 0.43 22.90 ± 0.37 22.40 ± 0.35 21.90 ± 0.38 21.30 ± 0.39 20.50 ± 0.41 19.60 ± 0.29

42.80 ± 2.91 37.70 ± 2.43 33.10 ± 0.64 31.60 ± 0.49 30.60 ± 0.48 30.20 ± 0.46 29.70 ± 0.40 28.90 ± 0.37 27.90 ± 0.33 27.90 ± 0.33 27.10 ± 0.23 27.00 ± 0.30 26.10 ± 0.35

44.80 ± 2.54 37.90 ± 2.38 33.10 ± 0.64 31.70 ± 0.46 30.90 ± 0.47 30.40 ± 0.49 29.80 ± 0.39 29.40 ± 0.34 28.70 ± 0.41 28.20 ± 0.37 27.30 ± 0.21 27.10 ± 0.24 26.30 ± 0.32

37.10 ± 2.58 32.00 ± 1.16 29.10 ± 0.81 27.60 ± 0.53 26.80 ± 0.59 25.70 ± 0.60 24.60 ± 0.42 23.80 ± 0.45 23.30 ± 0.38 22.70 ± 0.40 21.80 ± 0.34 21.00 ± 0.33 20.00 ± 0.27

37.10 ± 2.58 31.40 ± 1.11 29.50 ± 0.80 27.60 ± 0.75 26.90 ± 0.78 26.00 ± 0.78 24.60 ± 0.53 23.70 ± 0.43 23.20 ± 0.52 22.60 ± 0.44 21.60 ± 0.39 20.80 ± 0.35 20.00 ± 0.30

39.00 ± 2.63 34.60 ± 1.34 30.70 ± 0.68 28.90 ± 0.55 28.20 ± 0.48 27.90 ± 0.69 26.60 ± 0.49 25.40 ± 0.34 25.00 ± 0.34 24.70 ± 0.38 23.40 ± 0.30 22.50 ± 0.32 21.50 ± 0.25

is minimized. Note that this value, by construction, is a statistically valid estimate. VI. E XPERIMENTAL R ESULTS We describe in this section two sets of experiments. The first one is built by using relatively large datasets that allow us to simulate the small-sample setting. Each dataset is sampled by extracting a small amount of data to build the training sets and exploiting the remaining data as a good representative of the entire sample population. The rationale behind this choice is to build some toy problems, but based on real-world data, so as to better explore the performance of our proposal in a controlled setting. Thank to this approach, the experimental results can be easily interpreted and the two approaches, in-sample versus out-of-sample, easily compared. The second set, instead, targets the classification of microarray data, which consists of true small-sample datasets. A. Simulated Small-Sample Setting We consider the well-known MNIST [59] dataset consisting of 62 000 images, representing the numbers from 0 to 9: in particular, we consider the 13 074 patterns containing 0s and 1s, allowing us to deal with a binary classification problem. We build a small-sample dataset by randomly sampling a small

number of patterns, varying from n = 10 to n = 400, which is a value much smaller than the dimensionality of the data d = 28 × 28 = 784, while the remaining 13074 − n images are used as a reference set. In order to build statistically relevant results, the entire procedure is repeated 30 times during the experiments. We also consider a balanced version of the DaimlerChrysler dataset [60], where half of the 9800 images, of d = 36 × 18 = 648 pixels, contains the picture of a pedestrian, while the other half contains only some general background or other objects. These two datasets target different objectives: the MNIST dataset represents an easy classification problem, in the sense that a low classification error, well below 1%, can be easily achieved; on the contrary, the DaimlerChrysler dataset is a much more difficult problem, because the samples from each class are quite overlapped, so the small-sample setting makes this problem even more difficult to solve. By analyzing these two opposite cases, it is possible to gain a better insight into the performance of the various methods. In all cases, we use a linear kernel φ(x) = x, as the training data are obviously linearly separable (d > n) and the use of a nonlinear transformation would further complicate the interpretation of the results. In Tables I–VIII, the results obtained with the different methods are reported. Each column refers to a different approach.

1400

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

TABLE III MNIST D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE H ARD L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

2.33 ± 0.94 1.16 ± 0.31 0.47 ± 0.05 0.47 ± 0.10 0.39 ± 0.05 0.30 ± 0.04 0.28 ± 0.03 0.29 ± 0.04 0.30 ± 0.05 0.28 ± 0.04 0.25 ± 0.02 0.25 ± 0.04 0.20 ± 0.02

2.55 ± 1.04 1.16 ± 0.31 0.47 ± 0.05 0.47 ± 0.10 0.39 ± 0.05 0.30 ± 0.04 0.28 ± 0.03 0.29 ± 0.04 0.30 ± 0.05 0.28 ± 0.04 0.25 ± 0.02 0.25 ± 0.04 0.20 ± 0.02

2.69 ± 0.67 1.41 ± 0.43 0.72 ± 0.09 0.73 ± 0.10 0.65 ± 0.07 0.58 ± 0.06 0.56 ± 0.05 0.45 ± 0.07 0.35 ± 0.05 0.31 ± 0.04 0.28 ± 0.02 0.27 ± 0.02 0.26 ± 0.01

2.78 ± 0.66 1.49 ± 0.43 0.72 ± 0.09 0.73 ± 0.10 0.65 ± 0.07 0.59 ± 0.06 0.56 ± 0.05 0.51 ± 0.06 0.43 ± 0.07 0.37 ± 0.06 0.28 ± 0.02 0.27 ± 0.02 0.26 ± 0.01

2.77 ± 0.83 1.58 ± 0.42 0.90 ± 0.27 0.85 ± 0.26 0.73 ± 0.17 0.53 ± 0.14 0.57 ± 0.12 0.61 ± 0.19 0.49 ± 0.09 0.50 ± 0.11 0.40 ± 0.07 0.39 ± 0.08 0.29 ± 0.06

2.77 ± 0.83 1.91 ± 0.63 1.72 ± 0.56 1.15 ± 0.36 1.26 ± 0.30 0.76 ± 0.22 0.79 ± 0.18 0.77 ± 0.22 0.58 ± 0.11 0.61 ± 0.13 0.50 ± 0.10 0.48 ± 0.11 0.36 ± 0.07

3.21 ± 0.67 1.79 ± 0.34 0.78 ± 0.12 0.66 ± 0.12 0.56 ± 0.09 0.40 ± 0.05 0.43 ± 0.06 0.37 ± 0.05 0.37 ± 0.06 0.34 ± 0.04 0.32 ± 0.03 0.28 ± 0.04 0.24 ± 0.02

TABLE IV D AIMLERCHRYSLER D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE H ARD L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

33.60 ± 4.53 27.10 ± 2.53 23.80 ± 0.75 23.10 ± 0.54 22.20 ± 0.54 22.00 ± 0.75 20.90 ± 0.50 20.10 ± 0.42 20.00 ± 0.43 19.40 ± 0.41 19.10 ± 0.39 18.50 ± 0.40 17.90 ± 0.32

34.60 ± 4.91 27.10 ± 2.52 23.80 ± 0.75 23.10 ± 0.54 22.30 ± 0.54 22.00 ± 0.75 20.90 ± 0.52 20.10 ± 0.40 20.00 ± 0.43 19.40 ± 0.41 19.10 ± 0.39 18.50 ± 0.41 17.90 ± 0.32

31.40 ± 4.01 27.20 ± 1.05 26.00 ± 0.63 25.90 ± 0.79 25.00 ± 0.51 24.20 ± 0.49 24.10 ± 0.55 23.70 ± 0.48 23.10 ± 0.36 22.70 ± 0.49 22.60 ± 0.43 22.40 ± 0.36 21.40 ± 0.55

31.70 ± 4.00 27.30 ± 1.09 26.00 ± 0.63 26.00 ± 0.80 25.20 ± 0.55 24.20 ± 0.48 24.30 ± 0.50 24.00 ± 0.49 23.60 ± 0.41 23.00 ± 0.51 22.70 ± 0.43 22.50 ± 0.31 21.60 ± 0.57

32.30 ± 3.54 26.70 ± 1.06 24.70 ± 0.88 23.50 ± 0.85 22.70 ± 0.73 21.80 ± 0.77 20.80 ± 0.51 19.80 ± 0.46 19.80 ± 0.51 19.20 ± 0.44 18.60 ± 0.44 17.80 ± 0.31 17.00 ± 0.25

32.30 ± 3.54 25.70 ± 0.61 24.40 ± 0.80 23.20 ± 0.78 22.80 ± 0.84 22.00 ± 0.74 21.00 ± 0.71 20.30 ± 0.78 19.90 ± 0.63 19.10 ± 0.48 18.50 ± 0.42 17.80 ± 0.37 16.90 ± 0.34

35.30 ± 3.56 29.40 ± 1.41 26.00 ± 0.78 24.40 ± 0.53 23.50 ± 0.48 23.40 ± 0.72 22.30 ± 0.51 21.50 ± 0.40 21.10 ± 0.40 20.80 ± 0.40 19.70 ± 0.32 19.00 ± 0.29 18.20 ± 0.27

TABLE V M NIST D ATASET: E RROR E STIMATION U SING THE S OFT L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

– – 77.00 ± 0.00 62.90 ± 0.00 54.50 ± 0.00 48.70 ± 0.00 44.50 ± 0.00 39.80 ± 0.00 37.40 ± 0.00 34.40 ± 0.00 30.80 ± 0.00 28.10 ± 0.00 24.40 ± 0.00

– – 77.00 ± 0.00 62.90 ± 0.00 54.50 ± 0.00 48.70 ± 0.00 44.50 ± 0.00 39.80 ± 0.00 37.40 ± 0.00 34.40 ± 0.00 30.80 ± 0.00 28.10 ± 0.00 24.40 ± 0.00

– – – 83.60 ± 0.52 72.70 ± 0.45 65.30 ± 0.30 59.50 ± 0.33 54.90 ± 0.26 51.80 ± 0.22 47.80 ± 0.19 43.30 ± 0.18 39.70 ± 0.17 34.80 ± 0.15

– – – 85.00 ± 0.46 73.90 ± 0.41 66.50 ± 0.32 60.90 ± 0.34 55.10 ± 0.27 52.20 ± 0.23 48.40 ± 0.19 43.20 ± 0.17 39.60 ± 0.16 34.90 ± 0.16

– 85.60 ± 0.76 62.19 ± 0.43 48.28 ± 0.31 39.16 ± 0.25 32.66 ± 0.19 28.41 ± 0.18 24.02 ± 0.17 21.39 ± 0.11 18.66 ± 0.12 15.44 ± 0.11 13.40 ± 0.10 10.23 ± 0.08

– – – – – – – – – – – – –

78.09 ± 1.56 54.56 ± 0.98 32.34 ± 0.58 23.87 ± 0.41 18.21 ± 0.31 15.71 ± 0.24 13.81 ± 0.22 11.29 ± 0.21 10.01 ± 0.17 9.00 ± 0.14 7.13 ± 0.14 6.31 ± 0.16 5.03 ± 0.08

1) RC and MD are the in-sample procedures using, respectively, the Rademacher complexity and maximal discrepancy approaches, with f0 = 0. 2) RC f and MD f are similar to the previous cases, but 30% of the samples of the training set are used for finding a hint f 0 (x) = w0 · x + b0 by learning a linear classifier on them (refer to [53] for further details). 3) KCV is the k-fold cross validation procedure, with k = 10. 4) LOO is the leave-one-out procedure. 5) BTS is the Bootstrap technique with N B = 100.

For in-sample methods, the model selection is performed by searching the optimal hyperparameter ρ ∈ [10−6 , 103 ] among 30 values, equally spaced in a logarithmic scale, while for the out-of-sample approaches the search is performed by varying C in the same range. Tables I and II show the error rate achieved by each selected classifier on the reference sets, using the soft loss for computing the error. In particular, the in-sample methods exploit the soft loss for the learning phase, which, by construction, includes also the model selection and error estimation phases. The out-of-sample approaches, instead, use the

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

1401

TABLE VI D AIMLERCHRYSLER D ATASET: E RROR E STIMATION U SING THE S OFT L OSS n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

– – 88.30 ± 1.92 71.80 ± 1.48 64.40 ± 1.07 58.90 ± 1.06 52.20 ± 0.95 48.80 ± 0.92 44.50 ± 0.88 42.30 ± 0.76 36.70 ± 0.66 35.40 ± 0.58 30.80 ± 0.75

– – 88.40 ± 1.84 72.10 ± 1.50 64.80 ± 1.11 58.60 ± 1.08 52.20 ± 0.88 48.70 ± 0.89 44.60 ± 0.93 42.30 ± 0.78 36.70 ± 0.65 35.30 ± 0.60 30.80 ± 0.75

– – – – 94.50 ± 0.89 87.90 ± 1.08 82.00 ± 0.81 77.60 ± 0.85 73.90 ± 0.74 71.10 ± 0.81 66.20 ± 0.65 62.80 ± 0.68 58.30 ± 0.74

– – – – 94.30 ± 0.87 87.80 ± 1.01 82.30 ± 0.78 76.90 ± 0.83 73.50 ± 0.74 70.90 ± 0.82 65.60 ± 0.64 62.40 ± 0.64 57.80 ± 0.71

– – 84.05 ± 2.03 73.58 ± 2.02 68.70 ± 1.13 63.54 ± 1.31 58.93 ± 1.19 55.36 ± 1.36 51.12 ± 0.74 48.64 ± 0.90 44.61 ± 0.82 42.22 ± 0.74 37.97 ± 0.84

– – – – – – – – – – – – –

98.11 ± 4.96 77.76 ± 3.80 63.48 ± 2.18 54.50 ± 1.87 51.50 ± 1.45 48.10 ± 1.37 44.52 ± 1.02 41.83 ± 1.17 39.04 ± 0.94 37.84 ± 0.99 34.79 ± 0.82 33.05 ± 0.77 30.83 ± 0.84

TABLE VII MNIST D ATASET: E RROR E STIMATION U SING THE H ARD L OSS

TABLE IX H UMAN G ENE E XPRESSIONS D ATASETS Dataset

n

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

95.00 ± 1.99 77.64 ± 0.76 52.71 ± 0.32 39.30 ± 0.25 31.23 ± 0.24 25.89 ± 0.17 22.09 ± 0.13 18.10 ± 0.15 16.16 ± 0.09 13.91 ± 0.10 11.29 ± 0.09 9.50 ± 0.10 7.22 ± 0.07

– – – – – – – – – – – – –

57.65 ± 2.48 34.15 ± 0.90 18.63 ± 0.45 12.79 ± 0.33 9.74 ± 0.23 7.86 ± 0.19 6.59 ± 0.17 5.30 ± 0.15 4.69 ± 0.11 4.00 ± 0.11 3.21 ± 0.09 2.68 ± 0.09 2.02 ± 0.08

TABLE VIII D AIMLERCHRYSLER D ATASET: E RROR E STIMATION U SING THE H ARD L OSS n

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

95.00 ± 6.88 77.64 ± 3.83 75.14 ± 2.18 58.18 ± 2.27 59.97 ± 1.44 50.69 ± 1.67 43.81 ± 1.45 43.98 ± 1.52 39.56 ± 1.02 34.37 ± 0.91 32.96 ± 0.97 31.90 ± 0.89 27.47 ± 0.86

– – – – – – – – – – – – –

80.75 ± 7.42 64.80 ± 4.73 52.41 ± 2.78 42.17 ± 2.20 40.29 ± 1.53 38.98 ± 1.62 35.52 ± 1.28 32.94 ± 1.50 31.09 ± 1.09 29.71 ± 1.10 27.68 ± 0.99 26.27 ± 0.82 24.43 ± 0.83

conventional hinge loss for finding the optimal classifier and the soft loss for model selection. When the classifiers have been found, according to the respective approaches, their performance is verified on the reference dataset, so as to check

Brain tumor 1 [62] Brain tumor 2 [62] Colon cancer 1 [63] Colon cancer 2 [64] DLBCL [62] DukeBreastCancer [65] Leukemia [66] Leukemia 1 [62] Leukemia 2 [62] Lung cancer [62] Myeloma [67] Prostate tumor [62] SRBCT [62]

d

n

5920 10367 22283 2000 5469 7129 7129 5327 11225 12600 28032 10509 2308

90 50 47 62 77 44 72 72 72 203 105 102 83

whether a good model has been selected, and the achieved misclassification rate is reported in the tables. All the figures are in percentage and the best values are highlighted. As can be easily seen, the best approaches are RC f and MD f , which consistently outperform the out-of-sample ones. It is also clear that centering the function space in a more appropriate point, thanks to the hint f 0 , improves the ability of the procedure to select a better classifier, with respect to in-sample approaches without hints. This is a result of the shrinking of the function space, which directly affects the tightness of the generalization error bounds. As a last remark, it is possible to note that RC and MD often select the same classifier, with a slight superiority of RC, when dealing with difficult classification problems. This is also an expected result, because MD makes use of the label information, which is misleading if the samples are not well separable [49]. The use of the soft loss is not so common in the SVM literature, so we repeated the experiments by applying the hard loss for computing the misclassification error on the reference dataset. Tables III & VI report the results and confirm the superiority of the in-sample approach when dealing with the MNIST problem, while the results are comparable with the out-of-sample methods in the case of the DaimlerChrysler dataset. In particular, the in-sample methods

1402

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

TABLE X H UMAN G ENE E XPRESSION D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE S OFT L OSS Dataset

MD f

RC f

MD

RC

KCV

LOO

BTS

Brain tumor1 Brain tumor2 Colon cancer1 Colon cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung cancer Myeloma Prostate tumor SRBCT

14.00 ± 6.37 5.78 ± 2.35 27.40 ± 14.78 25.50 ± 8.59 19.10 ± 2.03 33.40 ± 5.07 14.20 ± 5.05 17.40 ± 4.66 11.30 ± 4.34 11.60 ± 2.93 8.60 ± 2.00 17.80 ± 4.65 10.20 ± 4.47

14.00 ± 7.66 5.78 ± 3.34 27.40 ± 13.97 25.50 ± 7.92 19.10 ± 3.40 33.40 ± 5.61 14.20 ± 5.59 17.40 ± 5.16 11.30 ± 4.08 11.60 ± 3.35 8.60 ± 2.14 17.80 ± 3.82 10.20 ± 4.49

33.30 ± 0.02 76.20 ± 0.12 45.30 ± 0.05 67.70 ± 0.04 58.00 ± 0.04 50.00 ± 0.06 31.20 ± 0.04 38.80 ± 5.07 31.10 ± 0.03 30.20 ± 0.01 28.40 ± 0.00 25.00 ± 4.25 31.50 ± 0.01

33.30 ± 0.02 76.20 ± 0.12 45.30 ± 0.05 67.70 ± 0.04 58.00 ± 0.04 50.00 ± 0.06 31.20 ± 0.04 42.20 ± 3.33 31.10 ± 0.03 30.20 ± 0.00 28.40 ± 0.00 25.60 ± 3.96 31.50 ± 0.01

17.10 ± 1.94 73.90 ± 0.30 30.10 ± 3.01 67.70 ± 0.07 57.90 ± 0.06 27.90 ± 1.61 20.90 ± 3.61 18.20 ± 3.46 9.56 ± 3.08 11.20 ± 1.53 9.06 ± 0.65 12.40 ± 3.71 10.90 ± 1.33

15.70 ± 1.80 73.90 ± 0.27 29.00 ± 1.60 67.70 ± 0.07 57.90 ± 0.05 26.40 ± 3.00 20.60 ± 3.20 17.40 ± 4.15 9.59 ± 2.78 10.60 ± 1.63 8.63 ± 0.66 11.80 ± 3.40 10.50 ± 1.55

18.90 ± 3.29 75.60 ± 0.14 29.80 ± 4.14 67.70 ± 0.05 58.00 ± 0.03 31.60 ± 3.08 23.30 ± 3.29 21.20 ± 4.24 10.80 ± 3.71 13.10 ± 1.38 11.50 ± 0.60 15.30 ± 3.93 13.60 ± 1.51

TABLE XI H UMAN G ENE E XPRESSION D ATASET: E RROR ON THE R EFERENCE S ET C OMPUTED U SING THE H ARD L OSS Dataset

MD f

RC f

MD

RC

KCV

LOO

BTS

Brain tumor1 Brain tumor2 Colon cancer1 Colon cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung cancer Myeloma Prostate tumor SRBCT

5.56± 4.52 2.86± 4.50 18.20 ± 16.50 18.60 ± 11.00 8.57 ± 4.58 21.70 ± 10.90 7.50 ± 6.01 1.00± 2.57 3.00± 3.15 5.96± 3.19 0.00± 0.00 10.90± 4.67 1.74± 2.74

5.56 ± 4.52 2.86 ± 4.50 18.20 ± 16.50 18.60 ± 11.00 8.57 ± 4.58 21.70 ± 10.90 5.00 ± 6.01 1.00 ± 2.57 4.00 ± 4.81 5.96 ± 3.19 0.00 ± 0.00 10.90 ± 4.67 1.74 ± 2.74

33.30 ± 0.00 2.86 ± 4.50 54.50 ± 0.00 42.90 ± 0.00 33.30 ± 0.00 48.30 ± 12.50 31.20 ± 0.00 49.00 ± 2.57 40.00 ± 0.00 34.00 ± 0.00 28.00 ± 0.00 34.50 ± 12.00 39.10 ± 0.00

33.30 ± 0.00 2.86 ± 4.50 54.50 ± 0.00 42.90 ± 0.00 33.30 ± 0.00 48.30 ± 12.50 31.20 ± 0.00 50.00 ± 0.00 40.00 ± 0.00 34.00 ± 0.00 28.00 ± 0.00 35.50 ± 10.10 39.10 ± 0.00

5.56 ± 4.52 2.86 ± 4.50 12.70 ± 17.50 15.70 ± 6.87 7.62 ± 6.24 6.67 ± 8.02 5.00 ± 3.21 2.00 ± 3.15 5.00 ± 4.07 7.23 ± 2.19 0.00 ± 0.00 10.90 ± 4.67 2.61 ± 4.47

8.89 ± 7.28 2.86 ± 4.50 16.40 ± 15.50 17.10 ± 9.36 8.57 ± 7.14 10.00 ± 10.50 2.50 ± 3.94 1.00 ± 2.57 5.00 ± 4.07 5.96 ± 3.19 0.00 ± 0.00 10.90 ± 4.67 3.48 ± 6.52

5.56 ± 7.82 2.86 ± 4.50 20.00 ± 13.60 17.10 ± 4.50 8.57 ± 4.58 8.33 ± 9.58 6.25 ± 5.08 1.00 ± 2.57 5.00 ± 4.07 7.23 ± 2.19 0.00 ± 0.00 10.90 ± 4.67 5.22 ± 8.21

with hints appear to perform slightly better than the BTS and slightly worse than KCV and LOO, even though the difference is almost negligible. This is not surprising because the in-sample methods adopt a soft loss for the training phase, which is not the same as used for evaluating them. Tables V and VI show the error estimation computed with the various approaches by exploiting the error bounds, presented in Sections III and IV, in particular, the in-sample methods provide these values directly, as a byproduct of the training phase, while the figures for the out-of-sample methods are obtained by applying the Hoeffding bound of (22) on the samples of the test set. The missing values indicate that the estimation is not consistent, because it exceeds 100%. In this case, BTS (but not always KCV nor LOO) outperforms the in-sample ones, which are more pessimistic. However, by taking a closer view of the results, several interesting facts can be inferred. The out-of-sample methods are very sensitive to the number of test samples: in fact, the LOO method, which uses only one sample at a time for the error estimation, is not able to provide consistent results. The quality of the error estimation improves for KCV, which uses 1/10 of the data, and even more for BTS, which selects, on average, a third of the data for performing the estimation. In any case, by

comparing the results in Table V with the ones in Table I, it is clear that even out-of-sample methods are overly pessimistic: in the best case (BTS, with n = 400), the generalization error is overestimated by a factor greater than 4. This result seems to be in contrast with the common belief that out-of-sample methods provide a good estimation of the generalization error, but they are not surprising because, most of the times, when the generalization error of a classifier is reported in the literature, the confidence term [i.e., the second term on the right side of (23)] is usually neglected and only its average performance is disclosed [i.e., the first term on the right side of (23)]. The results on the two datasets provide another interesting insight into the behavior of the two approaches: it is clear that out-of-sample methods exploit the distribution of the samples of the test set, because they are able to identify the intrinsic difficulty of the classification problem, in-sample methods, instead, do not possess this kind of information and, therefore, maintain a pessimistic approach in all cases, which is not useful for easy classification problems, such as MNIST. This is confirmed also by the small difference in performance of the two approaches on the difficult DaimlerChrysler problem. On the other hand, the advantage of having a test set, for out-of-sample methods, is overcome by the need of reducing

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

1403

TABLE XII H UMAN G ENE E XPRESSION D ATASET: E RROR E STIMATION U SING THE S OFT L OSS Dataset

MD f

RC f

MD

RC

KCV

LOO

BTS

Brain tumor1 Brain tumor2 Colon cancer1 Colon cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung cancer Myeloma Prostate tumor SRBCT

61.10 ± 1.25 81.70 ± 0.49 92.60 ± 2.84 74.90 ± 2.36 66.30 ± 1.02 – 68.20 ± 0.96 70.70 ± 1.43 68.30 ± 0.43 39.10 ± 0.01 54.50 ± 0.00 56.90 ± 1.39 63.40 ± 0.63

61.10 ± 1.25 81.50 ± 0.02 92.50 ± 2.91 75.10 ± 2.55 66.30 ± 1.02 – 67.60 ± 1.34 70.90 ± 1.44 69.00 ± 1.23 39.10 ± 0.01 54.50 ± 0.00 56.90 ± 1.39 63.50 ± 0.75

89.50 ± 0.00 96.10 ± 0.06 – 82.70 ± 0.06 66.90 ± 0.08 – – – 99.90 ± 0.01 70.60 ± 0.00 82.70 ± 0.01 – 97.70 ± 0.00

90.20 ± 0.00 – – 91.10 ± 0.06 74.10 ± 0.07 – – – 99.20 ± 0.01 70.20 ± 0.00 82.60 ± 0.01 – 97.60 ± 0.00

58.30 ± 0.78 82.40 ± 0.04 85.40 ± 0.76 72.10 ± 0.01 59.10 ± 0.02 85.30 ± 2.57 66.10 ± 1.22 70.00 ± 1.45 63.40 ± 0.86 42.00 ± 0.66 51.10 ± 0.36 63.50 ± 0.66 58.20 ± 0.64

– – – – – – – – – – – – –

31.03 ± 0.54 39.50 ± 0.07 55.18 ± 0.35 32.44 ± 0.02 23.70 ± 0.02 60.27 ± 4.02 36.21 ± 1.61 41.80 ± 1.52 33.41 ± 1.55 15.68 ± 0.81 22.77 ± 0.77 36.83 ± 1.08 31.10 ± 1.83

TABLE XIII H UMAN G ENE E XPRESSION D ATASET: E RROR E STIMATION U SING THE H ARD L OSS Dataset

KCV

LOO

BTS

Brain tumor1 Brain tumor2 Colon cancer1 Colon cancer2 DLBCL DukeBreastCancer Leukemia Leukemia1 Leukemia2 Lung cancer Myeloma Prostate tumor SRBCT

34.04 ± 2.03 56.49 ± 1.29 56.49 ± 5.14 46.43 ± 4.11 41.43 ± 1.71 60.79 ± 2.40 41.43 ± 1.60 43.79 ± 1.60 43.79 ± 2.57 17.47 ± 0.88 31.23 ± 0.00 47.07 ± 1.64 39.30 ± 2.57

– – – – – – – – – – – – –

22.05 ± 1.46 20.50 ± 1.68 49.29 ± 9.54 38.65 ± 3.85 21.21 ± 3.54 45.08 ± 3.90 21.21 ± 1.92 22.70 ± 3.63 22.70 ± 0.99 13.00 ± 1.16 9.74 ± 0.18 20.00 ± 1.14 19.91 ± 1.68

TABLE XIV MNIST D ATASET: C OMPUTATIONAL T IME R EQUIRED B Y THE D IFFERENT I N -S AMPLE AND O UT-O F -S AMPLE P ROCEDURES n

MD f

RC f

MD

RC

KCV

LOO

BTS

10 20 40 60 80 100 120 150 170 200 250 300 400

0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.2 1.1 ± 0.3 2.3 ± 0.2 2.4 ± 0.2 4.5 ± 0.4 11.4 ± 0.4 17.8 ± 0.4 25.1 ± 0.3 27.4 ± 0.4 48.1 ± 0.4 58.9 ± 0.6

0.1 ± 0.1 0.4 ± 0.1 0.6 ± 0.2 1.1 ± 0.2 2.2 ± 0.4 2.6 ± 0.3 3.9 ± 0.3 10.9 ± 0.4 17.1 ± 0.3 24.8 ± 0.4 28.9 ± 0.4 47.2 ± 0.4 59.4 ± 0.4

0.1 ± 0.1 0.3 ± 0.1 0.6 ± 0.1 1.1 ± 0.1 2.0 ± 0.1 2.7 ± 0.2 4.2 ± 0.3 10.1 ± 0.3 12.8 ± 0.3 21.3 ± 0.3 25.2 ± 0.2 36.1 ± 0.2 44.2 ± 0.2

0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.1 1.2 ± 0.1 1.9 ± 0.2 2.7 ± 0.2 4.0 ± 0.2 9.7 ± 0.2 11.9 ± 0.2 20.5 ± 0.2 25.9 ± 0.3 36.8 ± 0.2 44.1 ± 0.2

0.0 ± 0.1 0.0 ± 0.1 0.1 ± 0.1 0.1 ± 0.1 0.3 ± 0.1 0.7 ± 0.1 1.1 ± 0.3 1.7 ± 0.3 2.4 ± 0.3 2.9 ± 0.2 4.1 ± 0.4 4.9 ± 0.3 6.3 ± 0.3

0.0 ± 0.1 0.1 ± 0.1 0.3 ± 0.1 0.8 ± 0.1 2.0 ± 0.3 2.9 ± 0.7 5.1 ± 0.4 9.3 ± 0.4 13.4 ± 0.4 18.1 ± 1.1 27.6 ± 1.2 39.3 ± 1.1 59.7 ± 1.3

0.0 ± 0.1 0.0 ± 0.1 0.0 ± 0.1 0.0 ± 0.1 0.1 ± 0.1 0.2 ± 0.2 0.5 ± 0.2 0.7 ± 0.1 0.9 ± 0.2 1.4 ± 0.3 1.9 ± 0.3 2.3 ± 0.3 4.1 ± 0.4

the size of the training and validation sets, which causes the methods to choose a worse performing classifier. This is related to the well-known issue of the optimal splitting of the data between training and test sets, which is still an open problem.

Finally, Tables VII and VIII show the error estimation of the out-of-sample methods using the hard loss. In this case, the in-sample methods cannot be applied, because it is not possible to perform the learning with this loss. As expected, the error estimation improves, with respect to the

1404

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

previous case, except for the LOO method, which is not able to provide consistent results. The improvement with respect to the case of the soft loss is due to the fact that we are now working in a parametric setting (e.g., the errors are distributed according to a binomial distribution), while the soft loss gives rise to a nonparametric estimation, which is a more difficult problem. In short, the experiments clearly show that in-sample methods with hints are more reliable for model selection than out-of-sample ones and that the BTS appears to be the best approach to perform the generalization error estimation of the trained classifier. B. Microarray Small-Sample Datasets The last set of experiments deals with several Human Gene Expression datasets (Tables X–XIII), where all the problems are converted, where needed, to two classes by simply grouping some data. In this kind of setting, a reference set of reasonable size is not available, so we reproduce the methodology used by [61], which consists in generating five different training/test pairs using a cross validation approach. The same procedures of Section VI-A are used in order to compare the different approaches to model selection and error estimation, and the results are reported in Tables X–XIII. Table X shows the error rate obtained on the reference sets using the soft loss, where the in-sample methods outperform out-of-sample ones most of the time (8 versus 5). The interesting fact is the large improvement of the in-sample methods with hints, with respect to the analogous versions without hints. Providing some a priori knowledge for selecting the classifier space appears to be very useful and, in some cases (e.g., the Brain Tumor 2, Colon Cancer 2, and DLBCL datasets) allows to solve problems that no other method, in-sample without hints or out-of-sample, can deal with. In Table XI, analogously, the misclassification rates on the reference sets using the hard loss, which favors out-of-sample methods, are reported. In this case, the three out-of-sample methods globally outperform the in-sample ones, but none of them, considered alone, is consistently better than in-sample ones. Finally, Tables XII and XIII show the error estimation using the soft and hard loss, respectively. The BTS provides better estimates respect to all other methods but, unfortunately, it suffers from two serious drawbacks: the estimates are very loose and, in some cases (e.g., the Brain Tumor 2, Colon Cancer 2, and DLBCL datasets), the estimation is not consistent as it underestimates the actual classifier error rate. This is an indication that, in the small sample setting, where the test data is very scarce, both in-sample and out-of-sample methods are of little use for estimating the generalization error of a classifier. However while out-of-sample methods cannot be improved because they work in a parametric setting, where the Clopper–Pearson bound is the tightest possible, in-sample methods could lead to better estimation as they allow for further improvements, both in the theoretical framework and in their practical application.

C. Some Notes on the Computational Effort The proposed approach addresses the problem of model selection and error estimation of SVMs. Though general, this approach gives benefits when dealing with small sample problems (d  n) like the gene expression datasets, where only a few samples are available (n ≈ 100). In this framework, the computational complexity and the computational cost of the proposed method is not a critical issue because the time needed to perform the procedure is small. In fact, as an example, we report in Table XIV the computational time (in seconds)2 needed to perform the different in-sample and out–of-sample procedures on the MNIST dataset, where it is worthwhile noting that the learning procedures always require less than 1 min to conclude. Similar results are obtained using the other datasets of this paper. VII. C ONCLUSION We have detailed a complete methodology for applying two in-sample approaches, based on the data-dependent SRM principle, to the model selection and error estimation of SV classifiers. The methodology is theoretically justified and obtains good results in practice. At the same time, we have shown that in-sample methods can be comparable to, or even better than, more widely used out-of-sample methods, at least in the small-sample setting. A step for improving their adoption is our proposal for transforming the in-sample learning problem from the Ivanov formulation to the Tikhonov one, so that it can be easily approached by conventional SVM solvers. We believe that our analysis opens new perspectives on the application of the data-dependent SRM theory to practical problems, by showing that the common misconception about its poor practical effectiveness is greatly exaggerated. The SRM theory is just a different and sophisticated statistical tool that needs to be used with some care, and we hope it will be further improved in the future, both by building sharper theoretical bounds and by finding cleverer ways to exploit hints for centering the classifier space. R EFERENCES [1] I. Guyon, A. Saffari, G. Dror, and G. Cawley, “Model selection: Beyond the Bayesian/frequentist divide,” J. Mach. Learn. Res., vol. 11, pp. 61–87, Jan. 2010. [2] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/variance dilemma,” Neural Comput., vol. 4, no. 1, pp. 1–58, 1992. [3] P. Bartlett, “The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Mar. 1998. [4] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995. [5] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [6] Y. Shao, C. Zhang, X. Wang, and N. Deng, “Improvements on twin support vector machines,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 962–968, Jun. 2011. [7] D. Anguita, S. Ridella, and F. Rivieccio, “K-fold generalization capability assessment for support vector classifiers,” in Proc. IEEE Int. Joint Conf. Neural Netw., Jul.–Aug. 2005, pp. 855–858. 2 The values are referred to an Intel Core I5 2.3 GHz architecture. The source code is written in Fortran90.

ANGUITA et al.: MODEL SELECTION AND ERROR ESTIMATION FOR SUPPORT VECTOR MACHINES

[8] B. Milenova, J. Yarmus, and M. Campos, “SVM in oracle database 10 g: Removing the barriers to widespread adoption of support vector machines,” in Proc. 31st Int. Conf. Very Large Data Bases, 2005, pp. 1–1163. [9] Z. Xu, M. Dai, and D. Meng, “Fast and efficient strategies for model selection of Gaussian support vector machine,” IEEE Trans. Syst., Man Cybern., Part B, Cybern., vol. 39, no. 5, pp. 1292–1307, Oct. 2009. [10] T. Glasmachers and C. Igel, “Maximum likelihood model selection for 1-norm soft margin SVMs with multiple parameters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp. 1522–1528, Aug. 2010. [11] K. De Brabanter, J. De Brabanter, J. Suykens, and B. De Moor, “Approximate confidence and prediction intervals for least squares support vector regression,” IEEE Trans. Neural Netw., vol. 22, no. 1, pp. 110–120, Jan. 2011. [12] M. Karasuyama and I. Takeuchi, “Nonlinear regularization path for quadratic loss support vector machines,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1613–1625, Oct. 2011. [13] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London, U.K.: Chapman & Hall, 1993. [14] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Proc. Int. Joint Conf. Artif. Intell., 1995, pp. 1137–1143. [15] F. Cheng, J. Yu, and H. Xiong, “Facial expression recognition in JAFFE dataset based on Gaussian process classification,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1685–1690, Oct. 2010. [16] D. Anguita, A. Ghio, S. Ridella, and D. Sterpi, “K–fold cross validation for error rate estimate in support vector machines,” in Proc. Int. Conf. Data Mining, 2009, pp. 1–7. [17] T. Clark, “Can out-of-sample forecast comparisons help prevent overfitting?” J. Forecasting, vol. 23, no. 2, pp. 115–139, 2004. [18] D. Rapach and M. Wohar, “In-sample versus out-of-sample tests of stock return predictability in the context of data mining,” J. Empirical Finance, vol. 13, no. 2, pp. 231–247, 2006. [19] A. Isaksson, M. Wallman, H. Goransson, and M. Gustafsson, “Cross-validation and bootstrapping are unreliable in small sample classification,” Pattern Recognit. Lett., vol. 29, no. 14, pp. 1960–1965, 2008. [20] U. M. Braga-Neto and E. R. Dougherty, “Is cross-validation valid for small-sample microarray classification?” Bioinformatics, vol. 20, no. 3, pp. 374–380, 2004. [21] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [22] K. Duan, S. Keerthi, and A. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, vol. 51, pp. 41–59, Apr. 2003. [23] D. Anguita, A. Boni, R. Ridella, F. Rivieccio, and D. Sterpi, “Theoretical and practical model selection methods for support vector classifiers,” in Support Vector Machines: Theory and Applications, L. Wang, Ed. New York: Springer-Verlag, 2005, pp. 159–180. [24] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi, “General conditions for predictivity in learning theory,” Nature, vol. 428, no. 6981, pp. 419–422, 2004. [25] B. Scholkopf and A. J. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2001. [26] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model complexity control for regression using VC generalization bounds,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 1075–1089, Sep. 1999. [27] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony, “Structural risk minimization over data-dependent hierarchies,” IEEE Trans. Inf. Theory, vol. 44, no. 5, pp. 1926–1940, Sep. 1998. [28] P. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and error estimation,” Mach. Learn., vol. 48, no. 1, pp. 85–113, 2002. [29] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999. [30] C. Lin, “Asymptotic convergence of an SMO algorithm without any assumptions,” IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 248–250, Jan. 2002. [31] J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in Advances in Large Margin Classifier. Cambridge, MA: MIT Press, 1999. [32] D. Anguita, A. Ghio, N. Greco, L. Oneto, and S. Ridella, “Model selection for support vector machines: Advantages and disadvantages of the machine learning theory,” in Proc. Int. Joint Conf. Neural Netw., 2010, pp. 1–8. [33] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading convexity for scalability,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 201–208.

1405

[34] M. Anthony, Discrete Mathematics of Neural Networks: Selected Topics. Philadelphia, PA: SIAM, 2001. [35] M. Aupetit, “Nearly homogeneous multi-partitioning with a deterministic generator,” Neurocomputing, vol. 72, nos. 7–9, pp. 1379–1389, 2009. [36] C. Bishop, Pattern Recognition and Machine Learning. New York: Springer-Verlag, 2006. [37] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “The impact of unlabeled patterns in rademacher complexity theory for kernel classifiers,” in Proc. Neural Inf. Process. Syst., 2011, pp. 1009–1016. [38] P. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” Ann. Stat., vol. 33, no. 4, pp. 1497–1537, 2005. [39] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, “Inference with the universum,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 1009–1016. [40] J. Langford, “Tutorial on practical prediction theory for classification,” J. Mach. Learn. Res., vol. 6, no. 1, pp. 273–306, 2006. [41] C. Clopper and E. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, pp. 404–413, 1934. [42] J. Audibert, R. Munos, and C. Szepesvári, “Exploration-exploitation tradeoff using variance estimates in multi-armed bandits,” Theor. Comput. Sci., vol. 410, no. 19, pp. 1876–1902, 2009. [43] A. Maurer and M. Pontil, “Empirical Bernstein bounds and sample variance penalization,” in Proc. Int. Conf. Learn. Theory, 2009, pp. 1–9. [44] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” J. Amer. Stat. Assoc., vol. 58, no. 301, pp. 13–30, 1963. [45] V. Bentkus, “On Hoeffding’s inequalities,” Ann. Probab., vol. 32, no. 2, pp. 1650–1673, 2004. [46] D. Anguita, A. Ghio, L. Ghelardoni, and S. Ridella, “Test error bounds for classifiers: A survey of old and new results,” in Proc. IEEE Symp. Found. Comput. Intell., Apr. 2011, pp. 80–87. [47] P. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Nov. 2003. [48] C. McDiarmid, “On the method of bounded differences,” Surv. Combinat., vol. 141, no. 1, pp. 148–188, 1989. [49] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Maximal discrepancy versus rademacher complexity for error estimation,” in Proc. Eur. Symp. Artif. Neural Netw., 2011, pp. 257–262. [50] S. Kutin, “Extensions to McDiarmid’s inequality when differences are bounded with high probability,” Dept. Comput. Sci., Univ. Chicago, Chicago, IL, Tech. Rep. TR-2002-04, 2002. [51] E. Ordentlich, K. Viswanathan, and M. Weinberger, “Denoiser-loss estimators and twice-universal denoising,” in Proc. IEEE Int. Symp. Inf. Theory, Oct. 2009, pp. 1–9. [52] R. Serfling, “Probability inequalities for the sum in sampling without replacement,” Ann. Stat., vol. 2, no. 1, pp. 39–48, 1974. [53] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “Selecting the hypothesis space for improving the generalization ability of support vector machines,” in Proc. Int. Joint Conf. Neural Netw., 2011, pp. 1169–1176. [54] Y. Abu-Mostafa, “Hints,” Neural Comput., vol. 7, no. 4, pp. 639–671, 1995. [55] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization path for the support vector machine,” J. Mach. Learn. Res., vol. 5, pp. 1391–1415, Dec. 2004. [56] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K. Muller, and A. Zien, “Efficient and accurate lp-norm multiple kernel learning,” Adv. Neural Inf. Process. Syst., vol. 22, no. 22, pp. 997–1005, 2009. [57] D. Anguita, A. Ghio, L. Oneto, and S. Ridella, “In-sample model selection for support vector machines,” in Proc. Int. Joint Conf. Neural Netw., 2011, pp. 1154–1161. [58] D. Anguita, A. Ghio, and S. Ridella, “Maximal discrepancy for support vector machines,” Neurocomputing, vol. 74, no. 9, pp. 1436–1443, 2011. [59] L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L. Jackel, Y. LeCun, U. Muller, E. Sackinger, and P. Simard, “Comparison of classifier methods: A case study in handwritten digit recognition,” in Proc. 12th IAPR Int. Conf. Pattern Recognit. Comput. Vis. Image Process., 1994, pp. 1–11. [60] S. Munder and D. Gavrila, “An experimental study on pedestrian classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1863–1868, Nov. 2006. [61] A. Statnikov, C. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis,” Bioinformatics, vol. 21, no. 5, pp. 631–643, 2005.

1406

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 9, SEPTEMBER 2012

[62] A. Statnikov, I. Tsamardinos, and Y. Dosbayev, “GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data,” Int. J. Med. Inf., vol. 74, nos. 7–8, pp. 491–503, 2005. [63] N. Ancona, R. Maglietta, A. Piepoli, A. D’Addabbo, R. Cotugno, M. Savino, S. Liuni, M. Carella, G. Pesole, and F. Perri, “On the statistical assessment of classifiers using DNA microarray data,” BMC Bioinf., vol. 7, no. 1, pp. 387–399, 2006. [64] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proc. Nat. Acad. Sci. United States Amer., vol. 96, no. 12, pp. 6745–6767, 1999. [65] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. Olson, J. Marks, and J. Nevins, “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proc. Nat. Acad. Sci. United States Amer., vol. 98, no. 20, pp. 11462–11490, 2001. [66] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. [67] D. Page, F. Zhan, J. Cussens, M. Waddell, J. Hardin, B. Barlogie, and J. Shaughnessy, Jr., “Comparative data mining for microarrays: A case study based on multiple myeloma,” presented at the International Conference on Intelligent Systems for Molecular Biology, Aug. 2002.

Davide Anguita (S’93-M’95) received the Laurea degree in electronic engineering and the Ph.D. degree in computer science and electronic engineering from the University of Genoa, Genoa, Italy, in 1989 and 1993, respectively. He was a Research Associate with International Computer Science Institute, Berkeley, CA, on special-purpose processors for neurocomputing. He joined the Department of Biophysical and Electronic Engineering (DIBE, now DITEN), University of Genoa, where he is currently an Associate Professor of smart electronic systems. His current research interests include the theory and applications of kernel methods and artificial neural networks.

Alessandro Ghio was born in Chiavari in 1982. He received the Master’s degree in electronic engineering and the Ph.D. degree in knowledge and information science from the University of Genoa, Genoa, Italy, in 2010. His current research interests include the theoretical and practical aspects of smart systems based on computational intelligence and machine learning methods. Currently, he is an IT Consultant.

Luca Oneto was born in Rapallo in 1986. He received the Specialistic Laurea degree in electronic engineering from the University of Genoa, Genoa, Italy, with the thesis Model Selection for Support Vector Machines: Advantages and Disadvantages of the Machine Learning Theory. He completed his thesis work together with Noemi Greco.

Sandro Ridella (M’93) received the Laurea degree in electronic engineering from the University of Genoa, Genoa, Italy, in 1966. He is currently a Full Professor with the Department of Biophysical and Electronic Engineering (DIBE, now DITEN), University of Genoa, where he teaches circuits and algorithms for signal processing. His scientific activity has been focused on the field of neural networks for the last five years.

In-sample and out-of-sample model selection and error estimation for support vector machines.

In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where p...
582KB Sizes 0 Downloads 3 Views