Learning sparse kernel classifiers for multi-instance classification.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

1377

Learning Sparse Kernel Classifiers for Multi-Instance Classification Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang

Abstract— We propose a direct approach to learning sparse kernel classifiers for multi-instance (MI) classification to improve efficiency while maintaining predictive accuracy. The proposed method builds on a convex formulation for MI classification by considering the average score of individual instances for bag-level prediction. In contrast, existing formulations used the maximum score of individual instances in each bag, which leads to nonconvex optimization problems. Based on the convex MI framework, we formulate a sparse kernel learning algorithm by imposing additional constraints on the objective function to enforce the maximum number of expansions allowed in the prediction function. The formulated sparse learning problem for the MI classification is convex with respect to the classifier weights. Therefore, we can employ an effective optimization strategy to solve the optimization problem that involves the joint learning of both the classifier and the expansion vectors. In addition, the proposed formulation can explicitly control the complexity of the prediction model while still maintaining competitive predictive performance. Experimental results on benchmark data sets demonstrate that our proposed approach is effective in building very sparse kernel classifiers while achieving comparable performance to the state-of-the-art MI classifiers. Index Terms— Classification, convex optimization, kernel machines, multi-instance learning (MI).

I. I NTRODUCTION

M

ULTI-INSTANCE (MI) classification is a paradigm in supervised learning first introduced by Dietterich [1]. In a MI classification problem, training examples are presented in the form of bags. Each bag contains a collection of instances drawn from two label classes. Only bag labels are given for classification purposes and instance labels are not perceived in priori. The key assumption for MI classification is that a bag with positive label contains at least one positive instance and a negative bag contains negative instances only. We focus on kernel methods for MI classification. Despite the large amount of work on kernel methods for MI classification [2]–[6], the majority of them are concentrated on improving the predictive performance. In this paper, we explore Manuscript received December 11, 2012; revised February 16, 2013; accepted February 25, 2013. Date of publication May 13, 2013; date of current version August 16, 2013. This work was supported by the Australian Research Council under the Discovery Project DP0986052 entitled Automatic Music Feature Extraction, Classification and Annotation. Z. Fu is with the School of Computing, Engineering and Mathematics, University of Western Sydney, Kingswood 2747, Australia (e-mail: [email protected]). G. Lu, K. M. Ting, and D. Zhang are with the Gippsland School of Information Technology, Monash University, Churchill 3842, Australia (e-mail: [email protected]; [email protected]; dengsheng. [email protected]). Digital Object Identifier 10.1109/TNNLS.2013.2254721

the novel perspective of building sparse kernel prediction models for MI classification to improve efficiency in prediction without compromising on performance. The prediction function for a standard kernel classifier follows a generalized linear model with predictor variables given by the kernel values evaluated on the training instances. The prediction speed of a kernel classifier is proportional to the number of training instances with nonzero coefficients, called expansion vectors (XV) in this paper. Hence for applications where fast prediction speed is required, it is desirable to have sparse prediction models with a few XVs. The issue becomes more serious for MI classification for two reasons. Firstly, the number of XVs in a trained classifier largely depends on the number of training instances. Even with a moderately sized MI problem, a large number of instances may be encountered, therefore leading to a classifier with potentially many XVs. Secondly, the prediction function is usually defined at instance level explicitly [3] or implicitly [2]. To predict the bag label, one needs to apply the classifier to each instance in the bag. Reducing a single XV in the prediction function would result in savings proportional to the size of the bag. Thus it is important to remove redundant XVs hence to improve efficiency for MI prediction. In this paper, we address the issue of learning sparse classifiers directly for MI classification. To the best of our knowledge, this is the first comprehensive investigation of this problem in the literature despite its importance. We made the following main contributions to the fields of MI classification and sparse kernel classifier learning. 1) To formulate sparse MI classifier learning as a feasible optimization problem, we introduced an alternative label-mean formulation for MI classification that predicts the bag label by considering the average of decision values from individual instances in each bag. This is in contrast to the dominant label-max formulations that use the maximum individual decision value for bag-level prediction [3]–[5], [7]. Theoretical justification for the label-mean formulation is also provided that hinges on the proper choice of kernel functions. 2) A discriminative approach is taken for sparse classifier learning. The XVs and the kernel classifier are learned jointly in a discriminative fashion to minimize the regularized risk for MI classification. 3) The proposed approach can explicitly control the complexity of the prediction model in the problem formulation. This is in contrast to existing sparse classifier

2162-237X © 2013 IEEE

1378


training algorithms where model complexity is implicitly controlled by using a sparsity preserving L 1 norm [8] and is particularly desirable for classification scenarios where an optimal classifier needs to be trained within the budget of maximum complexity. 4) Unlike existing formulations of sparse kernel classifier learning in the standard single instance (SI) setting [9], [10] that cast optimization in the dual form, a primal optimization framework is developed in this paper, which gains a significant runtime advantage over the dual formulation and makes it particularly suited for sparse MI classifier learning. II. R ELATED W ORK Kernel methods are popular in MI classification over the last decade. Specifically, various adaptations of kernel support vector machine (SVM) classifiers to the MI setting were proposed in the literature, including MI-kernel [2], MI-SVM [3], MI-SVM [3], regularization SVM [4], MILES [8], MissSVM [5], and MILIS [11]. Based on how classification is performed, these methods can be roughly divided into two categories. The first category includes MI-kernel, MILES, and MILIS, which aims at developing a bag-level kernel or feature representation and directly learning the classifier for bag-level prediction. The majority of methods are in the second category that modify the regularization framework and consider the MI assumption explicitly into the constraints of the resulting optimization problem. A bottom-up strategy is usually followed by these methods for learning a classifier at instance level. Bag-level predictions can be made by considering the maximum of predictions of individual instances in accordance with the MI assumption. The maximum operation, however, leads to difficult optimization problems that hinders the development of effective MI classification techniques. Along a separate line of research, sparse kernel classifiers received much attention in the literature of supervised learning. As discussed earlier, the purpose of learning sparse kernel classifiers is to reduce the number of XVs in the prediction function hence to improve prediction speed. Based on a smaller number of XVs, the prediction model is sparser, resulting fewer evaluations of kernel function; hence the prediction process is more efficient. Early sparse kernel methods focus on pre and postprocessing steps for building economicsized prediction models. The core idea behind preprocessing methods is low-rank approximation where the full kernel matrix is decomposed into the inner product of a low-rank matrix and its transpose. Therefore, the classifier weight vector can be represented by the span of a subset of instances corresponding to columns of the low-rank matrix. The subset can either be selected randomly from the training set [12], [13], or by greedy selection of training instances as XVs that minimizes the reconstruction error of the kernel matrix [14]. The same trick can be applied for postprocessing methods like the reduced set (RS) [15], which reduces the number of expansions involved in the classifier by fitting the weight vector with a linear combination of reduced number of vectors that minimizes the reconstruction error of the classifier.

A major problem with pre and postprocessing-based methods is that label information is ignored in the selection and approximation of XVs, which would negatively influence the learning of a kernel classifier for discriminative learning. This was addressed in recent works on discriminative sparse kernel learning [9], [10]. A method for building kernel SVMs with reduced classifier complexity was proposed in [9] by greedily locating a set of kernel expansions of a specified maximum size to minimize the SVM cost function. It is shown that the number of XVs needed in the resulting kernel classifier is much smaller than that in the standard SVM with comparable performance. Wu et al. [10] proposed a direct formulation for learning sparse kernel classifiers with controlled complexity in the prediction function by constraining the maximum allowed number of expansions. Unlike [9], where the XVs are selected from the training instances, the XVs in [10] are obtained as part of the optimization process and thus allow more flexibility in approximating the classification function. Beyond the scope of sparse kernel classifiers, local classification models are also proposed to reduce the complexity of kernel classifiers, most notably models based on the SVM [16]–[18]. However, most of these methods aim at reducing the training cost by using a divide and conquer strategy. The method proposed in this paper for MI classification is relevant to the aforementioned joint optimization framework proposed by Wu et al. [10] in SI classification. Nevertheless, direct adaptation of their method to the MI setting is not possible as the base classifier considered in their model should connect to a convex optimization problem with unique global minimum. This is a prerequisite for the applicability of the optimization technique used in [10]. The standard SVM classifier used in SI classification is essentially a convex quadratic problem [15], which is not the case for MI classification. Hence, a first step for learning sparse kernel prediction model in the MI setting is to develop a convex model for MI classification, which will be discussed in the following section. III. L ABEL -M EAN F ORMULATION FOR MI C LASSIFICATION A. Label-Mean Versus Label-Max Formulations In MI classification, we are given m training bags {X1 , . . . , Xm } and their corresponding labels {y1 , . . . , ym } with yi ∈ {−1, 1}, where Xi = {xi,1 , . . . , xi,ni } is the i th bag containing n i instances. Each instance xi, p in the bag is also associated with a hidden label yi, p . Through the MI assumption, yi, p = −1 for all p ∈ {1, . . . , n i } if yi = −1 and yi, p = 1 for some p if yi = 1. Only bag labels are available for training and instance labels are not observable from the data. Based on the above notation, the MI assumption can be written by ni ∀i. (1) yi = max yi, p p=1

The purpose of MI classification is to learn a classifier F to predict bag labels. F can either be learned directly, or by learning an instance-level classifier f first and aggregating the scores of individual instances returned by f .

FU et al.: LEARNING SPARSE KERNEL CLASSIFIERS FOR MI CLASSIFICATION

The risk minimization framework provides a powerful tool for developing kernel classifiers. It can be naturally extended to the MI setting for learning instance [3], [5] and bag-level kernel classifiers [2], [4]. In the following, we provide our formulation for MI classification based on the regularization framework: 1 f 2H + C (yi , F(Xi )) 2

(2)

ni 1 f (xi, p ) ni

(3)

m

min L( f ) = f ∈H

i=1

F(Xi ) =

∀I

p=1

where f is the instance-level classifier function to be solved by the optimization problem, and F is the bag-level classifier that takes the average of prediction values of individual instances given by f . f is defined in the reproducing kernel Hilbert space (RKHS) H, a Hilbert space of functions induced by kernel function k : X × X → R with the reproducing property f (x) = f, k(x, .), where X is the input space of instances. The first term on the right-hand-side (RHS) is a regularization term that controls the complexity of the classifier f using its RKHS norm denoted by .H , and the second term minimizes the empirical losses on bag predictions in the training data set, where C is a constant parameter specifying its relative importance. : R × R → R+ is the nonnegative loss function that measures the discrepancy between the ground-truth label and the decision value. f is both a function on X and a vector in RKHS H in (2). To clarify different terms, we used the notation w to denote the vector form of f in H in subsequent text. Let us denote φ(x) = k(x, .) and view it as the feature projection vector in RKHS for input x, (2) translates to an equivalent formulation in the following for learning linear classifier function in H: ⎛ ⎞ ni m 1 1 ⎝ yi , f (xi, p )⎠ (4) min f 2H + C ni w∈H 2 i=1

p=1

f (x) = wT φ(x)

(5)

where f 2H = w2 and the equality for f (x) follows directly from the reproducing property. A major difference between our formulation and existing risk minimization-based MI formulations [3]–[5] is the constraint term in (3) that captures the relationship between instance and bag predictions. Most existing formulations used different constraints for bags that involve the use of maximization ni

F(Xi ) = max f (xi, p ) p=1

∀i.

(6)

The use of maximization in the constraints, called label-max in subsequent text, leads to difficult optimization problems because of the nonsmooth and nonconvex nature of the max operator. In contrast, the formulation with label-mean constraints in (3) is more appropriate for optimization purposes because of the convex and smooth average operator.

1379

B. Justification of the Label-Mean Formulation Despite the computational advantage, the label-mean formulation is less intuitive than label-max and seemingly violates the MI assumption expressed in (1), which implies the use of a maximum relationship between bag and instance label predictions. Nevertheless, it can be justified by establishing its equivalence with the set kernel, which further connects to the MI-kernel [2] as shown in our previous literature [19]. Here, we provide theoretical results to validate the label-mean formulation. Before introducing the main theorem, we begin with the definition on separability. Definition 1 (Separability): In binary classification, we say two classes are separable with margin μ by classifier f : → R+ if f (x+ ) > 1 for each positive instance x+ and 0 ≤ f (x− ) ≤ 1 − μ for each negative instance x− . Two classes are simply separable by classifier f if f (x+ ) > f (x− ) for any positive instance x+ and negative instance x− . The above definition of separability is slightly different from the standard definition of separability used in the literature of large-margin classifiers [15], which requires that scores for negative instances be smaller than −1. This is not a major issue, though, as we can always apply scaling to f and adding a bias term to translate between the two separability criteria. Based on the above definitions, we now present our main theorem in the following: Theorem 1: If positive and negative instances are separable with margin μ by classifier f in RKHS H with kernel k, then for sufficiently large r : 1) positive and negative bags satisfying the MI assumption are separable by bag-level classifier F defined by the average operator in the label-mean formulation using the instance classifier f = v, ψ(x), where ψ(x) represents the feature map in RKHS Hr induced by kernel k r , the r th power of kernel k, and v is given by the linear expansion below v=

ni m

αi, j ψ(xi, j )

(7)

i=1 j =1

2) a similarity-based feature vector can be defined for each bag, such that positive and negative bags are separable by a linear classifier in the bag feature space 1 r r s(Xi ) = k (xi, p , x1 ), . . . , k (xi, p , x N ) ni p p (8) where xt (t = 1, . . . , N) is a reordering of training instances sequentially over all bags. The theorem establishes the validity of the label-mean formulation for MI classification. The main tradeoff for the MI problem is to use a more complex kernel than the SI counterpart, indicated by the r-factor in the theorem. This implies that if positive and negative instances are linearly separable, a kernel classifier with r th-order polynomial kernel can be built to guarantee separability at bag-level using labelmean; if instances are separable with the Gaussian kernel, bags are separable with a Gaussian kernel with a larger bandwidth. In addition, a bag-level feature representation is

1380


(a)

(b)

(c)

(d)

(e)

(f)

Fig. 1. Illustration of feature mapping for the label-mean formulation. (a) Input data. (b) Linear mapping. (c) Polynomial kernel mapping (r = 3). (d) Gaussian kernel mapping. (e) Poor selection of XVs [×s in (a)]. (f) Proper selection of XVs [+s in (a)].

also introduced in (8) inspired by label-mean with guarantee of separability between positive and negative bags. The baglevel classifier F is in the form F(X ) = i αi si (X ), where s i (X ) is the i th feature in the bag feature vector. The instance classifier f is the given by f (x) =

i

αi yi

ni 1 k r (xi, p , x) = αt k r (x p , x). ni t

(9)

p=1

The proof of Theorem 1 is omitted here for simplicity. Interested readers can find the proof and more theoretical results on the complexity of kernels in the supplementary material. Now, we use an example to demonstrate the labelmean feature representation in (8). The data set contains ten positive bags and ten negative ones, each of them containing three instances in the 2-D space. Each positive bag contains a single positive instance and two negative instances. Instances from two classes are linearly separable, as shown in Fig. 1(a), where circles and triangles represent positive and negative instances respectively. Fig. 1(b) plots the bag-level feature vectors computed according to (8) with a linear kernel, which is equivalent to considering the average of instance feature vectors for each bag. It can be easily seen that bag feature vectors are no longer linearly separable. Fig. 1(c) and (d) shows the corresponding bag-level feature vectors rendered by the polynomial kernel with degree 3 and the Gaussian kernel respectively. The original feature vectors are defined in 60-D Euclidean space given by the total number of instances. For visualization purposes, we projected them to a 2-D space by applying linear discriminant analysis. It can be clearly observed that in both cases, the projected points representing training bags from two classes can be separated by a linear classifier, demonstrating the effectiveness of the label-mean formulation.

IV. S PARSE K ERNEL C LASSIFIER FOR MI C LASSIFICATION A. Motivation A question naturally arises here, that the instance classifier f in (9) is computed from a linear combination of kernel values evaluated over all training instances. The complexity for data prediction depends on the number of nonzero αi s and the number of instances in each bag. From (9), we can see that a single nonzero αi variable would render all instances in the corresponding bag to be XVs. The total number of XVs is then equal to m i=1,αi =0 n i , which is roughly proportional to the number of instances in the training set. In addition, to make a bag-level prediction, one needs to evaluate f (x) for all instances in the bag and average the instance prediction values. This creates a more adverse situation for MI prediction as compared with SI prediction with kernel classifiers. To make prediction faster, it is essential to have a model with fewer expansions in (9). It is possible to use a reduced number of XV from which kernel values are evaluated and construct a sparse prediction model with many of the αi, j s in (7) being zero. Intuitively, this is equivalent to sampling the columns of bag-level feature representation in (8) and learn a linear classifier. However, XVs have to be chosen carefully. Arbitrary choices of XVs and sampling can lead to poor performance for prediction. We illustrate this point using the same example in Fig. 1. Specifically, a feature embedding with poorly selected XVs and Gaussian kernel is shown in Fig. 1(e). Here we constructed the bag feature vectors by choosing two instances as XVs, indicated by cross signs in Fig. 1(a). Both XVs are negative instances and far from the distribution of positive class, and this yields poor feature vectors for bag classification. On the other hand, a feature embedding with properly selected XVs is


shown in Fig. 1(f), where the two instances chosen as XVs for bag-level feature computation are indicated by plus signs in Fig. 1(a). Here, the XVs are chosen to cover the positive and negative populations. Therefore, the computed bag features are linearly separable with no loss of discriminative power as compared with the dense feature map in Fig. 1(d). Hence, we need a method to automatically train sparse kernel classifiers for MI classification. Random selection of XVs [13] is likely to produce poor results especially for small number of expansions. Postprocessing methods such as the RS [15] basically fits a trained classifier with reduced number of XVs. Again, there is a tradeoff between approximation error and sparsity. The sparser the classifier, the larger the approximation error, and the poorer the classification performance. This is largely because the label information is ignored in the approximation step. A more principled approach to building sparse classification model is to impose a sparsity preserving norm like the L 1 norm on the classifier weights. This strategy is employed by MILES [8], a popular SVM formulation for MI classification. However, we are still confronted with the tradeoff between sparsity and accuracy, as sparsity is not explicitly controlled by the model. Therefore for difficult problems, either sparsity is sacrificed for better accuracy, or the other way around. Without loss of generality, we focus on SVM with squared Hinge loss and the Gaussian kernel in the following discussion. However, the technique developed here is quite general in nature and can be applied to a variety of differentiable loss functions such as squared and logit losses for least squares and logistic regression models respectively, and nonlinear kernels such as polynomial kernel and sigmoid kernel. B. Model Based on the aforementioned concerns, we propose a direct approach for building sparse kernel SVM classifier for MI classification based on the label-mean formulation. Classifier learning and XV selection are coupled and formulated in a single-optimization problem. This is achieved by adding an explicit constraint to control the complexity of linear expansion for the weight vector v in (7). In this way, we can control the number of kernel evaluations involved in the computation of prediction functions in (9). Specifically, we aim at approximating v with a RS of feature maps while maintaining the large margin of the SVM objective function. The new optimization problem can thus be formulated as follows: ⎛ ⎞ ni m 1 1 ⎝ yi , (vT ψ(xi, p ) + ρ)⎠ min v2 + C Z,v,ρ 2 ni i=1

p=1

(10) v=

NXV

β j ψ(z j )

(11)

that minimizes the bag-level loss in a subspace spanned by the feature maps of the XVs (ψ(z j )s) in the RKHS Hr induced by instance kernel k r instead of the whole RKHS. By directly specifying NXV , the maximum number of XVs allowed in v, the optimal solution is guaranteed to reside in a subspace whose dimension is no larger than NXV . The above formulation can be regarded as a joint optimization problem that toggles between the search for the optimal subspace and the search for the optimal solution in the subspace. Given that NXV is usually a small number, we can build a very sparse model for MI prediction. Let KZ denote the NXV × NXV Gram matrix with the (i, j )th entry given by k r (zi , z j ), and KXi ,Z denote the n i ×NXV Gram matrix between instances in bag i and XVs, with the ( p, j )th entry given by k r (xi, p , z j ). We can rewrite optimization problem in (10) in terms of β instead of v by substituting (10) into the cost function in (10) This leads to the following objective function we solve for sparse SVM model for MI classification: min Q(β, ρ, Z) =

β,ρ,Z

1 T β KZ β 2

1 +C yi , (KXi ,Z β + ρ) . (12) ni i

C. Optimization Strategy The sparse formulation in (12) involves the joint optimization of two interdependent sets of target variables— the classifier weights (β, ρ) and XVs z j s. Change in one of them would influence the optimal solution of the other. A straightforward strategy to tackle this problem is via alternating minimization [20]. However, alternating optimization lacks convergence guarantee and usually leads to slow convergence in practice [20], [21]. Hence, we use a more efficient and effective strategy to solve the problem formulation by viewing it as an optimization problem for the optimal value function. Specifically, we convert the original problem defined in (12) into the following problem that depends on variable Z only min g(Z) Z

with

g(Z) = min Q(β, ρ, Z). β,ρ

(13)

g(Z), the new objective function, is special as it is the optimal value of Q optimized over variables (β, ρ). The evaluation of g(Z) at a fixed point Z is equivalent to training a 2-norm SVM given the XVs and computing the cost of the trained model in (12). To train the 2-norm SVM, we minimize function Q(β, ρ, Z) over β and ρ by fixing Z. This can be done easily with various numerical optimization routines. In this paper, we used the limited memory BFGS (L-BFGS) algorithm for its efficiency and super-linear convergence rate. The implementation of L-BFGS requires the cost Q(β, ρ, Z) and the gradient information as follows: 1

∂Q = KZ β + C (yi , f i )KXi ,Z ∂β n i fi

(14)

1

∂Q =C (yi , fi ) ∂ρ n i fi

(15)

m

j =1

where NXV is the total representation, z j is the is a matrix with XVs in the above formulation as

1381

number of XVs allowed in the j th XV, and Z = [z1 , . . . , z NXV ] columns. Intuitively, we can view searching for the optimal solution

i=1

m

i=1

1382


where the partial derivative of with respect to f i is

2( f i − yi ), yi f i < 1 fi (yi , fi ) = 0, yi f i ≥ 1.

(16)

Denote β and ρ as the minimizer of Q(β, ρ, Z) at Z, the value of function g(Z) is then given by

1 1 T yi , (KXi ,Z β + ρ) . (17) g(Z) = β KZ β + C 2 ni i

We assume the Gram matrix KZ to be positive definite, which is always the case with the Gaussian kernel for distinct z j s.1 Q is then a strictly convex function with respect to β and ρ, and the optimal solution at each Z is unique. This makes g(Z) a proper function with unique value for each Z. In addition, the uniqueness of optimal solution also makes it possible for the derivative analysis for g(Z). Existence and computation of the derivative of the optimal value function was well studied in the optimization literature. Specifically, [22, Th. 4.1], provided the sufficient conditions for the existence of derivative of g(Z). According to the theorem, the differentiability of g(Z) is guaranteed by the uniqueness of optimal solution β and ρ as we discussed earlier, and by the differentiability of Q(β, ρ, Z) with respect to β and ρ, which is ensured by the square Hinge loss function we adopted. In addition, the derivative of g(Z) can be computed at each given Z by substituting the minimizers β and ρ into (17) and considering the derivative in the following as if g(Z) does not depend on β and ρ: NXV ∂k(zi , z j ) ∂g = βi βk ∂z j ∂z j i=1

+C

ni m ∂k(xi, p , z j ) 1

(yi , f i )β k . ni ∂z j i=1

(18)

p=1

The derivative terms in the above equation depend on the specific choice of kernels. In this paper, we adopted the following Gaussian kernel k(x, z) = exp (−γ x − z2 ) where γ is the scale parameter for the kernel. The power of a Gaussian kernel is still a Gaussian kernel. This is a nice property that saves us from searching for the optimal parameter p for bag-level separation according to Theorem 2 in the previous section. If instances are separable with Gaussian kernel, then bags are separable with Gaussian kernel with a larger γ value. Hence, in practice, we only have to tune the γ parameter for Gaussian kernel. Based on the use of Gaussian kernel, the partial derivative terms in (18) can be replaced by the following: ∂k(x, z j ) = 2γ (x − z j )k(x, z j ). ∂z j 1 In practice, we can enforce the positive definiteness by adding a small value to the diagonal of KZ .

Algorithm 1 Sparse Kernel SVM for MI Classification Input: data (Xi , yi ), NXV , λ, smax , and tmax Output: classifier weights β, ρ, and XVs Z Set t = 0 and initialize Z Solve minβ,ρ Q(β, ρ, Z(t ) ) and denote the optimizer as (β (t ), ρ (t ) ) and optimal value as g(Z(t )) repeat for s = 1 to smax do ∂g(Z)

Set Z = Z(t ) − λ ∂Z

Solve minβ,ρ Q(β, ρ, Z ) and denote the optimizer as

(β , ρ ) and optimal value as g(Z )

if g(Z ) < g(Z(t )) then

Set (β (t +1), ρ (t +1) ) = (β , ρ ), Z(t +1) = Z Set λ = 2λ if s equals 1 break end if Set λ = λ/2 end for Set t = t + 1 until Convergence or t ≥ tmax or s ≥ smax

D. Algorithm and Extension Based on g(Z) and its derivative given in (17) and (18), we can develop a gradient descent approach to solve the overall optimization problem for sparse kernel SVM in (12). The detailed steps are outlined in Algorithm 1. In addition input data and NXV , a predefined number of XVs, additional input parameters include λ, the initial step size, smax , the maximum number of line searches allowed, and tmax , the maximum number of iterations. For line search, we implemented a strategy similar to backtracking, without imposing the condition on sufficient decrease. Any step size resulting in a decrease in function value is immediately accepted. This is more efficient than more sophisticated line search strategies with significant reduction in the number of expensive function evaluations. The optimization scheme we used is also adopted previously for sparse kernel machine [10] and simple multiple kernel learning (MKL) [21]. The major difference here is that the SVM problem is solved directly in its primal formulation when evaluating the optimal value function g(Z). This is mainly because of the use of squared Hinge loss in the cost function, which makes the primal formulation continuously differentiable with respect to classifier parameters. In contrast, both sparse kernel machine [10] and simple MKL [21] used nondifferentiable Hinge loss for the basic SVM model, which has to be treated in the dual form to guarantee the differentiability of the optimal value function. Solving the primal problem has great computational advantage as compared with the dual. For our formulation, it is cheaper to solve the primal problem for SVM as it only involves NXV + 1 variables. In contrast, the complexity of the dual problem is much higher. It involves the computation and cache of the kernel matrix in solving the SVM. In addition, evaluating the gradients with respect to the XVs is also much more costly, as it needs to aggregate over the gradient value


for each entry in the kernel matrix, which is the sum of inner products between vectors of kernel function values over instances and XVs. The complexity of gradient computation scales with O(NXV N 2 ), where N is the total number of instances in the training set. This is much higher than the complexity of O(NXV N) in the primal case. The proposed algorithm can also be extended to deal with multiclass MI problems. As a multiclass problem can be decomposed into several binary (i.e., two-class) problems using various decomposition schemes, each binary problem may introduce a different set of XVs by applying the binary version of algorithm directly. Therefore, the XVs have to be learned in a joint fashion for multiclass problems to ensure that each binary classifier shares the same XVs. Consider M binary problems and let β c , ρ c denote the weight and bias for the cth classifier respectively (c = 1, . . . , M), the optimization problem is only slightly different from (12) min Q((β, ρ, Z) =

β,ρ,Z

M

Q c (β c , ρ c , Z)

c=1

=

M 1

T

β c KZ β c 2 c=1

1 +C yic , (KXi ,Z β c +ρ c ) . (19) ni i

The same strategy can be adopted for the optimization of the above equation by introducing g(Z) as the optimal value function of Q over β c s and ρ c s. The first step of each iteration is the same, only that M classifiers are trained instead of one. These M classifiers can be trained separately by minimizing Q c (β c , ρ c , Z) for c = 1, . . . , M. The argument on the existence of the derivative of g(Z) is still true, which can be computed with a minor modification via NXV M ∂g c c ∂k(zi , z j ) = βi βk ∂z j ∂z j c=1 i=1 m M

+C

c=1 i=1

1383

should be chosen from the span of a predefined set of basis vectors (BVs) instead of allowing them to take arbitrary values in the input domain. Each XV is then given by the following: Q v j,q bq = Bv j (21) zj = q=1

where bq s (q = 1, . . . , Q) are the BVs, v j,q is the weight of the qth BV for constructing z j , the j th XV, and Q is the total number of BVs used. B = [b1 , . . . , b Q ] is the matrix with BVs stacked in columns and v j = [v j,1 , . . . , v j,Q ] is the weight vector for the j th BV. The use of BVs for computing XVs in (21) has two advantages over direct computation of XV. Firstly, sparsity of XVs can be better preserved with the linear combination of BVs as compared with direct computation, which is likely to generate dense gradient vectors and consequently dense XVs. On the other hand, BVs are sparse vectors if chosen to overlap with the input feature vectors, whose span is also likely to be sparse unless each BV has very different sparsity patterns. Secondly, using a linear span of BVs restricts the possible values that each XV can take. This serves as a regularization mechanism on XVs. Otherwise XVs can take arbitrary values for each entry through the direct optimization process, which may not produce optimal results. Based on the change of variables in (21), we can now reformulate the optimization problem in (12) in terms of V instead of Z as follows: 1 min Q(β, ρ, V) = β T KV β β,ρ,V 2

1 +C yi , (KXi ,V β + ρ) (22) ni i

KV = [k(Bvi , Bv j )] NXV ×NXV KXi ,V = [k(xi, p , Bv j )]ni ×NXV .

i 1 c c ∂k(xi, p , z j ) (yi , fi )β k . ni ∂z j

n

(20)

p=1

E. Updating Prototypes for Sparse Instance Features The computational cost in both time and space for the evaluation of gradient in (18) depends heavily on the feature dimension d. The first term on the RHS of (18) has a time complexity of O(NXV d), and the second term on the RHS has a time complexity of O( m i=1 n i d) = O(Nd). The overall time complexity is dominated by the second term as N NXV and d is usually smaller than both N and NXV . The storage required by the computation of gradients is O(d) for each XV. When the feature dimension d is high, the evaluation of gradients would be quite costly. This is an issue for sparse instance feature vectors where d can be in the order of hundreds or thousands, and most of the feature values are zero. To overcome the issue encountered with sparse input features, we can make a further assumption that the XVs

(23)

The gradient computation can also be revised accordingly by NXV ∂g ∂k(zi , Bvk ) = βi βk ∂vk ∂vk i=1

ni m ∂k(xi, p , Bvk ) 1

+C (yi , fi )β k . (24) ni ∂vk i=1

p=1

Based on the Gaussian kernel, the partial derivative terms in the above equation can be replaced by ∂k(x, Bvk ) = 2γ BT (x − Bvk )k(x, Bvk ). ∂vk V. E XPERIMENTAL R ESULTS A. Results on Synthetic Data In our experiments, we first use two synthetic data sets to showcase the interesting properties of the proposed sparse kernel SVM algorithm for MI classification. Fig. 2(a) shows an example of binary MI classification, where each positive

1384


(a)

(b) Fig. 2.

Demonstration of the proposed sparse kernel classifier on two synthetic MI data sets. (a) Binary MI data set. (b) 3-class MI data set.

bag contains at least one instance in the center, whereas each negative bag contains only instances on the surrounding circle. Fig. 2(b) shows an example of MI classification with three classes. Instances are generated from four Gaussians N(μi , σ 2 )(i = 1, . . . , 4) with μ1 = [−2, 2]T , μ2 = [2, 2]T , μ3 = [2, −2]T , μ4 = [−2, −2]T , and σ = 0.25. Bags from class 3 contain only instances randomly drawn from N(μ2 , σ 2 ) and N(μ4 , σ 2 ), whereas each bag from class 1 contains at least one instance drawn from N(μ1 , σ 2 ) and each bag from class 2 contains at least one instance drawn from N(μ3 , σ 2 ). For the binary example, a single XV in the center is sufficient to discriminate between bags from two classes. For the multiclass example, two XVs are needed for discriminant purposes whose optimal locations should overlap with μ1 and μ3 , the centers of class 1 and 2. For both cases, we initialize our algorithm with poor XV locations as shown by the big plus signs in the first column of Fig. 2. The second columns show the data overlayed with the XVs obtained from the final iteration, and arrows pointing from the initial XVs to the corresponding final XVs. Based on the trained classifier, we can also partition the data space into regions predicted for each class. Decision regions corresponding to different classes are shown in different shades in each plot. It can be seen that, despite the poor initialization, the algorithm is able to find good XVs and make the right decision. The convergence is quite fast, with the first few iterations already making pronounced improvements over the initialization and finetuning is done by the remaining iterations. This is further demonstrated by the monotonically decreasing function values over iterations on the rightmost column of Fig. 2, with significant decrease in the first few iterations.

B. Results on Real-World Data We used five real-world data sets in our experiment. These include two data sets on drug activity prediction (MUSK1 and MUSK2), two data sets on image categorization (COREL-10 and COREL-20) and one data set on music genre classification (GENRE). MUSK1 and MUSK2 were first introduced in [1] and are widely used as the benchmark for MI classification. MUSK1 contains 47 positive bags and 45 negative ones, with an average of 5.2 instances in each bag. MUSK2 contains 39 positive bags and 63 negative ones, with an average of 64.7 instances in each bag. COREL-10 and COREL-20 are from the COREL image collection and introduced for MI classification in [8]. COREL-20 contains 2000 JPEG images from 20 categories, with 100 images per category. COREL10 is a subset of COREL-20 with images from the first ten categories. Each image is segmented into 2–13 regions from which color and texture features are extracted. For our experiment, we simply used the processed features obtained from the author’s website. GENRE was originally introduced in [23] and is first tested for MI classification in this paper. The data set contains 1000 audio tracks equally distributed in ten music genres. We split each track, which is roughly 30 s long, into overlapping 3-s segments with 1-s steps. Audio features are then extracted from each segment following the procedures in [23]. We compared the proposed sparse kernel classifier for MI classification (SparseMI) with existing kernel SVM implementations for MI classification, including MI-SVM [3], MISVM [3], MI-kernel [2] and MILES [8]. For MUSK1 and MUSK2, we performed ten-fold cross validation (CV) and recorded CV accuracies. For other data sets, we randomly


1385

TABLE I P ERFORMANCE C OMPARISON FOR D IFFERENT M ETHODS ON MI C LASSIFICATION . MUSK1

MUSK2

COREL10

COREL20

GENRE

MI-SVM

87.30 ± 1.56% 400.2 ± 0.5

80.52 ± 1.88% 2029.2 ± 18.6

75.62 ± 1.77% 1458.2 ± 25.4

52.15 ± 2.02% 2783.3 ± 37.8

80.28 ± 1.25% 12438.6 ± 111.6

MI-SVM

77.10 ± 2.69% 277.1 ± 0.5

83.20 ± 2.62% 583.4 ± 20.4

74.35 ± 2.35% 977.0 ± 31.5

55.37 ± 2.19% 2300.1 ± 28.3

72.48 ± 1.85% 3639.8 ± 113.6

MI-Kernel

89.93 ± 1.33% 362.4 ± 3.4

90.26 ± 1.33% 3531.2 ± 135.6

84.30 ± 1.48% 1692.6 ± 214.4

73.19 ± 1.04% 3664.8 ± 144.6

77.05 ± 2.02% 13843.9 ± 364.8

MILES

85.73 ± 1.05% 40.8 ± 2.8

87.64 ± 1.23% 42.5 ± 3.5

82.86 ± 1.31% 379.1 ± 23.2

69.16 ± 0.92% 868 ± 48.1

69.87 ± 2.12% 1127.9 ± 134.2

SparseMI

90.12 ± 1.05% 170([50, 500])

88.52 ± 1.40% 10([10, 10])

84.63 ± 1.29% 120([50, 500])

72.97 ± 1.12% 340([50, 1000])

76.89 ± 1.42% 800([500, 1000])

NXV = 10

RS RSVM SparseMI

88.68 ± 2.36% 74.66 ± 3.47% 88.44 ± 1.49%

86.39 ± 2.59% 77.70 ± 3.50% 88.52 ± 1.40%

75.13 ± 2.30% 69.25 ± 2.17% 80.10 ± 1.31%

55.20 ± 1.80% 48.53 ± 2.80% 62.49 ± 1.23%

54.68 ± 2.09% 46.01 ± 3.36% 71.28 ± 3.31%

NXV = 50

RS RSVM SparseMI

89.61 ± 1.10% 87.23 ± 2.05% 89.98 ± 1.19%

89.92 ± 1.81 86.71 ± 2.00% 88.02 ± 2.91%

79.87 ± 1.72% 76.94 ± 1.64% 84.19 ± 1.20%

66.27 ± 1.51% 62.54 ± 1.45% 71.66 ± 1.13%

65.27 ± 2.08 63.58 ± 2.84% 75.28 ± 1.76%

NXV = 100

RS RSVM SparseMI

90.18 ± 1.33% 89.02 ± 2.14% 90.40 ± 1.25%

89.16 ± 1.88% 88.26 ± 2.05% 87.98 ± 3.23%

78.81 ± 2.04% 77.63 ± 1.58% 84.31 ± 1.32%

65.35 ± 1.31% 63.82 ± 1.57% 72.22 ± 1.39%

67.77 ± 1.64% 67.41 ± 2.02% 76.06 ± 1.56%

The first block shows the mean accuracies and standard deviations in percentage (first row) and numbers of XVs (second row) obtained by sparseMI and existing methods. For sparseMI, the average numbers of XVs were reported along with the minimum and maximum in brackets. The second block shows the accuracies for three sparse classifiers with given number of XVs NXV . The highest accuracy for each data set is highlighted in each block along with any tied results determined by paired-t tests on the differences over ten runs.

selected 50% of the data for training and the remaining 50% for testing and recorded test accuracies. The experiments are repeated ten times for each data set with different random partitions. Instance features are normalized to zero mean and unit standard deviation for all data sets. We implemented all methods being compared here with a Gaussian kernel. The kernel parameter γ and the SVM parameter C for all methods are chosen via threefold CV on the training data. For SparseMI, we set smax = 10, tmax = 50, and λ equal to the average pairwise distance between initial XVs. The optimal number of XVs is determined by CV and chosen from a discrete set of candidate values {10, 50, 100, 500, and 1000}. In a different set of tests, we compared SparseMI with two alternative naive methods for sparse MI classification: RS and RSVM. The former is discussed early in the previous section, and RSVM is a reduced SVM classifier learned from a randomly selected set of XVs [13], which is equivalent to running SparseMI without any optimization. All three methods can explicitly control the sparsity of the prediction model, hence we tested them with varied sparsity by ranging from 10, 50 up to 100 XVs. Test results on the benchmark data sets used in our experiments are reported in Table I. Results are organized in two blocks. The first block features the comparison between SparseMI with tuned number of XVs and existing MI classifiers. The second block includes results for SparseMI and two naive sparsification methods at three fixed sparsity levels. From the results, we can see that SparseMI is well balanced between accuracy and sparsity among existing MI classifiers. Compared with existing classifiers like MI-kernel and MILES,

SparseMI has comparable accuracy rates yet with significantly fewer XVs in the prediction function. Compared with alternative sparse implementations such as RS and RSVM, SparseMI achieves better performances in majority of the cases. The gap in performance is more evident with smaller number of XVs, as can be observed for the case of NXV = 10. Based on increasing value of NXV , the performance of SparseMI is further improved. SparseMI performs comparably with MIkernel at optimally tuned sparsity level. However, MI-kernel has far more number of XVs than SparseMI because MI-kernel is simply the dense version of SparseMI. Compared with MILES, which has better sparsity than other SVM methods, SparseMI not only achieves better accuracy but obtains sparser models. This is especially true in the cases of multiclass MI classification, for reasons we discussed in the previous section. In addition, with SparseMI, we can explicitly control its sparsity by specifying the number of XVs, whereas the sparsity of MILES can only be implicitly controlled by tuning the regularization parameter. C. Results of Sparse Input Features Here, we performed experiments to show the utility of SparseMI for MI classification on high-dimensional sparse input features. We used seven real-world text data sets from [3], namely TST1-4, TST7, TST9, and TST10. Each document, represented by a bag in text categorization, is split into passages with overlapping windows of maximal 50 words each. A bag-of-words (BOW) feature vector is extracted from each passage as the instance feature vector, whose length

1386


TABLE II P ERFORMANCE C OMPARISON FOR D IFFERENT M ETHODS ON MI C LASSIFICATION W ITH S PARSE I NPUT F EATURES

TST1 TST2 TST3 TST4 TST7 TST9 TST10

MI-SVM 94.88 ± 0.60% 496.6 ± 7.1 81.08 ± 1.16% 633.7 ± 5.7 86.75 ± 1.43% 598.3 ± 7.3 83.42 ± 0.99% 573.4 ± 7.1 80.78 ± 1.10% 615.7 ± 5.4 67.30 ± 1.40% 701.8 ± 4.2 83.38 ± 0.94% 639.1 ± 4.1

MI-Kernel 94.85 ± 0.44% 1935.8 ± 24.4 77.00 ± 1.44% 2509.6 ± 20.4 86.92 ± 1.17% 2290.0 ± 8.7 84.03 ± 0.53% 2348.8 ± 22.3 80.67 ± 1.18% 2407.2 ± 30.0 70.83 ± 1.29% 2485.0 ± 13.3 83.50 ± 0.85% 2450.9 ± 18.6

MILES 95.65 ± 0.61% 999.8 ± 0.2 83.58 ± 1.47% 999.8 ± 0.2 85.67 ± 1.12% 999.9 ± 0.1 83.47 ± 1.10% 999.9 ± 0.1 79.70 ± 1.03% 999.8 ± 0.2 67.17 ± 1.27% 999.9 ± 0.1 82.10 ± 1.23% 999.9 ± 0.1

depends on the size of the dictionary. Each feature represents the frequency of occurrence of the corresponding dictionary word weighted by inverse document frequency. Hence, the BOW representation yields high-dimensional feature vectors with many zero entries. For the TST data sets used in this experiment, each set has 200 positive bags and 200 negative ones. Each bag contains one to 20 instances with an average of about 8.5 instances per bag. Each instance feature vector has around 65 000 attributes and only about 0.025% of the attributes in each data set are nonzero. Therefore, it is impossible to solve the SparseMI problem using the standard formulation. Instead, we employed a different expression of XVs using a predefined set of BVs as discussed in Section IV-E. We randomly selected 100 BVs from training instances and tested SparseMI for each data set with ten XVs. The initial coefficients are also randomly generated for each XV. We also tested SparseMI with higher number of XVs without noticing significant changes in performance. This is also supported by the results of treating NXV as a separate parameter and selecting it from CV. Hence, we only report the results of SparseMI with ten XVs here. Along with SparseMI, we applied the same algorithms from the previous subsection with identical experiment setup. The results for different methods are shown in Table II. We can observe similar relative performance between SparseMI and other methods in comparison with the results obtained on data sets with dense input features in Table I. Compared with existing SVM methods like MI-kernel and MILES, SparseMI has comparable accuracy rates yet with significantly fewer XVs in the prediction function. Compared with alternative sparse implementations like RS and RSVM, SparseMI achieves better performances in all cases. In summary, SparseMI is the only method able to produce satisfactory accuracies and sparse classification models at the same time. D. Initialization Schemes All results reported in the previous subsections for SparseMI are based on choosing the initial XVs (or BVs in the case of sparse instance features) randomly for the optimization algorithm. Here, we examined alternative strategies for the selection of XVs/bases as the starting point for optimization and their influences on classification accuracies. In addition the

RS

RSVM

SparseMI

88.30 ± 4.36%

81.35 ± 2.66%

94.60 ± 0.69%

66.25 ± 3.68%

66.72 ± 2.14%

76.38 ± 0.69%

74.83 ± 7.22%

68.58 ± 2.87%

84.38 ± 1.00%

71.17 ± 5.26%

72.70 ± 2.18%

84.33 ± 0.96%

70.10 ± 4.64%

67.97 ± 3.18%

80.90 ± 0.93%

61.95 ± 2.85%

65.85 ± 1.31%

70.25 ± 1.15%

72.50 ± 3.50%

69.85 ± 3.82%

82.12 ± 1.11%

first scheme of with random selection of XVs, we compared two other initialization schemes here. The second scheme is unsupervised and chooses the centroids returned by the application of the Kmeans clustering algorithm to the training instances as initial XV/BVs. The third scheme is a supervised XV/BV selection scheme. To choose the XVs/bases, we first run a SVM classifier on the full-length bag-level feature vectors produced by the label-mean formulation in (8) and selected the top NXV instances whose corresponding feature attributes have the largest SVM weights in absolute value. We then repeated the previous experiments for data sets with both dense and sparse instance-level features, with 100 XVs for dense data sets and 100 BVs, ten XVs for sparse ones. The initial coefficients for XVs are generated randomly for sparse data sets. Fig. 3 shows the results of classification performance in bar plots. Each group of bars correspond to the accuracy rates and standard deviations for six methods: Random, Kmeans, SVM, Random+SparseMI, Kmeans+SparseMI, and SVM+SparseMI. These correspond to, from left to right, the three initialization schemes mentioned above without optimization, and the use of SparseMI with them. It can be seen that SparseMI is not very sensitive to the choice of initial XVs. The classification accuracies obtained by different initialization schemes with SparseMI are quite close for almost all the data sets tested. The only exceptions are TST1 and TST2, where Random+Sparse underperforms Kmeans+SVM and SVM+SparseMI respectively with a small margin. The differences for other data sets are not statistically significant based on the paired-t tests. This is not the case for the differences without optimization, where Kmeans performs poorly for three out of the seven sparse data sets. It may be because of the fact that the BVs chosen by both SVM and Random overlap with the input instances, which is not the case for Kmeans. Therefore, this may alter the structures and sparsity patterns of the input features. However, this does not affect the performance of SparseMI. In all cases except the Musk2 data set, SparseMI can improve significantly the classification performance as compared with the results without optimization, regardless of the initialization schemes chosen. This demonstrates the effectiveness of SparseMI for MI classification and its robustness to different starting points for the optimization.


(a)

1387

(b)

Fig. 3. Performance comparison of different initialization schemes for SparseMI. (a) Data sets with dense instance features. (b) Data sets with sparse features. Bars for each data set correspond to the results obtained by Random, Kmeans, SVM, Random+SparseMI, Kmeans+SparseMI, and SVM+SparseMI from left to right in increasing brightness.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Stepwise demonstration of SparseMI optimization algorithm for six selected data sets. The darker solid line in each plot: evolution of cost function values. The broken line in each plot: evolution of test accuracies over iterations. (a) MUSK1.(b) COREL-20. (c) GENRE. (d) TST1. (e) TST2. (f) TST3.

E. Convergence Results We proceed to study the behavior of SparseMI over the iterations of the optimization algorithm. For each data set, we randomly selected 50% of bags for training and the remaining 50% for testing with the same proportions of data for each class. We then applied SparseMI to the training data set for 50 iterations and record the cost function values over each iteration. The classifier obtained from each iteration is used to perform prediction on the testing data and the accuracy rates for all iterations are also recorded. The above stepwise test is repeated ten times for different random partitions of training and testing data, and the average values are taken over ten test rounds. Fig. 4 shows the results for six data

sets used in our previous experiments by plotting the cost function values and accuracy rates against iteration numbers in each subfigure. These include three data sets with dense input features (MUSK1, COREL20, and GENRE) and three with sparse input (TST1-3). It can be seen clearly that for each data set shown here, the cost function values are monotonically nonincreasing over the iterations, accompanied by the tendency of improvement for test accuracies. The decrease of cost function values and improvement of test accuracies are especially obvious for the first few iterations. After about ten to 20 iterations, the performance becomes gradually stable on the test set and does not change too much over further iterations despite small perturbations. Results on other data sets used in previous sections also show similar trends. These results

1388


TABLE III P ERFORMANCE OF D IFFERENT M ETHODS FOR AUDIO S EGMENTATION

LSVM SVM MI-Kernel MILES SparseMI

Train Time (s)

Test Time (s)

NXV

Accuracy (%)

< 0.05 ≈6 1200–2400 800–1200 650–700

0.01 0.6 75–160 ≈2

Multiple kernel learning for sparse representation-based classification.

Use of customizing kernel sparse representation for hyperspectral image classification.

A Kernel Classification Framework for Metric Learning.

Sparse extreme learning machine for classification.

Gaussian kernel width optimization for sparse Bayesian learning.

Multiple kernel sparse representations for supervised and unsupervised learning.

Epileptic EEG classification based on kernel sparse representation.

Robust Pedestrian Classification Based on Hierarchical Kernel Sparse Representation.

A linear recurrent kernel online learning algorithm with sparse updates.

Sparse Bayesian extreme learning machine for multi-classification.

Sparse coding induced transfer learning for HEp-2 cell classification.

Sparse kernel machine regression for ordinal outcomes.

Kernel reconstruction ICA for sparse representation.

Structured Sparse Kernel Learning for Imaging Genetics Based Alzheimer's Disease Diagnosis.

CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING.

Prediction of S-nitrosylation modification sites based on kernel sparse representation classification and mRMR algorithm.

Learning ensemble classifiers for diabetic retinopathy assessment.

Fast Generation of Sparse Random Kernel Graphs.

Online learning control using adaptive critic designs with sparse kernel machines.

Robust GRAPPA reconstruction using sparse multi-kernel learning with least squares support vector regression.

Fast Gaussian kernel learning for classification tasks based on specially structured global optimization.

Deviance residuals-based sparse PLS and sparse kernel PLS regression for censored data.

Multiple Kernel Based Region Importance Learning for Neural Classification of Gait States from EEG Signals.

A novel approach to malignant-benign classification of pulmonary nodules by using ensemble learning classifiers.