Multitask Classification Hypothesis Space With Improved Generalization Bounds.

1468

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 7, JULY 2015

Multitask Classification Hypothesis Space With Improved Generalization Bounds Cong Li, Student Member, IEEE, Michael Georgiopoulos, Senior Member, IEEE, and Georgios C. Anagnostopoulos, Senior Member, IEEE

Abstract— This paper presents a pair of hypothesis spaces (HSs) of vector-valued functions intended to be used in the context of multitask classification. While both are parameterized on the elements of reproducing kernel Hilbert spaces and impose a feature mapping that is common to all tasks, one of them assumes this mapping as fixed, while the more general one learns the mapping via multiple kernel learning. For these new HSs, empirical Rademacher complexity-based generalization bounds are derived, and are shown to be tighter than the bound of a particular HS, which has appeared recently in the literature, leading to improved performance. As a matter of fact, the latter HS is shown to be a special case of ours. Based on an equivalence to Group-Lasso type HSs, the proposed HSs are utilized toward corresponding support vector machine-based formulations. Finally, experimental results on multitask learning problems underline the quality of the derived bounds and validate this paper’s analysis. Index Terms— Machine learning, pattern recognition, statistical learning, supervised learning, support vector machines.

I. I NTRODUCTION

M

ULTITASK learning (MTL) has been an active research field for over a decade [5]. The fundamental philosophy of MTL is to simultaneously train several related tasks with shared information, so that the hope is to improve the generalization performance of each task by the assistance of other tasks. More formally, in a typical MTL setting with T tasks, we want to choose T functions f = ( f 1 , . . . , f T ) from a hypothesis space (HS) F = {x → [ f 1 (x), . . . , f T (x)] }, where x is an instance of some input set X , such that the performance of each task is optimized based on a problemspecific criterion. Here, [ f 1 (x), . . . , f T (x)] denotes the transposition of row vector [ f 1 (x), . . . , f T (x)]. MTL has been successfully applied in feature selection [2], [11], [13], regression [19], [23], metric learning [27], [33], and other kernel-based problems [1], [4], [10], [22], [28], to name a few. Manuscript received October 30, 2013; revised May 10, 2014; accepted July 30, 2014. Date of publication August 26, 2014; date of current version June 16, 2015. The work of C. Li was supported by the National Science Foundation (NSF) under Grant 0806931 and Grant 0963146. The work of M. Georgiopoulos was supported by the NSF under Grant 0963146, Grant 1200566, and Grant 1161228. The work of G. C. Anagnostopoulos was supported by the NSF under Grant 1263011. C. Li and M. Georgiopoulos are with the Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816 USA (e-mail: [email protected]; [email protected]). G. C. Anagnostopoulos is with the Department of Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, FL 32901-6975 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2347054

Moreover, MTL has been shown to be advantageous in solving real-world problems, such as HIV therapy screening [3], web image and video search [31], web page classification [7], and disease prediction [35], again, to name a few. Despite the abundance of MTL applications, relevant generalization bounds have only been developed for special cases. A theoretically well-studied MTL framework is the regularized linear MTL model, whose generalization bound is bound is studied in [16], [25], and [26]. In this framework, each function f t is featured as a linear function with weight wt ∈ H, such that ∀x ∈ X ⊆ H, f t (x) = wt , x , where H is a real Hilbert space equipped with inner product ·, · . Different regularizers can be employed to the weights w = (w1 , . . . , w T ) ∈ H × . . . × H to fulfill different requirements T times of the problem at hand. Formally, given x it , yti ∈ X ×Y, i = 1, . . . , n t , t = 1, . . . , T , where X and Y are the input and output space for each task, the framework can be written as

min R(w) + λ L f t x it , yti (1) w

i,t

where R(·) and L(·, ·) are the regularizer and loss function, respectively. Many MTL models fall into this framework. For example, [8] aims for group sparsity of w, [34] discovers group structure of multiple tasks, and [11] and [13] select features in a MTL context. In the previous framework, tasks are implicitly related by regularizers on w. On the other hand, another angle of considering information sharing amongst tasks is by preprocessing the data from all tasks by a common processor, and subsequently, a linear model is learned based on the processed data. One scenario of this learning framework is subspace learning, where data of each task are projected to a common subspace by an operator A, and then the w t s are learned in that subspace. Such an approach is followed in [2] and [17]. Another particularly straightforward and useful adoptation of this framework is kernel-based MTL. In this situation, the role of the operator A is assumed by the nonlinear feature mapping φ associated with the kernel function in use. In this case, all data are preprocessed by a common kernel function, which is preselected or learned during the training phase, while the wt s are then learned in the corresponding reproducing kernel Hilbert space (RKHS). One example of this technique is given in [30]. One previous work which discussed the generalization bound of this method in a classification context is [24].

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LI et al.: MULTITASK CLASSIFICATION HYPOTHESIS SPACE WITH IMPROVED GENERALIZATION BOUNDS

Given a set A of bounded self-adjoint linear operators on X and T linear functions with weights wt s, the HS is given as F = {x → [w1 , Ax , . . . , w T , Ax ] : wt 2 ≤ R, A ∈ A}. Clearly, in this HS, data are preprocessed by the operator A to a common space, as a strategy of information sharing amongst tasks. By either cleverly choosing A beforehand or by learning A ∈ A, it is expected that a tighter generalization bound can be attained compared with learning each task independently. It is straightforward to see that preselecting A beforehand is a special case of learning A ∈ A, i.e., preselecting A is equivalent to A = {A}. However, the limitations of F are twofold. First, in F , all w t s are equally constrained in a ball, whose radius R is determined prior to training. However, in practice, the HS that lets each task have its own radius for the corresponding normball constraint may be more appropriate and may lead to a better generalization bound and performance. The second limitation is that it cannot handle the models that learn a common kernel function for all tasks, e.g., the multitask multiple kernel learning (MT-MKL) models. One way to incorporate such kernel learning models into F is to let A be the feature mapping φ : X → Hφ , where Hφ is the output space of the feature mapping φ and φ corresponds to the common kernel function k. In other words, this setting defines F = {x → [w 1 , φ(x) , . . . , w T , φ(x) ] : wt 2 ≤ R, φ ∈ (φ)}, where (φ) is the set of feature mappings that φ is learned from. Obviously, the HS that is considered in [24] does not cover this scenario, since it only allows the operator A to be linear operator, which is not the case when A = φ. Yet another limitation reveals itself, when one considers the equivalent HS: F = {x → [w1 , x , . . . , w T , x ] : wt 2 ≤ R, x ∈ Hφ , φ ∈ (φ)}, where, as mentioned above, Hφ is the output space of the feature mapping φ. Obviously, the HS in [24] fails to cover this HS, due to the lack of the constraint φ ∈ (φ), which indicates that the feature mapping φ (hence, its corresponding kernel function) is learned during the training phase instead of being selected beforehand. Therefore, in this paper, we generalize F , particularly for kernel-based classification problems, by considering the common operator φ (which is associated with a kernel function) for all tasks and by imposing norm-ball constraints on the wt s with different radii that are learned during the training process, instead of being chosen prior to training. Specifically, we consider the HS Fs x → [w1 , φ(x) , . . . , w T , φ(x) ] :

wt 2 ≤ λ2t R, λ ∈ s (λ) (2) and

Fs,r x → [w1 , φ(x) , . . . , w T , φ(x) ] :

wt 2 ≤ λ2t R, λ ∈ s (λ), φ ∈ r (φ)

(3)

where s (λ) {λ 0, λ s ≤ 1, s ≥ 1}, in which λ s ( tT=1 λst )1/s√stands for the √ Minkowski L s -norm, and r (φ) = {φ : φ = ( θ1 φ1 , . . . , θ M φ M ), θ 0, θ r ≤ 1, r ≥ 1}, φm (x) ∈ Hm , φ(x) ∈ H1 × · · · × H M . The objective of our paper is to derive and analyze the generalization bounds of these two HS. Specifically, the first HS, Fs , has fixed feature

1469

mapping φ, which is preselected, and the second HS, Fs,r , learns the feature mapping via a MKL approach. We refer readers to [21] and a survey paper [12] for details on multiple kernel learning (MKL). Obviously, by letting all λt s equal to 1, Fs degrades to the equal-radius HS, which is a special case of Fs when s → +∞, as we will show in the sequel. By considering the generalization bound of Fs based on the empirical Rademacher complexity (ERC) [25], we first demonstrate that the ERC is monotonically increasing with s, which implies that the tightest bound is achieved, when s = 1. We then provide an upper bound for the ERC of Fs , which also monotonically increases with respect to s. In the optimal case √ (s = 1), we achieve a generalization bound of order O( log T /T ), which decreases relatively fast with increasing T . On the other hand, when s → +∞, the bound does not decrease with increasing T , thus, it is less preferred. We then derive the generalization bound√for the HS Fs,r , which still features a bound of order O( log T /T ) when s = 1, as in the single-kernel setting. In addition, if M √kernel functions are involved, the bound is of order O( log M), which has been proved to be the best bound that can be obtained in single-task multikernel classification [9]. Therefore, the optimal order of the bound is also preserved in the MT-MKL case. Note that the proofs of all theoretical results are in the Appendix. After investigating the generalization bounds, we experimentally show that our bound on the ERC matches the real ERC very well. Moreover, we propose a MTL model based on support vector machines (SVMs) as an example of a classification framework that uses Fs as its HS. It is further extended to an MT-MKL setting, whose HS becomes Fs,r . Experimental results on multitask classification problems show the effect of s on the generalization ability of our model. In most situations, the optimal results are indeed achieved, when s = 1, which matches our technical analysis. For some other results that are not optimal as expected, when s = 1, we provide a justification. II. F IXED F EATURE M APPING {x it , yti }

Let ∈ X × {−1, 1}, i = 1, . . . , N, t = 1, . . . , T be independent identically distributed (i.i.d.) training samples from some joint distribution. Without loss of generality and on grounds of convenience, we will assume an equal number of training samples for each task. Let H be a RKHS with reproducing kernel function k(·, ·) : X × X → R and associated feature mapping φ : X → H. In what follows, we give the theoretical analysis of our HS Fs , when the feature mapping φ is fixed. A. Theoretical Results Given T tasks, our objective is to learn T linear functionals f t (·) : H → R, such that f t (φ(x)) = wt , φ(x) , t = 1, . . . , T , x ∈ X . Next, let f [ f 1 , . . . , f T ] and define the multitask classification error as T 1 er ( f ) E{1(−∞,0] (yt f t (φ(x t )))} (4) T t =1

1470


where 1(−∞,0] (·) is the characteristic function of (−∞, 0] and referred to as the 0/1 loss function. The empirical error based on a surrogate loss function L¯ : R → [0, 1], which is a Lipschitz-continuous function that upper bounds the 0/1 loss function, is defined as er ˆ ( f)

T ,N 1 ¯ i i

L yt f t φ x t . TN

(5)

t,i=1

For the constraints on the wt s, instead of predefining a common radius R for all tasks, as discussed in [24], we let

wt 2 ≤ λ2t R, where λt is learned during the training phase. This motivates our consideration of Fs , as given in (2), which we repeat here Fs x → [w1 , φ(x) , . . . , w T , φ(x) ] :

wt 2 ≤ λ2t R, λ ∈ s (λ) . (6) Note that the feature mapping φ is determined before training. To derive the generalization bound for Fs , we first provide the following lemma. Lemma 1: Let Fs be as defined in (6). Let L¯ : R → [0, 1] be a Lipschitz-continuous loss function with Lipschitz constant γ and upper bounds the 0/1 loss function 1(−∞,0] (·). Then, with probability 1 − δ, we have 9 log 2δ 1 ˆ ∀ f ∈ Fs (7) er ( f ) ≤ er ˆ ( f ) + R(F s) + γ 2T N ˆ s ) is the ERC for MTL problems defined in [25] where R(F ⎧ ⎫ T ,N ⎨ ⎬ ˆ s ) E σ sup 2 σti f t (φ(x it )) R(F (8) ⎩ f ∈Fs T N ⎭ t,i=1

where the σti s are i.i.d. Rademacher-distributed (i.e., Bernoulli (1/2)-distributed random variables with sample space {−1, +1}). This lemma can be simply proved by utilizing Theorems 16 and 17 in [24]. Using the same proving strategy, it is easy to show that (7) is valid for all HSs that are considered in this paper. Therefore, we will not explicitly state a specialization of it for each additional HS encountered in the sequel. In the next, we first define the following duality mapping for all a ∈ R: a , ∀a = 1 (9) (·)∗ : a → a ∗ a−1 +∞, a = 1 ˆ s ) is then we give the following results which show that R(F monotonically increasing with respect to s. Lemma 2: Let σ t

[σt1 , . . . , σtN ] ,

ut

σ t K t σ t , where j

K t is the kernel matrix that consists of elements k(x it , x t ), t = 1, . . . , T , u [u 1 , . . . , u T ] . Then ∀s ≥ 1, we have √ ˆ s) = 2 R E σ { u s ∗ }. (10) R(F TN Leveraging from (10), one can show the following theorem. ˆ s ) is monotonically increasing with respect Theorem 1: R(F to s. Define F˜ {x → [w1 , φ(x) , . . . , w T , φ(x) ] :

wt 2 ≤ R}, which is the HS that is given in [24] under

kernelized MTL setting, then F˜ is the HS with equal radius for each wt 2 . Obviously, it is the special case of Fs with all λt s be set to 1. We have the following result. ˆ F˜ ) = R(F ˆ +∞ ). Theorem 2: R( The above results imply that the tightest generalization bound is obtained when s = 1, while, on the other hand, the bound of F+∞ that sets equal radii for all wt s is the least preferred. It is clear that to derive a generalization bound for Fs , we need to compute or at least find an upper bound for ˆ s ). The following theorem addresses this requisite. R(F Theorem 3: Let Fs be as defined in (6), and let ρ 2 ln T . Assume that ∀x ∈ X , k(x, x) = φ(x), φ(x) ≤ 1. Then the ERC can be bounded as follows: 2 2 ˆ s) ≤ √ τ RT s ∗ (11) R(F T N where τ (max {s, ρ ∗ })∗ . B. Analysis It is worth pointing out some observations regarding the result of Theorem 3. 1) It is not difficult to see that the bound of the ERC in ˆ s ). (11) is monotonically increasing in s, as is R(F ˜ . In this case, R(F ˆ +∞ ) ≤ degrades to F 2) As s → +∞, F s √ 2 R/N . Note that this bound matches the one that is given in [24]. This is because of the following relation between F˜ and the HS of [24], F , which is introduced in Section I. First, let the operator A in F be the identity operator, and then let x in F be an element of H, i.e., let x in F be φ(x) in F˜ . Then F becomes F˜ . 3) Obviously, when s is √ finite, the bound for Fs , which is of order O((1/T (1/s)) √1/N ), is more preferred over the aforementioned O(1/ N ) bound, as it asymptotically decreases with increasing number tasks. √ of√ ˆ ρ ∗ ) ≤ (2/T N ) 2e R log T . Here, 4) When s = ρ ∗ , R(F √ we achieve a bound of order O( log T /T ), which decreases faster with increasing T compared with the bound, when s > ρ ∗ . √ √ ˆ 1 ) ≤ (2/T N ) 2R log T . While 5) When s = 1, R(F √ being of order O( log T /T ), it features a smaller ˆ ρ ∗ ). In fact, constant compared with the bound of R(F due to the monotonicity of the bound that is given in (11), the tightest bound is obtained when s = 1. In the next section, we derive and analyze the generalization bound by letting φ to be learned during the training phase. III. L EARNING THE F EATURE M APPING In this section, we consider the selection of the feature mapping φ during training via In particular, √ √ an MKL approach. we will assume that φ = ( θ1 φ1 , . . . , θ M φ M ), where each φm : X → Hm is selected before training. A. Theoretical Results Consider the following HS: Fs,r x → [w1 , φ(x) , . . . , w T , φ(x) ] :

wt 2 ≤ λ2t R, λ ∈ s (λ), φ ∈ r (φ)

(12)


√ √ where r (φ) = {φ : φ = ( θ1 φ1 , . . . , θ M φ M ), θ 0,

θ r ≤ 1}. By following the same derivation procedure of Lemma 1, we can verify that (7) is also valid for Fs,r . Therefore, we only need to estimate its ERC. Similar to the previous section, we first give results regarding the monotonicˆ s,r ). ity of R(F Lemma 3: Let σ t [σt1 , . . . , σtN ] , u m σt Km t σt, t T m α , and v λ σ K ut [u 1t , . . . , u tM ] , v m t t t t t =1 [v 1 , . . . , v M ] , where K m t is the kernel matrix that contains j elements km (x it , x t ). Then ∀s ≥ 1 and r ≥ 1 s1∗ T √ s∗ 2 ˆ s,r ) = sup R Eσ (θ ut ) 2 R(F TN θ∈r (θ) t =1 2

v r ∗ Eσ sup = TN λ∈s (λ),α∈(α)

(13)

where r (θ ) {θ : θ 0, θ r ≤ 1} and (α) {α t : σt Km t α t ≤ R, ∀t}. Based on (13), we have the following result. ˆ s,r ) is monotonically increasing with Theorem 4: R(F respect to s. We extend F˜ to a MT-MKL setting by letting φ ∈ r (φ), for φ in F˜ , which gives F˜r {x → (w 1 , φ(x) , . . . , w T , φ(x) ) : wt 2 ≤ R, φ ∈ r (φ)}. Then, (13) leads to the following result. ˆ F˜r ) = R(F ˆ +∞,r ), and thus, F˜r is a special Theorem 5: R( case of Fs,r . Again, the above results imply that the tightest bound is obtained, when s = 1. In the following theorem, we provide ˆ s,r ). an upper bound for R(F Theorem 6: Let Fs,r be as defined in (12). Assume that ∀x ∈ X , m = 1, . . . , M, km (x, x) = φm (x), φm (x) ≤ 1. The ERC can be bounded as follows: 1 2 2 2 ˆ s,r ) ≤ √ R(F Rs ∗ T s ∗ M max r ∗ , s ∗ . (14) T N The above theorem can be explicitly refined under the following two situations. Corollary 1: Under the conditions that are given in Theorem 6, we have 2 1 2 ˆ τ RT s ∗ M r ∗ , if r ∗ ≤ log T R(Fs,r ) ≤ √ T N 2 2 2 ˆ s,r ) ≤ √ R(F τ RT s ∗ M τ , if r ∗ ≥ log M T (15) T N ∀s ≥ 1, where τ (max {s, ρ ∗ })∗ , and 2 ln T, r ∗ ≤ ln T ρ 2 ln M T, r ∗ ≥ ln M T.

(16)

B. Analysis Once again, it is worth commenting on the results given in Theorem 6 and Corollary 1.

1471

1) In general, ∀s ≥ 1, (15) gives a bound of order O(1/T (1/s)). Obviously, s → +∞ is least preferred, since its bound does not decrease with increasing number of tasks. Moreover, based √ on (14), ∀r ≥ 1, ˆ 1,r )’s bound is of order O( M (1/r ∗ ) ). Compared R(F ∗ with the O( M (1/r ) min(log M, r ∗ )) bound of single-task MKL scenario, which is examined in [18], our bound for MT-MKL is tighter, for almost all M, when r is small, which is usually a preferred setting. 2) When r ∗ ≥ log M T , the bound given in (15) is monotonically increasing with respect√ to √ s. When s = ρ ∗ , ˆ ∗ we have R(F √ ρ ,r ) ≤ (2/T N ) 2e R log M T . This gives a O( log M T /T ) bound. Note that it is proved that the best bound that can be obtained in √ single-task multiple kernel classification is of order O( log M) [9]. Obviously, this logarithmic bound is preserved in the MT-MKL context. When√s further decreases to 1, we ˆ 1,r ) ≤ (2/T N ) 2RM (1/log M T ) log M T . have R(F Since M (1/log M T ) can never be larger than e, this bound is even tighter than the one obtained, when s = ρ ∗ . 3) When r ∗ ≤ log T , the bound that is given in (15) is monotonically increasing with √respect to s. When s = ˆ ρ ∗ ,r ) ≤ (2/T N ) 2e RM (1/r ∗ ) log T . ρ ∗ , we have R(F ∗ This gives a O( M (1/r ) log T /T ) bound. When ˆ s further √ decreases∗ to 1, we have R(F1,r ) ≤ (2/T N ) 2RM (1/r ) log T . As we can see, it further decreases the bound by a constant factor e. 4) Compared with the optimum bounds that are given in the previous two situations, i.e., r ∗ ≥ log M T and r ∗ ≤ log T , we can see that, when r ∗ ≥ log M T , we √achieve a better bound with respect to M, i.e., √ ∗ hand, O( log M) versus O( M (1/r ) ). On the other √ with regards to T , even though we get a O( log T /T ) bound in both cases, the case of r ∗ ≤ log T features a lower constant factor. To √ summarize, MT-MKL not only preserves the optimal in single-task MKL but also O( log M) bound encountered √ preserves the optimal O( log T /T ) bound encountered in the single-kernel MTL case, which was given in the previous section. IV. D ISCUSSION A. Relation to Group-Lasso Type Regularizer In the next theorem, we show the relation between our HS and the one that is based on Group-Lasso type regularizer. Theorem 7: The HS Fs is equivalent to ⎧ ⎪ ⎨ GL Fs x → [w1 , φ(x) , . . . , w T , φ(x) ] : ⎪ ⎩ ⎫ T 2s ⎪ ⎬ s (17)

wt

≤R . ⎪ ⎭ t =1

1472


Similarly, Fs,r is equivalent to ⎧ ⎪ ⎨ GL ˜ ˜ Fs,r x → [w1 , φ(x) , . . . , w T , φ(x) ] : ⎪ ⎩ ⎫ T 2s ⎪ ⎬ s

w t

≤ R, θ ∈ r (θ) ⎪ ⎭ t =1

(18)

M wmt 2 ˜ where wt 2 = m=1 θm , φ = (φ1 , . . . , φ M ), and r (θ ) = {θ : θ 0, θ r ≤ 1}. Obviously, by employing the Group-Lasso type regularizer, one can obtain the HSs that are proposed in previous sections. Below is a regularization-loss framework based on this regularizer with preselected kernel min

w 1 ,...,w T

T

2s s

w t

+C

t =1

L wt , φ x it , yti .

M∈M

(19)

i,t

The MKL-based model can be similarly defined. B. Other Related Works There has been substantial efforts put on the research of kernel-based MTL and also MT-MKL. We in this section discuss four closely related papers and emphasize the difference between this paper and these works. First, we consider [1] and [28]. Both of these two papers consider Group-Lasso type regularizer to achieve different level of sparsity. Specifically, [1] utilized the regularizer ⎛ nj s ⎞ 2s n ⎝

w j k ⎠ , s ≥ 2 (20) j =1

k=1

where l1 norm regularization is applied to the weights of each of the n groups (i.e., the inner summation) and the group-wise regularization is achieved via ls norm regularizer. It is hoped that by utilizing this regularizer, one can achieve inner-group sparsity and group-wise nonsparsity. This regularizer can be applied to MT-MKL, by letting w j k to be wm t , which yields the regularizer ⎛ s ⎞ 2s M T ⎝ ⎠ ,

w m s ≥ 2. (21) t

t =1

m=1

G L . However, the This is similar to the one that appeared in Fs,r major difference between our regularizer and (21) is that, in G L , instead of applying an l norm to the inner summation, Fs,r 1 M 2 we used m=1

wm t /θm , where θm has a feasible region that is parametrized by r . Therefore, our regularizer encompasses MT-MKL with common kernel function, which is learned during training, while (21) does not. Rakotomamonjy et al. [28] considered

T M m=1

t =1

qp q

w m t

, 0 ≤ p ≤ 1, q ≥ 1.

By applying the l p (pseudo)norm, the authors intended to achieve sparsity over the outer summation, while variable sparsity is obtained for the inner summation, due to the G L are lq norm. The major difference between (22) and Fs,r twofold. First, the order of the double summation is different, i.e., in (22), the wm t s that belong to the same RKHS is G L treats each task as a group. considered as a group, while Fs,r Second, similar to the reason that is discussed above, (22) does not encompasses the MT-MKL with common kernel function, which is learned during the training phase. In the following, we discuss the difference between our work and the two theoretical works, [16] and [26], which derived generalization bound of the HSs that are similar to ours. Maurer and Pontil [26] consider the regularizer

v M : v M ∈ H, Mv M = w (23)

w M inf

(22)

M∈M

where M is an almost countable set of symmetric bounded linear operators on H. This general form covers several regularizers, such as Lasso, Group-Lasso, and weighted Group-Lasso. A key observation is that, in order for a specific regularizer to be covered by this general expression, the regularizer needs to be either summation of several norms, or the infimum of such a summation over a feasible region. For 2 our regularizer ( tT=1 wt s ) s , obviously, it is not summation of norms (note the power outside the summation). In addition, it is not immediately clear, if it can be represented by an infimum, which we just mentioned. Therefore, it appears that there is no succinct way to represent our regularizer as a special case of (23) and the same seems to be the case for our MT-MKL regularizer. Also, for our HSs, it is clear how the generalization bounds relate to the number of tasks T and number of kernels M, in Fs and Fs,r , and under which circumstances the logarithmic bound can be achieved. This observation may be hard to obtain from the bound that is derived in (23), even though one may view our regularizers as special cases of (23). Kakade et al. [16] derived generalization bound for regularization-based MTL models, with regularizer W r, p

( w 1 r , . . . , wn r ) p . However, their work assumes W ∈ Rm×n , while we assume our wt s to be vectors of a potentially infinite-dimensional Hilbert space. In addition, such group norm does not generalize our MT-MKL regularizer, therefore their bound cannot be applied to our HS, even if their results were to be extended to infinite-dimensional vector spaces. V. E XPERIMENTS In this section, we investigate via experimentation the generalization bounds of our HSs. We first evaluate the discrepancy between the ERC of Fs , Fs,r and their bounds. We show experimentally that the bound gives a good estimate of the relevant ERC. Then, we consider a new SVMbased MTL model that uses Fs as its HS. The model is subsequently extended to allow for MT-MKL using Fs,r as its HS.


1473

A. ERC Bound Evaluation For Fs , given a data set and a preselected kernel function, we can calculate its kernel matrices K t , t = 1, . . . , T . Then, the ERC is given by (10). To approximate the expectation E σ { u s ∗ }, we resort to Monte Carlo simulation by drawing a large number D of i.i.d samples for the σ t s from a uniform distribution on the hyper-cube {−1, 1} N . Subsequently, for each sample we evaluate the argument of the expectation and average the results. For Fs,r , the ERC is calculated as in the first equation of (13). For each of the D samples of σ t , we can calculate the corresponding ut . Then, we solve the maximization problem using CVX [14], [15]. Finally, we calculate the average of the D values to approximate the ERC. For the experiment related to Fs,r , we only considered the case, when s ≥ 2. Under these circumstances, the maximization problem in (13) is concave and can be easily solved, unlike the case, when s ∈ [1, 2). We used the Letter data set1 for this set of experiments. It is a collection of handwritten words compiled by Rob Kassel of the MIT Spoken Language Systems Group. The associated MTL problem involves eight tasks, each of which is a binary classification problem for handwritten letters. The eight tasks are: C versus E, G versus Y, M versus N, A versus G, I versus J, A versus O, F versus T, and H versus N. Each letter is represented by a 8×16 pixel image, which forms a 128 dimensional feature vector. We chose 100 samples for each letter, and set D = 104 . To calculate the kernel matrix, we used a Gaussian kernel with spread parameter 27 for Fs , and nine different Gaussian kernels with spreads {2−7 , 2−5 , 2−3 , 2−1 , 20 , 21 , 23 , 25 , 27 } for Fs,r . Finally, R was set to 1. The experimental results are shown in Fig. 1. In both subfigures, it is obvious that both our bound and the real ERC are monotonically increasing. For Fs , it can be seen that the bound is tight everywhere. For Fs,r , even though the difference between our bound and the Monte Carlo estimated ERC becomes larger when s grows, the bound is still tight for small s. This experiment shows a good match between the real ERC and our bound, which verifies our theoretical analysis in Sections II and III. B. SVM-Based Model In this section, we present a new SVM-based model which reflects our proposed HS. For training data {x it , yti } ∈ X × {−1, 1}, i = 1, . . . , Nt , t = 1, . . . , T and fixed feature mapping φ : X → H, our model is given as follows: 2 T " s wt 2 # 2 s

T ,Nt

+C ξti 2 t =1 t,i=1

i i s.t. yt wt , φ x t + bt ≥ 1 − ξti , ξti ≥ 0, ∀i, t. (24) min

w,ξ ,b

Obviously, Fs is the HS of (24). Problem (24) is convex when s ≥ 1. In addition, it is worth pointing out that, when s = 2, (24) is a quadratic programming problem, and for s = 1, it 1 Available at http://www.cis.upenn.edu/∼taskar/ocr/.

Fig. 1. Comparison between Monte Carlo-estimated ERCs and our derived bounds using the Letter data set. 104 σ t samples were used for Monte Carlo estimation. We sampled 100 data for each letter and used 9 kernel functions in multiple kernel scenario. (a) HS: Fs . (b) HS: Fs,r .

can be written as an second-order cone programming problem, both of which are convex. Problem (24) can be solved as follows. First, note that when 1 ≤ s ≤ 2, the problem is equivalent to T ,Nt

T

w t 2

+C ξti 2λt t =1 t,i=1

s.t. yti w t , φ x it + bt ≥ 1 − ξti , ξti ≥ 0, ∀i, t min

w,ξ ,b,λ

s ≤ 1 λ 0, λ 2−s

(25)

which can be easily solved via block coordinate descent method, with {w, ξ , b} as a group and λ as another. When s > 2, (24) is equivalent to min max

w,ξ ,b

λ

T λt w t 2 t =1

2

+C

T ,Nt t,i=1

ξti

s.t. yti w t , φ x it + bt ≥ 1 − ξti , ξti ≥ 0, ∀i, t s ≤ 1. λ 0, λ s−2 (26)

1474


Since it is a convex–concave min–max problem with compact feasible region, the order of min and max can be interchanged [29], which gives the objective function Nt T

w t 2 C i λt ξt . (27) + max min λ w,ξ ,b 2 λt

between two consecutive iterations is smaller than a certain threshold. For the exact penalty function method, the optimization procedure stops when the duality gap is smaller than a certain threshold.

Calculating the dual form of the inner SVM problem gives the following maximization problem: " # T 1 λt α t 1 − α t Y t K t Y t α t max α,λ 2

We performed our experiments on two well-known and frequently used multitask data sets, namely, Letter and Landmine, and two handwritten digit data sets, namely, MNIST and USPS. The Letter data set was described in the previous section. Due to the large size of the original Letter data set, we randomly sampled 200 points for each letter to construct a training set. One exception is the letter j , as it contains only 189 samples in total. The Landmine data set2 consists of 29 binary classification tasks. Each datum is a 9 dimensional feature vector extracted from radar images that capture a single region of landmine fields. Tasks 1–15 correspond to regions that are relatively highly foliated, while the other 14 tasks correspond to regions that are bare earth or desert. The tasks entail different amounts of data, varying from 30 to 96 samples. The goal is to detect landmines in specific regions. Regarding the MNIST3 and USPS4 data sets, each of the two are grayscale images containing handwritten digits from 0 to 9 with 784 and 256 features, respectively. As was the case with the Letter data set, due to the large size of the original data set, we randomly sampled 100 data from each digit population to form a training set consisting of 1000 samples in total. To simulate the MTL scenario, we split the data into 45 binary classification tasks by applying a one-versus-one strategy. The classification accuracy was then calculated as the average of classification accuracies over all tasks. For all our experiments, the training set size was set to 10% of the available data. We did not choose large training sets, since, as we can see from the generalization bound in (11), (14), and (15), when N is large, the effect of s becomes minor. For MT-MKL, we chose the nine Gaussian kernels that were introduced in Section V-A, as well as a linear and a second-order polynomial kernel. For the single kernel case, we selected the optimal kernel from these 11 kernel function candidates via cross-validation. SVM’s regularization parameter C was selected from the set {1/81, 1/27, 1/9, 1/3, 1, 3, 9, 27, 81}. In the MT-MKL case, the norm parameter r for θ was set to 1 to induce sparsity on θ. We varied s from 1 to infinity and reported the best average classification accuracy over 20 runs. The experimental results are given in Fig. 2. It can be seen that the classification accuracy is roughly monotonically decreasing with respect to s and the performance deteriorates significantly when s > 2. In many situations, the best performance is achieved, when s = 1. This result supports our theoretical analysis that the lowest generalization bound is obtained when s = 1. On the other hand, in some situations, such as the cases which consider the

t =1

i=1

t =1

C 1, α t yt = 0, ∀t λt s ≤ 1 λ 0, λ s−2

s.t. 0 α t

(28)

where Y t diag([yt1 , . . . , ytNt ] ) and K t is the kernel matrix that is calculated based on the training data from the tth task. Although (28) is not concave, it can be equivalently transformed to the following concave problem: # T " 1 βt 1 − β Y t K t Y t βt max β,λ 2λt t t =1

s.t. 0 β t C1, β t yt = 0, ∀t s ≤ 1 λ 0, λ s−2

(29)

by letting α t λt β t . Note that the equality constraint is originally β t yt /λt = 0, which is equivalent to β t yt = 0. Group coordinate descent can be utilized to solve (29), with λ as a group and β as another group. The model can be extended so that it can accommodate MKL as follows: ⎛ M s ⎞ 2s T ,Nt T wm 2 2 t ⎝ ⎠ +C ξti min w t ,ξ t ,bt ,θ 2θm t =1 m=1 t,i=1

s.t. yti wt , φ x it + bt ≥ 1 − ξti , ξti ≥ 0, ∀i, t θ 0, θ r ≤ 1 (30) where φ = (φ1 , . . . , φ M ). Obviously, its HS is Fs,r . This model can be solved via the similar strategy of solving (24). The only situation that needs a different algorithm is the case when s > 2, where (30) will be transformed to M T 1 m βt 1 − min max β Yt θm K t Y t β t θ β,λ 2λt t t =1

s.t. 0 β t C1, β t yt = 0, ∀t s ≤ 1 λ 0, λ s−2 θ 0, θ r ≤ 1.

m=1

(31)

This min–max problem cannot be solved via group coordinate descent. Instead, we use the exact penalty function method to solve it. We omit the details of this method and refer the readers to [32] and [22], since it is not the focus of this paper. In our experiments, the SVM problems are solved using LIBSVM [6]. For the block-coordinate descent algorithm, the optimization procedure stops when the change of variables

C. Experimental Results on the SVM-Based Model

2 Available at http://people.ee.duke.edu/∼lcarin/LandmineData.zip. 3 Available at http://yann.lecun.com/exdb/mnist/. 4 Available at http://www.cs.nyu.edu/∼roweis/data.html.


1475

which is maxt { w2t }. In this scenario, the regularizer of the model is only the one which has the smallest margin, while the regularizers of other tasks are ignored. Therefore, it is not a surprise that the performance of the other tasks is bad, which leads to low average classification accuracy. For large s value, even though it is not infinity, the bad result can be similarly analyzed. 2

VI. C ONCLUSION

Fig. 2. Average classification accuracy on 20 runs for different s values. (a) Fixed kernel (single kernel) scenario. (b) MKL scenario.

In this paper, we proposed a MTL HS Fs involving T discriminative functions parametrized by weights wt . The weights are controlled by norm-ball constraints, whose radii are variable and estimated during the training phase. It extends a HS F˜ that has been previously investigated in the literature, where the radii are predetermined. It is shown that the latter space is a special case of Fs , when s → +∞. We derived and analyzed the generalization bound of Fs , and have shown that the bound is monotonically increasing with respect to s. In √ addition, in the optimal case (s = 1), a bound of order O( log T /T ) is achieved. We further extended the hypothesis space (HS) to Fs,r , which is suitable for MT-MKL. Similar results were obtained, including a bound that is monotonically increasing with s and an optimal bound √ of order O( log M T /T ), when s = 1. The experimental results have shown that our ERC bound is tight and matches the real ERC very well. We then demonstrated the relation between our HS and the Group-Lasso type regularizer, and a SVM-based model was proposed with HS Fs that was further extended to handle MT-MKL using the HS Fs,r . The experimental results on multitask classification data sets showed that the classification accuracy is monotonically decreasing with respect to s, and the optimal results for most experiments are indeed achieved, when s = 1, as indicated by our analysis. The presence of results that, contrary to our analysis, are optimal, when s = 1, can be justified similarly to [18, Sec. 5.1]. A PPENDIX A. Preliminaries

USPS data set in a single kernel setting and in the Letter data set in multiple kernel setting, the optimum model is not obtained when s = 1. This seems contradictory to our previously stated claims. However, this phenomenon can be explained similarly to the discussion in [18, Sec. 5.1], which we summarize here: Obviously, for different s, the optimal solution ( f t s) may be different. To get the optimal solution, we need to tune the size of the HS, such that the optimum f t s are contained in the HS. This implies that the size of the HS, which is parametrized by R, could be different for different s, instead of being fixed as discussed in previous sections. It is possible that the HS size (thus R) is very small, when s = 1. In this scenario, the lowest bound could be obtained when s = 1. For a more detailed discussion, we refer the reader to [18, Sec. 5.1]. Finally, it is interesting to see how the performance deteriorates when s becomes large. The reason for the bad performance is as follows. Observe that the regularizer is the l 2s norm of the T SVM regularizers. Consider the extreme case, when s → ∞, the l s2 norm becomes the l∞ norm,

In this section, we provide two results that will be used in the following sections. Lemma 4: Let p ≥ 1, x, a ∈ Rn such that a 0 and a = 0. Then max a x = a p

x∈(x)

(32)

where (x) {x : x p∗ ≤ 1}. This lemma can be simply proved by utilizing Lagrangian multiplier method with respect to the maximization problem. Lemma 5: Let x 1 , . . . , x n ∈ H, then we have that p n n 2 σi x i p ≤ p

x i 2 (33) Eσ

i=1

i=1

for any p ≥ 1, where σi s are the Rademacher-distributed random variables. For 1 ≤ p < 2, the above result can be simply proved using Lyapunov’s inequality. When p ≥ 2, the lemma can be proved in [20, Propositions 3.3.1 and 3.4.1].

1476


Optimize with respect to α t gives

B. Proof to Lemma 2 Proof: First notice that Fs is equivalent to the following

T √ R 2 ˆ F˜ ) = Eσ σt Ktσt . R( TN

HS:

Fs x → (λ1 w1 , φ(x) , . . . , λT w T , φ(x) ) :

wt 2 ≤ R, λ ∈ s (λ) .

(34)

According to the same reasoning of [9, eqs. (1) and (2)], N i i we know that wt = i=1 αt φ(x t ), and the constraint 2

wt ≤ R is equivalent to α t K t α t ≤ R. Therefore, based on the definition of empirical Rademacher complexity (ERC) that is given in (8), we have that T 2 ˆ s) = (35) E σ sup λt σ t K t α t R(F TN α t ∈Fs t =1

where Fs = {α t | α t K t α t ≤ R, ∀t}. To solve the maximization problem with respect to α t , we observe that the T problems are independent and thus can be solved individually. Based on the Cauchy-Schwartz inequality, the 1

(40)

t =1

1

optimal α t is achieved when K t2 α t = ct K t2 σ t , where ct is a constant. Substituting this result into each of the T maximization problems, we have the following:

Based on Lemma ˆ F˜ ) = R(F ˆ +∞ ). R(

2,

we

immediately

obtain

E. Proof to Theorem 3 Proof: have

According to (38) and Jensen’s inequality, we

√ T s1∗

s∗ R 2 ˆ s) ≤ R(F Eσ σ t K t σ t 2 TN t =1 ⎧$ $s ∗ ⎫⎞ s1∗ √ ⎛ T N $ $ ⎬ ⎨ 2 R $ $ ⎠ . (41) ⎝ = Eσ $ σti φ(x it )$ $ $ ⎭ ⎩ TN t =1

i=1

Based on Lemma 5, we have that

max ct σ t K t σ t

1∗ √ T

s∗ s R 2 ∗ 2 ˆ s) ≤ R(F s tr(K t ) TN t =1 2 ∗ = Rs tr(K t )tT=1 s ∗ 2 TN

ct

s.t.

ct2 σ t K t σ t

≤ R.

(36)

Obviously, the optimal ct is obtained when ct = R/σ t K t σ t . Therefore, the ERC now becomes T 2 ˆ s) = Eσ λt σ t K t σ t R . (37) sup R(F TN λ∈s (λ) t =1

Since s ≥ 1, based on Lemma 4, it is not difficult to get the solution of the maximization problem with respect to λ, which gives T √ s∗ 1 2 R ˆ s) = E σ [ (σ t K t σ t ) 2 ] s ∗ R(F TN t =1 √ 2 R E σ u s ∗ . (38) = TN

C. Proof to Theorem 1 Proof: First note that ∀s1 > s2 ≥ 1, we have that 1 ≤ s1∗ < s2∗ , which means u s1∗ ≥ u s2∗ . Based on (38), ˆ s2 ). This gives the ˆ s1 ) ≥ R(F we immediately have R(F ˆ monotonicity of R(Fs ) with respect to s.

where tr(K t )tT=1 s ∗ /2 denotes the ls ∗ /2 -norm of vector [tr(K 1 ), . . . , tr(K T )] . Since we assumed that k(x, x) ≤ 1, ∀x, we have % & T 2 & s∗ s∗ & 2 ' Rs ∗ ˆ s) ≤ N2 R(F TN t =1 2 2 = √ RT s ∗ s ∗ . T N

Proof: Similar to the proof to Lemma 2, we write the ERC of F˜ T 2 ˜ ˆ (39) E σ sup σ t K t αt . R(F ) = TN α ∈F˜ t

t =1

(43)

Note that this bound can be further improved for the interval s ∈ [1, ρ ∗ ]. To make this improvement, we first prove that 1 1 ˆ s ) for any s ≥ s ≥ 1. ˆ s ) ≤ T s − s R(F R(F ˆ s ) = 2 Eσ R(F TN

sup

T

λ 0, λ s ≤1 t =1

⎧ ⎪ ⎨

λt σ t K t σ t R T

⎫ ⎪ ⎬

2 Eσ λt σ t K t σ t R sup ⎪ ⎪ TN 1 1 ⎩ ⎭ − λ 0, λ s ≤T s s t =1 T 1 1 2 − sup Eσ T s s λt σ t K t σ t R = TN λ 0, λ s ≤1 ≤

D. Proof to Theorem 2

(42)

t =1

=T

1 −1 s s

ˆ s ). R(F

(44)


Based on this conclusion, we have that ∀s ∈ [1, ρ ∗ ] ˆ s) ≤ T R(F

1 ρ∗

H. Proof to Theorem 5 Proof: Define K t

− 1s

ˆ ρ∗ ) R(F 1 1 2 − = T ρ∗ s 2e R N log T TN 1 −1+1− 1s 2 2e R N log T = T ρ∗ TN 1 T s∗ 2 = 1 2e R N log T Tρ TN

Proof: Define K t

M

ˆ s,r ) = 2 E σ R(F TN

m m=1 θm K t , ∀t,

sup

T

α t ∈Fs,r t =1

m m=1 θm K t , ∀t,

T

sup

α t ∈F˜ t =1

then

σ t K t αt .

(49)

Fixing θ and optimizing with respect to α t gives T 2 √ ˜ ˆ sup R Eσ θ ut . R(Fr ) = TN θ∈r (θ)

(50)

t =1

(45)

√ √ ∗ Note that this is always less than (2/T N ) RT (2/s ) s ∗ ∗ that is given in (43); ρ is the global minimizer of the expression in (43) a function of s. In summary, we have √ as ˆ s ) ≤ (2/T N ) RT (2/s ∗) ρ when s ∈ [1, ρ ∗ ], and R(F √ √ ˆ s ) ≤ (2/T N ) RT (2/s ∗ ) s ∗ when s > ρ ∗ . R(F F. Proof to Lemma 3

M

ˆ F˜r ) = 2 E σ R( TN

1 s∗

2 T 2e R N log T = √ e TN 2 2 RT s ∗ ρ. = √ T N

1477

then we can write

λt σ t K t α t

(46)

where Fs = {α t | α t K t α t ≤ λ2t R, ∀t; λ ∈ s (λ); θ ∈ r (θ )}. Then using the similar proof of Lemma 2, we have that ⎧ ⎫ ( T ) s1∗ ⎪ √ ⎪ ⎨ ⎬ s∗ ˆ s,r ) = 2 R E σ sup (σ t K t σ t ) 2 R(F ⎪ ⎪ TN ⎩θ∈r (θ) t =1 ⎭ s1∗ √ T s∗ 2 R = Eσ (θ ut ) 2 . sup TN θ∈r (θ)

Based on (13), ˆ F˜r ) = R(F ˆ +∞,r ). R(

we

t =1

This gives the first equation in (13). To prove the second equation, we simply optimize (47) with respect to θ , which directly gives the result.

G. Proof to Theorem 4 Proof: Consider (13) and let g(λ) supα∈(α) v r ∗ . Then ˆ s,r ) = 2 E σ sup g(λ) . (48) R(F TN λ∈s (λ) Note that ∀1 ≤ s1 < s2 , we have the relation s1 (λ) ⊆ s2 (λ). Therefore, let λˆ 1 , λˆ 2 be the solution of problems supλ∈s (λ) g(λ) and supλ∈s (λ) g(λ) correspondingly, we 1 2 ˆ s1 ,r ) ≤ must have g(λˆ 1 ) ≤ g(λˆ 2 ). This directly implies R(F ˆ s2 ,r ). R(F

obtain

I. Proof to Theorem 6 Proof: Based on (13) and Hölder’s inequality, let c max{0, r1∗ − s2∗ }, we have that √ ˆ s,r ) ≤ 2 R Eσ R(F TN 2 √ = R Eσ TN

T

sup

θ∈r (θ) t =1 T

s∗ 2

( θ r ut r∗ )

2 √ ≤ RM c E σ TN

s∗ 2

s1∗

s1∗

ut r∗

t =1

T

ut

t =1

s∗ 2 s∗ 2

s1∗ .

(51)

Applying Jensen’s inequality, we have that ⎞ 1∗ ⎛ T ,M * s ∗ + s √ 2 2 ⎠ ˆ s,r ) ≤ R(F RM c ⎝ Eσ u m t TN t,m=1

⎛

$s ∗ ⎞ s1∗ $ N T ,M $ $

2 √ $ $ = RM c ⎝ Eσ $ σti φm x it $ ⎠ . (52) $ $ TN t,m=1

(47)

immediately

i=1

Using Lemma 5, we have that ⎛ ⎞ 1∗ s T ,M √ m

s ∗ 2 c ∗ 2 ˆ ⎝ ⎠ R(Fs,r ) ≤ RM s . tr K t TN

(53)

t,m=1

Since we assume that km (x, x) ≤ 1, ∀m, x, we have that 1 2 2 2 ˆ s,r ) ≤ √ (54) Rs ∗ T s ∗ M max r ∗ , s ∗ . R(F T N

J. Proof to Corollary 1 ˆ s) ≤ Proof: First, by following the same proof of R(F ˆ s ) for any s ≥ s ≥ 1, we can directly obtain T (1/s )−(1/s) R(F ˆ s,r ) ≤ T (1/s )−(1/s) R(F ˆ s ,r ) for any the conclusion that R(F s ≥ s ≥ 1.

1478


When r ∗ ≤ log T and s ∈ [1, ρ ∗ ], where ρ = 2 log T , we have that 1

1

ˆ s,r ) ≤ T ρ ∗ − s R(F ˆ ρ ∗ ,r ) R(F 1 1 2T s ∗ = √ 2e RM r ∗ log T T Ne 2 1 2 RT s ∗ ρ M r ∗ . = √ T N

(55)

ˆ s,r ) ≤ When s > ρ ∗ , obviously, we have R(F √ √ ∗ ∗ (2/T N ) RT (2/s ) s ∗ M (1/r ) . Similarly, for r ∗ ≥ log M T and s ∈ [1, ρ ∗ ], where ρ = 2 log M T , we have that ˆ s,r ) ≤ T ρ ∗ − s R(F ˆ ρ ∗ ,r ) R(F 1

1

1

=

2 2Re log M T √ T N

T s∗ 1

T log MT 1 2 2 2RT s ∗ M log MT log M T = √ T N 2 2 2 RT s ∗ ρ M ρ . = √ T N ˆ s,r ) R(F

When√ s √ > ρ ∗ , obviously, we have ∗ ∗ (2/T N ) RT (2/s ) s ∗ M (2/s ) .

(56) ≤

K. Proof to Theorem 7 Proof: We have already show that the ERC of Fs is ˆ s ) = 2 Eσ R(F TN

sup

T

α,λ t =1

σ t K t α t

.

(57)

It is not difficult to see that optimizing the following problem: sup α,λ

s.t.

T

σ t K t α t

t =1

α t K t α t ≤ λ2t R

λ s ≤ 1

(58)

with respect to α t must achieve its optimum at the boundary, i.e., the optimal α t must satisfy α t K α t = λ2t R. Therefore, (58) can be rewritten as sup α,λ

s.t.

T

σ t K t α t

t =1

α t K t α t = λ2t R

λ s ≤ 1.

(59)

Substituting the first constraint into the second one directly leads to the result. The proof regarding Fs,r is similar, and therefore we omit it.

R EFERENCES [1] J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, S. Raman, and S. Sonnenburg, “Variable sparsity kernel learning,” J. Mach. Learn. Res., vol. 12, pp. 565–592, Aug. 2011. [2] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Mach. Learn., vol. 73, no. 3, pp. 243–272, 2008. [3] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer, “Multi-task learning for HIV therapy screening,” in Proc. 25th Int. Conf. Mach. Learn., 2008, pp. 56–63. [4] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying, “Universal multi-task kernels,” J. Mach. Learn. Res., vol. 9, pp. 1615–1646, Jun. 2008. [5] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997. [6] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 23, pp. 1–27, 2011. [Online]. Available: http://www.csie.ntu.edu.tw/∼cjlin/libsvm [7] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for learning shared structures from multiple tasks,” in Proc. 26th ICML, 2009, pp. 1–8. [8] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse structures for robust multi-task learning,” in Proc. 17th KDD, 2011, pp. 42–50. [9] C. Cortes, M. Mohri, and A. Rostamizadeh, “Generalization bounds for learning kernels,” in Proc. 27th ICML, 2010, pp. 1–8. [10] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, Dec. 2005. [11] H. Fei and J. Huan, “Structured feature selection and task relationship inference for multi-task learning,” in Proc. ICDM, Dec. 2011, pp. 171–180. [12] M. Gönen and E. Alpaydin, “Multiple kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, Jul. 2011. [13] P. Gong, J. Ye, and C. Zhang, “Multi-stage multi-task feature learning,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2012. [14] M. Grant and S. Boyd, “Graph implementations for nonsmooth convex programs,” in Recent Advances in Learning and Control (Lecture Notes in Control and Information Sciences), V. Blondel, S. Boyd, and H. Kimura, Eds. New York, NY, USA: Springer-Verlag, 2008, pp. 95–110. [15] M. Grant and S. Boyd. (Apr. 2011). CVX: MATLAB Software for Disciplined Convex Programming, Version 1.21. [Online]. Available: http://cvxr.com/cvx [16] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization techniques for learning with matrices,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 1865–1890, 2012. [17] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share in multi-task feature learning: Supplementary material,” in Proc. ICML, 2011, pp. 1–8. [18] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “l p -norm multiple kernel learning,” J. Mach. Learn. Res., vol. 12, pp. 953–997, Feb. 2011. [19] M. Kolar and H. Liu, “Marginal regression for multitask learning,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2012. [20] S. Kwapie´n and W. W. A. Woyczynski, Random Series and Stochastic Integrals: Single and Multiple. Basel, Switzerland: Birkhäuser, 1992. [21] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, pp. 27–72, Dec. 2004. [22] C. Li, M. Georgiopoulos, and G. C. Anagnostopoulos, “A unifying framework for typical multitask multiple kernel learning problems,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 7, pp. 1287–1297, Jul. 2014. [23] A. C. Lozano and G. Swirszcz, “Multi-level lasso for sparse multi-task regression,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2012. [24] A. Maurer, “Bounds for linear multi-task learning,” J. Mach. Learn. Res., vol. 7, pp. 117–139, Dec. 2006. [25] A. Maurer, “The Rademacher complexity of linear transformation classes,” in Learning Theory (Lecture Notes in Computer Science), vol. 4005, G. Lugosi and H. U. Simon, Eds. New York, NY, USA: Springer-Verlag, 2006, pp. 65–78. [26] A. Maurer and M. Pontil, “Structured sparsity and generalization,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 671–690, 2012.


1479

[27] S. Parameswaran and K. Q. Weinberger, “Large margin multi-task metric learning,” in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2012. [28] A. Rakotomamonjy, R. Flamary, G. Gasso, and S. Canu, “l p −lq penalty for sparse linear and sparse multiple kernel multitask learning,” IEEE Trans. Neural Netw., vol. 22, no. 8, pp. 1307–1320, Aug. 2011. [29] M. Sion, “On general minimax theorems,” Pacific J. Math., vol. 8, no. 1, pp. 171–176, 1958. [30] L. Tang, J. Chen, and J. Ye, “On multiple kernel learning with multiple labels,” in Proc. IJCAI, Jul. 2009, pp. 1255–1260. [31] X. Wang, C. Zhang, and Z. Zhang, “Boosted multi-task learning for face verification with applications to web image and video search,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 142–149. [32] G. A. Watson, “Globally convergent methods for semi-infinite programming,” BIT Numer. Math., vol. 21, no. 3, pp. 392–373, 1981. [33] Y. Zhang and D.-Y. Yeung, “Transfer metric learning by learning task relationships,” in Proc. 16th KDD, 2010, pp. 1199–1208. [34] W. Zhong and J. Kwok, “Convex multitask learning with flexible task clusters,” in Proc. ICML, 2012, pp. 1–8. [35] J. Zhou, L. Yuan, J. Liu, and J. Ye, “A multi-task learning formulation for predicting disease progression,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2011, pp. 814–822.

Michael Georgiopoulos (S’82–M’83–SM’01) received the Diploma degree from the National Technical University of Athens, Athens, Greece, in 1981, and the M.S. and Ph.D. degrees from the University of Connecticut, Storrs, CT, USA, in 1983 and 1983, respectively, all in electrical engineering. He served as the Interim Assistant Vice President of Research with the Office of Research and Commercialization, Orlando, FL, USA, from 2011 to 2012, and the Interim Dean of the College of Engineering and Computer Science, Orlando, from 2012 to 2013. He has served as the Dean of the College of Engineering and Computer Science since 2013. He is currently a Professor with the Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando. He has authored more than 60 journal papers, and more than 170 conference papers in a variety of conference and journal venues. His current research interests include machine learning and applications with special emphasis on neural network and neuroevolutionary algorithms, and their applications. Dr. Georgiopoulos was an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS from 2002 to 2006, and the Neural Networks journal from 2006 to 2012. He served as the Technical Co-Chair of the 2011 International Joint Conference on Neural Networks.

Cong Li (S’11) received the B.S. degree in electronic and information engineering and mathematics from Tianjin University, Tianjin, China, in 2009. He is currently pursuing the Ph.D. degree in electrical engineering with the Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL, USA. His current research interests include machine learning with an emphasis on kernel methods, multitask learning, and learning theories.

Georgios C. Anagnostopoulos (M’93–SM’10) received the Diploma degree from the University of Patras, Patras, Greece, in 1994, and the M.S. and Ph.D. degrees from the University of Central Florida, Orlando, FL, USA, in 1997 and 2001, respectively, all in electrical engineering. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, FL, USA. His current research interests include machine learning with an emphasis on pattern recognition, artificial neural networks, and kernel methods.

Generalization Bounds for Domain Adaptation.

Generalization Bounds Derived IPM-Based Regularization for Domain Adaptation.

Manifold regularized multitask feature learning for multimodality disease classification.

Empirically Estimable Classification Bounds Based on a Nonparametric Divergence Measure.

Refined Generalization Bounds of Gradient Learning over Reproducing Kernel Hilbert Spaces.

Generalization bounds of ERM-based learning processes for continuous-time Markov chains.

Semisupervised multitask learning with Gaussian processes.

Improving space domain awareness through unequal-cost multiple hypothesis testing in the space surveillance telescope.

Protein sequence classification with improved extreme learning machine algorithms.

Bounds for the norm of lower triangular matrices on the Cesàro weighted sequence space.

Multiplicative Multitask Feature Learning.

Central Sensitization-Based Classification for Temporomandibular Disorders: A Pathogenetic Hypothesis.

Improved lower bounds on the ground-state entropy of the antiferromagnetic Potts model.

The generalization ability of online SVM classification based on Markov sampling.

Modeling Reactivity to Biological Macromolecules with a Deep Multitask Network.

Large-Scale Aerial Image Categorization Using a Multitask Topological Codebook.

Multitask Coupled Logistic Regression and its Fast Implementation for Large Multitask Datasets.

Egocentric daily activity recognition via multitask clustering.

Generalization of improved step length symmetry from treadmill to overground walking in persons with stroke and hemiparesis.

Improved Carbohydrate Structure Generalization Scheme for (1)H and (13)C NMR Simulations.

Development of inductive generalization with familiar categories.

Robust Multitask Multiview Tracking in Videos.

Bioinspired Architecture Selection for Multitask Learning.

Spurious Generalization.