Local Rademacher Complexity for Multi-Label Learning.

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 3, MARCH 2016

1495

Local Rademacher Complexity for Multi-Label Learning Chang Xu, Tongliang Liu, Dacheng Tao, Fellow, IEEE, and Chao Xu

Abstract— We analyze the local Rademacher complexity of empirical risk minimization-based multi-label learning algorithms, and in doing so propose a new algorithm for multi-label learning. Rather than using the trace norm to regularize the multi-label predictor, we instead minimize the tail sum of the singular values of the predictor in multi-label learning. Benefiting from the use of the local Rademacher complexity, our algorithm, therefore, has a sharper generalization error bound. Compared with methods that minimize over all singular values, concentrating on the tail singular values results in better recovery of the low-rank structure of the multi-label predictor, which plays an important role in exploiting label correlations. We propose a new conditional singular value thresholding algorithm to solve the resulting objective function. Moreover, a variance control strategy is employed to reduce the variance of variables in optimization. Empirical studies on real-world data sets validate our theoretical results and demonstrate the effectiveness of the proposed algorithm for multi-label learning. Index Terms— Multi-label complexity.

learning,

local

Rademacher

I. I NTRODUCTION N MULTI-LABEL learning, an image can be assigned more than one label. This is different from the conventional single label learning problem, in which each example corresponds to one, and only one, label. Over the past few decades, multi-label learning [1] has been successfully applied to many real-world applications such as text categorization [2], image annotation [3]–[8], and gene function analysis [9]. A straightforward approach to multi-label learning is to decompose it into a series of binary classification problems for different labels [10]. However, this approach can result in poor performance when strong label correlations exist. To improve

I

Manuscript received May 5, 2015; revised September 13, 2015, November 7, 2015, and December 8, 2015; accepted January 5, 2016. Date of publication February 12, 2016; date of current version February 12, 2016. This work was supported in part by the National Key Technology Research and Development Program under Grant 2015BAF15B00, in part by the National Natural Science Foundation of China under Grant 61375026, and in part by the Australian Research Council under Grant DP-140102164, Grant FT-130101457, and Grant LP-140100569. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Guoliang Fan. C. Xu and C. Xu are with the Key Laboratory of Machine Perception, Cooperative Medianet Innovation Center, School of Electronics Engineering and Computer Science, Ministry of Education, Peking University, Beijing 100871, China (e-mail: [email protected]; xuchao@ cis.pku.edu.cn). T. Liu and D. Tao are with the Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2016.2524207

prediction, a large number of algorithms have been developed that approach the multi-label learning problem from different perspectives such as the classifier chains algorithm [11], the max-margin multi-label classifier [12], the probabilistic multilabel learning algorithms [13], [14], the correlation learning algorithms [15], [16], and label dependency removal algorithms [17], [18]. Vapnic’s learning theory [19] can be used to justify the successful development of multi-label learning [20], [21]. The generalization error of a multi-label learning model measures how well this algorithm generalizes to unseen data. Rademacher complexity [22] is a useful data-dependent complexity measure to derive a tighter generalization error bounds than those derived using the VC dimension and cover number. Recently, [23] proved that the Rademacher complexity of empirical risk minimization (ERM)-based multi-label learning algorithms can be bounded by the trace norm of the multilabel predictors, which provides a theoretical explanation for the effectiveness of using the trace norm for regularization in multi-label learning. On the other hand, minimizing the trace norm over the predictor implicitly exploits the correlations between different labels in multi-label learning. One shortcoming of the general Rademacher complexity is that it ignores the fact that the hypotheses selected by a learning algorithm usually belong to a more favorable subset of all the hypotheses, and they therefore have better performance than in the worst case. To overcome this drawback, the local Rademacher complexity [24], [25] considers the Rademacher averages of smaller subsets of the hypothesis set. This results in a sharper generalization error bound than that derived using the global Rademacher complexity. Specifically, the generalization error bound derived by Rademacher complexity √ is at most of convergence order of O( 1/n), while the bound obtained using local Rademacher complexity usually converges as fast as O(log n/n). We therefore seek to use local Rademacher complexity to analyze the generalization error of multi-label learning models. To bound the generalization error of ERM-based multi-label learning algorithms, we analyze its local Ramemacher complexity, which is upper-bounded in terms of the tail sum of singular values of the multi-label predictor. As a result in the multi-label learning problem, we are motivated to develop a new multi-label learning algorithm by directly constraining its local Rademacher complexity, so that a tighter generalization error bound of the algorithm could be expected. In particular, the local Rademacher complexity can be constrained by penalizing the tail sum of the singular values of the multi-label predictor. As well as the

1057-7149 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1496


advantage of producing a sharper generalization error bound, this new constraint over the multi-label predictor achieves better recovery of the low-rank structure of the predictor and effectively exploits the correlations between labels in multi-label learning. The resulting objective function can be efficiently solved using a newly proposed conditional singular value thresholding algorithm. To encourage the objective variable to fall in a smaller hypothesis set, a variance control strategy is adopted to reduce the variance in the process of optimization. Extensive experiments on real-world datasets validate our theoretical analysis and demonstrate the effectiveness of the new algorithm for solving the multi-label learning problem. The remainder of the paper is organized as follows. In Section 2, some related works are addressed. Section 3 introduces global and local Rademacher complexities. It follows a theory guided multi-label learning algorithm in Section 4 and its optimization method in Section 5. We conduct experiments in Section 6. Finally, we conclude in Section 7.

regression problems. The simplest strategy for problem transformation (BR) [38] is to decompose the multi-label learning problem into several independent binary classification problems, where each binary classification problem corresponds to a possible label in the label space. Closely to BR, classifier chain method (CC) [11] link the binary classifiers along a chain. Label combination methods [39], [40] aims to combine entire label sets into single labels to form a single-label problem. However, the space of possible label subsets can be very large. Fürnkranz [41] and Wu et al. [42] trained pairwise classifiers to cover all pairs of labels. Each classifier is trained using the samples of the first label as positive examples and the samples of the second label as negative examples. Zhang and Schneider [43] investigated multi-label learning problem from a composite likelihood view and constructed the connection between composite likelihood and many multi-label decomposition methods, such as one-vs-all and one-vs-one. The multi-label decomposition methods usually neglect the correlation between labels, which is important for improving multi-label learning. In a general ERM multi-label learning framework, [23] employed the matrix factorization technique to discover the correlation between multiple labels. Jing et al. [44] proposed to exploiting label correlation through minimizing the trace norm of multi-label predictor in a semisupervised setting. Recently, the idea of channel coding has been applied to multi-label prediction for handling the large number of labels [45]–[47], which encodes the output into a codeword, learns models to predict the codeword, and then recovers the correct output from noisy predictions. Besides the label part of dataset, Chen and Lin [17] additionally exploited the feature part in label space dimension reduction.

II. P REVIOUS W ORK In this section, we briefly review the algorithmic and theoretical works on multi-label label learning and provide a survey on the trace norm and its variants for rank minimization problems. A. Algorithms on Multi-Label Learning According to [1], multi-label learning algorithms can be categorized into two groups: algorithm adaptation and problem transformation methods. The multi-label methods that adapt, extend and customize an existing machine leaning algorithms for the task of multi-label learning are called algorithm adaptation methods. AdaBoost.MH and AdaBoost.MR [26] were proposed to extend AdaBoost to minimize Hamming loss and ranking loss [26], respectively. Huang et al. [15] and Yan et al. [27] maintained a shared pool of base classifiers from AdaBoost for all the labels. Several variants [28], [29] were developed for multi-label learning of the popular k-Nearest Neighbors (kNN) lazy learning algorithms. Tang et al. [30] proposed a novel k-NN-sparse graph-based semi-supervised learning approach for harnessing the labeled and unlabeled data simultaneously, which has shown nice performance on handling noisilytagged web images. To handle multi-label data, [31] adapted the decision tree by modifying the entropy calculation. Vens et al. [32] and Bi and Kwok [33] employed trees to describe the hierarchical relationships among labels. New error functions [34], [35] were introduced into neural network to take multiple labels into account. Based on SVM, [36] used a ranking approach to ensure that the relevant labels were ranked higher than any of the irrelevant ones and [12] proposed a max-margin multi-label formulation to learn correlated predictors for labels. Cabral et al. [37] formulated weakly supervised multi-label image classification as a matrix completion problem. The problem transformation methods are multi-label learning algorithms that transform the multi-label learning problem into one or more single-label classification or

B. Theories on Multi-Label Learning Recently, [48] studied the consistency of multi-label learning, that is, whether the expected loss of a learned classifier converges to the Bayes loss as the training set size increases. A necessary and sufficient condition for consistency of multi-label learning based on surrogate loss function is given, and implies the set of classifiers yielding optimal surrogate loss must fall in the set of classifiers yielding optimal original multi-label loss. Dembczy´nski et al. [49] distinguished two types of label dependence, i.e., conditional and marginal dependence, and presented three scenarios in which the exploitation of one of these types of dependence may boost the predictive performance of a classifier. Theoretically the benefit of exploiting label dependence is shown to be depend on the type of loss to be minimized. In a generic empirical risk minimization (ERM) framework, [23] performed the generalization error analysis for the trace norm regularized ERM formulation of multi-label learning, and showed that its Rademacher complexity can be upper bounded in terms of the trace norm. C. Trace Norm for Rank Minimization Rank minimization has recently received much attention due to its success in different applications [50]. However, in general, the rank minimization problem is known to be

XU et al.: LOCAL RADEMACHER COMPLEXITY FOR MULTI-LABEL LEARNING

computationally intractable (NP-hard) [51]. Vapnik [19] made a breakthrough by stating the minimization of the rank function, under broad conditions, can be achieved with the trace norm. The trace norm regularizer has been widely applied to classification tasks [52], [53], most of which exploit the trace norm to enforce correlations between classifiers. Since the natural reformulation of the trace norm leads to a semi-definite program, off-the-shelf optimizers are not efficient for large scale problem. Some methods have therefore been devised for its efficient optimization [54]–[56]. III. G LOBAL AND L OCAL R ADEMACHER C OMPLEXITIES In a standard supervised learning setting, a set of training examples z 1 = (x 1 , y1 ), · · · , z n = (x n , yn ) are i.i.d. sampled from distribution P over X × Y. Let F be a set of functions mapping X to Y. The learning problem is to select a function f ∈ F such that the expected loss E[( f (x), y)] is small, where (·) : Y × Y → [0, 1] is a loss function. Defining G = (F , ·) as the loss class, the learning problem is then equivalent to finding a function g ∈ G with small E[g]. Global Rademacher complexity [22] is an effective approach for measuring the richness (complexity) of the function class G, and it is defined as Definition 1: Let σ1 , · · · , σn be independent uniform {−1, 1}-valued random variables. The global Rademacher complexity of G is then defined as n 1 Rn (G) = E sup σi g(z i ) . (1) g∈G n i=1

Based on the notion of global Rademacher complexity, the algorithm has a standard generalization error bound, as shown in the following theorem [22]. Theorem 1 [22]: Given δ > 0, suppose the function g is learned over n training points. Then, with probability at least 1 − δ, we have 2 log(2/δ) . (2) E[ g ] ≤ inf E[g] + 4Rn (G) + n g∈G Since √ global Rademacher complexity Rn (G) is in the order of O( 1/n) for various classes used in practice, the general√ ization error bound in Theorem 1 converges at rate O( 1/n). Global Rademacher complexity is a global estimation of the complexity of the function class, and thus it ignores the fact that the algorithm is likely to pick functions with a small error, and, in particular, only a small subset of the function class will be used. Instead of using the global Rademacher averages of the entire class as the complexity measure, it is more reasonable to consider the Rademacher complexity of a small subset of the class, e.g., the intersection of the class with a ball centered on the function of interest. Clearly, this local Rademacher complexity [25] is always smaller than the corresponding global Rademacher complexity, and its formal definition is given by Definition 2: For any r > 0, the local Rademacher complexity of G is defined as Rn (G, r ) = Rn ({g ∈ G : E[g 2 ] ≤ r }).

(3)

1497

The following theorem describes the generalization error bound based on the local Rademacher complexity. Theorem 2 [25]: Given δ > 0, suppose we learn the function g over n training points. Assume that there is some r > 0 such that for every g ∈ G, E[g 2 ] ≤ r . Then with probability at least 1 − δ, we have 8r log(2/δ) 3 log(2/δ) E[ g ] ≤ inf E[g] + 8Rn (G) + + . n n g∈G (4) By choosing a much smaller class G ⊆ G with as small a variance as possible while requiring that g still lies in G , the generalization error bound in Theorem 2 has a faster convergence rate than Theorem 1 of up to O(log n/n). Once the local Rademacher complexity is known, E[ g ]−inf g∈G E[g] can be bounded in terms of the fixed point of the local Rademacher complexity of F . IV. L OCAL R ADEMACHER C OMPLEXITY FOR M ULTI -L ABEL L EARNING In this section, we analyze the local Rademahcer complexity for multi-label learning and illustrate our motivation for developing a new multi-label learning algorithm. The multi-label learning model is described by a distribution Q on the space of data points and labels X × {0, 1} L . N We receive N training points {(x i , yi )}i=1 sampled i.i.d. from L the distribution Q, where yi ∈ {0, 1} are the ground truth label vectors. Given these training data, we learn a linear multi ∈ Rd×L by performing ERM as follows: label predictor W = arg W

n )= 1 ( f (x i , W ), yi ), L(W W ∈φ(W ) n

inf

(5)

i=1

) is the empirical risk of a multi-label learner W , where L(W and φ(W ) is some constraint on W . Yu et al. [23] propose to solve the multi-label learning problem with Eq. (5) by setting φ(W ) as the trace constraint Tr(W ) < λ, and then providing its corresponding global Rademacher complexity bound λ Rn (W ) ≤ √ . n

(6)

This global Rademacher √ complexity for multi-label learning is in the order of O( 1/n), which is exactly consistent with the general analysis shown in the previous section. Hence, the generalization error bound based on the √ global Rademacher complexity in [23] converges up to O( 1/n). In practice, the hypotheses selected by a learning algorithm usually have better performance than the worst case and belong to a more favorable subset of all the hypotheses. Based on this idea, we employ the local Rademacher complexities to measure the complexity of smaller subsets of the hypothesis set, which results in sharper learning bounds and guarantees faster convergence rates. The local Rademacher complexity of the multi-label learning algorithm using Eq. (5) is shown in Theorem 3. Theorem 3: Suppose we learn W over n training points. Let W = U V T be the SVD decomposition of W ,

1498


where U and V are the unitary matrices, and is the diagonal matrix with singular values {λi } in descending order. Assume W ≤ 1 and there is some r > 0 such that for every W ∈ W, E[W W T ] ≤ r . Then, the local Rademacher complexity of W is n 1 λj σi x i , W ≤ r θn + j√>θn . E sup n W ∈W i=1

Proof: Considering W = U V T , W can be rewritten as W = u j v Tj λ j , (7) j

where u j and v j are the column vectors of U and V , respectively. Based on the orthogonality of U and V , we have the following decomposition n 1 σi x i , W

n

V. A LGORITHM In this section, the properties of the local Rademacher complexity discussed above are used to devise a new multilabel learning algorithm. Each training point has a feature vector x i ∈ Rd and a corresponding label vector yi = {0, 1} L . If yi j = 1, example x i will have label- j ; otherwise, there is no label- j for example x i . The multi-label predictor is parameterized as f (x, W ) = W T x, where W ∈ Rd×L . (y, f (x, W )) is the loss function that computes the discrepancy between the true label vector and the predicted label vector. The trace norm is an effective approach for modeling and capturing correlations between labels associated with examples, and it has been widely adopted in many multilabel algorithms. Within the ERM framework, their objective functions usually take the form min W

i=1

= X σ , W

θ θ ≤ X σ u j u Tj λ−1 , u j v Tj λ2j + X σ u j u Tj , W

j j =1

≤

j =1

θ

X σ u j u Tj λ−1 j

j =1

j >θ

θ

u j v Tj λ2j +

j =1

X σ u j u Tj W .

j >θ

Considering E[

⎡ ⎤ θ 2⎦ ⎣ X σ u j u Tj λ−1 λ−2 j ] = E j X σ , u j

θ j =1

(8)

and

θ

u j v Tj λ2j =

j =1

≤

θ j =1 ∞

u j u Tj λ2j u j u Tj λ2j = E[W W T ] ≤ r,

W

(9)

n θ 1 2 1 + E sup σi x i , W ≤ r λj n n W ∈W n i=1 j >θ θ j >θ λ j + √ ≤r , n n

(10)

which completes the proof. According to Theorem 3, the local Rademacher complexity for ERM-based multi-label learning algorithms is determined by the tail sum of the singular values. When j >θ λ j = O(exp(−θ )), we have E[ g] − inf g∈G E[g] = O(log n/n), which leads to a sharper generalization error bound than that based on global Rademacher complexity.

(12)

j >θ

where λ j (W ) is the j -th largest singular value of W , and θ is a parameter to control the tail sum. If we use the squared L2-loss function, we get min

j =1

we have

where ·∗ is the trace norm and C is a constant. In particular, for Problem (11), [23] has proven that the global Rademacher complexity of W is upper bounded in terms of its trace. As shown in the previous section, however, the tail sum of the singular values of W , rather than its trace, determines the local Rademacher complexity. Since the local Rademacher complexity can lead to tighter generalization bounds than those of the global Rademacher complexity, this motivates us to consider the following objective function to solve the multilabel learning problem.

i=1

j =1

(11)

i=1

n 1 (yi , f (x i , W )) + C λ j (W ), min F(W ) = W n

j =1

θ −2 λ j θ

2 ≤ E[ x, u j ] = , n n

n 1 (yi , f (x i , W )) + CW ∗ , n

n 1 yi − W T x i 2 + C λ j (W ). n i=1

(13)

j >θ

The other loss functions can be applied within the framework as well. In multi-label learning, the multi-label predictor W usually has a low-rank structure due to the correlations between multiple labels. The trace norm is regarded as an effective surrogate of rank minimization by simultaneously penalizing all the singular values of W . However, it may incorrectly keep the small singular values, which should be zero, or shrink the large singular values to zeros, which should be non-zero. In contrast, our new algorithm can directly minimize over the small singular values, which encourages the low-rank structure. To understand why the trace norm may fail in rank minimization, we consider a matrix ⎡ ⎤ 2 1 2 1 M = ⎣ 1 1 ? 2 ⎦, (14) 1 1 2 ?


1499

Fig. 1. Comparison of the trace norm and the proposed norm as functions of unknown entries M2,3 and M3,4 for the matrix in Eq. (14). (a) and (b) are the 3D function plot and the contour lines for the trace norm, while (c) and (d) are the same for the proposed norm.

where M2,3 and M3,4 are unknown. The results shown in Figure 1 plot the trace norm and the proposed new norm of M for all possible completions in a range around the value that minimizes its rank M2,3 = 2 and M3,4 = 2. We find that the trace norm yields the optimal solution for M with singular values λ = [5.1235, 1.0338, 0.2965] when M2,3 = 1.8377 and M3,4 = 1.4248. In contrast, we propose to constrain over the tail singular values (setting θ = 2), and derive the optimal solution with singular values λ = [5.3549, 1.1512, 0] when M2,3 = 2 and M3,4 = 2. Hence, the new norm can successfully discover the low-rank structure, while the trace norm fails in this case.

Based on Eq. (18), Problem (13) can be solved in the following iterative step: Wk = pηk (Wk−1 ) 1 W − (Wk−1 − ηk ∇gik (Wk−1 ))2F = arg min W 2ηk +C λ j (W ),

where the terms in Eq. (18) that do not depend on W are ignored. Many works on proximal gradient optimization [57], [58] have suggested that the step size ηk in each iteration should satisfy the following condition g( pηk (Wk−1 )) ≤ Jηk ( pηk (Wk−1 ), Wk−1 ),

A. Optimization Starting with Eq. (12) without the norm regularization, we get the following problem, min g(W ) = W

n 1 (yi , f (x i , W )), n

(15)

i=1

The gradient method is a natural approach for solving this problem and generates a sequence of approximate solutions: Wk = Wk−1 − ηk ∇g(Wk−1 ) n ηk ∇gi (Wk−1 ), = Wk−1 − n

(16)

where ηk determines the step size. However, at each step, gradient descent requires evaluation of n derivatives, which is expensive. A popular modification is to use stochastic gradient descent: where at each iteration k = 1, 2, · · · , we draw i k randomly from {1, · · · , n}, and Wk = Wk−1 − ηk ∇gik (Wk−1 ),

(17)

which can be further reformulated as a proximal regularization of the linearized function gik (W ) at Wk−1 as Wk = arg min Jηk (W, Wk−1 ),

(18)

where Jηk (W, Wk−1 ) = g(Wk−1 ) + W − Wk−1 , ∇gik (Wk−1 )

1 + W − Wk−1 2F . 2ηk

(20)

which acts as a principle for step size estimation in iterations. Recall that if problem (19) is constrained with the trace norm, [55] showed that it can be efficiently solved using the singular value thresholding algorithm. Hence, we propose a new conditional singular value thresholding algorithm to handle the newly proposed norm regularization. The solution is summarized in the following theorem. Theorem 4: Let Q ∈ Rm×n and its SVD decomposition is Q = U V T , where U ∈ Rm×r and V ∈ Rn×r have orthonormal columns, ∈ Rr×r is diagonal. Then, 1 λ j (W )} (21) Dθ (Q) = arg min{ W − Q2F + C W 2 j >θ

Dθ (Q)

i=1

(19)

j >θ

U θ V T ,

θ

= where is diagonal with is given by ( θ )ii = (i ≤ θ &ii > C)?ii : max(0, ii − C). is the optimal solution, 0 should Proof: Assuming that W , be a subgradient of the objective function at the point W − Q + C∂( )), 0∈W λ j (W (22) j >θ

)) is the set of subgradients of the new where ∂( j >θ λ j (W norm regularization. Letting W = U V T , we have )) = U Iθ V T + S ∂( λ j (W j >θ

s.t. U T S = 0, SV = 0, S ≤ 1,

(23)

where Iθ is obtained by setting the diagonal values with indices smaller than θ in the identity matrix I as zeros. Set the SVD of Q as Q = U0 0 V0T + U1 1 V1T ,

(24)

1500


where U0 , V0 are the singular vectors associated with singular values greater than C, while U1 , V1 correspond to those smaller than or equal to C. With these definitions, we have

j By setting α ∗ as the minimizer of Tr Var[Z ik ] , we obtain j j j (30) α ∗ = Tr Cov[∇gik , h ik (W )] /Tr Var[h ik (W )] .

= U0 [0 − C Iθ ]V0T , W

(25)

and thus, = U0 0 V0T + U1 1 V1T − U0 [0 − C Iθ ]V0T Q−W = C(U0 Iθ V0T + C −1 U1 1 V1T ) )), λ j (W = C∂(

(26)

j >θ

where S is defined as C −1 U1 1 V1T . The proof is completed. It turns out that the minimization of Problem (19) can be solved by first computing the SVD of (Wk−1 −ηk ∇gik (Wk−1 )), and then applying the conditional thresholding on the singular values. Chen et al. [59] have provided a general solution for adaptive trace norm problem. Given a set of weights {w1 , · · · , wθ , · · · } regarding to different singular values, the shrinkage operation in [59] is defined as ii = max(0, ii − λwi ),

(27)

which could be reduced to the singular value shrinkage operation for trace norm problem in [55] by setting all weights to 1. For our newly defined conditional singular value shrinkage operation, the weights for singular values satisfying (i ≤ θ &ii > C) are set to 0, otherwise the weights are 1. B. Variance Control As discussed in Section IV, the variance of W should be constrained so that the assumption of local Rademacher complexity is fulfilled. According to Eq. (19), the optimal W should be close to (Wk−1 −ηk ∇gik (Wk−1 )) while satisfying the rank constraint. Given the fixed Wk−1 in the k-th iteration, controlling the variance of W is thus equivalent to controlling the variance introduced by computing the gradient ∇gik (Wk−1 )) with respect to a random example. The aim of variance control is to construct a random matrix Z ik ∈ Rd×L that has the the same expectation as that of ∇gik (Wk−1 )) but with a smaller variance. For simplicity, we focus on the predictor for the j -th label, and consider the j -th column vectors of ∇gik (Wk−1 )) and j j j Z ik as ∇gik and Z ik , respectively. The rule to generate Z ik is defined as j j j (28) Z ik = ∇gik − α h ik (W ) − h j (W ) ,

j

We next proceed to show the variance of Z ik is smaller than j that of ∇gik . By plugging α ∗ back into Eq. (29), we have j j E Z ik − Eik [Z ik ]22 j = Tr Var[Z ik ] 2 j j j j = Tr Var[∇gik ]) − Tr Cov[∇gik , h ik (W )] /Tr Var[h ik (W )] j j j ≤ Tr Var[∇gik ]) = E ∇gik − Eik [∇gik ]22 2 j j j where Tr Cov[∇gik , h ik (W )] /Tr Var[h ik (W )] ≥ 0. To estimate the optimal α ∗ , we have to obtain j j j Cov[∇gik , h ik (W )] and Var[h ik (W )], which can be approximated by the covariance and variance of the mini-batch examples in stochastic gradient descend. According to the j above analysis, the greater the correlation between ∇gik and j h ik (W ), the greater the variance reduction. However, it is j j inappropriate to set h ik (W ) = ∇gik , because of the expensive j computation complexity of Eik [h ik (W )]. We employ the Taylor expansion of the loss function as an alternative approach to j j construct h ik (W ), such that it is highly correlated with ∇gik and its expectation can be efficiently computed. We take the squared loss function as example to illustrate the main idea. The first-order Taylor expansion of the squared e is loss = e22 around e22 ≈ e22 + 2 e(e − e).

Let be the mean of examples, whose j -th label is positive. j We define h ik (W ) as j j j h ik (W ) = x ik 1 − W jT x¯+ 22 + 2 1 − W jT x¯+ j · yik j − W jT x ik − 1 + W jT x¯+ , (32) where W j is the j -th column vector of W . By independently j considering yik j ∈ {−1, 1}, the expectation of h ik (W ) can be computed by j

j

E[h ik (W )] =

j

where α is a real number, h ik (W ) ∈ Rd is a random matrix, j j and h j (W ) = Eik [h ik (W )]. The definition on h ik (W ) will j be shown latter. It is obvious to note that Z ik has the same j expectation as that of ∇gik , and thus Z ik can be used to replace j ∇gik (Wk−1 ) in Eq. (19). The variance of Z ik is computed by j

j

j

Var [Z ik ] = Var[∇gik ] + α 2 Var[h ik (W )] j j j j − α Cov[∇gik , h ik (W )] + Cov[h ik (W ), ∇gik ] . (29)

(31)

j x¯+

j

n+ j E[h ik (W )|yik j = 1] n j n j + − E[h ik (W )|yik j = −1], n

(33)

j

where n + and n − are the numbers of examples with positive and negative label- j , respectively. The expectation j E[h ik (W )|yik j = 1] can be written as, j

E[h i (W )|yik j = 1] k

j

j

j

= x¯+ 1 − W Tj x¯+ 22 + 2(1 − W Tj x¯+ ) j j j j · x¯+ W Tj x¯+ − Var(xik |yik j = 1) + x¯+ (x¯+ )T W j .

(34)


1501

TABLE I S TATISTICS OF THE D ATASETS U SED IN E XPERIMENTS

Algorithm 1 Optimizing Problem (13) Through Variance Control and Conditional Singular Value Thresholding

To get a satisfactory solution, the optimization algorithm could be launched from some different random initialization points. VI. E XPERIMENTS

Similarly for yik j = −1, we have j

E[h i (W )|yik j = −1] k

j

j

j

= x¯− 1 − W Tj x¯+ 22 + 2(1 − W Tj x¯+ ) j j j j · x¯− (W Tj x¯+ − 2) − Var(xik |yik j = −1) + x¯− (x¯− )T W j . (35)

Based on the data statistics computed in advance, the expecj tation E[h ik (W )] can be easily obtained for a particular W . j Hence, we can efficiently compute Z ik for each label using Eq. (28), and then control the variance of W through Z ik . The whole procedure of the optimization on objective function is summarized in Algorithm 1. C. Algorithm Complexity and Convergence Similar with the singular value thresholding algorithm [55], the computation cost of the proposed model is by-andlarge dominated by the conditional singular value shrinkage operator, which calls for the singular value decomposition. Note that we do not need to compute the entire SVD of Q (see Theorem 4) to apply the shrinkage operator. Only the part corresponding to singular values greater than C is needed. Hence, the computation cost could be largely decreased by applying the iterative Lanczos algorithm to compute the top few singular values and singular vectors. When the rank of the solution is substantially smaller than either dimension of W , the storage requirement is low since we could store each Wk in its SVD form. Some works [60], [61] have already theoretically established the convergence of proximal SGD for convex optimization problems. However, it is much more difficult to analyse its convergence for non-convex optimization problems.

In this section, we evaluate our proposed algorithm on three benchmark image annotation datasets1 : corel5k, espgame and iaprtc12, and two text datasets2 : bixtex and delicious. We employ the pretrained convolutional network [62] to extract image features for its promising performance. The convolutional network trained on ILSVRC is composed of a stack of convolutional layers (which has a different depth in different architectures) followed by three Fully-Connected (FC) layers: the first two have 4096 channels each and the third performs 1000-way ILSVRC classification. We input the images in the annotation datasets to the convolutional network for feature extraction. Each image is represented by the neuronal responses of the last but one layer. For the bibtex and delicious datasets, we used the text features provided by the mulan website. All these datasets have already been pre-separated into training and test sets. A summary of the statistics of datasets is shown in Table I. #train is the number of training examples; #test is the number of test examples; #features is the number of features; #labels is the number of labels; #cardinality is the average number of labels per example; #density is the number of labels per example divided by the total number of labels, averaged over the samples; #distinct is the number of distinct label combinations appeared in the datasets. In experiments, we compared our proposed LRML (Local Rademacher complexity Multi-label Learning) with ML-Trace, ML-Fro, ML-Max and ML-None methods, which solve the multi-label learning problem based on Eq. (5) with the trace norm, the Frobenius norm and the max-norm, and without norm, respectively. Since LEML algorithm [23] explores the low-rank property of multi-label predictor and analyses the algorithm’s Rademacher complexity as well, it is the most related algorithm with ours. CPLST [17] is a representative multi-label algorithm through label-space dimension reduction. The authors of LEML [23] suggested that CPLST has a close connection with LEML, and employed CPLST as an important comparison algorithm in their experiments, which thus motivates us to consider CPLST in our experiments as well. To conduct comprehensive comparison experiments, we further consider MaxMargin approach [47], which does 1 http://lear.inrialpes.fr/people/guillaumin/data.php 2 http://mulan.sourceforge.net/

1502


TABLE II C OMPARISON OF M ULTI -L ABEL L EARNING A LGORITHMS W ITH VARIOUS L OSS F UNCTIONS AND N ORMS ON THE COREL 5k D ATASET

not rely on regularization penalty term for exploring label correlation. It is well-understood by the multi-label community that different evaluation criterion may require different formulations of the multi-label learning algorithm. [63] discussed 0/1 error and Hamming loss in multi-label learning, [64] proposed to directly maximizing F-measure, [17] conducted label space dimension reduction in multi-label learning through minimizing an upper bound of the Hamming loss, and [65] studied cost-sensitive multi-label classification, which takes the evaluation criterion into account during learning. Hence, within the proposed multi-label learning framework (i.e, Eq. (12)), we consider different loss functions, including squared loss, squared hinge loss and logistic regression loss. For evaluation criterion, we adopt F1-measure, average AUC, Hamming loss and top-k accuracy. In experiments, the parameter C was selected from the set {0.001, 0.01, 0.1, 1, 10, 100, 1000} via cross validation. The algorithm was considered to have converged when either the variation of W between two successive iterations was less than 10−5 or 2, 000 iterations were executed. A. Multi-Label Classification Performance Comparison Given three surrogates (squared loss (SQ), squared hinge loss (SH) and logistic regression loss (LR)), we first compared the proposed LRML algorithm with the ML-Trace, ML-Fro and ML-Max algorithms. For LRML, parameter θ was selected from the set [{0.1, 0.2, · · · , 0.9}rank(W )] via cross validation. The results on the corel5k dataset are reported using macro-F1, micro-F1 and Hamming loss in Table II. To justify the connection between the proposed algorithm and ERM, the Hamming loss on the training set (denoted by ‘Training: Hamming loss’) is reported as well. The small hamming loss on the training set for different surrogate functions demonstrates the effectiveness of the proposed framework on minimizing the empirical risk of multi-label learning. Table II shows that for different surrogate functions ML-Fro algorithm is outperformed by the other algorithms under different criterion. Since the Frobenius norm can be decomposed as the summarization of the squared norms of multiple label predictors, ML-Fro is equivalent to independently conduct classification for each individual label. The neglect of the label relationships thus limits the performance of ML-Fro. In contrast, trace norm exploits the low-rank structure of multi-label predictor by constraining all the singular values, and max norm as an upper bound on the rescaled trace norm is a convex approximation of the matrix rank as well. Both ML-Trace and ML-Max algorithms have thus received significant performance improvement over those of ML-Fro. Adjusting θ in penalizing the tail singular values enables

LRML algorithm to better exploit the low-rank structure, which will lead to further performance improvement. It is necessary to note that the actual training-test difference in Hamming Loss has no explicit connection with the ideal E[ g ] − inf E[g] (see Theorem 2) involving squared loss, squared hinge loss, or logistic regression loss, since the computation equations of the performance measurement and loss functions are different. Moreover, it is difficult to accurately calculate the consistency of the algorithm (see Theorem 2) on a specific real-world dataset. This is because the test error is an approximation of E[g] and the approximation error is usually large given the small test sample size. The training error is also an approximation of E[ g ]. Though the difference between training and test error cannot be straightforwardly used to justify our theoretical analysis, the small test/training error and the good performance actually indicate that the classifiers obtained by our proposed method are closer to the optimal ones compared with the others. Given squared loss, we conducted the significance test over algorithms with different norms or without norm to demonstrate the advantage of the proposed algorithm. The statistical significance result in Figure 2 is obtained through the Bonferroni-Dunn test [66], which is based on methods’ average rankings across all the five datasets. By taking LRML as the control algorithm, the relative performance among the comparing approaches can be demonstrated. Results are visualized using critical diagram, which contains an enumerated axis on which the average ranks of the algorithms are drawn. Different algorithms are depicted along the axis in such a manner that the best ranking ones are at the rightmost side of the diagram. Methods that are not significantly different from LRML are interconnected to each other by a bold line. As can be seen from Figure 2, LRML achieves the lowest average rank in terms of each evaluation metric, and it significantly outperforms ML-None and ML-Fro, which independently handle the multi-label learning problem with or without hypothesis complexity control. Most importantly, at a significance level of 0.05, LRML is suggested to be significantly better than ML-Trace in terms of Micro-F1 and Top-5 Accuracy and ML-Max in terms of Macro-F1. We compared LRML algorithm with the state-ofthe-art multi-label learning algorithms on three image datasets and two text datasets, and summarized the results in Tables III and IV, respectively. LRML either improves on, or has comparable performance with, the other methods on different datasets. Although these algorithms all attempt to exploit the dependency structure of multiple labels, they study and discover it from different perspectives. LEML aims to minimize the rank of multi-label predictor through penalizing its trace norm within the ERM framework, while CPLST


Fig. 2.

1503

Comparison of LRML (control algorithm) against other ERM-based approaches with various norms using the Bonferroni-Dunn test. TABLE III M ACRO -F1 AND M ICRO -F1 P ERFORMANCE OF M ULTI -L ABEL L EARNING M ETHODS ON THE T HREE I MAGE D ATASETS

TABLE IV P ERFORMANCE C OMPARISON OF M ULTI -L ABEL L EARNING M ETHODS ON THE T WO T EXT D ATASETS

Fig. 3. Samples of the annotation results (a-h) of the proposed LRML on the benchmark corel5k dataset, with the red tags being the ground-truth, and the blue and black ones being the right and false annotation results of LRML.

and MaxMargin algorithms exploit the dependency between labels by finding predictable codewords of labels within the framework of multi-label output coding. The proposed LRML algorithm directly constrains the tail singular values of the predictor to better exploit the low-rank structure of multi-label predictor. In addition, since the objective function of LRML is an explicit minimization of the local Rademacher complexity,

it will lead to a tight generalization error bound and guarantee stable performances for unseen examples. B. Algorithm Analysis Figures 3, 4 and 5 give examples of the annotation results of the proposed LRML on the benchmark corel5k, espgame and iaprtc12 datasets, respectively. The tags on the left of

1504


Fig. 4. Samples of the annotation results (a-h) of the proposed LRML on the benchmark espgame dataset, with the red tags being the ground-truth, and the blue and black ones being the right and false annotation results of LRML.

Fig. 5. Samples of the annotation results (a-h) of the proposed LRML on the benchmark iaprtc12 dataset, with the red tags being the ground-truth, and the blue and black ones being the right and false annotation results of LRML. TABLE V T HE I NFLUENCE OF PARAMETER θ ON C LASSIFICATION P ERFORMANCE ON THE COREL 5k D ATASET

each image are the ground-truth, while those right tags are the top five prediction results. From these figures, we find that the proposed LRML algorithm can correctly predict the labels of images in most cases. Although sometimes a few of the predictions given by LRML are not in the groundtruth, they are generally related to the image content, which further demonstrates the effectiveness of LRML. For example in Figure 4, the ‘group’ and ‘people’ labels should be positive with respect to image (e), and LRML could provide more closely ‘cartoon’ and ‘wheel’ labels not in the ground-truth to describe image (h). We varied θ in LRML algorithm to examine the influence of the number of constrained singular values on the corel5k dataset in Table V. When θ = 0, all the singular

values will be penalized, and thus the proposed LRML algorithm is reduced to ML-Trace. Turning the value of θ will enable LRML algorithm to penalize different numbers of tail singular values, and then is beneficial for better exploiting the low-rank structure of multi-label predictor. Compared to ML-trace, which constrains over all the singular values, the best LRML performance usually could offer further improvements. In order to investigate the advantages of variance control in optimization, we plot the objective values of LRML on the corel5k dataset in Figure 6. We can observe that the variance control strategy enables LRML to converge fast. This confirms that the proposed variance control strategy can effectively reduce the variance of objective variables in optimization,


Fig. 6.

The convergence curve of LRML on the corel5k dataset.

and then encourage the optimal solution fall in a smaller hypothesis set. VII. C ONCLUSION In this paper, we use the principle of local Rademacher complexity to guide the design of a new multi-label learning algorithm. We analyze the local Rademacher complexity of ERM-based multi-label learning algorithms, and discover that it is upper bounded by the tail sum of the singular values of the multi-label predictor. Inspired by this local Radermacher complexity bound, a new multi-label learning algorithm is therefore proposed that concentrates solely on the tail singular values of the predictor, rather than on all the singular values as with the trace norm. This use of the local Rademacher complexity results in a sharper generalization error bound and moreover, the new constraint over tail singular values provides a tighter approximation of the low-rank structure than the trace norm. The experimental results on multi-label classification demonstrate the effectiveness of the proposed algorithm. ACKNOWLEDGMENT The authors greatly thank the handling Associate Editor and all anonymous reviewers for their positive support and constructive comments for improving the quality of this paper. R EFERENCES [1] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837, Aug. 2014. [2] S. Gao, W. Wu, C.-H. Lee, and T.-S. Chua, “A MFoM learning approach to robust multiclass multi-label text categorization,” in Proc. 21st Int. Conf. Mach. Learn., 2004, p. 42. [3] J. Tang, Z.-J. Zha, D. Tao, and T.-S. Chua, “Semantic-gap-oriented active learning for multilabel image annotation,” IEEE Trans. Image Process., vol. 21, no. 4, pp. 2354–2360, Apr. 2012. [4] F. Sun, J. Tang, H. Li, G.-J. Qi, and T. S. Huang, “Multi-label image categorization with sparse factor representation,” IEEE Trans. Image Process., vol. 23, no. 3, pp. 1028–1037, Mar. 2014. [5] J. Fan, Y. Shen, C. Yang, and N. Zhou, “Structured max-margin learning for inter-related classifier training and multilabel image annotation,” IEEE Trans. Image Process., vol. 20, no. 3, pp. 837–854, Mar. 2011. [6] Z. Li, J. Liu, J. Tang, and H. Lu, “Robust structured subspace learning for data representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 10, pp. 2085–2098, Oct. 2015.

1505

[7] J. Tang, Z. Li, M. Wang, and R. Zhao, “Neighborhood discriminant hashing for large-scale image retrieval,” IEEE Trans. Image Process., vol. 24, no. 9, pp. 2827–2840, Sep. 2015. [8] L. Jing and M. K. Ng, “Sparse label-indicator optimization methods for image classification,” IEEE Trans. Image Process., vol. 23, no. 3, pp. 1002–1014, Mar. 2014. [9] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya, “Hierarchical multi-label prediction of gene function,” Bioinformatics, vol. 22, no. 7, pp. 830–836, Jan. 2006. [10] G. Tsoumakas, I. Katakis, and I. Vlahavas, “Mining multi-label data,” in Data Mining and Knowledge Discovery Handbook. New York, NY, USA: Springer, 2010, pp. 667–685. [11] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” J. Mach. Learn., vol. 85, no. 3, pp. 333–359, Dec. 2011. [12] B. Hariharan, L. Zelnik-Manor, M. Varma, and S. Vishwanathan, “Large scale max-margin multi-label classification with priors,” in Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010, pp. 423–430. [13] M.-L. Zhang and K. Zhang, “Multi-label learning by exploiting label dependency,” in Proc. 16th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2010, pp. 999–1008. [14] Y. Guo and W. Xue, “Probabilistic multi-label classification with sparse feature learning,” in Proc. 23rd Int. Joint Conf. Artif. Intell., 2013, pp. 1373–1379. [15] S.-J. Huang, Z.-H. Zhou, and Z. Zhou, “Multi-label learning by exploiting label correlations locally,” in Proc. AAAI, 2012, pp. 1–7. [16] W. Bi and J. T. Kwok, “Multilabel classification with label correlations and missing labels,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2014, pp. 1680–1686. [17] Y.-N. Chen and H.-T. Lin, “Feature-aware label space dimension reduction for multi-label classification,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 1529–1537. [18] F. Tai and H.-T. Lin, “Multilabel classification with principal label space transformation,” Neural Comput., vol. 24, no. 9, pp. 2508–2542, 2012. [19] V. N. Vapnik, Statistical Learning Theory, vol. 2. New York, NY, USA: Wiley, 1998. [20] M. Xu, Y.-F. Li, and Z.-H. Zhou, “Multi-label learning with pro loss,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2013, pp. 1–7. [21] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification via high-order label correlation driven active learning,” IEEE Trans. Image Process., vol. 23, no. 3, pp. 1430–1441, Mar. 2014. [22] P. L. Bartlett and S. Mendelson, “Rademacher and Gaussian complexities: Risk bounds and structural results,” J. Mach. Learn. Res., vol. 3, pp. 463–482, Mar. 2003. [23] H.-F. Yu, P. Jain, and I. S. Dhillon, “Large-scale multi-label learning with missing labels,” in Proc. 21st Int. Conf. Mach. Learn., 2014, pp. 1–9. [24] C. Cortes, M. Kloft, and M. Mohri, “Learning kernels using local rademacher complexity,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 2760–2768. [25] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” Ann. Statist., vol. 3, no. 4, pp. 1497–1537, Aug. 2005. [26] R. E. Schapire and Y. Singer, “BoosTexter: A boosting-based system for text categorization,” Mach. Learn., vol. 39, nos. 2–3, pp. 135–168, May 2000. [27] R. Yan, J. Tesic, and J. R. Smith, “Model-shared subspace boosting for multi-label classification,” in Proc. 13th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2007, pp. 834–843. [28] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit., vol. 40, no. 7, pp. 2038–2048, Jul. 2007. [29] E. Spyromitros, G. Tsoumakas, and I. Vlahavas, “An empirical study of lazy multilabel classification algorithms,” in Artificial Intelligence: Theories, Models and Applications. Berlin, Germany: Springer, 2008, pp. 401–406. [30] J. Tang, R. Hong, S. Yan, T.-S. Chua, G.-J. Qi, and R. Jain, “Image annotation by kNN-sparse graph-based label propagation over noisily tagged Web images,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 2, p. 14, Feb. 2011. [31] A. Clare and R. D. King, “Knowledge discovery in multi-label phenotype data,” in Principles of Data Mining and Knowledge Discovery. Berlin, Germany: Springer, 2001, pp. 42–53. [32] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, “Decision trees for hierarchical multi-label classification,” Mach. Learn., vol. 73, no. 2, pp. 185–214, Nov. 2008.

1506


[33] W. Bi and J. T. Kwok, “Multi-label classification on tree- and DAGstructured hierarchies,” in Proc. 28th Int. Conf. Mach. Learn. (ICML), 2011, pp. 17–24. [34] K. Crammer and Y. Singer, “A family of additive online algorithms for category ranking,” J. Mach. Learn. Res., vol. 3, pp. 1025–1058, Mar. 2003. [35] M.-L. Zhang and Z.-H. Zhou, “Multilabel neural networks with applications to functional genomics and text categorization,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 10, pp. 1338–1351, Oct. 2006. [36] A. Elisseeff and J. Weston, “A kernel method for multi-labelled classification,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 681–687. [37] R. Cabral, F. De la Torre, J. P. Costeira, and A. Bernardino, “Matrix completion for weakly-supervised multi-label image classification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 121–135, Jan. 2015. [38] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern Recognit., vol. 37, no. 9, pp. 1757–1771, Sep. 2004. [39] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for multilabel classification,” in Machine Learning: ECML. Springer, 2007, pp. 406–417. [40] J. Read, B. Pfahringer, and G. Holmes, “Multi-label classification using ensembles of pruned sets,” in Proc. 8th IEEE Int. Conf. Data Mining (ICDM), Dec. 2008, pp. 995–1000. [41] J. Fürnkranz, “Round robin classification,” J. Mach. Learn. Res., vol. 2, pp. 721–747, Mar. 2002. [42] T.-F. Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multiclass classification by pairwise coupling,” J. Mach. Learn. Res., vol. 5, pp. 975–1005, Dec. 2004. [43] Y. Zhang and J. Schneider, “A composite likelihood view for multilabel classification,” in Proc. Int. Conf. Artif. Intell. Statist., 2012, pp. 1407–1415. [44] L. Jing, L. Yang, Y. Jian, and M. K. Ng, “Semi-supervised low-rank mapping learning for multi-label classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2015, pp. 1483–1491. [45] D. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” in Proc. NIPS, vol. 22. 2009, pp. 772–780. [46] Y. Zhang and J. G. Schneider, “Multi-label output codes using canonical correlation analysis,” in Proc. Int. Conf. Artif. Intell. Statist., 2011, pp. 873–882. [47] Y. Zhang and J. Schneider, “Maximum margin output coding,” in Proc. Int. Conf. Mach. Learn., 2012, pp. 1575–1582. [48] W. Gao and Z.-H. Zhou, “On the consistency of multi-label learning,” Artif. Intell., vols. 199–200, pp. 22–44, Jun./Jul. 2013. [49] K. Dembczy´nski, W. Waegeman, W. Cheng, and E. Hüllermeier, “On label dependence and loss minimization in multi-label classification,” Mach. Learn., vol. 88, no. 1, pp. 5–45, Jul. 2012. [50] Y. Cong, J. Liu, J. Yuan, and J. Luo, “Self-supervised online metric learning with low rank constraint for scene categorization,” IEEE Trans. Image Process., vol. 22, no. 8, pp. 3179–3191, Aug. 2013. [51] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996. [52] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick, “Largescale image classification with trace-norm regularization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3386–3393. [53] G. Zhu, S. Yan, and Y. Ma, “Image tag refinement towards low-rank, content-tag prior and error sparsity,” in Proc. Int. Conf. Multimedia, 2010, pp. 461–470. [54] S. Ma, D. Goldfarb, and L. Chen, “Fixed point and Bregman iterative methods for matrix rank minimization,” Math. Program., vol. 128, no. 1, pp. 321–353, Jun. 2011. [55] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4, pp. 1956–1982, 2010. [56] K. C. Toh and S. Yun, “An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems,” Pacific J. Optim., vol. 6, nos. 615–640, p. 15, 2010. [57] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, pp. 183–202, 2009. [58] S. Ji and J. Ye, “An accelerated gradient method for trace norm minimization,” in Proc. 26th Annu. Int. Conf. Mach. Learn., 2009, pp. 457–464. [59] K. Chen, H. Dong, and K.-S. Chan, “Reduced rank regression via adaptive nuclear norm penalization,” Biometrika, vol. 100, no. 4, pp. 901–920, Sep. 2013.

[60] L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM J. Optim., vol. 24, no. 4, pp. 2057–2075, 2014. [61] L. Rosasco, S. Villa, and B. C. V˜u. (2014). “Convergence of stochastic proximal gradient algorithm.” [Online]. Available: http://arxiv. org/abs/1403.5074 [62] K. Simonyan and A. Zisserman. (2014). “Very deep convolutional networks for large-scale image recognition.” [Online]. Available: http://arxiv.org/abs/1409.1556 [63] W. Cheng, E. Hüllermeier, and K. J. Dembczynski, “Bayes optimal multilabel classification via probabilistic classifier chains,” in Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010, pp. 279–286. [64] K. J. Dembczynski, W. Waegeman, W. Cheng, and E. Hüllermeier, “An exact algorithm for f-measure maximization,” in Proc. Adv. Neural Inf. Process. Syst., 2011, pp. 1404–1412. [65] C.-L. Li and H.-T. Lin, “Condensed filter tree for cost-sensitive multilabel classification,” in Proc. 31st Int. Conf. Mach. Learn. (ICML), 2014, pp. 423–431. [66] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.

Chang Xu received the B.E. degree from Tianjin University, in 2011. He is currently pursuing the Ph.D. degree with the Key Laboratory of Machine Perception (Ministry of Education), Peking University. He was a Research Intern with the Knowledge Mining Group, Microsoft Research Asia, and a Research Assistant with the Center of Quantum Computation & Intelligent Systems and the Faculty of Engineering & Information Technology, University of Technology Sydney. His research interests lie primarily in machine learning, multimedia search, and computer vision. He won the Best Student Paper Award in ACM ICIMCS 2013.

Tongliang Liu received the B.E. degree in electronics engineering and information science from the University of Science and Technology of China, Hefei, China, in 2012. He is currently pursuing the Ph.D. degree in computer science with the University of Technology Sydney, Ultimo, NSW, Australia. He is interested in statistical learning theory. He received the best paper award in the IEEE International Conference on Information Science and Technology in 2014.


Dacheng Tao (F’15) is a Professor of Computer Science with the Centre for Quantum Computation & Intelligent Systems, and the Faculty of Engineering and Information Technology in the University of Technology Sydney. He mainly applies statistics and mathematics to data analytics problems and his research interests spread across computer vision, data science, image processing, machine learning, and video surveillance. His research results have expounded in one monograph and 200+ publications at prestigious journals and prominent conferences, such as the IEEE T-PAMI, T-NNLS, T-IP, JMLR, IJCV, NIPS, ICML, CVPR, ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD, with several best paper awards, such as the Best Theory/Algorithm Paper Runner Up Award in IEEE ICDM’07, the Best Student Paper Award in IEEE ICDM’13, and the 2014 ICDM 10-Year Highest-Impact Paper Award. He received the 2015 Australian Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award and the 2015 UTS Vice-Chancellor’s Medal for Exceptional Research. He is a Fellow of the IEEE, OSA, IAPR, and SPIE.

1507

Chao Xu received the B.E. degree from Tsinghua University, in 1988, the M.S. degree from the University of Science and Technology of China, in 1991, and the Ph.D. degree from the Institute of Electronics, Chinese Academy of Sciences, in 1997. From 1991 to 1994, he was an Assistant Professor with the University of Science and Technology, China. Since 1997, he has been with the School of EECS, Peking University, where he is currently a Professor. He has authored or co-authored over 80 publications and holds five patents in his research fields. His research interests are in image and video coding, processing, and understanding.

Local Rademacher Complexity: sharper risk bounds with and without unlabeled samples.

Refined rademacher chaos complexity bounds with applications to the multikernel learning problem.

Learning Instance Correlation Functions for Multilabel Classification.

Sample Complexity Bounds for Differentially Private Learning.

Multi-instance multilabel learning with weak-label for predicting protein function in electricigens.

Augmenting multi-instance multilabel learning with sparse bayesian models for skin biopsy image analysis.

Coupled dimensionality reduction and classification for supervised and semi-supervised multilabel learning.

Multilabel image classification via high-order label correlation driven active learning.

Lessons on resilience: Learning to manage complexity.

The order of complexity of visuomotor learning.

A Local Learning Rule for Independent Component Analysis.

ASAP: a machine learning framework for local protein properties.

Reinforced AdaBoost learning for object detection with local pattern representations.

Multiview vector-valued manifold regularization for multilabel image classification.

Learning in brain and machine-complexity, Gödel, Aristotle.

Task complexity and maximal isometric strength gains through motor learning.

Mixed-complexity artificial grammar learning in humans and macaque monkeys: evaluating learning strategies.

Coarse-Grained Langevin Equation for Protein Dynamics: Global Anisotropy and a Mode Approach to Local Complexity.

Machine learning-based coding unit depth decisions for flexible complexity allocation in high efficiency video coding.

Using complexity theory to develop a student-directed interprofessional learning activity for 1220 healthcare students.

Does formal complexity reflect cognitive complexity? Investigating aspects of the Chomsky Hierarchy in an artificial language learning study.

Multilabel region classification and semantic linking for colon segmentation in CT colonography.

Multilabel user classification using the community structure of online networks.

On multilabel classification methods of incompletely labeled biomedical text data.