Constrained empirical risk minimization framework for distance metric learning.

1194

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 8, AUGUST 2012

Constrained Empirical Risk Minimization Framework for Distance Metric Learning Wei Bian and Dacheng Tao, Senior Member, IEEE

Abstract— Distance metric learning (DML) has received increasing attention in recent years. In this paper, we propose a constrained empirical risk minimization framework for DML. This framework enriches the state-of-the-art studies on both theoretic and algorithmic aspects. Theoretically, we comprehensively analyze the generalization by bounding the sample and the approximation errors with respect to the best model. Algorithmically, we carefully derive an optimal gradient descent by using Nesterov’s method, and provide two example algorithms that utilize the logarithmic loss and the smoothed hinge loss, respectively. We evaluate the new framework on data classification and image retrieval experiments. Results show that the new framework has competitive performance compared with the representative DML algorithms, including Xing’s method, large margin nearest neighbor classifier, neighborhood component analysis, and regularized metric learning. Index Terms— Data classification, distance metric learning, empirical risk minimization, first-order method, generalization, image retrieval, optimal convergence rate.

I. I NTRODUCTION

D

ISTANCE metrics play an important role in many data analysis tasks, such as classification [1]–[6], clustering [7], recognition [8], [9], and retrieval [10]–[13]. Conventionally, a proper distance metric is usually defined by a particular prior according to the feature for data representation, e.g., the cosine distance [14] is popular for measuring the distance between documents represented by the bag-of-words model. However, recent research results show that a distance metric can be learned empirically by properly utilizing the similarity and/or dissimilarity between instances when the prior information is not available. Most representative works in this direction tend to learn a Mahalanobis distance metric that properly adapts the information of similarity and/or dissimilarity given in the training dataset. Empirical studies have shown that the learned Mahalanobis distance metric is invaluable to improving the performance of a wide range of learning tasks, from medical image retrieval [15] and object recognition [16] to human pose estimation and activity recognition [17], [18].

Manuscript received June 14, 2011; revised April 28, 2012; accepted April 28, 2012. Date of publication May 22, 2012; date of current version July 16, 2012. This work was supported by the Australian Research Council Discovery Project DP-120103730. The authors are with the Center for Quantum Computation & Intelligent Systems, the Faculty of Engineering & Information Technology, University of Technology, Sydney, NSW 2007, Australia (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2198075

In recent years, dozens of algorithms have been developed for distance metric learning (DML). We group the popular DML algorithms into the following four categories. It is impossible to refer to all the related literature. The cited survey [19] may contain those missing references. 1) Similarity/Dissimilarity-Based Algorithms: A natural intuition in DML is to learn a distance metric that keeps similar instances close to each other while dissimilar ones apart. Several DML algorithms have been developed based upon this intuition. For example, Xing et al. [7] proposed to learn a distance metric by minimizing the sum-of-squared distances between similar instances while keeping the counterpart for dissimilar instances larger than a predefined threshold. Globerson and Roweis [20] proposed the collapsing classes method for DML, in which similar instances are kept as close as possible, by mapping similar instances to the same point. In order to encode with relative similarity/dissimiarity among instances, Schultz and Joachims [21] proposed the relative comparison algorithm for DML. 2) Information Theory-Based Algorithms: Studies along this direction associate a distance metric (a semidefinite matrix) with the covariance matrix of a Gaussian distribution. In particular, a distance metric can be learned by minimizing the relative entropy between two Gaussian distributions, where the first one encodes the distance metric to be learned and the second one models a particular heuristic. For example, the information-theoretic metric learning (ITML) [22] specifies the second one as a reference Gaussian distribution and minimizes the relative entropy between these two subject to a set of similarity/dissimialrity constraints obtained from the dataset. The information geometry algorithm [23] constructs the covariance of the second Gaussian by using an “ideal kernel” computed from labels [24], [25]. Bar–Hillel et al. [26] proposed a DML algorithm by maximizing the mutual information between the original and the transformed instances and using side-information of instance similarities. 3) Probabilistic Algorithms: Neighborhood component analysis (NCA) [27] encourages instances from the same class to be close. It defines for each instance the probability of selecting same-class instances as its neighbors, and then learns a distance metric that maximizes the summation of such probabilities over all instances. Another probabilistic DML algorithm is the Bayesian DML [28]. It considers DML in a Bayesian setting, where a Wishart prior is imposed on the distance metric, and an approximate algorithm is derived for posterior estimation.

2162–237X/$31.00 © 2012 IEEE

BIAN AND TAO: RISK MINIMIZATION FRAMEWORK FOR DISTANCE METRIC LEARNING

4) Classification Based Algorithms: By transforming similarity and dissimilarity relationships to positive and negative labels, DML can be cast into a classification problem. Thus, well-established classification models are readily applicable to DML. In analogy to support vector machines (SVMs), Weinberger et al. [29] proposed the large-margin nearest neighbor classifier (LMNN), where the margin is defined as the difference between intraclass and interclass distances. Jin et al. introduced a regularized DML algorithm and analyzed its generalization error [30]. Besides, DML is also related to discriminative subspace selection [31]–[36], as the latter can be regarded as learning a Euclidean distance (ED) with low rank. However, an advantage of DML is that it automatically specifies the weighting parameters of difference directions in the feature space. In our earlier work [37], we presented an empirical risk minimization (ERM) framework for DML and showed the basic results on the convergence of the empirical risk to the population risk. This paper largely extends [37] by proposing a constrained ERM framework for DML, as well as presenting algorithmic and theoretical results for the new framework. Compared to previous studies on DML [7], [20], [27], [29], [30], and our earlier work [37], the contributions of this paper are as follows. 1) We propose a trace-norm-constrained ERM framework for DML. The trace-norm constraint encourages sparsity on the spectrum of the learned distance metric by selecting a small number of the most discriminative directions in the feature space, which it is important to improve the generalization ability and to avoid the overfitting problem. 2) We prove a generalization bound for the new framework by investigating the sample error and the approximation error of the empirical learning procedure with respect to the best model. The bound shows the importance of the trace-norm constraint in improving the generalization ability by reducing the sample error. To our best knowledge, such theoretical analysis in DML is new. The only theoretical result available on DML is in [30], but it does not show the performance of empirical learning with respect to the best model. Thus, we believe that the theoretical contribution of this paper is of independent interest and supplements the theoretical studies on DML. 3) We derive an optimal first-order algorithm for the proposed trace-norm-constrained ERM framework. The algorithm is based mainly on Nesterov’s method, the same as in [37], but introduces the Dykstra’s projection method [38] to project the distance metric onto the feasible set defined by the trace-norm constraint. Most existing distance metric learning algorithms [7], [29], [27], [20] are based on the standard gradient descent method, which has a slow convergence rate and thus limits the application of DML to high-dimensional data problems, e.g., image retrieval. In this paper, we propose to utilize the optimal first-order method to learn the distance metric. The obtained learning algorithm has the optimal convergence rate of O(1/k 2 ) and thus is much

1195

faster than the standard gradient descent search whose convergence rate is O(1/k), where k is the iteration number of the search. The rest of this paper is organized as follows. In Section II, we present the constrained ERM framework for DML. Section III is devoted to the generalization analysis. In Section IV, we derive the optimal first-order method for solving the proposed framework, and give two example algorithms. Section V reports the experimental results on the UCI machine learning repository datasets and the MIT outdoor scene image dataset. Section VI concludes this paper. II. DML BY E MPIRICAL R ISK M INIMIZATION In this section, we first define the expected risk for DML given the underlying distribution. Afterward, we present the constrained ERM framework to learn the distance metric from a set of training data. A. Expected Risk Without loss of generality, we consider the DML problem given the underlying distribution P(X, X , S) on X × X × {1, −1}, wherein X is a compact subset of the vector space Rm specified by X = {x : x2 ≤ a0 /4}.1 For a realization (x, x , s), we say x and x are similar (or dissimilar) if s is equal to 1 (or −1). Under this setting, we regard the DML problem as finding the distance metric that best fits P(X, X , S). Therefore, the learned distance metric properly measures the distance between two instances x and x . The distance metric M parameterizes a family of Mahalanobis distances and thus M 0 belongs to the semidefinite . Given M, the distance between two matrix space Sm×m + instances x and x is given by dM (x, x ) = (x − x )T M(x − x ).

(1)

The above distance is defined globally over the entire instance space. Given a distance metric M, a natural way to determine whether two instances are similar or not to each other is by comparing their distance with a proper threshold c [28], [37], [39] 1, dM (x, x ) ≤ c (2) sˆ(x, x ) = −1, dM (x, x ) > c. Given a set of similar or dissimilar instance pairs x and x , in order to properly learn the distance metric M, we introduce the following risk function to measure the disagreement between s and sˆ(x, x ) : r (x, x , s|M, c) = (s(c − dM (x, x )))

(3)

where (·) is a loss function (monotonically decreasing), e.g., the logarithmic loss. When s = 1, dM (x, x ) < c gives smaller risk than when dM (x, x ) > c. The converse situation holds for s = −1. Further, by integrating (3) with respect to P(X, X , S), we obtain the expected risk as R(M, c) = (s(c − dM (x, x )))d P(x, x , s). (4) 1 The bound a /4 is only for the convenience of later discussion. 0

1196


Thus, the optimal distance metric M and the corresponding optimal threshold c that best adapt P(X, X , S) are given by (M , c ) = arg min R(M, c). (M,c)∈Q

(5)

M2 = λ(M)2

(M , c )

We refer to as the best model. For the hypothesis space Q in (5), we have the following two concerns. 1) Since the homogeneous rescaling should not affect the similarity between instances, we restrict 0 M I. 2) Accordingly, we have dM (x, x ) ≤ x − x 2 ≤ a0 over the entire instance space X , and thus the threshold c is meaningful only if c ≤ a0 . Therefore, we have the hypothesis space Q = {(M, c) : 0 M I, 0 ≤ c ≤ a0 }.

sparsity-encouraging constraint helps reduce the VC dimension, i.e., the complexity, of the learned model. In contrast to the trace norm, the Frobenius norm regularization satisfies

(6)

The above analysis shows that the range of c is induced given the space of M. An alternative way is to restrict c first, e.g., setting c = 1 [30], and then determine the space of M. However, our choice makes the subsequent theoretical analysis on generalization convenient.

(10)

which is equivalent to imposing the 2 norm over the eigenvalues. It thus does not encourage sparsity and may enlarge the impact of the noised directions. Based on the above discussion, we finally estimate the distance metric and the corresponding threshold by c) = arg (M,

min

(M,c)∈Q

R(M, c)

(11)

where we can choose different loss functions (·) in (7). By using different loss functions, we can obtain different forms of (11), and thus we refer to (11) as a constrained ERM framework for DML. III. G ENERALIZATION A NALYSIS

B. Constrained ERM for DML In practice, we have a set of training data sampled from the underlying distribution P(X, X , S), although P(X, X , S) is not available. Denote the training data by D = {(xi , xi , si )}, i = 1, 2, ..., n, where the triplets are independent and identically distributed. Then, an empirical estimation of R(M, c), which is the empirical risk, is given by n 1 R(M, c) = (si (c − dM (xi , xi ))). n

(7)

i=1

Although (7) provides an unbiased estimation of the expected risk, learning by directly minimizing (7) suffers from the overfitting problem. Therefore, regularization method is usually applied to control the model complexity, for example, [30] uses the Frobenius norm of M as the regularizer while [22] adopts the Kullback–Leibler divergence between M and a reference M0 . Besides, [28] places a Wishart prior on M to reduce overfitting. In this paper, we introduce a constraint method to control the model complexity, which is equivalent to the regularization method for learning but simplifies the subsequent generalization analysis. Specifically, we utilize the trace-norm constraint, by which the reduced hypothesis space given by (8) Q = {(M, c) ∈ Q : Trace(M) ≤ } where 0 < ≤ m, is a tuning parameter to control the volume of the search space. The trace-norm constraint has the following advantages. For a semidefinite matrix M, it satisfies Trace(M) = λ(M)1

(9)

where λ(·) is the eigenvalue function, and · 1 is the 1 norm over vectors. Since the 1 norm encourages sparsity by penalizing the small eigenvalues of M to be exactly zero, the trace-norm constraint helps discarding redundant directions (including noised directions) for the adaption of M to P(X, X , S). Besides, from a VC-theory point of view, our

In this section, we theoretically analyze the generalization of the proposed constrained ERM framework. In particular, we derive an upper bound on the difference between the expected risks of the learned model and the best model c) − R(M , c ). (12) c) − min R(M, c) = R(M, R(M, (M,c)∈Q

To the best of our knowledge, such generalization bound for DML has not been studied yet. A related result on generalization of DML [30] has been derived from the stability point of view [40], which bounds the difference between the expected risk and the empirical risk on the learned model, M, c) − R( c). In contrast, our analysis examines i.e., R(M, c) with respect to the performance of the learned model (M, the best model (M , c ). First, we define (M , c ) as (M , c ) = arg

min

(M,c)∈Q

R(M, c).

(13)

It is clear that (M , c ) is biased from (M , c ). This bias diminishes when the tradeoff parameter increases to m, c) because Q = Q given = m. Recall the definition of (M, c) = arg (M,

min

(M,c)∈Q

R(M, c).

(14)

Since (14) is an exact empirical version of (13), an analysis c) and R(M , c ) can be on the difference between R(M, characterized by a probabilistic bound. By using R(M , c ), we can decompose the difference (12) into c) − R(M , c ) + R(M , c ) − R(M , c ) . (15) R(M, The first term in (15) depends on the training data and is referred to as the sample error, while the second is independent of the training data but affected by the tradeoff parameter and thus is referred to as the approximation error.


Proof: First, from the Lipschitz continuity of , it holds

A. Probabilistic Bound on the Sample Error In the following analysis, we require the loss function (u) to be monotonically decreasing and Lipschitz-continuous with a constant η (16) (u 1 ) ≥ (u 2 ) if u 1 ≤ u 2

2 , c2 )| (M1 , c1 ) − R(M |R n 1 ≤ si (c1 − dM1 (xi , xi )) − si (c2 −dM2 (xi , xi )) n i=1

n η ≤ si (c1 − dM1 (xi , xi )) − si (c2 − dM2 (xi , xi )) n

and |(u 1 ) − (u 2 )| ≤ η|u 1 − u 2 |.

(17)

Most smooth loss functions, such as the logarithmic loss and the smoothed hinge loss, satisfy the above requirements. The approach for bounding the sample error is based on the uniform analysis of deviation R(M, c) − R(M, c) over Q . Specifically, we first derive a concentration inequality for a fixed model (M, c). Then we generalize it into a uniform concentration by utilizing the concept of covering number, which finally gives a probabilistic bound on the sample error. We first derive a concentration inequality for a fixed model (M, c) in the following Lemma 1. Lemma 1: For ∀(M, c) ∈ Q and > 0, and (·) satisfying (16), it holds that

2n 2 Pr |R(M, c) − R(M, c)| ≤ ≥ 1 − 2 exp − 2 . (−a0 ) (18) ) = (s(c − d (x, x ))), we Proof: By letting ξ(s, x, x M then have R(M, c) = (1/n) ni=1 ξ(si , xi , xi ) and R(M, c) = E[ξ(s, x, x )]. Since (M, c) ∈ Q , it holds that 0 ≤ ξ(s, x, x ) = (s(c − dM (x, x ))) ≤ (−|c − dM (x, x )|) ≤ (−| max(a0 , max(dM (x, x )))|) = (−a0 )

1197

i=1

≤ η (|c1 − c2 | + a0 M1 − M2 ) = η(M1 , c1 ) − (M2 , c2 ).

(22)

Similarly, by replacing the empirical average with the expectation, we have |R(M1 , c1 ) − R(M2 , c2 )| ≤ η(M1 , c1 ) − (M2 , c2 ). (23) Denoting r = N Q , /4η , we can decompose Q into Q = ∪rj =1 S j such that each S j is centered at (M j , c j ) and has a radius /4η. Then, it holds that for ∀(M, c) ∈ S j j , c j ) − R(M j , c j ) c) − R(M, c) − R(M sup R(M, (M,c)∈S j

≤

sup

(M,c)∈S j

≤ 2η

R(M, j , c j ) + R(M, c) − R(M j , c j ) c) − R(M

sup

(M,c)∈S j

(M, c) − (M j , c j ) ≤ 2η

= . 4η 2

(24)

Thus, by (24), we have

Pr sup |R(M, c) − R(M, c)| ≤ (M,c)∈S j

(19)

where we used the fact that max(dM (x, x )) ≤ a0 . Then, by Hoeffding’s inequality, we complete the proof. Afterward, we give a uniform concentration inequality over Q by using covering number. By r = N (Q , ε), we mean that Q is covered by r balls of radius ε. Note that in order to cover a set, it needs the definition of a metric. To this end, we define the metric on (M, c) as2 (M1 , c1 ) − (M2 , c2 ) = a0 M1 − M2 + |c1 − c2 |. (20) Then, the uniform concentration inequality is given by the following Lemma. Lemma 2: For ∀ > 0, 0 < ≤ m, and (·) satisfying (16) and (17), it holds that

Pr sup |R(M, c) − R(M, c)| ≤ (M,c)∈Q

m 2 2 exp − 2 (21) ≥ 1 − N Q , 4η 2 (−a0 ) where N Q , /4η is the covering number for Qρ with balls of radius /4η. 2 It can be verified that (20) satisfies the common requirements for being a metric.

j , c j ) − R(M j , c j )| ≤ ≥ Pr |R(M 2

m 2 (Lemma 1) ≥ 1 − 2 exp − 2 . (25) 2 (−a0 )

Finally, by Q = ∪rj =1 S j , we have (21). By applying Lemma 2, we have the following probabilistic bound on the sample error. Theorem 1: For ∀ > 0, 0 < ≤ m, and (·) satisfying c) (16) and (17), it holds for the learned model (M, ˆ that c) − R(M , c ) ≤ Pr R(M,

n 2 2 exp − 2 . (26) ≥ 1 − N Q , 4η 8 (−a0 ) Proof: By Lemma 2, we have

M, c) − R(M, c)| ≤ , Pr |R( 2

, c ) − R(M , c )| ≤ |R(M 2

≥ Pr sup |R(M, c) − R(M, c)| ≤ 2 (M,c)∈Q

2 m ≥ 1 − N Q , 2 exp − 2 . (27) 4η 8 (−a0 )

1198


Further, since c) − R(M , c ) R (M, M, M, c) − R( c) + R( c) − R(M , c ) = R(M, c) − R( M, c) + R(M , c ) − R(M , c ) ≤ R(M, , c )−R(M , c )| ≤ |R(M, c)− R(M, c)|+|R(M

the minimum distance by examining the distance between their eigenvalues λ1 and λ min M1 − M =

M 1 ∈Q

(28)

and from (27), we have (26). This completes the proof. Theorem 1 shows that a large n will give high confidence on the bounding of the sample error. In addition, a small leads to a small N Q , /4η , which also enhances the confidence. Besides, by fixing the confidence, we can show that the upper bound on the sample error decreases with increasing n or with . This requires an explicit estimation of decreasing N Q , /4η . According to of proposition 5, [41, Ch. 1], we get an upper bound of the covering number, given by the following Lemma. ≤ (m(m + 1) + 2/2) ln Lemma 3: ln N Q , /4η (16a0 ( + 1)η/). Proof: Since M ∈ Sm×m , it is clear that the dimensionality of Q is m(m + 1)/2 + 1. Besides, c ≤ a0 and Trace(M ≤ ) indicate a0 M + |c| ≤ a0 ( + 1), i.e., the radius of Q is a0 ( + 1). Then, according to of proposition 5, [41, Ch. 1], we get the result.

min

0≤λ1 ≤1,e T λ1 ≤

λ1 − λ . (34)

This completes the proof. Therefore, the upper bound of the approximation error in (29) decreases with increasing . In particular, when = m, λ becomes feasible, and thus the upper bound (also the approximation error) turns to zero. In addition, we can also obtain a worst case upper bound that is independent of λ by using the fact that max

min

0≤λ ≤1 0≤λ1 ≤1,e T λ1 ≤

λ1 − λ ≤ =

min

max λ1 − λ

0≤λ1 ≤1,e T λ1 ≤ 0≤λ ≤1

min

0≤λ1 ≤1,e T λ1 ≤

λ1 − e

m− = √ . m

(35)

Again, this shows explicitly that the approximation error reduces to zero when = m. C. Empirical Evaluation

B. Bound on the Approximation Error Regarding the approximation error, we have the following Theorem. Theorem 2: For ∀ 0 < ≤ m, (·) satisfying (16) and (17), denote by λ the eigenvalues of best model M , and then it holds that R(M , c ) − R(M , c ) ≤ ηa0

min

0≤λ1 ≤1,e T λ1 ≤

λ1 − λ .

(29) Proof: According to the definition (13), for ∀(M1 , c1 ) ∈ Q , it holds that R(M , c ) ≤ R(M1 , c1 ).

(30)

In addition, since Q ⊇ Q , we have R(M , c ) ≤ R(M , c ).

(31)

Thus R(M , c ) − R(M , c ) ≤ min |R(M1 , c1 ) − R(M , c )|. (M1 ,c1 )∈Q

(32)

Further, by using (23) and (20),3 we have R(M , c ) − R(M , c ) ≤ min η(a0 M1 − M + |c1 − c |) (M1 ,c1 )∈Q

= ηa0 min M1 − M . M 1 ∈Q

We design a two-class classification problem in R10 , where the two classes are separable in the first five dimensions and completely overlapped in the rest five dimensions. Thus, the optimal distance metric is M∗ = diag(I5 , 0). We model each class as a unit polyhedron at randomly selected location and do normalization such that the magnitude x2 is no larger than 1/4, which means a0 = 1 in (26). Then, we use different number of training samples, from 50 to 5000, to learn the distance metric, and use 5000 test samples to estimate the test risk. The generalization risk is composed of the sample error and the approximation error, with a confidence set to 0.9 and the best model’s risk R(M∗ , c∗ ) estimated from 10 000 samples. The model parameter , used in both the empirical learning and calculating the sample error, is selected by the 10-fold cross validation on the training set. Two loss functions are examined in the experiment, i.e., the logarithmic loss and the smoothed hinge loss (see the next section for details), and the corresponding Lipschitz constants η are both 1. Fig. 1 shows the simulation results. With an increase in the number of training samples, the generalization risk bound approaches the test risk,√and the rate of decrease of the generalization risk is roughly n, which is consistent with (26). IV. O PTIMAL F IRST-O RDER A LGORITHM

(33)

To minimize the distance between M1 and M , M1 should have the same eigenspace as the latter. Thus, we can estimate 3 Though (23) and (20) are defined over Q , they are readily extended to Q.

In this section, we derive an efficient first-order algorithm for solving (11), which converges at the optimal rate O(1/k 2 ) in the sense of Nemirovsky and Yudin [42]. We first give a general algorithm under the Lipschtiz-continuous gradient assumption on the empirical risk, and then present example algorithms with two specific loss functions that satisfy the assumption provably.


Logarithmic Loss

Smoothed Hinge Loss

3

3.5

2.5

3

2

2.5

Test Risk Generalization Risk Bound

2 1.5

Risk

Risk

Test Risk Generalization Risk Bound

1.5 1

1 0.5

0.5 0 0

1000

2000 3000 Training Sample Size

4000

0 0

5000

1000

2000 3000 Training Sample Size

4000

5000

Fig. 1. Empirical evaluation of the generalization bound with two loss functions.

A. General First-Order Algorithm For convenience, we denote the empirical risk in (11) by J (X) = R(M, c)

(36)

where X ∈ S(m+1)×(m+1) and has the form

M 0 X = diag(M, c) = . 0T c. Further, we need the conditions that J (X) is convex and has a Lipschitz-continuous gradient ∇ J (X) −∇ J (X )2 ≤ L∇ J (X) − ∇ J (X ), X − X (37) where L is the Lipschitz constant of ∇ J (X). These conditions will be verified with specific loss functions in the example algorithms provided later. Given the convexity and Lipschitzcontinuous gradient conditions, we can solve (36) at the optimal convergence rate O(1/k 2 ), where k is the number of iteration steps. We apply Nesterov’s optimal first-order method, which has been widely used recently in the machine learning field, e.g., nonnegative matrix factorization [43]–[45]. Specifically, given the current solution Xk , Nesterov’s method uses the following three steps to calculate Xk+1 for next iteration. 1) Solve the standard first-order problem L X − Xk 2 + ∇ J (Xk ), X − Xk . (38) X∈Q 2

Then, step 1 can be rewritten as

2 L 1 1 M − Mk − ∇ JM (Xk ) , ck+1 } = arg min {Mk+1 L X∈Q 2 2

∇ Jc (Xk ) + ck1 − ck − . (42) L 1 1 Since Mk+1 and ck+1 are independent of each other, we 1 , an can optimize them individually. Specifically, for ck+1 immediate result is

∇ Jc (Xk ) 1 , a0 (43) ck+1 = min max 0, ck − L

where we have used the constraints c ≥ 0 and c ≤ a0 1 , one needs to solve according to Q . For Mk+1

2 ∇ JM (Xk ) 1 . (44) Mk+1 = arg min M − Mk − L X∈Q Since the · 2 norm for a matrix is invariant to rotation, we can transform (44) to 2 1 Mk+1 = arg min UT MU − (45) M∈Q

where U and are obtained from the eigen decomposition

∇ JM (Xk ) = UUT . Mk − (46) L Since is a diagonal matrix, i.e., = diag(λ), (45) implies that UT MU is also a diagonal matrix. By letting UT MU = ˜ then the minimization problem (45) is equivalent to diag(λ), λ˜ = arg min z − λ2 s.t. 0 ≤ z ≤ 1, eT z ≤ .

L p(X) + αi ∇ J (Xi ), X − Xi X∈Q σ p k

2 Xk+1 = arg min

(39)

i=0

where the weighting parameter αi = (i + 1)/2 according to [46] and p(X) is called the proxy function and required to be strongly convex with respect to parameter σ p . 3) Update Xk+1 by k+1 1 2 Xk+1 + X2 . (40) k+3 k + 3 k+1 It has been proved that the convergence rate of the above procedure is O(1/k 2 ). We now show the details for minimizing (36) in steps 1 and 2. Step 1: Note that Xk = diag(Mk , ck ), and thus Xk+1 =

∇ J (Xk ) = diag(∇ JM , ∇ Jc )

0 ∇ JM (Xk ) . = ∇ Jc (Xk ). 0T

(41)

(47)

˜ = diag(λ), ˜ i.e., the ˜ we then have UT MU = , By denoting 1 optimal solution Mk+1 is given by 1 ˜ T. Mk+1 = UU

1 = arg min Xk+1

2) Combine all previous gradients and solve

1199

(48)

The only problem remaining is solving (47). Note that (47) is actually a classical problem of projection to a convex set. Thus, it can be efficiently solved by Dykstra’s algorithm [38]. Algorithm 1 presents the pseudo code. Step 2: First, we specify the prox-function p(X) as p(X) = X2

(49)

which vanishes at X0 = O and has the convexity parameter 2 2 , c 2 ). Similar to σ p = 2. Recall that Xk+1 = diag(Mk+1 k+1 step 1, we have

− ki=0 αi ∇ Jc (Xi ) 2 (50) , a0 ck = min max 0, L and

2 ˜ T = VV Mk+1

where V is obtained from eigen decomposition ˆ i) − ki=0 αi ∇ R(X = VVT L

(51)

(52)

1200


Algorithm 2 Optimal first-order method for the constrained ERM framework Input: The training dataset (xi , xi , si ), i = 1, 2, ..., n. Output: X = diag(M, c), i.e., distance metric M and decision threshold c. Initialize: M0 = I and c0 = 1. For k = 0, 1, 2, ...

Algorithm 1 Dykstra’s algorithm for (47) and (53) Input: ρ, m, u2 = λ or σ , v1 = 0 and v2 = 0. Output: z. For k = 0, 1, 2, ... 1. z = u2 + v1 ; 2. u1 = z,u1 (z > 1) = 1, u1 (z < 0) = 0; 3. v1 = z − u1 ; 4. z = u1 + v2 ; 5. u2 = z − max(0, eT z − ρ)/m; 6. v2 = z − u2 ; Until Converged.

1. Compute ∇ J (Xk ), by (56) and (57) with the logarithmic loss function, or by (66) and (67) with the smoothed hinge loss function. 1 1 , c 1 ) by (43) and (48). = diag(Mk+1 2. Update Xk+1 k+1 2 2 , c 2 ) by (50) and (51). 3. Update Xk+1 = diag(Mk+1 k+1

˜ = diag(σ˜ ) is obtained from and σ˜ = arg min z − σ s.t. 0 ≤ z ≤ 1, eT z ≤

1 +(2/(k +3))X2 . 4. Update Xk+1 = ((k +1)/(k +3))Xk+1 k+1

2

(53)

which is solved by Algorithm 1. The pseudo code for all three steps is summarized in Algorithm 2. B. Example Algorithms In this section, we present two examples by considering two different loss functions, i.e., the logarithmic loss and a smoothed hinge loss [47]. In classification, the logarithmic loss and the hinge loss are the most utilized, and correspond to logistic regression and SVMs, respectively. Since the similarity/dissimilarity between instances is also a binary output, these two loss functions can be directly used in the proposed framework for DML. One problem with the hinge loss is its nonsmoothness, which is unsuitable for using the optimal firstorder algorithm. Thus, we adopt a smoothed hinge loss in our algorithm. For both loss functions, we will show that the objective J (X) in (36) is convex and has a Lipschitzcontinuous gradient. 1) Logarithmic Loss Function: The logarithmic loss function is defined by lg (u) = ln (1 + exp(−u)).

(54)

Accordingly, the objective J (X) becomes J (X) =

n 1 T ln 1 + e−si (c−(xi −xi ) M(xi −xi )) . n

(55)

i=1

Since the logarithmic loss lg (u) is convex and u = si (c − (xi − xi )T M(xi − xi )) is jointly linear in (M, c), we know that J (X) is convex. The derivatives of J (X) with respect to M and c are given by

Until Xk+1 − Xk 2 < ε.

continuous gradient and its Lipschitz constant is upperbounded by n 1 1 L= + xi − xi 4 . (58) 4 4n i=1

Property 1: Given any direction ∈ S(m+1)×(m+1) , with the logarithmic loss function, J (X) satisfies ∇ 2 J (X) , ≤ L 2 (59) n 4 where L = 1/4 + (1/4n) i=1 xi − xi . Proof: Letting Ai = diag −(xi − xi )(xi − xi )T , 1 , then, with the logarithmic loss, we have ∇ J (X) =

n −si Ai 1 . n 1 + esi Ai ,X

(60)

i=1

ˆ Define function φ() = ∇ R(X + ), with > 0 [46], and then we have φ() − φ(0) = ∇ J (X + ) − ∇ J (X), n 1 −si Ai −si Ai = − ,

n 1 + esi Ai ,X+ 1 + esi Ai ,X i=1 n −si Ai (1 − esi Ai , ) 1 , . (61) = n 1 + esi Ai ,X+ 1 + e−si Ai ,X i=1 Therefore φ() − φ(0) n 2 si Ai , 2 1 = si Ai ,X 1 + e −si Ai ,X n 1 + e i=1

∇ 2 J (X) , = φ (0) = lim

→0

∇ JM (X) =

n 1 si (xi − xi )(xi − xi )T si (c−(xi −xi )T M(xi −xi )) n i=1 1 + e

(56)

∇ Jc (X) =

n 1 −si . si (c−(xi −xi )T M(xi −xi )) n i=1 1 + e

(57)

n n 1 1 si2 Ai , 2 ≤ Ai 2 2 . n 4 4n i=1

i=1 1 n 1 4 + = xi − xi 2 . (62) i=1 4 4n

The following property checks the Lipschitz gradient property of J (X), which shows that J (X) has a Lipschitz

This completes the proof. Thus, Nesterov’s method is readily applicable by using the gradients calculated by (56) and (57). We briefly term

and

≤


1201 Logarithmic Loss 0.9

optimal first−order method gradient desced method

0.7 0.6 0.5 0.4 0.3 0.2 0.1

(63)

0 0

optimal first−order method gradient desced method

0.8 Objective Function Value

⎧ u 1.

Smoothed Hinge Loss

0.8

Objective Function Value

the constrained ERM for DML with the logarithmic loss as logarithmic loss-based metric learning (LLML). 2) Smoothed Hinge Loss Function: The smoothed hinge loss is defined by [47]

0.7 0.6 0.5 0.4 0.3 0.2 0.1

20

40 60 Iteration Steps

80

100

0 0

20

40 60 Iterative Steps

80

100

Fig. 2. Empirical study on the convergence of the optimal first-order method.

It is convex, and the derivative is given by ⎧ ⎪ ⎨ − 1, ∇sh (u) = u − 1, ⎪ ⎩ 0,

And thus u 1.

By using the smoothed hinge loss, J (X) becomes n 1 R(X) = sh si (c − (xi − xi )T M(xi − xi )) n

(65)

φ() − φ(0) →0 n 1 1 ∇sh (si Ai , X + ) = lim →0 n i=1 − ∇sh (si Ai , X) si Ai ,

∇ 2 J (X) , = lim

n 1 2 ∇ si Ai , X si2 Ai , Ai , = n i=1

i=1

n 1 2 = ∇ si Ai , X Ai , 2 n

and its gradients with respect to M and c are

i=1

∇ JM (X) =

1 n

n

−si ∇sh si (c − (xi − xi )T M(xi − xi ))

i=1

×(xi − xi )(xi − xi )T

(66)

and ∇ Jc (X) =

n 1 si ∇sh si (c − (xi − xi )T M(xi − xi )) . (67) n i=1

The following property 2 shows that J (X) also has a Lipschitz gradient and the Lipschitz constant is upper bounded by n 1 xi − xi 4 . L =1+ n

(68)

i=1

Property 2: Given any direction ∈ S(m+1)×(m+1) , with the smoothed hinge loss function, the empirical risk R(X) satisfies

≤ L 2 ∇ 2 R(X) ,

(69)

where L = 1 + (1/n) ni=1 xi − xi 4 . Proof: With the same setting in the proof for proving Property 1, we have for the smoothed hinge loss

n 1 ≤ Ai 2 2 n

i=1 n 1 = 1+ xi − xi 4 2 . i=1 n

(71)

This completes the proof. Thus, Nesterov’s method is readily applicable by using the gradients calculated by (66) and (67). We briefly term the constrained ERM for DML with the smoothed hinge loss as smoothed hinge loss-based metric learning (sHLML). 3) Remarks: We regard LLML (with the logarithmic loss) and sHLML (with the smoothed hinge loss) as two example algorithms of the proposed DML framework. Other loss functions, such as the exponential loss and the squared error loss, can be used. For example, if the similarity/dissimilarity between instances is of real value, then the proposed framework is applicable by using the squared error loss. In general, sHLML is superior to LLML. This is similar to the fact that the hinge loss is generally better than the logarithmic loss for classification, although opposite case surely exists due to the “no free lunch” theorem [48]. As shown later, our experiments confirm the superiority of sHLML, i.e., on four out of six UCI datasets and the MIT outdoor scene image dataset, sHLML outperforms LLML. Here, we present both as example algorithms in order to show the generality of the proposed framework. C. Evaluation of the Optimal Convergence Rate

φ() − φ(0) = ∇ J (X + ) − ∇ J (X), n 1 ∇sh (si Ai , X + )(si A j ) = n i=1 −∇sh (si A j , X)(si Ai ), . (70)

We empirically verify the convergence rate of the optimal first-order method in comparison it with the standard gradient descent method. By taking the “Soybean” dataset for example, we plot the objective function in a training procedure (refer to Section V for the experimental setting) in Fig. 2. It can be

1202


On BreastCancer

On Ecoli 0.4

0.35

0.09

0.35

0.3

0.08

0.25 0.2 0.15 0.1 0.05 0

Classification Error Rate

0.1



On BalanceScale 0.4

0.07 0.06 0.05 0.04

DML

NCA

LMNN

RML

LLML

0.02

sHLML

0.2 0.15 0.1 0.05

0.03 ED

0.3 0.25

ED

DML

NCA

LMNN

RML

LLML

0

sHLML

ED

DML

On Iris

On Flare 0.55

NCA

LMNN

RML

LLML

sHLML

LLML

sHLML

On SoyBean

0.16

0.4

0.14

0.35

0.45

0.4

0.12




0.5

0.1 0.08 0.06 0.04

0.3 0.25 0.2 0.15 0.1

0.35 0.02 0.3

ED

DML

NCA

LMNN

RML

LLML

sHLML

0

0.05 ED

DML

NCA

LMNN

RML

LLML

sHLML

0

ED

DML

NCA

LMNN

RML

Fig. 3. Performance evaluation on classification by using the NN classifier on six datasets from the UCI machine learning repository. Each method is represented by a box with whiskers. The box has lines at the lower quartile, median, and upper quartile values of the classification error rates on 20 independent experiments, and the error rate outside 1.5 times of the interquartile range from the ends of the box is regarded as whisker.

TABLE I D ATASETS FROM THE UCI M ACHINE L EARNING R EPOSITORY Dataset BalanceScale BreastCancer Ecoli Flare Iris Soybean

No. of Class 3 2 8 5 3 4

No. of Dimension 4 9 7 10 4 35

No. of Instance 625 699 336 1389 150 47

seen that, in both cases of the logarithm loss and the smoothed hinge loss functions, the optimal first-order method converges much faster than the standard gradient descent method. V. E XPERIMENTS In this section, we present empirical evaluations on the proposed framework of constrained ERM-based DML by conducting two groups of experiments. The first one is DML for data classification, which is commonly used to evaluate DML algorithms. The second one evaluates different DML algorithms on a scene image retrieval task. We compare the proposed LLML and sHLML with several popular DML algorithms, including Xing’s DML algorithm (Xing) [7], the NCA [27], the LMNN [29], and the regularized metric learning algorithm (RML) [30]. Parameters of different algorithms are determined by 10-fold cross validation on the training set. A. Classification on UCI Machine Learning Repository Six datasets from the UCI Machine Learning Repository [49] shown in Table I are used in this experiment, namely, “BalanceScale,” “BreastCancer,” “Ecoli,” “Flare,” “Iris,” and “Soybean,” some of which are commonly used as benchmarks for DML [29], [7], [22]. For each dataset, we randomly select 80% (on “BalanceScale,” “BreastCancer,” “Ecoli,” and

“Flare”) or 40% (“Iris” and “Soybean”)4 data for training and use the rest for testing. The datasets contain instances with the corresponding class labels, so we randomly select a training set composed of a portion (80%) of all similar and dissimilar instance pairs. The random selection reduces the number of instance pairs and thus reduces the time costs for training different DML algorithms. Besides, the instance pairs from the original training set can be correlated and the random selection helps in obtaining “roughly” independent instance pairs for learning a distance metric. Fig. 3 compares different algorithms, in which classification is conducted by using the nearest neighbor (NN) classifier with the learned distance metric and the boxplots of the misclassification rate are obtained by 20 independent random trials. The ED metric is also used as a baseline for performance comparison. Based on this figure, we have the following observations. 1) On four out of the six datasets, i.e., “BalanceScale,” “BreastCancer,” “Iris,” and “Soybean,” the DML by most algorithms outperform the ED, which confirms the effectiveness of DML. 2) On “Ecoli” and “Flare,” all DML algorithms show comparable performance with the ED metric, which implies that DML may not be favorable on the two datasets. 3) By comparing with other algorithms, our algorithms LLML and sHLML have competitive performance on the four datasets where the learned distance metrics outperform the ED metric. 4) LLML and sHLML show significant improvements on “BalanceScale” and “Iris” compared with DML, NCA, and LMNN. We further study how DML improves the classification performance of SVM. The RBF-kernel SVM is a powerful 4 When using 80% training data, the misclassification rate (including using the ED metric) on “Iris” and “Soybean” is nearly zeros, and thus we use 40% data for training.


On BreastCancer

On BalanceScale

0.35

0.15

0.1

0.05




0.4

0.045

0.2

0.04 0.035 0.03 0.025

ED

DML

NCA

LMNN

RML

LLML

0.02

sHLML

ED

DML

NCA

LMNN

RML

LLML

0.3

0

NCA

LMNN

ED

DML

NCA

RML

LLML

0.4

0.07

0.35

0.06 0.05 0.04 0.03 0.02

0

sHLML

LMNN

RML

LLML

sHLML

LLML

sHLML

On SoyBean

0.08

0.01 DML

0.1

sHLML



0.35

ED

0.2 0.15

On Iris

0.4

0.25

0.3 0.25

0.05

On Flare 0.45


On Ecoli

0.05

0.25

0

1203

0.3 0.25 0.2 0.15 0.1 0.05

ED

DML

NCA

LMNN

RML

LLML

0

sHLML

ED

DML

NCA

LMNN

RML

Fig. 4. Performance evaluation on classification by using RBF-kernel SVM on six datasets from the UCI machine learning repository. Each method is represented by a box with whiskers. The box has lines at the lower quartile, median, and upper quartile values of the classification error rates on 20 independent experiments, and the error rate outside 1.5 times of the interquartile range from the ends of the box is regarded as whisker.

ED NCA LMNN RML LLML sHLML

0.7

0.5

0.4

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

0.1

1

Retrieval Performance (5% Training Data)


0.6

0.5

F−1 Score

0.5

0.4

0.3

0.4

0.3

0.4

0.3

0.1

0.1

0.1

0

0

2000

2500

1

0

500

1000

1500

2000

2500

Euclidean NCA LMNN RML LLML sHLML

0.5

0.2

1500

0.8

0.6

0.2

Number of Returned Images

0.6


0.2

1000

0.4

0.7 Euclidean NCA LMNN RML LLML sHLML

F−1 Score

0.6

500

0.2

Recall

0.7 Euclidean NCA LMNN RML LLML sHLML

0

0

Recall

0.7

F−1 Score

0.5

0.2

Recall

Fig. 5.

0.6

0.5

0.3

0

Euclidean NCA LMNN RML LLML sHLML

0.7

Precision

0.6

Precision

Precision

0.6

0.8

0.8

ED NCA LMNN RML LLML sHLML

0.7



Retrieval Performance (5% Training Data) 0.8

0

0


500

1000

1500

2000


Performance evaluation on scene image retrieval experiment on the MIT outdoor scene image dataset.

classifier for nonlinear classification problems. Although it is simple to use the ED metric to calculate the kernel function in SVM, it has been widely acknowledged the ED metric is not optimal for measuring the distance between samples. Thus, we replace the ED metric with the empirically learned distance metric obtained by different algorithms. The parameters of SVM are selected by the 10-fold cross validation on the training set. Fig. 4 compares the different distance metrics for SVM based-classification. Experimental results suggest that: 1) the empirically learned distance metrics outperform the ED metric on at least three datasets, which are “BalanceScale,” “BreastCancer,” and “SoyBean” and 2) the proposed LLML

and sHLML outperform most competitors on these datasets. Besides, comparing with Fig. 3, one can see that on most of the used six datasets SVM with RBF-kernel shows better classification performance than NN. We think this is due to the good generalization ability of large-margin-based classification. B. Scene Image Retrieval In this experiment, we apply DML algorithms to a scene image retrieval task on the MIT outdoor scene image dataset [50], which contains 2688 images from eight categories,

1204


namely, “Coast,” “Forest,” “Highway,” “Inside City,” “Mountain,” “Open Country,” “Street,” and “Tall Building.” For image representation, we use the 512-D GIST descriptors [50]. We further reduce the dimensionality to 100 by using principal component analysis (PCA), where 99.99% energy is preserved. Since the discarded energy is limited, the influence of the PCA-based preprocessing on the retrieval performance can be ignored. We randomly select 5%, 10%, and 30% images as the training data for learning a distance metric. The rest of the images compose the “database” for retrieval, and we repeatedly select each image in it as the query and calculate the average retrieval performance. Fig. 5 shows the precisionrecall curves, where six distance metrics, including the Euclidean one, are reported.5 When fixing the recall to 20%, the precision improvements obtained by using sHLML’s distance metric with respect to the ED metric are about 12%, 15%, and 16%, given 5%, 10%, and 30% training data, respectively. Besides, it can be observed that, with limited (5%) training data, LLML and sHLML also offer appealing improvements over other DML algorithms. We further evaluate the retrieval performance by using the F-1 score, which is a collective measure of recall and precision, and the corresponding results are shown in Fig. 5. Similarly, one can also see the superiority of the proposed algorithms. VI. C ONCLUSION In this paper, we proposed a new learning framework for DML, which formulates DML as a constrained ERM problem. Theoretically, we have proved a generalization bound of the new framework. Algorithmically, we derived a firstorder method with optimal convergence rate based on Nesterov’s method, and provided two example algorithms by using the logarithmic loss and the smoothed hinge loss, respectively. Experiments on data classification and scene image retrieval show that the proposed framework with example algorithms have competitive performance compared with several representative DML algorithms including Xing’s DML algorithm (Xing) [7], the NCA [27], the LMNN [29], and the RML. The proposed constrained ERM for DML requires that the empirical risk has a Lipschitz-continuous gradient. This is invalid for particular loss functions, such as the hinge loss. Thus, it is valuable to further study the new DML framework to deal with nonsmooth loss functions. In addition, we use cross validation to select the tradeoff parameter that controls the hypothesis complexity, so it will be useful to find an automatic model selection method to determine the value of the tradeoff parameter. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers and the handling editor for their constructive comments and suggestions. 5 The results of the DML algorithm [7] are removed because the learned distance metric is nearly a ED metric.

R EFERENCES [1] M. S. Baghshah and S. B. Shouraki, “Semi-supervised metric learning using pairwise constraints,” in Proc. Int. Joint Conf. Artif. Intell., Pasadena, CA, 2009, pp. 1217–1222. [2] D.-Y. Yeung and H. Chang, “A kernel approach for semisupervised metric learning,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 141– 149, Jan. 2007. [3] C. Shen, J. Kim, and L. Wang, “Scalable large-margin Mahalanobis distance metric learning,” IEEE Trans. Neural Netw., vol. 21, no. 9, pp. 1524–1530, Sep. 2010. [4] E. Hu, S. Chen, D. Zhang, and X. Yin, “Semisupervised kernel matrix learning by kernel propagation,” IEEE Trans. Neural Netw., vol. 21, no. 11, pp. 1831–1841, Nov. 2010. [5] D. Wang, D. Yeung, and E. Tsang, “Weighted Mahalanobis distance kernels for support vector machines,” IEEE Trans. Neural Netw., vol. 18, no. 5, pp. 1453–1462, Sep. 2007. [6] X. Liang and Z. Ni, “Hyperellipsoidal statistical classifications in a reproducing kernel Hilbert space,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 968–975, Jun. 2011. [7] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 505–512. [8] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Washington, DC, Jun. 2005, pp. 539–546. [9] G. Lebanon, “Metric learning for text documents,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 4, pp. 497–508, Apr. 2006. [10] A. Frome, F. Sha, Y. Singer, and J. Malik, “Learning globally-consistent local distance functions for shape-based image retrieval and classification,” in Proc. IEEE 11th Int. Conf. Comput. Vis., Oct. 2007, pp. 1–8. [11] A. Frome, Y. Singer, and J. Malik, “Image retrieval and classification using local distance functions,” in Proc. Adv. Neural Inf. Process. Syst., 2006, pp. 417–424. [12] T. Hertz, A. Bar-Hillel, and D. Weinshall, “Learning distance functions for image retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Washington, DC, Jul. 2004, pp. 570–577. [13] M. Slaney, K. Q. Weinberger, and W. White, “Learning a metric for music similarity,” in Proc. Int. Conf. Music Inf. Retr., Philadelphia, PA, 2008, pp. 313–318. [14] C. D. Manning, P. Raghavan, and H. Schtze, Introduction to Information Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008. [15] L. Yang, R. Jin, L. Mummert, R. Sukthankar, A. Goode, B. Zheng, S. C. Hoi, and M. Satyanarayanan, “A boosting framework for visualitypreserving distance metric learning and its application to medical image retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 1, pp. 30–44, Jan. 2010. [16] A. Frome, Y. Singer, and J. Malik, “Image retrieval and classification using local distance functions,” in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 417–424. [17] P. Jain, B. Kulis, and K. Grauman, “Fast image search for learned metrics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [18] D. Tran and A. Sorokin, “Human activity recognition with metric learning,” in Proc. 10th Eur. Conf. Comput. Vis., 2008, pp. 548–561. [19] L. Yang and R. Jin, “Distance metric learning: A comprehensive survey,” Dept. Comput. Sci. Eng., Michigan State Univ., East Lansing, Tech. Rep., May 2006. [20] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2005, pp. 1–8. [21] M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in Proc. Adv. Neural Inf. Process. Syst., 2004, pp. 1–8. [22] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Informationtheoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 209–216. [23] S. Wang and R. Jin, “An information geometry approach for distance metric learning,” in Proc. 12th Int. Conf. Artif. Intell. Stat., 2009, pp. 591–598. [24] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, “On kerneltarget alignment,” in Proc. Adv. Neural Inf. Process. Syst. 14, 2002, pp. 367–373. [25] J. T. Kwok and I. W. Tsang, “Learning with idealized kernels,” in Proc. Int. Conf. Mach. Learn., Aug. 2003, pp. 400–407.


[26] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning a Mahalanobis metric from equivalence constraints,” J. Mach. Learn. Res., vol. 6, pp. 937–965, Dec. 2005. [27] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighbourhood components analysis,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2004, pp. 513–520. [28] L. Yang, R. Sukthankar, and R. Jin, “Bayesian active distance metric learning,” in Proc. 23rd Conf. Uncertainty Artif. Intell., Vancouver, BC, Canada, 2007, pp. 1–8. [29] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance netric learning for large margin nearest neighbor classification,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2005, pp. 1473–1480. [30] R. Jin, S. Wang, and Y. Zhou, “Regularized distance metric learning: Theory and algorithm,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, 2010, pp. 1–9. [31] M. Loog, R. P. W. Duin, and R. Haeb-Umbach, “Multiclass linear dimension reduction by weighted pairwise fisher criteria,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 7, pp. 762–766, Jul. 2001. [32] M. Loog and R. P. W. Duin, “Linear dimensionality reduction via a heteroscedastic extension of LDA: The chernoff criterion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 6, pp. 732–739, Jun. 2004. [33] O. C. Hamsici and A. M. Martinez, “Bayes optimality in linear discriminant analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 647–657, Apr. 2008. [34] D. Tao, X. Li, X. Wu, and S. J. Maybank, “Geometric mean for subspace selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 260–274, Feb. 2009. [35] W. Bian and D. Tao, “Biased discriminant euclidean embedding for content-based image retrieval,” IEEE Trans. Image Process., vol. 19, no. 2, pp. 545–554, Feb. 2010. [36] W. Bian and D. Tao, “Max-min distance analysis by using sequential SDP relaxation for dimension reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 1037–1050, May 2011. [37] W. Bian and D. Tao, “Learning a distance metric by empirical loss minimization,” in Proc. Int. Joint Conf. Artif. Intell., Barcelona, Spain, 2011, pp. 1186–1191. [38] J. Dattorro, Convex Optimization & Euclidean Distance Geometry. Palo Alto, CA: Meboo Publishing, 2011. [39] A. Maurer, “Generalization bounds for subspace selection and hyperbolic PCA,” in Proc. Subspace Latent Struct. Feature Select. Stat. Optim. Perspect. Workshop, Bohinj, Slovenia, 2006, pp. 185–197. [40] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, Mar. 2002. [41] F. Cucker and S. Smale, “On the mathematical foundations of learning,” Bull. Amer. Math. Soc., vol. 39, no. 1, pp. 1–49, 2002. [42] A. S. Nemirovsky and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization (Discrete Mathematics). New York: Wiley, 1983. [43] N. Guan, D. Tao, Z. Luo, and B. Yuan, “NeNMF: An optimal gradient method for solving non-negative matrix factorization and its variants,” IEEE Trans. Signal Process., vol. 60, no. 6, pp. 2882–2898, Jun. 2012.

1205

[44] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Non-negative patch alignment framework,” IEEE Trans. Neural Netw., vol. 22, no. 8, pp. 1218–1230, Aug. 2011. [45] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Online non-negative matrix factorization with robust stochastic approximation,” IEEE Trans. Neural Netw., DOI: 10.1109/TNNLS.2012.2197827, 2012. [46] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course (Applied Optimization). Boston, MA: Kluwer, 2004. [47] J. D. M. Rennie. (2005, Feb.). Smooth Hinge Classification [Online]. Available: http://people.csail.mit.edu/jrennie/writing [48] D. Wolpert, “The lack of a priori distinctions between learning algorithms,” Neural Comput., vol. 8, no. 7, pp. 1341–1390, Oct. 1996. [49] A. Asuncion and D. Newman, “UCI machine learning repository,” School Inf. Comput. Sci., Univ. California, Irvine, Tech. Rep., 2007. [50] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” Int. J. Comput. Vis., vol. 42, pp. 145–175, May 2001.

Wei Bian received the B.Eng. degree in electronic engineering, the B.Sc. degree in applied mathematics in 2005 and the M.Eng. degree in electronic engineering in 2007, all from the Harbin Institute of Technology, Harbin, China. He is currently pursuing the Ph.D. degree with the University of Technology, Sydney, Australia. His current research interests include pattern recognition and machine learning, such as dimension reduction and feature selection.

Dacheng Tao (M’07–SM’12) is a Professor of computer science with the Centre for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney, Australia. He has authored or co-authored more than 100 scientific articles in top journals, including the IEEE T RANS ACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE, the IEEE T RANSACTIONS ON N EURAL N ETWORKS A ND L EARNING S YSTEMS , the IEEE T RANSACTIONS ON I MAGE P ROCESS ING , AISTATS, ICDM, CVPR, and ECCV. His current research interests include statistics and mathematics for data analysis problems in data mining, computer vision, machine learning, multimedia, and video surveillance. He received the Best Theory/Algorithm Paper Runner-Up Award from the IEEE Industrial Conference on Data Mining in 2007.

Ordinal Distance Metric Learning for Image Ranking.

A Kernel Classification Framework for Metric Learning.

Efficient dual approach to distance metric learning.

Adaptive distance metric learning for diffusion tensor image segmentation.

Distance-informed metric learning for Alzheimer's disease staging.

A New Distance Metric for Unsupervised Learning of Categorical Data.

A Distributed Approach Toward Discriminative Distance Metric Learning.

Distance Metric Learning Using Privileged Information for Face Verification and Person Re-Identification.

Distance metric learning for complex networks: towards size-independent comparison of network structures.

Neural network for nonsmooth, nonconvex constrained minimization via smooth approximation.

Adaptive Metric Learning for Saliency Detection.

Neighborhood repulsed metric learning for kinship verification.

Phylo_dCor: distance correlation as a novel metric for phylogenetic profiling.

A common cortical metric for spatial, temporal, and social distance.

Group-wise cortical correspondence via sulcal curve-constrained entropy minimization.

PSF: A Unified Patient Similarity Evaluation Framework Through Metric Learning With Weak Supervision.

Multiple Cayley-Klein metric learning.

Regularized gradient-projection methods for finding the minimum-norm solution of the constrained convex minimization problem.

Enhanced Data Representation by Kernel Metric Learning for Dementia Diagnosis.

Constrained Total Generalized p-Variation Minimization for Few-View X-Ray Computed Tomography Image Reconstruction.

Distance learning perspectives.

Neural decoding with kernel-based metric learning.

Incorporating privileged information through metric learning.

Distance learning and telemedicine.