IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

1003

A Maximum Margin Approach for Semisupervised Ordinal Regression Clustering Yanshan Xiao, Bo Liu, and Zhifeng Hao Abstract— Ordinal regression (OR) is generally defined as the task where the input samples are ranked on an ordinal scale. OR has found a wide variety of applications, and a great deal of work has been done on it. However, most of the existing work focuses on supervised/semisupervised OR classification, and the semisupervised OR clustering problems have not been explicitly addressed. In real-world OR applications, labeling a large number of training samples is usually time-consuming and costly, and instead, a set of unlabeled samples can be utilized to set up the OR model. Moreover, although the sample labels are unavailable, we can sometimes get the relative ranking information of the unlabeled samples. This sample ranking information can be utilized to refine the OR model. Hence, how to build an OR model on the unlabeled samples and incorporate the sample ranking information into the process of improving the clustering accuracy remains a key challenge for OR applications. In this paper, we consider the semisupervised OR clustering problems with sampleranking constraints, which give the relative ranking information of the unlabeled samples, and put forward a maximum margin approach for semisupervised OR clustering (M2 SORC). On one hand, M2 SORC seeks a set of parallel hyperplanes to partition the unlabeled samples into clusters. On the other hand, a loss function is put forward to incorporate the sample ranking information into the clustering process. As a result, the optimization function of M2 SORC is formulated to maximize the margins of the closest neighboring clusters and meanwhile minimize the loss associated with the sample-ranking constraints. Extensive experiments on OR data sets show that the proposed M2 SORC method outperforms the traditional semisupervised clustering methods considered. Index Terms— Ordinal clustering.

regression

(OR),

semisupervised

I. I NTRODUCTION RDINAL regression (OR) [1]–[3] is referred to the learning paradigm where the input samples are ranked on an ordinal scale. Taking book rating as an example,

O

Manuscript received July 5, 2014; revised April 22, 2015; accepted May 16, 2015. Date of publication July 1, 2015; date of current version April 15, 2016. This work was supported in part by the Natural Science Foundation of China under Grant 61203280, Grant 61202270, Grant 61472090, and Grant 61472089, in part by the Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S2013050014133, in part by the Science and Technology Funds of Guangdong Province under Grant 2013B051000076, in part by the Natural Science Foundation of Guangdong Province–Special Program for Key Basic Research under Grant 2014A030308008, in part by the Specialized Research Funds for the Doctoral Program of Higher Education under Grant 20124420120004, in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, and in part by the Guangdong University of Technology, Guangzhou, China through the Overseas Outstanding Doctoral Fund under Grant 405120095. (Corresponding author: Bo Liu.) Y. Xiao and Z. Hao are with the School of Computers, Guangdong University of Technology, Guangzhou 510006, China (e-mail: [email protected]; [email protected]). B. Liu is with the School of Automation, Guangdong University of Technology, Guangzhou 510006, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TNNLS.2015.2434960

the readers grade a set of books into one of the following five ratings: 1) very bad; 2) bad; 3) not so bad; 4) good; and 5) very good, based on their preferences. The book in the bad rating is better than that in the very bad rating, and having very good rating is better than having the other rating. It is seen that the ratings have a natural order, and represent some ranking information, which distinguishes OR from traditional multiclass learning problems [4], [5]. In recent years, a lot of work [1], [2], [6]–[8] has been done to tackle the OR problems. However, most of the existing work focuses on OR classification, and the semisupervised OR clustering problem has not been explicitly addressed. In this paper, we address the semisupervised OR clustering problem with sample-ranking constraints, where the OR model is built on a set of unlabeled samples, and the relative ranking information of the unlabeled samples is incorporated to improve it. Semisupervised OR clustering with the sample-ranking constraints finds a wide variety of real-world applications. For example, in book rating, a great amount of unlabeled book review data can be available from the Internet, and labeling this review data is usually expensive and time-consuming, since it needs lots of human resources. Moreover, another important observation is that although the samples labels are unknown, we can sometimes get the relative ranking of the unlabeled samples, and this sample ranking information can be included to boost the clustering performance. In book rating, the readers believe that book A is better than book B, which means that book A may have a higher rating than book B, namely, book A being in a higher ranked cluster than book B. Although the readers do not give the exact rating (labels), this sample ranking information, i.e., book A in a higher ranked cluster than book B, can be used to improve the clustering accuracy. A more accurate book rating model can be constructed on the unlabeled book data by requiring book A in a higher ranked cluster than book B. Hence, how to build an unsupervised OR model by incorporating the sample ranking information remains a key challenge for OR applications. To deal with this problem, we propose a maximum margin approach for semisupervised OR clustering, termed M2 SORC. In M2 SORC, a set of parallel hyperplanes is utilized to separate the unlabeled samples into clusters, and the margins of the closest neighboring clusters are maximized. Moreover, a loss function associated with the sample-ranking constraints is introduced. Since the optimization function of M2 SORC is nonconvex, the constrained concave–convex procedure (CCCP) is then applied to decompose the learning problem into a series of convex subproblems.

2162-237X © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1004

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

The main contribution of our method M2 SORC can be viewed from the following aspects. 1) A semisupervised clustering paradigm of OR involving the unlabeled samples and the sample-ranking constraints is introduced. To the best of our knowledge, this paper serves as the first attempt that addresses the OR problem in a semisupervised clustering setting. 2) Compared with the traditional semisupervised clustering methods with must-link and cannot-link constraints, our method is able to incorporate the sample-ranking constraints into the process of refining the clustering boundary, which makes our method closer to real-world OR applications. 3) We present a maximum-margin-based semisupervised OR clustering model. Distinguished from the traditional semisupervised maximum margin clustering methods where the hyperplanes usually intersect each other, our method utilizes a set of parallel hyperplanes to separate the data, such that the parallel hyperplanes can be ordered to represent the ranking information of the samples. 4) Substantial experiments on the benchmark and real-world OR data sets show that our method explicitly obtains better clustering accuracy than the traditional semisupervised clustering methods considered. The rest of this paper is organized as follows. Section II reviews the related work. The details of M2 SORC are presented in Section III. Experiments are shown in Section IV. Section V concludes this paper and offers the future work. II. R ELATED W ORK ON O RDINAL R EGRESSION OR is an important research area in machine learning and data mining. To date, it has been widely applied to facial recognition [9], information retrieval [10], music recommendation [11], and gene expression analysis [7]. According to the availability of training labels, the existing OR work can be broadly classified into two categories: 1) supervised OR classification approaches, where all sample labels are known to train the classifier and 2) semisupervised OR classification approaches, where the classifier is learnt on a few labeled samples and a large number of unlabeled samples. Many approaches [1], [2], [6]–[8], [12], [13] have been proposed for supervised OR classification. Herbrich et al. [1] extend support vector machine (SVM) to OR and propose a threshold model in which the ordinal classes are separated by a set of hyperplanes. Shashua and Levin [8] introduce two SVM-based OR approaches: 1) FM-SVOR and 2) SM-SVOR. FM-SVOR is based on the fixed-margin policy where the margins of the closest neighboring classes are the same, and thereafter maximized in the objective function. SM-SVOR is based on the sum-of-margins policy, where the margins of the closest neighboring classes can be different, and the sum of margins is maximized. Chu and Keerthi [2], [6] present two support vector approaches for OR by enforcing the explicit and implicit constraints on the order of hyperplanes. Besides SVM, other learning techniques [14]–[16], such as Gaussian process [17], neural networks [18], and

ensemble learning [19]–[21], are also extended to solve the OR classification problems. Some other approaches [22], [23] have been presented for semisupervised OR classification. Srijith et al. [23] designed a Gaussian process approach for semisupervised OR classification problems by employing the expectation-propagation approximation idea. Liu et al. [22] proposed a semisupervised OR classification method by modeling the manifold information in the objective function. Moreover, Seah et al. [24] put forward an OR approach for transductive learning, and a label swapping scheme that facilitates a monotonic decrease in the objective function is introduced. In contrast with the supervised and semisupervised OR classification, there is very little work done on unsupervised OR clustering. The only study we know on unsupervised OR clustering is WSVORC [25]. WSVORC proposes a weighted support vector approach for OR clustering. Similar to least square SVM, the sample is required to cluster around the nearest hyperplane, and a least square SVM-like formulation is presented. Nevertheless, there are limitations of WSVORC. On the one hand, it does not take the cluster size balance into account, which may lead to the trivially optimal solution where all the samples are assigned to one cluster or form very small clusters. On the other hand, it is not capable of incorporating the sample ranking information into the process of refining the clustering model. Most of the existing methods on OR are proposed for supervised classification, semisupervised classification, and unsupervised clustering, and the semisupervised OR clustering problems have not been explicitly addressed. In real-world applications, labeling a large number of samples is usually expensive and time-consuming. Moreover, the sample ranking information can sometimes be obtained. This information can be utilized to improve the clustering accuracy. In this paper, we propose a semisupervised OR clustering method that builds a maximum margin OR clustering model on the unlabeled samples by incorporating the sample ranking information. III. P ROPOSED A PPROACH A. Notations and Preliminary u Suppose that the data set has n u unlabeled samples {xi }ni=1 nc and n c sample-ranking constraints {xi1 Y xi2 }i=1 , where Y denotes that the cluster of xi1 should be higher than xi2 . The goal of M2 SORC is to divide the data into k clusters such that the margins of the closest neighboring clusters are maximized, and xi1 is in a higher ranked cluster than xi2 . In order to separate the n u unlabeled samples into k clusters, M2 SORC builds k + 1 parallel hyperplanes (w, b0 ), (w, b1 ), . . . , (w, bk ), where b0 is set to be a very small value (e.g., −∞) and bk is a very large value (e.g., +∞), as done in [1] and [2]; we only need to solve k variables: w, b1 , b2 , . . . , bk−1 . The k + 1 parallel hyperplanes divide the samples into k clusters, and the margins of the closest neighboring clusters are maximized. Similar to FM-SVOR [1], the fix-margin strategy is adopted, namely, the margins of different pairs of closest neighboring clusters being fixed and the same. Take Fig. 1 as an example to illustrate the OR model. In Fig. 1, ◦ denotes an unlabeled sample, and the solid lines

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

Fig. 1.

Example of the OR model.

stand for the hyperplanes we need to build. On one hand, M2 SORC seeks four parallel hyperplanes—(w, b0 ), (w, b1 ), (w, b2 ), and (w, b3 ) (seen as solid lines) to partition the unlabeled samples into three clusters. On the other hand, the margin between the first and second clusters is the same as that between the second and third clusters, being 2/||w||. This is because we employ the fix-margin strategy and the margins of different pairs of closest neighboring clusters are equivalent. To obtain the optimized hyperplanes, the learning problem can be given to maximize the margin 2/||w||. After the learning problem is solved, we can get hyperplanes (w, b0 ), . . . , (w, bk ). For an unlabeled sample x, let f i (x) = wT x − bi represent its decision value associated with the i th hyperplane (w, bi ). When wT x − bi−1 ≥ 0 and wT x−bi < 0 come true, x lies between hyperplanes (w, bi−1 ) and (w, bi ), and is assigned to cluster i (i = 1, . . . , k). B. Formulation In Fig. 1, a set of parallel hyperplanes is used to divide the samples into clusters. Compared with the traditional semisupervised clustering methods [26]–[28], which do not take the cluster order into account, and the learnt hyperplanes are usually disordered and intersecting, we use parallel hyperplanes to partition the samples, and the ranking information of the samples can be incorporated into the clustering model by imputing an order to the parallel hyperplanes. Theorem 1: For an unlabeled sample x, its cluster label is equivalent to Y (Y = arg maxi=1,...,k (((bi − bi−1 )/2||w||) − (|wT x − ((bi−1 + bi )/2)|)/||w||)). Proof: The cluster label of x can be determined by investigating the value of ((bi − bi−1 )/2||w||) − (|wT x − ((bi−1 + bi )/2)|)/||w||. Here, (|wT x − ((bi−1 + bi )/2)|)/||w|| is the distance from x to the middle hyperplane (w, ((bi−1 + bi )/2)), and ((bi − bi−1 )/2||w||) is the distance from the middle hyperplane (w, ((bi−1 + bi )/2)) to the bounded hyperplane of cluster i , i.e., (w, bi−1 ) or (w, bi ). For cluster Y which x belongs to, x lies inside cluster Y, namely, located between hyperplanes (w, bY −1 ) and (w, bY ). The distance from x to the middle hyperplane (w, ((bY −1 + bY )/2)), i.e., (|wT x − ((bY −1 + bY )/2)|)/||w||, is no larger than the distance from the middle hyperplane (w, ((bY −1 + bY )/2)) to the bounded hyperplane of cluster Y, i.e., ((bY − bY −1)/2||w||). Hence, it has ((bY − bY −1)/2||w||) − (|wT x − ((bY −1 + bY )/2)|)/||w|| ≥ 0.

1005

Furthermore, for cluster i (i = Y) which x does not belong to, x lies outside cluster i , namely located outside the range of hyperplanes (w, bi−1 ) and (w, bi ). The distance from x to the middle hyperplane (w, ((bi−1 + bi )/2), i.e., (|wT x − ((bi−1 + bi )/2)|)/||w||, is larger than the distance from the middle hyperplane (w, ((bi−1 + bi )/2)) to the bounded hyperplane of cluster i , i.e., ((bi − bi−1 )/2||w||). Therefore, it has ((bY − bY −1 )/2||w||)− ((|wT x − ((bY −1 + bY )/2)|)/||w||) < 0. From the above analysis, it is seen that the cluster which x belongs to is usually associated with a nonnegative value of ((bi − bi−1 )/2||w||) − ((|wT x − ((bi−1 + bi )/2)|)/||w||), and thus the cluster label of x is equivalent to Y (Y = arg maxi=1,...,k (((bi − bi−1 )/2||w||) − ((|wT x− ((bi−1 + bi )/2)|)/||w||))). Let us consider an example in Fig. 2, where cluster i − 1 is bounded by hyperplanes (w, bi−2 ) and (w, bi−1 ), and hyperplane (w, ((bi−2 + bi−1 )/2)) is in the middle of cluster i − 1; cluster i is bounded by hyperplanes (w, bi−1 ) and (w, bi ), and hyperplane (w, ((bi−1 + bi )/2)) is in the middle of cluster i . For the sake of simplicity, only cluster i − 1 and cluster i are drawn out, and the other clusters are omitted. In Fig. 2, the distance from x1 to the middle hyperplane (w, ((bi−1 + bi )/2)) of cluster i is d1 = ((|wT x1 − ((bi−1 + bi )/2)|)/||w||). The distance from the middle hyperplane (w, ((bi−1 + bi )/2)) to the bounded hyperplane of cluster i , i.e., (w, bi−1 ) or (w, bi ), is ((bi − bi−1 )/2||w||). Based on the OR model in Section III-A, sample x1 lies between hyperplanes (w, bi−1 ) and (w, bi ), and belongs to cluster i . Considering that x1 is in the range of (w, bi−1 ) and (w, bi ), the distance from x1 to the middle hyperplane (w, (bi−1 + bi )/2) of cluster i , i.e., d1 = ((|wT x1 − ((bi−1 + bi )/2)|)/||w||), is no larger than ((bi − bi−1 )/2||w||). That is to say, it has ((|wT x1 − ((bi−1 + bi )/2)|)/||w||) ≤ ((bi − bi−1 )/2||w||) and thus ((bi − bi−1 )/2)−|wT x1 −((bi−1 + bi )/2)| ≥ 0 holds. Therefore, for cluster i which x1 belongs to, x1 lies inside the range of (w, bi−1 ) and (w, bi ), and ((bi − bi−1 )/2)− |wT x1 − ((bi−1 + bi )/2)| ≥ 0 comes true. Moreover, the distance from x1 to the middle hyperplane (w, (bi−2 + bi−1 )/2) of cluster i − 1 is d2 = ((|wT x1 − ((bi−2 + bi−1 )/2)|)/||w||). The distance from the middle hyperplane (w, ((bi−2 + bi−1 )/2)) to the bounded hyperplane of cluster i − 1, i.e., (w, bi−1 ) or (w, bi−2 ), is ((bi−1 − bi−2 )/2||w||). It is seen that x1 is located outside the range of hyperplanes (w, bi−2 ) and (w, bi−1 ), and does not belong to cluster i − 1. Since x1 is outside the range of hyperplanes (w, bi−2 ) and (w, bi−1 ), the distance d2 from x1 to the middle hyperplane (w, ((bi−2 + bi−1 )/2)) of cluster i − 1 is larger than ((bi−1 − bi−2 )/2||w||), namely, ((|wT x1 − ((bi−2 + bi−1 )/ 2)|)/||w||) > ((bi−1 − bi−2 )/2||w||). We can deduce ((bi−1 − bi−2 )/2) − |wT x1 − ((bi−2 + bi−1 )/2)| < 0. Hence, for cluster i − 1 which x1 does not belong to, x1 is located outside the range of hyperplanes (w, bi−2 ) and (w, bi−1 ), and it has ((bi−1 − bi−2 )/2) − |wT x1 − ((bi−2 + bi−1 )/2)| < 0. Based on Theorem 1, we can obtain the margin constraint for semisupervised OR clustering, as shown in Theorem 2.

1006

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

Fig. 2. Illustration of the margin constraint. The distances from x1 and x2 to hyperplane (w, ((bi−1 + bi )/2)) are d1 = (|wT x1 − ((bi−1 + bi )/2)|)/||w|| and d3 = (|wT x2 − ((bi−1 + bi )/2)|)/||w||, respectively. x1 is in the range of hyperplanes (w, bi−1 + 1) and (w, bi − 1), and d1 is no larger than ((bi − bi−1 )/2||w||) − (1/||w||), i.e., ((bi − bi−1 )/2||w||) − (1/||w||) − ((|wT x1 − ((bi−1 + bi )/2)|)/(||w||)) ≥ 0, which can be changed into ((bi − bi−1 )/2) − |wT x1 − ((bi−1 + bi )/2)| ≥ 1. Hence, it has ξ ∗ = 0 for x1 . x2 is outside the range of (w, bi−1 + 1) and (w, bi − 1), and d3 is larger than ((bi − bi−1 )/2||w||) − (1/||w||), i.e., ((bi − bi−1 )/2||w||) − (1/||w||) − ((|wT x2 − ((bi−1 + bi )/2)|)/||w||) < 0, which can be transformed into ((bi − bi−1 )/2) − |wT x2 − ((bi−1 + bi )/2)| < 1. Thus, it has ξ ∗ > 0 for x2 . From the above analysis, it is seen that by enforcing the margin constraint in Theorem 2, it has ξ ∗ = 0 for the samples (e.g., x1 ) lying inside the range of (w, bi−1 + 1) and (w, bi − 1), and it has ξ ∗ > 0 for those (e.g., x2 ) outside this range. Based on this, the semisupervised OR clustering problem can be formulated by maximizing the margin (2/||w||) between two closest neighboring clusters and meanwhile minimizing the errors ξ ∗ associated with the margin constraints.

Theorem 2: The margin constraint for semisupervised OR clustering can be given by   bi−1 + bi bi − bi−1 max − |wT x − | ≥ 1 − ξ. i=1,...,k 2 2 Proof: The value of ||w|| is larger than 0, and the constraint in Theorem 2 can be changed into maxi=1,...,k (((bi − bi−1 )/2||w||) − ((|wT x − ((bi−1 + bi )/2)|)/||w||)) ≥ (1/||w||) − ξ ∗ , where it has ξ ∗ = (ξ/||w||). Let i = arg maxi=1,...,k (((bi − bi−1 )/2||w||) − (|wT x − ((bi−1 + bi )/2)|)/||w||), and it is changed into ((bi − bi−1 )/2||w||) − ((|wT x − ((bi−1 + bi )/2)|)/||w||) ≥ (1/||w||) − ξ ∗ , which can be further rewritten as ((bi − bi−1 )/2||w||) − (1/||w||) ≥ ((|wT x − ((bi−1 + bi )/2)|)/||w||) − ξ ∗ . As shown in Fig. 2, ((bi −bi−1 )/2||w||)−(1/||w||) is the distance from the middle hyperplane (w, ((bi−1 + bi )/2)) to hyperplane (w, bi−1 + 1) or (w, bi − 1); (|wT x − ((bi−1 + bi )/2)|)/||w|| is the distance from x to the middle hyperplane (w, ((bi−1 + bi )/2)). When x falls inside the range of hyperplanes (w, bi−1 + 1) and (w, bi − 1), its distance to the middle hyperplane (w, ((bi−1 + bi )/2)), i.e., (|wT x − ((bi−1 + bi )/2)|)/||w||, is no larger than ((bi − bi−1 )/2||w||) − (1/||w||), and ξ ∗ = 0 holds. When x lies outside the range of hyperplanes (w, bi−1 + 1) and (w, bi − 1), its distance to the middle hyperplane (w, ((bi−1 + bi )/2)), i.e., (|wT x − ((bi−1 + bi )/2)|)/||w||, is larger than ((bi − bi−1 )/2||w||) − (1/||w||), and ξ ∗ > 0 holds. Hence, it has ξ ∗ = 0 when x falls inside the range of hyperplanes (w, bi−1 + 1) and (w, bi − 1). Otherwise, it has ξ ∗ > 0. Taking Fig. 2 as an example, it has ξ ∗ = 0 for x1 , and ξ ∗ > 0 for x2 . It is seen that x1 is inside the range of hyperplanes (w, bi−1 + 1) and (w, bi − 1), and its distance to the middle hyperplane (w, ((bi−1 + bi )/2)), i.e., d1 = ((|wT x1 − ((bi−1 + bi )/2)|)/||w||), is no larger than ((bi − bi−1 )/2||w||) − (1/||w||). Hence, it has ξ ∗ = 0 for x1 . In addition, x2 is outside the range of (w, bi−1 + 1) and (w, bi − 1), and its distance to the middle hyperplane (w, ((bi−1 + bi )/2)), i.e., d3 = ((|wT x2 − ((bi−1 +bi )/2)|)/||w||), is larger than ((bi − bi−1 )/2||w||) −

(1/||w||). Thus, it has ξ ∗ > 0 for x2 . From this example, it is observed that when x is inside the range of (w, bi−1 + 1) and (w, bi − 1), it has ((bi − bi−1 )/2||w||) − (1/||w||) ≥ ((|wT x1 − ((bi−1 + bi )/2)|)/||w||) and ξ ∗ = 0 holds. When x is outside this range, it has ((bi − bi−1 )/2||w||) − (1/||w||) < ((|wT x1 − ((bi−1 + bi )/2)|)/||w||) and ξ ∗ > 0 comes true. By incorporating the margin constraint in Theorem 2, the learning problem can be given by u  1 min ||w||2 + C1 ξj 2 j =1    bi − bi−1  T bi−1 + bi  − w x j − s.t. max  ≥ 1 − ξj i=1,...,k 2 2 ξ j ≥ 0 ∀ j = 1, . . . , n u (1)

n

where C1 is a regularization parameter and ξ j are error terms. As shown in Fig. 1, (2/||w||) is the margin of the closest neighboring clusters. In problem (1), minimizing (1/2)||w||2 is equivalent to maximizing the margin of the closest neighboring clusters. The first set of constraints maxi=1,...,k (((bi − bi−1 )/2) − |wT x j − ((bi−1 + bi )/2)|) ≥ 1 − ξ j is the margin constraint, and C1 nj u=1 ξ j are the errors terms. It is seen that problem (1) is formulated by maximizing the margin of the closest neighboring clusters and meanwhile minimizing the errors caused by the margin constraints.

C. Enforcing Sample-Ranking Constraints The relative ranking information of unlabeled samples can be introduced to refine the clustering boundary. In this case, the OR clustering model is built on a set of unlabeled samples and a number of sample-ranking constraints. In this section, we will put forward the sample-ranking constraint for semisupervised OR clustering, which will be presented as follows. c , where Given n c pairs of ranked samples {xi1 Y xi2 }ni=1 x j 1 should be assigned to a higher ranked cluster than x j 2 , the sample-ranking constraint can be obtained as follows.

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

1007

Fig. 3. Illustration of the sample-ranking constraint. [{x j 1 Y x j 2 } is a pair of ranked samples, where x j 1 should be assigned to a higher ranked cluster than x j 2 . To be clear, only samples x j 1 and x j 2 are drawn out and the other samples are omitted. According to our proposed sample-ranking constraints (a) when x j 1 and x j 2 are assigned to the same clusters, it has ζ j > 0. (b) When x j 1 is assigned to a lower ranked cluster than x j 2 , it has ζ j > 0. (c) When x j 1 is assigned to a higher ranked cluster than x j 2 , it has ζ j = 0.]

Theorem 3: The sample-ranking constraint is given by min

i=1,...,k−1

[|w x j 1 − bi | + |w x j 2 − bi | T

T

−(wT x j 1 − wT x j 2 )] ≤ ζ j . Proof: Let f c = |wT x j 1 − bi | + |wT x j 2 − bi | − (wT x j 1 − wT x j 2 . First, when wT x j 1 −bi > 0 and wT x j 2 −bi > 0 hold, it has f c = 2(wT x j 2 −bi ) whose value is larger than 0. Second, when wT x j 1 − bi < 0 and wT x j 2 − bi < 0 come true, it has f c = 2(bi − wT x j 1 ) of which the value is larger than 0. Third, when wT x j 1 − bi < 0 and wT x j 2 − bi > 0 are met, it has f c = 2(wT x j 2 − wT x j 1) whose value is also larger than 0, since wT x j 1 < bi and bi < wT x j 2 hold. Last, when wT x j 1 − bi > 0 and wT x j 2 − bi < 0 are satisfied, it has f c = 0. Hence, the value of f c is nonnegative. It has f c = 0 only if wT x j 1 − bi > 0 and wT x j 2 − bi < 0 are met. That is to say, when f c = 0 holds, for the k − 1 hyperplanes (w, b1 ), . . . , (w, bk−1 ), there must exist at least one hyperplane (w, bi ) that x j 1 is on the positive side of (w, bi ) with wT x j 1 − bi > 0 and x j 2 is on the negative side of (w, bi ) with wT x j 2 − bi < 0. If x j 1 and x j 2 are in the same cluster, they always lie on the same side of hyperplanes and we cannot find a hyperplane that x j 1 is on its positive side and x j 2 is on its negative side. If x j 1 and x j 2 are in the different clusters and x j 1 is in a lower ranked cluster than x j 2 , such a hyperplane cannot be found either. If x j 1 is in a higher ranked cluster than x j 2 , we can seek at least one hyperplane which lies between x j 1 and x j 2 , and x j 1 is on its positive side and x j 2 is on its negative side. Hence, it has f c = 0 only if x j 1 is in a higher ranked cluster than x j 2 . Otherwise, it has f c > 0. Moreover, when f c = 0 comes true, it has ζ j = 0 and no error occurs. When f c > 0 holds, it has ζ j > 0 and the penalty is imposed. Let us consider an example in Fig. 3(a) and (b). Assume that there is a pair of ranked samples {x j 1 Y x j 2 }, where x j 1 should be assigned to a higher ranked cluster than x j 2 . In Fig. 3(a), x j 1 and x j 2 are assigned to the same cluster. For the hyperplane (w, bi ), x j 1 and x j 2 lie on the positive side of (w, bi ) and it has wT x j 1 − bi > 0 for sample x j 1 and wT x j 2 − bi > 0 for sample x j 2 . Hence, f c is equal to 2(wT x j 2 −bi ), whose value is larger than 0. Moreover, for the hyperplane (w, bi+1 ), x j 1 and x j 2 are located on the negative side of (w, bi+1 ) and it has wT x j 1 − bi+1 < 0 for sample x j 1 and wT x j 2 −bi+1 < 0 for sample x j 2 . Thus, fc is equivalent to 2(bi+1 −wT x j 1 ), of which the value is larger than 0. The same

analysis can be conducted on the other hyperplanes. It is seen that when the two samples are in the same cluster, they always lie on the same side of the separating hyperplanes and fc is larger than 0 which leads to ζ j > 0, and the error is incurred. In Fig. 3(b), x j 1 and x j 2 are in the different clusters and x j 1 is assigned to a lower ranked cluster than x j 2 . x j 1 is in cluster i and x j 2 is in cluster i + 1. It can be seen that x j 1 lies on the negative side of hyperplane (w, bi ), and it has wT x j 1 − bi < 0, i.e., bi > wT x j 1 . x j 2 lies on the positive side of hyperplane (w, bi ), and it has wT x j 2 − bi > 0, i.e., wT x j 2 > bi . Therefore, f c is equivalent to 2(wT x j 2 −wT x j 1 ), of which the value is larger than 0 since it has wT x j 2 > bi and bi > wT x j 1 , and wT x j 2 > wT x j 1 holds. As for the other hyperplanes, e.g., (wT , bi+1 ), x j 1 and x j 2 are located on the same side of these hyperplanes. Similar to the case in Fig. 3(a), the value of f c is larger than 0 and the penalty is enforced. Moreover, in Fig. 3(c), x j 1 is assigned to a higher ranked cluster than x j 2 . x j 1 is in cluster i +1 and x j 2 is in cluster i . For all the hyperplanes, there must exist at least one hyperplane, e.g., (w, bi ), which lies between x j 1 and x j 2 , such that x j 1 is located on its positive side and x j 2 is on its negative side. In Fig. 3(c), x j 1 lies on the positive side of hyperplane (w, bi ), and it has wT x j 1 − bi > 0. x j 2 is located on the negative side of hyperplane (w, bi ), and it has wT x j 2 −bi < 0. Hence, f c is equal to 0, and no error happens. From the above example, it can be seen that the function mini=1,...,k−1 [|wT x j 1 − bi | + |wT x j 2 − bi | − (wT x j 1 − wT x j 2 )] is equivalent to 0 only if samples x j 1 and x j 2 are assigned to distinct clusters and meanwhile x j 1 lies in a higher ranked cluster than x j 2 . Otherwise, the value of this function is larger than 0 and the errors are incurred, which demonstrates the effectiveness of our proposed sample-ranking constraints. By applying the sample-ranking constraint, the optimization function (1) can be transformed into nu nc   1 ||w||2 + C1 ξ j + C2 ζj 2 j =1 j =1   bi − bi−1 bi−1 + bi T − |w x j − | ≥ 1 − ξj s.t. max i=1,...,k 2 2 ∀ j = 1, . . . , n u   T T   min w x j 1 − bi + |w x j 2 − bi | i=1,...,k−1

 − w T x j 1 − w T x j 2 ≤ ζ j ∀ x j 1 Y x j 2

min

j = 1, . . . , n c , ξ j ≥ 0, ζ j ≥ 0.

(2)

1008

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

The traditional semisupervised clustering methods [26]–[28] consider only the must-link and cannot-link constraints. Distinguished from these methods, we present the sample-ranking constraints to incorporate the relative ranking information of unlabeled samples into refining the clustering model which brings our method closer to real-world OR applications. D. Enforcing Cluster Balance Constraints The cluster balance constraint should be enforced to avoid the trivially optimal solution where all samples are assigned to one cluster, and prevent forming very small clusters [29]. In this section, the cluster balance constraint will be presented. Theorem 4: The cluster balance constraint is obtained as nu n u T T j =1 (w x j − bi ) j =1 (w x j − bi+1 ) − ≤l −l ≤ k − 2i k − 2i − 2 where l is a cluster balance parameter. Proof: Let yi (x) = sign(wT x − bi ) ∈ {1, −1} denote the predicted label of sample x associated with the i th hyperplane (w, bi ). When x lies on the negative side of (w, bi ), it has wT x−bi < 0 and yi (x) = −1. When x is located on its positive side, it has wT x − bi ≥ 0 and yi (x) = 1. Given a data set with n u unlabeled samples and k clusters, suppose that the cluster sizes are equivalent, namely, each cluster containing n/k samples. Based on the OR model as in Fig. 2, the i th hyperplane (w, bi ) separates cluster i and cluster i + 1, and thus there are i clusters (clusters 1, . . . , i ) lying on its negative side, and k − i clusters (clusters i + 1, . . . , k) located on its positive side. Since each cluster has n/k samples, the numbers of samples lying on the negative side and the positive side of hyperplane (w, bi ) are i ∗ (n u /k) and (k − i ) ∗ (n u /k), respectively. Therefore, for the i th hyperplane (w, bi ), it has  nu j =1 yi (x j ) = i ∗ (n u /k) ∗ (−1) + (k − i ) ∗ (n u /k) ∗ 1 = (k − 2i ) ∗ (n u /k). Furthermore, for the (i + 1)th hyperplane (w, bi+1 ), there are (i + 1) ∗ (n u /k) samples lying on its negative side and (k − i − 1) ∗ (n u /k) samples on its positive side. Hence, it has nj u=1 yi+1 (x j ) = (i +1)∗(n u /k)∗(−1)+(k−i − the equations 1) n∗u (n u /k) ∗ 1 = (k−2i −2) ∗ (n u /k). Based non u j =1 yi (x j ) = (k − 2i ) ∗ (n u /k)and j =1 yi+1 (x j ) = (k −2i −2)∗(n u /k), we can obtain ( nj u=1 yi (x j )/(k − 2i)) =  ( nj u=1 yi+1 (x j )/(k − 2i − 2)) = (n u /k). Hence, it has   ( nj u=1 yi (x j )/(k − 2i))− ( nj u=1 yi+1 (x j )/(k − 2i − 2)) = 0. Above, we assume that each cluster has the same size. However, in practice, the number of samples in each cluster may not be equivalent. As in [30] and  [31], we introduce a parameter n ≥ 0, and the constraint ( nj u=1 Yi (x j )/(k − 2i))−  ( nj u=1 Yi+1 (x j )/(k − 2i − 2)) = 0 is relaxed as − n ≤   ( nj u=1 Yi (x j )/(k −2i ))−(( nj u=1 Yi+1 (x j ))/(k −2i −2)) ≤ n. However, this relaxed constraint is difficult to solve since the sample label Yi (x j ) = sign(wT x j − bi ) includes computation of the sign function. Following the same routine in [30] and [31], f i (x j ) = wT x j − bi can be viewed as the soft label of x j and the sample label yi (x j ) can be approximated by wT x j − bi . Hence, the cluster balance constraint is given by −l ≤ (( nj u=1 (wT x j − bi ))/(k − 2i )) −  (( nj u=1 (wT x j − bi+1 ))/(k − 2i − 2)) ≤ l, where l is a

cluster balance with the constraint  parameter. Compared  − n ≤ ( nj u=1 yi (x j )/(k − 2i )) − (( nj u=1 yi+1 (x j ))/(k− 2i − 2)) ≤ n , it is seen that the cluster balance constraint is n with f i (x j ) = wT x j −bi formulated by replacing yi (x j ) and and l, respectively. Let = {b0 , b1 , . . . , bk }T ∈ R k+1 and ei = {0, . . . , 0, 1, 0, . . . , 0}T ∈ R k+1 where the i th element is 1 and the others are 0. Hence, b0 , b1 , . . . , bk can be represented as T e0 , T e , . . . , T e , respectively. By enforcing the cluster balance 1 k constraint and replacing bi with T ei , the learning problem (2) can be further transformed into nu nc   1 ||w||2 + +C1 ξ j + C2 ζj 2 j =1 j =1    T ei − ei−1 − wT x j − s.t. max i=1,...,k 2

min

min

i=1,...,k−1

 + ei   ≥ 1−ξ j 2

T ei−1

  T w x j 1 − bi  + |wT x j 2 − bi |

∀ j = 1, . . . , n u



 − wT x j 1 − wT x j 2 ≤ ζ j , j = 1, . . . , n c

n u T n u Te Tx − Te ) x − w j i+1 (w j i j =1 j =1 −l ≤ − ≤l k − 2i k − 2i − 2 ∀i = 1, 2, . . . , k − 2, ξ j ≥ 0, ζ j ≥ 0. (3) In problem (3), the cluster balance constraints are associated with k − 1 hyperplanes (w, T e1 ), . . . , (w, T ek−1 ). These k − 1 hyperplanes separate the data into k clusters, and we can make the cluster sizes relatively balanced by enforcing the cluster balance constraints on these hyperplanes. Furthermore, problem (3) is not a convex optimization problem since the first and second sets of constraints are nonconvex. In Section III-E, we will present how to solve problem (3) effectively by using the CCCP technique [32]–[35]. CCCP has been extensively applied to solve the nonconvex clustering [26], [27], [31], [36], [37] and classification problems [38]–[43]. E. CCCP Decomposition The CCCP technique is proposed to solve the nonconvex optimization problem whose objective function can be expressed as a difference of convex functions. In problem (3), although the first and second sets of constraints are nonconvex, they are the difference between two convex functions, which can be solved by employing the CCCP technique. Let g(w, , i, j ) = T ((ei −ei−1 )/2)−|wT x j − T ((ei−1 + ei )/2)|, and h(w, , j ) = maxi=1,...,k g(w, , i, j ). Moreover, define s(w, , i, j ) = |wT x j 1 − T ei | + |wT x j 2 − T ei | − (wT x j 1 − wT x j 2 ) and d(w, , j ) = mini=1,...,k−1 s(w, , i, j ). Hence, the first and second sets of constraints in problem (3) turn to be h(w, , j ) ≥ 1 − ξ j and d(w, , j ) ≤ ζ j , respectively. Based on this, CCCP is applied to decompose the nonconvex problem (3) into a series of convex quadratic programming (QP) problems. Given an initial point (w(0) , (0) ), CCCP computes (w(t +1), (t +1))

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

from (w(t ), (t ) ) iteratively by replacing h(w, , j ) and d(w, , j ) with their corresponding first-order Taylor expansions at (w(t ), (t )). The resulting QP problem is then solved until the stopping criterion is met. The first-order Taylor expression of h(w, , j ) at (w(t ), (t )) can be obtained as h(w, , j ) ≈ −wT x j z (tj ) +

T (t ) vj

(4)

  (t ) (t ) (t ) (t ) (t ) where it has z j = ki=1 θi j ri j ∈ R, v j = (1/2) ki=1 θi j (t ) (t ) [ei (ri j + 1) + ei−1 (ri j − 1)] ∈ R k+1 , and  1, if i = arg max g(w(t ), (t ) , i, j ) (t ) θi j = 0, otherwise  e + e T i i−1 (t ) ≥0 1, if (w(t ) )T x j − (t ) ri j = 2 −1, otherwise. Furthermore, the first-order Taylor expression of d(w, , j ) at (w(t ), (t ) ) is given by  d(w, , j ) ≈ wT p(tj )x j 1 − q (tj )x j 2 − T u(tj ) (5) k−1 (t ) (t ) (t ) k+1 , where it has u(tj ) = i=1 κi j (ηi j + μi j )ei ∈ R   (t ) k−1 (t ) (t ) k−1 (t ) p(tj ) = i=1 κi j (ηi j − 1) ∈ R , and q j = − i=1 κi j

) (μ(t i j + 1) ∈ R  1, if i = arg min s(w(t ), (t ) , i, j ) (t ) i=1,...,k−1 κi j = 0, otherwise  1, if (w(t ))T x j 1 − ( (t ))T ei ≥ 0 (t ) ηi j = −1, otherwise  1, if (w(t ))T x j 2 − ( (t ))T ei ≥ 0 ) μ(t i j = −1, otherwise.

The deviations for obtaining formulas (4) and (5) can refer to Appendixes A and B, respectively. In problem (3), we replace h(w, , j ) and d(w, , j ) with their corresponding first-order Taylor expressions (4) and (5) at (w(t ), (t ) ), and the convex optimization problem for the tth CCCP iteration can be obtained as min

1 ||w||2 + C1 2

nu  j =1

ξ j + C2

nc 

ζj

j =1

s.t. − wT x j z (tj ) + T v(tj ) ≥ 1 − ξ j ∀ j = 1, . . . , n u  (t ) (t ) (t ) wT p j x j 1 − q j x j 2 − T u j ≤ ζ j ∀ j = 1, . . . , n c n u n u T Te ) T Te i i+1 ) j =1 (w x j − j =1 (w x j − − ≤l −l ≤ k − 2i k − 2i − 2 ∀i = 1, 2, . . . , k − 2, ξ j ≥ 0, ζ j ≥ 0. (6)

1009

  (t ) (t ) (t ) (t ) ri1 , . . . , − ki=1 θin r , Let us define E (t ) = {− ki=1 θi1 u in u k−1 (t ) (t ) k−1 (t ) (t ) (t ) ) − i=1 κi1 (ηi1 + μi1 ), . . . , − i=1 κinc (ηinc + μ(t in c ),

m 1 n u , . . . , m k−2 n u , −m 1 n u , . . . , −m k−2 n u }T ∈ R nu +nc +2k−4 (t )

(t )

(t )

(t )

and Q (t ) = {−z 1 x1 , . . . , −z nu xnu , q 1 x12 − p 1 x11 , . . . , n u  ) (t ) q (t x j , . . . , m k−2 nj u=1 x j , n c xn c 2 − p n c xn c 1 , m 1 j =1   −m 1 nj u=1 x j , . . . , −m k−2 nj u=1 x j } ∈ R (nu +nc +2k−4)×d , where d is the dimension number of data. Based on these notations, the norm vector w can be given by w = (Q (t ) )T U

(7)

where U = {u 1 , u 2 , . . . , u nu +nc +2k−4 }T are Lagrange multipliers to be optimized. Refer to Appendix C for deviations. Proposition 1: The dual form of problem (6) is obtained as max − U T Q (t ) (Q (t ) )T U + s.t.

U T E (t ) = 0 0 ≤ u j ≤ C1 , 0 ≤ u j ≤ C2 ,

nu 

uj

j =1

j = 1, . . . , n u j = n u + 1, . . . , n u + n c .

(8)

Proof: Refer to Appendix C. The value of U can be obtained through resolving the dual form (35). By substituting U , we can calculate w from (33) and using the support vectors. After (w, ) of problem (6) in the tth iteration is attained, we use it to compute (w(t +1), (t +1)) in the (t + 1)th iteration. The CCCP iterations repeat until the stopping criterion is met. At the end of the algorithm, we can get a set of parallel hyperplanes f i (x) = U T Q (t ) x − T ei . For an unlabeled sample x, when fi−1 (x) ≥ 0 and f i (x) < 0 hold, it is assigned to cluster i . When the samples are not linearly separated, a nonlinear mapping function φ(·) is usually used to map the samples from the input space into a higher dimensional feature space. The inner product of two samples in the feature space can be computed as K (xi , x j ) = φ(xi ) · φ(x j ). To extend M2 SORC in the feature space, we just need to replace φ(xi ) · φ(x j ) with K (xi , x j ) in the matrix Q (t ) (Q (t ))T of the dual form (35). The i th hyperplane in the feature space can be given by f i (φ(x)) = U T Hφ(t ) − T ei ∈ R nu +nc +2k−4 , where it has (t ) (t ) (t ) (t ) Hφ = {−z 1 K (x1 , x), . . . , −z nu K (xnu , x), q 1 K (x12 , x) − n u ) (t ) (t ) p(t n c K (xn c 2 , x) − p n c K (xn c 1 , x) , m 1 1 K (x11 , x), . . . , q j =1  K (x j , x), . . . , m k−2 nj u=1 K (x j , x), −m 1 nj u=1 K (x j , x), . . . ,  −m k−2 nj u=1 K (x j , x)}T . By mapping the samples to the feature space, the data is expected to be more separable than that in the input space. G. Algorithm Overview

F. Dual Form

In Table I, we give an overview of the M2 SORC algorithm which consists of a number  of CCCP loops. Here, J (t ) = (1/2)||w(t )||2 + C1 nj u=1 ξ j + C2 nj c=1 ζ j is the

In problem (3), we replace h(w, , j ) and d(w, , j ) with their corresponding first-order Taylor expressions, and the obtained problem (6) is a convex QP problem. The QP problem is usually solved via its dual form.

value of the objective function in the tth CCCP iteration. J is the difference between the objective function’s values in two successive CCCP iterations. We employ the same CCCP stopping criterion as in [30] and [31] to determine

1010

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

TABLE I

the termination of M2 SORC. When the proportion of J and J (t −1) is no larger than a threshold , the algorithm terminates. H. Comparisons With Related Studies Our method is proposed for semisupervised OR clustering with sample-ranking constraints, which is different from the traditional semisupervised clustering methods [26]–[28]. Our method involves the sample-ranking constraints, while the traditional semisupervised clustering methods incorporate the must-link and cannot-link constraints. The must-link constraint indicates that samples should be assigned to the same cluster, and the cannot-link constraint implies that samples should be arranged to different clusters. Hence, it is difficult to employ the must-link and cannot-link constraints to explicitly represent the sample ranking information that one sample should be assigned to a higher ranked cluster than the other one. Moreover, an underlying assumption of the traditional semisupervised clustering methods is that there are no orders among the clusters. Taking semisupervised maximum margin clustering [26], [27] as an example, it learns k hyperplanes for k clusters by assigning the sample to one hyperplane and meanwhile letting it far from the other k − 1 hyperplanes. However, it does not take the scenario of cluster ordering

into account. The hyperplanes obtained are usually disordered and intersecting with each other. It is difficult to impute the intersecting hyperplanes with an order to represent the ranking information of samples. Moreover, our method is distinguished from the regression methods with order preferences (RWOPs) [44]–[46]. The RWOP methods focus on the regression problems with order preference and the training set is comprised with the following two parts. The first part is a set of fully labeled samples {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}, where xi (i = 1, . . . , n) is a training sample, and yi is the target value in regression. The second part is a number of ordered pairs (xi , x j ), namely, ordered preferences, where the target values of xi and x j are unknown, and (xi , x j ) indicates that the target value of xi should be larger than that of x j . Based on this, the RWOP methods estimate a regression function f (x) = wT x + b from the labeled samples, and at the same time ensure that for the ordered pair (xi , x j ), the target value f (xi ) of xi is greater than that of x j , i.e., f (x j ). Similar to our method, the RWOP methods include the relative ranking information of unlabeled samples (namely order preferences) as side information to improve the learning model. However, our method differs markedly from the RWOP methods. First, the RWOP methods are regression approaches, and our method is a clustering approach. The RWOP methods seek a regression function f (x) = wT x + b to fit the labeled samples, while our method attempts to find k + 1 parallel hyperplanes fi (x) = wT x + bi (i = 0, 1, . . . , k) to separate the unlabeled samples into k clusters. Second, the RWOP methods are designed for regression problems and our method is proposed for clustering problems which determines the different ways of our method and the RWOP methods when incorporating the sample ranking information. Assume that there is an ordered pair (xi , x j ), where xi is preferred to x j . In the RWOP methods, when the decision value of xi is larger than x j , the corresponding loss is equal to 0. This is because the RWOP methods are designed for regression problems and the sample ranking information is incorporated by measuring the samples’ decision values. By contrast, in our method, when the cluster of xi is higher than that of x j , the loss is equivalent to 0, since our method is proposed for clustering problems and the sample ranking information is incorporated by comparing the ranking of clusters that samples belong to. Due to the different natures of targeted problems and thus the distinctive ways of incorporating the sample ranking information, our method differs greatly from the RWOP methods. IV. E XPERIMENTS We conduct experiments on the benchmark data sets and real-world sentiment data set to investigate the performance of our proposed M2 SORC method. All the experiments run on the Windows platform with a 2.8-GHz processor and 3-GB DRAM. The objectives of our experiments are as follows. 1) To evaluate the effectiveness of our method when only the unlabeled samples and sample-ranking constraints are available to learn the OR model.

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

2) To evaluate the performance variation of our method with varying numbers of sample-ranking constraints. 3) To evaluate the improvement of our method when the number of CCCP iterations increases. A. Evaluation Metrics To investigate the performance of the proposed semisupervised OR clustering algorithm, we use the mean zero-one error (MZE), mean absolute error (MAE), average MAE (AMAE), and normalized mutual information (NMI) as the evaluation metrics. Among these metrics, the MZE, MAE, and AMAE are usually used to validate the effectiveness of OR methods [15], [47]–[50], while NMI is popularly utilized to evaluate the performance of clustering approaches [27], [30]. Following the same routine in [26]–[28], we first select a set of labeled samples and remove the sample labels. Second, the clustering algorithms are conducted on the unlabeled samples. Third, the samples are relabeled using the clustering assignments returned by the algorithms. Finally, the metrics: MZE, MAE, AMAE, and NMI, are utilized to evaluate the algorithm performance i assigned by by comparing the true label Yi and the label Y the clustering algorithms. The ways to calculate the MZE, MAE, AMAE, and NMI are as follows. 1) MZE: The fraction of incorrect assignments on individual samples n 1   MZE = I Yi = Yi n

(9)

i=1

where n is the data set size. I (·) is an indication i = Yi ) = 1 if Y i = Yi meets, function. It has I (Y  and I (Yi = Yi ) = 0 otherwise. Yi is the true label of i is the label returned by the clustering sample xi , and Y assignment, which can be found by the Hungarian algorithm [30], [51]. It is clear that the MZE measures the extent to which cluster assignments are not associated with the corresponding true clusters. The smaller the MZE is, the better performance the algorithm attains. 2) MAE: The average deviation of the clustering assignment from the true cluster which is treated as consecutive integers MAE =

n  1   Yi − Yi . n

(10)

i=1

The MAE uses the absolute deviation of the clustering i − Yi | to represent assignment and the true label, i.e., |Y their differences, instead of 1 or 0 as the MZE does. This is because in OR, there exists some kind of cluster ordering information and using the MZE cannot sufficiently reflect the exact deviation from the clustering assignment to the true cluster. The smaller the MAE is, the better performance the algorithm obtains. 3) AMAE: The average of MAE across different classes 1 MAEi k k

AMAE =

i=1

(11)

1011

where k is the number of clusters and MAEi is the MAE taking into account only samples from  j − Y j |, cluster i . It has MAEi = (1/n i ) Y j =i |Y where n i is the number of samples whose true cluster is i . The AMAE is typically used in imbalanced OR problems as it emphasizes errors equally in each cluster [18]–[20], [52]. A smaller value of AMAE indicates a better learning performance. 4) NMI: A symmetric measure to quantify the statistical information shared between two distributions. Given two clustering results, the clustering assignments can be considered as distributions of random variables, and NMI can be calculated as

 K K n ∗ n pq p=1 q=1 n pq log n p ∗  nq NMI = 



(12) K np  nq K n q log n p=1 n p log n q=1  where n p is the number of samples contained in the pth cluster;  n q denotes the number of samples belonging to the qth cluster (the ground truth cluster); n pq represents the number of samples which should belong to the qth cluster, but are assigned to the pth cluster. The NMI value ranges from 0 to 1. Different from the above metrics, the larger the NMI value is, the more similar the groupings obtained by the algorithm and those by the true clusters. B. Baselines and Experimental Settings Our method is proposed for semisupervised OR clustering. Considering that there is little work done on semisupervised OR clustering, we mainly compare our method with the traditional semisupervised clustering methods—CPCMMC, SMMCPC, and MPCKmeans, the unsupervised maximum margin clustering method—CPMMC, and the unsupervised OR clustering method—WSVORC. 1) CPMMC [31]: This is a typical unsupervised maximum margin clustering algorithm, which finds k nonparallel hyperplanes to divide the data into k clusters. However, it is proposed for unsupervised clustering and does not take the sample-ranking constraints into account. 2) CPCMMC [26]: This is a semisupervised maximum margin clustering algorithm which builds a number of hyperplanes to separate the data and meanwhile incorporates two types of constraints: must-link and cannot-link constraints, into refining the clustering results. 3) SMMCPC [27]: This is also a semisupervised maximum margin clustering algorithm which seeks several hyperplanes to split the unlabeled data with must-link and connot-link constraints. However, it has a different formulation from CPCMMC. 4) MPCKmeans [28]: This is a well-known semisupervised clustering algorithm which alternates between the metric learning step and the clustering step. 5) WSVORC [25]: This is a unsupervised OR clustering algorithm where the unlabeled sample is required to cluster around the nearest hyperplane and a least-squares-SVM-like formulation is presented.

1012

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

TABLE II D ETAILED D ESCRIPTIONS OF THE E XPERIMENTAL D ATA S ETS

Among the above baselines, CPCMMC, SMMCPC, and MPCKmeans are representative semisupervised clustering methods. However, they are not designed for OR problems, the learnt clusters being disordered and cannot representing the inherent cluster ordering information. More importantly, they consider only the must-link and cannot-link constraints, while the sample ranking information cannot be appropriately utilized to construct the clustering model. Distinguished from those methods, our method is proposed for semisupervised OR clustering, and capable of incorporating the sample ranking information into improving the accuracy. For CPMMC, we pick up C from 2[−5:1:5] . w is randomly initialized. and α are set to be 0.1. Following the same setting in [30] and [31], the cluster balance parameter l is searched from the grid [0, 0.001, 0.01, 0.1, 1 : 1 : 5, 10]. For CPCMMC, let C = C1 = C2 and select the value from 2[−5:1:5] . The initial value of w is randomly generated, and and α are equal to 0.1. For SMMCPC, C is chosen from 2[−5:1:5] . δ0 is set at 1, and λ is selected in a set of candidates 10[−2:1:1] . For WSVORC, C is picked up from 2[−5:1:5] . For our method, w and are randomly initialized. We let C1 = C2 and select the value from 2−5:1:5 . is set to be 0.1. The cluster balance parameter is selected in the same way as for CPMMC. For MPCKmeans, the cluster centers are initialized randomly. As in [26]–[28], the cluster number k is the true class number for all algorithms. C. Results on Benchmark Data Sets We conduct experiments on ten benchmark data sets: AutoMPG, Abalone, Bank, Boston, California, Census, Computer, MachineCPU, Stock, and Triazines,1 which are popularly used in the existing OR work [6], [15], [24], [47], [48]. Each data set contains five clusters, and the sample size and feature number are shown in Table II. For our method, the experiments are conducted on the benchmark data sets with sample-ranking constraints. Each sample-ranking constraint is generated by randomly selecting a pair of samples 1 Available at http://www.gatsby.ucl.ac.uk/∼chuwei/ordinalregression.html.

from two different clusters. Assuming that samples x1 and x2 are picked up from clusters (a) and (b), respectively. If b > a holds, a pair of ranked samples x2 Y x1 is generated. Otherwise, x1 Y x2 is formed. For the traditional semisupervised clustering methods, i.e., MPCKmeans, CPCMMC and SMMCPC, we investigate their performance on the benchmark data sets with both of the must-link and cannotlink constraints. As in [26] and [27], the must-link constraint is formed by randomly select two samples from the same cluster, and the cannot-link constraint is generated by choosing the samples from different clusters. For a fixed number of constraints, the results are averaged over 20 realizations with different constraints. These data sets are reported to perform well with the Gaussian kernel [6], [24], [47], and in the experiments, the Gaussian kernel is employed. The kernel width is chosen from 10[−5:1:5] . Table III presents the MZE on the benchmark data sets when the number of sample-ranking constraints is around 10%n, where n is the sample size of the data set. For the traditional semisupervised clustering methods, i.e., MPCKmeans, CPCMMC and SMMCPC, the numbers of must-link and cannot-link constraints are both 5%n, totally 10%n. As seen from Table III, our proposed M2 SORC method delivers the lowest MZE on 8 out of 10 data sets, except for the Abalone and Computer data sets. Moreover, the semisupervised maximum margin clustering methods—SMMCPC and CPCMMC illustrate improved performances over the unsupervised maximum margin clustering method CPMMC, which is in line with that obtained in [27]. WSVORC and MPCKmeans show higher MZE values than the other methods. Although WSVORC is designed for unsupervised OR clustering, it does not take the cluster size balance into account which may lead to the trivially optimal solution, where the samples may be assigned to one cluster or very small clusters are formed. Furthermore, we conduct the paired t-test on the benchmark data sets to compare our method and the baselines. The p-values are computed by performing the paired t-test on the MZE, comparing the baselines to our method M2 SORC under the null hypothesis that there is no difference between the MZE distributions. When the p-value is smaller than the confidence level 0.05, there is a significant difference between M2 SORC and the baseline. The average p-values of M2 SORC with MPCKmeans, CPMMC, CPCMMC, SMMCPC, and WSVORC on the benchmark data sets are 0.006, 0.008, 0.014, 0.027, and 0.002, respectively. It is seen that the p-values of M2 SORC with the baselines are markedly lower than 0.05, which implies that there is a significant difference between M2 SORC and the baselines. Table IV shows the MAE for our method and the baselines. Compared with the MZE, the MAE can more appropriately represent how far the sample deviates from the targeted cluster. After investigating the details in Table IV, we can observe that our method attains the lowest MAE on all the benchmark data sets considered. Taking the AutoMPG data set as an example, the MAE of M2 SORC is 0.544, which is markedly lower than WSVORC (0.699), SMMCPC (0.647), CPCMMC (0.668), CPMMC (0.706), and MPCKmeans (0.712). The better performance of our method over the baselines confirms the

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

1013

TABLE III MZEs ON THE B ENCHMARK D ATA S ETS

TABLE IV MAEs ON THE B ENCHMARK D ATA S ETS

TABLE V AMAEs ON THE B ENCHMARK D ATA S ETS

effectiveness of our method on dealing with the semisupervised OR clustering problems. Our method is able to involve the sample ranking information into refining the OR model. Table V shows the AMAE on the benchmark data sets. In real-world applications, a great number of OR data sets are imbalanced. Thus, the AMAE is employed to measure the performance of algorithms on imbalanced data sets. Compared with the MAE which considers the absolute error over all the samples, the AMAE calculates the absolute errors according to different clusters, so that the trivial cluster can count as any other clusters [18]–[20], [52]. Table V gives the cluster distributions of the benchmark data sets, and some of the data sets, e.g., Abalone, Boston, and MachineCPU, are imbalanced. In Table V, it is seen that the AMAE values on the Bank, California, Census, and Computer data sets are the same with the MAE. This is because these data sets are balanced, and the AMAE coincides with the MAE on balanced data sets [18]–[20], [52]. Moreover, it is seen that WSVORC obtains the worst AMAE on the benchmark data sets. WSVORC is designed for unsupervised OR clustering, but it

does not take the cluster size balance into account. Hence, the obtained result may be trivial optimal and the AMAE becomes high. Besides the MZE, MAE, and AMAE, Table VI reports the NMI values, which show the similarity of groupings by the cluster assignment and those by true sample labels. It can be seen that our method obtains the highest NMI on 9 out of 10 benchmark data sets, which indicates that our method can yield more similar clustering groupings to the true sample labels than the baselines. Moreover, although MPCKmeans has a smaller MZE than our method on the Computer data set, it can be seen from Table VI that the NMI value of our method on the Computer data set is larger than MPCKmeans. D. Results on Real-World Sentiment Data Sets The proposed M2 SORC method and the baselines are also evaluated on real-world sentiment data sets.2 The sentiment 2 Available at http://www.cs.jhu.edu/∼mdredze/datasets/sentiment/.

1014

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

TABLE VI NMI ON THE B ENCHMARK D ATA S ETS

Fig. 4.

MZE (Books data set).

Fig. 5.

MZE (DVDs data set).

data sets are generated from Amazon.com and comprised with four categories of product reviews: 1) Book; 2) DVDs; 3) Electronics; and 4) Kitchen appliance. Each product review is associated with one of the five ordinal rating labels: 1, 2, 3, 4, and 5. A higher rating means a better review feedback. In the experiments, the review documents are first preprocessed as follows: 1) convert upper-case text to lower-case; 2) replace common nonword pattern with a unique identifier [e.g., mapping smileys :-) to happy ]; 3) remove HTML tags and any character that is neither alphanumeric nor a punctuation; and 4) normalize contractions (e.g., transforming don’t to do not). Then, function words on the SMART stop-list [53] are removed from the vocabulary, and the remaining words are stemmed. Each review document is

Fig. 6.

MZE (Electronics data set).

Fig. 7.

MZE (Kitchen data set).

represented as a term frequency–inverse document frequency feature vector, and each feature is normalized. Table II shows the sample sizes and feature sizes of the sentiment data sets. As shown in Table II, the data dimensions are relatively high, principal component analysis which is a well-known dimensionality reduction technique for clustering is performed to reduce the data to 300 features. Hence, each review document is represented as a 300-D feature vector. Similar to the benchmark data sets, we form a number of constraints, and report the results. It is known that the text data sets usually perform well with the linear kernel and as in [24] and [49], the linear kernel is applied on the sentiment data sets. Figs. 4–7 present the MZE when the constraint number varies from 450 to 3000. For MPCKmeans, CPCMMC,

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

Fig. 8.

MZE with different initialization points of CCCP.

Fig. 9.

1015

MZE with different numbers of CCCP iterations.

and SMMCPC, the constraints are evenly divided between the must-link and cannot-link constraints. On the one hand, it is seen that the MZE of CPMMC and WSVORC keeps unchanged, since they are unsupervised clustering methods which do not incorporate the constraints. Except for CPMMC and WSVORC, the MZE values of the other methods dramatically drop off as the increasing of constraints. This may be due to the fact that when more constraints are included into the learning process, more supervised information carried by the constraints can be utilized to improve clustering the performance. On the other hand, we can also observe that the MZE remains stable when the number of constraints becomes relatively large. This may be because not all the constraints have the same importance to the performance of the clustering model. As pointed out in [54] and [55], the sample located near the class boundary usually plays a more important role than that lying deeply inside the class when building the model. Hence, the constraint which includes the sample around the class boundary may have a more decisive effect than that involving only the samples inside the class. How to determine the constraint which includes the samples around the boundary, and how to decide the appropriate constraint number will be a valuable consideration in our future work. Furthermore, the performance of our method with different initialization points of CCCP is also investigated. As presented in Section III-E, given an initial point (w(0) , (0) ), CCCP computes (w(t +1), (t +1)) from (w(t ) , (t )) iteratively by replacing the nonconvex counterparts with their first-order Taylor expansions at (w(t ), (t ) ). In the experiments, we randomly initialize the starting point (w(0) , (0) ) for 10 times and investigate the variation of MZE with different starting points on the Book data set. Here, the number of sample-ranking constraints is fixed to be 1200. The results are shown in Fig. 8. It is seen that the result keeps the same on 9 out of 10 starting points, which indicates that our method is relatively stable with most of the starting points. On the seventh starting point, a higher MZE is observed. This may be because a local optimal solution is obtained. Moreover, it is noted that although it achieves a local optimal solution, the MZE on the seventh starting point is still markedly lower than the baselines.

MZE on the sentiment data sets. Here, the x-axis is the number of CCCP iterations, and the y-axis is the MZE value. As shown in Fig. 9, the MZE value decreases when the number of CCCP iterations goes up. The CCCP technique solves the learning problem (3) by iteratively replacing the nonconvex counterparts with their first-order Taylor expansions. As the number of CCCP iterations increases, the solution is closer to an optimal one and the clustering performance improves. Furthermore, it is seen that the MZE remains relatively stable after a few number of CCCP iterations, which implies the convergence of M2 SORC. M2 SORC can converge after a few iterations by applying the CCCP technique. The similar findings can also be observed from [30] and [31]. By employing the termination criterion presented in Table I, M2 SORC stops after 4.58 iterations on the Books data set, 7.11 iterations on the DVDs data set, 3.23 iterations on the Electronics data set, and 6.47 iterations on the Kitchen data set averagely.

E. Performance Variation With CCCP Iterations

B. Limitations and Future Work

M2 SORC

when the number The performance variation of of CCCP iterations changes is investigated. Fig. 9 shows the

V. C ONCLUSION A. Contribution of This Paper Most of the existing OR methods are proposed for supervised/semisupervised OR classification, and the semisupervised OR clustering problems have not been explicitly addressed. In this paper, we consider the semisupervised OR clustering problems with sample-ranking constraints, and propose a maximum-margin-based approach, termed as M2 SORC. To the best of our knowledge, this paper serves as the first attempt that addresses the OR problems in a semisupervised clustering setting. Specifically, M2 SORC seeks a set of parallel hyperplanes to partition the unlabeled samples into clusters and the optimization function is formulated to maximize the margin of the closest neighboring clusters. Moreover, a loss function associated with the sampleranking constraints is proposed. The nonconvex optimization problem is then decomposed into a number of convex problems via CCCP. Substantial experiments have demonstrated that M2 SORC obtains explicitly better clustering accuracy than the traditional semisupervised clustering methods considered.

The learning problem (3) of M2 SORC is nonconvex and as in [27] and [31], CCCP is applied to decompose the nonconvex

1016

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

problem into a number of QP problems (35). The size of the QP problem is N = n u + n c + 2k − 4, where n u is the number of unlabeled samples, n c is the number of sample-ranking constraints, and k is the number of clusters. Let T represent the number of CCCP iterations and t (N) be the running time taken to solve the QP problem of a N size. The computational cost of M2 SORC is O(T t (N)). As seen in Section IV-E, T is typically small compared with the data set size. Furthermore, we utilize LibSVM [56] to solve the QP problems (35), and the computational cost of solving a QP problem with a N size is about t (N) = O(N 2 ). Hence, the time complexity of M2 SORC is O(T N 2 ). When the number of unlabeled samples n u or the number of sample-ranking constraints n c is large, the running time may be relatively high. In the future, we will employ more efficient optimization methods [57], [58], e.g., Pegasos [59] and NORMA [60], to speed up M2 SORC. By applying these methods, the running time of solving a QP problem increases linearly with the number of nonzero features in each sample, and does not depend directly on the data set size. They will be used to speed up the resolving of QP problems (35) on large-scale data sets. Furthermore, we will extend the existing must-link constraints and present the same-rank constraints. This paper addresses the sample-ranking constraints where one sample is in a higher ranked cluster than another. In the future, we will investigate another kind of constraints, called same-rank constraints, where two samples should be in the same rank. The same-rank constraints are similar to the must-link constraints in the existing semisupervised clustering work. Hence, we will introduce the ideas of must-link constraints, present different strategies to formulate the same-rank constraints, and validate their performances on real-world data sets. Then, we will extend the global optimization techniques [61], [62] to solve our nonconvex problem. This paper applies CCCP to solve the nonconvex problem (3). Although CCCP can converge after only a few numbers of iterations, it is a local optimization technique and does not guarantee to obtain the global optimal solution. As a future work, we would like to exploit the global optimization techniques for resolving the nonconvex problem. According to different real-world applications, the users can choose between the global optimization technique and CCCP to tradeoff the discriminatory accuracy and computational cost. In addition, we will investigate the determination of cluster numbers. The selection of cluster numbers is a critical issue that all the clustering methods confront. Following the same routine of the existing clustering methods [26], [27], [30], [31], [36], this paper focuses on the illustration of the proposed semisupervised OR clustering method. The determination of cluster numbers will be studied as the future work. Finally, we will investigate how to determine the important sample-ranking constraints and decide the appropriate constraint number. A PPENDIX

is a nonsmooth function. Hence, we replace its gradient with the corresponding subgradient, as follows:   ∂h(w, , j ) ∂g(w, , i, j )  ∂h(w, , j )   w = w(t) = ∂g(w, , i, j ) ×  w = w(t) ∂w ∂w =

(t)

=

=−

k 

(13)

i=1

  ∂g(w, , i, j )  ∂h(w, , j )  ∂h(w, , j ) × =  w = w(t)  w = w(t) ∂ ∂g(w, , i, j ) ∂ =

(t)

=

=

1 2

k 

(t)

   (t )  (t ) (t ) θi j ei ri j + 1 + ei−1 ri j −1

i=1

(14) where it has  1, if i = arg max g(w(t ) , (t ), i, j ) (t ) θi j = 0, otherwise  e + e T i i−1 (t ) ≥0 1, if (w(t ) )T x j − (t ) ri j = 2 −1, otherwise. By substituting (13) and (14), the first-order Taylor expression of h(w, , j ) at (w(t ), (t ) ) is obtained as h(w, , j ) ≈ h(w(t ) ,

(t )

, j) ∂h(w, , j )  + (w − w(t ) )T  w = w(t) ∂w = (t)  ∂h(w, , j )  + ( − (t ))  w = w(t) ∂ = (t)

= −wT x j z (tj ) +

T (t ) vj

(15)

k (t ) (t ) ∈ R and v(tj ) = where it has z (tj ) = i=1 θi j ri j  (1/2) ki=1 θi(tj )[ei (ri(tj ) + 1) + ei−1 (ri(tj ) − 1)] ∈ R k+1. B. Deviations for Formula (5) We calculate the first-order Taylor expansions of d(w, , j ) at (w(t ), (t ) ). Similar to h(w, , j ), we can obtain   ∂d(w, , j )  ∂d(w, , j ) ∂s(w, , i, j )  × =  w = w(t)  w = w(t) ∂w ∂s(w, , i, j ) ∂w =

(t)

=

=

k−1 

(t)

 )

 κi(tj ) ηi(tj ) − 1 x j 1 + μ(t ij + 1 xj2

i=1

(16)

 ∂d(w, , j ) ∂d(w, , j )  =  (t) ∂ ∂s(w, , i, j ) w=w =

(t)

 ∂s(w, , i, j )  ×  w = w(t) ∂ =

A. Deviations for Formula (4) To solve problem (3), we first compute the first-order Taylor expansions of h(w, , j ) at (w(t ) , (t )). However, h(w, , j )

(t ) (t )

θi j r i j x j

(t)

=−

k−1  i=1

(t)

(t )  (t ) (t ) κi j ηi j + μi j ei

(17)

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

where it has κi(tj ) = ηi(tj ) (t )

μi j

1017

following formulas:  1,

if i = arg

min

i=1,...,k−1

0, otherwise  1, if (w(t ))T x j 1 − ( = −1, otherwise  1, if (w(t ))T x j 2 − ( = −1, otherwise.

s(w(t ), (t ) )T e (t ) )T e

i

(t ) , i,

j)

w=−

≥0

(t )

, j)   (t ) T ∂d(w, , j )  +(w − w )  w = w(t) ∂w = (t)   ∂d(w, , j )  +( − (t ))  w = w(t) ∂

=w

T



p(tj )x j 1

=

− q (tj ) x j 2 −

k−2 

(t)

T (t ) uj

(18)

ρi m i

j =1



k−2 

ρi m i

nu 

xj +

j =1

i=1

k−2 

πi m i

i=1

nu 

xj = 0

+

i=1



k−2 

(24)

0 ≤ β j ≤ C2 .

(25)

u  1 α j θ1(tj) r1(tj) − 1 = 0. 2

u u 1 1 (t )  (t ) (t )  (t ) α j θ1 j r 1 j + 1 − α j θ2 j r 2 j − 1 2 2

i=1



nc  j =1

j =1

 nu ) = 0. β j κ1(tj) η1(tj) + μ(t 1 j + (ρ1 − π1 ) k−2

(27)

u u 1 1 (t )  (t ) (t )  (t ) α j θi j r i j + 1 − α j θi+1, j ri+1, j − 1 2 2

n

(19)



n

j =1 n c  j =1

j =1

 nu ) = 0. β j κi(tj ) ηi(tj ) +μ(t i j +(ρi −ρi−1 + πi−1 − πi ) k − 2i (28)

For ek−1 , the following equation can be acquired:

 n e n u ei+1  u i − ρi k − 2i k − 2i − 2

∂L = C1 − α j − σ j = 0, ∂ξ j ∂L = C2 − β j −  j = 0, ∂ζ j

n

j =1



j =1

 n e n u ei+1  u i − =0 k − 2i k − 2i − 2

(26)

For e1 , (27) can be obtained

j =1

πi

(23)

j =1

n

j =1

k−2 

xj

j =1

n



u c   ∂L (t ) (t ) =− αjvj − βju j ∂

n

i=1

nu 

For ei (i = 2, . . . , k − 2), we can deduce

u c    ∂L = w+ α j z (tj ) x j + β j p(tj )x j 1 − q (tj ) x j 2 ∂w

j =1

πi m i

 (t ) (t ) = (1/2) ki=1 θi j [ei (ri j + 1) +  (t ) (t ) (t ) (t ) (t ) ei−1 (ri j − 1)] ∈ R k+1 and u j = k−1 i=1 κi j (ηi j + μi j )ei ∈ R k+1 into (20), it can be found that (∂ L/∂ ) is a linear combination of ei , and a (k + 1)-dimension column vector. We can separate (20) into k + 1 equations according to different ei , since ei is a (k + 1)-dimension column vector with the i th element being 1 and the other elements being 0. The k + 1 equations acquired from (20) for e0 , e1 , . . . , ek are presented in (26)–(30), respectively. For e0 , we have the following equation:

By introducing the Lagrangian multipliers α j ≥ 0, β j ≥ 0, ρi ≥ 0, πi ≥ 0, σ j ≥ 0, and  j ≥ 0, the Lagrange function of problem (6) can be obtained. We deviate the Lagrange function with w, , ξ j , and ζ j , respectively, and it leads to

j =1

k−2 

0 ≤ α j ≤ C1

n

n

xj −

(t )



n

 β j p(tj )x j 1 − q (tj ) x j 2

After substituting v j

k−1 (t ) (t ) (t ) (t ) k+1 , where it has u j = i=1 κi j (ηi j + μi j )ei ∈ R k−1 (t ) (t ) k−1 (t ) (t ) (t ) pj = i=1 κi j (ηi j − 1) ∈ R, and q j = − i=1 κi j ) (μ(t + 1) ∈ R. ij

C. Proof for Proposition 1

nc  j =1

nu 

i=1

Substituting (16) and (17), the first-order Taylor expression of d(w, , j ) at (w(t ), (t ) ) is given by d(w, , j ) ≈ d(w(t ),

α j z (tj ) x j −

j =1

≥0 +

i

nu 



u u 1 1 (t )  (t ) (t )  (t ) α j θk−1, + 1 − α j θkj r rkj − 1 j k−1, j 2 2

(20) −

j = 1, . . . , n u

(21)

j = 1, . . . , n c

(22)

n

n

j =1

j =1

nc  j =1

nu (t )  (t ) (t ) = 0. β j κk−1, j ηk−1, j + μk−1, j −(ρk−2 − πk−2 ) 2−k (29)

For ek , we can have u 1 (t )  (t ) α j θkj rkj + 1 = 0. 2

n

where it Equations

has m i = (19), (21),

(1/k − 2i ) − (1/k − 2i − 2). and (22) result in the



j =1

(30)

1018

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 27, NO. 5, MAY 2016

Sum up k − 3 (28) for e2 , e3 , . . . , ek−2 , and it gives rise to another equation, as follows: −

nu nu k−2 k−2   1  1 (t )  (t ) αj θi(tj ) ri(tj ) + 1 − αj θi+1, j ri+1, j − 1 2 2 j =1



nc 

βj

j =1



k−2 

j =1

i=2

k−2 

 ) κi(tj ) ηi(tj ) + μ(t ij +

i=2

i=2

k−2 

(ρi − πi )

i=2

(ρi−1 − πi−1 )

i=2

nu k − 2i

nu = 0. k − 2i

(31)

Combining (26), (27), and (29)–(31), and replacing (1/k − 2i) − (1/k − 2i − 2) with m i lead to the following equation: −

nu 

αj

j =1

k 

(t ) (t )

θi j r i j −

nc 

βj

j =1

i=1

+

k−2 

k−1 

(t )  (t ) (t ) κi j ηi j + μi j

i=1

ρi m i n u −

i=1

k−2 

πi m i n u = 0. (32)

i=1

  (t ) (t ) (t ) Let us define E (t ) = {− ki=1 θi1 ri1 , . . . , − ki=1 θin u   (t ) (t ) (t ) (t ) (t ) k−1 (t ) (t ) rinu , − k−1 i=1 κi1 (ηi1 + μi1 ), . . . , − i=1 κin c (ηin c + μin c ), m 1 n u , . . . , m k−2 n u , −m 1 n u , . . . , −m k−2 n u }T ∈ R nu +nc +2k−4 (t )

(t )

(t )

(t )

and Q (t ) = {−z 1 x1 , . . . , −z nu xnu , q 1 x12 − p 1 x11 , . . . ,   (t ) (t ) q nc xnc 2 − pnc xnc 1 , m 1 nj u=1 x j , . . . , m k−2 nj u=1 x j ,   −m 1 nj u=1 x j , . . . , −m k−2 nj u=1 x j } ∈ R (nu +nc +2k−4)×d , where d is the dimension number of data. Moreover, let U = {α1 , . . . , αnu , β1 , . . . , βnc , ρ1 , . . . , ρk−2 , π1 , . . . , πk−2 }T ∈ R nu +nc +2k−4 , and u i be the i th element of U . Equations (19) and (32) change into T

U E

w = (Q (t ))T U

(33)

(t )

(34)

= 0.

By substituting (24), (25), (33), and (34) into the Lagrange function, and replacing α j , β j , ρ j , and π j with u j , the dual form of problem (6) can be obtained as max − U T Q (t ) (Q (t ) )T U +

nu 

uj

j =1

s.t.

U T E (t ) = 0 0 ≤ u j ≤ C1 , 0 ≤ u j ≤ C2 ,

j = 1, . . . , n u j = n u + 1, . . . , n u + n c .

(35)

R EFERENCES [1] R. Herbrich, T. Graepel, and K. Obermayer, “Support vector learning for ordinal regression,” in Proc. 9th Int. Conf. Artif. Neural Netw., Edinburgh, U.K., Sep. 1999, pp. 97–102. [2] W. Chu and S. S. Keerthi, “New approaches to support vector ordinal regression,” in Proc. 22nd Int. Conf. Mach. Learn., Bonn, Germany, Aug. 2005, pp. 145–152. [3] L. Li and H.-T. Lin, “Ordinal regression by extended binary classification,” in Proc. 19th Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2006, pp. 865–872. [4] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, Mar. 2002.

[5] K. Zhang, I. W. Tsang, and J. T. Kwok, “Maximum margin clustering made practical,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 583–596, Apr. 2009. [6] W. Chu and S. S. Keerthi, “Support vector ordinal regression,” Neural Comput., vol. 19, no. 3, pp. 792–815, 2007. [7] J. S. Cardoso and J. F. P. da Costa, “Learning to classify ordinal data: The data replication method,” J. Mach. Learn. Res., vol. 8, no. 12, pp. 1393–1429, 2007. [8] A. Shashua and A. Levin, “Ranking with large margin principle: Two approaches,” in Proc. 15th Conf. Neural Inf. Process. Syst., Cambridge, U.K., Dec. 2002, pp. 937–944. [9] H. Yan, “Cost-sensitive ordinal regression for fully automatic facial beauty assessment,” Neurocomputing, vol. 129, pp. 334–342, Apr. 2014. [10] H. Wu, H. Lu, and S. Ma, “A practical SVM-based algorithm for ordinal regression in image retrieval,” in Proc. 11th ACM Int. Conf. Multimedia, Berkeley, CA, USA, Nov. 2003, pp. 612–621. [11] Z.-S. Chen, J.-S. R. Jang, and C.-H. Lee, “A kernel framework for content-based artist recommendation system in music,” IEEE Trans. Multimedia, vol. 13, no. 6, pp. 1371–1380, Dec. 2011. [12] H.-T. Lin and L. Li, “Reduction from cost-sensitive ordinal ranking to weighted binary classification,” Neural Comput., vol. 24, no. 5, pp. 1329–1367, 2012. [13] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel, “Collaborative ordinal regression,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, Jun. 2006, pp. 1089–1096. [14] W. Kotlwski and R. Slowi´nski, “On nonparametric ordinal classification with monotonicity constraints,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 11, pp. 2576–2589, Nov. 2013. [15] B.-Y. Sun, J. Li, D. D. Wu, X.-M. Zhang, and W.-B. Li, “Kernel discriminant learning for ordinal regression,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 6, pp. 906–910, Jun. 2010. [16] J. F. P. da Costa, H. Alonso, and J. S. Cardoso, “The unimodal model for the classification of ordinal data,” Neural Netw., vol. 21, no. 1, pp. 78–91, 2008. [17] W. Chu and Z. Ghahramani, “Gaussian processes for ordinal regression,” J. Mach. Learn. Res., vol. 6, pp. 1019–1041, Dec. 2005. [18] F. Fernández-Navarro, A. Riccardi, and S. Carloni, “Ordinal neural networks without iterative tuning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 11, pp. 2075–2085, Nov. 2014. [19] F. Fernández-Navarro, P. A. Gutiérrez, C. Hervás-Martínez, and X. Yao, “Negative correlation ensemble learning for ordinal regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 11, pp. 1836–1849, Nov. 2013. [20] M. Perez-Orti´z, P. A. Gutierrez, and C. Hervás-Martínez, “Projectionbased ensemble learning for ordinal regression,” IEEE Trans. Cybern., vol. 44, no. 5, pp. 681–694, May 2014. [21] A. Riccardi, F. Fernández-Navarro, and S. Carloni, “Cost-sensitive AdaBoost algorithm for ordinal regression based on extreme learning machine,” IEEE Trans. Cybern., vol. 44, no. 10, pp. 1898–1909, Oct. 2014. [22] Y. Liu, Y. Liu, S. Zhong, and K. C. C. Chan, “Semi-supervised manifold ordinal regression for image ranking,” in Proc. 19th ACM Int. Conf. Multimedia, Scottsdale, AZ, USA, Nov. 2011, pp. 1393–1396. [23] P. K. Srijith, S. Shevade, and S. Sundararajan, “Semi-supervised Gaussian process ordinal regression,” in Proc. 24th Eur. Conf. Mach. Learn., Prague, Czech Republic, Sep. 2013, pp. 144–159. [24] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, “Transductive ordinal regression,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 7, pp. 1074–1086, Jul. 2012. [25] G. Liu, Y. Wu, and L. Yang, “Weighted ordinal support vector clustering,” in Proc. 1st Int. Multi-Symp. Comput. Comput. Sci., Hangzhou, China, Jun. 2006, pp. 743–745. [26] Y. Hu, J. Wang, N. Yu, and X.-S. Hua, “Maximum margin clustering with pairwise constraints,” in Proc. 8th Int. Conf. Data Mining, Pisa, Italy, Dec. 2008, pp. 253–262. [27] H. Zeng and Y.-M. Cheung, “Semi-supervised maximum margin clustering with pairwise constraints,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 5, pp. 926–939, May 2012. [28] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proc. 21st Int. Conf. Mach. Learn., Banff, AB, Canada, Jul. 2004, pp. 81–88. [29] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” in Proc. 17th Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2005, pp. 1537–1544. [30] D. Zhang, F. Wang, L. Si, and T. Li, “Maximum margin multiple instance clustering with applications to image and text clustering,” IEEE Trans. Neural Netw., vol. 22, no. 5, pp. 739–751, May 2011.

XIAO et al.: MAXIMUM MARGIN APPROACH FOR SEMISUPERVISED OR CLUSTERING

[31] B. Zhao, F. Wang, and C. Zhang, “Linear time maximum margin clustering,” IEEE Trans. Neural Netw., vol. 21, no. 2, pp. 319–332, Feb. 2010. [32] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann, “Kernel methods for missing variables,” in Proc. 10th Int. Workshop Artif. Intell. Statist., 2005, pp. 325–332. [33] B. K. Sriperumbudur and G. R. G. Lanckriet, “A proof of convergence of the concave-convex procedure using Zangwill’s theory,” Neural Comput., vol. 24, no. 6, pp. 1391–1407, 2012. [34] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural Comput., vol. 15, no. 4, pp. 915–936, 2003. [35] B. K. Sriperumbudur and G. R. G. Lanckriet, “On the convergence of the concave-convex procedure,” in Proc. 22nd Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2009, pp. 1759–1767. [36] B. Zhao, F. Wang, and C. Zhang, “Efficient multiclass maximum margin clustering,” in Proc. 25th Int. Conf. Mach. Learn., Helsinki, Finland, Jun. 2008, pp. 1248–1255. [37] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Trading convexity for scalability,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, Jun. 2006, pp. 201–208. [38] Z.-H. Zhou and J.-M. Xu, “On the relation between multi-instance learning and semi-supervised learning,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, USA, Jun. 2007, pp. 1167–1174. [39] T. Zhang, C. Xu, G. Zhu, S. Liu, and H. Lu, “A generic framework for video annotation via semi-supervised learning,” IEEE Trans. Multimedia, vol. 14, no. 4, pp. 1206–1219, Aug. 2012. [40] B. Z. Yao, B. X. Nie, Z. Liu, and S.-C. Zhu, “Animated pose templates for modeling and detecting human actions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 3, pp. 436–452, Mar. 2014. [41] C.-N. J. Yu and T. Joachims, “Learning structural SVMs with latent variables,” in Proc. 26th Int. Conf. Mach. Learn., Montreal, QC, Canada, Jun. 2009, pp. 1169–1176. [42] G. Fung and O. L. Mangasarian, “Semi-supervised support vector machines for unlabeled data classification,” Optim. Methods Softw., vol. 15, no. 1, pp. 29–44, 2001. [43] R. Collobert, F. Sinz, J. Weston, and L. Bottou, “Large scale transductive SVMs,” J. Mach. Learn. Res., vol. 7, pp. 1687–1712, Dec. 2006. [44] X. Zhu and A. B. Goldberg, “Semi-supervised regression with order preferences,” Dept. Comput. Sci., Univ. Wisconsin-Madison, Madison, WI, USA, Tech. Rep. 1578, 2006. [45] X. Zhu and A. Goldberg, “Kernel regression with order preferences,” in Proc. 22nd AAAI Conf. Artif. Intell., Vancouver, BC, Canada, Jul. 2007, pp. 681–687. [46] H. Yu, S. Kim, and S. Na, “RankSVR: Can preference data help regression?” in Proc. 19th ACM Conf. Inf. Knowl. Manage., Toronto, ON, Canada, Oct. 2010, pp. 879–888. [47] B. Zhao, F. Wang, and C. Zhang, “Block-quantized support vector ordinal regression,” IEEE Trans. Neural Netw., vol. 20, no. 5, pp. 882–890, May 2009. [48] S. K. Shevade and W. Chu, “Minimum enclosing spheres formulations for support vector ordinal regression,” in Proc. 6th Int. Conf. Data Mining, Hong Kong, Dec. 2006, pp. 1054–1058. [49] C.-W. Seah, I. W. Tsang, and Y.-S. Ong, “Transfer ordinal label learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 11, pp. 1863–1876, Nov. 2013. [50] F. Fernández-Navarro, P. Campoy-Muñoz, M. de la Paz-Marín, C. Hervás-Martínez, and X. Yao, “Addressing the EU sovereign ratings using an ordinal regression approach,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 2228–2240, Dec. 2013. [51] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. New York, NY, USA: Dover, 1998. [52] S. Baccianella, A. Esuli, and F. Sebastiani, “Evaluation measures for ordinal regression,” in Proc. 9th Int. Conf. Intell. Syst. Design Appl., Pisa, Italy, Nov./Dec. 2009, pp. 283–287. [53] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Boston, MA, USA: Addison-Wesley, 1989. [54] N. Panda, E. Y. Chang, and G. Wu, “Concept boundary detection for speeding up SVMs,” in Proc. 23rd Int. Conf. Mach. Learn., Pittsburgh, PA, USA, Jun. 2006, pp. 681–688.

1019

[55] H. P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik, “Parallel support vector machines: The cascade SVM,” in Proc. 17th Conf. Neural Inf. Process. Syst., Vancouver, BC, Canada, Dec. 2004, pp. 521–528. [56] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines. [Online]. Available: http://www.csie.ntu.edu.tw/cjlin/libsvm, accessed Apr. 2011. [57] T. Joachims, T. Finley, and C.-N. J. Yu, “Cutting-plane training of structural SVMs,” Mach. Learn., vol. 77, no. 1, pp. 27–59, 2009. [58] V. Tresp, “A Bayesian committee machine,” Neural Comput., vol. 12, no. 11, pp. 2719–2741, 2000. [59] S. Shalev-Shwartz, Y. Singer, and N. Srebro, “Pegasos: Primal estimated sub-gradient solver for SVM,” in Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, USA, Jun. 2007, pp. 807–814. [60] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [61] C. A. Floudas and C. E. Gounaris, “A review of recent advances in global optimization,” J. Global Optim., vol. 45, no. 1, pp. 3–38, 2009. [62] R. Horst, P. M. Pardalos, and N. Van Thoai, Introduction to Global Optimization, 2nd ed. Dordrecht, The Netherlands: Kluwer, 2000.

Yanshan Xiao received the Ph.D. degree in computer science from the University of Technology, Sydney, NSW, Australia, in 2011. She is currently with the School of Computers, Guangdong University of Technology, Guangzhou, China. She has authored papers in the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS and the IEEE T RANSAC TIONS ON C YBERNETICS . Her current research interests include multiple-instance learning, support vector machine, and data mining.

Bo Liu is currently with the School of Automation, Guangdong University of Technology, Guangzhou, China. He has authored papers in the IEEE T RANSACTIONS ON N EURAL N ETWORKS and the IEEE T RANSACTIONS ON K NOWLEDGE AND D ATA E NGINEERING. His current research interests include machine learning and data mining.

Zhifeng Hao is currently a Professor with the School of Computers, Guangdong University of Technology, Guangzhou, China. His current research interests include design and analysis of algorithms, mathematical modeling, and combinatorial optimization.

A Maximum Margin Approach for Semisupervised Ordinal Regression Clustering.

Ordinal regression (OR) is generally defined as the task where the input samples are ranked on an ordinal scale. OR has found a wide variety of applic...
2MB Sizes 0 Downloads 10 Views