This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Semi-Supervised Text Classification With Universum Learning Chien-Liang Liu, Wen-Hoar Hsaio, Chia-Hoang Lee, Tao-Hsing Chang, and Tsung-Hsun Kuo

Abstract—Universum, a collection of nonexamples that do not belong to any class of interest, has become a new research topic in machine learning. This paper devises a semi-supervised learning with Universum algorithm based on boosting technique, and focuses on situations where only a few labeled examples are available. We also show that the training error of AdaBoost with Universum is bounded by the product of normalization factor, and the training error drops exponentially fast when each weak classifier is slightly better than random guessing. Finally, the experiments use four data sets with several combinations. Experimental results indicate that the proposed algorithm can benefit from Universum examples and outperform several alternative methods, particularly when insufficient labeled examples are available. When the number of labeled examples is insufficient to estimate the parameters of classification functions, the Universum can be used to approximate the prior distribution of the classification functions. The experimental results can be explained using the concept of Universum introduced by Vapnik, that is, Universum examples implicitly specify a prior distribution on the set of classification functions. Index Terms—AdaBoost, learning with Universum, text classification.

I. I NTRODUCTION EXT classification has become important given the strong growth in the volume of text documents available on the internet. Supervised learning is the main approach to this problem, and numerous state-of-the-art supervised learning algorithms have been proposed and successfully used in text classification [1]–[3]. However, problem in applying supervised learning methods to real-world problems is the cost of obtaining sufficient labeled training data, since supervised learning methods often require a large, prohibitive, number of labeled training examples to learn accurately. Labeling is time-consuming and typically done manually.

T

Manuscript received June 13, 2014; revised November 26, 2014 and February 2, 2015; accepted February 3, 2015. This work was supported in part by the National Science Council under Grant NSC-103-2221-E-009-153, and in part by the Aim for the Top University Project of the National Taiwan Normal University and the Ministry of Education, Taiwan. This paper was recommended by Associate Editor Y. Jin. C.-L. Liu is with the Computational Intelligence Technology Center, Industrial Technology Research Institute, Hsinchu 31040, Taiwan. W.-H. Hsaio and C.-H. Lee are with the Department of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan (e-mail: [email protected]). T.-H. Chang is with the Department of Computer Science and Information Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 807, Taiwan. T.-H. Kuo is with Acer Inc., Hsinchu 221, Taiwan. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2015.2403573

Conversely, unlabeled data is relatively easy to collect, and many algorithms and experimental results have demonstrated that it can considerably improve learning accuracy in certain practical problems [4]. Consequently, semi-supervised learning, which involves learning from a combination of both labeled and unlabeled data, has recently attracted significant interest [5]–[12]. Semi-supervised learning often uses one of the following assumptions to model the structure of the underlying data distribution [13]. First, the classification function should be smooth with respect to the intrinsic structure revealed by known labeled and unlabeled data. Numerous studies view smoothness as an optimization with constraints problem, in which a regularization term is used for the problem at hand [14]–[16]. Second, points on the same cluster or manifold frequently share the same label. Consequently, the classification functions are naturally defined only on the sub-manifold in question rather than the total ambient space [17]–[19]. Besides the above two assumptions, prior knowledge is another source of information that can improve classification performance. While prior knowledge has proven useful for classification [20], [21], it is notoriously difficult to apply in practice due to the difficulty of explicitly specifying prior knowledge. The Universum, introduced by Vapnik [22], provides a novel means to encode prior knowledge by giving examples. The Universum is a collection of examples that do not belong to any category of interest, but do belong to the same domain as the problem. Weston et al. [23] devised an algorithm called U-supporting vector machine (SVM) to demonstrate that using Universum as a penalty term of the standard SVM [24] objective function can enhance classification performance. To our knowledge, this paper explores, for the first time, how Universum impacts semi-supervised learning with insufficient labeled examples. Both semi-supervised learning and learning with Universum rely on unlabeled examples to improve classification performance. Learning with Universum and semi-supervised learning differ mainly in the distribution of unlabeled examples. While requiring unlabeled examples to share the same distribution as labeled ones, semisupervised learning attempts to learn a model using a few labeled examples and numerous target unlabeled examples. Conversely, learning with Universum uses the examples with different distributions to the target ones, and aims to use Universum examples to estimate prior model information. Intuitively, the Universum examples should be close to the decision hyperplane, since they do not belong to any

c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON CYBERNETICS

(a)

(b)

Fig. 1. (a) Example of semi-supervised learning with two classes, where represents positive labeled examples and denotes negative labeled examples. Others are unlabeled examples. (b) Example of semi-supervised learning with Universum, in which denotes Universum examples. With the help of Universum, the learned hyperplane correctly classifies that misclassified by semi-supervised classifier.

class. Thus, the U-SVM is designed to minimize empirical classification loss on the target examples rather than to give clear classification assignments for the Universum examples. Most learning with Universum methods use margin to explain Universum and devise algorithms owing to the inspiration of U-SVM. Compared with most previous work on Universum, this investigation uses boosting technique to devise a semi-supervised learning with Universum algorithm. Analysis shows that using Universum can improve classification performance, particularly when insufficient labeled examples are available. Although semi-supervised learning has achieved considerable success in the domain of machine learning, availability of only a few labeled examples may affect classification performance. As shown in Fig. 1(a), the classifier fails to learn a robust hyperplane with insufficient labeled examples and enormous unlabeled examples. In contrast, Universum can provide a means to model the prior of classification functions, and the learning algorithm can obtain a robust decision hyperplane by using Universum to adjust hyperplane when insufficient labeled examples are available. The above example inspires us to design a semi-supervised learning algorithm with Universum to enhance semi-supervised learning classification performance, particularly under conditions of insufficient labeled examples. To analyze the study findings, the experiments use different percentages of labeled examples to analyze the influence of Universum on classification performance. Once more labeled examples become available, the benefits from Universum reduce. The experimental results relate the proposed method to Bayesian approaches. Finally, the experiments use four data sets with several combinations, and the experimental results indicate that the proposed algorithm can benefit from Universum examples. The main contributions of this paper are as follows. First, this paper proposes a semi-supervised learning with Universum algorithm using the boosting technique. Second, we show that the training error of AdaBoost with Universum is bounded by the product of normalization factor, and the training error drops exponentially fast when each weak classifier is slightly better than random guessing. Third, this paper investigates the influence of Universum on classification performance

by conducting several experiments, and presents interesting findings in the experimental results. The remainder of this paper is organized as follows. Section II describes related work. Section III then introduces the proposed method of semi-supervised learning with Universum algorithm. Next, Section IV summarizes the results of several experiments. Section V then discusses and analyzes the proposed algorithm and the experimental results. Finally, Section VI draws the conclusion. II. R ELATED W ORK Weston et al. [23] demonstrated that Universum can be encoded as prior knowledge imposed on the classification hyperplane, and proposed an algorithm called U-SVM to leverage the Universum by maximizing the number of observed contradictions, which associates the structural risk minimization principle with the use of Bayesian priors [25]. Their work viewed the Universum as a penalty term of the standard SVM objective function, in which the -insensitive loss function represents the penalty, and the experimental results indicated that U-SVM can benefit from Universum data and outperform SVM. Sinz et al. [26] analyzed the influence of the Universum on U-SVM, and derived a least squares SVM algorithm called Uls -SVM, which uses the quadratic loss in the objective function rather than the Hinge loss. Graph-based methods are popular approaches in semisupervised learning, and the basic assumption is that all the examples are situated on a low-dimensional manifold within the ambient space of the examples. Zhang et al. [27] proposed a graph-based framework to formulate the semi-supervised learning with Universum problems and developed two algorithms called Lap-Universum (U-Lap) and NLap-Universum (U-NLap), in which the objective function considers three elements. The first element focuses on labeled data, and penalizes the difference between the estimated labels and the given labels. The second one is a graph regularization term that penalizes the unsmoothness of the classification function. Laplacian matrix provides an approximation for this smoothness. The final one penalizes the loss on the Universum examples, and the goal is to let the soft labels

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: SEMI-SUPERVISED TEXT CLASSIFICATION WITH UNIVERSUM LEARNING

for the Universum examples be close to zero, which provides some prior knowledge for the classification function. Several differences exist between this framework and the proposed method. First, this framework is a graph-based method, while the proposed method is a boosting algorithm. Second, this framework is a multiclass algorithm, while the proposed algorithm is a multiclass and multilabel algorithm. Third, this framework relies on smoothness assumption to devise semisupervised learning algorithm, while the proposed method uses bootstrapping technique based on the proposed confidence scheme. In clustering, Zhang et al. [28] developed a document clustering with Universum algorithm, using maximum margin clustering to model target examples and Universum examples for clustering. Huang et al. [29] focused on semi-supervised learning using general unlabeled data. Their work focused on detecting irrelevant data in situations involving a small volume of labeled data and a large volume of unlabeled data. Thus, their work differs from learning with Universum, in which Universum data needs to be specified in advance and the objective is to use Universum examples to improve classification performance. Inspired by the success of U-SVM, Shen et al. [30] proposed a boosting algorithm called UBoost to use Universum data to improve classification performance on images and videos. Although both UBoost and the proposed method use boosting technique along with Universum data to improve performances, several differences exist between the two algorithms. First, the proposed algorithm is a semisupervised learning algorithm, while UBoost is a supervised learning algorithm. Second, the proposed algorithm considers Universum in the decision stump to enhance each weak learner, while UBoost takes advantage of the available Universum data and labeled data to learn a strong classifier with a maximal contradiction on the Universum data and a minimal classification error on the labeled data. It is noted that the idea behind UBoost is similar to the one used by U-SVM. Finally, the proposed algorithm minimizes exponential loss by using Universum to derive the criterion for selection of the weak learner in each round, while UBoost applies an optimization technique called column generation to iteratively solve a relaxed version of the original problem. Collecting Universum examples is easy and several approaches have been proposed. Typical approaches treat the examples known to be unlikely to belong to any of the target categories as Universum. The experiments in this paper use the same approach to collect Universum. In cases where Universum data is not abundantly available, a priori domain knowledge can instead be used to construct purely artificial examples [23]. Intuitively, the Universum examples should be classified as zero in U-SVM, providing a basis for selecting the in-between samples as Universum examples [31]. Cherkassky and Dai [32] used a similar idea to gather Universum data using a random selection approach, in which samples are generated by randomly selecting a pair of positive and negative training samples, and averaging them. Numerous researchers have recently studied semisupervised learning, namely learning with both labeled and unlabeled data [4], [33]. Compared with learning

3

with Universum, semi-supervised learning assumes that unlabeled examples must share the same distribution as the labeled ones. Numerous algorithms and experimental results have demonstrated that unlabeled data can considerably improve learning accuracy in certain practical problems. Various semi-supervised algorithms, generally transform prior or background knowledge into a set of constraints for learning and guide the learning process to meet the given constraints as possible, have been proposed, including co-training [34], [35], semisupervised Naive Bayes [36], [37], transductive support vector machines (TSVM) [14], graph-based approaches [38], [39], and clustering-based approaches [40]–[44]. More detailed literature survey about semi-supervised learning algorithms can refer to the survey conducted by Zhu [4]. III. S EMI -S UPERVISED L EARNING W ITH U NIVERSUM Unlike traditional supervised learning approaches, ensemble learning combines multiple classifiers to solve a computational intelligence problem. Arguably the best known of all ensemble-based algorithms, AdaBoost [45] and its variants have been successfully applied to diverse domains [2], [46]–[51]. Essentially, the underlying classifier of AdaBoost can be of any type, including decision tree, neural network, and simple rule of thumb. The weak hypothesis of AdaBoost algorithm returns “+1” or “−1” to classify binary category items. Schapire and Singer [46] extended the AdaBoost algorithm to develop the AdaBoost.MH algorithm. The AdaBoost.MH maintains a set of weights over all training examples. In the training process, the weights of training examples that are difficult to be predicted correctly are increased, and those that are easy to be classified are decreased. Moreover, it permits a weak hypothesis to predict a data example with a real-value for each label. This extension provides sufficient flexibility to design weak hypotheses. This paper focuses on text classification, which tries to categorize each document based on its content. The proposed algorithm is based on the AdaBoost.MH algorithm [46], which is a multilabel and multiclass algorithm. A. Notation This section describes the notations used throughout this paper. The number of classes or categories is k, where the class labels are 1, . . . , k. Let X represents the feature space, and let Y be the label set. The input data includes Universum examples U, a few labeled examples L, and numerous unlabeled examples R. Meanwhile, no class information is provided for unlabeled examples. Finally, ½ is an indicator function. B. AdaBoost With Universum AdaBoost only uses the labeled examples that share the same distribution with the target ones to train a classification model. Compared with the previous research, we consider Universum examples in the training process to devise a variant of AdaBoost algorithm. Algorithm 1 shows AdaBoost with Universum algorithm, in which the inputs include Universum

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4

Algorithm 1 AdaBoost With Universum Algorithm 1: Given Universum examples {(x1 , Y1 ), . . . , (xu , Yu )} and training examples {(xu+1 , Yu+1 ), . . . , (xu+a , Yu+a )} where xi ∈ X for 1 ≤ i ≤ u + a, Yi = 0 for 1 ≤ i ≤ u, and Yi ∈ {−1, +1} for u + 1 ≤ i ≤ u + a. 1 2: Initialize D1 (i) = u+a , i = 1, . . . , u + a 3: for t = 1, . . . , T do 4: Train weak learner using Distribution Dt 5: Get weak hypothesis ht : X → R 6: Choose αt ∈ R t Yi ht (xi )) 7: Update Dt+1 (i) = Dt (i) exp(−α , where Zt is a Zt normalization factor and Dt+1 is a distribution 8: end for T 9: Output the final classifier: H(x) = sign( t=1 αt ht (x))

IEEE TRANSACTIONS ON CYBERNETICS

as local domain information. This paper follows the scheme used in BoostTexter [2] to design weak hypotheses, each of which is a decision stump. In text classification, a term is commonly used as the feature of documents, so this paper uses the same scheme to represent each document. Intuitively, text classification can be achieved by enormous rules based on the presences or absences of terms. Each term w’s presence or absence can partition the whole data sets into two blocks, X0 and X1 , excluding and containing w, respectively. Let βjl = h(x, l) represents the output of a hypothesis for label l and data x ∈ Xj , where j ∈ {0, 1}, the goal is to find appropriate choices for βjl . Equation (2) shows hypothesis output  ht (x, l) =

examples as well as labeled examples. The Dt (i) in the algorithm denotes the weight of this distribution on training example i at iteration t, and all example weights are initially set to be equal. At each iteration, the weak hypothesis or the learner with the minimum training error is selected as listed in lines 4 and 5. This algorithm is an extension of the AdaBoost by considering Universum examples in the learning process to improve the selection of weak learner for each iteration and can be viewed as a special case of the proposed USemiAdaBoost.MH. A general and detailed description about the selection of weak learner will be described in the following subsection. The training error is measured with respect to the distribution Dt on which the weak classifier was trained. The algorithm then measures the importance of weak classifier ht with αt , which can be viewed as the weighting of ht in strong classifier construction. On completion of the training process, the final classifier can be obtained from a linear combination of the weak classifiers. The training error of AdaBoost with Universum is bounded by the product of Zt as shown in Theorem 1. Theorem 1: The training error of AdaBoost with Universum is bounded bythe product of normalization T factor Zt , namely, (1/u + a) u+a t=1 Zt i=1 ½{H(xi )  = Yi } ≤ Proof: The proof is presented in the Appendix. C. USemi-AdaBoost.MH The upper bound of the training error, as shown in Theorem 1, provides a direction to design the weak hypothesis ht and its corresponding weight αt , whose goal is to minimize Zt at each iteration. The proposed method is a multiclass and multilabel algorithm, so that the weight distribution on training example i for label l is denoted as Dt (i, l) rather than Dt (i). Meanwhile, the label of each labeled example is a set rather than a categorical value, so (1) is used to denote label information  +1, if l ∈ Y Y[l] = (1) −1, if l ∈ / Y. Essentially, the predictions of weak hypotheses should focus on partitioning the domain X, so that each weak hypothesis is only responsible for a specific partition, which can be viewed

/x β0l , if w ∈ β1l , if w ∈ x.

(2)

Besides, for each j and b ∈ {+1, −1}, a weighted fraction of examples which fall in block j with label l is defined as (3), where s = u for Universum example set U and s = a for labeled example set L jl

Wbs =



Dt (i, l)½{(xi ∈ Xj ) ∧ (Yi [l] = b) ∧ (xi ∈ s)}. (3)

i=1

In the following, we focus on the criterion for finding the weak hypothesis in each round. As described above, Zt is a normalization factor, so Zt is equivalent to the expression listed in (4). Then, the αt can be absorbed into ht , and Zt listed jl in (4) can be decomposed into the form represented by Wbs as shown in (5). As mentioned above, the goal is to minimize Zt with respect to βjl , implying that the calculus partial derivative technique can derive the value of βjl , as shown in (6), in jl which W+u is zero since the Universum represents the set not jl jl belonging to the target categories. Practically, W+a or W−a jl may be 0, so the smoothing technique can be applied to W+a jl and W−a Zt =





Dt (i, l) exp(−αt Yi [l]ht (xi , l))

(4)

l∈Y i:xi ∈U ∪L

Zt =





Dt (i, l) exp(−Yi [l]ht (xi , l))

l∈Y i:x ∈U ∪L

i     jl jl = W+u + W+a exp(−βjl )

l∈Y j∈{0,1}

  jl jl + W−u + W−a exp(βjl )  jl jl W+u + W+a

1 ln jl jl 2 W +W  −u jl −a W+a 1 . = ln jl jl 2 W +W

βjl =

−u

(5)

(6)

−a

Thus, the output of weak hypothesis can be determined by (6). Furthermore, βjl can be plugged into (5) to obtain Zt , as shown in (7). As described above, minimizing Zt can identify the best weak hypothesis at each iteration. Thus, the weak

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: SEMI-SUPERVISED TEXT CLASSIFICATION WITH UNIVERSUM LEARNING

hypothesis with the smallest Zt , as shown in (7), is selected at each iteration

    jl jl jl Zt = 2 W+a W−a + W−u . (7) l∈Y j∈{0,1}

As described above, the training error of AdaBoost with Universum is bounded by the product of Zt . We show that considering Universum in the learning process drops the training error exponentially fast as shown in Theorem 2, which is the same as the bound proofed by Freund and Schapire [45] for AdaBoost. Theorem 2: Considering Universum in the learning process, and each weak classifier is slightly better than random guessing, we can find a γt ≥ 0, so that the training error drops exponentially fast, since the training error is at most  exp(−2 Tt γt2 ). Proof: The proof is presented in the Appendix. Algorithm 2 shows the USemi-AdaBoost.MH algorithm, where Zt is a normalization factor at iteration t. The inputs include number of categories k, Universum examples, labeled examples, and target unlabeled examples. As shown in Theorem 1, the product of all Zt is the upper bound of the training error. The result provides a direction to design the weak hypothesis by minimizing Zt at each iteration to reduce training error, and (6) shows how to design weak hypotheses. Besides, the proposed algorithm can select the best hypothesis using (7) at each iteration, indicating that the hypothesis jl which can make the biggest difference between W+a and jl jl (W−a + W−u ) is the best choice. Line 6 of Algorithm 2 lists this process. Meanwhile, this criterion can be applied to the selection of weak learner during training process for AdaBoost with Universum algorithm as mentioned in the previous subsection. This paper uses a simple example to explain the effect of Universum examples on weak hypothesis selection. For simplicity, the discussion is limited to the binary classification problem, involving the “computer” and “talk” categories. Assume the term “network” is a distinctive feature, which appears in most documents belonging to the computer class but rarely appears in documents belonging to the talk class. Using Universum examples can lead the weak learner to select network as a hypothesis when network appears in most Universum examples or rarely appears in Universum examples, based on the criterion listed in (7). According to the definition of Universum, Universum examples should generally be free of distinctive features, since they do not belong to the classes of interest. Consequently, the first condition should not hold. Conversely, the second condition always holds for a distinctive feature, so Universum examples can help weak learner select network as a hypothesis. We conduct experiments to verify the above analysis. Moreover, the value of hypothesis βjl provides additional information. For instance, βjl can be interpreted as the impact of the presence or absence of a feature has on the classification of class l. Consequently, this investigation uses the value of βjl to represent the confidence of the hypothesis regarding the classification of label l in terms of the presence or absence of the feature. The value of βj is a vector of βjl , where 1 ≤ l ≤ k,

5

Algorithm 2 USemi-AdaBoost.MH Algorithm 1: Given the number of categories k, Universum examples U = {(x1 , Y1 ), . . . , (xu , Yu )}, labeled examples L = {(xu+1 , Yu+1 ), . . . , (xu+a , Yu+a )}, target unlabeled examples R = {(xu+a+1, Yu+a+1 ), . . . , (xu+m , Yu+m )}, where xi ∈ X for 1 ≤ i ≤ u + m, Yi = ∅ for 1 ≤ i ≤ u, and Yi ⊆ Y for u + 1 ≤ i ≤ u + m. 2: Use U, L and R to represent Universum data set, labeled data set and unlabeled data set, respectively. 3: repeat 4: Initialize D1 (i, l) = 1/((|U ∪ L|)k), where i = 1, . . . , |U ∪ L| and l = 1, . . . , k 5: for t = 1, . . . , T do 6: Pass distribution Dt to weak learner to get a weak hypothesis ht : (U ∪ L) × Y → R with the smallest Zt as listed in Equation (7) i [l]ht (xi ,l)) 7: Update Dt+1 (i, l) = Dt (i,l) exp(−Y , where Zt ht (xi , l) is the output of weak hypothesis determined by Equation (6), Zt is a normalization factor and i = 1, . . . , |U ∪ L| 8: end for 9: for i = u + a + 1, . . . , u + m do 10: if xi ∈ R then  11: xi .label = argmax Tt ht (xi , l) 1≤l≤K  12: xi .conf = max Tt ht (xi , l) 1≤l≤K

13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:

end if end for for l = 1, . . . , k do  xi .conf ×½{(xi ∈R)∧(xi .label=l)}  avg_conf[l] = i=u+a+1 i=u+a+1 ½{(xi ∈R)∧(xi .label=l)} for i = u + a + 1, . . . , u + m do if (xi ∈ R ) and (xi .label = l) and (xi .conf ≥ avg_conf[l]) then L = L ∪ xi R = R − xi end if end for end for until Convergence Output the final classifier: f (x, l) = Tt=1 ht (x, l)

and the highest among the vector values is the most likely category. Each hypothesis can provide a prediction based on its βj value. The sum of βj for all weak hypotheses reveals the classification result and corresponding confidence, since the final classifier or strong classifier in the AdaBoost framework is the linear combination of the weak hypotheses. For simplicity, lines 11 and 12 of Algorithm 2 represent the process in which the two attributes “label” and “conf” represent the classification result and confidence of the classified data. Besides the Universum examples, this paper focuses on semi-supervised learning, particularly when only a few labeled examples are available. Intuitively, semi-supervised learning must use target unlabeled examples to train an accurate and reliable classifier. This investigation uses the bootstrapping

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON CYBERNETICS

technique to design the algorithm. Initially, the proposed method uses available labeled examples and Universum examples to learn a classification model and obtain T weak classifiers, where T is the number of iterations. Lines 4–8 list the training process. A strong classifier, comprising the linear combinations of the T weak classifiers, can then classify each target unlabeled example. The confidence and label of the classification can be obtained from βjl . Lines 9–14 present the classification process. This paper uses the confidence information to identify reliable classified unlabeled examples. The average of the confidence values serves as a threshold value for selecting classified unlabeled examples, and can be changed to fit application requirements. If the confidence value of a classified unlabeled example exceeds the provided threshold value, the classification result is reliable. The proposed method then merges the reliable classified unlabeled examples and their labels into the labeled data set and removes them from the target unlabeled data set. Lines 15–23 of Algorithm 2 illustrate the filtering process. Furthermore, the proposed method is a multiclass algorithm, explaining why the algorithm iterates over all the labels. IV. E XPERIMENTS A. Data Sets The experiments use nine combinations from four data sets, including 20 Newsgroups,1 CiteULike,2 WebKB3 and Reuters-21578.4 They are all popular data sets which are commonly used in text analysis experiments, to assess system performance. 1) The 20 Newsgroups data set, which is abbreviated as “20NG” in the experimental results, is a collection of approximately 20 000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. Some of the newsgroups are very closely related; while others are highly unrelated. The 20 Newsgroups data set has become popular for experiments involving text applications of machine learning techniques, such as text classification and text clustering. 2) CiteULike is a social bookmarking web site and is aimed to promote and to develop the sharing of scientific references amongst researchers. Scientists can annotate their interested academic papers with tags and share the information with the other people. However, CiteULike does not provide paper category information, which is necessary for this paper to evaluate performance. We assign papers to communities according to their venues, using the classification system adopted by Microsoft’s academic search service,5 which provides the ranking of publications in different fields. For instance, graphics field includes ACM Transactions on Graphics (TOG), and IEEE Computer Graphics and Applications. A paper published in the TOG would be classified as “graphics” 1 http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html 2 CiteULike: http://www.citeulike.org/ 3 http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/ 4 http://www.daviddlewis.com/resources/testcollections/reuters21578/ 5 Microsoft Academic Search: http://academic.research.microsoft.com

field. Above paper classification mechanism is also used by Shi et al. [52]. This study focused on computer science domain and collected 3394 articles from three fields which are graphics (741 articles), “databases” (1289 articles), and “programming languages” (1364 articles). The paper’s full text is unavailable in CiteULike, so the paper’s abstract and tags annotated by users are considered as paper content. 3) WebKB contains web pages collected from the computer science departments of various universities in January 1997 by the World Wide Knowledge Base Project of the Carnegie Mellon University text learning group. The WebKB data set contains seven categories, including “student” (1641 pages), “faculty” (1124 pages), “staff” (137 pages), “department” (182 pages), “course” (930 pages), “project” (504 pages), and “other” (3764 pages). 4) Reuters-21578, which is abbreviated as “Reuters” in the following tables, is one of the most widely used test collections for text classification research. The data was originally collected and labeled by Carnegie Group, Inc., and Reuters, Ltd., in developing the CONSTRUE text categorization system. Noted that the Reuters-21578 is a multilabel data set and the news of the ten largest classes, including “earn,” “acq,” “money-fx,” “grain,” “crude,” “trade,” “interest,” “ship,” “wheat,” and “corn” are used to conduct experiments. Meanwhile, we collect 5140 and 2643 news as target and Universum sets after removing the ones whose class labels are simultaneously in target set and Universum set. As mentioned above, this paper assumes that Universum examples are available, so we select a subset of a specific data set as Universum examples. For example, the 20 Newsgroups data set contains several subjects, and documents in different subjects can be viewed as different classes of data for the same problem. Consequently, if the target examples are obtained from computer and talk subjects, and the goal is to classify each target example as one of the two subjects, the collection of documents from science and recreation subjects can be considered Universum examples. The same approach is applied to the experiments on all the data sets. Table I lists the experimental settings used in the experiments. In the preprocess stage, the stop words are removed from these data sets, since they fail to provide sufficient information for the clustering task. Additionally, punctuation marks are removed and all English letters are converted into lower case. Finally, stemming process is applied to the words. Furthermore, the proposed method sets the number of iterations T as 500, and the convergence condition holds when the number of remaining unlabeled examples is less than or equal to 4. B. Evaluation Measurements For each class, the correctness of a classification can be assessed by calculating the number of correctly recognized class examples (true positives), the number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: SEMI-SUPERVISED TEXT CLASSIFICATION WITH UNIVERSUM LEARNING

7

TABLE I E XPERIMENTAL C OMBINATIONS U SED IN THE E XPERIMENTS

(false positives) or were not recognized as class examples (false negatives) [53]. Equation (8) shows the definition of precision, recall and F1 score, where TP represents the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives. Meanwhile, numerous classification tasks employed in the experiments are multiclass problems, so the evaluation should consider the prediction results for every class. Macroaverage F1 , which is the average of the F1 scores of all the classes, is used to assess system performance. Equation (9) shows the definition of the macro-average F1 score, where k denotes the number of classes and F1i represents the F1 score of the ith class TP Precision = TP + FP TP Recall = TP + FN 2 × Precision × Recall (8) F1 = Precision + Recall k F1i Macro-average F1 = i=1 . (9) k C. Comparison Methods The combination of U-SVM and TSVM can be considered a semi-supervised learning with Universum method, and this paper follows the naming convection of U-SVM and calls it U-TSVM. Although U-TSVM does not appear in the literature, the UniverSVM [54] package provides an example of the same idea. U-TSVM can be thought of as a semi-supervised learning with Universum method, and is an extension of TSVM, explaining why it is used in the experiments for comparison. Besides U-TSVM, UniverSVM provides the implementations of SVM and TSVM, so we use UniverSVM in the

experiments involving SVM, TSVM, and U-TSVM. In the experiments, all SVM related methods use radial basis function as the kernel function. Besides SVM related methods, Zhang et al. [27] proposed two semi-supervised learning methods with Universum, which simultaneously use the labeled examples, target unlabeled examples and Universum examples to improve classification performance. These two methods, U-Lap and U-NLap, use graph representation for encoding all of the data, and use the Universum data to help represent the prior information to identify possible classifiers. The implementations of the proposed method and comparison algorithms except SVM related algorithms can be downloaded from http://islab.cs.nctu.edu.tw/ D. Semi-Supervised Learning Experiments The proposed method is a semi-supervised learning with Universum algorithm. Moreover, this paper focuses on the influence of Universum on classification performance when only a few labeled examples are available, and so randomly selects 1% of the target examples from each class as labeled ones, while the remainder of the target examples are unlabeled. Universum examples are those whose distribution differs to the target ones. Several of the machine learning algorithms described above are applied to the data sets for comparison with the proposed method. Each evaluation is repeated ten times and the average of the results become the experimental result, which are listed in the tables as means plus or minus two standard deviations. In the experiments, supervised learning methods such as SVM only use available labeled examples in training model; meanwhile, supervised learning with Universum methods such as U-SVM use available labeled examples and Universum examples to determine decision hyperplane. Semi-supervised

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON CYBERNETICS

TABLE II E XPERIMENTAL R ESULTS ON 20 N EWSGROUPS D ATA S ETS

learning methods, such as TSVM, use available labeled examples and target unlabeled examples; meanwhile, semisupervised learning with Universum methods, such as USemiAdaBoost.MH, U-TSVM, U-Lap and U-NLap, use all the target examples and Universum examples to train classification models. The first experiment involves a data set comprising 20 Newsgroups. Binary classification and multiclass classification experiments are performed to evaluate system performance. Additionally, each newsgroup in the 20 Newsgroups data set comprises approximately 1000 documents, and so is a balanced data set. In the binary classification experiments, this investigation uses two combinations of newsgroups as listed in the experimental combinations I and II of Table I, and the experimental results are listed in corresponding columns of Table II. Experimental combination I uses newsgroups on recreation and science subjects as target examples, and newsgroups on computer and talk subjects as Universum examples. Each target class comprises two newsgroups, including approximately 2000 documents, and the goal is to classify each unlabeled target example into one of the two classes. In the experiments, SVM and TSVM do not employ Universum examples to train classification models, explaining why their classification performance remain unchanged for the incorporation of Universum examples. Experimental combination II uses documents which share the same subjects but different topics as the Universum ones to conduct experiments. To prevent the Universum examples from biasing a specific class of interest, the experiments collect documents from four different newsgroups. Experiments, which are combinations III and IV listed in Table I, are then conducted by using newsgroups on computer and talk subjects as target examples. Similarly, the documents collected from different subjects are viewed as Universum examples in the experiments. The experimental results of combinations III and IV are also listed in corresponding columns of Table II. To assess whether the proposed method can perform well on multiclass problems, the experiments use newsgroups collected from three subjects as target examples. Similarly, we use two combinations as listed in the experimental combinations V and VI of Table I to conduct experiments. The experimental results are presented in Table II. Besides balanced data sets, this paper also uses imbalanced data sets to assess system performance. The experimental settings for imbalanced data sets, including CiteULike, WebKB, and Reuters-21578, are the combinations VII–IX listed in Table I. Table III lists the experimental results.

V. D ISCUSSION AND A NALYSIS The experiments use four data sets and several combinations of those data sets. Table II lists the experimental results for the 20 Newsgroups data set, where binary and multiclass classification experiments are used for performance evaluation. Table III summarizes the experimental results for the CiteULike, WebKB, and Reuters-21578 data sets, respectively. The SVM and TSVM, two state-of-the-art learning methods without Universum examples, serve as baselines to assess the influence of Universum examples on classification performance when only a few labeled examples are available. As shown in the above tables, the experimental results indicate that SVM fails to perform well owing to a lack of labeled examples. On the other hand, TSVM, a semi-supervised learning method, performs much better than SVM when a few labeled examples are available. The TSVM can consider both labeled and unlabeled target examples to determine a decision hyperplane, thus improving the classification performance. The experimental results conform to the previous research findings; namely, semi-supervised learning can outperform supervised learning in certain applications with few available labeled examples. Additionally, the two methods only use target examples to train classification models, and thus Universum examples do not influence their classification performance. The experimental results presented in Tables II and III indicate that the proposed USemi-AdaBoost.MH outperforms the comparison methods mentioned above. This paper focuses on learning with Universum, thus the following sections analyze and discuss the role of Universum in learning algorithms. A. Learning With Universum on Insufficient Labeled Examples Weston et al. [23] and Sinz et al. [26] have shown that learning with Universum, such as U-SVM, can improve performance, since it can use Universum examples to estimate prior information. Unlike previous research focusing on supervised learning with Universum, this paper focuses on analyzing semi-supervised learning with Universum. From the experimental results presented in the previous section, using Universum clearly improves classification performance in situations with insufficient labeled examples. The classifiers fail to yield a reliable decision hyperplane with insufficient labeled examples. In contrast, the Universum provides a means to impose prior knowledge on the classifiers, so the classifiers can benefit from Universum when insufficient labeled examples are available. Motivated by recent progress

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: SEMI-SUPERVISED TEXT CLASSIFICATION WITH UNIVERSUM LEARNING

9

TABLE III E XPERIMENTAL R ESULTS ON C ITE UL IKE , W EB KB, AND R EUTERS -21578 D ATA S ETS

Fig. 2.

Experimental results using different percentage of labeled data. (a) Target: two science newsgroups. (b) Target: four talk newsgroups.

on semi-supervised learning, we also conduct experiments using semi-supervised learning with Universum to analyze the influence of Universum. This paper also compares TSVM with U-TSVM, a semisupervised learning with Universum method, in terms of their performance in the experiments. The experimental results listed above indicate that U-TSVM significantly outperforms TSVM. The main difference between TSVM and U-TSVM is that the latter uses additional Universum examples for model training. Universum can provide prior information regarding the classification functions, so the decision hyperplane determined by TSVM can be adjusted using the Universum and outperform that without using Universum. Consequently, semisupervised learning can benefit from Universum in solving the text classification problem. Moreover, U-Lap and U-NLap are graph-based methods of semi-supervised learning with Universum, and their performance also outperform U-SVM, indicating that using unlabeled target examples and Universum examples can enhance text classification performance. The experimental results for multiclass classification problems exhibit similar results. The proposed method can outperform other methods in binary and multiclass classification experiments. Besides labeled examples, the proposed method uses Universum examples in the boosting framework to help the weak learner select discriminative hypotheses, each of which is a decision stump. Consequently, the experimental results indicate that the proposed method performs well by using the boosting technique with Universum examples. Furthermore, the proposed method

is a multiclass and multilabel algorithm, and thus differs from U-TSVM and U-SVM. Although U-Lap and U-NLap can be extended to multiclass problems, the experimental results indicate that they fail to function properly compared with the proposed method and U-TSVM. B. Performance Impact With Universum Given Different Amount of Labeled Examples This paper analyzes the impact of Universum on classification performance using different numbers of labeled examples for experiments. The experiments compare the proposed method with a semi-supervised learning method called SemiAdaBoost.MH, which can be viewed as a semi-supervised learning extension of AdaBoost.MH based on the confidence mechanism described above. The Semi-AdaBoost.MH does not use Universum in the training process, making it appropriate to compare it with the proposed method to analyze the influences of Universum on classification performance. The first experiment focuses on the binary classification problem, where documents belonging to the “sci.crypt” and “sci.electronics” newsgroups are target examples. The Universum data is collected from the “rec.autos” and “rec.motorcycles” newsgroups. Fig. 2(a) illustrates the experimental results. The experimental results reveal several research issues related to Universum. First, using Universum can effectively improve the performance of semi-supervised learning in the text classification problem, particularly when

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON CYBERNETICS

only a few labeled examples are available. Second, the experimental results can be explained using the concept of Universum introduced by Vapnik, that is, Universum examples implicitly specify a prior distribution on the set of classification functions. When the number of labeled examples is insufficient to estimate the parameters of classification functions, the Universum can be used to approximate the prior distribution of the classification functions, explaining why the classification performance can improve substantially when the percentages of labeled examples are below 10%. Third, the performance improvements obtained from Universum decrease with increasing numbers of labeled examples. Based on the present analysis, when the number of labeled examples is sufficient, the learning methods can use available labeled examples to estimate prior information and train a robust classifier. Although the learning algorithms can use the Universum to provide additional prior information to adjust the decision boundary, the performance improvement reduces as the number of labeled examples becomes sufficient to train an accurate classification model. Additionally, this investigation conducts experiments on a multiclass data set in which the target examples comprise documents from the “talk.politics.guns,” “talk.politics.mideast,” “talk.politics.misc,” and “talk.religion.misc” newsgroups. The Universum examples are collected from science subjects newsgroups, including sci.crypt, sci.electronics, “sci.med,” and “sci.space.” Fig. 2(b) illustrates the experimental results, which resemble those of the previous experiment. VI. C ONCLUSION This paper developed a semi-supervised learning with Universum algorithm called USemi-AdaBoost.MH. Central to the proposed method is using boosting techniques with Universum examples to improve text classification performance, which we argue captures prior information of classification functions. We demonstrate and analyze why using Universum examples can improve classification performance, particularly when the available labeled examples are insufficient to train a robust model. Furthermore, we also show that the training error of AdaBoost with Universum is bounded by the product of normalization factor, and the training error drops exponentially fast when each weak classifier is slightly better than random guessing. Finally, the experiments use four data sets with several combinations. The experimental results indicate that the proposed algorithm can benefit from Universum examples, particularly when only a few labeled examples are available. In the future, we will try to extend the proposed method to image classification. We will also considers the ways to generate Universum examples. A PPENDIX A. Proof of Theorem 1 Proof: According  to the update rule and the final classifier H(xi ) = sign( Tt=1 αt ht (xi )), we can obtain the following result:  exp(− t αt Yi ht (xi )) exp(−Yi H(xi ))   = . (10) DT+1 (i) = (u + a) t Zt (u + a) t Zt

Moreover, ½{H(xi ) = Yi } is an indicator function, implying that ½{H(xi ) = Yi } ≤ 1. On other hand, if H(xi ) = Yi , then Yi H(xi ) ≤ 0, implying that exp(−Yi H(xi )) ≥ 1. Therefore, the bound of ½{H(xi ) = Yi } is listed as follows:

½{H(xi ) = Yi } ≤ 1 ≤ exp(−Yi H(xi )).

(11)

Combining (10) and (11), the upper bound of the training error is u+a 1  ½{H(xi ) = Yi } u+a i=1 ⎞ ⎛ u u+a  1 ⎝ ½{H(xi ) = Yi } + ½{H(xi ) = Yi }⎠ = u+a i=1 i=u+1 ⎛ ⎞ u u+a  1 ⎝ = 1+ ½{H(xi ) = Yi }⎠ u+a i=1



1 u+a

u+a 

i=u+1

exp(−Yi H(xi ))

i=1 u+a 

 1 DT+1 (i)(u + a) Zt u+a t i=1  = Zt . =

t

B. Proof of Theorem 2 Proof: According to the definition of Zt as shown in (4), the Zt can be represented as the form listed in (12), in which t is introduced to represent the weighted error rate. Equation (12) is a function of αt , so using the calculus partial derivative technique can derive the value of αt as shown in (13)   Zt = Dt (i, l) exp(−αt Yi [l]ht (xi , l)) l∈Y i:xi ∈U ∪L

=





Dt (i, l) exp(−αt )

l∈Y i:ht (xi ,l)=Yi [l]

+



Dt (i, l) exp(αt )

i:ht (xi ,l) =Yi [l]

(12) = (1 − t ) · exp(−αt ) + t · exp(αt )   1 − t 1 αt = ln . (13) 2 t Then, αt can be plugged into Equation (12) to obtain the upper bound of Zt as shown in Equation (14), in which the upper bound is obtained from the inequality 1 + x ≤ exp(x) and γt =  (1/2) − t . Finally, the upper bound of training error, which is t Zt , can be obtained from (15)

   1 1 1 Zt = 2 (1 − t )t t ≤ , set t = − γt , where ≥ γt ≥ 0 2 2 2  2 = 1 − 4γt   ≤ exp −2γt2 (14)        2 2 Zt = 1 − 4γt ≤ exp −2 γt 2 t (1 − t ) = t

t

t

t

(15)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. LIU et al.: SEMI-SUPERVISED TEXT CLASSIFICATION WITH UNIVERSUM LEARNING

R EFERENCES [1] T. Joachims, “Text categorization with suport vector machines: Learning with many relevant features,” in Proc. 10th Eur. Conf. Mach. Learn., Chemnitz, Germany, 1998, pp. 137–142. [2] R. E. Schapire and Y. Singer, “BoosTexter: A boosting-based system for text categorization,” Mach. Learn., vol. 39, nos. 2–3, pp. 135–168, 2000. [3] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar, “Distributed text classification with an ensemble kernel-based learning approach,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 3, pp. 287–297, May 2010. [4] X. Zhu, “Semi-supervised learning literature survey,” Dept. Comput. Sci., Univ. Wisconsin, Madison, WI, USA, Tech. Rep. 1530, 2006. [5] N. Seliya and T. M. Khoshgoftaar, “Software quality analysis of unlabeled program modules with semisupervised clustering,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 2, pp. 201–211, Mar. 2007. [6] F. Nie, D. Xu, X. Li, and S. Xiang, “Semi-supervised dimensionality reduction and classification through virtual label regression,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 3, pp. 675–685, Jun. 2011. [7] F. Wang, “Semisupervised metric learning by maximizing constraint margin,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 4, pp. 931–939, Aug. 2011. [8] C. Liu, W. Hsaio, C. Lee, and F. Gou, “Semi-supervised linear discriminant clustering,” IEEE Trans. Cybern., vol. 44, no. 7, pp. 989–1000, Jul. 2014. [Online]. Available: http://dx.doi.org/10.1109/TCYB.2013.2278466 [9] D. Wang, F. Nie, and H. Huang, “Large-scale adaptive semisupervised learning via unified inductive and transductive model,” in Proc. 20th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min. (KDD), New York, NY, USA, 2014, pp. 482–491. [Online]. Available: http://doi.acm.org/10.1145/2623330.2623731 [10] F. Nie, D. Xu, I. W. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Trans. Image Process., vol. 19, no. 7, pp. 1921–1932, Jul. 2010. [11] F. Nie, S. Xiang, Y. Jia, and C. Zhang, “Semi-supervised orthogonal discriminant analysis via label propagation,” Pattern Recognit., vol. 42, no. 11, pp. 2615–2627, 2009. [12] G. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2405–2417, Dec. 2014. [Online]. Available: http://dx.doi.org/10.1109/TCYB.2014.2307349 [13] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Schölkopf, Eds. Cambridge, MA, USA: MIT Press, 2004. [14] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. 16th Int. Conf. Mach. Learn. (ICML), Bled, Slovenia, 1999, pp. 200–209. [15] T. Joachims, “Transductive learning via spectral graph partitioning,” in Proc. 20th Int. Conf. Mach. Learn. (ICML), Washington, DC, USA, 2003, pp. 290–297. [16] M. Culp and G. Michailidis, “Graph-based semisupervised learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 1, pp. 174–179, Jan. 2008. [17] O. Chapelle, J. Weston, and B. Schlkopf, “Cluster kernels for semisupervised learning,” in Proc. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, 2002, pp. 585–592. [18] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proc. 20th Int. Conf. Mach. Learn. (ICML), Washington, DC, USA, 2003, pp. 912–919. [19] M. Belkin and P. Niyogi, “Semi-supervised learning on Riemannian manifolds,” Mach. Learn., vol. 56, nos. 1–3, pp. 209–239, Jun. 2004. [20] B. Schölkopf, P. Simard, A. J. Smola, and V. Vapnik, “Prior knowledge in support vector kernels,” in Proc. Neural Inf. Process. Syst. (NIPS), Cambridge, MA, USA, 1997, pp. 640–646. [21] A. Epshteyn and G. DeJong, “Generative prior knowledge for discriminative classification,” J. Artif. Int. Res., vol. 27, no. 1, pp. 25–53, Sep. 2006. [22] V. Vapnik, “Transductive inference and semi-supervised learning,” in Semi-Supervised Learning, O. Chapelle, B. Schölkopf, and A. Zien, Eds. Cambridge, MA, USA: MIT Press, 2006, ch. 24, pp. 454–472. [23] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, “Inference with the Universum,” in Proc. 23rd Int. Conf. Mach Learn. (ICML), Pittsburgh, PA, USA, 2006, pp. 1009–1016.

11

[24] V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer, 1995. [25] J. M. Bernardo and A. F. M. Smith, Bayesian Theory. New York, NY, USA: Wiley, 1994. [26] F. H. Sinz, O. Chapelle, A. Agarwal, and B. Schölkopf, “An analysis of inference with the Universum,” in Proc. Neural Inf. Process. Syst. (NIPS), Vancouver, BC, Canada, 2007, pp. 1369–1376. [27] D. Zhang, J. Wang, F. Wang, and C. Zhang, “Semi-supervised classification with Universum,” in Proc. SIAM Int. Conf. Data Min. (SDM), Atlanta, GA, USA, 2008, pp. 323–333. [28] D. Zhang, J. Wang, and L. Si, “Document clustering with Universum,” in Proc. 34th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval (SIGIR), Beijing, China, 2011, pp. 873–882. [29] K. Huang, Z. Xu, I. King, and M. R. Lyu, “Semi-supervised learning from general unlabeled data,” in Proc. 8th IEEE Int. Conf. Data Min. (ICDM), Pisa, Italy, 2008, pp. 273–282. [30] C. Shen, P. Wang, F. Shen, and H. Wang, “UBoost: Boosting with the Universum,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 825–832, Apr. 2012. [31] S. Chen and C. Zhang, “Selecting informative Universum sample for semi-supervised learning,” in Proc. 21st Int. Joint Conf. Artif. Intell. (IJCAI), Pasadena, CA, USA, 2009, pp. 1016–1021. [32] V. Cherkassky and W. Dai, “Empirical study of the Universum SVM learning for high-dimensional data,” in Proc. 19th Int. Conf. Artif. Neural Netw. I (ICANN), Limassol, Cyprus, 2009, pp. 932–941. [33] L. Shi, X. Ma, L. Xi, Q. Duan, and J. Zhao, “Rough set and ensemble learning based semi-supervised algorithm for text classification,” Expert Syst. Appl., vol. 38, no. 5, pp. 6300–6306, May 2011. [Online]. Available: http://dx.doi.org/10.1016/j.eswa.2010.11.069 [34] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proc. 11th Annu. Conf. Comput. Learn. Theory (COLT), San Francisco, CA, USA, 1998, pp. 92–100. [35] W. Wang and Z.-H. Zhou, “A new analysis of co-training,” in Proc. 27th Int. Conf. Mach. Learn. (ICML), Haifa, Israel, 2010, pp. 1135–1142. [36] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using EM,” Mach. Learn., vol. 39, nos. 2–3, pp. 103–134, May/Jun. 2000. [37] X. Ji and W. Xu, “Document clustering with prior knowledge,” in Proc. 29th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval (SIGIR), Seattle, WA, USA, 2006, pp. 405–412. [38] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), Williamstown, MA, USA, 2001, pp. 19–26. [39] A. B. Goldberg and X. Zhu, “Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization,” in Proc. 1st Workshop Graph Based Methods Nat. Lang. Process. (TextGraphs-1), Stroudsburg, PA, USA, 2006, pp. 45–52. [40] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl, “Constrained K-means clustering with background knowledge,” in Proc. 18th Int. Conf. Mach. Learn. (ICML), Williamstown, MA, USA, 2001, pp. 577–584. [41] S. Basu, A. Banerjee, and R. J. Mooney, “Semi-supervised clustering by seeding,” in Proc. 19th Int. Conf. Mach. Learn. (ICML), Sydney, NSW, Australia, 2002, pp. 27–34. [42] S. Basu, M. Bilenko, and R. J. Mooney, “A probabilistic framework for semi-supervised clustering,” in Proc. 10th ACM SIGKDD Int. Conf. Knowl. Disc. Data Min. (KDD), Seattle, WA, USA, 2004, pp. 59–68. [43] K. Zhou, X. Gui-Rong, Q. Yang, and Y. Yu, “Learning with positive and unlabeled examples using topic-sensitive PLSA,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 1, pp. 46–58, Jan. 2010. [44] F. Wang, T. Li, and C. Zhang, “Semi-supervised clustering via matrix factorization,” in Proc. SIAM Int. Conf. Data Min. (SDM), Atlanta, GA, USA, 2008, pp. 1–12. [45] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997. [46] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Mach. Learn., vol. 37, pp. 297–336, Dec. 1999. [47] X. Carreras, L. S. Marquez, and J. G. Salgado, “Boosting trees for antispam email filtering,” in Proc. 4th Int. Conf. Recent Adv. Nat. Lang. Process. (RANLP-01), Tsigov Chark, Bulgaria, 2001, pp. 58–64. [48] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 1. Kauai, HI, USA, 2001, pp. I-511–I-518.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON CYBERNETICS

[49] X. Carreras, L. Márquez, and L. Padró, “Named entity extraction using AdaBoost,” in Proc. 6th Conf. Nat. Lang. Learn. (COLING-02), Taipei, Taiwan, 2002, pp. 1–4. [50] W. Hu, W. Hu, and S. J. Maybank, “AdaBoost-based algorithm for network intrusion detection,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 2, pp. 577–583, Apr. 2008. [51] W. Hu, J. Gao, Y. Wang, O. Wu, and S. J. Maybank, “Online AdaBoostbased parameterized methods for dynamic distributed network intrusion detection,” IEEE Trans. Cybern., vol. 44, no. 1, pp. 66–82, Jan. 2014. [Online]. Available: http://dx.doi.org/10.1109/TCYB.2013.2247592 [52] X. Shi, B. L. Tseng, and L. A. Adamic, “Information diffusion in computer science citation networks,” in Proc. Int. Conf. Weblogs Soc. Media (ICWSM), San Jose, CA, USA, 2009, pp. 319–322. [53] M. Sokolova and L. Guy, “A systematic analysis of performance measures for classification tasks,” Inf. Process. Manage., vol. 45, pp. 427–437, Jul. 2009. [54] F. Sinz and M. Roffilli. (2010). UniverSVM. [Online]. Available: http://mloss.org/software/view/19/

Chien-Liang Liu received the M.S. and Ph.D. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, in 2000 and 2005, respectively. He is currently an Engineer with the Computational Intelligence Technology Center, Industrial Technology Research Institute, Hsinchu. His current research interests include machine learning, data mining, and information retrieval.

Wen-Hoar Hsaio received the B.S. degree from the Department of Computer Science and Information Engineering, Chung Cheng Institute of Technology, National Defense University, Taoyuan City, Taiwan, and the M.S. degree from the Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan, in 1980 and 1996, respectively, where he is currently pursuing the Ph.D. degree from the Department of Computer Science. His current research interests include information retrieval, data mining, and machine learning.

Chia-Hoang Lee received the Ph.D. degree in computer science from the University of Maryland, College Park, MD, USA, in 1983. He was a Faculty Member at the University of Maryland and Purdue University, West Lafayette, IN, USA. He is currently a Professor with the Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan. His current research interests include artificial intelligence, human machine interface systems, natural language processing, and opinion mining.

Tao-Hsing Chang received the Ph.D. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, in 2007. From 1999 to 2008, he was a Research Fellow at the Research Center for Psychological and Educational Testing, National Taiwan Normal University, Taipei, Taiwan. He joined the National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, in 2008, where he is currently an Associate Professor with the Department of Computer Science and Information Engineering. He is the Designer of Automatic Chinese Essay Scoring System and the Co-Principal Investigator of Chinese Readability Index Explorer Project. His current research interests include natural language processing, information retrieval, soft computing, and educational technology.

Tsung-Hsun Kuo received the M.S. degree in computer science from National Chiao Tung University, Hsinchu, Taiwan, in 2012. He is currently a Software Engineer in Acer Inc., New Taipei, Taiwan. His current research interests include machine learning, natural language processing, and pattern recognition.

Semi-Supervised Text Classification With Universum Learning.

Universum, a collection of nonexamples that do not belong to any class of interest, has become a new research topic in machine learning. This paper de...
899KB Sizes 8 Downloads 9 Views