IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

451

Network-Based Stochastic Semisupervised Learning Thiago Christiano Silva and Liang Zhao, Senior Member, IEEE

Abstract— Semisupervised learning is a machine learning approach that is able to employ both labeled and unlabeled samples in the training process. In this paper, we propose a semisupervised data classification model based on a combined random-preferential walk of particles in a network (graph) constructed from the input dataset. The particles of the same class cooperate among themselves, while the particles of different classes compete with each other to propagate class labels to the whole network. A rigorous model definition is provided via a nonlinear stochastic dynamical system and a mathematical analysis of its behavior is carried out. A numerical validation presented in this paper confirms the theoretical predictions. An interesting feature brought by the competitive-cooperative mechanism is that the proposed model can achieve good classification rates while exhibiting low computational complexity order in comparison to other network-based semisupervised algorithms. Computer simulations conducted on synthetic and real-world datasets reveal the effectiveness of the model. Index Terms— Classification, complex networks, preferential walk, random walk, semisupervised learning, stochastic competitive learning.

I. I NTRODUCTION

N

OWADAYS, information reaches us at a remarkable speed and the amount of data it brings is unprecedented. In many situations, only a small subset of data items can be effectively labeled. This is because the labeling process is often expensive, time consuming, and requires intensive human involvement. As a result, partially labeled datasets are more frequently encountered. In order to get a better characterization of partially labeled datasets, semisupervised classifiers are designed to learn from both labeled and unlabeled data. It has turned out to be a new topic of machine learning research that has received increasing attention in the past years [1]–[3]. Semisupervised methods include generative models [4], [5], cluster-and-label techniques [6], [7], co-training and tritraining techniques [8]–[10], low-density separation models such as transductive support vector machines (TSVM) [11], and graph-based methods. Many semisupervised learning techniques, such as TSVM, can identify data classes of welldefined forms, but usually fail to identify classes of irregular forms. Thus, assumptions on the class distributions have to be made. Unfortunately, such information is usually not known

Manuscript received January 31, 2011; revised October 3, 2011; accepted December 17, 2011. Date of publication January 9, 2012; date of current version February 29, 2012. This work was supported in part by the São Paulo State Research Foundation (FAPESP) and by the Brazilian National Research Council (CNPq). The authors are with the Department of Computer Sciences, Institute of Mathematics and Computer Science, University of São Paulo, São Carlos 13560-970, Brazil (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2011.2181413

a priori. In order to overcome this problem, graph-based methods have been developed in the past years, such as Mincut [12], local and global consistency (LGC) [13], local learning regularization [14], local and global regularization [15], random walk techniques [16], [17], and label propagation (LP) techniques [18], [19]. The main advantage of graph-based methods is their ability to identify classes of arbitrary distributions. However, most of the graph-based methods share the regularization framework, differing only in the particular choice of the loss function and the regularizer [12], [13], [20]–[23], and most of them have cubic order of computational complexity (O(n 3 )). This factor makes their applicability limited to small or middle-sized datasets [1]. It is worth emphasizing that there are some recent graph-based methods that have linear time complexity [24] on networked data (graph is given) or quadratic time complexity on non-networked data; the latter because of the graph construction. However, a drawback of the technique proposed in [24] is the fact that it can only capture linear features on the labels in the sense that the prediction vector is a multiplication of a matrix and an initial vector. On the other hand, the technique proposed in this paper is capable of capturing nonlinear features by grasping nonlinear interactions among the labels via the nonlinear movement policy of the particles. Even with this nonlinearity characteristic, the proposed method runs in linear time for sparse networked data and in quadratic time for non-networked data. Competition is a natural process observed in nature and in social systems sharing limited resources. Competitive learning is an important category of machine learning and is widely implemented in artificial neural networks to realize unsupervised learning [25]–[33]. Without doubt, competitive learning neural networks represent one of the main successes of the neural network theory development. However, at least two problems remain. 1) The constructed network is usually small, so that competition occurs among a small number of neurons. Consequently, the model may not exhibit high robustness for data processing. 2) There is no direct connection between the input data and the trained competitive learning neural network. When a large dataset is mapped into a network with a small number of neurons, it becomes hard to see the correspondence between the original data and the trained neural network. This is one of the reasons why neural networks sometimes are considered as “black box” systems. In order to retain the interesting features and at the same time overcome the problems of competitive learning neural networks, Quiles et al. [34] proposed a particle walking model,

2162–237X/$31.00 © 2012 IEEE

452

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

pertaining to the unsupervised learning scheme, to realize the competitive learning mechanism. The model considers a large-scale network in which several particles walk in the network and compete with each other to occupy as many nodes as possible. Each particle can perform a random walk by choosing any neighbor to visit, a preferential walk by choosing the node with the highest domination to visit, or a combination of them. The model has been applied in community detection (a kind of unsupervised learning), where a community is defined as a subgraph whose nodes are densely connected within itself but sparsely connected with the remainder of the network. The model shows high precision in community detection, and at the same time presents low computational complexity. In this paper, we propose a network-based semisupervised learning model based on particle competition and cooperation in the network constructed from the input dataset. The model proposed in [34] only describes a procedure of particle competition without formal definition. In this paper, a rigorous definition is provided, where it is formally represented by a nonlinear stochastic dynamical system. Moreover, a mathematical analysis has been carried out, and theoretical results have been obtained. A validation is presented linking the numerical and theoretical results. Due to the lack of theory for this type of particle competition-cooperation, the rigorous model definition presented in this paper is an important step to understand and dominate such systems. Since the models of several interactive walking processes correspond to many natural and artificial systems, the study of this topic is relevant for modeling complex systems. In this paper, we also introduce a cooperative mechanism among the particles. In this way, particles of the same class proceed in the network in a cooperative manner to propagate their labels, while particles of different classes compete with each other to determine the class borders. Another interesting feature is that the model has a local label spreading fashion, i.e., due to the competition mechanism, each particle only visits a portion of nodes potentially belonging to the current particle or its teammates. This can be roughly understood as a “divide-and-conquer” effect embedded in the competitivecooperative scheme. In this way, many long-range redundant operations are avoided. As a result, the proposed method has a lower computational complexity order. Since the underlying network is constructed directly from the input dataset, the correspondence between the input data and the processing result (the final network) is maintained. Consequently, the “black box” effect can be avoided at a large extent. The remainder of this paper is organized as follows. The model definition is presented in Section II. In Section III, a detailed mathematical analysis and a validation of the model are provided. In Section IV, computer simulations are performed to show how the proposed model solves semisupervised data classification by using artificial and real-world datasets. Finally, Section V concludes this paper. II. M ODEL D ESCRIPTION In this section, the proposed semisupervised competitive learning model is presented in detail.

TABLE I D ESCRIPTION OF THE M OST R ELEVANT N OTATIONS U SED IN T HIS PAPER Notation t i, j k K ai j (k)

Ni

(t)

(k) N¯ i (t) p(k) (t) E (k) (t)

S (k) (t) λ ωmin ωmax  (k) Ptransition (t) (k) Prand (k) Ppref (t) (k) Prean (t)

V E K C L S It Mt

Description Index for time Indices for vertices in the network Index for a particle in the network Number of particles inserted into the network The arc weight from vertex i to j Number of visits that vertex i has received from particle k up to time t Domination level of particle k on vertex i at time t Location of the kth particle in the network at time t Particle k’s energy at time t Indicator of the state of the kth particle: active or exhausted at time t Counterweights the amount of preferential and random walks performed by all particles Minimum energy that a particle can carry Maximum energy that a particle can carry Fraction of gained/lost energy depending on which type of vertex a particle visits Particle k’s transition matrix at time t Particle k’s random walk matrix (time-invariant) Particle k’s preferential walk matrix at time t Particle k’s reanimation matrix at time t Set of vertices in the network Set of edges in the network Set of particles inserted into the network Set of possible labels (classes) Set of labeled data items and their corresponding class label Set corresponding to the space spawned by V × K Set containing all elements satisfying Lemma 3 at time t Set of all N (t) whose every entry is in It

A. Brief Overview of the Model Consider a graph G = V, E, where V = {v 1 , . . . , v V } is the set of vertices (or nodes) and E = {e1 , . . . , e L } ⊂ V × V is the set of links (or edges). In the competitive learning model, a set of particles K = {1, . . . , K } is inserted into the vertices of the network. Each particle can be conceptualized as a flag carrier with the main objective being to conquer new territories (vertices) while defending its previously dominated vertices. In view of this, a competition process will naturally take place among the particles. When a particle visits an arbitrary vertex, it strengthens its own domination level on this vertex and, simultaneously, weakens the domination levels of all other rival particles on this same vertex. It is expected that this model, in a broad horizon of time, will end up uncovering the classes in the network in such a way that each particle or team of particles dominates a class. For convenience, Table I summarizes all the relevant notations used in this paper. B. Competitive Transition Matrix In the proposed competitive model, each particle k ∈ K basically performs two types of movements: 1) a random (k) movement term, modeled by the matrix Prand , which permits the particle to venture throughout the network, without accounting for the defense of its previously dominated

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

vertices; and 2) a preferential movement term, modeled by the matrix P(k) pref , which is responsible for inducing the particle to reinforce the domination imposed on already dominated vertices. In order to model such dynamics, consider the stochastic vector p(t) = [ p(1)(t), p(2) (t), . . . , p(K ) (t)], which denotes the localization of the set of K particles presented to the network, where the kth entry, p(k) (t), indicates the location of particle k in the network at time t, i.e., p(k) (t) ∈ V, ∀k ∈ K. It is desirable to find a transition matrix that governs the movement performed by the particles to their respective next neighboring vertex, i.e., p(t + 1) = [ p(1)(t + 1), p(2)(t + 1), . . . , p(K ) (t + 1)]. A particle in this model can be in two modes: active or exhausted. The current states of the particles are stored by the stochastic vector S(t) = [S (1) (t), . . . , S (K ) (t)], where the kth entry, S (k) (t) ∈ {0, 1}, indicates whether the particle k is active (S (k) (t) = 0) or exhausted (S (k) (t) = 1) at time t. Depending on which state a particle is, its movement policy in the network is changed. Specifically, if particle k is active, then the particle presents an exploratory ability. In this case, it navigates in the network accordingly to a combined behavior of random and preferential movements. However, if particle k is exhausted, then it does not present exploratory ability. In view of this, it switches its movement policy to (k) a new transition matrix, here referred to as Prean (t), which is responsible for taking the particle back to its dominated territory. At the next step, the exhausted particle will be possibly recharged by visiting its dominated vertices in the vicinity. After its energy has been properly recharged, the particle becomes active and it can again perform the combined random-preferential movement in the network. We call this the reanimation procedure. In brief, S(t) acts as a switch that determines the movement policy of all particles at time t. With all this information in mind, we are able to define the transition matrix associated to the particle k as   (k) (k) (k) Ptransition(t)  (1 − S (k) (t)) λPpref (t) + (1 − λ)Prand + S (k) (t)P(k) rean (t)

(1)

where λ ∈ [0, 1] indicates the desired fraction of preferential movement that all particles in the network will perform. Specifically, P(k) transition(i, j, t) indicates the probability that particle k performs a transition from vertex i to j at time t. It is worth noting that (1) is a convex combination of two transition matrices (the first term is itself a combination of two transition matrices, too), since the sum of the coefficients is unitary, therefore, the resulting matrix is guaranteed to be another transition matrix. We now define each matrix that appears in (1) in a detailed manner. The random movement matrix only depends on the adjacency matrix of the graph, which is previously known. In this (k) way, each entry (i, j ) ∈ V × V of the matrix Prand is given by ai j (2) P(k) rand (i, j )   V u=1 aiu where ai j denotes the (i, j )th entry of the adjacency matrix A of the graph. Note that (2) resembles the traditional Markovian matrix for a single random walker, here symbolized as a

453

particle [35]. In addition, note that matrix P(k) rand is timeinvariant and is the same for every particle in the network, therefore, we drop the superscript k whenever the situation makes it clear. In short, the probability of an adjacent neighbor j to be visited from vertex i is proportional to the edge weight linking these two vertices. In order to assist in the derivation of the matrix associ(k) ated to the preferential movement term Ppref (t) for a given particle k ∈ K, we introduce the stochastic vector Ni (t)  (1) (2) (K ) [Ni (t), Ni (t), . . . , Ni (t)]T , where dim(Ni (t)) = K × 1 and Ni (t) stands for the number of visits received by vertex i up to time t by all particles scattered throughout the network. Specifically, the kth entry Ni(k) (t) indicates the number of visits made by particle k to vertex i up to time t. Generalizing this notation, we define the matrix that maintains the number of visits made by every particle in the network to all the vertices as N(t)  [N1 (t), N2 (t), . . . , NV (t)]T , where dim(N(t)) = V × K . Let us also formally define the domination level vector of vertex i , N¯ i (t), according to the stochastic vector N¯ i (t)  [ N¯ i(1) (t), N¯ i(2) (t), . . . , N¯ i(K ) (t)]T , where dim( N¯ i (t)) = K × 1. Particularly, the kth entry, N¯ i(k) (t), indicates the relative frequency of visits performed by particle k to vertex i up to time t. Similar to the previous case, we extend this notion to all vertices in the network, defining the domination level matrix that sustains all the domination levels imposed by every particle in the network to all the vertices as N¯ (t)  [ N¯ 1 (t), N¯ 2 (t), . . . , N¯ V (t)]T , where dim( N¯ (t)) = V × K . (k) Mathematically, we define each entry of N¯ i (t) as (k)

N (t) (k) N¯ i (t)   K i (u) . u=1 Ni (t)

(3)

In view of this, we can define P(k) pref (i, j, t), which is the probability of a single particle k to perform a transition from vertex i to j at time t, using solely the preferential movement term, as follows: (k) ai j N¯ j (t) (k) . (4) Ppref (i, j, t)  V ¯ (k) u=1 aiu Nu (t) From (4), it can be observed that each particle has a different transition matrix associated to its preferential movement and that, unlike the matrix related to the random movement, this matrix is time-variant with dependence on the domination levels of all the vertices ( N¯ (t)) in the network at time t. It is worth remarking that the approach taken here to characterize the preferential movement of the particles is defined as the visiting frequency of each particle to a specific vertex, i.e., as more visits are performed by a specific particle to an arbitrary vertex, the higher will be the chance of the same particle to repeatedly visit the same vertex. Furthermore, it is important to emphasize that (4) produces two distinct features presented by a natural competition model: 1) the strengthening of the domination level of the visiting particle on a vertex; and 2) the consequent weakening of the domination levels of all other particles on that same vertex. (k) Next, we define each entry of Prean (t) that is responsible for teleporting an exhausted particle k ∈ K back to its

454

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

dominated territory, with the purpose of recharging its energy (reanimation process). Suppose that particle k is visiting vertex i when its energy is completely depleted. In this situation, the particle must regress to an arbitrary vertex j of its possession at time t, according to the following expression:

½ P(k) rean (i,

j, t)  V



arg max

u=1 ½

m∈K

  N¯ (m) j (t ) =k





arg max m∈K

  N¯ u(m) (t ) =k

3

4

(7)

14 15

11

16

5

6

17 12

8

(5)

(m) where owner(k, t) = (arg maxm∈K ( N¯ p(k) (t )(t)) = k) is a logical expression that essentially yields true if the vertex that particle k visits at time t [i.e., vertex p(k) (t)] is being dominated by it, and false otherwise, dim(E(t)) = 1× K ;  > 0 symbolizes the increment or decrement of energy that each particle receives at time t. Indeed, the first expression in (6) represents the increment of the particle’s energy and it occurs when particle k visits a vertex p(k) (t) that is dominated by itself, i.e., arg maxm∈K ( N¯ p(m) (k) (t ) (t)) = k. Similarly, the second expression in (6) indicates the decrement of the particle’s energy and it happens when particle k visits a vertex p(k) (t) dominated by a rival particle. Hence, in this model, particles will be given a penalty if they are wandering in rival territory, so as to minimize aimless navigation of the particles in the network. Next, we advance to the update rule that governs S(t), which is responsible for determining the movement policy of each particle. As we have stated, an arbitrary particle k will be transported back to its domain only if its energy drops to a threshold ωmin . With that in mind, it is natural that each entry of S (k) (t) must monitor the current energy value of its corresponding particle k, i.e., if this energy ever drops to the given threshold, the switch must be enabled, analogously, if the particle still has an energy value greater than this threshold, then the switch should be disabled. Mathematically, the kth entry of S(t) can be written as

13

10

7

where ½[.] is the indicator function that yields 1 if the argument is logically true and 0, otherwise. The operator arg maxm∈K (.) returns an index M, such that N¯ u(M) (t) is (m) the maximal value among all N¯ u (t) for m = 1, 2, . . . , K . Indeed, a careful analysis of the expression in (5) reveals that the probability of returning to an arbitrary vertex j , which is dominated by the particle k, follows a uniform distribution. In addition to that, (5) only results in nonzero transition probabilities for vertices j that are being dominated by particle k at time t, regardless of the existence of a connection between i and j in the adjacency matrix. Now, we may proceed to the development of the particle’s energy update rule. First, it is useful to introduce the stochastic vector E(t) = [E (1) (t), . . . , E (K ) (t)], where the kth entry, E (k) (t) ∈ [ωmin , ωmax ], ωmax ≥ ωmin , denotes the energy level of particle k at time t. The energy update rule is given by min(ωmax , E (k) (t − 1) + ) if owner(k, t) (k) E (t) = max(ωmin , E (k) (t − 1) − ) if  owner(k, t) (6)

S (k) (t) = ½[ E (k) (t )=ωmin ]

1

2

9

18 19 20

Fig. 1. Illustration of the reanimation procedure. The vertex color represents the particle that is imposing the highest domination level at time t. The beige color denotes a vertex not dominated by any particle.

where dim(S(t)) = 1 × K . Specifically, S (k) (t) = 1 if E (k) (t) = ωmin and 0, otherwise. The upper limit, i.e., ωmax , has been introduced to prevent any particle in the network from increasing its energy to an undesirably high value and, therefore, taking a long time to become exhausted even if it constantly visits vertices from rival particles. In view of this scenario, the accuracy rate achieved by the semisupervised classifier would probably be harmed. We now summarize the key concepts introduced so far in the following example. Example 1: Consider the network depicted in Fig. 1, which encompasses 20 vertices. Suppose there are two particles, namely, red and blue, each of which located at vertices 17 and 1, respectively. As both particles are visiting vertices whose owners are rival particles, their energy will drop. Consider, in this case, that both particles have reached the minimum allowed energy, i.e., ωmin , at time t. Therefore, according to (7), both particles are exhausted. In agreement with the mechanism of the dynamical system, these particles will be teleported back to their owned territory according to (5) P(red) transition(i, j, t) =

1 ∀i ∈ V, j ∈ {v 1 , v 2 , . . . , v 9 } 9

P(red) transition(i, j, t) = 0 ∀i ∈ V, j ∈ V\{v 1 , v 2 , . . . , v 9 } (blue)

Ptransition(i, j, t) =

1 ∀i ∈ V, j ∈ {v 13 , v 14 , . . . , v 20 } 8

P(blue) transition(i, j, t) = 0 ∀i ∈ V, j ∈ V\{v 13 , v 14 , . . . , v 20 }.

(8)

One can verify that, given that the particle is exhausted, it will be teleported back to its territory (set of dominated vertices), no matter where the particle is at the present iteration. The determination of which of the owned vertices to visit follows a uniform distribution, i.e., each owned vertex has an equal chance to be visited by the exhausted particle. This confinement mechanism prevents aimless navigation of the particle in the network as will be confirmed by the computer simulations in the following sections. Once the collection of matrices associated to each particle is defined, we couple all of them into a single representative transition matrix denominated Ptransition(t) which models the transition of p(t) to p(t + 1). In order to properly define such matrix, the following fact will be used: given the system state X (t), the transition of an arbitrary active or exhausted particle to a next vertex at time t + 1 is independent of the transitions of the other rival particles at time t + 1. This happens because the transition matrix of each particle, according to (1), is

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

solely based on the current system state X (t). By virtue of this property, one has (1)

(K )

Ptransition(t) = Ptransition(t) ⊗ . . . ⊗ Ptransition(t)

(9)

where ⊗ denotes the Kronecker tensor product operator. In this way, (9) completely specifies the transition distribution matrix for all the particles in the network. Essentially, when K ≥ 2, p(t) will be a vector and we would no longer be able to conventionally define the row p(t) of matrix Ptransition(t). Owing to this, we define an invertible mapping f : V K → Æ∗ . The function f simply maps the input vector to a scalar number that reflects the natural ordering of the tuples in the input vector. For example, p(t) = [1, 1, . . . , 1, 1] (all particles at vertex 1) denotes the first state, p(t) = [1, 1, . . . , 1, 2] (all particles at vertex 1, except the last particle, which is at vertex 2) is the second state, and so on, up to the scalar state V K . Therefore, with this tool, we can fully manipulate the matrix Ptransition(t). Remark 1: The matrix Ptransition(t) in (9) possesses dimensions V K × V K , which are undesirably high. In order to save space, the collection of K matrices given in (1) will be used in the computer simulations section. The matrix Ptransition(t) has been specially introduced only to ease the derivations that will be conducted in the mathematical analysis section. C. Semisupervised Competitive Learning Model In light of the results obtained in the previous section, the internal state of the dynamical system, X (t), is given by X (t) = [N(t) p(t) E(t) S(t)]T

(10)

and the proposed competitive dynamical system is given by ⎧ (k) (k) ⎪ Ni (t) + ½[ p(k) (t +1)=i ] ⎪ Ni (t + 1) = ⎧ ⎪ ⎪ ⎪ min(ωmax , E (k) (t) + ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎨ if owner(k, t) (11) φ : E (k) (t + 1) = (k) ⎪ ⎪ max(ωmin , E (t) − ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ if  owner(k, t) ⎪ ⎪ ⎩ (k) S (t + 1) = ½[ E (k) (t +1)=ωmin ] . Observe that p(t +1) has no closed form because it is qualified as a probabilistic distribution following Ptransition(t), therefore, its acquisition is merely by random number generation. Also note that system φ is nonlinear, because of the indicator function. The first equation of system φ is responsible for updating the number of visits that every vertex i ∈ V has received by every particle k ∈ K up to time t; the second equation is used to maintain the current energy levels of all the particles inserted in the network; and the third equation is used to indicate whether the particle is active or exhausted, depending on its actual energy level. One can also verify that system φ is Markovian, since the determination of the future state only depends on the present state. D. Initial Conditions of the System Consider a set of classes C and a set of prelabeled examples VL ⊂ V. Let L denote the set whose each entry indicates a

455

pair of labeled node and its corresponding class, i.e., L = {(v 1 , c1 ), . . . , (v |VL | , c|VL | )}, where v i ∈ VL , and ci ∈ C, 0 ≤ i ≤ |L| = |VL |. For those initially labeled vertices, we set Ni(k) (0) = ∞, where k ∈ K is the representative particle of vertex i ∈ C. In this way, in view of (3), the ownership of all prelabeled vertices cannot be reassigned to another particle. Mathematically, this can be precisely expressed by ∞, if particle k represents node i (k) Ni (0) = 1 + ½[ p(k) (0)=i ] , otherwise. (12) Observe that the scalar 1 was used in the second expression of (12) so as to prevent the denominator of (3) from becoming zero. Therefore, Ni(k) (t) ∈ [1, ∞), (i, k) ∈ S, where S is the space spawned by V × K. Usually, more than one particle (a team) is generated to represent a set of prelabeled examples of the same class. Each particle attempts to dominate vertices in an independent manner. The cooperation among the particles of the same team happens only at the end of the process. In order to do so, all the domination levels imposed by each member of a team are simply summed up to obtain the aggregated domination level of that team. Regarding the initial condition of E(0), we desire a fair competition among the particles, so we place isonomy in their initial energy values, i.e., all particles k ∈ K start out with the same energy level given by   ωmax − ωmin (k) . (13) E (0) = ωmin + K Lastly, the variable that accounts for indicating whether the particle k is active or exhausted at the initial step, i.e., S (k) (0), ∀k ∈ K, is given by S (k) (0) = 0

(14)

i.e., we deliberately set all particles in the network to the active state at the beginning of the process. E. Algorithm In order to facilitate the understanding of how the proposed stochastic dynamical system evolves in time, Fig. 2 depicts a flowchart with the main tasks that must be processed in the proposed technique. In the first block, “Set Initial Conditions,” we need to initialize the X (0), which is composed of N(0), p(0), E(0), and S(0). After that, the system starts to evolve and the “Stop Criterion” logical statement is checked at each iteration. For a specific iteration, each particle needs to perform a transition to the next vertex. This is exactly done by the inner loop starting from the “k > K ” logical statement. Inside this inner loop, the time-variant transition matrix associated to particle k is calculated (“Calculate Transition Matrix of Particle k” block). Next, the same particle performs the transition respecting the probability distribution of the matrix calculated in the previous step (“Particle k Visits Another Vertex” block). When all particles have properly completed their movements, the inner loop ceases and the remaining system variables are updated, i.e., N(t), E(t), and S(t), for

456

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

Fig. 2. Illustrative flowchart showing the main tasks of the stochastic dynamical system φ.

some t ≥ 1 (“Update Remaining System Variables” block). Finally, the system can either move forward to the next iteration or instead, if the stop criterion has been met, supply N¯ (t) to the user (“Return Domination Level Matrix” block). Algorithm 1 summarizes all the steps, in a detailed manner, to iterate system φ. Essentially, the algorithm accepts the dataset (data) and the labeled set of data items (L), as well as three user-defined parameters: the fraction of gained/lost energy by the particles in the model (), the desired proportion of preferential movement (λ), and a stopping factor (). Usually, good results can be obtained by selecting λ ∈ [0.2, 0.8] and  ∈ [0.05, 0.4] (see Section IV-A).  can be set to an arbitrary small value. K is the number of labeled data items and also the number of particles. Observe that, in Step 22, we utilize the matricial max-norm .∞ , which simply yields the maximum absolute row sum of the matrix. Note that the stopping criterion can also be defined as a certain number of iterations. F. Computational Complexity The whole algorithm for semisupervised data classification (Algorithm 1) can be divided into two distinct parts: 1) Step 3: Network construction from the input dataset; and 2) Steps 4 to 23: Definition of the classification algorithm. The following list provides the computational complexity of each step of Algorithm 1. Step 3: Construction of the graph from the input dataset. It has complexity order O(V 2 ), since the distance matrix must be evaluated. If an already networked data is presented to the algorithm, this step is skipped. Step 4: Generation of K particles. This has complexity order O(K ). Step 5: In this step, every link of the network must be visited. Thus, its complexity order is O(L), where L denotes the number of links. Steps 6 and 7: An assignment is made for each of the K ×V entries of N(0) and N¯ (0), respectively. Hence, these steps have complexity order O(K V ). Steps 8 and 9: A simple assignment is performed for each of the K entries of E(0) and S(0). In this way, the complexity order is O(K ). Step 13: Suppose that k is the average node degree of the network; it follows that this can be performed in O(k). Step 14: We maintain a hashtable to store the vertices that are being owned by each particle. In this way, it takes constant time to find out a vertex dominated by an exhausted particle, i.e., O(1).

Algorithm 1 Semisupervised Particle Competition Algorithm 1: procedure PARTICLE C OMPETITION (data, L, , λ, ) 2: K ← |L| 3: A ← buildGraph(data) 4: p(0) ← generateParticles(A, L) 5: Prnd ← calculateRandomMatrix(A): Use (2) 6: N(0) ← calculateInitialN(p(0), L): Use (12) 7: N¯ (0) ← calculateNbar(N(0)): Use (3) 8: E(0) ← calculateInitialE(K): Use (13) 9: S(0) ← calculateInitialS(): Use (14) 10: t ←0 11: repeat 12: for k = 1 to K do (k) 13: Ppref (t) ← calculatePpref(N(t), p(t)): Use (4) 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

(k) Prean (t) ← calculatePrean(N(t), p(t)): Use (5) (k) (k) (k) Ptran (t) ← setPTran(λ, Prnd ,Ppref (t),Prean (t)): Use (1) (k) p(k) (t +1) ← chooseNextVertices(Ptran(t), p(k) (t)) end for N(t + 1) ← updateN(N(t), p(t + 1)): Use the first equation in (11) N¯ (t + 1) ← calculateNbar(N(t + 1)): Use (3) ¯ + 1), p(t + 1)): Use E(t + 1) ← updateE(E(t), N(t the second equation in (11) S(t +1) ← updateS(E(t +1)): Use the third equation in (11) t←  t +1  until  N¯ (t) − N¯ (t − 1)∞ <  return N¯ (t) end procedure

Step 15: Scalar multiplication with each neighbor of the vertex that particle k is visiting. This is finished in O(k) time. Step 16: Particle k chooses the next vertex to visit according to the transition distribution calculated in Step 14. This task can be done in O(k) time. Steps 18 and 19: Update matrix N(t) and N¯ (t). Considering that at most K different vertices will be visited at any given time, then it is guaranteed that at most K rows of the matrices N and N¯ will be changed. So, this update can be done in O(K 2 ) time, on account that each of the K rows has K entries. Steps 20 and 21: Completed in O(K ) time. Since Steps 13 to 16 repeat K times, it follows that this block has complexity order O(K k). The complexity order of the next block, defined by Steps 18 to 22, is determined by Step 18 or 19, i.e., O(K 2 ). Therefore, the semisupervised classifier, without the “repeat” loop, has complexity order O(K k + K 2 ). In what follows, the number of iterations of the “repeat” loop will be estimated. Consider a network with some completely separated classes and suppose each class has a single particle. The ownership of each node can be determined by a single visit of the particle, thus, the number of iterations of the main loop is certainly O(V ) = c1 V , where c1 is a positive constant proportional to the fraction of random movement

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

III. T HEORETICAL R ESULTS In this section, we supply an in-depth analysis of the proposed model, as well as a validation of the theoretical results. Here, we suppose that G is an undirected graph. A. Mathematical Analysis Primarily, it is of utmost importance to determine the transition probability function of system φ, i.e., P(X (t +1) | X (t)), before any rigorous analysis be taken. We first simplify the notation observing that P(X (t)) = P(N(t), p(t), E(t), S(t)).

250 225 200 Processing time [S]

performed by the particles. If the classes are connected in a well-defined manner (there are few interclass connections), the ownership of each node can be determined by a small number of visits. Then, in order to have all V nodes dominated by the particles, the number of iterations is again O(V ) = c2 V , where c2 is a positive constant satisfying c2 > c1 . Following the same reasoning, we conclude that the number of iterations required for all nodes to be completely dominated by the particles is O(V ) = cV , where c is a constant whose magnitude increases with the portion of interclass links. In summary, the semisupervised classifier has complexity order O(V 2 + K kV + K 2 V ). Some specific cases can be observed. 1) If the network is sparse, i.e., k  V , then the algorithm runs in O(V 2 ) for non-networked data and in O(K 2 V ) for networked data. 2) If the average degree k is proportional to V (a highly connected network), the algorithm runs in O(K V 2 ) time. 3) Since the quantity of particles inserted in the network is usually small and many real networks are sparse, i.e., K  V and k  V , it is reasonable to assume that the semisupervised algorithm, in the majority of the cases, runs in O(V 2 ) for non-networked data and in O(V ) for networked data. Next, an empirical evaluation of our prior analysis will be performed using random clustered networks [36]. Here, we suppose that the network is given, i.e., Step 3 is not executed. Moreover, we consider that the number of particles is far less than the quantity of data items (K  V ) and k = 16  V . In this scenario, the algorithm is expected to run in linear time order (O(V )). The random clustered networks are constructed with increasing network sizes (V = {1000, 2000, . . . , 10 000}). For each network, there are four interconnected subnetworks, each of which representing a data class. Within each subnetwork, the vertices are densely connected, whereas the connections are relatively sparse between each pair of classes. On average, the neighborhood of a vertex is defined so as to share 60% of its connections with vertices of the same class and 40% with the other classes. Each class has two prelabeled vertices; hence, K = 8 particles are inserted into the network. The time is quantified in an Intel Core 2 CPU 6700 with 4 GB of RAM. The results are shown in Fig. 3. By inspecting this figure, we notice that the time increases linearly as a function of the network size, which empirically confirms our analysis.

457

175 150 125 200 150 100 50 0

2000

4000

6000 8000 Network size [V]

10000

Fig. 3. Consumed time for the variations of N¯ (t) start to be insignificant ( = 0.05). λ = 0.6,  = 0.07, ωmin = 0, and ωmax = 1. Each point in the trace is averaged over 10 realizations. The error bars represent the maximum and minimum values.

In fact, a detailed algebraic derivation of the transition function P(X (t + 1) | X (t)) is given as follows: P(X (t + 1) | X (t)) = P(N(t + 1), p(t + 1), E(t + 1), S(t + 1) | N(t), p(t), E(t), S(t)) = P(S(t + 1) | N(t + 1), p(t + 1), E(t + 1), N(t), p(t), E(t), S(t)) × P(N(t + 1), p(t + 1), E(t + 1) | N(t), p(t), E(t), S(t)) = PS(t +1) P(E(t + 1) | N(t + 1), p(t + 1), N(t), p(t), E(t), S(t)) × P(N(t + 1), p(t + 1) | N(t), p(t), E(t), S(t)) = PS(t +1) PE(t +1) P(N(t + 1) | p(t + 1), N(t), p(t), E(t), S(t)) × P( p(t + 1) | N(t), p(t), E(t), S(t)) = PS(t +1) PE(t +1) PN(t +1) Pp(t +1)

(15)

where PS(t +1) = P(S(t + 1) | N(t + 1), p(t + 1), E(t + 1), X (t)), PE(t +1) = P(E(t + 1) | N(t + 1), p(t + 1), X (t)), PN(t +1) = P(N(t + 1) | p(t + 1), X (t)), and Pp(t +1) = P( p(t + 1) | X (t)). Now we proceed to the evaluation of all these four terms. We start out by deriving Pp(t +1). Observing that the stochastic vector p(t + 1) is directly evaluated from Ptransition(t) given in (9), which in turn only requires p(t) and N(t) to be constructed (X (t) is given), then the following equivalence holds: Pp(t +1) = P( p(t + 1) | X (t)) = Ptransition(N(t), p(t)).

(16)

Here, we have used Ptransition(N(t), p(t)) to emphasize the dependence of the transition matrix on N(t) and p(t). Next, we proceed to the evaluation of PN(t +1) . In this case, we have additional information, besides the previous state

458

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

X (t), which is the knowledge about p(t + 1). By a quick analysis of the update rule given in the first expression of system φ, it is possible to completely determine N(t +1), since p(t + 1) and N(t) are known. Because of that, the following equation holds:

where Q S (E(t + 1)) is a matrix with dim(Q S ) = 1 × K and dependence on E(t + 1). The kth entry, k ∈ K, of such matrix is calculated as

PN(t +1) = P(N(t + 1) | p(t + 1), X (t)) = ½[N(t +1)=N(t )+Q N ( p(t +1))]

Substituting (16), (17), (19), and (21) into (15), we are able to encounter the transition probability function of the competitive dynamical system

(17)

where Q N ( p(t + 1)) is a matrix with dim(Q N ) = V × K and dependent on p(t + 1), whose (i, j )th entry is given by Q N ( p(t + 1))(i, k) = ½[ p(k) (t +1)=i ] .

(18)

The argument in the indicator function shown in (17) is essentially the first expression of system φ, but with a matrix notation. In brief, (17) will result in 1 if the computation of N(t + 1) is correct, given p(t + 1) and N(t), i.e., it is in compliance with the dynamical system rules; and 0, otherwise. For the third term PE(t +1) , we have knowledge of the previous state X (t) as well as of p(t + 1) and N(t + 1). By (3), we see that N¯ (t + 1) can be directly calculated from N(t + 1), i.e., having knowledge of N(t + 1) permits us to evaluate N¯ (t + 1), which, probabilistically speaking, is also a given information. In light of this, together with (6), one can see that E(t + 1) can be evaluated if we have information of E(t), p(t +1), and N¯ (t +1), which we actually do. On account of that, PE(t +1) can be surely determined and, analogous to the calculation of the PN(t +1) , is given by PE(t +1) = P(E(t + 1) | N(t + 1), p(t + 1), X (t)) = ½[E(t +1)=E(t )+×Q E ( p(t +1),N(t +1))]

(19)

where Q E ( p(t + 1), N(t + 1)) is a matrix with dim(Q E ) = 1 × K and dependence on N(t + 1) and p(t + 1). The kth entry, k ∈ K, of this matrix is calculated as Q (k) E ( p(t + 1), N(t + 1)) = ½[owner(k,t +1)] − ½[owner(k,t +1)] . (20) Note that the argument of the indicator function in (20) is essentially (6) in a compact matrix form. Indicator functions were employed to describe the two types of behavior that this variable can have an increment or a decrement of the particle’s energy. Suppose that particle k ∈ K is visiting a vertex that is dominated by itself, then only the first indicator function (k) in (20) is enabled; hence, Q E ( p(t + 1), N(t + 1)) = 1. Similarly, if particle k is visiting a vertex that is dominated by a rival particle, then the second indicator function (k) is enabled, yielding Q E ( p(t + 1), N(t + 1)) = −1. This behavior together with (19) is exactly the expression given by (6) in a compact matrix form. Lastly, for the fourth term PS(t +1), we have knowledge of E(t+1), N(t+1), p(t+1), and the previous internal state X (t). By a quick analysis of (7), one can verify that the calculation of the kth entry of S(t + 1) is completely characterized once E(t +1) is known. In this way, one can surely evaluate PS(t +1) in this scenario as follows: PS(t +1) = P(S(t + 1) | E(t + 1), N(t + 1), p(t + 1), X (t)) (21) = ½[S(t +1)=Q S(E(t +1))]

(k)

Q S (E(t + 1)) = ½[ E (k) (t +1)=ωmin ] .

P(X (t + 1) | X (t)) =

(22)

½[N(t +1)=N(t )+Q N ( p(t +1))] × ½[S(t +1)=Q S (E(t +1))] × ½[E(t +1)=E(t )+Q E ( p(t +1),N(t +1))] × Ptransition(N(t), p(t))

= ½[Compliance(t )] Ptransition(N(t), p(t)) (23) where Compliance(t) is a logical expression given by Compliance(t) = [N(t + 1) = N(t) + Q N ( p(t + 1))] ∧ [S(t + 1) = Q S (E(t + 1))] ∧ [E(t + 1) = E(t)+Q E ( p(t + 1), N(t + 1))] . (24) That is, Compliance(t) encompasses all the rules that have to be satisfied in order that all the indicators in (23) produce 1. If all the values provided to (23) are in compliance with the dynamics of the system, then Compliance(t) = true and the indicator function ½[Compliance(t )] yields 1; otherwise, if there is at least one measure that does not satisfy the system, then from (24), the chain of logical AND will produce false. As a consequence, Compliance(t) = false and the indicator function ½[Compliance(t )] in (23) yields 0, resulting in a zero transition probability. In order to obtain N(t) as t → ∞, the calculation of the joint distribution of all states X (0), . . . , X (t) will be useful. By applying Bayes’ theorem successively and using the Markovian property of the proposed dynamical system, we get P(X (0), . . . , X (t)) = P(X (t) | X (t − 1)) × P(X (t − 1) | X (t − 2)) × · · · × P(X (1) | X (0))P(X (0)).

(25)

Using the transition function that governs system φ, as illustrated in (23), to each shifted term in (25) we get t −1   ½[Compliance(u)] P(X (0), . . . , X (t)) = P(X (0)) u=1

× Ptransition(N(u), p(u))

 (26)

where P(X (0)) = P(N(0), p(0), E(0), S(0)). But, we are interested in knowing the marginal distribution N(t) as t → ∞. We can obtain it from the joint distribution calculated in (26), summing over all the possible values of random variables with no relevance in the analysis, i.e., N(t − 1), . . . , N(0), p(t), . . . , p(0), E(t), . . . , E(0), S(t), . . . , S(0). In doing so, it is worth studying the limits of N(t) for an arbitrary t, because the domain that an entry of N(t) can take is [1, ∞). With this paper, we expect to find the reachable values of

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

every entry of matrix N(t) for any t. In this way, values that exceed these limits are guaranteed to happen with probability 0. Lemma 1 precisely supplies this analysis. Lemma 1: If i is an unlabeled vertex, then the maximum (k) reachable value of Ni (t), ∀k ∈ K, t ∈ Æ, is (k) Nimax (t)

=

  t +1 

+ 1, t + 2,

2

if t > 0 and aii = 0 if t > 0 and aii > 0.

(27)

Proof: The proof is based on encountering the particle’s trajectory that increases Ni(k) (t) in the quickest manner. In this situation, we suppose particle k is generated in vertex i ; otherwise, the maximum theoretical value would never be reached in view of the second expression in (12). For the sake of clarity, consider two specific cases: 1) networks without self-loops; and 2) networks with self-loops. For the first case, ∀i ∈ V : aii = 0. By hypothesis, particle k starts in vertex i at t = 0. The quickest way to increase Ni(k) (t) happens when particle k visits a neighbor of vertex i and immediately returns to vertex i . Repeating this trajectory until time t, Ni(k) (t) precisely matches the first expression in (27). For the second case, ∃i ∈ V : aii > 0. By hypothesis, particle k starts in vertex i at t = 0. It is clear that the quickest way to increase Ni(k) (t) occurs when particle k always travels through the self-loop for all t. In view of this, the (k) maximum value that Ni (t) can reach is given by the second expression in (27). The “+2” factor appears because the particle initially spawns at vertex i , according to the second expression in (12). Lemma 1 does not provide information about prelabeled vertices, since this can be inferred in a straightforward manner through the initial conditions of the system. According to (12), if i is a prelabeled vertex and k is its representative particle, then Ni(k) (t) = ∞, ∀t ≥ 0. In what follows, we analyze the properties of the stochastic vector E(t). The upper limit of the kth entry of E(t) is always ωmax . Thus, provided that ωmax < ∞, the upper limit is always well defined. However, this entry does not only accept integer values in between ωmin and ωmax . Lemma 2 provides all reachable values of E(t) within this interval. Lemma 2: For a fixed t ∈ Æ, the reachable domain of E (k) (t), ∀k ∈ K, is

459

reachable value is given when n = n i   ωmin + ωmaxK−ωmin − ωmin ωmax − ωmin = (29) ni =  K whereas the maximum reachable value is given when n = n m     ωmax − ωmin + ωmaxK−ωmin ωmax − ωmin 1 = 1− . nm =   K (30) After some time, the particle k might reach one of the two possible extremes of energy value: ωmin or ωmax . On account of the max(.) operator in (6), it is also necessary to list all multiple numbers of  with these two offsets. The second and thrid sets precisely fulfill this aspect when the offsets are ωmin and ωmax , respectively. Once the particle enters one of these sets, it never leaves it. Hence, all values have been properly mapped. Lastly, the upper limit of an arbitrary entry of S(t) is 1. The values that this variable can take are {0, 1}. In light of the aforementioned limits, the marginal distribution P(N(t)) is expressed by P(N(t)) =

V 

V 

p (1) (0)=1 p (2) (0)=1 (1) N1max (0)

×



V 

...

p (K ) (0)=1

(1)

×

(2)



E (1) (0)∈DE E (2) (0)∈DE

× 

1 

1 

S (1) (0)=0

S (2) (0)=0

P(X (0))

t −1  



...

N1 (0)=1 N1 (0)=1



p (K ) (t )=1

(K ) NVmax (0)

(2) N1max (0)



V 

...

...

(K )

NVmax (t −1)

(K )

NV (0)=1



...



...

(K )

NV (t −1)=1

E (K ) (0)∈DE 1  S (K ) (0)=0

...



...

E (K ) (t )∈DE 1 

S (K ) (t )=0

 ½[Compliance(u)] Ptrans (N(u), p(u)) .

u=1

(31)

The summations in the first line of (31) account for passing through all possible values of p(0), . . . , p(t). The summations in the second line are responsible for passing through all reachable values of N(0), . . . , N(t − 1), where the upper   limit is set with the aid of Lemma 1. The third line supplies ωmax − ωmin D E  ωmin + + n, n = {−n i , . . . , n m } the summation over all possible values of E(0), . . . , E(t), in K which it is utilized the set D E defined in Lemma 2. Lastly, the     ωmax − ωmin fourth line summations cover all the values of S(0), . . . , S(t). ∪ ωmin + n, n = 1, 2, . . . ,  Note that the logical expression Compliance(u) and the tran    ωmax − ωmin sition matrix inside the productory are built up from all these ∪ ωmax − n, n = 1, 2, . . . , (28) summation indices previously fixed.  Now we show how to calculate N¯ (t). First, it must be where n i = (ωmax − ωmin )/(K ) ≥ 0 and n m = ωmax − observed that positive integer multiples of N(t) compose the (ωmin /) (1 − (1/K )) ≥ 0. same N¯ (t). Therefore, the mapping N(t) → N¯ (t) is not Proof: We divide this proof into three main steps, injective, and hence not invertible. Before proceeding further namely the three sets that appear in the expression in the with the deduction of how to calculate N¯ (t) from N(t), let us caput of this lemma. The first set accounts for supplying present some helpful auxiliary results. all values that are multiples of  with the offset E (k) (0) = Lemma 3: If i is an unlabeled vertex, then the following ωmin + ((ωmax − ωmin )/K ), ∀k ∈ K [see (13)]. The minimum assertions hold ∀k ∈ K.

460

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

1) The minimum value of N¯ i(k) (t) is (k) N¯ imin (t) =

1+



1 u∈K \ {k}

Ni(u) (t) max

.

(32)

(k) 2) The maximum value of N¯ i (t) is

N¯ i(k) (t) = max

(k)

(k)

Nimax (t)

Nimax (t) + (K − 1)

.

(33)

Proof: 1) According to (3), the minimum value occurs if three conditions are met: a) particle k is not initially spawned at vertex i ; b) particle k never visits vertex i ; and c) all other K − 1 particles u ∈ K \ {k} visit vertex i in the quickest possible manner, i.e., they follow the trajectory given in Lemma 1. In this way, vertex i will be visited  (u) u∈K \ {k} Nimax (t) times by the other particles. However, having in mind the initial condition of N(0) shown in the second expression of (12), we must add 1 to the total number of visits received by vertex i . By virtue ofthat, it is expected (u) that the total number of visits will be 1 + u∈K \ {k} Nimax (t). In view of this scenario, applying (3) yields (32). 2) The maximum value happens if three conditions are satisfied: a) particle k initially spawns at vertex i ; b) particle k visits i in the quickest possible manner; and c) all other particles u ∈ K \ {k} never visit i . In this scenario, it is hoped (k) that vertex i will receive Nimax (t) + (K − 1) visits, where the second term in the expression is due to the initialization of N(0), as the second expression in (12) reveals. This information, together with (3), implies (33). Remark 2: If the network does not have self-loops, then (32) reduces to 1 (k) N¯ imin (t) = . (34) 1 + (K − 1)Ni(k) (t) max The following Lemma provides all reachable elements that an arbitrary entry of N¯ (t) can have. Lemma 4: Denote num/den as an arbitrary irreducible fraction. Consider that the set It retains all the reachable values of N¯ i(k) (t), ∀(i, k) ∈ S, for a fixed t. Then, It comprises all elements which satisfy the following constraints. 1) With regard to unlabeled vertices a) the minimum element is given by the expression in (32); b) the maximum element is given by the expression in (33); c) all the irreducible fractions within the interval delimited by a) and b) such that: i) num, den ∈ Æ∗ ; (k) ii) num ≤ Nimax (t);  (u) iii) den ≤ u∈K Nimax (t). 2) With regard to labeled vertices a) 0, if particle k does not represent vertex i ; b) 1, if particle k represents vertex i . Proof: Regarding item 1): a) and b) Straightforward from (k) Lemma 3. c) First, we need to remember that Ni (t) may only (k) take integer values. According to (3), N¯ i (t) = num/den is a ratio of integer numbers. As a consequence, num and den must

be integers and clause I is demonstrated. In view of (3), num only registers visits performed by a single particle. Therefore, (k) the upper bound of it is established by Lemma 1, i.e., Nimax (t). Hence, clause II is proved. Looking at the same expression, observe that den registers the number of visits performed by all particles. Again, using Lemma 1 proves clause III. Regarding item 2a): As vertex i is labeled, ∃u ∈ K : Ni(u) (t) = ∞. In view of (3) and (12), we obtain N¯ i(k) (t) = 0; b) Similarly, using (3) and (12), we get N¯ i(k) (t) = 1. Another interesting feature of the set It is elucidated on the following Lemma. Lemma 5: For a fixed t ≤ ∞, the set It indicated in Lemma 4 is always finite. Proof: In order to demonstrate this lemma, it is enough to verify that each set appearing in the caput of Lemma 4 is finite. With regard to item 1) of Lemma 4: (a) and (b) are scalars, hence, they are finite sets. (c) Clause I indicates a lower bound for the numerator and the denominator. Clauses II and III reveal upper bounds for the numerator and denominator, respectively. Also from clause I, it can be inferred that the interval delimited by the lower and upper bounds is discrete. Therefore, the number of irreducible fractions that can be made from these two limits is finite. With regard to item 2) of Lemma 4: (a) and (b) are scalars. In this way, they are finite sets. As all the sets are finite for any t, since It is the union of all these subsets, it follows that It is also finite for any t. Lemma 4 supplies the reachable values of an arbitrary entry of N¯ (t) by means of the definition of the set It . Next, we simply extend this notion to the space spawned by the matrices N¯ (t) with dimensions V × K , in such a way that each entry of it must be an element of It as follows: Mt  { N¯ : N¯ i(k) (t) ∈ It , ∀(i, k) ∈ S}.

(35)

In light of all these previous consideration, we can now provide a compact way of determining the distribution of N¯ (t), as follows: t     ¯ P L = u N(t) : L¯ = U (36) P N (t) = U : U ∈ Mt =



u=1

where the upper limit provided in summation of (36) is taken using a conservative approach. Indeed, the probability for (k) (k) events such that Ni (t) > Nimax (t) is zero. By virtue of that, we can stop summing whenever any entry of matrix u N(t) exceeds this value. We have omitted this observation from (36) for the sake of clarity. As t → ∞, P( N¯ (t)) provides enough information for classifying the unlabeled vertices. In this case, they are labeled according to the team of particles that imposes the highest domination level. Since the domination level is a normalized stochastic variable, the output of this model is fuzzy. B. Validation of the Theoretical Results In this section, we will show that the theoretical results presented in the previous section approximate the empirical

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

0.5

0.4

0.4

0.24 0.16

0.3 0.2 0.1

0.08

0 0

0.2

0.4 0.6 ¯ (red) (1000) N

0.8

1

Theoretical distribution Empirical distribution

0.32 Probability

Theoretical distribution Empirical distribution

Probability

Probability

0.4

Theoretical distribution Empirical distribution

0.32

0

461

0.24 0.16 0.08

0

0.2

0.4 0.6 N¯ (red) (1000)

0.8

1

0

0

0.2

11

4

0.4 0.6 N¯16(red) (1000)

0.8

1

Fig. 4. Comparison of the theoretical and empirical distributions of three distinct vertices, namely, v 4 , v 11 , and v 16 . One can see that the most probable domination level that the red particle will impose on v 4 is approximately 0.88 with 34% probability, on v 11 it is 0.53 with 47% probability, and on v 16 it is 0.14 with 33% probability.

0.8

Accuracy

Accuracy

0.7 0.6 0.5 0.4 0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.76 0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6

0

0.2

0.4

0.6

λ



(a)

(b)

0.8

1

Fig. 5. Parameter sensitivity analysis of the model. The network’s configurations are the following, V = 1000, there are four equally sized classes, and k = 16. On average, each vertex shares 40% of its links with vertices belonging to different classes. Each point in the trace is averaged over 100 realizations. The error bars represent standard deviations. (a) Accuracy rate versus λ with  = 0.07. (b) Accuracy rate versus  with λ = 0.6.

behavior of the stochastic competitive model for a large number of independent runs of the algorithm. In order to do so, the network shown in Fig. 1 will be used, i.e., V = {v 1 , . . . , v 20 }. We label two vertices, namely v 6 with the red label and v 18 with the blue label. Two particles K = {1, 2} are inserted, each one representing the red and blue labeled vertices, respectively. For both the empirical and theoretical comparisons, we set λ = 0.6,  = 0.07, ωmin = 0, and ωmax = 1. With respect to the empirical evaluation, since the proposed competitive model is stochastic, the empirical domination level matrix N¯ (t) will be estimated by executing the algorithm 10 000 times independently. For each run, the system is iterated until t = 1000 and the generated N¯ (1000) is stored. Afterwards, V × K histograms are constructed, each of which represents an entry of N¯ (1000). The (i, k)th histogram is (k) populated using the N¯ i (1000) generated in each of the 10 000 runs. Since the domination levels are continuous values in the interval [0 − 1], it is discretized using bins with 0.01 width each, i.e., 100 bins. At the end, the estimated empirical probability function P( N¯ i(k) (1000)) is built by means of the normalization of each histogram. Regarding the theoretical evaluation of the network in Fig. 1, we directly apply (36). With that, we are able to evaluate the theoretical probability distribution P( N¯ (1000)) of occurring a specific domination level matrix N¯ (1000). Since we cannot visually plot the probability distribution of

N¯ (1000), on account of being in the space V × K + 1, this probability distribution is marginalized on three specific vertices. Mathematically, the marginalization consists (k) of calculating P( N¯ i (1000)) for the three selected vertices using P( N¯ (1000)). This process is performed on vertices v 4 (a member of the red class), v 11 (a vertex in the border of both classes), and v 16 (a member of the blue class). Fig. 4 shows the empirical and theoretical distributions of the domination levels that the red particle can impose on these three vertices. As one can see from Fig. 4(a), v 4 will most probably be dominated by the red particle, since a large portion of the distribution is near 1. Fig. 4(b) confirms the overlapping nature of v 11 , because with a relatively high probability the algorithm will produce a domination level of the red particle in the vicinities of 0.50, and Fig. 4(c) indicates that the red particle will probably have a little domination on v 16 . These curves must be interpreted in the following manner: take Fig. 4(a), for instance. There is an approximately 34% chance that the domination level imposed by the red particle on v 4 to be [0.88 − 0.89] if we start the system at time t = 0 and stop evolving it at t = 1000. Other values are possible, but with rarer chances. As we can visually verify, the theoretical results have approximately matched the results obtained by the empirical evaluation, confirming our theoretical analysis. IV. C OMPUTER S IMULATIONS In this section, we present simulation results in order to assess the effectiveness of the proposed competitive model. A. Parameter Sensitivity Analysis We start this section by analyzing the behavior of the parameter λ, which is responsible for counterweighting the proportion of preferential and random walks performed by all particles in the network. Particularly, its impact on random clustered networks, whose construction methodology has already been described, will be inspected. We set  = 0.07, ωmin = 0, and ωmax = 1. Fig. 5(a) shows how the accuracy rate of the model behaves as λ is varied from 0 (pure random walks) to 1 (pure preferential walks). As one can deduce from the figure, this parameter is sensitive to the outcome of

462

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

1

1

1

1

0.9

0.9 Vertices 1 to 50

0.8

0.75 0.5

0.25

0.25

0.6 0.5 0.4

0

0 0

0.25

0.5

0.75

1

0

0.25

1

1

0.75

0.75

0.5

0.5

0.25

0.25

0

0.25

0.5

0.75

1

(b)

(a)

0

0.5

0.75

1

0

0.5 0.4 0.3

0.2

0.2

0

Vertices 51 to 100

0.6

0.3

0.1

Vertices 1 to 50

0.7

N¯ (2) (t)

0.5

Vertices 51 to 100

0.7

N¯ (1) (t)

0.75

0.8

0.1 0

300 600 900 1200 1500 1800 2100 2400 2700 3000

0

0

300 600 900 1200 1500 1800 2100 2400 2700 3000

time

time

(a)

(b)

Fig. 7. Evolutional behavior of the average class domination level imposed by (a) blue particle on the blue (vertices 1 to 50) and red (vertices 51 to 100) classes and (b) same information for the red particle.

0

(c)

0.25

0.5

0.75

1

(d)

Fig. 6. Illustration of an artificial semisupervised data classification process via particle competition. Snapshot of the network when (a) t = 0, (b) t = 100, (c) t = 200, and (d) t = 300.

the technique. Usually, the optimal accuracy rate is attained when a mixture of random and preferential walks occurs. Empirically, for 0.2 ≤ λ ≤ 0.8, the model provides good performance results. Another important parameter that needs to be addressed is , which is responsible for updating the energy of the particles, according to which type of vertices they are visiting. The same experimental setup as before is utilized with λ = 0.6. Fig. 5(b) shows the accuracy rate reached by the algorithm for different values of . One can see that for intermediate values of , namely 0.05 <  < 0.4, the model’s performance is not sensitive. However, as  gets larger, the accuracy rate decays accordingly. This happens because the particles get exhausted as soon as they visit vertices dominated by rival teams. In this way, it becomes hard for the particles to change the ownership of the previously conquered vertices. One can truly conceive this process as an artificial “hard labeling.” On the other hand, when  is small, the particles are free to travel around the network with no energy penalties, i.e., they will rarely get exhausted. Therefore, all vertices in the network are expected to be in constant competition and no class borders will be established. It is worth stressing that ωmin and ωmax do not need to be analyzed, since they are simply defining an interval. Furthermore,  is not a sensitive parameter, because its interval that produces good results is very large. In view of this scenario, in the following simulations, we will always fix  = 0.07, ωmin = 0, and ωmax = 1.

observe the dynamic behavior of the proposed algorithm. For this simulation, we use λ = 0.6. Fig. 6(b) and (c) shows the ownership of each vertex after 100 and 200 iterations, respectively. Fig. 6(d) reveals the outcome of the algorithm after 300 iterations, in which all vertices are correctly classified. In these cases, the vertices’ colors represent their owners. We now take a closer look at the evolutional behavior of the average domination level of the vertices that belong to the same class. Fig. 7(a) provides the average domination level imposed by the particle representing the initially blue labeled vertex on the vertices 1 to 50 (blue class) and 51 to 100 (red class), whereas Fig. 7(b) displays the same information for the particle representing the initially red labeled vertex. Clearly, as time progresses, one can see that the classes are unmistakably separated by the competitive system. Next, the proposed technique will be tested on toy data classes with arbitrary forms of class distributions. These datasets are generated using the PRTools framework [37]. Since the generated data in this framework are not in a networked representation, a graph formation technique must be employed. To this end, the k-nearest neighbor is used with k = 5, i.e., each data item is represented by a vertex and each vertex is connected to its five nearest neighbors determined by the Euclidean distance among the data items. For each 50 vertices that are unlabeled, we randomly choose one among them and label it. Every labeled vertex has a particle representing it. Here, we set λ = 0.6. The first dataset, as shown in Fig. 8(a), consists of 600 samples equally divided into two banana-shaped classes. The result is shown in Fig. 8(b). The second dataset, as depicted in Fig. 8(c), comprises 550 samples separated into two Lithuanian classes of sizes 250 and 300 vertices. The corresponding result is given in Fig. 8(d). All results seem to be visually satisfactory, according to the provided inputs, reinforcing the robustness of the technique to detect arbitrary forms of class distributions. C. Simulations on Benchmarked Datasets

B. Simulations on Synthetic Datasets In order to facilitate the understanding of how the proposed technique works, we have designed a synthetic dataset with two simple classes, each of which has 50 vertices. Fig. 6(a) shows the initial configuration of the network, where the colored dots symbolize labeled samples. The black dots denote unlabeled data. With this simple dataset, we will closely

In order to measure the performance of the proposed method against competing techniques, the benchmark proposed in [2] will be used. Such benchmark is composed of seven standard semisupervised datasets in the transductive setting, i.e., the test set coincides with the set of unlabeled points. A brief description of the metadata of the selected datasets is provided in Table II. The artificial datasets (the first three in

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

1

1

0.75

0.75

0.5

0.5

0.25

0.25

0

0.5

0.75

0.25

1

(a) 1

1

0.75

0.75

0.5

0.5

0.25

0.25

0

TABLE III I NFORMATION OF S OME S EMISUPERVISED L EARNING T ECHNIQUES

0 0.25

0.5

0.75

1

0.75

1

(b)

0 0.25

0.5

(c)

0.75

1

463

0.25

0.5

(d)

Fig. 8. Semisupervised data classification of toy datasets. (a) and (b) Initial configuration and results for two banana-shaped classes. (c) and (d) Initial configuration and results for two Lithuanian classes. Data with the same color represent the same class. The black dots depict unlabeled vertices. The colored dots in the left panel represent previously labeled data. TABLE II M ETADATA AND O PTIMAL PARAMETERS U SED IN THE D ATASETS C OMPOSING THE C HAPELLE ’ S B ENCHMARK Dataset Classes Dimension Points Type g241c 2 241 1500 artificial g241d 2 241 1500 artificial Digit1 2 241 1500 artificial USPS 2 241 1500 unbalanced COIL 6 241 1500 BCI 2 117 400 Text 2 11 960 1500 sparse discrete

k 2 2 3 3 5 2 3

λopt 0.46 0.42 0.66 0.60 0.62 0.54 0.58

Table II) were designed in an attempt to create situations that correspond to certain assumptions that many semisupervised learning algorithms base themselves upon, i.e., smoothness, cluster, and manifold assumptions. The other four datasets are derived from real-world data. For a detailed explanation about each of them, one can refer to [2]. No preprocessing was performed on these datasets. For each dataset in Table II, we consider 10 and 100 labeled vertices. All the labeled sets are ensured to present at least 1 labeled vertex of each class. For each dataset and each quantity of labeled vertices (10 or 100 in these simulations), 12 distinct nonbiased label sets are provided by the benchmark. For each of the 12 label sets, the model is run 100 times independently. Finally, the test error of each dataset is calculated by averaging over these 12 × 100 = 1200 runs. For comparison purposes, the proposed technique is tested against several competing semisupervised learning techniques. A brief description of each of them is fiven in Table III. The simulation results were integrally taken from [2], except for the LGC, LP, and LNP techniques. By doing this, the comparisons tend to be more objective (less biased), since [2, ch. 21] compiles these results and the configurations of each algorithm in a detailed manner. The model selection of the LGC, LP, and

Abbreviation MVU + 1-NN LEM + 1-NN QC + CMR Discrete reg. TSVM SGT Cluster-Kernel Data-Dep. Reg. LDS Laplacian RLS CHM (normed) LGC LP LNP

Technique Ref(s). Maximum variance unfolding [38], [39] Laplacian eigenmaps [40] Quadratic criterion and class mass regularization [21], [41] Discrete regularization [42] Transductive support vector machines [23], [43] Spectral graph transducer [23] Cluster kernels [44] Data-dependent regularization [45] Low-density separation [43] Laplacian regularized least squares [46] Conditional harmonic mixing [47] Local and global consistency [13] Label propagation [18] Linear neighborhood propagation [19]

LNP is done as follows. The three techniques have σ selected from the discretized interval σ ∈ {0, 1, . . . , 100} and α is fixed to α = 0.99 (the same setup as [13] and [19]). For the LNP, k is also evaluated in the discretized interval k ∈ {1, 2, . . . , 100}. The model selection consists of choosing the best combination of parameters that produces the highest accuracy rate. Regarding the proposed algorithm, the data is transformed into a networked representation using the k-nearest neighbor graph formation technique. For this purpose, k is chosen over the discretized interval k ∈ {1, 2, . . . , 10}. For the model’s parameter, we first discretize the interval of λ by using a step equal to 0.02, i.e., λ ∈ {0, 0.02, . . . , 1}. Then, we run the proposed technique on a given input dataset by using each of the combination of the discretized value of λ and k. The result that produces the best classification accuracy is selected. The parameter optimization results are shown in Table II in the k and λopt columns. Considering the outcomes of parameter selection on real-world datasets, one can see that λ < 0.2 or λ > 0.8 usually does not produce good results. This argument serves to empirically confirm that our parameter sensitivity analysis carried on synthetic datasets [shown in Fig. 6(a) and (b)] can also be applied for real-world datasets. The test error results of these techniques when there are only 10 and 100 initially labeled vertices are reported in Tables IV and V, respectively. In both tables, the average rank of each algorithm is supplied, whose calculation procedure is as follows: 1) for each dataset we rank the algorithms according to their average performance (average test error), i.e., the best algorithm is ranked as first, the second best one is ranked as second, and so on; and 2) for each algorithm, the average rank is given by its the average performance on all the datasets. A careful analysis of Tables IV and V reveals that our technique has obtained a satisfactory result in comparison to the other techniques. Specifically, for the case of a few labeled nodes (10 labeled nodes), our technique achieves better results with relation to the other techniques than for the case of 100 initially labeled nodes. This is an attractive characteristic because the task of vertex labeling is often expensive and cumbersome, and generally involves the work of a human expert.

464

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

TABLE IV T EST E RRORS (%) W ITH 10 L ABELED T RAINING P OINTS AND THE C ORRESPONDING AVERAGE R ANK OF E ACH T ECHNIQUE

1-NN SVM MVU + 1-NN LEM + 1-NN QC + CMR Discrete Reg. TSVM SGT Cluster-Kernel Data-Dep. Reg. LDS Laplacian RLS CHM (normed) LGC LP LNP Proposed method

g241c 47.88 47.32 47.15 44.05 39.96 49.59 24.71 22.76 48.28 41.25 28.85 43.95 39.03 45.82 42.61 47.82 41.17

g241d 46.72 46.66 45.56 43.22 46.55 49.05 50.08 18.64 42.05 45.89 50.63 45.68 43.01 44.09 41.93 46.24 43.51

Digit1 13.65 30.60 14.42 23.47 9.80 12.64 17.77 8.92 18.73 12.49 15.63 5.44 14.86 9.89 11.31 8.58 8.10

USPS 16.66 20.03 23.34 19.82 13.61 16.07 25.20 25.36 19.41 17.96 17.57 18.99 20.53 9.03 14.83 17.87 15.69

COIL 63.36 68.36 62.62 65.91 59.63 63.38 67.50 67.32 63.65 61.90 54.54 63.45 55.82 55.50 54.18

BCI 49.00 49.85 47.95 48.74 50.36 49.51 49.15 49.59 48.31 50.21 49.27 48.97 46.90 47.09 46.37 47.65 48.00

Text 38.12 45.37 45.32 39.44 40.79 40.37 31.21 29.02 42.72 27.15 33.68 45.50 49.53 41.06 34.84

Avg. Rank 9.86 14.14 9.86 10.00 7.86 10.86 10.86 6.50 10.86 10.17 8.43 6.29 7.20 7.43 5.71 7.57 4.29

TABLE V T EST E RRORS (%) W ITH 100 L ABELED T RAINING P OINTS AND THE C ORRESPONDING AVERAGE R ANK OF E ACH T ECHNIQUE

1-NN SVM MVU + 1-NN LEM + 1-NN QC + CMR Discrete Reg. TSVM SGT Cluster-Kernel Data-Dep. Reg. LDS Laplacian RLS CHM (normed) LGC LP LNP Proposed method

g241c

g241d

Digit1

USPS

COIL

BCI

Text

Avg. Rank

43.93 23.11 43.01 40.28 22.05 43.65 18.46 17.41 13.49 20.31 18.04 24.36 24.82 41.64 30.39 44.13 21.41

42.45 24.64 38.20 37.49 28.20 41.65 22.42 9.11 4.95 32.82 23.74 26.46 25.67 40.08 29.22 38.30 25.85

3.89 5.53 2.83 6.12 3.15 2.77 6.15 2.61 3.79 2.44 3.46 2.92 3.79 2.72 3.05 3.27 3.11

5.81 9.75 6.50 7.64 6.36 4.68 9.77 6.80 9.68 5.10 4.96 4.68 7.65 3.68 6.98 17.22 4.82

17.35 22.93 28.71 23.27 10.03 9.61 25.80 21.99 11.46 13.72 11.92 45.55 11.14 11.01 10.94

48.67 34.31 47.89 44.83 46.22 47.67 33.25 45.03 35.17 47.47 43.97 31.36 36.03 43.50 42.69 46.22 41.57

30.11 26.45 32.83 30.77 25.71 24.00 24.52 23.09 24.38 23.15 23.57 46.83 40.79 38.48 27.92

9.00 9.29 11.86 12.14 7.79 8.21 8.71 4.67 6.79 7.17 6.00 5.10 9.30 10.00 9.29 12.50 6.14

It is worth noting that, even though the proposed technique achieved the best average rank for the case of 10 labeled vertices, its performance on g241c and g241d is not among the top algorithms. The g241c and g241d datasets contain two Gaussian distributed classes that slightly overlap with each other [2, Fig. 21.1]. By analyzing the 12 labeled sets provided by the Chapelle’s benchmark with only 10 available labels (12 ways to select 10 labeled data items), one can verify that a significant portion of these labeled samples is within the overlapped region. Since the proposed technique relies on the propagation of labels by means of particle competition, an intense competition in this overlapped region will be established and the particles will probably not be able to conquer those central vertices (nonoverlapped region) of each class. Hence, some of the startup labeled sets pull down the

performance of our algorithm. Another point that reinforces and confirms this argument is that, when 100 labeled samples are available, the performance gap between the best and the proposed technique is reduced. In this case, as the labeled set is bigger, one or more central vertices of each class are labeled. Therefore, each particle can successfully propagate its label to the samples in the vicinity by using the “divide-and-conquer” aspect of the technique. From this example, we see that the initial label selection is an important issue for semisupervised learning, since it largely affects the final results of LP. Owing to page limitation, the designing of initial labeling strategy is left as a future work. It is important to emphasize that, if we look at the eigenfunctions of the g241c and g241d, the sign of the second eigenfunction at a given sample is an excellent guess as to which class that sample belongs to

SILVA AND ZHAO: NETWORK-BASED STOCHASTIC SEMISUPERVISED LEARNING

TABLE VI T EST E RRORS (%) O BTAINED FOR THE L ETTER R ECOGNITION D ATASET

LP LNP Proposed method

10% Labeled

5% Labeled

1% Labeled

10.94 24.22 12.09

18.99 34.08 15.51

46.94 54.61 38.24

(optimal strategy of classification in this case). Since the SGT technique precisely relies on this information for prediction, this explains why it has obtained better results on these two datasets in relation to most other techniques under comparison. We must stress an interesting phenomenon that occurs on the two-class BCI dataset. Almost all algorithms supply a test error rate near 50% with 10 labeled samples, i.e., theoretically, their performance is near a random guessing classifier. This happens on account of the following reason. Even though it is a manifold-like dataset [2], a 2-D principal component analysis projection on it reveals that the classes are heavily mixed, which means that the cluster assumption does not hold. In order to examine the results in a statistical manner, an adaptation of procedure described in [48] is utilized in Tables IV and V. The methodology described therein evaluates the significance of the data using a Skillings–Mack test (Friedman test with missing values) [49] by means of the average rank of each algorithm. The null hypothesis of the Skillings–Mack test is that all the algorithms are equivalent, so their ranks should be equal. From here on, for all the future tests, we fix a significance level of 0.10. For our experiments, according to [48], we have that N = 7 (number of datasets) and k = 17 (number of algorithms), resulting in a critical value given by F(16, 96) ≈ 1.55, where the two arguments are derived from the degrees of a value FF ≈ 1.89, which is higher than the critical value, so the null hypothesis is rejected at a 10% significance level. On the other hand, regarding Table V, we get a value FF ≈ 0.29, which is not higher than the critical value, hence, we cannot reject the null hypothesis at a 10% significance level. As the null hypothesis is rejected for the data presented in Table IV (only 10 labeled examples), we are able to advance to post hoc tests which aim at verifying the performance of our algorithm in relation to others. For this task, we opt to use the Bonferroni–Dunn test, with the control algorithm fixed as the proposed technique. Basically, the Bonferroni–Dunn test quantifies whether the performance between an arbitrary algorithm and the reference is significantly different by a critical difference (CD) region. If they do differ that much, then it is said that the better ranked algorithm is statistically superior to the worse ranked one. Thus, if we perform the evaluation of the CD for the problem at hand, we encounter CD = 4.86. The average rank of the proposed method is 4.29. By virtue of that, if any rank lies in the interval 4.29 ± 4.86, the control and the compared algorithms are statistically the same. Indeed, we conclude that our algorithm is superior to 1-NN, SVM, MVU + 1-NN, LEM + 1-NN, Discrete Reg., TSVM, Cluster-Kernel, and Data-Dep. Reg. for this benchmark. Nonetheless, the proposed technique has obtained

465

the best performance (the lowest average rank) in relation to the other techniques when only 10 labeled examples are employed. As our last experiment, a simulation using a large-scale multiclass dataset is performed, namely the Letter Recognition dataset from the UCI Machine Learning Repository [50]. This dataset comprises 20 000 samples of the 26 capital letters (classes) from the English alphabet. We apply two representative graph-based algorithms for comparison matters (LP and LNP using the same parameters as before). Each algorithm is executed using three distinct randomly selected labeled subset sizes, 1, 5, and 10%. For the proposed technique, we fix λ = 0.6. Table VI reports the test errors for this dataset. Again, we see that our method can get good results compared to the others, especially when a very small number of labeled samples is available. V. C ONCLUSION This paper has proposed a mathematical model for competitive learning in complex networks, biologically inspired by the competition process taking place in many nature and social systems. In this model, several particles, each of which representing a class, navigate in the network to explore their territory and, at the same time, attempt to defend their territory against rival particles. If several particles propagate the same class label, then a team is formed, and a cooperation process among these particles occurs. Since the proposed model is nonlinear and stochastic, probabilistic expressions have been derived to predict the model’s behavior as time progresses. A numerical validation presented in this paper confirmed the theoretical predictions. As future work, new measures to quantify the overlapping nature of vertices or subgraphs in the network will be investigated. Another important topic that will be studied is the propagation of labels from wrongly prelabeled vertices. In this scenario, we will attempt to use the competition process to prevent the propagation of these false labels and, consequently, increase the reliability of the final prediction produced by the model. Since the prelabeling task usually involves human efforts, which in turn is susceptible to introduce errors, this turns out to be an important topic to be further studied. R EFERENCES [1] X. Zhu, “Semi-supervised learning literature survey,” Dept. Comput. Sci., Univ. Wisconsin-Madison, Madison, Tech. Rep. 1530, 2005. [2] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2006. [3] S. Abney, Semisupervised Learning for Computational Linguistics. Boca Raton, FL: CRC Press, 2008. [4] K. Nigam, A. K. Mccallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using EM,” Mach. Learn., vol. 39, nos. 2–3, pp. 103–134, May–Jun. 2000. [5] A. Fujino, N. Ueda, and K. Saito, “A hybrid generative/discriminative approach to semi-supervised classifier design,” in Proc. 20th Nat. Conf. Artif. Intell., 2005, pp. 764–769. [6] A. Demiriz, K. P. Bennett, and M. J. Embrechts, “Semi-supervised clustering using genetic algorithms,” in Proc. Artif. Neural Netw. Eng., 1999, pp. 809–814. [7] R. Dara, S. Kremer, and D. Stacey, “Clustering unlabeled data with SOMs improves classification of labeled real-world data,” in Proc. World Congr. Comput. Intell., 2002, pp. 2237–2242.

466

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 3, MARCH 2012

[8] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in Proc. Workshop Comput. Learn. Theory, 1998, pp. 92– 100. [9] T. M. Mitchell, “The role of unlabeled data in supervised learning,” in Proc. 6th Int. Colloq. Cognit. Sci., 1999, pp. 1–8. [10] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data using three classifiers,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 11, pp. 1529–1541, Nov. 2005. [11] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 2008. [12] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learning using Gaussian fields and harmonic functions,” in Proc. 20th Int. Conf. Mach. Learn., 2003, pp. 912–919. [13] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in Neural Information Processing Systems, vol. 16. Cambridge, MA: MIT Press, 2004, pp. 321–328. [14] M. Wu and B. Schölkopf, “Transductive classification via local learning regularization,” in Proc. 11th Int. Conf. Artif. Intell. Stat., Mar. 2007, pp. 628–635. [15] F. Wang, T. Li, G. Wang, and C. Zhang, “Semi-supervised classification using local and global regularization,” in Proc. 23rd Int. Conf. Artif. Intell., 2008, pp. 726–731. [16] M. Szummer and T. Jaakkola, “Partially labeled classification with Markov random walks,” in Advances in Neural Information Processing Systems, vol. 14. Cambridge, MA: MIT Press, 2002. [17] L. Grady, “Random walks for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1768–1783, Nov. 2006. [18] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” School Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CALD-02-107, 2002. [19] F. Wang and C. Zhang, “Label propagation through linear neighborhoods,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 1, pp. 55–67, Jan. 2008. [20] A. Blum and S. Chawla, “Learning from labeled and unlabeled data using graph mincuts,” in Proc. 18th Int. Conf. Mach. Learn., 2001, pp. 19–26. [21] M. Belkin, I. Matveeva, and P. Niyogi, “Regularization and semisupervised learning on large graphs,” in Proc. Conf. Learn. Theory, 2004, pp. 624–638. [22] M. Belkin, P. Niyogi, and V. Sindhwani, “On manifold regularization,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp. 17–24. [23] T. Joachims, “Transductive learning via spectral graph partitioning,” in Proc. Int. Conf. Mach. Learn., 2003, pp. 290–297. [24] H. Yang, M. R. Lyu, and I. King, “A volume-based heat-diffusion classifier,” IEEE Trans. Syst. Man, Cybern. Part B, vol. 39, no. 2, pp. 417–430, Apr. 2009. [25] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp. 1464–1480, Sep. 1990. [26] S. Grossberg, “Competitive learning: From interactive activation to adaptive resonance,” Cognit. Sci., vol. 11, no. 1, pp. 23–63, Jan.–Mar. 1987. [27] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans. Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005. [28] D. Bacciu and A. Starita, “Competitive repetition suppression (CoRe) clustering: A biologically inspired learning model with application to robust clustering,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1922– 1940, Nov. 2008. [29] D. Liu, Z. Pang, and S. R. Lloyd, “A neural network method for detection of obstructive sleep apnea and narcolepsy based on pupil size and EEG,” IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 308–318, Feb. 2008. [30] A. Meyer-Bäse and V. Thümmler, “Local and global stability analysis of an unsupervised competitive neural network,” IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 346–351, Feb. 2008. [31] J. C. Principe and R. Miikkulainen, Advances in Self-Organizing Maps (Lecture Notes in Computer Science), vol. 5629. New York: SpringerVerlag, 2009. [32] E. López-Rubio, J. M. O. de Lazcano-Lobato, and D. López-Rodríguez, “Probabilistic PCA self-organizing maps,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1474–1489, Sep. 2009. [33] A. Kaylani, M. Georgiopoulos, M. Mollaghasemi, G. C. Anagnostopoulos, C. Sentelle, and M. Zhong, “An adaptive multiobjective approach to evolving ART architectures,” IEEE Trans. Neural Netw., vol. 21, no. 4, pp. 529–550, Apr. 2010. [34] M. G. Quiles, L. Zhao, R. L. Alonso, and R. A. F. Romero, “Particle competition for complex network community detection,” Chaos, vol. 18, no. 3, pp. 033107-1–033107-10, Jul. 2008.

[35] E. Çinlar, Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-Hall, 1975. [36] L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,” J. Stat. Mech.: Theory Experim., vol. 2005, no. 9, pp. P09008-1–P09008-10, 2005. [37] R. P. W. Duin, “PRTools version 3.0: A MATLAB toolbox for pattern recognition,” Delf Univ. Techol., Delf, The Netherland, 2000. [38] K. Q. Weinberger and L. K. Saul, “Unsupervised learning of image manifolds by semidefinite programming,” Int. J. Comput. Vis., vol. 70, no. 1, pp. 77–90, Oct. 2006. [39] J. Sun, S. Boyd, L. Xiao, and P. Diaconis, “The fastest mixing Markov process on a graph and a connection to a maximum variance unfolding problem,” SIAM Rev., vol. 48, no. 4, pp. 681–699, 2006. [40] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Comput., vol. 15, no. 6, pp. 1373–1396, Jun. 2003. [41] O. Delalleau, Y. Bengio, and N. Le Roux, “Efficient non-parametric function induction in semi-supervised learning,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp. 96–103. [42] D. Zhou and B. Schölkopf, “Discrete regularization,” in Semi-Supervised Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2006, pp. 237–250. [43] O. Chapelle and A. Zien, “Random-walk based approach to detect clone attacks in wireless sensor networks,” in Proc. 10th Int. Workshop Artif. Intell. Stat., 2005, pp. 57–64. [44] O. Chapelle, J. Weston, and B. Schölkopf, “Cluster kernels for semisupervised learning,” in Neural Information Process Systems 2002, vol. 15. Cambridge, MA: MIT Press, 2003, pp. 585–592. [45] A. Corduneanu and T. Jaakkola, “Data-dependent regularization,” in Semi-Supervised Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2006, pp. 163–190. [46] V. Sindhwani, P. Niyogi, and M. Belkin, “Beyond the point cloud: From transductive to semi-supervised learning,” in Proc. 22nd Int. Conf. Mach. Learn., 2005, pp. 824–831. [47] C. J. C. Burges and J. C. Platt, “Semi-supervised learning with conditional harmonic mixing,” in Semi-Supervised Learning (Adaptive Computation and Machine Learning). Cambridge, MA: MIT Press, 2006, pp. 251–273. [48] J. Demvol. sar, “Statistical comparisons of classifiers over multiple data sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006. [49] M. Chatfield, “The Skillings-Mack test (Friedman test when there are missing data),” Stat. J., vol. 9, no. 2, pp. 299–305, 2009. [50] A. Frank and A. Asuncion. (2010). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml/ Thiago Christiano Silva was born in Americana, São Paulo, Brazil, in 1986. He received the B.E. degree (with first place honors) in computer engineering from the University of São Paulo, São Carlos, in 2009. He is currently pursuing the Ph.D. degree with the same university. His current research interests include complex networks, machine learning, neural networks, and dynamical systems. Mr. Silva was the recipient of three distinctive awards during his B.E. degree: Diploma of Honors Award, Diploma of Academic Excellence Award, and Diploma of Professional Excellence Award. Liang Zhao (M’09–SM’11) received the B.S. degree from Wuhan University, Wuhan, China, and the M.Sc. and Ph.D. degrees from the Aeronautic Institute of Technology, São Paulo, Brazil, in 1988, 1996, and 1998, respectively, all in computer science. He joined the University of São Paulo, São Carlos, in 2000, where he is currently an Associate Professor with the Department of Computer Science. From 2003 to 2004, he was a Visiting Researcher with the Department of Mathematics, Arizona State University, Tempe. His current research interests include artificial neural networks, nonlinear dynamical systems, complex networks, bioinformatics, and pattern recognition. Dr. Zhao is a recipient of the Brazilian Research Productivity Fellowship. He is currently an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS .

Network-based stochastic semisupervised learning.

Semisupervised learning is a machine learning approach that is able to employ both labeled and unlabeled samples in the training process. In this pape...
2MB Sizes 0 Downloads 0 Views