Data Partition Learning With Multiple Extreme Learning Machines.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Data Partition Learning With Multiple Extreme Learning Machines Yimin Yang, Member, IEEE, Q. M. J. Wu, Senior Member, IEEE, Yaonan Wang, K. M. Zeeshan, Member, IEEE, Xiaofeng Lin, and Xiaofang Yuan, Member, IEEE

Abstract—As demonstrated earlier, the learning accuracy of the single-layer-feedforward-network (SLFN) is generally far lower than expected, which has been a major bottleneck for many applications. In fact, for some large real problems, it is accepted that after tremendous learning time (within finite epochs), the network output error of SLFN will stop or reduce increasingly slowly. This report offers an extreme learning machine (ELM)based learning method, referred to as the parent-offspring progressive learning method. The proposed method works by separating the data points into various parts, and then multiple ELMs learn and identify the clustered parts separately. The key advantages of the proposed algorithms as compared to the traditional supervised methods are twofold. First, it extends the ELM learning method from a single neural network to a multinetwork learning system, as the proposed multiELM method can approximate any target continuous function and classify disjointed regions. Second, the proposed method tends to deliver a similar or much better generalization performance than other learning methods. All the methods proposed in this paper are tested on both artificial and real datasets. Index Terms—Data partition learning, extreme learning machine (ELM), learning accuracy, universal approximation.

I. I NTRODUCTION HE widespread popularity of neural networks (NNs) in many fields is mainly due to their ability to approximate complex nonlinear mappings directly from input samples [1]. NNs can provide models for a large class of natural and artificial phenomena that are difficult to handle using classical parametric techniques. As a specific type of NNs, singlelayer-feedforward-networks (SLFNs) play an important role in practical application. An SLFN consists of one input layer receiving the stimuli from external environments, singlehidden layer, and one output layer sending the network output to external environments. For N arbitrary distinct samples

T

Manuscript received June 25, 2014; revised August 16, 2014; accepted August 17, 2014. This work was supported by the Natural Sciences and Engineering Research Council of Canada. This paper was recommended by Associate Editor G.-B. Huang. Y. Yang is with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON N9B 3P4, Canada, and also with the College of Electric Engineering, Guangxi University, Nanning 530004, China. Q. M. J. Wu and K. M. Zeeshan are with the Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON N9B 3P4, Canada (e-mail: [email protected]). Y. Wang and X. Yuan are with the College of Electrical and Information Engineering, Hunan University, Changsha 410082, China. X. Lin is with the College of Electric Engineering, Guangxi University, Nanning 530004, China. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2352594

(xj , tj ), j = 1, . . . , N, where xj ∈ Rn and tj ∈ Rm , the network output is fL (x) =

L

βi h ai · xj + bi , j = 1, . . . , N

(1)

i=1

where h denotes an activation function, (ai , bi ) ∈ Rn ×R denotes the ith hidden node parameters, and βi ∈ R is the output weight between the ith hidden node and the output nodes. An active topic on the universal approximation capability of SLFNs is how to determine the parameters (a, b, β) such that the network output fL (x) can approximate a given target. According to conventional NN theories, SLFNs are universal approximators when all the parameters of the networks (a, b, β) are allowed to be adjustable. Based on these network theories, many learning algorithms and variants have been playing dominant roles in training feedforward NNs, such as gradient-descent methods (see backpropagation, support vector machines [2]), least-square methods (see incremental radial basis function (RBF) networks as in self-adaptive resource allocation network [3], growing and pruning RBF (GAPRBF) [4], multi-innovation recursive least square-quantum particle swarm optimization [5], and generalized growing and pruning RBF (GGAP-RBF) [6]), subset selection methods (see orthogonal least squares [7], [8]) and so on. Unlike the above NN theories, which claim that all the parameters in networks are allowed to be adjustable, several researchers propose partial random learning methods in which partial hidden nodes are generated randomly [9]–[11]. For example, Broomhead and Lowe [11] focuses on a specific RBF network: the centers a in [11] can be randomly selected from the training data instead of tuning, but the impact factor b of RBF hidden node is not randomly selected and is usually determined by users [12]. Different from these partial random based learning methods, the extreme learning machine (ELM) is a full-random learning method that differs from the usual understanding of learning. It should be highlighted that in ELM all the hidden nodes parameters (a, b) are randomly generated and independent of training data [13]. Huang et al. [1], [15] demonstrate that ELM can approximate any continuous target function with any required accuracy. In ELM methods, the hidden layer parameters of NN need not be calculated, but generated randomly. Although hidden nodes are important and critical, Huang et al. [13] indicated that they need not be tuned, and the hidden node parameters can be randomly generated beforehand. Unlike conventional learning

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON CYBERNETICS

methods that must see the training data before generating the hidden node parameters, ELM can generate the hidden node parameters before seeing the training data. According to the universal approximation capability of SLFNs with random hidden nodes, Huang et al. [1], [14] proposed simple and efficient learning steps with increased network architecture (I-ELM) and fixed-network architecture (ELM). These two methods generate parameters of a hidden node randomly. In these two methods, training is performed only at the output layer. Hence, the overall computational time for model structure, selection, and training of the model is often several hundred times less than in any other learning method such as support vector machine (SVM) or backpropagation (BP). Reference [15] shows that ELM unifies fuzzy neural networks and SVM/least square SVM (LS-SVM), and compared to ELM, LS-SVM, and proximal SVM achieve suboptimal solutions and have a higher computational cost. ELM has become increasingly attractive to researchers [13]. Based on ELM, improved methods such as enhanced incremental ELM (EI-ELM) [16], error minimized ELM (EM-ELM) [17], parallel chaos ELM (PC-ELM) [18], optimally pruned ELM (OP-ELM) [19], meta-cognitive ELM [20], real-coded genetic algorithm based on ELM [21], architecture selection localized generalization error model based on ELM [22], dynamic ELM [23], and B-ELM [24] are proposed to obtain better performance. Also, applications of ELM have recently been presented in computer vision [25], [26], feature selection [27], [28], unsupervised learning, medical recognition [29], [30], power system analysis [31], [32], automation control [33], [34], etc. Many of the above mentioned learning methods are established by decreasing the residual error of NN to zero or some conditions are met. For incremental learning methods, including I-ELM, EI-ELM, B-ELM, minimal resource allocation network [4], GAP-RBF [4], GGAP-RBF [6], etc., the residual training error tends to decreases with increasing hidden nodes. Whereas for fixed learning methods, including BasicELM, SVM, BP, and Cone [35], [36], the residual error of NN reduces when the number of neurons is equal to the number of training samples. However, it is accepted that the network output error will reduce increasingly slowly as hidden-node numbers increase. Especially for large real systems, it remains impossible to obtain the zero error within a finite learning time since the learning time increases infinitely, with hidden-node numbers also increasing infinitely. For example, for the California House database,1 the output error of SLFN reduces very slowly after a large number of hidden nodes are used. The best mean testing root-meansquare error (RMSE) on the California house dataset obtained by three SLFN learning methods, ELM, BP, and SVM, is about 0.1267,2 0.1285, and 0.1258 [14]. This paper proposes an ELM-based learning method to further reduce the learning error. In this learning system, a partition growth method is proposed to separate similar feature data, which are easily trained by SLFNs, into the same partition. After all data points are classified, multiple NNs are 1 Detailed information about this database is shown in Table I. 2 In our test (see Table III), the mean RMSE obtained by ELM is 0.1258,

which is similar to the result shown in [14].

used to learn each corresponding partition. To differentiate our method from the other popular learning algorithms, we refer to it as the parent-offspring progressive learning method (PPLM) in the context of this paper. In this paper, we prove that the multiELM method can approximate any target continuous function and classify disjointed regions, extending the ELM learning method from a single NN to multinetworks. Experimental results show that the proposed method tends to deliver a much better generalization performance than many other learning methods. For large datasets such as IJCNN, Cod-RNA, and Covtype dataset [37], the learning error of the proposed method can be several to hundreds of times lower than that of other SLFN methods and multinetwork learning methods. II. P RELIMINARIES AND P ROBLEM S TATEMENT A. Notations nThe sets {i } forma complete partition of the set {} if j = ∅, ∀i = j. The m-ary Cartesian i=1 i = and i product of a partition is denoted by m . The difference of the two sets i and j is denoted by i / j . The sets of real, integer, and positive integer numbers are denoted by R, Z, and Z+ , respectively. #A defines the number of samples in a j j j Lj finite partition A. For all j ∈ Z+ , the notation {(al , bl , βl )}l=1 represents parameters partition of jth SLFNs NNj , L represents the hidden-node-numbers of NNj . H is called the hidden layer output matrix of the SLFNs; the lth column of H (Hl ) is the lth hidden node output with respect to inputs. en represents the network output error with n number of hidden nodes. For M M training samples {(xi , yi )}M i=1 , xi=1 represents input data and M yi=1 represents output data. B. Problem Statement and Definition Definition 1: Given a finite dataset X sampled from a continues model, a small positive value λ, Qi , i = 1, . . . , c are called an optimal partition of X and c is the number of optimal partitions of X if the following hold. + i 1) {(xik , yik )}M k=1 = Qi , Mi ∈ Z , Qi ⊂ X , i = 1, . . . , c. 2) Qi , i = 1, . . . , c form a complete partition of X . 3) L-hidden-nodes SLFN can learn set Qi , i = 1, . . . , c with less than λ error. L βl H(al , bl , xk )| < λ, (xk , yk ) ∈ X Qi = {(xk , yk ) |yk − k = 1, . . . , Mi }.

l=1

(2) {xk , yk }M k=1 ,

Problem 1: Given finite training data X = and a small positive value λ > 0, find the smallest number c of j j j NN parameters sets NNj = {(ai , bi , βi )}Li=1 , j = 1 . . . , c and Mj c optimal partitions Qj = {(xkj , ykj }kj =1 , j = 1, . . . , c such that all the points in each optimal partition Qi , i = 1, . . . , c satisfy L j j j Qj = {(xkj , ykj )| βi H(ai , bi , xkj ) − ykj | ≤ λ i=1

kj = 1, . . . , Mj , j = 1, . . . , c}

(3)

where Q1 ∪· · ·∪Qc = X , Q1 ∩· · ·∩Qc = ∅, M1 +M2 + · · ·+ Mc = M and Q1 , . . . , Qc ⊂ X .

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YANG et al.: DATA PARTITION LEARNING WITH MULTIPLE EXTREME LEARNING MACHINES

3

Algorithm 1 Optimal Parent Partition Selection Algorithm Given a training set X , a small positive value λ, a loop count jmax , and j = 1. Step 1) Set φ 1 = X . Step 2-4) Learning step: while j < jmax do a) calculate the offspring partition φ j+1 of φ j according to Definition 3. b) j = j + 1; end while return obtain optimal parent partition R = φ 1 ∩ φ 2 ∩ · · · ∩ φ jmax . Fig. 1.

Structure of the parent-offspring progressive learning method.

Remark 1: The focus here is on providing several particular NNs to approach c “easy-learning” partitions in a continuous system and to approximate a continuous system model to find the smallest number of NN parameter sets satisfying the above eqnarray. The basic idea of the algorithm is to find NN parameters that make the inequalities in (3) true for as many Mi (i = 1, 2, . . . , c) as possible, and then to remove those satisfied data points and repeat over those that remain until all data points have been processed [see Algorithm 3(c)]. The proposed method is composed of two parts: partition growth to classify training data and partition estimation to test data. In Section III, we propose a partition growth algorithm for training data. In Section IV, we propose a partition estimation algorithm for testing data. The structure of proposed PLM is shown in Fig. 1. III. PARTITION G ROWTH A LGORITHM FOR T RAINING DATA To solve problem 1, we must first focus our attention on estimating a number of regions, classifying the data points into these regions, and obtaining the corresponding NN parameters. In this section, the partition growth algorithm is proposed to estimate feasible regions and to classify the training data points. The proposed method is composed of several steps that are shown in Sections III-A–III-C.

We call partition φ j a parent partition, and φ j+1 is identified as the offspring partition of φ j . Remark 2: NN theories show that with a sufficiently large number of hidden neurons any network can generalize a given function to any accuracy. But for a large real problem, it is impossible to set too large a number of hidden nodes in an SLFN, because the SLFN learning time will increase infinitely as the number of hidden nodes increases infinitely. The idea of this definition is that the ith SLFN in these n SLFNs can learn all the points in partition φ j with less than η error (η >> λ), i.e., all the points in φ j = {xk , yk }M k=1 i i satisfy: |yk − Ll=1 βl H(ail , bil , xk )| < η, (yk , xk ) ∈ X , k = 1, . . . , M. However, according to Definition 3, we find that some points Ei = {xk , yk } in the partition φ j satisfy: |yk − Li i i i l=1 βl H(al , bl , xk )| < λ, (xk , yk ) ∈ X . According to Definitions 2 and 3, we have the following theorem. Theorem 1: Given a training set X , a partition φ j , φ j ⊂ X , a positive small value λ, and repeated Definition 3 m times to generate m partitions φ j+1 , . . . , φ j+m , then we have an optimal parent partition R as R = φ j ∩φ j+1 ∩· · ·∩φ j+m , and all the points in R = {yk , xk }M k=1 belong to the same optimal partition such that L s s s β H(a , b , xk )| < λ, (xk , yk ) ∈ X R = (xk , yk ) |yk − l

l

l

l=1

s = 1, . . . , m, k = 1, . . . , M.

A. Optimal Parent Partition Selection {(xk , yk )M k=1 }.

(5)

For Definition 2: Given a training data φ = Proof: For each φ j+s , s = 1, . . . , m, there exists at least one j + all j ∈ Z , the notation NNφ represents a parameter set of SLFN that can learn each φ j+s , s = 1, . . . , m with λ error j j j j 3 jth SLFN NNφ = {(al , bl , βl )}Ll=1 , which has already been based on Definition ⎧ L M trained by φ. ⎨ j+s s s s g Definition 3: Given a finite number of data points X = φ = (x, y) y − βl H al , bl , xτ ⎩ {(xk , yk ), k = 1, . . . , M} sampled from a continuous sysl=1 k=N+1 i j j tem and a partition φ , φ ⊂ X . n SLFNs (NNφ j = < λ, (x, y) ∈ X . (6) {(ail , bil , βli )}Ll=1 , i = 1, . . . , n) that have already been trained by partition φ j , partition φ j+1 is obtained by L Because, R = φ j+1 ∩ · · · ∩ φ j+m , all the points in the R can i i i be satisfied to (5). Ei = (xk , yk ) |yk − βl H(al , bl , xk )| < λ, (xk , yk ) ∈ X According to Theorem 1, the optimal parent cluster algol=1 i=

1, . . . , n rithm is summarized as in Algorithm 1. Remark 3: Theorem 1 gives the proof of Algorithm 1. The φ j+1 = φ j+1 |φ j+1 = Ez , z = arg max(#Ei ), i = 1, . . . , n . (4) aim of Algorithm 1 is to find an optimal parent partition R. i



Fig. 2.

Steps of Condition 1.

All the points in R belong to the same optimal partition Qi , i = 1, . . . , c. In real applications, φ 1 , R should not equal an empty set. If φ 1 equals an empty set, the value of λ increases. If R equals an empty set, the value of jmax decreases. B. Optimal Parent Partition Growth and Optimal Offspring Partition Selection According to Algorithm 1 and Theorem 1, we obtain R ⊂ Qi . Thus, the aim of the next step is to find a partition G, R ⊂ G ⊂ Qi . Lemma 1: Given a small positive value λ, an optimal parent partition R from Algorithm 1, update X = X /R. Generate jmax offspring partitions φ 1 , φ 2 , . . . , φ jmax of the R according to Definition 3 and set G = φ 1 ∩ φ 2 ∩ · · · ∩ φ jmax . If G satisfies Condition 1, then G ∪ R belongs to the same optimal partition (G ∪ R ⊂ Q i , i = 1, . . . , c, all the points in G ∪ R satisfy {(x, y) |y − Ll=1 βl H(al , bl , x)| < λ). Condition 1: Given two sets G, G ⊂ X and R, R ⊂ X , i = {ail , bil , βli }Ll=1 , i = a positive value λ, n SLFNs (NNG∪R 1, . . . , n), which have already been trained by the training set G ∪ R, then obtain a partition Y according to the following: ⎧ ⎫ L ⎨ ⎬ βji H(aij , bij , x)| < λ, (x, y,) ∈ G ∪ R Ei = (x, y) |y − ⎩ ⎭ j=1

Y = {Y|Y = Ez , z = arg max(#(Ei ))}, i = 1, . . . , n. i

(7)

If Y = G ∪ R, we can say G satisfies Condition 1. Proof: Based on Theorem 1, we can get all the points in sets R and G to belong to the same optimal partition (R ⊂ Qi , i = 1, . . . , c, G ⊂ Qj , j = 1, . . . , c). According to (7), we know that all the points in Y can be approximated by an SLFN with less than ⎧ λ error L ⎨ Y = (x, y) |y − βjs H(asj , bsj , x)| < λ, (x, y) ∈ G ∪ R, ⎩ j=1 ⎫ ⎬ 1≤s≤n . (8) ⎭ Thus, if Y = G ∪ R we can say that G ∪ R belongs to the same optimal partition (G ∪ R ⊂ Qi , i = 1, . . . , c). Fig. 2 shows the detailed steps for Condition 1. If we combine the result of Theorem 1 with the result of Lemma 1, then we can obtain a very important result for extending optimal parent partition R.

Algorithm 2 Optimal Partition Growth and Optimal Offspring Partition Selection Given a training set X , a small positive value λ, an optimal parent partition R from algorithm 1, i = 1, X = X /R, and Yi = R. while i < imax do a) Generate jmax offspring partitions φ 1 , φ 2 , . . . , φ jmax of set Yi according to Definition 3 b) Calculate G = φ 1 ∩ φ 2 ∩ · · · ∩ φ jmax c) Set i = i + 1 and H = Y1 ∪ · · · ∪ Yi−1 ; if G satisfies Condition 1 and Condition 2 then Set Yi = G; else Set Yi = Yi−1 and G equals an empty set; end if end while return obtain Y1 , Y2 , . . . , Yimax and set Q = Y1 ∪ Y2 ∪ · · · ∪ Yimax . Theorem 2: Given a small positive value λ, an optimal parent partition R from Algorithm 1, a set Gj from Lemma 1 (Gj ∪ R ⊂ Qi , i = 1, . . . , c), update data set X = X /(G1 ∪ R), generate jmax offspring partitions φ 1 , φ 2 , . . . , φ jmax of Gj according to Definition 3, and set Gj+1 = φ 1 ∩ φ 2 ∩ · · · ∩ φ jmax , then we have Gj ∪ Gj+1 ∪ R belongs to the same optimal partition (Gj ∪ Gj+1 ∪ R ⊂ Qi , i = 1, . . . , c) if Gj+1 satisfies the following Condition 2. Condition 2: Given n SLFNs (NNH = {as , bs , β s }ns=1 ), which have already been trained by the training set H (H = Gj ∪ R), then Y is obtained by the following: L i i i β H(a , b , x)| < λ, (x, y) ∈ Gj+1 Ei = (x, y) |{y} − l

l

l

l=1

Y = {Y|Y = Ez , z = arg max(#Ei )}, i = 1, . . . , n. i

(9)

If Y = Gj+1 , we can say Gj+1 satisfies Condition 2. Proof: Because, H belongs to the same optimal partition Qi , n NNs (NNHs = {as , bs , β s }, s = 1, . . . , n) must satisfy ⎫ ⎧ L ⎬ ⎨ βjs H(asj , bsj , x)| < λ, (x, y) ∈ G ∪ R . H = (x, y) |y − ⎭ ⎩ j=1

(10) For Y = Gj+1 , there is at least one NN among these n NNs that can learn Gj+1 with less than λ error according to 9. When n −→ +∞, there is at least one NN that can learn H ∪ Gj+1 with less than λ error. Thus, Gj ∪ Gj+1 ∪ R belongs to the same optimal partition. According to Theorem 2, the optimal parent partition growth is summarized as in Algorithm 2. Remark 4: The aim of Algorithm 2 is to extend the optimal parent partition R and let #(R) > #(Gj+1 ) in Condition 2, it is impossible to achieve Y = Gj+1 . Thus, we modify Condition 2 as follows. Condition 3: Given a set H, which is defined as H = H ⊂ Y1 ∪ · · · ∪ Yimax , #(H) = #(Gj+1 ) (11) and n NNs (NNHs , s = 1, . . . , n), which have already been trained by the training set H, then Y is obtained by the following: L i i i βl H(al , bl , x)| < λ, (x, y,) ∈ Gj+1 Ei = (x, y) |y −

Algorithm 3 Parent-Offspring Progressive Learning Method Given a training set X , a small positive value λ and a specific value S, c = 1. while #() < S do a) Obtain optimal parent partition R according to Algorithm 1. b) Obtain optimal partition Qc = Y according to Algorithm 2. c) Let X = X /Qc , c = c + 1 and λ = λ + λ. end while Obtain c optimal partitions Q1 , . . . , Qc .

IV. R EGION E STIMATION FOR T ESTING DATA P OINTS

The classification and learning steps described in Section III return c feasible regions Q1 , . . . , Qc extracted from training (12) dataset X . These provide the classifications of the M training i data points into the c regions. By training these c datasets, If Y = Gj+1 , we can say Gj+1 satisfies Condition 3. respectively, we obtain c corresponding NNs (c NN parameters 2) As mentioned in our recent publications [24], in theory, sets (a1i , b1i , βi1 )Li=1 , . . . , (aci , bci , βic )Li=1 ). a bidirectional extreme learning machine (B-ELM) with However, for testing data points, one still does not know several hidden nodes can obtain a performance compawhat c regions these N data points should be classified rable to other ELMs with hundreds of hidden nodes. into. Thus, first, we should classify these N testing data points Thus, in the proposed method we use B-ELM to reduce into the c regions, and then use the c NNs to generate N estioutput error of SLFNs to λ as quickly as possible. mated testing data outputs. In this paper, M-ELM [15] is first used to train these c regions Q1 , . . . , Qc , after which all the D. Pseudo-Code for PPLM Method testing data points can be classified by this trained M-ELM. Our proposed method, PPLM, can be summarized as in Finally, we use c NNs obtained by the process outlined in Algorithm 3. In this paper, S is generated randomly from Section III to generate testing data outputs. Fig. 4 shows the structure of testing data partition estimation. regions [#X /15, #X /10] and λ ∈ [0.005, 0.1]. l=1

Y = Y|Y = Ez , z = arg max(#Ei ) , i = 1, . . . , n.



TABLE I S PECIFICATION OF 19 B ENCHMARK DATA S ETS

Fig. 4.

Region estimation for testing data points.

V. E XPERIMENT In this section, aimed at examining the performance of our proposed learning method, we test the proposed method on 19 regression and classification problems. The experiments are conducted in MATLAB 2009a with 32 GB of memory and an E3-1230 v2 (3.3G) processor. In the experiment, we compare the performance and efficiency of the proposed method with 15 other state-of-the-art learning methods. NNs are first tested in some SLFN methods including I-ELM, EM-ELM, OP-ELM, PC-ELM, EI-ELM, B-ELM, SVM, BP, and the proposed PPLM. Furthermore, we compare the performance and efficiency of the proposed method with other multinetwork or clustering methods, including Boost SVM, MultiSVMlinear light [38], 3 MultiSVMlinear , MultiSVMRBF 4 [38], classi, MultiSVMRBF perf light perf fier adaptation based on performance measurement optimization (CAPO) [39] and UNlabel [40]. Nineteen different databases are chosen for the experiments. All databases, except Section V-D, are preprocessed in the same way (held-out test). Table I shows the training and testing data of the corresponding datasets. Random permutations of the whole dataset are taken with replacements, and #Train (shown in Table I) of them are used to create the training set. The remaining dataset is used for the test set (#test in Table I). In Section V-D, the databases are preprocessed by 5k cross validation. The average results are obtained over ten trials for all problems. The performance results are shown in Figs. 5–10 and Tables II–VI. In these tables, the close results obtained by different algorithms are underlined, and the apparently better results are shown in boldface. In Sections V-A–V-C, the number of hidden nodes in the proposed PPLM is ten and S = card(A)/15, jmax = 5, and imax = 40. A. Performance Results for Regression Problems 1) Performance Results for Incremental Learning Method: As mentioned in [24], for incremental learning methods such 3 http://svmlight.joachims.org/ 4 http://svmlight.joachims.org/svm_perf.html

as EI-ELM, EM-ELM, I-ELM, and so on, Huang et al. [1] indicated that with an increase in hidden nodes, the network output error reduces very slowly with network growth. In this test, the network learning procedure stops of I-ELM and B-ELM are set as En−1 − En < 1.0 × 10−12 . For EI-ELM, the stop condition is5 set as en−1 − en < 2.2 × 10−7 . The comparisons are conducted on nine real benchmark regression problems shown in Table I. B-ELM, I-ELM, and EI-ELM increase the hidden nodes one-by-one, and the proposed PPLM uses B-ELM with ten fixed hidden nodes. As shown in Table II and Fig. 5, the advantage of the PPLM for testing RMSE is obvious. In Table II, for House 8L or Pole problem, the testing RMSE of I-ELM, B-ELM, and EI-ELM are about three times larger than that of PPLM. For other problems such as Parkinsons, Wine, Puma, and so on, the testing RMSE of I-ELM, B-ELM, and EI-ELM are about two times larger than that of PPLM. In fact, a single hidden-layer NN with thousands of hidden nodes is an extremely large network structure, meaning that PPLM responds to new unknown external stimuli much more accurately than incremental learning methods in real deployment. 2) Performance Results for Fixed Learning Methods: For fixed learning methods such as ELM and EM-ELM, we use 500 hidden nodes. Whereas in OP-ELM, the optimal number of hidden nodes is selected by LOO, which is shown in [19]. Therefore, for OP-ELM, the number of hidden nodes gradually increases by an interval of five until nodes-numbers equal 500. To compare the generalization performance of the proposed PPLM with that of other fixed network learning methods, ELM, EM-ELM, and OP-ELM are also calculated in these nine regression problems. Table III displays the performance evaluation of PPLM and other fixed ELM methods. As seen from the simulation results given in these tables, the advantage of the PPLM for testing RMSE is obvious. In Table III, for the Puma or California Housing problems, the testing RMSE of ELM, OP-ELM, and EM-ELM are about two times larger than that of PPLM. 5 If we set e −7 n−1 − en < a and a < 2.0 × 10 , the hidden-node numbers will be larger than 8000. Then MATLAB will display “out of memory.”


(a)

(b)

7

(c)

Fig. 5. Average testing RMSE when training I-ELM, ELM, EM-ELM, B-ELM, PC-ELM, and the proposed method on Parkin, California House, and Puma, where the x- and y-axes show the number of hidden nodes and average testing RMSE. Result on (a) Parkin, (b) California House, and (c) Puma.

(a)

(b)

(c)

Fig. 6. Average testing accuracy when training ELM, EM-ELM, I-ELM, and the proposed method on Acoustic, Poker, Connect4, and A9a, where the x- and y-axes show the number of hidden nodes and average testing accuracy, respectively. Result on (a) Parkin, (b) California House, and (c) Puma. TABLE II P ERFORMANCE C OMPARISON OF I NCREMENTAL L EARNING M ETHODS (T IME -T RAINING T IME , M EAN -M EAN T ESTING RMSE)

Huang et al. [1], [15], [17], and Yang et al. [18], [24] have systematically investigated the performance of ELM, support vector regression (SVR), EM-ELM, B-ELM, I-ELM, and BP for most databases tested. These papers show that ELM methods obtain the similar generalization performance as SVR or BP, but in much faster and simpler ways. As observed from our experimental results, the proposed PPLM achieves better generalization performance than ELM, B-ELM, EM-ELM, and so on. Thus, the proposed PPLM provides better generalization performance than SVR and BP as well.6 6 More detailed experimental results can http://www1.uwindsor.ca/engineering/cvss/∼yimin

be

found

at

B. Performance Results for Classification Problems In order to indicate the advantage of the proposed method on classification performance, tests have been conducted of the accuracy of the proposed method compared to other ELMbased algorithms. Table IV and Fig. 6 display the performance comparison of I-ELM, EM-ELM, ELM, and PPLM. As seen in from Table IV and Fig. 6, the advantage of the proposed method for testing accuracy is obvious. Consider Poker (large number of dataset samples with medium input dimensions), Acoustic (medium number of dataset samples with medium input dimensions), and Cod RAN (large number of dataset samples with low input dimensions).



(a)

(b)

(c)

(d)

(e)

(f)

Fig. 7. 2000 Samples classification results when training ELM and the proposed method on Cod-RNA, where the x- and y-axes show the first dimension value of Cod-RNA dataset and the second dimension value of cod-RNA dataset, respectively. These 2000 points are selected randomly from all 180 000 testing samples. (a) True partition. (b) Estimated partition by PPLM. (c) Estimated partition by ELM. (d) Optimal partitions information. (e) Classification results by PPLM. (f) Classification results by ELM.

(a)

(b)

(c)

Fig. 8. Average testing accuracy when training MultiSVMlight (linear), MultiSVMlight (RBF), MultiSVMperf (linear), MultiSVMperf (RBF), CAPO, I-ELM, ELM, and the proposed method on A9a, W3a, and IJCNN, where the x- and y-axes show the C values and average testing accuracy, respectively. Result on (a) A9a, (b) W3a, and (c) IJCNN.

1) For the Poker dataset, the testing error of PPLM is about 2.78, 3, and 2.78 times lower than that of ELM, I-ELM, and EM-ELM, respectively. 2) For the Acoustic dataset, the testing error of PPLM is about 21 times, 30 times, and 20 times lower than that of ELM, I-ELM, and EM-ELM, respectively. 3) For the Cod-RAN dataset, the testing error of PPLM is about 101 times, 700 times, and 100 times lower than that of ELM, I-ELM, and EM-ELM, respectively. Huang et al. [1], [14], [15], [17] have systematically investigated the performance of ELM, SVM, and BP for most of

the classification datasets. It is found that ELM obtains a generalization performance that is similar to or better than that of SVM/SVR. Thus, the proposed method always provides a better generalization performance than SVM and BP. Furthermore, to demonstrate the advantage of the proposed method for generalization performance, we take the Cod-RNA dataset as an example. Fig. 7 shows the detailed classification performance. We can see that Fig. 7(a) is the same as Fig. 7(b), which means the proposed method achieves 100% testing accuracy on the Cod-RNA dataset. More importantly, according to single network approximation theory, it is reasonable to


(a)

9

(b)

Fig. 9. Time complexity when training MultiSVMlight (linear), MultiSVMlight (RBF), CAPO and the proposed method on Cod-RNA and Covapp, where the x- and y-axes show the number of training samples and CPU time (in seconds), respectively. Result on (a) Cod-RNA and (b) Covapp.

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 10. Performance and CPU time (in seconds) of the proposed method with different parameters n, imax , jmax , S, λ. Each subfigure shows performance in the first row and corresponding CPU time in the second row. Result on (a) Bank32, (b) Puma, (c) Pole, (d) Bank32, (e) Puma, and (f) Pole.

accept that the separating hyperplane is necessary in a single NN (BP, SVM, or ELM). From Fig. 7(c), we find that there is a separating hyperplane in ELM; unfortunately, it is insufficient to achieve 100% accuracy by using only one separating hyperplane. From Fig. 6(b) and (d), that unlike traditional learning methods there is more than one separating hyperplane in the proposed multinetwork learning method. The proposed method is superior to other SLFN methods, because PPLM exploits more nonlinearity. C. Performance Comparison Between PPLM and Other State-of-the-Art Clustering Methods In this subsection, we compare the performance and efficiency of PPLM with state-of-the-art methods. Specifically, we

compare seven methods: Boost SVM, MultiSVMlinear light [38], 8 7 MultiSVMRBF MultiSVMlinear MultiSVMRBF [38], light , perf perf , CAPO [39] and UNlabel [40]. For MultiSVMlight , MultiSVMperf , Boost SVM and CAPO, the parameter C is selected from C ∈ {2−7 , . . . , 27 }. Tables V and VI present the performance of compared methods; the best result for each task is displayed in boldface. PPLM and CAPO succeed in finishing all tasks in 6 h, but the remaining methods can only work for binary-classification problems. In this table, we observe that the proposed method achieves performance improvements on testing accuracy on 7 http://svmlight.joachims.org/ 8 http://svmlight.joachims.org/svm_perf.html



TABLE III P ERFORMANCE C OMPARISON OF F IXED L EARNING M ETHODS (T IME -T RAINING T IME , M EAN -M EAN T ESTING RMSE)

TABLE IV P ERFORMANCE C OMPARISON OF F IXED L EARNING M ETHODS (ACCURACY-M EAN T ESTING C LASSIFICATION ACCURACY, T IME -T RAINING T IME , E RROR -M EAN T ESTING C LASSIFICATION E RROR )

TABLE V P ERFORMANCE C OMPARISON OF OTHER C LUSTERING M ETHODS (E RROR -M EAN T ESTING C LASSIFICATION E RROR , T IME -T RAINING T IME )

most tasks, and many of the performance improvements are extremely large. Consider the following. 1) For the IJCNN dataset, the testing error of PPLM is about 9, 2, 5, 8, 10, and 12 times lower than linear RBF that of MultiSVMlinear light , MultiSVMlight , MultiSVMperf , RBF MultiSVMperf , Boost SVM, UNlabel, and CAPO, respectively. 2) For Cod-RAN, the testing error of PPLM is about 730 times, 250 times, 330 times, 930 times, 970 times, 275 times, and 920 times lower than that of MultiSVMlinear light , MultiSVMRBF MultiSVMlinear MultiSVMRBF light , perf , perf , UNlabel, and CAPO, respectively. 3) For Covtype, the testing error of PPLM is about 45 times and 40 times lower that of Boost SVM and CAPO, respectively. Moreover, we find that only CAPO and PPLM can work for multiclassification problems.

To further study the generalization performance for classification problems on different parameters, we perform experiments on three medium-sized datasets: A9a, W3a, and IJCNN. We compare the classification performances between PPLM and other compared methods under different parameters C, respectively. For all compared methods, we vary C within {2−7 , 2−6 , . . . , 27 }. Moreover, for PPLM and CAPO, we fix another parameter λ at 0.03 and another parameter B at 1. Fig. 8 shows the results: the proposed method generally outperforms the compared methods at different C. Thus, our method is more robust with C. It is well known that many ensemble learning methods require processing loads of nonlinear complexity, i.e., the computational time of the given data is a nonlinear function of the training sample size (N). As seen in Fig. 9, for the proposed method the computational complexity for N


11

TABLE VI P ERFORMANCE C OMPARISON OF OTHER C LUSTERING M ETHODS (T IME -T RAINING T IME , M EAN -M EAN T ESTING C LASSIFICATION E RROR )

points is approximately proportional to N. However, CAPO and SVMlight (RBF) require O(N 2 ) computational effort. Thus, with respect to time complexity, SVMlight (linear) and PPLM cost comparable CPU time, which is much less than CAPO and MultiSVMRBF light . D. Sensitivity of Parameters In this subsection, we discuss the sensitivity of all the parameters in PPLM. In the proposed method, there are five parameters: n, λ, S, jmax and imax . The mean testing RMSE, mean learning time, and mean region-numbers offered by PPLM with different values of parameters in PPLM for all test real problems are tabulated in Fig. 10. We carry out k5 cross validation tests on some databases. These real databases are tested by selecting parameters imax , jmax , S randomly from their respective feasible regions, n always randomly generate from [10, 15], but fixing parameter λ = 0.005, 0.02, 0.03, 0.05, 0.08, and 0.1. We consider three kinds of situations: 1) imax , jmax , S are generated randomly from respective regions ( jmax ∈ [4, 8], imax ∈ [10, 50], and S ∈ [#X /10, #X /15]); 2) the parameter jmax is fixed as specific values 5, but the imax and S are generated randomly from [10, 50] and [#X /15, #X /10], respectively; and 3) the parameter jmax , imax and S are fixed at specific values 5, 50, and #X /12. The generation performance and learning time are shown in Fig. 10. We find that in Fig. 10, the mean testing RMSE for different parameters jmax , imax , S are nearly the same. This indicates that parameters jmax , imax , and S are not sensitive to generalization performance. The user can choose the regions for these four parameters in PPLM randomly at the outset without affecting the generalization performance in the learning process. Furthermore, from these subfigures, it is clear that the parameter λ in some cases is not very sensitive, such as in Bank32. The mean cost values with different values of λ have no significant difference. However, in some cases, such as Puma, the value of the parameter λ is very sensitive to the performance of PPLM. In conclusion, no formal method is available to choose the value of the parameter λ; it depends on each characteristic of the real database. However, with the results in Fig. 10, we can see that λ < 0.05 generally performs well. In all, we suggest fixing λ at less than 0.05 in general and the other four parameters can be generated randomly from corresponding regions.

VI. C ONCLUSION In this paper, a new learning system called the PPLM is proposed to further improve learning accuracy. Unlike other NN-based methods, in this new approach similar feature data are selected into the same region, then a multineural network learns these regions and eventually further reduces the learning error. The experimental results show that, compared to other NN learning methods, the proposed PPLM significantly reduces the network output error. This means that the proposed method tends to deliver a much better generalization performance than other learning methods. Compared to some large datasets such as IJCNN, Cod-RNA, and Covtype dataset, the learning error of the PPLM can be several hundreds of times lower than that of other learning methods.

R EFERENCES [1] G. B. Huang, L. Chen, and C. K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892, Jul. 2006. [2] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–97, 1995. [3] S. Suresh, K. Dong, and H. J. Kim, “A sequential learning algorithm for self-adaptive resource allocation network classifier,” Neurocomputing, vol. 73, nos. 16–18, pp. 3012–3019, 2010. [4] Y. W. Lu, N. Sundararajan, and P. Saratchandran, “A sequential learning scheme for function approximation using minimal radial basis function neural networks,” Neural Comput., vol. 9, no. 2, pp. 461–478, 1997. [5] H. Chen, Y. Gong, and X. Hong, “Online modeling with tunable RBF network,” IEEE Trans. Cybern., vol. 43, no. 3, pp. 935–947, Jun. 2013. [6] G. B. Huang, P. Saratchandran, and N. Sundararajan, “An efficient sequential learning algorithm for growing and pruning RBF (GAP-RBF) networks,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 6, pp. 2284–2292, Dec. 2004. [7] Y. Zhao, H. L. Wei, and S. Billings, “A new adaptive fast cellular automaton neighborhood detection and rule identification algorithm,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 4, pp. 1283–1287, Aug. 2012. [8] G. Huang, S. J. Song, and C. Wu, “Orthogonal least squares algorithm for training cascade neural networks,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 11, pp. 2629–2637, Nov. 2012. [9] B. Igelnik and Y. H. Pao, “Stochastic choice of basis functions in adaptive function approximation and the functional-link net,” IEEE Trans. Neural Netw., vol. 6, no. 6, pp. 1320–1329, Nov. 1995. [10] Y. H. Pao, G. H. Park, and D. J. Sobajic, “Learning and generalization characteristics of the random vector functional-link net,” Neurocomputing, vol. 6, no. 2, pp. 163–180, 1994. [11] D. S. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Syst., vol. 2, no. 3, pp. 321–355, 1988.


[12] G. B. Huang, M. B. Li, L. Chen, and C. K. Siew, “Incremental extreme learning machine with fully complex hidden nodes,” Neurocomputing, vol. 71, nos. 4–6, pp. 576–583, 2008. [13] G. B. Huang, “An insight into extreme learning machines: Random neurons, random features and kernels,” Cogn. Comput., vol. 6, no. 3, pp. 376–390, Sep. 2014. [14] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, nos. 1–3, pp. 489–501, 2006. [15] G. B. Huang, H. M. Zhou, X. J. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [16] G. B. Huang and L. Chen, “Enhanced random search based incremental extreme learning machine,” Neurocomputing, vol. 71, nos. 16–18, pp. 3460–3468, 2008. [17] G. R. Feng, G. B. Huang, Q. P. Lin, and R. Gay, “Error minimized extreme learning machine with growth of hidden nodes and incremental learning,” IEEE Trans. Neural Netw., vol. 20, no. 8, pp. 1352–1357, Aug. 2009. [18] Y. Yang, Y. Wang, and X. Yuan, “Parallel chaos search based incremental extreme learning machine,” Neural Process. Lett., vol. 37, no. 3, pp. 277–301, 2013. [19] Y. Miche et al., “Optimal pruned extreme learning machine,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 158–162, Jan. 2010. [20] R. Savitha, S. Suresh, and H. J. Kim, “A meta-cognitive learning algorithm for an extreme learning machine classifier,” Cogn. Comput., vol. 6, no. 2, pp. 253–263, 2013. [21] S. Suresh, S. Saraswathi, and N. Sundararajan, “Performance enhancement of extreme learning machine for multi-category sparse data classification problems,” Eng. Appl. Artif. Intell., vol. 23, pp. 1149–1157, Oct. 2010. [22] X. Z. Wang, Q. Y. Shao, Q. Miao, and J. H. Zhai, “Architecture selection for networks trained with extreme learning machine using localized generalization error model,” Neurocomputing, vol. 102, pp. 3–9, Feb. 2013. [23] R. Zhang, Y. Lan, G. B. Huang, Z. B. Xu, and Y. C. Soh, “Dynamic extreme learning machine and its approximation capability,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 2054–2065, Dec. 2013. [24] Y. Yang, Y. Wang, and X. Yuan, “Bidirectional extreme learning machine for regression problem and its learning effectiveness,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1498–1505, Sep. 2012. [25] A. Baradarani, Q. M. J. Wu, and M. Ahmadi, “An efficient illumination invariant face recognition framework via illumination enhancement and DD-DTCWT filtering,” Pattern Recognit., vol. 46, no. 1, pp. 57–72, 2013. [26] R. Minhas, A. A. Mohammed, and Q. M. J. Wu, “Incremental learning in human action recognition based on snippets,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 11, pp. 1529–1541, Nov. 2012. [27] L. L. C. Kasun, H. Zhou, G. B. Huang, and C. M. Vong, “Representational learning with extreme learning machine for big data,” IEEE Intell. Syst., vol. 28, no. 6, pp. 31–34, Nov./Dec. 2013. [28] A. A. Mohammed, R. Minhas, Q. M. J. Wu, and M. A. Sid-Ahmed, “Human face recognition based on multidimensional PCA and extreme learning machine,” Pattern Recognit., vol. 44, nos. 10–11, pp. 2588–2597, 2011. [29] C. Pan, D. S. Park, Y. Yang, and H. M. Yoo, “Leukocyte image segmentation by visual attention and extreme learning machine,” Neural Comput. Appl., vol. 21, no. 6, pp. 1217–1227, 2012. [30] L. C. Shi and B. L. Lu, “EEG-based vigilance estimation using extreme learning machines,” Neurocomputing, vol. 102, pp. 135–143, Feb. 2013. [31] X. Chen et al., “Electricity price forecasting with extreme learning machine and bootstrapping,” IEEE Trans. Power Syst., vol. 27, no. 4, pp. 2055–2062, Nov. 2012. [32] A. H. Nizar, Z. Y. Dong, and Y. Wang, “Power utility nontechnical loss analysis with extreme learning machine method,” IEEE Trans. Power Syst., vol. 23, no. 3, pp. 946–955, Aug. 2008. [33] Z. Yan and J. Wang, “Robust model predictive control of nonlinear systems with unmodeled dynamics and bounded uncertainties based on neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 3, pp. 457–469, Mar. 2014. [34] Y. M. Yang, Y. N. Wang, X. F. Yuan, Y. H. Chen, and L. Tan, “Neural network-based self-learning control for power transmission line deicing robot,” Neural Comput. Appl., vol. 22, no. 5, pp. 969–986, 2013. [35] F. L. Minku and T. B. Ludermir, “Clustering and co-evolution to construct neural network ensembles: An experimental study,” Neural Netw., vol. 21, no. 9, pp. 1363–1379, 2008.


[36] T. Chen and K. H. Yap, “Discriminative bow framework for mobile landmark recognition,” IEEE Trans. Cybern., vol. 44, no. 5, pp. 695–706, May 2014. [37] C. C. Chang and C. J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, 2011. [38] T. Joachims, “A support vector method for multivariate performance measures,” in Proc. Int. Conf. Mach. Learn., 2005, pp. 377–384. [39] N. Li, I. W. Tsang, and Z. H. Zhou, “Efficient optimization of performance measures by classifier adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1370–1382, Jun. 2013. [40] M. L. Zhang and Z. H. Zhou, “Exploiting unlabeled data to enhance ensemble diversity,” Data Min. Knowl. Dis., vol. 26, no. 1, pp. 98–129, 2013.

Yimin Yang (SM’10–M’13) received the M.Sc. and the Ph.D. degrees in electrical engineering from the College of Electrical and Information Engineering, Hunan University, Changsha, China, in 2009 and 2013, respectively. Since 2013, he has been an Assistant Professor with the College of Electric Engineering, Guangxi University, Nanning, China. He is currently a PostDoctoral Fellow with the Department of Electrical and Computer Engineering, the University of Windsor, Windsor, ON, Canada. He has authored or coauthored over 20 refereed papers. His current research interests include extreme learning machine, hybrid system approximation, and image dimension reduction. Dr. Yang serves as a reviewer for international journals in his research field, such as the IEEE T RANSACTIONS ON C YBERNETICS and the Neural Networks.

Q. M. J. Wu (M’92–SM’09) received the Ph.D. degree in electrical engineering from the University of Wales, Swansea, U.K., in 1990. He was with the National Research Council of Canada for ten years, from 1995–2005, where he became a Senior Research Officer and Group Leader. He is currently a Professor with the Department of Electrical and Computer Engineering, the University of Windsor, Windsor, ON, Canada. He has published over 250 peer-reviewed papers in computer vision, image processing, intelligent systems, robotics, and integrated microsystems. His current research interests include 3-D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. Dr. Wu holds the Tier 1 Canada Research Chair in Automotive Sensors and Information Systems. He is an Associate Editor for the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS , Cognitive Computation, and the International Journal of Robotics and Automation. He has served on technical program committees and international advisory committees for many prestigious conferences.

Yaonan Wang received the Ph.D. degree in electrical engineering from Hunan University, Changsha, China, in 1994. From 1994 to 1995, he was a Post-Doctoral Research Fellow with the Normal University of Defense Technology, Changsha, China. From 1998 to 2000, he was a Senior Humboldt Fellow in Germany, and from 2001 to 2004, he was a Visiting Professor with the University of Bremen, Bremen, Germany. Since 1995, he has been a Professor with the College of Electrical and Information Engineering, Hunan University. His current research interests include intelligent control, robotics, and image processing.


K. M. Zeeshan (M’11) received the B.S. (Hons.) and M.S. degrees from the University of the Central Punjab, Lahore, Pakistan, in 2003 and 2006, respectively. He is currently a Research Associate with the University of Windsor, Windsor, ON, Canada. He is an Assistant Professor with the University of Punjab, Lahore, and is currently on research leave. His current research interests include neural networks and image processing.

Xiaofeng Lin received the B.S. degree in electrical engineering from Guangxi University, Nanning, China, in 1982. He has been a Professor with the College of Electrical Engineering, Guangxi University, Nanning, China, since 1999. From 2008 to 2009, he was a Visiting Professor with the University of Illinois at Chicago, Chicago, IL, USA. His current research interests include neural networks, approximate dynamic programming, and industrial process control.

13

Xiaofang Yuan (M’14) received the B.S., M.S., and the Ph.D. degrees in electrical engineering from Hunan University, Changsha, China, in 2001, 2006, and 2008, respectively. In 2008, he joined the College of Electrical and Information Engineering, Hunan University, where he is currently an Associate Professor. His current research interests include artificial neural networks, industrial process control, and computing intelligence.

Stacked Extreme Learning Machines.

Trends in extreme learning machines: a review.

Synthetic learning machines.

Computer science: The learning machines.

Color face recognition based on steerable pyramid transform and extreme learning machines.

Low-Discrepancy Points for Deterministic Assignment of Hidden Weights in Extreme Learning Machines.

Improving Classification Performance through an Advanced Ensemble Based Heterogeneous Extreme Learning Machines.

Online Sequential Extreme Learning Machine With Kernels.

An Automated System for Skeletal Maturity Assessment by Extreme Learning Machines.

Concrete Condition Assessment Using Impact-Echo Method and Extreme Learning Machines.

Decoding intracranial EEG data with multiple kernel learning method.

Generalized multiple kernel learning with data-dependent priors.

Graph Embedded Extreme Learning Machine.

Prediction of hot spots in protein interfaces using extreme learning machines with the information of spatial neighbour residues.

A Fast Reduced Kernel Extreme Learning Machine.

Sparse extreme learning machine for classification.

Discriminative clustering via extreme learning machine.

Extreme Learning Machine for Multilayer Perceptron.

EEG-based emotion recognition with manifold regularized extreme learning machine.

Protein sequence classification with improved extreme learning machine algorithms.

A Unified Framework for Reservoir Computing and Extreme Learning Machines based on a Single Time-delayed Neuron.

Stochastic Synapses Enable Efficient Brain-Inspired Learning Machines.

ML2Motif-Reliable extraction of discriminative sequence motifs from learning machines.

Scaling up graph-based semisupervised learning via prototype vector machines.