Dynamic adjustment of hidden node parameters for extreme learning machine.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 45, NO. 2, FEBRUARY 2015

279

Dynamic Adjustment of Hidden Node Parameters for Extreme Learning Machine Guorui Feng, Member, IEEE, Yuan Lan, Xinpeng Zhang, and Zhenxing Qian

Abstract—Extreme learning machine (ELM), proposed by Huang et al., was developed for generalized single hidden layer feedforward networks with a wide variety of hidden nodes. ELMs have been proved very fast and effective especially for solving function approximation problems with a predetermined network structure. However, it may contain insignificant hidden nodes. In this paper, we propose dynamic adjustment ELM (DA-ELM) that can further tune the input parameters of insignificant hidden nodes in order to reduce the residual error. It is proved in this paper that the energy error can be effectively reduced by applying recursive expectation-minimization theorem. In DA-ELM, the input parameters of insignificant hidden node are updated in the decreasing direction of the energy error in each step. The detailed theoretical foundation of DA-ELM is presented in this paper. Experimental results show that the proposed DA-ELM is more efficient than the state-of-art algorithms such as Bayesian ELM, optimally-pruned ELM, two-stage ELM, Levenberg–Marquardt, sensitivity-based linear learning method as well as the preliminary ELM. Index Terms—Adjustment of hidden node parameters, error minimized approximation, extreme learning machine, least squares method.

I. I NTRODUCTION EEDFORWARD neural networks (FNNs) are known as massively parallel distributed processors that use the weighted connections to store the learned knowledge via some learning processes. According to Jaeger’s estimation [4], 95% neural networks applications and modelings are actually based on FNNs. As a type of FNNs, single hidden layer feedforward networks (SLFNs) play an important role in the practical applications. Backpropagation (BP) algorithm [3] is probably the most famous algorithm among all algorithms using the gradient descent method in the SLFNs learning field. The advanced algorithm such as Levenberg–Marquardt (LM) [20]

F

Manuscript received May 6, 2013; revised December 9, 2013 and March 27, 2014; accepted April 24, 2014. Date of publication June 5, 2014; date of current version January 13, 2015. This work was supported in part by the National Natural Science Foundation of China under Grant 61373151, in part by the Natural Science Foundation of Shanghai under Grant 13ZR1415000, in part by the Qualified Personnel Foundation of Taiyuan University of Technology under Grant tyutrc-201307b, in part by the Taiyuan University of Technology Group Fund under Grant 1205-04020102, in part by the Innovation Program of Shanghai Municipal Education Commission under Grant 14YZ019, and in part by Shanghai Rising-Star Program under Grant 14QA1401900. This paper was recommended by Associate Editor X. You. G. Feng, X. Zhang, and Z. Qian are with the School of Communication and Information Engineering, Shanghai University, Shanghai 200072, China (e-mail: [email protected]; [email protected]; [email protected]). Y. Lan is with the Research Institute of Mechatronics Engineering, Taiyuan University of Technology, Taiyuan 030024, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TCYB.2014.2325594

that includes the second order information is much faster and able to find better solutions. When the activation function is reversible, the joint optimization problem based on the sensitivity analysis is determined by the initial random weights [26]. The algorithm referred to as sensitivity-based linear learning method (SBLLM) attempts to use gradient search to avoid the local minima. Another typical machine method, support vector machine (SVM) [21] has been extensively used to solve classification problems. Later, a faster implementation of SVM, least square SVM (LSSVM) [25], has been introduced that applied equality optimization constraints instead of inequalities used in SVM. However, the traditional learning algorithms are mostly iterative and generally far slower than required, which is a major drawback in their applications during the past decades. To improve the efficiency of SLFNs and overcome the learning issues faced by traditional learning methods, an alternative learning algorithm was proposed by Huang et al. [5], [6], named extreme learning machine (ELM). ELM is designed on the basis of SLFNs. Theoretical analysis shows that SLFNs can approximate arbitrary nonlinear functions when hidden neuron parameters (input weights and biases) is tunable [19]. Surprisingly, unlike popular learning theories, Huang et al. [5], [8] proved that the hidden neurons of SLFNs need not be tuned and SLFNs where hidden nodes were randomly generated had universal approximation capability. In addition, a priori knowledge can be introduced in Bayesian ELM (BELM) [18] which may obtain the confidence intervals (CIs) without the need of applying computationally intensive methods. According to the studies in [5], ELM requires more hidden nodes than BP due to the randomness of hidden neurons process. One open question is whether the network complexity of ELMs can be further reduced while remaining the good generalization performance. One solution to tackle the problem of the architecture design of SLFNs is to alter the network structure during the training process. There are normally two heuristic approaches to alter the network structure, including constructive (or growing) and destructive (or pruning) approaches. Early studies altered the structure of SLFNs with RBF nodes by applying sequential learning algorithms, including resource allocating network (RAN) [15] and its extension of minimizing RAN (MRAN) [16]. They adaptively add the hidden nodes so as to generate a more compact network, but the computational cost is quite high. This problem is resolved well by the error minimized ELM (EM-ELM) [11]. EM-ELM adds random hidden nodes one-by-one (or group-by-group) and updates output

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

280

weights of all hidden nodes incrementally to minimize the sum-of-squares error in the training set. In fact, under the constraint of Lebesgue integrable criterion, adaptive growing ELM can approximate any target over a compact set [10]. On the contrary, there are also many pruning algorithms proposed in the literature. Based on ELM, pruned ELM (P-ELM) [9] and optimally-pruned ELM (OP-ELM) [22] have been proposed. P-ELM prunes the hidden nodes with low relevance by using statistical criteria Chi-squared (χ 2 ) and information gain (IG). While in OP-ELM, the hidden nodes are ranked by multiresponse sparse regression algorithm (MRSR) [23], [24], and the final model is selected by the leave-one-out (LOO) cross-validation. Because pruning methods are always dealing with the network structure larger than necessary, they may have high computational cost and can be rather time-consuming for cases with the huge number of data. To reduce the computational, random selection of hidden nodes suggested in [7] can quickly grow and prune the network nodes according to their significance to network performance. Although altering the network structure by constructive or destructive methods is useful to achieve a relatively more compact network structure, it may not be the optimal. It is apparent that some of hidden nodes in the ELM network are insignificant. In other words, some of hidden nodes in the ELM network have little contribution to the performance, probably due to the reasons that there is no selection of hidden nodes in the basic ELM. Therefore, a better performance can be achieved by selecting the most significant hidden nodes. Many network selection methods have been proposed in the literature. Selection methods usually adopt exhaustive search that considers all subset combinations. However, it is computationally expensive if the number of hidden nodes under investigation is large. Hence, various methods have been proposed for evaluating only a small number of subset models, which is generally referred to as stepwise methods. Orthogonal least-squares (OLS) [1] is one of the forward stepwise selection methods and has been widely implemented. Based on the modified Gram–Schmidt (MGS) orthogonalization, OLS selects a suitable hidden node from a large set of candidates which can reduce the residual error. Another popular stepwise forward selection method is the fast forward recursive algorithm (FRA) [2]. The modified fast recursive algorithm (MFRA) can quickly estimate the contribution of each hidden node and is used in selecting hidden node of ELM [12]. Unlike OLS that uses QR decomposition, FRA solves the least-squares problem recursively without requiring matrix decomposition. Motivated by FRA, a systematic two-stage ELM (TS-ELM) for regression [13] is proposed. In TS-ELM, after a forward recursive selection stage, all hidden nodes are reviewed and the insignificant ones are removed from the network at the second stage. Its forward stepwise selection methods require a large number of hidden nodes to form the candidate reservoir and randomly update hidden nodes to approximate for the energy error. To solve the problem, a new algorithm, referred to as dynamic adjustment ELM (DA-ELM), is proposed in this paper based on the preliminary ELM, which can dynamically adjust the parameters of insignificantly hidden nodes. Inspired


by the TS-ELM algorithm, instead of selecting hidden nodes from a candidate reservoir, the proposed algorithm is able to rank the hidden nodes and update the parameters of insignificantly hidden nodes based on their performance. The algorithm contains three phases: a fixed network model is set up according to preliminary ELM in the first phase; the hidden nodes are ranked based on the order of the significance and the least significantly hidden node is identified in the second phase; the parameters of the least significantly hidden node are updated in the third phase. The process is repeated until the change in the training error is neglected. The experimental studies have shown that with same network structure, DA-ELM can achieve better generalization performance as compared to the preliminary ELM/BELM/OP-ELM/TS-ELM and other typical algorithms. The rest of this paper is organized as follows. In Section II, a theoretical foundation of DA-ELM is presented. Section III provides a detailed description of DA-ELM. The evaluations of DA-ELM are given in Section IV and it has been compared with preliminary ELM, BELM, OP-ELM, TS-ELM, LM, and SBLLM. Finally, Section V is the conclusion of this paper. II. T HEORETICAL F OUNDATION OF DA-ELM In this section, we provide the background knowledge of proposed DA-ELM. The output of an SLFN with L hidden nodes can be represented by YL (x) =

L

βi G(ai , bi , x), x ∈ RM , ai ∈ RM

(1)

i=1

where ai and bi are the learning parameters of hidden nodes and βi the weight connecting the ith hidden node to the output node. G(ai , bi , x) is the output of the ith hidden node with respect to the input x. SLFNs with a wide variety of random computational hidden nodes are proved to have the universal approximation capability as long as the hidden node activation functions are nonlinear piecewise continuous seen as a linear combination of basis functions for SLFNs with additive hidden nodes or RBF hidden nodes. For additive hidden node with the activation function g(x) : R → R (e.g., sigmoid, threshold, sin/cos, etc.), G(ai , bi , x) is given by G(ai , bi , x) = g(ai · x + bi ), bi ∈ R

(2)

where ai is the weight vector connecting the input layer to the i-th hidden node and bi is the bias of the i-th hidden node. ai · x denotes the inner product of vectors ai and x in RM . M For a given set of training examples {(xi , ti )}N i=1 ⊂ R × R, if real outputs of the network are equal to the targets, we have YL (xj ) =

L

βi G(ai , bi , xj ) = tj , j = 1, . . . , N.

(3)

i=1

Equation (3) can be written compactly as HL β = t

(4)

FENG et al.: DYNAMIC ADJUSTMENT OF HIDDEN NODE PARAMETERS FOR EXTREME LEARNING MACHINE

where

⎤

⎡

G(a1 , b1 , x1 ) · · · G(aL , bL , x1 ) ⎥ .. .. ⎦ . ··· . G(a1 , b1 , xN ) · · · G(aL , bL , xN ) ⎡ ⎤ ⎡ ⎤ β1 t1 ⎢ .. ⎥ ⎢ .. ⎥ β = ⎣ . ⎦ , and t = ⎣ . ⎦

⎢ HL = ⎣

βL

(5)

(6)

tN

where HL is called the hidden layer output matrix of the network [5]; the ith column of HL is the i-th hidden node’s output vector with respect to inputs x1 , x2 , . . . , xN and the j-th row of HL is the output vector of the hidden layer with respect to the input xj . The whole input is X = [x1 , . . . , xN ]T . In ELM, there may be some hidden nodes that have less contribution to the network compared to the other hidden nodes. In order to solve this problem, DA-ELM is proposed in this section as an extension to conventional least-squares methods in Hilbert spaces (general N-dimensional vector spaces). DA-ELM intends to recursively adjust the parameters of the least significantly hidden node that is determined by ranking all hidden nodes in the hidden layer in each step. As a result, the training mean squared error (MSE) can be reduced. In this paper, the norm used is the l2 norm. Consider the following approximation problem: let H be a Hilbert space, t ∈ H is a target vector with [t1 , . . . , tN ]T with respect to the input [x1 , . . . , xN ]T . The target t could be approximated by L-order approximator tL = [tL1 , . . . , tLN ]T using SLFNs with L hidden nodes. Therefore N L 2

tj − P(L) = t − tL = βi G ai , bi , xj (7) j=1

i=1

where P(L) is approximation error. To better estimate the target function, it is desired to find a FL , such that P (L) = t − tL ≤ t − tL .

(8)

As a result, it can decrease the approximation error. During the past investigation, we found that a better approximator tL could be achieved by updating the input parameters of certain hidden nodes. The output of lth hidden node with respect to the input (i.e., the lth column of HL ) is denoted as ⎤ ⎡ G(al , bl , x1 ) ⎥ ⎢ .. δhl = G(Xal ) = ⎣ . (9) ⎦ . G(al , bl , xN )

N×1

For additive hidden nodes, we compute the inverse of the activation function and have ⎤ ⎡ −1 ⎤ ⎡ G (G(al x1 + bl )) al x1 + bl ⎥ ⎢ ⎥ ⎢ .. .. G−1 (δhl ) = ⎣ ⎦ . (10) ⎦=⎣ . . G−1 (G(al xN + bl ))

al xN + bl

In order to simplify the equation, define vL = G−1 (t − tL + βl δhl ) and ignore the bias of hidden nodes. We have G−1 (δhl ) = Xal .

281

Lemma 1: Given an SLFN as aforementioned, there exists a vector al , such that vL − Xal ≤ vL − Xal . Proof: Under the constraint of minimum norm least-square (i.e., min al and min vL −Xal ) and with the matrix pseudoinversion theory, a simple solution of the system is given by al = X† vL .

(11)

In real applications, the number of hidden nodes is usually less than the number of training data, L < N. Thus, X† could be calculated as (XT X + μI)−1 XT [13], where μ is a very small regularization parameter and I is the identity matrix. Lemma 1 indicates there exists another vector Xal that is much closer to vL than the random vector Xal in the input space. Lemma 2: Given an SLFN as aforementioned, if there exists a vector al such that vL − Xal ≤ vL − Xal , it is true that the new vector a¯ l = w1 al + w2 al yields vL − X¯al ≤ vL − Xal , where w1 > 0, w2 > 0 and w1 + w2 = 1. Proof: Since a¯ l = w1 al + w2 al , (w1 + w2 = 1), we have vL − X¯al = w1 vL + w2 vL − X¯al ≤ w1 (vL − Xal ) + w2 (vL − Xal )

= w1 (vL − Xal ) + w2 (vL − Xal ) ≤ w1 (vL − Xal ) + w2 (vL − Xal ) = vL − Xal . (12)

Lemma 2 states that if the vector Xal is closer to the preset point vL than the vector Xal , those points between Xal and Xal are also closer to vL than Xal . Lemma 3: Given three random variables Y1 , Y2 , Z, if there exists E[(Y1 − Z)2 ] > E[(Y2 − Z)2 ], meanwhile, (Y1 − Z)2 − E[(Y1 − Z)2 ] and (Y2 − Z)2 − E[(Y2 − Z)2 ] have the same probability density function (PDF), it is true that E[(Y1 − Z)2n ] > E[(Y2 − Z)2n ] for any positive integer n. Proof: Let W and V be two random variables with W = (Y1 − Z)2 , V = (Y2 − Z)2 . According to the condition given, we have E(W) > E(V), W ≥ 0, and V ≥ 0. Let pW (w), pV (v) be the normalized PDFs of W, V, respectively. Then pW (w − E(W)) = pV (v − E(V)). For any positive integer n, we have

∞ E[(Y1 − Z)2n ] = wn pW (w − E(W))dw

0 ∞ [v + E(W) − E(V)]n pV (v − E(V))dv = 0

∞ > vn pV (v − E(V)) dv 0 (13) = E (Y2 − Z)2n . Lemmas 1–3 can be very useful to support the general expectation-minimization theorem. Lemma 1 presents the foundation of DA-ELM, which states that after adjusting input parameters of certain hidden nodes, the approximation error can be maintained or reduced (i.e., vL − Xal ≤ vL − Xal ).

282


However, it makes no difference to just maintain the approximation error. Therefore, we shall have vL − Xal < vL − Xal . To achieve recursive expectation-minimization theorem, there are two necessary assumptions. 1) For any derivatives of the activation function G(x) with order 2n + 1 (i.e., G2n+1 (x), n = 1, 2, . . .), assume that E(G2n+1 (x)) = 0. 2) Let Y1 , Y2 be two random variables. If Y1 is a nonlinear function of Y2 , assume that Y1 and Y2 are uncorrelated. Theorem 1: (recursive expectation-minimization theorem): Given an SLFN as aforementioned, if there exists a vector al such that vL − Xal < vL − Xal , and the derivatives of the activation function are uniformly bounded, it is true that there exists a vector aol = w1 al +w2 al such that vL −Xaol < vL − Xal , and E[t − tL + βl G(Xal ) − βl G(Xaol )] < E[t − tL ], where w1 > 0, w2 > 0, and w1 + w2 = 1. Proof: According to Lemma 2, assume vL − Xal − vL − Xaol = > 0. For the sake of simplicity, replace the vectors vL , Xal , and Xaol by [v1 , v2 , . . . , vN ]T , [u1 , u2 , . . . , uN ]T , and [r1 , r2 , . . . , rN ]T . Hence N i=1

(vi − ri )2 + ε =

N (vi − ui )2 .

(14)

Finally, computing the sum of the terms in (20) for all i yields E[t − tL + βl G(Xal ) − βl G(Xaol )] < E[t − tL + βl G(Xal ) − βl G(Xal )] = E[t − tL ].

(21)

The recursive expectation-minimization theorem provides the key foundation of DA-ELM. After initializing the network based on the preliminary ELM, DA-ELM ranks all the hidden nodes according to their performance. The most insignificant hidden node is identified in each step. Then, the input parameters of the insignificant hidden nodes are updated to reduce the residual error. In addition, the recursive expectationminimization theorem guarantees that the updating of hidden node parameter lies in the direction of cost function reduction. From Theorem 1, when Xal is replaced by Xal in the input layer, we have G−1 ((t − tL )/βl + G(Xal )) − Xal < G−1 ((t − tL )/βl + G(Xal )) − Xal = G−1 ((t − tL )/βl ). Meanwhile, according to the theorem, for the vector Xaol between Xal and Xal , it is closer to G−1 ((t−tL )/βl +G(Xal )) than Xal , where aol = w1 al + w2 al .

i=1

Applying the Taylor expansion, for any i = 1, 2, . . ., we have

III. DA-ELM A LGORITHM

G(vi ) = G(ri ) + G (ri )(ri − vi ) 1 + G (ri )(ri − vi )2 + · · · . 2

Consider an SLFN with additive hidden nodes and Sigmoid activation function. For most of real applications, the number of training data is usually larger than the number of hidden nodes, i.e., N > L. Given the training data and randomly generated hidden nodes, the hidden layer output matrix HL is known. Therefore, training SLFNs is simply to get the solution of a linear system. Under the constraint of minimum norm least-squares i.e., min β and min HL β − t, the solution of the system (4) can be explicitly presented as [5]

(15)

Then (G(vi ) − G(ri ))2 = (G (ri ))2 (ri − vi )2 + G (ri )G (ri )(ri − vi )3 1 + (G (ri ))2 (ri − vi )4 + · · · . (16) 4 Compute the least-squares solution of the output layer by calculating the mean on both sides of the (16). Note that G (ri ), G (ri ), G(ri ) − G(vi ) are uncorrelated, and E(G2n+1 (x)) = 0, (n = 1, 2, . . .), we have E[(G(vi ) − G(ri ))2 ] = E[(G (vi ))2 ]E[(ri − vi )2 ] 1 + E[(G (vi ))2 ]E[(ri − vi )4 ] + · · · . 4 (17) Similarly E[(G(vi ) − G(ui ))2 ] = E[(G (vi ))2 ]E[(ui − vi )2 ] 1 + E[(G (vi ))2 ]E[(ui − vi )4 ] + · · · . 4 (18) According to Lemma 3, considering only the even terms, we have E[(ri − vi )2n ] < E[(ui − vi )2n ], (n = 1, 2, . . .)

(19)

therefore E[(G(vi ) − G(ri ))2 ] < E[(G(vi ) − G(ui ))2 ].

(20)

βˆ = H†L t

(22)

where H†L is the Moore–Penrose generalized inverse [14] of the hidden layer output matrix HL . The proposed algorithm, DA-ELM, firstly chooses a fixed network set up according to the preliminary ELM with the number of hidden nodes given by users. The hidden nodes are ranked in the significance decreasing order and the least significantly hidden node is identified in next phase, referred to as hidden node selection phase. In the final phase, recursive updating input weights phase, the parameters of the least significantly hidden node are updated. This least significant node is replaced by a new node in terms of the energy error. The updating output of the hidden node provides more contribution to residual squared error than the original hidden node. The second phase and third phase are repeated until the training error is unchanged. The whole processing is summarized in Algorithm 1. The training error is calculated by P(L) = t − HL H†L t.

(23)

For the rapid implementation of pseudo-inversion, a simple and efficient method proposed in [11] can be applied here,


Algorithm 1 Proposed DA-ELM Algorithm Given a set of training data {(xi , ti )}N i=1 , and the maximal iterative epoches Cmax , the proposed DA-ELM algorithm can be shown in three phases: Phase 1: Initialization phase a) Initialize the SLFN with a group of existing hidden nodes (ai , bi )N i=1 . The number of hidden nodes L is a positive integer given by users. b) Calculate the hidden layer output matrix HL shown in (5). c) Calculate the corresponding output error P(L) = t − HL β L , where β is obtained from (22). Phase 2: Hidden Node Selection Phase Rank all hidden nodes. Let k = 1. While k ≤ Cmax a) For i = 1, . . . , L, remove the ith hidden node in turn, while keeping the other structure fixed. Calculate the corresponding residual energy, P(L − 1, i) = t − HL/i β L/i , where HL/i represents the matrix when the ith column of HL has been removed, and similarly β L/i represents the matrix when the ith entry of β has been removed. b) At the kth epoch, select the minimal P(L−1, i) as PL−1,k and record this hidden node label i as ik . Set PL−1,0 = 0. c) Calculate the backward compensation vector vL−1,k = G−1 ((t−HL/ik β L/i )/(|PL−1,k |+σ )), where σ is a small k positive number to avoid zero value. In the following experiments, we set σ = 0.0001. Phase 3: Recursively Updating Input Weights Phase a) Update the new input weights related to the ik th hidden node with δai = X† (vL−1,k ), where X is the training k matrix. Add these weights δhi = G(Xδai ) to the ik th k k column of HL/ik . The updating hidden layer output is redenoted as HL for next epoch. b) The output weights β is updated in a fast way using pseudo-inversion calculation like ELMs. • k = k + 1. End While

which can reduce the computational complexity by recursively updating the output weights † H I − H δhT L/i ik k L/ik DL−1 = T δhi I − HL/ik H†L/i δhi k k k UL−1 = H†L/i I − δhi DL−1 k k U † L−1 t. (24) βˆ L = HL t = DL−1

IV. P ERFORMANCE E VALUATION OF DA-ELM In this section, we investigate the performance of the proposed algorithm DA-ELM on some benchmark applications. The simulations have been conducted in the MATLAB platform running on a desktop PC (Intel i7-2600 CPU). Since the performance of ELM had been compared with other

283

sequential/incremental/growing algorithms resource allocation network (RAN) [15], minimum resource allocation network (MRAN) [16], and incremental extreme learning machine (I-ELM) [8] in [5], we only conduct the comparison among the proposed DA-ELM, BELM, OP-ELM, TS-ELM, LM, SBLLM, and the preliminary ELM. The network contains additive hidden nodes with sigmoidal activation function: G(ai , bi , x) = 1/(1 + exp(−(ai · x + bi ))). For the sake of reversibility, G−1 ((t − HL/ik β L/i )/(|PL−1,k | + k σ )) must be a real number in Phase 2. The inverse function G−1 (x) in (10) is assigned as G−1 (x) = f2 (f1 (x)) shown in x, x > 0.25 f1 (x) = , and f2 (x) = 4(x − 0.25) − 1. 0.25, otherwise (25) For SinC and 2-D function approximations, the training and testing data set are randomly generated from their whole data set before each trial. The average results are obtained over 100 trials for the SinC function case and 2-D nonlinear function case. The average results are obtained over 100 trials for realworld benchmark applications and ten trials for large-scale dataset and image steganalysis cases. For DA-ELM, the number of epoches (Cmax ) is five in Phase 2 for the cases of Sinc function, 2-D function and classification cases, furthermore Cmax = 10 for regression cases shown in Section IV-C. For the large-scale case and image steganalysis, the number of epoches (Cmax ) is three. A. Approximation of SinC Function In this subsection, the performance of the DA-ELM is evaluated on SinC function approximation with the same data generation procedures as in [5]. The dataset has 2000 data pairs randomly generated from the interval [ − 5, 5]. We add uniform noise distributed in [−0.2, 0.2] to all training samples while testing data remain noise-free. 1000 samples are chosen as training data after randomly permutation, and the remaining samples are used for testing. The simulation results are presented in Table I, which contain the average testing root mean square error (RMSE) and its variance versus number of hidden nodes used in the network of preliminary ELM, BELM, TS-ELM, LM, and DA-ELM. From Table I, the performance of DA-ELM is better than other algorithms when the number of hidden nodes is beyond seven due to tuning the parameters of insignificantly hidden nodes, where the recursive expectation-minimization theorem guarantees that the tuning is in the direction of reducing the training error. According to the experimental results, LM is instable compared with other methods and DA-ELM shows the better stability compared to ELM, BELM, TS-ELM, and LM. Beyond four hidden nodes, LM starts to suffer from overtraining. In this case, DA-ELM can achieve the best performance over eight hidden nodes. From the results, in SinC case, DA-ELM can obtain about 1% RMSE minimum reduction as compared to ELM. Fig. 1 shows the actual and the expected output of the network in SinC case when five hidden nodes are used in the network, similar figure can be obtained by DA-ELM with six or more hidden nodes. Because the

284


TABLE I AVERAGE T ESTING RMSE (M EAN ) AND I TS VARIANCE (VAR ) A MONG DA-ELM, ELM, BELM, TS-ELM, AND LM A LGORITHMS IN S IN C F UNCTION

Fig. 1.

Outputs of DA-ELM in SinC function approximation.

difference in performance is insignificant beyond six hidden nodes and we do not clearly distinct them in one figure. We only provide the actual and expected function of the DA-ELM learning algorithm running over five hidden nodes. B. Approximation of 2-D Nonlinear Function In this subsection, the performance of the proposed DAELM is evaluated on a 2-D nonlinear complex learning task. The database contains 1000 samples, which are actually the points sampled from the unit square. This dataset is split equally to form a training and a testing set. In order to make this regression case real, the normal distribution noise with mean zero and variance one has been added to all the training samples while testing data remain noise-free

2 2 2 2 z = max e−10x , e−50y , 1.25e−5 x +y + N(0, 1). (26) The networks with two to ten hidden nodes are used in the evaluation. The simulation results are presented in Table II. It is shown that DA-ELM can obtain much better generalization performance than ELM, BELM, and TS-ELM with the same number of hidden nodes except the case below six hidden nodes. The gradient descent to the steepness parameters is easily to search optimal solution in the fewer hidden nodes (i.e., low dimension), therefore, LM can achieve the better performance in the fewer hidden nodes. Compared to ELM based solutions, LM tends to become overfitting easily. C. Benchmark Problems Further comparisons have been conducted on some real benchmark regression and classification problems [17],

including four regression applications (i.e., servo, abalone, Boston housing, auto price) and three classification applications (i.e., wine, image segmentation, iris). The specifications of the applications are shown in Table III. In our simulations, the input data have been normalized into [− 1, 1]. The performance comparisons among DA-ELM and some other popular state of the art algorithms (ELM, BELM, OPELM, TS-ELM, LM, SBLLM) on regression and classification cases are given in Tables IV and V. Because BELM and TS-ELM are designed only for regression applications, the comparison on classification cases is conducted among DAELM, OP-ELM, SBLLM, and the preliminary ELM only. For regression cases, the comparison is based on the testing RMSE; while for classification cases, the comparison is based on the testing accuracy rate. For each method, a sufficient large number of hidden neurons is required. The OP-ELM can choose the optimal number of hidden nodes, therefore, all comparisons are assigned the same number of hidden nodes as OP-ELM network. In the simulation, the number of hidden nodes obtained by OP-ELM is ten for all of the applications except for image segmentation case. For image segmentation case, the number of hidden nodes in the hidden layer is 15. In fact, the objective of DA-ELM is not to determine the optimal network topology but to provide a fine tuning method on the basis of the fixed network structure in order to find more significant hidden nodes. As observed from Tables IV and V state the proposed DA-ELM outperforms ELM, BELM, OP-ELM and SBLLM in most of the regression datasets. Meanwhile, from the variance of testing results, it shows that DA-ELM networks are quite stable. To classification cases, DA-ELM, LM, and SBLLM maybe be the better alternative scheme, but DA-ELM has the lower time-consuming than LM and SBLLM. D. Large-Scale Features Problem Further comparison has been conducted in two large-scale benchmark classification problems: letter recognition and shuttle. They are also from the UCI machine learning repository. In our simulations, we normalize the training data to the interval [−1, 1] firstly. The average results are obtained over ten trials in two dataset. Letter recognition database contains


285

TABLE II AVERAGE T ESTING RMSE (M EAN ) AND I TS VARIANCE (VAR ) A MONG DA-ELM, ELM, BELM, TS-ELM, AND LM A LGORITHMS IN 2-D F UNCTION

TABLE III S PECIFICATIONS OF B ENCHMARK DATA S ETS

TABLE IV C OMPARISON OF AVERAGE T ESTING RMSE (M EAN ), I TS VARIANCE (VAR ), AND T RAINING T IME (T IME , S ECOND ) O BTAINED BY D IFFERENT A LGORITHMS

20 000 examples from which we build a training set of 15 000 examples and a testing set of 5000 examples. The initial SLFNs are given ten hidden nodes.

Figs. 2 and 3 shows average testing classification accuracy rates and training times using DA-ELM, ELM, and SBLLM methods. Note that the LM method is not used here, because

286


TABLE V C OMPARISON OF AVERAGE T ESTING C LASSIFICATION ACCURACY R ATE (M EAN ), I TS VARIANCE (VAR ), AND T RAINING T IME (T IME , S ECOND ) O BTAINED BY D IFFERENT A LGORITHMS

Fig. 2. Average testing accuracy rate of ELM, SBLLM, and DA-ELM algorithms in letter recognition data.

Fig. 4. Average testing accuracy rate of ELM, SBLLM, and DA-ELM algorithms in shuttle data.

Fig. 3. Average training time (second) of ELM, SBLLM, and DA-ELM algorithms in letter recognition data.

Fig. 5. Average training time (second) of ELM, SBLLM, and DA-ELM algorithms in shuttle data.

it is impractical in these cases as it is highly computationally demanding [26]. It is shown DA-ELM is the best algorithm for this kind of problems compared with ELM and SBLLM over all hidden nodes. DA-ELM can also improve the performance of the traditional ELM and SBLLM, maximally about 1% rate gain with ELM and 8% rate gain with SBLLM. In next subsection, we conduct the evaluation using shuttle dataset. This dataset consists of 58 000 instances with seven features. We set the size of training data to be 50 000, and the size of testing data to be 8000. As observed from Figs. 4 and 5, when the number of hidden nodes increases, the average testing accuracy rate increases as well. Furthermore, it is shown that DA-ELM can obtain much better generalization performance than ELM and SBLLM beyond five hidden nodes. Moreover, we observe that training time used by SBLLM is much larger than ELM and DA-ELM. These results indicate that iterative algorithm in DA-ELM has not seriously degraded the computational complexity as compared to the conventional ELM.

E. Image Steganalysis Feature Classification In this subsection, a real classification application is presented. This case is derived from image steganalysis. Image steganalysis distinguishes stego-images and coverimages through the feature classification. In this case, only a specified feature set in steganalysis is used, which is a 548-dimensional feature space proposed by Kodovský and Fridrich [28]. The detailed definition can be found in the paper. We conduct the evaluations using DAELM, ELM, and SBLLM and compare their performance. All experiments are carried out on a database of 10 000 images from the BOSSbase [30]. They are resized to 512 × 512. nsF5 (no-shrinkage F5) algorithm [27] is used to create 10 000 stego images carrying 0.05bpac (bits per nonzero AC DCT coefficient). nsF5 is the most prevalent one for the JPEG domain. The steganographic algorithm and extracted features are available in [29]. 10 000 mixed image (5000 stego and 5000 cover images) are selected for training and the rest


287

the pruning unit. The experimental studies have shown that the proposed DA-ELM is really effective and can provide better generalization performance on most of the applications. R EFERENCES

Fig. 6. Average testing accuracy rate of ELM, SBLLM, and DA-ELM algorithms in image steganalysis case.

Fig. 7. Average training time (second) of ELM, SBLLM, and DA-ELM algorithms in image steganalysis case.

of the images are used for testing. The results are shown in Fig. 6 with the number of hidden nodes from two to ten, moreover, the specific number set (100, 500, 1000) of hidden nodes is also considered. For the case beyond ten nodes, SBLLM is unavailable for the higher training cost. From the figure, it is shown that proposed DA-ELM and SBLLM both outperform the preliminary ELM. Fig. 7 gives a comparison of computational complexity among the algorithms. It is stated that DA-ELM is faster than SBLLM. However, the preliminary ELM is the fastest among all the algorithms. V. C ONCLUSION In this paper, we have proposed a novel framework based on the preliminary ELM. The proposed DA-ELM can dynamically adjust the parameters of the least significantly hidden node in the network. Our objective is to improve the generalization performance of an SLFN with the fixed network structure. Considering a network with M input nodes and L hidden nodes, it has M × L connections between input layer to hidden layer, which contain a great amount of information. Instead of only computing the output connections in the conventional ELM, we intent to introduce a fine tuning of the input connections. The initial model depends on a random selection of input connections. A specific method is used to rank all hidden nodes of this model and identifies the hidden node with the least contribution. The residual energy of l2 norm optimization is the approximation goal while updating hidden node parameters. As a theoretic foundation, the recursive expectation-minimization theorem can guarantee that the updated hidden node makes more contribution compared to

[1] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Trans. Neural Netw., vol. 2, no. 2, pp. 302–309, Mar. 1991. [2] K. Li, J. Peng, and G. W. Irwin, “A fast nonlinear model identification method,” IEEE Trans. Autom. Control, vol. 50, no. 8, pp. 1211–1216, Aug. 2005. [3] D. E. Rumelhart, G. E. Hinton, and R. J. Wiliams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986. [4] H. Jaeger, “A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ‘echo state network’ approach,” German Nat. Res. Center Inf. Technol., GMD Rep. 159, 2002. [5] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, nos. 1–3, pp. 489–501, 2006. [6] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [7] R. Zhang, Y. Lan, G.-B. Huang, Z.-B. Xu, and Y. C. Soh, “Dynamic extreme learning machine and its approximation capability,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 2054–2065, Dec. 2013. [8] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879–892, Jul. 2006. [9] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. X. Zhu, “A fast pruned-extreme learning machine for classification problem,” Neurocomputing, vol. 72, nos. 1–3, pp. 359–366, 2008. [10] R. Zhang, Y. Lan, G.-B. Huang, and Z.-B. Xu, “Universal approximation of extreme learning machine with adaptive growth of hidden nodes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 2, pp. 365–371, Feb. 2012. [11] G. R. Feng, G.-B. Huang, Q. P. Lin, and G. Robert, “Error minimized extreme learning machine with growth of hidden nodes and incremental learning,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 1352–1357, Aug. 2009. [12] M. Han and X. Y. Wang, “A modified fast recursive hidden nodes selection algorithm for ELM,” in Proc. Int. Joint Conf. Neural Netw., Brisbane, QLD, Australia, 2012, pp. 1–7. [13] Y. Lan, Y. C. Soh, and G.-B. Huang, “Two-stage extreme learning machine for regression,” Neurocomputing, vol. 73, nos. 16–18, pp. 3028–3038, 2010. [14] C. R. Rao and S. K. Mitra, Generalized Inverse of Matrices and Its Applications. New York, NY, USA: Wiley, 1971. [15] J. Platt, “A resource-allocating network for function interpolation,” Neural Comput., vol. 3, no. 2, pp. 213–225, 1991. [16] Y. W. Lu, N. Sundararajan, and P. Saratchandran, “A sequential learning scheme for function approximation using minimal radial basis function (RBF) neural networks,” Neural Comput., vol. 9, no. 2, pp. 461–478, 1997. [17] C. Blake and C. Merz. (1998). “UCI repository of machine learning databases,” Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, USA [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html [18] E. Soria-Olivas et al., “BELM: Bayesian extreme learning machine,” IEEE Trans. Neural Netw., vol. 22, no. 3, pp. 505–509, Mar. 2011. [19] F. Scarselli and A. C. Tsoi, “Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results,” Neural Netw., vol. 11, no. 1, pp. 15–37, 1998. [20] M. T. Hagan and M. Henhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 989–993, Nov. 1994. [21] C. Cortes and V. Vapnik, “Support vector networks,” Mach. Learn., vol. 20, pp. 273–297, Sep. 1995. [22] Y. Miche et al., “OP-ELM: Optimally pruned extreme learning machine,” IEEE Trans. Neural Netw., vol. 21, no. 1, pp. 158–162, Jan. 2010. [23] T. Similä and J. Tikka, “Multiresponse sparse regression with application to multidimensional scaling,” in Proc. Int. Conf. Artif. Neural Netw., vol. 3697. Warsaw, Poland, 2005, pp. 97–102. [24] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 2004.

288


[25] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999. [26] E. Castillo, B. Guijarro-Berdiñas, O. Fontenla-Romero, and A. AlonsoBetanzos, “A very fast learning method for neural networks based on sensitivity analysis,” J. Mach. Learn. Res., vol. 7, pp. 1159–1182, Jul. 2006. [27] J. Fridrich, T. Pevný, and J. Kodovský, “Statistically undetectable JPEG steganography: Dead ends, challenges, and opportunities,” in Proc. 9th ACM Workshop Multimedia Security, Dallas, TX, USA, Sep. 2007, pp. 3–14. [28] J. Kodovský and J. Fridrich, “Calibration revisited,” in Proc. 11th ACM Workshop Multimedia Security, Princeton, NJ, USA, 2009, pp. 63–74. [29] [Online]. Available: http://dde.binghamton.edu/download/ [30] [Online]. Available: http://exile.felk.cvut.cz/boss/BOSSFinal/index. php?mode=VIEW&tmpl=materials

Xinpeng Zhang received the B.S. degree in computational mathematics from Jilin University, Changchun, China, in 1995, and the M.E. and Ph.D. degrees in communication and information system from Shanghai University, Shanghai, China, in 2001 and 2004, respectively. Since 2004, he has been a Faculty with the School of Communication and Information Engineering, Shanghai University, where he is currently a Professor. He joined the State University of New York at Binghamton, NY, USA, as a Visiting Scholar, from 2010 to 2011, and Konstanz University, Konstanz, Germany, as an Experienced Researcher sponsored by the Alexander von Humboldt Foundation, from 2011 to 2012. His current research interests include multimedia security, image processing, and digital forensics. He has published over 170 papers in these areas.

Guorui Feng (M’11) received the B.S. and M.S. degrees in computational mathematics from Jilin University, Changchun, China, in 1998 and 2001, respectively. He received the Ph.D. degree in electronic engineering from Shanghai Jiaotong University, Shanghai, China, in 2005. From 2006 to 2006, he was an Assistant Professor at East China Normal University, Shanghai. In 2007, he was a Research Fellow with Nanyang Technological University, Singapore. He is currently with the School of Communication and Information Engineering, Shanghai University, Shanghai. His current research interests include image processing, image analysis, and computational intelligence.

Zhenxing Qian received the B.S. and Ph.D. degrees from the University of Science and Technology of China, Hefei, China, in 2003 and 2007, respectively. He is currently an Associate Professor at the School of Communication and Information Engineering, Shanghai University, Shanghai, China. His current research interests include data hiding and image processing. He has published over 50 papers in these areas.

Yuan Lan received the B.Eng. degree in electrical and electronic engineering and the Ph.D. degree from Nanyang Technological University, Singapore, in 2006 and 2011, respectively. From 2010 to 2011, she was a Researcher in the electrical and electronic engineering at Nanyang Technological University. From 2012 to 2012, she joined the Qiito Pte Ltd., in Singapore, as a Web Scientist. Since 2013, she has been a Lecturer with the Research Institute of Mechatronics Engineering at the Taiyuan University of Technology, Taiyuan, China. Her current research interests include mechanical fault diagnosis, signal processing, machine learning, and data mining.

A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics.

Extreme Learning Machine With Subnetwork Hidden Nodes for Regression and Classification.

Sparse extreme learning machine for classification.

Extreme Learning Machine for Multilayer Perceptron.

Graph Embedded Extreme Learning Machine.

Universal approximation of extreme learning machine with adaptive growth of hidden nodes.

A Fast Reduced Kernel Extreme Learning Machine.

Discriminative clustering via extreme learning machine.

Online Sequential Extreme Learning Machine With Kernels.

A novel extreme learning machine for hypoglycemia detection.

Sparse Bayesian extreme learning machine for multi-classification.

Extreme learning machine for ranking: generalization analysis and applications.

Machine Learning of Parameters for Accurate Semiempirical Quantum Chemical Calculations.

Bidirectional extreme learning machine for regression problem and its learning effectiveness.

Visual tracking based on extreme learning machine and sparse representation.

Is extreme learning machine feasible? A theoretical assessment (part II).

Is extreme learning machine feasible? A theoretical assessment (part I).

EEG-based emotion recognition with manifold regularized extreme learning machine.

Cross-person activity recognition using reduced kernel extreme learning machine.

Alumina Concentration Detection Based on the Kernel Extreme Learning Machine.

NMF-Based Image Quality Assessment Using Extreme Learning Machine.

Protein sequence classification with improved extreme learning machine algorithms.

Low-Discrepancy Points for Deterministic Assignment of Hidden Weights in Extreme Learning Machines.

Retracted: Medical Dataset Classification: A Machine Learning Paradigm Integrating Particle Swarm Optimization with Extreme Learning Machine Classifier.