1484

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

ACKNOWLEDGMENT

Quantized Kernel Recursive Least Squares Algorithm

The authors would like to thank D. Cheng for his enthusiastic discussion on this topic and also the seven anonymous reviewers and the Associate Editor for their valuable comments and suggestions.

Badong Chen, Member, IEEE, Songlin Zhao, Student Member, IEEE, Pingping Zhu, Student Member, IEEE, and José C. Príncipe, Fellow, IEEE

R EFERENCES [1] T. Ideker, T. Galitski, and L. Hood, “A new approach to decoding life: Systems biology,” Annu. Rev. Genomics Hum. Genet., vol. 2, no. 2, pp. 343–372, Sep. 2001. [2] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” J. Theoretical Biol., vol. 22, no. 3, pp. 437–467, Mar. 1969. [3] T. Akutsu, S. Miyano, and S. Kuhara, “Inferring qualitative relations in genetic networks and metabolic pathways,” Bioinformatics, vol. 16, no. 8, pp. 727–734, Apr. 2000. [4] R. Albert and A.-L. Barabasi, “Dynamics of complex systems: Scaling laws or the period of boolean networks,” Phys. Rev. Lett. vol. 84, no. 24, pp. 5660–5663, Jun. 2000. [5] D. Cheng, “Semi-tensor product of matrices and its application to morgen’s problem,” Sci. China Ser., vol. 44, no. 3, pp. 195–212, Jun. 2001. [6] D. Cheng and H. Qi, “A linear representation of dynamics of boolean networks,” IEEE Trans. Automat. Control, vol. 55, no. 10, pp. 2251–2258, Oct. 2010. [7] F. Li and J. Sun, “Controllability of boolean control networks with time delays in states,” Automatica, vol. 47, no. 3, pp. 603–607, Mar. 2011. [8] Y. Zhao, H. Qi, and D. Cheng, “Input-state incidence matrix of boolean control networks and its applications,” Syst. Control Lett., vol. 59, no. 12, pp. 767–774, Dec. 2010. [9] D. Cheng and H. Qi, “Controllability and observability of boolean control networks,” Automatica, vol. 45, no. 7, pp. 1659–1667, Jul. 2009. [10] D. Cheng and Y. Zhao, “Identification of boolean control networks,” Automatica, vol. 47, no. 4, pp. 702–710, Apr. 2011. [11] T. Akutsu, M. Hayashida, W. Ching, and M. K. Ng, “Control of boolean networks: Hardness results and algorithms for tree structured networks,” J. Theoretical Bio., vol. 244, no. 4, pp. 670–679, 2007. [12] L. Zhang and K. Zhang, “Controllability of time-variant boolean control networks and its application to boolean control networks with finite memories,” in Science China Information Sciences New York, USA: Springer-Verlag, Aug. 2012. [13] M. Ghil, I. Zaliapin, and B. Coluzzi, “Boolean delay equations: A simple way of looking at complex systems,” Phys. D: Nonlinear Phenomena, vol. 237, no. 23, pp. 2967–2986, Dec. 2008. [14] D. Cheng, Q. Qi, and Z. Li, Analysis and Control of Boolean Networks: A Semi-tensor Product Approach, London, U.K. Springer, 2010, pp. 1–5. [15] F. Li, J. Sun, and Q. Wu, “Observability of boolean control networks with state time delays,” IEEE Trans. Neural Netw., vol. 22, no. 6, pp. 948–954, Jun. 2011. [16] D. Laschov, M. Margaliot, “Controllability of boolean control networks via the perron-frobenius theory,” Automatica, vol. 48, no. 6, pp. 1218–1223, Jun. 2012.

Abstract— In a recent paper, we developed a novel quantized kernel least mean square algorithm, in which the input space is quantized (partitioned into smaller regions) and the network size is upper bounded by the quantization codebook size (number of the regions). In this brief, we propose the quantized kernel least squares regression, and derive the optimal solution. By incorporating a simple online vector quantization method, we derive a recursive algorithm to update the solution, namely the quantized kernel recursive least squares algorithm. The good performance of the new algorithm is demonstrated by Monte Carlo simulations. Index Terms— Kernel recursive least squares (KRLS), quantization, quantized kernel recursive least squares (QKRLS), sparsification.

I. I NTRODUCTION Kernel methods have been shown to be highly successful in batch settings (e.g., support vector machine [1] and regularization networks [2]). Recently, most efforts have also been paid for developing online kernel methods [3] that approximate the desired nonlinear function incrementally and are particularly applicable to cases where data arrive sequentially. The kernel adaptive filters [4] are among the most celebrated online kernel methods, which are a class of nonlinear adaptive filtering algorithms derived in reproducing kernel Hilbert space (RKHS) by using the linear structure of this space to implement the well-established linear adaptive filtering algorithms. These algorithms include the kernel least mean square [5], kernel affine projection algorithms [6], kernel recursive least squares (KRLS) [7], and extended kernel recursive least squares (EX-KRLS) [8], etc. With the Gaussian kernel function, kernel adaptive filters build a linear growing radial basis function (RBF) network by allocating a new kernel unit for every new data sample as a RBF center. The coefficients corresponding to each center are directly related to the errors at each sample. The biggest challenge of kernel adaptive filters (as well as other online kernel methods) is their linear growing structure with each sample, which results in increasing computational complexity and memory requirements. When the sample size is very large, it becomes impractical to use them. In order Manuscript received July 7, 2012; revised November 23, 2012, and February 12, 2012; accepted April 10, 2013. Date of publication May 13, 2013; date of current version August 16, 2013. This work was supported in part by NSF under Grant ECCS 0856441 and National NSF of China under Grant 60904054. B. Chen is with the Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]). S. Zhao, P. Zhu, and J. C. Príncipe are with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611 USA (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2258936

2162-237X © 2013 IEEE

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

to curb network growth and to obtain truly practical online algorithms, a variety of online sparsification techniques have been applied, where only the important input data are accepted as new centers. A sparsification method in general uses a certain criterion function with thresholds to decide which data can be accepted as a new center. Typical sparsification criteria include the novelty criterion (NC) [9], approximate linear dependency (ALD) criterion [7], prediction variance criterion [10], coherence criterion [11], and surprise criterion (SC) [12], etc. These sparsification methods can dramatically reduce the network size (number of the centers) of kernel adaptive filters. Nevertheless, they still have some limitations, such as high computational cost (e.g., ALD) or difficulty in determining the thresholds (e.g., SC). In a recent paper [13], we proposed a novel quantization approach to constrain the network size and developed a quantized kernel least mean square (QKLMS) algorithm. The basic idea of quantization is to use a smaller body of data to represent the whole input data by partitioning the input space into smaller regions. The quantization approach has several merits: 1) it is computationally simple; 2) only quantization size must be determined, and its choice is relatively easy; and 3) the redundant data can be easily utilized to improve the performance by quantizing to the nearest center. In this brief, the quantization method is applied to kernel least squares regression. The quantized kernel least squares regression (with regularization) is formulated and the optimal solution is derived. Based on a simple online quantization method, we derive a recursive algorithm, namely the quantized kernel recursive least squares (QKRLS) algorithm, to update the solution when new data are available. The performance of QKRLS is illustrated by simulation results.

1485

Let α = [α1 , . . . , αi−1 ]T be the coefficient vector. Then the learning problem can also be defined as finding α ∈ Ri−1 that minimizes min y − Kα 2 + γ α T Kα (3) α∈Ri−1



T where y = y(1), . . . , y(i − 1) , K ∈ R(i−1)×(i−1) is the Gram matrix with elements Kij = κ(u(i ), u( j )). The solution of (3) is α ∗ = (K + γ I)−1 y, where I denotes an identity matrix with appropriate dimension. The above least squares problem can alternatively be formulated in a feature space. According to Mercer’s theorem [1], % $ [4], any Mercer kernel κ u, u induces a mapping ϕ from the input space U to a high (possibly infinite) dimensional feature space Fκ . The inner product in the feature space can be easily $ %using the well known kernel trick: $ % calculated ϕ (u)T φ u = κ u, u . The learning problem in feature space Fκ is to find a high-dimensional weight vector  ∈ Fκ that minimizes i−1 " #2  min y( j ) − T φ (u( j )) + γ 2Fκ (4) ∈Fκ

j =1

where . Fκ denotes the norm in Fκ . The feature space Fκ is isometric–isomorphic to the RKHS Hκ induced by the kernel. This can be easily recognized by identifying ϕ (u) = κ (u, .) and f = . In this brief, we do not distinguish the two spaces if no confusion arises. The solution of (4) can be derived as ∗ = α ∗ =  (K + γ I)−1 y

(5)

where  = [ϕ (u(1)) , . . . , ϕ (u(i − 1))], (The Gram matrix K can also be expressed as K = T ). The aim of the KRLS algorithm is to update this solution recursively as new data (u(i ), y(i )) become available [7].

II. Q UANTIZED K ERNEL L EAST S QUARES R EGRESSION B. Quantized Kernel Least Squares Regression

A. Kernel Least Squares Regression Consider the problem of least squares regression. The goal is to learn a continuous mapping f : U → R based on a sequence of available examples (up to and including time d i − 1) {u( j ), y( j )}i−1 j =1, where U ⊂ R is the input space. The hypothesis space for learning is assumed % to be a RKHS $ H associated with a Mercer kernel κ u, u , a continuous, symmetric, and positive-definite function κ:U × U → R [1]. To search such a function f , one may solve the regularized least squares regression in H min

f ∈H

i−1 " 

y( j ) − f (u( j ))

#2

+ γ f 2H

(1)

j =1

where . H denotes the norm in H, γ ≥ 0 is the regularization factor that controls the smoothness of the solution (to avoid overfitting). By the representer theorem [1], [4], the function f in H minimizing (1) can be expressed as a linear combination of the kernels in terms of available data f (.) =

i−1  j =1

α j κ(u( j ), .).

(2)

The quantized kernel least squares regression in feature space can be simply expressed as min

∈Fκ

i−1 "  j =1

y( j ) − T φ (Q [u( j )])

#2

+ γ 2Fκ

(6)

where Q[.] denotes a vector quantization (VQ) operator (or vector quantizer). Assume that the quantization codebook N . Then the C contains N code vectors, i.e., C = {c(n) ∈ U}n=1 vector quantizer Q(u) is a function that maps the input u in U into one of the N code vectors in C. The vector quantizer Q(u) is specified by the values of the code vectors and by a partition of the input space U into N disjoint and exhaustive regions S1 , S2 , . . . , S N , where Sn = Q −1 (c(n)) ⊂ U. Then we have Q [u( j )] = c(n), if u( j ) ∈ Sn . Based on these definitions, one can rewrite (6) as /M 0 n  2   T y(n, m) − ϕ (c(n)) min + γ 2Fκ (7) ∈Fκ

n∈J

m=1

where J = { n| Mn = 0, 1 ≤ n ≤ N }, Mn denotes the number of input data that lie in the region Sn , i.e., Mn = |{u( j ) |u( j ) ∈ Sn , 1 ≤ j ≤ i − 1 }| (we use |A| to denote the

1486

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

TABLE I

cardinality of a set A), and y(n, m) denotes the desired output corresponds to the mth element from the nth region; that is y(n, m) = y( j ), if u( j ) ∈ Sn and     u( j ) u( j ) ∈ Sn , 1 ≤ j ≤ j  = m. (8) N Clearly, we have n=1 Mn = i − 1 (Note that Mn can be zero for some values of n). Suppose now the index set J contains L elements (L ≤ N), i.e., J = {n 1 , . . . , n L }, then the solution to (7) can be readily derived as follows:  −1 ¯y ¯  ¯ T + γI ∗ =  ¯ (9)  ¯ = [ϕ (c(n 1 )) , . . . , φ (c(n L ))], y¯ = [ Mn1 where  m=1  Mn L y(n 1 , m), . . . , m=1 y(n L , m)]T , and  = diag[Mn1 , . . . , Mn L ]. By the matrix inversion lemma [4], the equation (9) can alternatively be expressed as

¯ + γ I −1 y¯ ¯ α¯ ∗ =  ¯ K (10) ∗ = 

¯ + γ I −1 y¯ , and K ¯ =  ¯T ¯ where α¯ ∗ = K $ %  is the Gram ¯ matrix with elements Kij = κ c(n i ), c(n j ) . From (10), one can see with quantization the network size of the learned mapping can never be larger than the quantization codebook size N, because dim (α¯ ∗ ) = L ≤ N. Thus the computational complexity of this new method is always bounded by the cardinality of the codebook.

O NLINE VQ A LGORITHM Initialization: Select quantization size ε, and initialize codebook C(1) = {u(1)}. Computation: while {u(i)}(i > 1)) available do 1) Compute the distance between u(i) and C(i − 1): 1 1 dis (u(i), C(i − 1)) = 1u(i) − C j ∗ (i − 1)1, 1 1 where j ∗ = arg min 1u(i) − C j (i − 1)1. 1≤ j ≤|C(i−1)|

2) If dis (u(i), C(i − 1)) ≤ ε, keep the codebook unchanged: C(i) = C(i − 1), and quantize u(i) to the closest code vector: Q [u(i)] = C j ∗ (i − 1). 3) Otherwise, update the codebook: C(i) = {C(i − 1), u(i)}, and quantize u(i) as itself: Q [u(i)] = u(i). end while (where . denotes the Euclidean norm, and C j (i − 1) denotes the jth element of the codebook C(i − 1))

the previous online VQ algorithm. Since all the matrices and vectors in (10) are time dependent, we rewrite (10) as ¯ − 1)α¯ ∗ (i − 1) ∗ (i − 1) = (i

¯ − 1) + γ I −1 y¯ (i − 1) ¯ − 1) (i − 1)K(i = (i (11)

C. Online VQ Method A key problem in the quantized kernel least squares regression is how to design a vector quantizer (including how to build the codebook and how to assign the code-vectors to the data). In the literature, there are a variety of VQ methods. Most of them, however, are not suitable for online implementation because the codebook must be supplied in advance (which is usually trained on an offline data set), and the computational cost is in general very high. In [13], a simple online (sequential) VQ method is proposed, in which the codebook is trained sequentially from the sample data. We include in Table I the pseudocode description of the online VQ algorithm so as to make this brief self-contained. This online VQ method merely depends on the Euclidean distance and does not take into account the data distribution. In general, it is not an optimal the

VQ that minimizes mean square distortion D = E u − Q(u) 2 . However, it is computationally simple (linear complexity in terms of the codebook size) and easy to implement. In addition, if the input domain U is a compact subset of Rd , then for any training sequence and any ε > 0, the codebook size is always finite. This can be easily proved. Since U is compact, for any ε > 0, a finite ε-cover exists. Therefore, the packing number is also finite and the maximum number of ε-separated point in U is finite. III. Q UANTIZED KRLS In this section, we derive a recursive algorithm to update the solution (10) as the new data (u(i ), y(i )) are available. The quantizer Q is supposed to be implemented by using

where (i − 1) denotes the time index i − 1. For example, (i −1) means the value of the matrix  at time (or iteration) i − 1. In the following, we denote P(i − 1) = [(i − 1) ¯ − 1) + γ I]−1 . To derive the recursive algorithm, we K(i consider two cases.

A. First Case: dis (u(i ), C(i − 1)) ≤ ε ¯ ) = In this case, we have C(i ) = C(i − 1), and (i ¯ ) = K(i ¯ − 1)). The input u(i ) is ¯ − 1) (and hence K(i (i quantized to the j ∗1th element of the codebook C(i −1), where 1 j ∗ = arg min 1u(i ) − C j (i − 1)1. The matrix (i ) and 1≤ j ≤|C(i−1)|

the vector y¯ (i ) are 

(i ) = (i − 1) + ξ j ∗ ξ Tj∗ y¯ (i ) = y¯ (i − 1) + y(i )ξ j ∗

(12)

where ξ j ∗ is a |C(i − 1)|-dimensional column vector whose j ∗ th element is 1 and all other elements are 0. The matrix P(i

) ¯ − 1) −1 . can be expressed as P(i ) = P(i − 1)−1 + ξ j ∗ ξ Tj∗ K(i By using the matrix inversion lemma [4], we obtain

P(i ) = P(i − 1) −

 ¯ j ∗ (i − 1)T P(i − 1) P j ∗ (i − 1) K ¯ j ∗ (i − 1)T P j ∗ (i − 1) 1+K

(13)

¯ j ∗ (i − 1) denote, respectively, the j ∗ th where P j ∗ (i − 1) and K ¯ − 1). The coefficient columns of the matrices P(i − 1) and K(i

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

1487

TABLE II

vector can then be updated as

QKRLS A LGORITHM

α¯ ∗ (i ) = P(i )¯y(i ) 



 ¯ j ∗ (i − 1)T P(i − 1) P j ∗ (i − 1) K = P(i − 1) − ¯ j ∗ (i − 1)T P j ∗ (i − 1) 1+K $ % × y¯ (i − 1) + y(i )ξ j ∗ .

= α¯ ∗ (i − 1)

¯ j ∗ (i − 1)T α¯ ∗ (i − 1) P j ∗ (i − 1) y(i ) − K + . (14) ¯ j ∗ (i − 1)T P j ∗ (i − 1) 1+K B. Second Case: dis (u(i ), C(i − 1)) > ε

Initialization: Select quantization size ε, regularization parameter γ , and initialize

−1 , and C(1) = {u(1)}, (1) = [1], P(1) = κ (u(1), u(1)) + γ ∗ α¯ (1) = P(1)y(1). Computation: while {u(i), y(i)}(i > 1) available do 1) If dis (u(i), C(i − 1)) ≤ ε: Keep the codebook unchanged: C(i) = C(i − 1), Update the matrix (i):(i) = (i − 1) + ξ j ∗ ξ Tj∗ , Update P(i) using (13), and update α∗ ¯ using (14). 2) If dis (u(i), C(i − 1)) > ε: {C(i − 1), u(i)}, Update Update the codebook: C(i) = 

¯ In this case, we have C(i ) = {C(i − 1), u(i )}, and (i ) = ¯ (i − 1), φ (u(i )) . The matrix (i ) and the vector y¯ (i ) are



(i − 1) y¯ (i − 1) (i ) = , y¯ (i ) = . (15) 1 y(i )

(16)

where h(i ) = [κ (C1 (i − 1), u(i )) , . . . , κ(C|C(i−1)| (i − 1), u(i ))]T . Using the block matrix inversion identity [4], we derive

T −1 P(i − 1)r (i ) + z  (i )z(i ) −z  (i ) (17) P(i ) = r (i ) 1 −z(i )T where z(i ) = P(i − 1)T h(i ), z  (i ) = P(i − 1)(i − 1)h(i ), r (i ) = γ + κ (u(i ), u(i )) − h(i )T z  (i ). Further, the coefficient vector is updated as α¯ ∗ (i ) = P(i )¯y(i )

T −1 P(i − 1)r (i ) + z  (i )z(i ) −z  (i ) = r (i ) 1 −z(i )T

y¯ (i − 1) × y(i )

∗ α¯ (i − 1) − z  (i )r (i )−1e(i ) = (18) r (i )−1 e(i ) where e(i ) = y(i ) − h(i )T α¯ ∗ (i − 1). We now have obtained a recursive algorithm to solve the quantized kernel least squares problem, which is referred to as the quantized kernel recursive least squares (QKRLS) algorithm, and is described in pseudocode in Table II.

(i − 1)

,

1

Update P(i) using (17), and update α∗ ¯ using (18). end while TABLE III C OMPARISON OF C OMPUTATIONAL C OSTS P ER I TERATION Algorithm

Computational Cost

QKLMS

Online VQ Update α(i)

O(L) O(L)

QKRLS

Online VQ Update P(i) Update α(i)

O(L) O(L 2 ) O(L)

KRLS-ALD

ALD test Update P(i) Update αi

O(L 2 ) O(L 2 ) 2 O(L )(O(L) if codebook changed)

0.25

0.2

ε = 2.0 ε = 1.5 ε = 1.4

Testing MSE

The matrix P(i ) can be expressed as

−1 P(i − 1)−1 (i − 1)h(i ) P(i ) = γ + κ (u(i ), u(i )) h(i )T

the matrix (i) : (i) =

0.15

ε = 1.3

ε = 1.2

0.1

ε = 1.1 ε = 1.0

0.05 ε = 0.9 ε = 0.1

C. Computational Complexity Table III summarizes the computational costs per iteration for QKLMS, QKRLS, and KRLS with ALD-based sparsification (KRLS-ALD) [7], where L denotes network size. For QKLMS, the overall cost is O(L); for QKRLS, the costs for online VQ and updating α(i ) are O(L), but the cost for updating P(i ) is O(L 2 ); for KRLS-ALD, all three parts of the algorithm have quadratic complexity in terms of the network size, except only that when codebook is changed, the cost for updating α(i ) is O(L). Clearly, QKRLS needs less computational cost than KRLS-ALD.

ε = 0.5 ε = 0.4

0 0

100

ε = 0.3

200 300 Final network size

ε = 0.2

400

500

Fig. 1. Final network size versus testing MSE of QKRLS by varying the quantization size.

It is worth noting that besides the ALD, other sparsification criteria, such as the NC and SC, can also be applied to KRLS (see [4] for details). These sparsification criteria, however, are only used to determine whether or not to accept the current input as a new center, and it is hard to use them to derive a

1488

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

0

10

120 100 Network size

Testing MSE

QKLMS QKRLS KRLS-ALD

-1

10

80 60 QKRLS (QKLMS) KRLS-ALD

40 20 -2

10

Fig. 2.

0

100

200 300 iteration (a)

400

500

0 0

100

200 300 iteration (b)

400

500

Performance comparison between QKLMS, QKRLS, and KRLS-ALD. (a) Learning curves. (b) Network growth curves.

recursive algorithm like QKRLS and KRLS-ALD. So, when applying these sparsification criteria to KRLS, we usually simply discard the redundant data (that is, there is no weight update when codebook is unchanged). IV. S IMULATION R ESULTS We test the performance of QKRLS algorithm in online time series prediction. A. Mackey-Glass Time Series First, we consider the Mackey–Glass time series prediction, which is a benchmark problem for nonlinear learning methods. The time series is generated from the differential equation with the same parameter settings as in [12], and is corrupted by additive white Gaussian noise with zero mean and 0.01 variance. Except√mentioned otherwise, the Gaussian kernel with width σ = 2/2 is selected as the Mercer kernel.1 The goal is to predict the current value using the previous seven points. In the first simulation, we show how the quantization size affects the performance. We test QKRLS algorithm with 20 different quantization sizes in the range of [0.1, 2.0]. For each quantization size, 100 Monte Carlo simulations are conducted, and for each Monte Carlo simulation, a segment of 500 samples is used as the training data and another 50 points as the test data (the filter is fixed in the testing phase). In the simulation, the regularization factor is experimentally set at γ = 0.01 (by scanning for the best results). Fig. 1 illustrates the final network size versus the testing MSE. One can see the testing MSE is little affected when the quantization size lies in the range [0.1 to 0.5]. However, with quantization size in this range, the final network size will decrease dramatically. The result confirms that QKRLS can produce a very compact model while preserving a desirable performance. In the second simulation, we compare the performance of QKLMS, QKRLS, and KRLS-ALD. The quantization size and 1 The Gaussian kernel is usually a default choice in kernel adaptive filtering due to its universal approximating capability, desirable smoothness, and numeric stability.

TABLE IV P ERFORMANCE C OMPARISON B ETWEEN QKLMS, QKRLS, AND KRLS-ALD Algorithm QKLMS QKRLS KRLS-ALD

Testing MSE 0.0401 ± 0.0156 0.0227 ± 0.0059 0.0210 ± 0.0055

Network Size 102 ± 9 102 ± 9 103 ± 6

Execution Time (Sec) 0.0342 0.0961 0.2657

the ALD threshold are set at 0.4 and 0.04, respectively, such that the final network sizes of the three algorithms are almost identical. The step size of QKLMS is experimentally set at 0.5. The average learning curves and network growth curves are shown in Fig. 2. Evidently, both QKRLS and KRLS-ALD achieve much better accuracy (converge to smaller testing MSE) than QKLMS. The performance of QKRLS is close to KRLS-ALD but with less computational complexity (KRLSALD chooses its centers by optimizing a square cost on the input data and hence should be more accurate than QKRLS). Table IV lists the testing MSE and network size at final iteration, and the average execution time for one simulation run (500 iterations), where the execution time is measured on a PC equipped with a 2.2-GHz processor and 3-GB memory. In the third simulation, we compare the performance among QKRLS and three simplified KRLS algorithms: the KRLS algorithms with sparsification criteria NC, SC, and ALD, where the redundant data are simply discarded. The regularization factors of all algorithms are set at 0.01, and the other parameters are chosen so as to produce almost the same final testing MSE. Specifically, for KRLS-NC (simple discard), we set the distance threshold δ1 = 0.3, and the error threshold δ2 = 0.1; for KRLS-SC (simple discard), we set the upper threshold of the surprise measure T1 = −0.1, and the lower threshold T2 = −1.0; for KRLS-ALD (simple discard), we set the ALD threshold at 0.2; for QKRLS, we set the quantization size ε = 0.6. The network growth curves are illustrated in Fig. 3. It is clear that the QKRLS outperforms the other three algorithms significantly, yielding much smaller network size. The testing MSE and network size at final iteration and the average execution time are shown in Table V. One can see

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

400 Network size

Network size

200

500

KRLS-NC (simple discard) KRLS-SC (simple discard) KRLS-ALD (simple discard) QKRLS

150

100

100

200 300 iteration

400

500

-1

10

-2

Fig. 4.

100

200 300 iteration

400

500

200 300 iteration

400

500

1 Normalized Annual Sunspot Count

Testing MSE

σ = 0.5 σ = 1.0 σ = 3.0 σ = 5.0 σ = 7.0

100

Fig. 6. Network growth curves of QKRLS with different quantization sizes (polynomial kernel).

0

10

0

200

0 0

Fig. 3. Network growth curves of QKRLS and three simplified KRLS algorithms.

10

ε = 0.0 ε = 0.2 ε = 0.3 ε = 0.4 ε = 0.5 ε = 0.6

100

50

0 0

300

1489

0.8

0.6

0.4

0.2

0 1700

Learning curves of QKRLS (ε = 0.4) with different kernel widths.

1800

1900

2000

Year

0

Testing MSE

10

ε ε ε ε ε ε

-1

10

= = = = = =

0.0 0.2 0.3 0.4 0.5 0.6

-2

10

0

100

200 300 iteration

400

500

Fig. 5. Learning curves of QKRLS with different quantization sizes (polynomial kernel).

that QKRLS requires less time to achieve the same accuracy. From the simulation we observe that, if the redundant data

Fig. 7.

Normalized annual sunspot data over the period 1700–2011.

are purely discarded, the performance will become worse even using ALD criterion to choose the centers. In the fourth simulation, we show the performance of QKRLS with different Gaussian kernel widths. For the case ε = 0.4, the learning curves are shown in Fig. 4. As expected, the kernel width also has significant influence on the learning performance. If the kernel width is too large or too small, the performance will be poor. In practice, the kernel width can be set manually, or determined by cross-validation. In the fifth simulation, we show that QKRLS can work well with other Mercer kernels. In this %3 a three$ simulation, degree polynomial kernel κ(u, u ) = 1 + uT u was used. The average learning curves, for different quantization sizes, are shown in Fig. 5, and the corresponding network growth curves are shown in Fig. 6. Simulation results confirm that, with a non-RBF Mercer kernel, the QKRLS can still achieve a desirable tradeoff between performance and complexity.

1490

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

TABLE V P ERFORMANCE C OMPARISON B ETWEEN QKRLS AND T HREE S IMPLIFIED KRLS A LGORITHMS Algorithm KRLS-NC (simple discard) KRLS-SC (simple discard) KRLS-ALD (simple discard) QKRLS

10

Testing MSE 0.0272 ± 0.0068 0.0273 ± 0.0066 0.0271 ± 0.0064 0.0273 ± 0.0065

0

testing MSE

QKLMS QKAPA QKRLS

10

10

-1

-2

50

100

150

200

250

iteration Learning curves of QKLMS, QKAPA, and QKRLS (σ = 0.5).

10

0

testing MSE

QKLMS QKAPA QKRLS

10

are used to predict the current value. In the simulation, we compare the performance of three quantized kernel adaptive filtering algorithms: QKLMS, QKAPA (quantized KAPA),2 and QKRLS. The quantization size is set at 0.1 (since the three algorithms adopt the same quantization size, they produce the same network size). Other parameters are set by scanning for the best results. For QKLMS, the step size is 0.2; for QKAPA, the step size is 0.08, and the sliding window length is 20; for QKRLS, the regularization factor is 0.01. With Gaussian kernel, the learning curves are shown in Fig. 8 (kernel width σ = 0.5) and Fig. 9 (kernel width σ = 1.0). Clearly, the QKRLS algorithm achieves the best performance among the three algorithms.

-1

A recursive algorithm, namely the QKRLS, is derived to update the solution of the quantized kernel least squares regression. Monte Carlo simulation results show that the QKRLS is rather effective in yielding a compact model (with smaller network size) while preserving a desirable performance. The tradeoff between the complexity and accuracy can be simply controlled by the quantization size. The online VQ method used in this brief is simple yet efficient. Certainly, the performance demonstrated can be further improved by applying more advanced online VQ methods, and this would be an interesting topic for future research. Although the present study focuses on the kernel least squares regression, we expect the quantization approach can also be used in other kernel methods. R EFERENCES

10

-2

0

50

100

150

200

250

iteration Fig. 9.

Execution Time (Sec) 0.0985 0.0918 0.1062 0.0497

V. C ONCLUSION 0

Fig. 8.

Network Size 163 ± 12 132 ± 9 152 ± 9 25 ± 4

Learning curves of QKLMS, QKAPA and QKRLS (σ = 1.0).

B. Sunspot Data The Wolf annual sunspot count time series has been the subject of interest in physics and statistics communities and has also been the benchmark data in time series prediction problem [14]. Fig. 7 shows the normalized annual sunspot data over the period 1700–2011. In this example, the kernel adaptive filtering is performed by using the normalized sunspot data over the period 1700–1979 as the training data and those over the period 1980–2011 as the testing data. The previous four points

[1] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, MA, USA: MIT Press, 2002. [2] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural networks architectures,” Neural Comput., vol. 7, no. 2, pp. 219–269, Feb. 1995. [3] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online learning with kernels,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2165–2176, Aug. 2004. [4] W. Liu, J. Principe, and S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction. New York, NY, USA: Wiley, 2010. [5] W. Liu, P. Pokharel, and J. Principe, “The kernel least mean square algorithm,” IEEE Trans. Signal Process., vol. 56, no. 2, pp. 543–554, Feb. 2008. [6] W. Liu and J. Principe, “Kernel affine projection algorithm,” EURASIP J. Adv. Signal Process., vol. 2008, no. 1, pp. 1–13, Jan. 2008. [7] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug. 2004. [8] W. Liu, I. L. Park, Y. Wang, and J. C. Principe, “Extended kernel recursive least squares algorithm,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 3801–3814, Oct. 2009. 2 The QKAPA algorithm can be easily derived by quantizing the input vectors in the original KAPA algorithm.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 9, SEPTEMBER 2013

[9] J. Platt, “A resource-allocating network for function interpolation,” Neural Comput., vol. 3, no. 2, pp. 213–225, Jun. 1991. [10] L. Csato and M. Opper, “Sparse online Gaussian process,” Neural Comput., vol. 14, no. 3, pp. 641–668, Mar. 2002. [11] C. Richard, J. C. M. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1058–1067, Mar. 2009. [12] W. Liu, I. L. Park, and J. C. Principe, “An information theoretic approach of designing sparse kernel adaptive filters,” IEEE Trans. Neural Netw., vol. 20, no. 12, pp. 1950–1961, Dec. 2009. [13] B. Chen, S. Zhao, P. Zhu, and J. C. Principe, “Quantized kernel least mean square algorithm,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 1, pp. 22–32, Jan. 2012. [14] H. Tong, Nonlinear Time Series: A Dynamical Systems Approach. New York, NY, USA: Oxford Univ. Press, 1990.

On the Optimal Class Representation in Linear Discriminant Analysis Alexandros Iosifidis, Anastasios Tefas, Member, IEEE, and Ioannis Pitas, Fellow, IEEE Abstract— Linear discriminant analysis (LDA) is a widely used technique for supervised feature extraction and dimensionality reduction. LDA determines an optimal discriminant space for linear data projection based on certain assumptions, e.g., on using normal distributions for each class and employing class representation by the mean class vectors. However, there might be other vectors that can represent each class, to increase class discrimination. In this brief, we propose an optimization scheme aiming at the optimal class representation, in terms of Fisher ratio maximization, for LDA-based data projection. Compared with the standard LDA approach, the proposed optimization scheme increases class discrimination in the reduced dimensionality space and achieves higher classification rates in publicly available data sets. Index Terms— Class representation, data projection, linear discriminant analysis (LDA), subspace learning.

I. I NTRODUCTION Linear discriminant analysis (LDA) is a well-known algorithm for feature extraction and dimensionality reduction, aiming at finding an optimal reduced dimensionality space for data projection, in which the classes are better discriminated. The adopted criterion is the ratio of the between-class scatter to the within-class scatter in the projection space, which is, usually, referred to as Fisher ratio. By maximizing this criterion, maximal class discrimination is achieved. The main idea in standard LDA is that in the reduced dimensionality space the samples belonging to different classes should be Manuscript received October 8, 2012; revised April 12, 2013; accepted April 13, 2013. Date of publication May 13, 2013; date of current version August 16, 2013. The work leading to these results has received support from the Collaborative European Project MOBISERV FP7-248434, An Integrated Intelligent Home Environment for the Provision of Health, Nutrition and Mobility Services to the Elderly. The authors are with the Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2258937

1491

as far from one another and that the within class dispersion from their mean should be as small as possible. LDA optimality is based on the assumptions that: 1) all classes follow normal distributions having the same covariance structure; and 2) each class is represented by the mean class vector. Although relying on rather strong assumptions that do not hold in many applications, it has proven very powerful and it is widely used in many applications, such as facial expression recognition [1], human action recognition [2], and person identification [3]. This brief is motivated by the observation that other class representation in the input space than the class mean, result in different scatter matrices and, finally, in a different projection space that could provide superior class discrimination. Consider the example shown in Fig. 1. Fig. 1(a) shows 2-D data resulted by applying principal component analysis (PCA) on 10-D data forming three classes following normal distributions. Let pi , i = 1, 2, 3 and qi , i = 1, 2, 3 represent random vectors that can be used as class representatives instead of the sample mean vectors μi , i = 1, 2, 3 in the LDA optimization procedure. Fig. 1(b) shows the projection space obtained by applying LDA on the 10-D data using the mean class vectors μi for class representation, whereas Fig. 1(c) and (d) shows the projection spaces obtained by applying LDA using vectors pi and qi for class representation, respectively. As can be seen, the three class data projections obtained by using three different class representations in the input space are quite different. That is, various discriminant LDA spaces can be formed, depending on the class representation choice in the input space, which offer different discrimination ability and recognition performance. As LDA optimality is based on Fisher ratio maximization, one may think that the optimal class representation can be obtained by an optimization procedure maximizing Fisher ratio. Therefore, we are interested in finding the class representative vectors in the reduced dimensionality space that are as far from one another and as close to the respective class samples as possible. As shown in next sections, this can be done by performing an iterative optimization procedure with respect to both the projection matrix and the chosen class representation. This brief is structured as follows. Section II presents an overview of the standard LDA algorithm. In Section III, we present the proposed optimization scheme. Experimental results assessing its performance are illustrated in Section V and conclusions are drawn in Section VII. II. S TANDARD LDA Given a set of D-dimensional data belonging Cto C classes, xij ∈ R D , i = 1, . . . , C, j = 1, . . . , Ni , i=1 Ni = N, and their class labels lij = i , standard LDA aims to find a projection matrix W ∈ R D×d , such that yij = WT xij ∈ Rd is the image of xij in a d-dimensional feature space, where classes achieve maximal compactness and discrimination. Let us assume that the data are centered to 0, i.e., 1 C  Ni D 1 i=1 j =1 xij = 0, where 0 ∈ R is a vector of zeros . N 1 This can always be done by using x˜ ij = xij − μ, i = 1, . . . , C,  Ni  j = 1, . . . , Ni , where μ = 1/N C i=1 j =1 xij .

2162-237X © 2013 IEEE

Quantized kernel recursive least squares algorithm.

In a recent paper, we developed a novel quantized kernel least mean square algorithm, in which the input space is quantized (partitioned into smaller ...
596KB Sizes 0 Downloads 0 Views