IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

1961

Adaptive Subset Kernel Principal Component Analysis for Time-Varying Patterns Yoshikazu Washizawa, Member, IEEE

Abstract— Kernel principal component analysis (KPCA) and its online learning algorithms have been proposed and widely used. Since KPCA uses training samples for bases of the operator, its online learning algorithms require the preparation of all training samples beforehand. Subset KPCA (SubKPCA), which uses a subset of samples for the basis set, has been proposed and has demonstrated better performance with less computational complexity. In this paper, we extend SubKPCA to an online version and propose methods to add and exchange a sample in the basis set. Since the proposed method uses the basis set, we do not need to prepare all training samples beforehand. Therefore, the proposed method can be applied to time-varying patterns, in contrast to existing online KPCA algorithms. Experimental results demonstrate the advantages of the proposed method. Index Terms— Kernel Hebbian algorithm (KHA), kernel principal component analysis (KPCA), subset KPCA (SubKPCA).

I. I NTRODUCTION

K

ERNEL principal component analysis (KPCA), which is a nonlinear extension of PCA, has been proposed [1]–[5], and is now widely used. Standard batch PCA obtains principal components of data from the eigenvalue problem of the covariance matrix. An alternative method for obtaining principal components uses an online learning scheme and updates principal components adaptively. These adaptive PCAs have been studied since the 1980s [6]–[9], and have been used for classification, feature extraction, data compression, representation, and so forth. Adaptive approaches have the ability to follow temporal variations of patterns. One of the algorithms of adaptive PCA is called the generalized Hebbian algorithm (GHA) [8], [9], and its nonlinear extension using the kernel trick is the kernel Hebbian algorithm (KHA) [10], [11]. KHA adaptively obtains nonlinear principal components with less computational complexity in some problems compared to batch KPCA, and has been applied to image processing, especially for super resolution. However, KHA requires the preparation of all training samples beforehand. Therefore, KHA cannot be applied to pattern Manuscript received May 1, 2011; revised June 21, 2012; accepted August 10, 2012. Date of publication October 23, 2012; date of current version November 20, 2012. This work was supported by the Japan Society for the Promotion of Science (JSPS) Grants-in-Aid for Scientific Research (KAKENHI) Program under Grant 23300069 and Grant 24700163. The author was with RIKEN Brain Science Institute, Saitama 351-1098, Japan. He is now with the University of Electro-Communications, Tokyo 182-0021, Japan (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2214234

features that gradually change with time. For example, in the case of feature extraction from image sequence or a movie, the angle of the moving object and illumination may change during the sequence of the observation. KHA cannot track such feature unless all samples from the sequence are given. We have previously proposed the subset KPCA (SubKPCA), which uses two sets of training samples, i.e., a full training set and its subset for the basis [12]. We call these sets the learning set and the basis set, respectively. SubKPCA uses the subset as the basis of nonlinear principal components and obtains the best transform operator from the training set. This extension enables us to control the tradeoff between the accuracy of KPCA and computational complexity. If the subset is identical to the learning set, SubKPCA is equivalent to the standard batch KPCA. If we use a smaller number of samples for the basis set, say m, the computational complexity for learning (eigenvalue decomposition) depends on O(m 3 ) or O(m 2r ), where r is the required number of principal components, which for evaluation depends on O(m). The approximation error of SubKPCA is much smaller than if we simply reduced the number of samples for KPCA. In this paper, we extend SubKPCA to an adaptive version. The advantage of this extension is not only its computational complexity, but because SubKPCA uses the learning and the basis sets, new pattern features can be learned using a predefined basis set. This extension overcomes one of the problems of KHA, i.e., KHA requiring the preparation of all training samples beforehand. Furthermore, we propose a method for exchanging basis sets. If the input patterns gradually change with time, a predefined basis set cannot represent the input patterns. When the approximation error becomes large, we can add the input sample to the basis set. If the state of the patterns changes, old samples in the basis set are no longer useful for representing the input patterns. Therefore, the sample can be removed from the basis set, reducing its computational complexity. SubKPCA achieves flexible and sustainable feature extraction and a tracking learning system. Incremental KPCA (IKPCA) and adaptive KPCA (AKPCA) have been proposed [13], [14]. AKPCA only takes into account adding samples. Therefore, if the state of patterns continually changes, the size of the basis set increases infinitely, and the computational complexity and required memory become larger. Therefore, it is impossible to construct sustainable online learning systems. IKPCA uses reduced-set (RS) compressions, i.e., obtaining preimages of the basis to avoid increasing the basis set. Preimaging has higher computational complexity, and IKPCA is only considered for incrementing samples, not for tracking dynamic data.

2162–237X/$31.00 © 2012 IEEE

1962

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

A. Notation For vectors we use bold lower case letters, e.g., a, b, and for matrices bold capital letters, e.g., A, B. If vectors and matrices can be infinite (functions or operators), we use nonbold lower case and capital letters, respectively. nonlinear mappings determined by kernel functions sometimes map to Infinite-dimensional spaces. For a matrix A, [ A](a:b,c:d) denotes the submatrix of A extracted from the ath to the bth rows, and from the cth to the dth columns. Similarly, [ A](a,c:d) = [ A](a:a,c:d) denotes a row vector, and [ A](a:b,c) = [ A](a:b,c:c) denotes a column vector.

where λ1 , . . . , λr are the sorted eigenvalues of K , and u 1 , . . . , u r and v 1 , . . . , v r are the corresponding eigenvectors of SS ∗ and K , respectively. Note that u 1 , . . . , u r are in the feature space (could be infinite) and v 1 , . . . , v r are in the n-dimensional finite space. An input vector x is transformed to an r -dimensional vector ∗ (x) = −1/2 V  [k(x 1 , x), . . . , k(x n , x)] UKp

where the vector [k(x 1 , x), . . . , k(x n , x)] is called the empirical kernel map [2]. KPCA minimizes the mean squared error in the feature space

II. C URRENT M ETHODS J (U ) =

This section briefly reviews the existing methods.

n 

(x i ) − UU ∗ (x i )2

(2)

i=1

A. Kernel PCA and SubKPCA Let X = {x 1 , . . . , x n } be a set of d-dimensional input vectors, and  is a mapping to a higher dimensional feature space for the kernel trick [2]. For computation, instead of directly using , we use positive definite kernel functions that satisfy k(x 1 , x 2 ) = (x 1 )|(x 2 ) for all x 1 , x 2 ∈ Rd , where ·|· denotes the inner product. The Gaussian kernel function k(x 1 , x 2 ) = exp(−cx 1 − x 2 2 ) and the polynomial kernel function k(x 1 , x 2 ) = (x 1 |x 2  + c) p are often used, where c and p are hyperparameters. Suppose that the averages of the input samples and the mapped samples are zero vectors. Let a matrix arrange sample vectors X = [x 1 , . . . , x n ] ∈

Rd×n .

that is, U = UKp is one of the minimum solutions of J (U ) under the rank constraint. This criterion minimizes the average distance between the mapped samples (x i ) and its projection UU ∗ (x i ). Since J (U ) is in the space spanned by {(x 1 ), (x 2 ), . . . , (x n )} (i.e., the range of S), U can be parameterized by U = S A, where A ∈ Rn×r is the parameter matrix whose rank is r . Therefore, the set {(x 1 ), (x 2 ), . . . , (x n )} is the basis for KPCA. On the other hand, SubKPCA uses a smaller number of bases, i.e., a subset of the full training samples [12]. Let T = [( y1 ), ( y2 ), . . . , ( ym )] Y = { y1 , . . . , y m } ⊂ X

(3)

where ui is the eigenvector corresponding to the i th largest eigenvalue. An input vector x is transformed by U  PCA x. In KPCA, let the operator of mapped input vectors be

where m ≤ n and Y is the basis set. Then U is parameterized by U = T B, where B ∈ Rm×r is the parameter matrix. Note that for the objective function (2), all samples are evaluated, and only for the basis the subset Y is used. The subset is determined using k-means, forward search, the random sample consensus (RANSAC) approach, or the other sample selection methods for data mining [12], [15]–[17]. Let K x y = S ∗ T ∈ Rn×m ([K x y ](i, j ) = k(x i , y j )) and K y = T ∗ T ∈ Rn×n ([K y ](i, j ) = k( yi , y j )). The principal components of SubKPCA are given by the generalized eigenvalue problem

S = [(x 1 ), (x 2 ), . . . , (x n )].

K x y K x y z = ξ K y z.

The principal components of batch PCA can then be obtained from the eigenvectors of the covariance matrix (1/n)X X  . Let r be the desired number of principal components. The transform matrix to the r -dimensional subspace spanned by the largest r principal components is U PCA = [u1 , . . . , ur ]

(1)

Then the principal components are the eigenvectors of (1/n)SS ∗ , where ∗ denotes the adjoint operator which is equivalent to the (Hermitian) transpose in real (complex) finite space (we henceforth omit 1/n). The eigenvector u of SS ∗ satisfies SS ∗ u = λu. By multiplying S ∗ from the left, we obtain S ∗ S(S ∗ u) = λ(S ∗ u). v 0 = S ∗ u is an eigenvector of the kernel Gram matrix K = S ∗ S ∈ Rn×n , ([K √](i, j ) = k(x i , x j )). By normalizing v 0 , we have v = (1/ λ)S ∗ u. Thus, the transform operator U , by arranging the largest r eigenvector, is given by UKp = [u 1 , u 2 , . . . , u r ] = SV −1/2 V = [v 1 , v 2 , . . . , v r ]  = diag([λ1 , . . . , λr ])

(4)

The outline of the derivation is shown in the Appendix. Suppose that z 1 , . . . , zr are sorted eigenvectors, and each vector is normalized to satisfy z|K y z = 1. Let a matrix Z = [z 1 , . . . , z r ]. Then the transform of SubKPCA is given by USubKp = T Z and the transform of an input vector x is ∗ (x) = Z [k( y1 , x), . . . , k( ym , x)]. USubKp

The size of the eigenvalue problem (4) as well as the dimension of empirical kernel map [k( y1 , x), . . . , k( ym , x)] is m, which is the size of the basis set Y.

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

1963

B. GHA and KHA

III. S UB KHA Rd×r ,

Suppose that the transform of GHA is U GHA ∈ and for the input vector x, the output is given by y = U  GHA x. The optimization criterion of GHA is given by

s.t.

max  y2 U GHA U GHA U GHA

= x  U GHA U  GHA x = I

(5)

which maximizes variance of the transformed vector y = U GHA x (note that the mean of y is assumed to be zero). The Lagrangian is given by L(U GHA , ) = x  U GHA U  GHA x   1 + Trace (U  U − I) . GHA GHA 2 Hence, the updating rule of GHA is given by [9] +1 U tGHA = U tGHA + η∂U L



  ∂U L = x t x  t U GHA − U GHA lt U GHA x t x t U GHA



(6) (7)

where lt(·) is an operator that extracts the lower triangular part. η is a learning parameter in which a fixed real positive value or monotonically decreasing value with time t η=

τ t +τ η0

(8)

is used, where τ is a time constant and η0 is an initial value. For KHA, a stochastic meta-decent approach for obtaining optimal learning coefficients η was also proposed [11]. U GHA in (5) has an ambiguity with respect to unitary rotation, i.e., if U GHA is a solution to (5), U GHA R is also solution for any r × r unitary matrix R. The lower triangular operation lt is introduced to obtain each principal component. If we omit the lower triangular operation lt(), U GHA does not represent principal components, but the space spanned by U GHA is the same as the case using lt() [18]. KHA is a nonlinear extension of GHA by using the kernel trick. Using the operator S defined by (1), the transform of KHA is parameterized by UKHA = S A

(9)

where A is an n ×r matrix. This is the same approach as with KPCA. From (6) and (7), the updating rule of KHA is S At +1 = S At + ηδUt

δUt = (x t )(x t )∗ S At   ∗ ∗ −S At lt( A t S  x t ) (x t ) S At     At +1 = At + η et k t At − At lt( At k t k t At )

(10) (11) (12)

where the suffix ·t denotes the time index, kt = [K ](1:n,t ), and et ∈ Rn is a vector whose the tth element is 1 and the other elements are zero. This updating rule requires the preparation of all training samples beforehand. Its calculation cost is O(nr 2 ).

KHA requires the preparation of all training samples beforehand for the basis. With SubKPCA, two sets of training samples are considered, i.e., full training set X and the basis set Y which is a subset of X . Therefore, we propose the SubKHA. Similar to KHA, the transform of SubKHA is parameterized by USubKHA = T B

(13)

where B is an m × r matrix, and T is an operator of current basis, defined in (3). SubKHA has three kinds of updates: 1) update of the matrix B (Section III-A); 2) update of the basis set (Section III-B); and 3) update of a weight vector for centering (Section III-C). We provide a calculation method for these three updates. For every incoming sample, B and the weight for centering should be updated. If incoming sample can be represented by the current basis set, we do not need to update the basis set. We update the basis set only if its approximation error is large. We discuss the criterion to this exchange in the next section. A. Update of B If we simply apply a similar updating rule to KHA, (10)–(12), the first term of δUt , i.e., (x t )(x t )∗ S At , may be out of the range of T if x t is not in the basis set Y. Therefore, the updated operator USubKHA cannot be parameterized with an m × r matrix as (13). To learn in the limited space that is spanned by T , we only learn from the part within the range of T and ignore the part in its orthogonal complement space. We introduce the orthogonal projection onto T before the first term of δUt . We assume that K y is nonsingular. If the Gaussian kernel is used, and the basis set does not have duplication, K y is nonsingular [19]. Otherwise, we assume that K y is regularized by K y ← K y + I, where > 0 is a regularization parameter. The ∗ projection operator is given by T K −1 y T , and the updating rule of SubKHA is T B t +1 = T B t + ηδUSubKHA δUSubKHA = T

B t +1

∗ ∗ K −1 y T (x t )(x t ) T

(14) Bt

 ∗ ∗ (15) −T B t lt B  T (x )(x ) T B t t t t      = B t + η K −1 y ht ht B t − B t lt(B t ht ht B t ) (16) 

where ht = T ∗ (x t ) = [k( y1 , x t ), . . . , k( ym , x t )] is the empirical kernel map by T . If we fix the basis set, K −1 y or the Cholesky decomposition of K y for inverse operation can be calculated beforehand. Suppose that we have K −1 y or the Cholesky decomposition; then the computational complexity of one iteration is as follows: 1) m evaluations of kernel functions for ht ; 2) m 2 /2 multiplications for K −1 y ht ; 3) r m multiplications for B  ht ;

1964

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

4) mr 2 /2 multiplications for Blt(B  ht h t B);  5) mr multiplications for (K −1 y ht )(ht B t ). In KHA, we can calculate K beforehand, and kt is the tth column vector of K . Therefore, its calculation is rather simple. However, in SubKHA, we can choose the size of the basis set m that controls the tradeoff between accuracy and computational complexity. B. Updating Basis Set If the pattern state gradually changes with time, it is necessary to change the basis set. In this subsection, we discuss how to update the basis set. We assume that the sample to be added or exchanged is already given. The criterion for deciding whether an input sample should be added to the basis set or not and which sample in the basis set should be removed is discussed in the next section. If we modify the basis set Y, we also have to modify the matrix B (we henceforth omit suffix t) as well as T . Suppose that we have USubKHA = T B and modify from T to T˜ . Then ˜ which is a modified version we discuss how to obtain B, ˜ as follows: of B. We introduce a simple criterion for B ˜ = argminUSubKHA − U˜ SubKHA 2F B

(17)



˜ This rule obtains B ˜ such that U˜ SubKHA where U˜ SubKHA = T˜ B. ∗ ˜ is the closest to USubKHA . Let K y = T T˜ and K˜˜ y = T˜ ∗ T˜ . Then we have the least-mean-square error solution ˜ ˜ = K˜˜ −1 B y K y B.

(18)

This involves a rather heavy calculation cost for every update of the subset. By using information about B, we can reduce the calculation cost of updating. We discuss two cases for obtaining this update: 1) adding one sample to the subset; 2) exchanging one sample of the subset. For this update, we use K y and its inverse K −1 y . Suppose or its Cholesky decomthat, before updating, we have K −1 y position for inverse operation on the memory. 2 Updating of K −1 y can be achieved in O(m ) by using the following theorem of partitioned matrices [20]. Theorem 1: Let A be an m × m matrix, B be an m × n matrix, C be an n × m matrix, and D be an n × n matrix. Suppose that A is nonsingular. Then the matrix

A B C D is nonsingular if and only if an n × n matrix Q = D − C A−1 B is nonsingular. In this case, we have

−1 −1

A B A + A−1 B Q −1 C A−1 −A−1 B Q −1 = C D −Q −1 C A−1 Q −1 −1

A 0 − A−1 B = Q −1 −C A−1 I n . + 0 0 In

(19)

(20) (21)

1) Adding One Sample to the Basis: Let us consider the case in which we add one sample ym+1 to the subset T = ( y1 ), . . . , ( ym ) T˜ = ( y1 ), . . . , ( ym ), ( ym+1 ) . It is obvious that the criterion (17) is equal to zero, which is minimum when

B ˜B = (22) 0(1,r) where 0(1,r) is a 1 ×r matrix (row vector) whose elements are zero. 2) Exchanging One Sample of the Basis: We assume that the sample to be removed is moved to the end, m, and we add a new sample z to the basis set. Let operators T0 = ( y1 ), . . . , ( ym−1 ) T = ( y1 ), . . . , ( ym−1 )( ym ) T˜ = ( y1 ), . . . , ( ym−1 )(z) . Then the update rule is given by 

k(z, ym ) 1 0  + d [h ] d − d d B y z y z 0 z p p ˜ = + B1 B 01,r − 1p [h0z ] d y + 1p k( ym , z)

(23)

where h0z = T0∗ (z)

h0ym = T0∗ ( ym ) K 0 = T0∗ T0

0 p = k(z, z) − [h0z ] K −1 0 hz

dy = dz = B=

0 K −1 0 h ym 0 K −1 0 hz



B0 B1

(24) (25) (26)



B 0 ∈ Rm−1×r , B 1 ∈ R1×r .

(27)

The derivation is given in the Appendix. Its computational cost is as follows: 1) m − 1 evaluations of kernel function for h0z ; 0 2) m − 1 multiplications for d y = K −1 0 h y m [i.e., (51)]; 3) m 2 /2 multiplications for K 0 [i.e., (52)]; 4) m 2 /2 multiplications for d z = K 0 h0z ; 5) r m + O(m) multiplications for (23). C. Centering of Samples Until now, we assumed that the given samples have zero mean. However, real samples or samples mapped using kernel functions do not have zero mean. For PCA, we subtract the mean vector before performing eigenvalue decomposition. In KPCA and SubKPCA, this centering process can be done without obtaining the mean vector directly [1]. For SubKHA, we assume that the mean vector is also time-varying. We start from the centering of batch SubKPCA [21]. While KPCA usually adopts simple averaging for the mean vector, SubKPCA uses weighted averaging to use the optimal mean

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

vector in the limited space. Let the simple mean vector of ¯ 1 be mapped input samples {(x 1 ), . . . , (x n )} be  ¯1 = 

n 

1 n

i=1

(x i ) = n1 S1n

where 1n is an n-dimensional vector whose elements are all ¯ 1 in the limited 1. SubKPCA uses the best approximation of  space spanned by T  ¯ 2 = argmin  −  ¯ 1  = 1 T K −1  y K x y 1n . n

∈R(T )

(28)

While KPCA adopts a uniform weight (1/n)1n to columns  of S, SubKPCA uses the weight (1/n)K −1 y K x y 1n to columns of T . Let us consider the estimation of the mean vector for online learning. In online learning, the mean vector x¯ is often estimated with the following two methods:  1) simple average of the l latest samples, i.e., x¯ t = (1/l) l−1 i=0 x t −i or 2) geometrical progression for the weight of averaging x¯ t = (1 − γ )x t + γ x¯ t −1 = (1 − γ )

t  i=0

γ i x t −i

(29)

where γ is the forgetting factor,  andi the summation of weights is 1 for t → ∞ (1 − γ ) ∞ i=0 γ = 1. We use the second method because we do not need to store past vectors. We next combine these two concepts, i.e., (28) and (29). In the feature space, according to (29), the current estimation of the mean mapped vector is

The best approximation of this estimated mean is estimated using the following criterion: ¯ t = argmin  −  ¯ 0t  

(30)

∈R(Tt )

where Tt denotes the operator T at time t. Since is in the range of Tt , can be parameterized by a linear combination of Tt

= Tt a

(31)

where a ∈ Rm is a weight for averaging. Then (30) is a simple least squares problem, which is minimized when   (1 − γ )ht + γ Tt∗ Tt −1 at −1 . (32) a t = K −1 y ¯ t = Tt at , the updating By using the estimated mean vector  rule, (15) and (16), is modified to    ∗ ¯ t  (x t ) −  ¯ t ∗ T Bt δUSubKHA = T K −1  (x t ) −  y T    ∗ ¯t  (x t ) −  −T B t lt B  t T  (33)

   B t +1 = B t + η K −1 h t − pt h t − p t B t y −B t lt



B t



h t − pt



where pt = K y at . For this update, we keep and update pt instead of at . We have discussed three types of updating for T in Section III-B, and now we move on to discuss three types of updating for pt = K y at : 1) T is not changed; 2) one sample is added to T ; 3) one sample in T is exchanged. 1) T Is Not Changed: From (32), we simply update pt +1 = (1 − γ )ht + γ pt . 2) One Sample Is Added to T : Suppose that Tt = ( y1 ), . . . , ( ym ) Tt +1 = ( y1 ), . . . , ( ym ), ( ym+1 ) [K y ]t = Tt∗ Tt , and [K y ]t +1 = Tt∗+1 Tt +1 . Then updating pt is given by

Im pt +1 = (1 − γ )ht + γ pt . [Tt∗ ( ym+1 )] [K −1 y ]t 2 Since we have [K −1 y ]t , the calculation complexity is O(m ). 3) One Sample Is Exchanged: Suppose that Tt +1 = ( y1 ), . . . , ( ym−1 ), (z) Tt = ( y1 ), . . . , ( ym−1 ), ( ym )

[K y ]t = Tt∗ Tt . Then updating pt is pt +1 = (1 − γ )ht + γ Tt∗+1 Tt ([K y ]t )−1 pt

¯ t −1 . ¯ 0t = (1 − γ )(x t ) + γ  

  ¯ t ∗ T Bt ×  (x t ) − 

1965

  h t − pt B t (34)

where Tt∗+1 Tt ([K y ]t )−1 can be calculated using (20) in O(m 2 ). IV. C RITERION FOR E XCHANGE OF BASIS S ET In Section III, we discussed the updating of the basis set Y, the matrix B, and p. T and B define the transform onto the principal components USubKHA = T B, and p is for the centering. For sequential samples, we update B and p for every incoming sample; however, we update the basis set Y only if the incoming sample cannot be represented by the current basis set or, more precisely, only if the representation error of the incoming sample is large. In this section, we introduce a simple criterion for this exchange of the basis set. Other criteria also can be introduced for our method in Section III. In SubKPCA, the squared error J (2) is caused by the rank reduction and the approximation error due to the subset [the first and the second terms of (42) in Appendix]. The subset approximation error is given by Trace[K − K x y K †y K x y ] [see (42) in Appendix]. Let PR(S) and PR(T ) be orthogonal projectors onto R(S) and R(T ), respectively. Then the approximation error yields   Trace K − K x y K †y K x y = Trace S(PR(S) − PR(T ) )S ∗ . (35) It should be recalled that R(S) is a space spanned by the samples for criterion, and R(T ) is a space spanned by the basis. Therefore, in order to minimize the approximation error, R(T ) should be chosen to be close to R(S).

1966

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

Algorithm 1 Algorithm of SubKHA 1: Parameter: Maximum size of the basis set m 0 , forgetting factor γ , no. of principal components (rank) r , α or θ for thresholding 2: Initiate: t = 0, Basis set is Y = φ, No. of the bases m = 0, K −1 y = [], B = []. 3: for t = 0, 1, . . . do 4: Get input sample x t 5: if m < m 0 then 6: Add x t to the basis set, Y ← Y ∪ {x t } 7: Update B, (22) 8: Update K −1 y , (20) 9: Obtain the empirical kernel map h, and d(x, T ), (37) 10: m ← m+1 11: else 12: Obtain the empirical kernel map h, and d(x, T ), (37) 13: if d(x, T ) > θ then 14: Exchange the basis of Y 15: Update K −1 y (20) 16: Update B (23) 17: end if 18: end if 19: Update B (34) or (16) 20: end for

Fig. 1.

Nonlinear dataset of first 200 samples.

Let us consider the problem of whether we should add an incoming vector x to the basis set. From above discussion, x should be added if x is not well represented in R(T ). The representability of an input sample x by the current basis set is measured by the distance between the original mapped sample (x) and its projection onto the subspace spanned by the subset. Since the orthogonal projection onto the subspace ∗ is P = T K −1 y T , the representability d(x, T ) is d(x, T ) = (x) − P(x)2 = k(x, x) − h, K −1 y h

(36) (37)

Fig. 2. Total empirical error E emp of concrete data. 50, 100, and 300 are SubKHA with m 0 = 50, 100, and 300, respectively.

where h = T ∗ (x) = [k( y1 , x), . . . , k( ym , x)] is the empirical kernel map of x by T . Since we keep K −1 y and use h for updating T and B, the additional computational complexity is small. If x is in the basis set, d(x, T ) = 0, and if (x) is far from the subspace spanned by T , d(x, T ) is large. If d(x, T ) is greater than a threshold, we add the sample x to the basis set. The threshold θ can be determined by a fixed value or probability

m since USubKHA = T B = i=1 ( y i )[ B](i,1:r) . If the size of the basis set, m, exceeds the desired size, we replace this sample with the new input sample. This simple criterion may easily add outliers. However: 1) outliers in the basis set do not increase the cost function directly; 2) outliers will be easily removed unless similar outlier is input; and 3) as we see in Section V, SubKHA is not so sensitive to the size of the basis set, m 0 . For datasets with outliers, we have to use a larger basis set to avoid the basis set being occupied by outliers. Further sophisticated robust basis selection criteria may avoid including the outliers to the basis set. We summarize our algorithm in Algorithm 1.

θ = d¯t + α σ¯ t

(38)

where α is a hyper parameter, and d¯t and σ¯ t are the estimated average and standard deviation of d(x, T ), respectively. They are estimated using (29). If α = ∞, we never exchange, and if α = −∞ we exchange at every iteration. Suppose that d(x, T ) follows a Gaussian distribution. When α = 0, 1, and 2, the probabilities of exchange are about 0.5, 0.16, and 0.023, respectively. In the current basis set, the sample that has the least contribution for USubKHA is min

i=1,...,m

k( yi , yi )[ B](i,1:r) 2

V. E XPERIMENTS A. Static Pattern We used two benchmark datasets, “concrete” from the UCI machine learning repository [22] and the benchmark used in [23] and [24]. We respectively call them “concrete” and “nonlinear.” The concrete dataset has 1030 9-D samples.

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

1967

TABLE II C OMPARISON W ITH AKPCA U SING N ONLINEAR D ATASET. VALUES ARE T OTAL E MPIRICAL E RRORS (39) Threshold m0 AKPCA SubKHA SubKPCA

0.05 9992 2194.7 2224.7 2194.4

0.1 8571 2213.7 2226.8 2194.4

0.2 5902 2375.9 2225.5 2194.4

0.3 2918 2586.0 2225.2 2194.4

0.4 102 2769.4 2224.8 2206.4

0.5 6 2894.8 2626.8 2725.3

Fig. 3. Total empirical error E emp of nonlinear data. 20 and 100 are SubKHA with m 0 = 20 and 100, respectively. TABLE I C OMPARISON W ITH AKPCA U SING CONCRETE D ATASET. VALUES A RE T OTAL E MPIRICAL E RRORS (39) Threshold m0 AKPCA SubKHA SubKPCA

0.01 1027 76.08 76.70 69.05

0.05 695 77.65 79.14 69.05

0.1 374 88.88 74.55 69.05

0.2 131 117.1 76.74 69.28

0.3 60 160.3 77.43 71.27

0.4 24 211.2 80.63 82.26

0.5 7 212.3 141.1 198.9

Fig. 4. Instantaneous error of cycrostationary data filtered by 500 points of simple moving averages.

quantitative evaluation The nonlinear dataset is given by the differential equation 



2 −sk−1



sk−1 sk = 0.8 − 0.5 exp    2 − 0.3 + 0.9 exp −sk−1 sk−2 + 0.1 sin (sk−1 π). The data were generated from the initial condition s0 = s1 = 0.1. Outputs sk were corrupted by a zero-mean Gaussian noise with variance equal to 0.01. Fig. 1 plots the data before the corruption. We define a 6-D (d = 6) data vector stacked with these nonlinear signals x k = [sk−d+1 , . . . , sk ] . The number of generated vectors, n, was 10 000. We compared SubKHA, standard batch KPCA, and KHA. We used the Gaussian kernel function k(x 1 , x 2 ) = exp(−cx 1 − x 2 2 ), where c = 1/2σ 2 , σ is the standard deviation of all elements for concrete dataset, and c = 0.1 for nonlinear dataset. The other parameters were as follows: 1) no. of principal components r , concrete: 5, and nonlinear: 2; 2) learning parameter η, concrete: 10−2 , and nonlinear: 10−1 ; 3) forgetting factorγ , concrete: 1 − 10−4 , and nonlinear: 1 − 10−4 ; 4) α in (38), concrete 0.5, and nonlinear: 3; 5) size of basis set m 0 , concrete: {50, 100, 300}, and nonlinear: {20, 100}. For concrete dataset, we input the 1030 samples for 100 times, and for nonlinear dataset, we input 10 000 samples once. We used the squared error for all training data for

E emp =

n 

(x i ) − UU ∗ (x i )2 .

(39)

i=1

Note that KPCA gives the minimum squared error under the rank constraint. Figs. 2 and 3 show the results for concrete and nonlinear, respectively. Even though we use only 100 or 300 samples for the basis, its squared error is almost the same as KHA, which uses 1030 samples for the basis. Since KHA utilizes all 1030 training samples as the basis from the beginning, its convergence is much faster than SubKHA. On the other hand, SubKHA selectively obtains its basis from incoming samples. Because of this procedure, SubKHA can naturally work on dynamic data. We coded KHA and SubKHA in C, on a PC with Intel Core i7 X980 3.33 GHz, and 12G DDR3 RAM. For vectors and matrices computations, we used Intel Math Kernel Library 10.3.2.137. On this PC, KHA took 2.7 s for 100 iterations, and SubKHA took 2.0, 3.7, and 7.7 s for m = 50, 100, and 300, respectively for concrete dataset. Since SubKHA has to select and exchange samples, computational complexity of SubKHA may be higher than KHA even though the size of basis set is smaller. We also obtained the squared error for the next 10 000 data of nonlinear dataset. These 10 000 test data were not used for training. The test errors for SubKHA were 641.7 (m = 20) and 678.2 (m = 100), whereas that of KHA was 639.6. These results do not have significant difference. However, their runtimes were 0.06 s (SubKHA, m = 20), 0.12 s (SubKHA m = 100), and 96 s (KHA n = 10 000).

1968

Fig. 5.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

Contour curves of transformed vectors. “+” represents samples in the basis set.

For transformation of incoming data, KHA requires evaluation of the kernal function n times to obtain the empirical kernel map and a multiplication of r ×n matrix and n-dimensional vector. On the other hand, SubKHA requires evaluation of the kernel function m times to obtain the empirical kernel map, and a multiplication of r ×m matrix and m-dimensional vector. We also compared SubKHA with the AKPCA. Since Ding et al. [14] did not discuss in detail how to obtain center samples (unless we keep all samples as the basis, we cannot compute the true mean vector), we did not use centering for both AKPCA and SubKHA. The size of the basis set of AKPCA remained unknown until the end of learning. For a fair comparison, first we obtained AKPCA by using fixed thresholds, and then we obtained SubKHA with a size of basis set equal to that of AKPCA. Tables I and II list the results for concrete and nonlinear, respectively. When the threshold was small (0.01, 0.05), AKPCA was close to the KPCA, and the error of AKPCA was smaller than that of SubKHA. However, when we used a larger threshold, i.e., a smaller basis set, the error of SubKHA was smaller than that of AKPCA. For SubKPCA, we used the basis set that was obtained by AKPCA. When the threshold was large (0.4 or 0.5), SubKHA showed lower error rate than SubKPCA because SubKHA selectively determines its basis set.

Fig. 6. First frame of the image sequence. The initial center point and the object area are shown.

B. Dynamic Data We verified SubKHA by using patterns whose state gradually changed with time. We made a sample at time t as follows. 1) Generate random numbers r1 and r2 that follow the standard Gaussian distribution N (0, 1). 2) Let y = [r1 , −r12 + r2 ]. 3) Rotate the vector y, θ = 2πt/25 000 [rad]

cos(θ ) sin(θ ) . x t = G  y, G = − sin(θ ) cos(θ )

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

Fig. 7.

1969

Results of object tracking.

Fig. 9.

Size of basis set. th: threshold of AKPCA.

Fig. 8. Distance between the estimated center points and true center points.

We generated samples for t = 1, 2, . . . , 100 000. These data were a 2-D cycrostationary process of four cycles. We used the Gaussian kernel with c = 0.1, size of basis set m 0 = 1000, forgetting factor γ = 0.98, learning rate η = 10−0.6 , and α = 2 and α = 1000. We used the instantaneous error E(t) = (x t ) − UU ∗ (x t )2

(40)

for quantitative measurements, and the results are plotted in Fig. 4. In this figure, the values are filtered by a simple moving-average filter with 500 points. For α = 1000, samples in the basis set were rarely exchanged. The error of α = 1000 is small only when the cyclic variation is close to the initial basis set. On the other hand, when α = 2, the algorithm exchanges the basis adaptively, and follows the state of the patterns.

Fig. 10. Outline of the experiment: the subject was asked to open and close her eyes, where “C” and “O” stand for “close” and “open,” respectively.

Fig. 5 shows the contour curves of the transformed vectors U ∗ (x), x ∈ R2 . For α = 1000, the algorithm cannot follow the state since the basis set is almost fixed. On the other hand, for α = 2, the algorithm exchanges the basis set adaptively and follows the state. Since this experiment uses 100 000 samples, the other online KPCA methods that are not sustainable are hard to be applied.

1970

Fig. 11.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

(a)

(b)

(c)

(d)

Instantaneous error of the MEG experiment.

For IKPCA, the size of the matrix to be updated becomes 100 000 × r , and we have to store all 100 000 training data. This computational complexity is not realistic. C. Object Tracking We applied SubKHA to an object tracking problem, and compare the results with those of other methods. The similarity between a predefined class and an input sample x can be ∗ (x) measured by the norm of KPCA transformation, UKp [25]. We recorded a sequence of 150 images [Fig. 6]. Each frame is a 480 × 270 pixel color image. A ball, a dice, and a cube were on a table, and the target was the ball. We rotated the table, and changed the illumination gradually during the recoding the image sequence. Therefore, adaptive approaches are necessary to track these objects. This example is a demonstration of the proposed method to show the application of the adaptive feature extraction. For object tracking problems, including occlusions and similar shape or color objects, further extensions may be required. The procedure of the experiment was as follows. 1) Specify the initial center point which is center of the ball, and the initial object area that is 61 × 61 pixels (shown in Fig. 6). 2) Extract all 31 × 31 pixel blocks from the object area. 3) Learn initial KPCA of the object from vectorized blocks. Let the operator be Uobj . 4) Randomly extract 500 31 × 31 pixel blocks that are not in the object area (in the background area). 5) Learn initial KPCA of background from vectorized blocks. Let the operator Ubak for frame n = 2, . . . , 150. a) Apply KPCAs of the object and the background to each 31 × 31 block of the nth frame, and obtain ∗ (x) and norms of transformed vectors, Uobj ∗ Ubak (x). b) Find the center point that maximizes  ∗ (x) − U ∗ (x) Uobj bak x∈A

Fig. 12. Distribution maps of α-wave. The darker parts represent higher α-waves. (a) Original (eyes closed). (b) Original (eyes open). (c) SubKHA (eyes closed). (d) SubKHA (eyes open).

where A is the 61 × 61 pixel object area of each center point. Update the center point and the object area. c) Extract 31 × 31 pixel blocks from the object and the background area. d) Update Uobj and Ubak using vectorized blocks. In order to check the learning algorithms, we did not use information of past frames. We compared four methods, i.e., SubKHA (proposed), AKPCA, KPCA, and spatial-color mixture of Gaussians (SMOG) as reference of a recent object tracking method [26]. For KPCA, the transformation that was obtained from the first frame was used for all frames. The parameters are as follows: the Gaussian kernel function (c = 0.1); forgetting factor and for AKPCA γ = = 0.97; no. of principal components r = 20; learning coefficient η = 5 × 10−5 ; α = 0.5; no. of maximum number of basis m 0 = 500 (only for SubKHA). For SMOG, the parameters in [26] were used (k0 = 7, h = 16, Tτ = 0.8, σb = 0.2, these notations are defined in [26]). Fig. 7 shows the results. KPCA could track the object before the 100th frame, but when the illumination was changed, KPCA could not track the object at the 137th frame. We manually determined the center point for each frame, and measured the distance between the manually determined points and the estimated points. Fig. 8 shows the results. SubKHA shows better performance than KPCA and AKPCA. SMOG also cannot track the object after the 60th frame. For AKPCA, even though we used a very small threshold, the size of basis set grew very fast. For this reason, we could not perform the experiment for all frames. Fig. 9

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

1971

(a)

(b)

Fig. 13.

Instantaneous distribution maps. (a) Original data. (b) Feature-extracted data by SubKHA. Time corresponds to that in Fig. 10. O: open. C: closed.

shows the size of the basis set. Since AKPCA only considers adding the incoming sample to the basis set, the size of basis monotonically increases if the state keeps changing. On the other hand, in SubKHA, since algorithm exchanges a sample in basis set, the size of the basis set does not change. D. Feature Extraction and Denoising of MEG Data Here we show an example of magnetoencephalogram (MEG) data. The data were recorded by an Elekta Neuromag system that has 102 channels; and each channel has one magnetometer and two gradiometers. We used only two gradiometers. The subject was a 22-year-old healthy female. During the MEG measurement, the subject was asked to open and close her eyes three times (Fig. 10). The sampling frequency of the original signals is 1001 Hz, and the signals were filtered by a 50-Hz low-pass filter and downsampled to the 100.1-Hz sampling frequency. Let s(n, t) be the downsampled signal where n = 1, . . . , 204 is the channel, and t = 0, 1, . . . , 15 014 is the discrete time index. The data were vectorized for 20 samples ( 0.2 s) and 4080 (= [204 channels] × [20 samples]) dimensional feature vectors x i , (i = 0, . . . , 14 995) were used  x i = s(1, i ), s(1, i + 1), . . . , s(1, i + 19), s(2, i ), . . . , s(2, i + 19), . . . ,  s(204, i ), . . . , s(204, i + 19) .

(41)

The parameters were as follows: no. of basis: m 0 = 30; forgetting factor: γ = 0.999; rank r = 10; α = 0.5; kernel function: Gaussian kernel k(x 1 , x 2 ) = exp(−0.5x 1 − x 2 2 ); learning coefficient η = 0.8. Fig. 11 shows the instantaneous residual error (40). For SubKHA, we obtained the preimage of transformed data [27], and reconstructed the signal s˜ (n, t) by the inverse transform of (41). It is known that an α-wave (8–12 Hz) appears when a subject is at rest with eyes closed and it disappears with the eyes open (α-blocking) [28]. Therefore, the state of the brain during the experiment is dynamic. We compared the original data and feature extracted data by SubKHA. In order to extract the α-wave, we applied the third-order Butterworth band-pass filter (BPF) whose passband was 8–12 Hz, and calculated the

instantaneous power of each channel. Since one channel has two signals from gradiometer, the power of the channel is obtained as their sum. Fig. 12 shows the distribution maps of the square root of the α-wave power averaged over the periods of eyes closing and opening. From Fig. 12, SubKHA, (c) and (d) have higher contrast than original signal (a) and (b). That is to say, small noise-induced α-band signal in Fig. 12(b) is suppressed by SubKHA, and SubKHA enhances major component in the MEG data. Without a priori knowledge of the frequency band (8–12 Hz) and phase differences between channels, SubKHA automatically learns that specific waveforms are observed when eyes are closed, and random noise is observed when eyes are open. SubKHA extracts the frequent patterns corresponding to the specific waveforms and suppresses the random noise. Note that, since orthogonal projection of SubKHA reduces the norm of vector, the ranges of darkness are not the same for the original and SubKHA. Since there were about 15 000 samples in this experiment, it was difficult to apply conventional KHA. Fig. 13 shows the instantaneous distribution maps whose powers are obtained by averaging over 13 samples ( 0.13 [s]  [one cycle of 8 Hz]). These figures also suggest that SubKHA reduces the effect of noise and enhances the feature of MEG data. VI. C ONCLUSION We proposed SubKHA, which is an online version of SubKPCA. Since SubKHA uses two sets of learning samples, i.e., the full training set and the basis set, it does not require the preparation of all the training samples beforehand. This is in contrast to KHA, which is the online learning version of KPCA. We introduced three update rules for SubKHA: 1) updating B; 2) updating the basis set; and 3) updating p for centering. By using these three update rules, SubKHA enabled us to develop a sustainable learning algorithm. We also provided a criterion for the exchange of the basis set. Experimental examples using an open dataset, synthetic data, and object tracking showed that: 1) for the concrete dataset, SubKHA achieves almost the same error as KHA by using 10% or 30% of the training data for the basis set; 2) for

1972

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 23, NO. 12, DECEMBER 2012

the concrete dataset, SubKHA shows a lower error rate than AKPCA; and 3) SubKHA adaptively pursues gradually changing patterns. Since the basis set of SubKHA does not expand in contrast to AKPCA, SubKHA can be applied to large-scale long-term problems.

and vector and matrices h0z = T0∗ (z)

h0ym = T0∗ ( ym ) K 0 = T0∗ T0

h0ym K0 Ky = T T = 0  [h ym ] k( ym , ym )

h0z K0 K˜˜ y = T˜ ∗ T˜ = [h0z ] k(z, z)

K0 h0ym  ∗ ˜ ˜ . Ky = T T = [h0z ] k(z, ym )





A PPENDIX D ERIVATION OF S UB KPCA From (2) and U = T B, J (U ) yields  2   J (B) = S − T B B  T ∗ S  F    = Trace K + B B K x y K x y B B  K y  −2 B B  K  xy K xy  2    1/2 1/2 †     K = B B K − K K xy y xy  y F   † +Trace K − K x y K y K x y

Then from (20), we have

 1/2 †

Ky

K xy =

m √  i=1

ξi wi v  i.

(43)

Then the minimum solution is given by K y B B K  xy = 1/2

r  i=1

√ ξi wi v  i

(44)

where W = [w1 , . . . , wr ], and especially, if R(S) ⊂ R(T )     1/2 † 1/2 † B B = K y W W K y . (45) The detailed derivation is in [12]. Note that B has an ambiguity with respect to any orthogonal rotation, i.e., if B satisfies (45), for any unitary matrix R, B R also satisfies (45). Suppose 1/2 that B = (K y )† W, then row vectors of B are generalized eigenvectors of (4) [20].

Let operators T0 = ( y1 ), . . . , ( ym−1 ) T = ( y1 ), . . . , ( ym−1 )( ym ) T˜ = ( y1 ), . . . , ( ym−1 )(z)

0 In these equations, K −1 0 h ym can be obtained by   0 1 −1 K K −1 h = − y ym 0 [K −1 ] y

(m,m)

(1:m,m)

(47) (48)

(49) (50)

(51)

0 from the inverse of (46). We let d y = K −1 0 h y m . Then, from the relation     −1 K −1 = K −1 d y d y y 0 + Ky (1:m−1,1:m−1)

and hence we have      †  1/2 † 1/2 †  K B B = K y K K W W K y K x y x y xy xy

D ERIVATION OF U PDATING RULE (23)



 I (m−1) b ˜ = K˜˜ −1 ˜ B = B K B y y 0 c   0 p = k(z, z) − h0z K −1 0 hz 1 −1 0  0  −1 0 0 b = K −1 K h h K 0 h ym 0 h ym + p 0 z z 1 − K −1 h0 k(z, ym ) p 0 z 1  0  −1 0 1 h c=− K 0 h ym + k( ym , z). p z p

(42)

where  ·  F denotes the Frobenius norm, ·1/2 denotes the squared root matrix, and ·† denotes the Moore–Penrose pseudo inverse. Since the second term is a constant for B, from the Schmidt approximation theorem, the minimum solution is given by the singular value decomposition (SVD) of 1/2 (K y )† K  xy 

(46)

we have

  −1 K −1 0 = Ky

(1:m−1,1:m−1)

(m,m)

  − K −1 y

(m,m)

d y d y.

(52)

By using the relations (51) and (52), we can reduce the computational complexity in the updating rules (47)–(50). Let 0 d z = K −1 0 h z , and split matrices as expressed by (27). Then we have the updating rule (23). R EFERENCES [1] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, no. 5, pp. 1299–1319, 1998. [2] B. Schölkopf and A. Smola, Learning Kernels. Cambridge, MA: MIT Press, 2002. [3] B. Schölkopf, C. Burges, and A. Smola, Advances in Kernel Methods. Cambridge, MA: MIT Press, 1998. [4] K. Jørgensen and L. Hansen, “Model selection for Gaussian kernel PCA denoising,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 1, pp. 163–168, Jan. 2012. [5] Z. Yang and E. Oja, “Linear and nonlinear projective nonnegative matrix factorization,” IEEE Trans. Neural Netw., vol. 21, no. 5, pp. 734–749, May 2010. [6] E. Oja, “Simplified neuron model as a principal component analyzer,” J. Math. Biol., vol. 15, no. 3, pp. 267–273, 1982. [7] E. Oja, Subspace Methods of Pattern Recognition. New York: Wiley, 1983. [8] J. Karhunen and J. Joutsensalo, “Representation and separation of signals using nonlinear PCA type learning,” Neural Netw., vol. 7, no. 1, pp. 113–127, 1994.

WASHIZAWA: ADAPTIVE SUBSET KPCA FOR TIME-VARYING PATTERNS

[9] T. D. Sanger, “Optimal unsupervised learning in a single-layer linear feedforward neural network,” Neural Netw., vol. 2, no. 6, pp. 459–473, 1989. [10] K. I. Kim, M. O. Franz, and B. Schölkopf, “Iterative kernel principal component analysis for image modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 9, pp. 1351–1366, Sep. 2005. [11] S. Günter, N. N. Schraudolph, and S. Vishwanathan, “Fast iterative kernel principal component analysis,” J. Mach. Learn. Res., vol. 8, pp. 1893–1918, Aug. 2007. [12] Y. Washizawa, “Subset kernel principal component analysis,” in Proc. IEEE Int. Workshop Mach. Learn. Signal Process., Sep. 2009, pp. 1–6. [13] T.-J. Chin and D. Suter, “Incremental kernel principal component analysis,” IEEE Trans. Image Process., vol. 16, no. 6, pp. 1662–1674, Jun. 2007. [14] M. Ding, Z. Tian, and H. Xu, “Adaptive kernel principal component analysis,” Signal Process., vol. 90, no. 5, pp. 1542–1553, 2010. [15] D. Achlioptas, F. Mcsherry, and B. Schölkopf, “Sampling techniques for kernel methods,” in Proc. Conf. Annu. Adv. Neural Inf. Process. Syst., 2001, pp. 335–342. [16] M. Girolami, “Mercer kernel-based clustering in feature space,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 780–784, May 2002. [17] H. Liu and H. Motoda, Instance Selection and Construction for Data Mining. New York: Academic, 2001. [18] E. Oja and J. Karhunen, “On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix,” J. Math. Anal. Appl., vol. 106, no. 1, pp. 69–84, 1985. [19] S. Haykin, Neural Networks a Comprehensive Foundation. Washington, DC: IEEE Computer Society Press, 1994. [20] D. A. Harville, Matrix Algebra form a Statistician’s Perspective. New York: Springer-Verlag, 1997. [21] Y. Washizawa and M. Tanaka, “Centered subset kernel PCA for denoising,” in Proc. 3rd Int. Workshop Subspace Methods, 2010, pp. 354–363. [22] A. Asuncion and D. Newman. (2007). UCI Machine Learning Repository [Online]. Available: http://www.ics.uci.edu/∼mlearn/ MLRepository.html [23] C. Richard, J. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1058–1067, Mar. 2009.

1973

[24] T. Dodd, V. Kadirkamanathan, and R. Harrison, “Function estimation in Hilbert space using sequential projections,” in Proc. Int. Federat. Autom. Control Conf. Intell. Control Syst. Signal Process., 2003, pp. 113–118. [25] K. Tsuda, “Subspace classifier in the Hilbert space,” Pattern Recognit. Lett., vol. 20, no. 5, pp. 513–519, 1999. [26] H. Wang, D. Suter, K. Schindler, and C. Shen, “Adaptive object tracking based on an effective appearance filter,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 9, pp. 1661–1667, Sep. 2007. [27] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, and G. Rätsch, “Kernel PCA and de-noising in feature spaces,” in Advances in Neural Information Processing Systems 11, M. S. Kearns, S. A. Solla, and D. A. Cohn, Eds. Cambridge, MA: MIT Press, 1999. [28] E. Niedermeyer and F. L. da Silva, Electroencephalography: Basic Principles, Clinical Applications, and Related Fields, 5th ed. Baltimore, MD: Williams & Wilkins, 2004.

Yoshikazu Washizawa (M’05) received the B.E. degree from the Nagoya Institute of Technology, Nagoya, Japan, in 2002, and the M.E. and Ph.D. (Dr. Eng.) degrees from the Tokyo Institute of Technology, Tokyo, Japan, in 2004 and 2008, respectively. He was with Toshiba Corporation, Tokyo, Japan, from 2004 to 2005. In 2005, he was with the Tokyo Institute of Technology, Tokyo, Japan, as a JSPS Research Fellow. From 2005 to 2011, he was with the Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, Saitama, Japan. Since 2011, he has been an Assistant Professor with the University of ElectroCommunications, Japan, and a Visiting Researcher with the Laboratory for Advanced Brain Signal Processing, RIKRN Brain Science Institute. His current research interests include pattern recognition, machine learning, and biomedical signal processing.

Adaptive subset kernel principal component analysis for time-varying patterns.

Kernel principal component analysis (KPCA) and its online learning algorithms have been proposed and widely used. Since KPCA uses training samples for...
2MB Sizes 5 Downloads 3 Views