IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

265

Multikernel Least Mean Square Algorithm Felipe A. Tobar, Sun-Yuan Kung, and Danilo P. Mandic Abstract— The multikernel least-mean-square algorithm is introduced for adaptive estimation of vector-valued nonlinear and nonstationary signals. This is achieved by mapping the multivariate input data to a Hilbert space of time-varying vectorvalued functions, whose inner products (kernels) are combined in an online fashion. The proposed algorithm is equipped with novel adaptive sparsification criteria ensuring a finite dictionary, and is computationally efficient and suitable for nonstationary environments. We also show the ability of the proposed vectorvalued reproducing kernel Hilbert space to serve as a feature space for the class of multikernel least-squares algorithms. The benefits of adaptive multikernel (MK) estimation algorithms are illuminated in the nonlinear multivariate adaptive prediction setting. Simulations on nonlinear inertial body sensor signals and nonstationary real-world wind signals of low, medium, and high dynamic regimes support the approach. Index Terms— Adaptive sparsification, kernel methods, least mean square (LMS), multiple kernels, vector RKHS, wind prediction.

xT R Rm×s C T = {(xi , di )}i=1:T {xis }i=1:T Hl H L = {Hl }l=1:L H L —as in (13) HeL —as in (24) H = [h l ]l=1:L h l , gl Hl h l Hl [a, b]—as in (20) l (x) (x) = [l (x)]l=1:L

N OMENCLATURE Transpose of vector x. Real field. Space of real m × n matrices. Training pairs. Support vectors. Standard (monokernel) RKHS. Set of (monokernel) RKHSs. Vector-valued RKHS. Empirical vector-valued RKHS. Array of the scalar-valued maps h l . Inner product in the RKHS Hl . Norm in the RKHS Hl . Elementwise inner product. l th feature sample of the MK. MK feature sample. I. I NTRODUCTION

H

IGH-dimensional feature spaces have gained considerable popularity in the estimation of nonlinear functions within signal processing, computational intelligence, and related areas, because of their inherent ability to deal with coupled variables [1] and to model multivariate data [2], [3]. Their uses have been facilitated by recent developments in sensor technology (vector sensors) and computing resources, Manuscript received June 25, 2012; revised April 10, 2013; accepted June 22, 2013. Date of publication July 25, 2013; date of current version January 10, 2014. F. A. Tobar and D. P. Mandic are with the Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2BT, U.K. (e-mail: [email protected]; [email protected]). S.-Y. Kung is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08540 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2272594

making it possible to perform intense computation in real time. Practical estimation algorithms using feature spaces are in their infancy but are very important, as a number of real-world problems can be regarded as, or converted into, the estimation of a nonlinear function (universal function approximation). Linear estimators for this type of problems suffer inherently from nonconvex optimization surfaces, rendering suboptimal and nonunique solutions. All of these highlight the benefits of finding a suitable nonlinear transformation comprising multiple, and possibly heterogeneous, mappings that are combined in an adaptive fashion to track changes in statistical and dynamical properties. Estimation in the so-introduced dual spaces benefits from their completeness, rigorously defined norm and distance, and hence allows for simple and statistically meaningful data representation. Linear adaptive estimation schemes typically use stochastic gradient optimization (e.g., the least mean square (LMS) [4]) to deal with nonstationary environments. When combined with kernel-type feature space mappings, such structures have shown promise in nonlinear channel equalization [5], [6], time series prediction [7], blind source extraction [8], wind forecasting [9], [10], system and channel identification [11], [12], and anomaly detection [13]. Approaches comprising multiple kernels have also been considered, mainly for classification and discrimination purposes [14], [15]. Their usefulness has been illustrated in a number of applications, however, physically meaningful multikernel approaches suitable for the generality of adaptive estimation tasks, such as nonlinear signal prediction, are still lacking. The idea behind the proposed adaptive multikernel estimators is to adaptively combine the predefined rigid (static) kernels, based on changes in the dynamics of the data, offering a degree of flexibility necessary for the operation in nonstationary environments. Current approaches perform the estimation by averaging Gaussian kernels of different widths, while not making full use of the opportunities offered in the feature space; these also focus mostly in the loss function, hence providing limited physical interpretation of the dual space corresponding to the multikernel concept [16]–[20]. To maximize the benefits of kernel-based learning, we propose to consider the multikernel adaptive estimation as the result of a time-varying mapping to a Hilbert space of vector-valued functions. This is a key step to provide rigorous and physically meaningful estimation of nonstationary multivariate data using multiple kernels, which has not been sufficiently addressed in the existing machine learning literature. Moreover, the kernel regression literature in general lacks an adaptive sparsification criterion, a prerequisite for real-world estimation of nonstationary signals. In this paper, a vector-valued reproducing kernel Hilbert space (RKHS) is proposed as the backbone of two multikernel

2162-237X © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

266

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

algorithms: the multikernel ridge regression (MKRR), first introduced in [21], and the novel multikernel LMS (MKLMS). These map the input samples onto a feature space of vectorvalued functions (or an array of real-valued functions), which are then linearly transformed to yield the estimated output in an optimal least-squares sense (MKRR) or in an LMS fashion (MKLMS). For nonstationary environments, we further equip the MKLMS with a two-stage sparsification criterion: the acceptance/rejection step based on the novelty criterion [7], [22], followed by a novel sample elimination step, which ensures accurate state-space representation throughout the online implementation of the algorithm. This makes it possible to bypass the limitations of current KLMS sparsification criteria in nonstationary environments. Further contributions of this paper include: 1) a unifying account of multivariate versions of both classic and nonlinear stochastic gradient update stages in the dual space; 2) the introduction of a Hilbert space of vector-valued functions with a reproducing property, suitable for a feature space within multikernel regression; 3) the introduction and analysis of a time-varying mapping from the input space to the constructed feature space that guarantees online adaptation of the proposed algorithm; 4) physical meaning associated with the resulting multikernel structure, not present in the standard single kernel approach; and 5) a discussion about convergence properties and computational complexity of the proposed algorithms. Illustrative simulations on both multivariate real-world inertial body sensors signals and nonstationary wind data support the approach. II. M ULTIVARIATE K ERNEL LMS Consider the sets X ⊆ Rn , D ⊆ Rm and the unknown multivariate mapping f : X −→ D x −→ d = f (x).

(1)

The aim of the LMS algorithm [4] is to find a linear approximation of the mapping f (·) so that the estimation error is minimum in the mean-square sense. This linear approximation is given by dˆ = Ax, where A ∈ Rm×n is termed the weight matrix, whose entries are computed via the stochastic gradient minimization of a squared estimation error, that is, the cost function is approximated by 1 1 (2) JLMS = e2 = (d − Ax)T (d − Ax) 2 2 where e is the instantaneous estimation error. This optimization routine is performed in a recursive fashion whenever training pairs C T = {(xi , di )}i=1:T ⊂ Rn × Rm become available. The update rule of the weight matrix A is given by [23] A0 = 0m×n

(3a)

ei = di − Ai−1 xi Ai = Ai−1 + ηei xiT

(3b) (3c)

where 0m×n is an m × n matrix of zeros, ei = di − dˆ i , dˆ i = Ai−1 xi is the estimated output, and the learning rate

η ∈ R+ controls the tradeoff between the speed of convergence and confidence in each sample. Note that the transposed vector xiT in (3c) is not found in the classic univariate LMS, because the multivariate LMS can be seen as an ensemble of univariate LMS algorithms. For a comprehensive analysis of the LMS algorithm, implementation considerations and convergence properties; see [4], [23], and [24]. A. Multivariate Nonlinear LMS In real-world problems, the mapping (1) is often highly nonlinear and hence the classic linear LMS algorithm is not well equipped to provide a reliable and accurate estimate of f (·). In such cases, one solution is to map the input data onto feature samples through a nonlinear transformation, and then feed these feature samples as an input to an LMS-based algorithm. Consider the nonlinear mapping ψ : X −→ Rs , and an LMS estimation of the relationship between the transformed ψ input–output pairs C T = {(ψ(xi ), di )}i=1:T . By identifying ψ the set C T as a particular case of the set C T , the extension of the LMS to deal with feature spaces is straightforward. We refer to this extension as the multivariate nonlinear LMS [25], in which the update rule of the weight matrix A ∈ Rm×s is linear in the transformed input ψ(xi ), that is A0 = 0m×s ei = di − Ai−1 ψ(xi )

(4a) (4b)

Ai = Ai−1 + ηei ψ T (xi )

(4c)

where the estimated output is given by dˆ i = Ai−1 ψ(xi ). By combining the information contained in the input signals in a nonlinear fashion via the function ψ(·), the multivariate nonlinear LMS provides a powerful and robust modeling resource that inherits the simplicity and ease of implementation of the classic LMS. Nevertheless, as its performance heavily depends on the a priori evidence of the nonlinear nature of f (·), its advantages will be negligible when insufficient information about f (·) is available or when f (·) is time varying. To improve the performance of the nonlinear LMS, a larger (or even an infinite) number of regressors can be considered, hence increasing the dimension of the mapping ψ. This extension of the nonlinear LMS is referred to the class of kernel LMS (KLMS) algorithms. B. Multivariate KLMS To unify the derivation of kernel regression algorithms, we first introduce the concept of Mercer kernel [26]. A Mercer kernel is a continuous, symmetric, positive-definite function K : X × X −→ R, X ⊆ Rn (or Cn ); examples of Mercer kernels used in real-world applications are the triangular kernel K T [13], the Gaussian kernel K G [9], [10], [12], and the polynomial kernel K P [27], given by  (b − xi − x j ), if xi − x j  ≤ b −  K T (xi , x j ) = (5a) , otherwise K G (xi , x j ) = exp(−axi − x j 2 ) K P (xi , x j ) =

(1 + xiT x j ) p

where b, , a, and p are real-valued parameters.

(5b) (5c)

TOBAR et al.: MKLMS ALGORITHM

267

According to Mercer’s theorem [28], any Mercer kernel can be expanded using some fixed suitable nonlinear functions {φ(x) : X −→ H1 , x ∈ X} as follows:   (6) K (xi , x j ) = φ(xi ), φ(x j ) where H1 is a real- or complex-valued RKHS, for which K (·, ·) is a reproducing kernel and ·, · is the corresponding inner product. For a detailed description of RKHS, see [26]. Expression (6) shows that if xi , x j ∈ X ⊆ Rn (or Cn ) are mapped onto the functions φ(xi ), φ(x j ) ∈ H1 , respectively, the inner product of such functions can be computed on the sole basis of evaluating the generating kernel K in the samples xi and x j , even if the mapping φ(·) is unknown. This result is known as the kernel trick [29] and provides advantageous properties in classification and nonlinear function estimation [3], [30], [31]. The Mercer kernel properties make it possible to derive an LMS-based nonlinear estimation scheme in an infinitedimensional feature space without having to perform algebraic operations directly in such a space, hence combining the desirable properties of the nonlinear mapping stage with the ease of implementation of the LMS algorithm and its ability to operate in nonstationary environments. To this end, consider the infinite-dimensional—and possibly unknown— transformation φ(·) defined by φ : Rn −→ H1 x −→ φ(x)

(7)

where H1 is an RKHS. Note that for any x ∈ X, the feature sample φ(x) : Rn −→ R is a function. Using the term φ(x) as a regressor, the function f (·) in (1) can be estimated according to dˆ = Aφ(x)

(8)

where the weight structure A is an array composed of m transposed elements of H1 so that dˆ = Aφ(x) ∈ Rm . Similarly to the nonlinear LMS update stage in (4c), A is φ sequentially updated using the pairs C T = {(φ(xi ), di )}i=1:T according to Ai = Ai−1 + μe ¯ i φ T (xi )

(9)

where μ¯ is a learning rate. Observe that although the update law in (9) is theoretically correct when φ(x) is an infinite-dimensional feature element, it cannot be implemented in practice, or at least not using any commonly used software, because these resources rely on the computation of arrays with a finite number of elements. In order not to perform algebraic operations in the feature space, we set out to restate the algebraic operations in the update stage of the algorithm in the form of inner products and then, via the kernel trick, to replace those by evaluations of the reproducing kernel. The first step in this direction is to present the update rule (9) in a nonrecursive fashion, that is Ai = μ¯

i  j =1

e j φ T (x j ).

(10)

This allows for the current estimate of the output (8) to be rewritten using the estimated weight matrix Ai−1 as follows: ⎛ ⎞ i−1  dˆ i = ⎝μ¯ e j φ T (x j )⎠ φ(xi ) j =1

= μ¯

i−1 

e j φ T (x j )φ(xi ).

(11)

j =1

Finally, if the mapping φ(·) is chosen to be an expansion function [26] of the the reproducing kernel of H1 [see (6)], the inner product in (11) can be computed using the kernel trick (6), and therefore the KLMS output can be written as [5] follows: dˆ i = μ¯

i−1 

e j K (x j , xi )

(12)

j =1

where K is the generating kernel of H1 . As highlighted earlier, the main advantage of the KLMS algorithm is that it does not require the mapping φ—nor the weights A—to be known, as long as the kernel K is specified. Therefore, a successful implementation of the KLMS only depends upon finding a suitable kernel K with respect to the data, rather than the specific mapping φ, which is far more complex. An important issue with the KLMS algorithm is that its complexity increases each time the update stage is performed, because all the samples and the estimation errors need to be stored for the computation of future estimates. This problem can be surmounted by equipping the KLMS algorithm with a sparsification stage in which new training samples can be either rejected or accepted based on: 1) the current operating point of the algorithm and 2) the performance improvement that the new training sample would provide. For a review of KLMS sparsification criteria; see [31]. III. V ECTOR -VALUED RKHS Recent higher dimensional kernel regression algorithms consider mapping the input samples to complex or quaternion functions [9], [10], [32], [33]. This has proven advantageous when dealing with general bivariate and quadrivariate signals compared with the standard kernel regression paradigm, which maps the input sample to a real-valued function. This also suggests that the performance of general kernel regression algorithms can be improved by augmenting the dimensionality of the feature space, that is, by mapping input samples to multiple feature elements, which are then linearly combined to yield the output estimate. This also allows the learning of multiple nonlinear features present in the data. We next introduce a novel vector-valued generalization of RKHS as the basis of the proposed class of multikernel regression algorithms, where each constitutive subkernel accounts for a different nonlinear property of the process in hand. Our aim is to perform a flexible linear combination of multiple kernels, whereby weighting coefficients can be adjusted based on the minimization of the estimation error; this can be achieved, for instance, in an LS or LMS fashion.

268

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

where ·2 is the L 2 -norm. Combined with the norm definition in (16), this leads to

A. Hilbert Space of Vector-Valued Functions With a Reproducing Property Consider the sample set X ∈ Rn (or Cn ) and an indexed collection of L RKHS over X denoted by H L = {Hl }l=1:L . From Section II-B, observe that the elements h l ∈ Hl are functions h l : X −→ R. Define the space H L as a set of vector-valued functions according to ⎧ ⎫ ⎡ ⎤ h1 ⎪ ⎪ ⎨ ⎬ ⎢ .. ⎥ (13) H L = H = ⎣ . ⎦ , h l ∈ Hl , l = 1, . . . , L ⎪ ⎪ ⎩ ⎭ hL that is, for H ∈ H L , its l th element h l ∈ Hl . For convenience of notation, we will denote H ∈ H in terms of its components in compact notation by H = [h l ]l=1:L . Note that the construction of H L allows for the representation of its elements as vector-valued functions of the following form: H : X −→ R L x −→ H (x).

(14)

We now introduce three properties (lemmas) of the space H L that are a backbone of the proposed multikernel approach. Lemma 1 (induced Hilbert space): For any L, X, and H L as defined above, H L is a Hilbert space. Proof: As by construction H L is a complete vector space, for it to become a Hilbert space we only have to equip it with an inner product inducing a norm. With the definition of H L , Hl is also a Hilbert space and it is therefore equipped with an inner product. Without loss of generality, we denote the inner product between feature elements h l , gl ∈ H l by h l , gl Hl , and the corresponding norm by h l Hl = h l , h l Hl . This allows us to define the inner product for H = [h 1 ]l=1:L , G = [gl ]l=1:L ∈ H L by H, G =

L 

h l , gl Hl

(15)

l=1

where the inner product properties of ·, ·Hl are inherited by ·, · because of its multilinear construction. With the properties of the norms of the RKHS in H L , it then follows that the inner product in (15) induces a norm given by    L  L      h 2 H,  h  H = (16) H  = l , h l Hl = l Hl l=1

l=1

where the norm properties are inherited because of the multilinear construction of H L . Lemma 2 (boundedness): The evaluation functional Fx (H ) = H (x) ∈ R L is bounded for any x ∈ X and H ∈ H L . Proof: The evaluation functional Fx (h l ) = h l (x) ∈ R is bounded ∀h l ∈ Hl , that is ∃Ml ∈ R+ , s.t. Fx (h l )2 ≤ Ml h l Hl

Fx (H )22 =

L L   Fx (h l )22 ≤ Ml2 h l 2Hl l=1

l=1

L  ≤ max {Ml2 } h l 2Hl = M 2 H 2 (17) l=1,...,L

l=1

where = Therefore, Fx (H )2 ≤ MH , that is, Fx (H ) is bounded. Lemma 3 (reproducing property): The evaluation functional Fx (H ) = H (x), ∀x ∈ X, H ∈ H L can be written as a product between arrays H = [h l ]l=1,...,L and K L (x, ·) = [kl (x, ·)]l=1,...,L , where kl is a reproducing kernel of Hl . Proof: The reproducing property of the RKHS Hl , Fx (h l ) = h l , k(x, ·)Hl , leads to M2

maxl=1:L {Ml2 }.

Fx (H ) = [Fx (h l )]l=1:L = [h l , kl (x, ·)Hl ]l=1:L .

(18)

Therefore the evaluation functional can be expressed as a function of K L and H given by Fx (H ) = [H, K L (x, ·)]

(19)

where the operator [·, ·] denotes the elementwise inner product ⎤ ⎤ ⎡ ⎤⎤ ⎡ ⎡⎡ a1 , b1  b1 a1 ⎥ ⎢⎢ .. ⎥ ⎢ .. ⎥⎥ ⎢ .. (20) ⎦. ⎣⎣ . ⎦ , ⎣ . ⎦⎦ = ⎣ . aN

bN

a N , b N 

As a consequence of the reproducing property presented in Lemma 3, the kernel trick for vector-valued RKHS can now be implemented by choosing H = K L (y, ·), y ∈ X in (19), leading to K L (y, x) = [K L (y, ·), K L (x, ·)].

(21)

Remark 1: The approach in (21) is generic, as for the dimension L = 1, H L is by construction a (single kernel) RKHS and Lemmas 1–3 correspond to the standard RKHS properties. This is observed from (19), which collapses into the standard reproducing property of RKHS when (20) degenerates into a scalar product. The standard reproducing property is defined for realvalued (scalar) function spaces [26], and therefore it cannot be applied to general vector-valued function spaces, such as H L . If K L = [kl ]l=1:L is, however, regarded as a vector-valued extension of the concept of reproducing kernel, (19) can be seen as a generalized reproducing property in Hilbert spaces of vector-valued functions, which we refer to a vector-valued RKHS. Remark 2: The space of vector-valued functions H L in (13) has a reproducing property presented in (19), that is, for any x ∈ X, H ∈ H, the evaluation functional Fx (H ) can be expanded as a product between H and K L (x, ·). The uniqueness of K L is guaranteed by construction from the uniqueness of the kernels kl . This makes it possible to use the space H L as a feature space in the construction of kernel

TOBAR et al.: MKLMS ALGORITHM

269

regression algorithms, providing a theoretical backbone for their design. With the so-introduced vector-valued RKHS, we next propose a function approximation scheme which relies on the utilization of both multiple kernels and different optimization criteria for the weight update. We refer to this class of algorithms as multikernel regression algorithms. B. Multikernel Ridge Regression For convenience, we present the single-output version of the algorithm; this will serve as a basis for introducing a general multivariate case as an ensemble of multiple single-output algorithms. We also show that the multikernel extension of the kernel ridge regression algorithm [34], [35], first presented in [21], can be regarded as a particular case of kernel regression using the above-defined vector-valued RKHS as feature space. Consider the spaces X ⊆ Rn , D ⊆ R, a collection of training pairs C T = {(xt , dt )}t =1:T , a collection of support samples C S = {xts }t =1:S , and the vector-valued RKHS H L , as defined in (13), from the collection of RKHS {Hi }i=1,...,L . The aim of MKRR is to perform a nonlinear approximation of the relationship defining the pairs in C T by linearly combining multiple feature space elements in an optimal least-squares sense. To that end, consider the time-invariant mapping given by  : X −→ H L

⎤ 1 (x) ⎢ 2 (x) ⎥ ⎥ ⎢ x −→ (x) = ⎢ . ⎥ ⎣ .. ⎦ ⎡

(22)

 L (x)

where l (x) ∈ Hl is a real-valued (scalar) function, l = 1, . . . , L. Using the feature elements (x), x ∈ X, our aim is to approximate the output d by dˆ = W, (x)

(23)

where W ∈ H L is an array of coefficients. As an extension to the monokernel ridge regression algorithm, we propose to find a projection of the optimal weights W in the empirical [30] space HeL built upon the support samples C S ⎫ ⎧ ⎤ ⎡ S s ⎪ ⎪ i=1 ω1,i 1 (xi ) ⎬ ⎨ ⎥ ⎢ . e s . HL = H = ⎣ ⎦ , ωl,i ∈ R, xi ∈ C S . . ⎪ ⎪ S ⎭ ⎩ s i=1 ω L ,i  L (xi ) (24) S This projection has the form W = [ i=1 ωl,i l (xis )]l=1:L , and therefore (23) can be written as   S  L the estimate s  ω (x ),  dˆ = l i l (x) that, by choosing the i=1 l=1 l,i mappings l to be expansion functions of the spaces Hl , can be expressed via the kernel trick as follows: dˆ =

S  L 

  ωl,i K l xis , x .

(25)

i=1 l=1

As in [21], the optimal least-squares coefficients ωl,i are found via the minimization of the cost function evaluated over the

training set C T 1  2 r  2 ei + ωl,i 2 2 T

J=

i=1

S

L

(26)

i=1 l=1

where r is a regularization coefficient and ei denotes the estimation error for the training pair (xi , di ) given by1 ei = di − dˆi .

(27)

Setting ∂ J /∂ωl,i = 0, the optimal weights become  −1 = KK T + r I KY where I is the identity matrix, and ! ⎡ ⎡ ⎤ ⎤ ω1,i i=1:T ωl,1 ! ⎢ ⎢ . ⎥ ⎥ .. =⎣ ⎦ , ωl,i i=1:T = ⎣ .. ⎦ , .! ωl,T ω L ,i i=1:T ⎡ ⎤ ⎡ ⎤ d1 K1 ⎢ .. ⎥ ⎢ .. ⎥ Y =⎣ . ⎦ K = ⎣ . ⎦, KL dT

(28)

(29)

while the subkernel Gram matrices (also referred to as kernel evaluation matrices) are given by Kl ( p, q) = K l (xsp , xq ), and are the blocks of the multikernel Gram matrix K. The MKRR, thus, provides an estimation structure that cannot be achieved by standard monokernel ridge regression [34], because MKRR assigns a different kernel combination to each support vector. Remark 3: The MKRR algorithm offers a physically meaningful framework to approximate nonhomogeneous nonlinear mappings by combining subkernels in an optimal regularized least-squares sense. C. Adaptive MKLMS The aim of MKLMS is to perform a nonlinear approximation of the relationship defining the pairs in C T by both linearly combining multiple feature space elements and building a support dictionary in an adaptive fashion. For the MKLMS, we now consider the time-varying mapping given by  i : X −→ H L



⎢ ⎢ x −→  i (x) = ⎢ ⎣

c1i 1 (x) c2i 2 (x) .. .

⎤ ⎥ ⎥ ⎥ ⎦

(30)

ciL  L (x) where i is the time index and {cli }i=1,2,... is a sequence of time-varying parameters, to approximate the output di by " # (31) dˆi = W,  i (xi ) . Note that although the inclusion of both parameters {cli }l=1,...,L and W may at first appear redundant, they have different physical meaning: parameters {cli }l=1,...,L cater for the instantaneous contribution of each kernel within the multikernel algorithm 1 Note that training pairs and support vectors are not necessarily the same.

270

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

at the time instant i , and therefore their update is crucial for the adaptive behavior of the algorithm; the parameter matrix W (via its L elements) extracts information from the feature samples to replicate a nonlinear nature of the signal. Accordingly, the updates of these two weight structures are performed separately. In a similar manner to the KLMS algorithm presented in Section II, the update of W is performed using stochastic gradient as follows: Wi = Wi−1 + μei  i (xi ) i  =μ e j  j (x j )

(32)

where μ is the learning rate. Similarly to the classic (single kernel) KLMS, (31) and (32) lead to the following expression for the output estimate $ i−1 %  e j  j (x j ),  i (xi ) dˆi = μ j =1



" # e j  j (x j ),  i (xi ) .

(33)

j =1

The standard KLMS uses the kernel trick to express the above product via a kernel evaluation. To restate the problem so that the kernel trick can be applied within MKLMS, we use the definition of  i in (30), to rewrite the inner product in (33) using the vector-valued RKHS inner product (15) as follows: "

L " #  # j cl l (x j ), cli l (xi )  j (x j ),  i (xi ) =

=

l=1 L 

j

cl cli kl (x j , xi ).

Hl

(34)

l=1

By combining (34) and (33), the output estimate dˆi can be rewritten as dˆi = μ

i−1  j =1

ej

L 

j

cli cl kl (xi , x j )

L i−1  

ωi, j,l kl (xi , x j )

(37)

j =1 l=1

or in a more compact notation dˆi = μ

i−1 

i, j K L (xi , x j )

(38)

(35)

where i, j = [ωi, j,1 , . . . , ω i, j,L ] ∈ Rm×L is the coefficient matrix and K L (xi , x j ) = [kl (xi , x j )]l=1:L is the Gram matrix. We refer to the expression in (37) and (38) as a general multivariate version of the MKLMS. D. Weight Update Within MKLMS A real-time update stage for the vector ωi, j,l is needed for the expression (37) to accurately represent the observed signal di . Because this update is performed in an LMS fashion, the MKLMS architecture effectively comprises two cascaded LMS update stages. Furthermore, unlike the existing KLMS algorithm where the regressors are evaluations of a single (fixed) kernel, the regressors in the MKLMS vary in magnitude for different kernels, therefore the optimal learning rate depends on the magnitude of these kernel evaluations. To that end, we implement a normalized version of MKLMS so that the value of the learning rate is robust to the kernel magnitude. For illustration, consider the estimated output dˆ i in (37) and the estimation error defined by ei = di − dˆ i ∈ Rm . By implementing a normalized LMS-based update for ωi, j,l , we have kl (xi , x j ) ˆ i ωi, j,l = ωi−1, j,l + μe (39)  + kl2 (xi , x j ) where the learning rate μ is absorbed into the learning rate μˆ and  is a small positive constant. This also reflects the physical meaning of (37) and (39), namely the estimation and update stages.

l=1

E. Comparison With the Monokernel Approach

j

while by setting ωi, j,l = e j cli cl , we have dˆi = μ

dˆi = μ

j =1

j =1

i−1 

MKLMS estimators, in which d ∈ Rm , for which the coefficient vector is given by ωi, j,l ∈ Rm and the estimated output is in the form

i−1  L 

ωi, j,l kl (xi , x j ).

(36)

j =1 l=1

By virtue of introducing a time-varying optimization of kernel combinations into MKLMS, the adaptive design of  i can be performed by updating ωi, j,l , and therefore parameters {cli }l=1:L need not be explicitly updated. Notice that as a result of incorporating an LMS update for W into (31), the relation (36) models the estimate of di as a linear combination of multiple kernels. Therefore, (36) can be regarded as a multiple kernel generalization of (12). We now present the multivariate version of the proposed algorithm, which represents an ensemble of multiple univariate

It is important to notice that the kernel sum in expression (37) can reformulated as a single time-varying kernel, that is dˆi = μ

i−1 

e j K ∗j (xi , x j )

(40)

j =1

L i j where K ∗j (xi , x j ) = l=1 cl cl kl (xi , x j ). Observe that (40) is a version of the single kernel result in (12), whose kernel K ∗j represents a time-varying relationship, hence the subscript j . The adaptive nature of K ∗j is a result of the consideration of the time-varying mapping  i in (30). We now highlight the unique features of the proposed multikernel approach and its advantages in nonstationary estimation as compared with the single kernel approach.

TOBAR et al.: MKLMS ALGORITHM

If the mapping  i is considered to be time invariant, there is no justification for the weight update stage, as presented in (39). In that case, the MKLMS collapses into the KLMS algorithm for which the kernel, K ∗j = K ∗ , although designed as a sum of kernels, is fixed for every support sample and is thus time invariant. Accordingly, for fixed coefficients cli = cl , based on (40),  the resulting structure features a single kernel L 2 K ∗ (xi , x j ) = l=1 cl kl (xi , x j ), providing a fixed (in both time and state space) mixture of kernels. Remark 4: For a time-invariant feature transformation  i , the MKLMS solution degenerates into a variant of KLMS in (40), which inherits properties of the kernels kl within MKLMS and also caters for the multimodality of the data. A well-known example of such a mapping is the Gaussian mixture approach. One restrictive property of the standard KLMS is that for nonstationary environments, it does not allow for the consideration of time-varying kernels. This is because if the kernel was arbitrarily modified in an update recursion, this makes it impossible to express the inner product in (11) as a kernel evaluation that yields the output estimate (12). However, in the case of the proposed MKLMS algorithm, as the coefficient vector ωi, j,l is updated in an online adaptive fashion based on the prediction error, then according to (40) the adaptive kernel combination of the MKLMS can be regarded as a KLMS with a time-varying kernel, a critical advantage over KLMS. In addition, the use of multiple kernels provides intuition and physical meaning, together with enhanced control over the algorithm, as illustrated in the wind modeling example in Section V-C. Remark 5: The MKLMS represents, via the time-varying mapping in (30), a universal nonlinear function approximation model, which is general, theoretically justified, and both convey physical meaning in the subkernel components and is straightforward to implement. Unlike the standard single KLMS approach, the time-varying kernel in the MKLMS is adaptively adjusted in an online fashion and is therefore able to match the time-varying dynamics of the data. IV. I MPLEMENTATION OF M ULTIKERNEL A LGORITHMS We now propose a novel adaptive sparsification rule and discuss convergence properties and computational complexity of the proposed algorithms. A. Sparsification Criteria For the MKLMS algorithm to be truly adaptive, the dictionary samples must accurately represent the current operating region in the state space. To that end, we propose a sparsification criterion that includes: 1) an acceptance/rejection stage of samples based on the two-part novelty criterion proposed in2 [22] and 2) a novel elimination stage in which samples that do not contribute to the estimation are eliminated. The novelty criterion [22], originally proposed to deal with resource-allocating networks, states that a given sample must 2 An implementation of this concept to KLMS applications is also known

as the coherence criterion proposed in [7].

271

be included in the network (or for the kernel algorithm, the dictionary D) only if both the predicted error norm and its distance to the dictionary d(x, D) = minxi ∈D xi − x are, respectively, larger than the predefined thresholds δe , δd . The extension of this criterion to the MKLMS algorithm is as follows: a new sample xi is added to the dictionary when di − dˆ i  ≥ δe

(41a)

min xi − x j  ≥ δd

(41b)

x j ∈D

where di and dˆ i are, respectively, the true system output (desired signal) and the output estimated using MKLMS, based on the input xi . Remark 6: The definition of the coherence criterion given in [7] differs from that in (41b), as our criterion is defined with respect to the sample difference rather than the kernel evaluation (i.e., a new sample x is only added to the dictionary if maxxi ∈D k(xi , x) ≤ δ¯d ). The standard coherence criterion has a physical meaning only for single kernel structures, and therefore a kernel-independent definition is needed for the multikernel approach. To that end, we restate the coherence criterion in terms of the sample distance instead of the evaluation of a particular kernel. Regarding the physical meaning of the sparsification rules (41a) and (41b), the parameter δe represents a measure of accuracy of the algorithm, and its choice is related to the desired steady-state error, whereas δd —regarded as the coherence of the dictionary—is a direct measure of sparsity. This way, parameter δd represents a tradeoff between the addition of a new sample to the dictionary and the update of the parameters (ωi, j,l for the MKLMS). The choice of δd determines the maximum dictionary size, which is guaranteed to be finite for a compact set X [7]. Together with the addition/rejection stage, to provide a truly adaptive structure that operates on nonstationary data, we introduce an elimination step in which previously added samples that no longer contribute to the estimation are eliminated. The elimination is based on the presence of dictionary samples in the current neighborhood of the state space: if the distance between a given dictionary sample and the last received samples remains below a predefined bound, the dictionary sample is retained, otherwise it is considered to be representative of an invalid state space region, or an outlier because of noisy observations, and is eliminated. To assess the presence of a given dictionary sample, consider the Gaussian kernel K G (·, ·) and define the instantaneous presence of the dictionary sample xd at time i as follows: pi (xd ) = K G (xd , xi ).

(42)

By adopting this definition, the larger pi the closer xd to the current input sample, and hence to the valid region of operation. To enhance robustness to outliers, we define the presence of xd as a filtered version of pi (xd ), given by Pi (xd ) = (1 − ρ)Pi−1 (xd ) + ρpi (xd )

(43)

where ρ ∈]0, 1] is a parameter controlling the smoothness of Pi (xd ). For a dictionary sample xd , this quantity can be interpreted as a measure of its contribution to the estimation

272

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

TABLE I C OMPUTATIONAL C OMPLEXITY OF W EIGHT C OMPUTATION

Algorithm 1 Multikernel LMS 1: # Initialization 2: Dictionary: D = {x0 } 3: Kernels set: K = {k 1 , k 2 , . . . , k n k } 4: Initial weights: ωk,0 = μd ˆ 0 (for each kernel) 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

FOR

for Training pair (x, d) do Sample deviation: ex ← minx j ∈D xi − x j    Prediction: dˆ ← μ x j ∈D k∈K ωk, j k(x, x j ) Error: e ← di − dˆ i # Novelty Criterion if e ≥ δe ∧ ex ≥ δd then Add new sample: D ← D ∪ {x} for all k ∈ K do Initialize new weight: ωk,x ← μd ˆ end for else for all k ∈ K , x j ∈ D do k(x,x ) Update: ωk,x j ← ωk,x j + μe ˆ +k(x,xj )2 j end for end if # Presence and Elimination for all x j ∈ D do Instantaneous presence: p(x j ) ← K G (x j , x) Presence: P(x j ) ← (1 − ρ)Pi−1 (x j ) + ρpi (x j ) end for if Perform elimination then Eliminate sample: D ← {x j ∈ D : P(x) ≥ δ p } end if end for

hand, the convergence of the MKLMS algorithm is governed by the LMS update stage, because for stationary signals the sparsification criteria result in a fixed support dictionary, and therefore the convergence of the algorithm depends on the finite-size weights vector. In this way, the convergence of the MKLMS (and also the standard KLMS) algorithm collapses into that of the standard nonlinear LMS reviewed in Section II-A, where the regressors are the kernel evaluations. Therefore, convergence properties for MKLMS stem from the standard (normalized) LMS convergence [4], [23], by considering the transformed samples (or kernel evaluations) as the regressors. Furthermore, note that for stationary data, and therefore for a fixed support dictionary, the MKLMS converges to the nonregularized (ρ = 0) MKRR because the relationship between MKRR and MKLMS can be thought of as the relationship between the standard LMS and the ridge regression algorithm with the kernel evaluation samples as regressors. C. Computational Complexity of Proposed Algorithms

of the output throughout the adaptation. Therefore, the elimination of samples whose presence Pi is below a predefined threshold will lead to a dictionary that accurately represents the current operating region, and will yield an algorithm that requires fewer operations. With the presence of xd , at time i , Pi (xd ), the elimination rule can now be stated as a dictionary update of the form Di = {x ∈ Di−1 : Pi (x) ≥ δ p }

KRR A LGORITHMS

(44)

where δ p > 0 is a parameter controlling the size of the operating region represented by the algorithm dictionary. One pragmatic approach to compose the dictionary using (41a) and (41b) would be to perform the elimination stage at a lower rate than that of the addition/rejection stage. A pseudocode implementation of the MKLMS is outlined in Algorithm 1, stating the novelty criterion and its corresponding sample addition and weight update stages, the computation of presence, and the sample elimination stage.

Computation of the weights in ridge regression algorithms involves both matrix multiplications and inversions. As the complexity of the multiplication of an n × m matrix by an m × p matrix is of order O(nmp), and the inversion of a n × n matrix is of order O(n 3 ), based on (28) we summarize the complexity of weights computation for KRR algorithms in Table I for S support vectors, T training pairs, and an L-kernel MKRR. The computational complexity for optimal weights of an L-kernel MKRR is, therefore, L 3 times larger than that of the monokernel algorithm, however, this routine is only performed once and does not affect the algorithm implementation. Regarding the computational complexity of the algorithm implementation, the number of kernel evaluations executed by an L-kernel MKRR algorithm is L times that of the monokernel version, that is, S kernel evaluations for the KRR and L S for the MKRR. Furthermore, for the adaptive MKLMS, as the dictionary size is time varying and is given by the sparsification criteria, the ability of the algorithm to estimate an unknown mapping with fewer support vectors will affect its complexity. The proposed adaptive sparsification criteria that eliminate samples not contributing to the estimation, such as the one in Section IV-A, reduce computational complexity, and running time. V. S IMULATIONS R ESULTS

B. Convergence Properties of the Multikernel Approach The existence and uniqueness of solutions for the MKRR [21] are based on the invertibility of the matrix K [(28) and (29)], which is ensured via the selection of both the support and training sets and the regularization factor. On the other

The performance of the proposed multikernel algorithms was assessed against their respective single kernel counterparts, evaluated on the prediction of synthetic nonlinear signals, and real-world inertial body sensor signals and nonstationary wind data of different dynamic properties.

TOBAR et al.: MKLMS ALGORITHM

273

Fig. 3. Training MSE and standard deviation for kernel algorithms as a function of number of support vectors. Fig. 1. Inertial body sensor estimation. Left: fixed coordinate system (red), sensor coordinate system (blue), and Euler angles (green). Right: body sensor placed at the right wrist.

of support vectors, and in particular for a lower number of support vectors. B. Prediction of a Nonlinear Signal

Fig. 2. Average running time and standard deviation for kernel algorithms as a function of the number of support vectors.

A. Estimation of a Multivariate Body Sensor Signal We first studied the behavior of the MKRR on the estimation of a real-world body sensor signal. Five accelerometers (placed at wrists, ankles, and waist) recorded the three Euler angles (Fig. 1), giving a total of 15 signals {θs }s=1,..,15 taking values in the range [−π, π]. For a more accurate representation, and to avoid discontinuities close to borders of the signals range, each θs was mapped into a pair (sin θs , cos θs ); in this way, the recorded body motion data were analyzed as a 30-D signal. The training set used to learn the evolution of the recorded data was constructed using two regressors, that is, the training pairs were elements of R60 × R30 . Quadratic and cubic polynomial kernels were fitted to implement two monokernel and one two-kernel MKRR algorithms. The same parameters as in [21] were used, that is, r = 10−3 and the cubic kernel was normalized with η = 30. By randomly choosing both the support vector and training samples from the recorded data (the number of training samples was 15 times that of the support vectors), the optimal weights were computed according to (28). Fig. 2 shows the running time for the computation of the optimal least-squares weights for all the three KRR algorithms in MATLAB. As discussed in Section IV-C, although the implementation of the MKRR is more complex than the standard KRR, the actual running time is below 10 ms, which is desired for the learning of a 1200-sample sequence (80 support samples). Fig. 3 shows the averaged training MSE and its standard deviation as a function of the number of support vectors for 100 independent trials (i.e., 100 independent choices of the support samples). The ability of the multikernel approach to replicate nonlinear nonhomogeneous multivariate mappings is highlighted by the lower training error achieved by MKRR, compared with the KRR algorithms, for the same number

The second case study considered the contribution of individual kernels to the global estimation, on the prediction of a well understood nonlinear and chaotic trivariate signal, a discretized version of the Lorenz attractor ⎤ ⎡ ⎤ ⎤ ⎡ ⎡ x i+1 xi σ (yi − x i ) li+1 = ⎣ yi+1 ⎦ = ⎣ yi ⎦ + s ⎣ x i (ρ − z i ) − yi ⎦ (45) z i+1 zi x i yi − βz i where s is the discretization step, σ, ρ, and β are real-valued parameters, and x i , yi , z i are the components of li . The MKLMS and two versions of KLMS with different kernels were assessed on a one-step ahead prediction of the trivariate signal in (45) considering five regressors for each signal, that is, the task was to predict li using li−5:i−1 . We implemented the KLMS algorithms with triangular and Gaussian kernels, and calculated the kernel parameters via exhaustive search minimization of the predicted MSE for a predefined training set. The resulting parameters were a = 0.0125 and b = 0.18 for the Gaussian and triangular kernels, respectively [see (5a) and (5b)]. The MKLMS structure comprised a triangular and a Gaussian kernel with the same parameters. All the three algorithms were implemented using the coherence sparsification criteria with parameters δe = 0.15, δd = 1 [see (41a) and (41b)], and learning rates μ = 0.3, μˆ = 0.5; while the system parameters were s = 0.01, σ = 10, and β = 8/3, ρ = 28. Fig. 4 shows the original chaotic signal and the kernel estimates for a 1500-sample realization. We employed the norm of the columns of i, j as a measure of contribution of each kernel. Fig. 5 shows that the kernelwise contributions in both multikernel and single kernel cases were similar, illustrating that, as desired, different kernels within MKLMS account for different nonlinear features of the data and can be associated physical meaning. Fig. 6 shows the averaged prediction MSE and size of the dictionary (in number of samples) for each of the algorithms considered, over 30 independent 3000-sample trials. Observe that, as desired, the MKLMS provided more accurate estimates in terms of the averaged prediction MSE, while using fewer samples than the existing KLMS algorithms. Notice that in the transient period, the estimation error of the multikernel algorithm is greater than that of the monokernel ones, because

274

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

Original

Gaussian

Triangular

Gaussian

Multikernel

MSE

i

0

x

Multikernel

Mean Square Error

20

−20 0

Triangular

500

1000

1

10

1500

Time [samples]

500

1000

1500 Time [samples]

2000

2500

3000

2500

3000

20 Averaged Dictionary Size

−20 0

500

1000

1500

Samples

y

i

0

1500 1000 500

Time [samples] 500

zi

60 40

500

1000

1500

Time [samples]

Fig. 4.

1500 Time [samples]

2000

Fig. 6. Averaged MSE and dictionary size (in number of samples) over 30 trials for the kernel algorithms.

20 0 0

1000

TABLE II AVERAGED P REDICTION G AINS AND RUNNING T IMES FOR THE G AUSSIAN KLMS, T RIANGULAR KLMS, AND MKLMS

Trivariate original signal and KLMS estimates.

||Ωi,j||

Multikernel LMS (Gaussian and triangular kernels) 150 100 50 0 0

Triangular 500

Time [samples]

Gaussian

1000

1500

The bivariate measurements correspond to the wind speed in the north–south (VN ) and east–west (VE ) directions, that is

Kernel LMS (Gaussian kernel) ||Ωi,j||

200 100 Gaussian 0 0

500

Time [samples]

1000

1500

||Ωi,j||

Kernel LMS (Triangular kernel) 200 100 0 0

Triangular 500

Time [samples]

1000

1500

Fig. 5. Time-varying magnitude of the weight matrix for all the three implemented kernel algorithms.

the MKLMS adjusts more parameters than the monokernel algorithm and therefore needs more training data. The steady-state prediction performance of the kernel algorithms was assessed using the prediction gain & T ' 2 i=T0 l i 2 (46) R p = 10 log10 T ˆ 2 i=T l i − l i  0

2

evaluated at the steady state (i.e., T0 = 1500 and T = 3000) for each of the realizations. The average prediction gains over 30 realizations, as well as the running time, are given in Table II. The multikernel approach provided the most accurate prediction at a lower computational complexity than the sum of its subkernels, because of the smaller support dictionary (Fig. 6, bottom). C. Adaptive Prediction of Nonstationary Wind Data The benefits of the proposed MKLMS algorithm, and adaptive sparsification criteria, are next illustrated on real-world nonstationary bivariate wind signals.

V = [VN , VE ]T .

(47)

The wind speed readings were taken3 with a 2-D ultrasonic anemometer with the sampling rate of 50 Hz, over three different wind regimes: low, medium, and high dynamics regions. The KLMS and MKLMS algorithms were evaluated for the 10-step ahead prediction of the wind speed, using 20 regressors for each signal component. The physical intuition of a multikernel algorithm in this context is illuminated by considering the empirical distribution of the deviations of the input samples (i.e., regressors), evaluated through the set ) ( (48) Rdyn = a − b2 , a, b ∈ dyn ⊂ R20×2 where the symbol dyn corresponds to a 1000-sample set in the low, medium, or high dynamics region. Fig. 7 shows the histograms for the sets Rdyn and suggests that different kernels should be used for different wind dynamics regions. Furthermore, as the deviation sets corresponding to different dynamics regimes overlap, a kernel designed to fit a particular set may also be useful in the estimation of data from other regions. We compared the MKLMS with three standard Gaussian KLMS structures whose kernel parameters were fitted (via combined exhaustive search and genetic algorithm minimization of the prediction MSE) to the low, medium, and high dynamics training sets; the parameters found for the wind regimes were, respectively, a L = 9.5, a M = 0.75, 3 The data are publicly available in http://www.commsp.ee.ic.ac.uk/∼mandic/ research/wind.htm

TOBAR et al.: MKLMS ALGORITHM

275

5

10

||Ωi,j||

Frequency

Low Medium High

0

10

0

200

400

600 800 Time [samples]

LF

MF

HF

1000

1200

1400

0

10

0

2

4

6

8 10 Sample deviation

12

14

16

Fig. 9. Weight magnitudes for the kernels within the MKLMS algorithm, evaluated for mixed regime wind, using the proposed adaptive sparsification criteria.

Original

LF

MF

HF

Samples

Fig. 7. Input sample deviation for the low, medium, and high dynamics wind regimes. Multikernel

VN

0.2

LF

0

0

−0.2 −0.4

50

100

150 200 250 Time [samples]

300

350

400

0.2 VE

250 200 150 100 50 200

400

MF

600 800 Time [samples]

HF 1000

Multikernel 1200

1400

Fig. 10. Dictionary size for all kernel algorithms for the prediction of mixedregime wind using the proposed adaptive sparsification criteria.

0

TABLE III AVERAGED P REDICTION G AINS FOR THE LF KLMS, MF KLMS,

−0.2 −0.4

50

100

150 200 250 Time [samples]

300

350

400

HF KLMS, AND MKLMS FOR A LL T HREE DYNAMIC R EGIONS

Fig. 8. Original bivariate wind signal and its KLMS estimates for the low dynamics region.

LF

MF

HF

Multikernel

Mean Square Error −0.1

MSE

10

−0.5

10

50

100

150

200 250 300 Time [samples]

350

400

450

350

400

450

Averaged Dictionary Size Samples

and a H = 0.109. We refer to such kernels [see (5b)] as the low-fitted (LF), medium-fitted (MF), and high-fitted (HF) kernels. The MKLMS structure comprised the same three kernels. To model the nonstationary nature of wind, all four KLMS algorithms were equipped with the adaptive sparsification criteria presented in Section IV-A, with parameters δe = 0.15, δd = 0.1, δ p = 0.01, and μ = 0.2, μˆ = 0.1. Fig. 8 shows performances for a 400-sample implementation of the kernel algorithms for the low dynamics wind. Observe that the MKLMS provided an accurate prediction of the signal even when compared with the specifically designed LF KLMS, confirming the benefits of a combination of kernels. To further highlight the physical meaning associated with the proposed multikernel approach, the kernel algorithms were used to predict mixed-regime wind data whose segments [1, 500], [501, 1000], and [1001, 1500] corresponded, respectively, to the low, medium, and high dynamics regimes. Fig. 9 shows the magnitudes of the weights associated with each kernel within the MKLMS: the LF kernel (blue) was active in the first third of the data duration, the MF and HF kernel were active for the segment [501, 1000], and the HF kernel in the last third of the data. This illuminates the ability of the MKLMS algorithm for assigning more confidence (weights) to the kernel corresponding to the particular dynamics region. Notice also that the HF kernel was also prominent for the second segment (samples [501, 1000]) because of the overlap of the sample deviation for the medium and high dynamics regions (Fig. 7). Fig. 10 shows the efficiency of MKLMS for the prediction of mixed-regime wind data in terms of the dictionary size for all the kernel algorithms considered. Observe that the proposed adaptive sparsification criterion allows for the dictionary to have a bounded number of samples throughout

150 100 50 50

100

150

200 250 300 Time [samples]

Fig. 11. Averaged MSE and dictionary size over 30 trials of combined low, medium, and high dynamics regions.

different dynamics regions, by rejecting samples that are not in use, thus avoiding kernel evaluations that do not improve the estimate. This desired feature of the adaptive sparsification criteria is present in all four implemented kernel algorithms. The MSE performance of the MKLMS on the prediction of wind speed was considered over 10 nonoverlapping segments of 470 samples for each regime (low, medium, and high dynamics, 30 segments in total). Table III presents the averaged prediction gain R p [T0 = 235 and T = 470; see (46)] calculated for all four kernel algorithms. Observe that, among the single kernel KLMS algorithms, each algorithm provided the most accurate estimates in the specific region used for its training, whereas the MKLMS performed better than any region-fitted KLMS, for all the three dynamics regions.

276

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 2, FEBRUARY 2014

TABLE IV AVERAGED RUNNING T IME FOR THE LF KLMS, MF KLMS, HF KLMS, AND

MKLMS FOR A LL T HREE DYNAMIC R EGIONS

The averaged performance of all the algorithms considered was evaluated over all 30 segments (10 segments of each dynamics regime). The bottom row in Table III shows that the MKLMS had the highest averaged prediction gain, and Fig. 11 shows that, as desired, the MKLMS exhibited the lowest MSE and smallest dictionary size. This is also reflected in the running times shown in Table IV, where the complexity of the MKLMS is bounded by the sum of the KLMS complexities. These results conform the analysis, highlighting the power of the MKLMS in real-world prediction applications in nonstationary environments, where there is no a priori information of the instantaneous wind dynamics. Note that, as in the previous simulation in Fig. 6, the convergence of the MKLMS was slower in the transient regime because of the larger number parameters to update. VI. C ONCLUSION We have introduced a multikernel extension of the KLMS algorithm, termed the MKLMS. This has been achieved using a time-varying mapping from the input space to a vectorvalued Hilbert space of feature samples, and by equipping the algorithm with an adaptive sparsification criterion. The rigorous analysis of the reproducing properties of such a space has shown that it also serves as a feature space of the existing MKRR algorithm, and that the proposed adaptive structure has the ability to deal with nonstationary data, while maintaining a small dictionary size. The advantages of the proposed MKLMS have been illuminated on generic problems for prediction of nonlinear multivariate real-world data, highlighting the ability of multikernel regression to represent multivariate mappings in a physically meaningful way. Simulations on nonstationary bivariate wind data of different dynamic regimes have further illustrated the benefits of the consideration of multiple mappings, as compared with the standard KLMS, in terms of the estimation error, dictionary size, and computational complexity. ACKNOWLEDGMENT The authors would like to thank P. Wu from Princeton University for his useful and constructive suggestions. R EFERENCES [1] D. P. Mandic and S. L. Goh, Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models. New York, NY, USA: Wiley, 2009. [2] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[3] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001. [4] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” IRE WESCON Conv. Rec., vol. 4, no. 1, pp. 96–104, 1960. [5] W. Liu, P. P. Pokharel, and J. C. Principe, “The kernel least-mean-square algorithm,” IEEE Trans. Signal Process., vol. 56, no. 2, pp. 543–554, Feb. 2008. [6] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug. 2004. [7] C. Richard, J. C. M. Bermudez, and P. Honeine, “Online prediction of time series data with kernels,” IEEE Trans. Signal Process., vol. 57, no. 3, pp. 1058–1067, Mar. 2009. [8] E. Santana, J. C. Principe, E. E. Santana, R. C. S. Freire, and A. K. Barros, “Extraction of signals with specific temporal structure using kernel methods,” IEEE Trans. Signal Process., vol. 58, no. 10, pp. 5142–5150, Oct. 2010. [9] A. Kuh and D. P. Mandic, “Applications of complex augmented kernels to wind profile prediction,” in Proc. IEEE ICASSP, Apr. 2009, pp. 3581–3584. [10] F. Tobar, A. Kuh, and D. Mandic, “A novel augmented complex valued kernel LMS,” in Proc. 7th IEEE Sensor Array Multichannel Signal Process. Workshop, Jun. 2012, pp. 481–484. [11] G. D. Nicolao and G. Pillonetto, “A new kernel-based approach for linear system identification,” Automatica, vol. 56, no. 12, pp. 2825–2840, 2010. [12] P. Honeine, C. Richard, and J. Bermudez, “On-line nonlinear sparse approximation of functions,” in Proc. IEEE ISIT, Jun. 2007, pp. 956–960. [13] F. A. Tobar, L. Yacher, R. Paredes, and M. E. Orchard, “Anomaly detection in power generation plants using similarity-based modeling and multivariate analysis,” in Proc. ACC, 2011, pp. 1940–1945. [14] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. 21st Int. Conf. Mach. Learn., 2004, pp. 1–6. [15] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” J. Mach. Learn. Res., vol. 12, pp. 2211–2268, Jul. 2011. [16] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large scale multiple kernel learning,” J. Mach. Learn. Res., vol. 7, pp. 1531–1565, Dec. 2006. [17] C.-V. Nguyen and D. B. H. Tay, “Regression using multikernel and semiparametric support vector algorithms,” IEEE Signal Process. Lett., vol. 15, no. 1, pp. 481–484, May 2008. [18] J. Xie, “Time series prediction based on recurrent LS-SVM with mixed kernel,” in Proc. APCIP, vol. 1. 2009, pp. 113–116. [19] M. Yukawa, “Nonlinear adaptive filtering techniques with multiple kernels,” in Proc. Eur. Signal Process. Conf., 2011, pp. 136–140. [20] C.-Y. Yeh, C.-W. Huang, and S.-J. Lee, “A multiple-kernel support vector regression approach for stock market price forecasting,” Expert Syst. Appl., vol. 38, no. 3, pp. 2177–2186, 2011. [21] F. A. Tobar and D. P. Mandic, “Multikernel least squares estimation,” in Proc. Signal Processing Defence Conf., 2012, pp. 1–5. [22] J. Platt, “A resource-allocating network for function interpolation,” Neural Comput., vol. 3, no. 2, pp. 213–225, 1991. [23] S. Haykin, Adaptive Filter Theory, 4th ed. Upper Saddle River, NJ, USA: Prentice-Hall, 2001. [24] A. H. Sayed, Fundamentals of Adaptive Filtering. New York, NY, USA: Wiley, 2003. [25] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures, and Stability. New York, NY, USA: Wiley, 2001. [26] N. Aronszajn, “Theory of reproducing kernels,” Trans. Amer. Math. Soc., vol. 68, no. 3, pp. 337–404, 1950. [27] G. L. Prajapati and A. Patle, “On performing classification using SVM with radial basis and polynomial kernel functions,” in Proc. 3rd ICETET, 2010, pp. 512–515. [28] J. Mercer, “Functions of positive and negative type, and their connection with the theory of integral equations,” Phil. Trans. R. Soc. London (A), vol. 209, pp. 415–446, Jan. 1909. [29] A. Aizerman, E. M. Braverman, and L. I. Rozoner, “Theoretical foundations of the potential function method in pattern recognition learning,” Autom. Remote Control, vol. 25, pp. 821–837, 1964. [30] S.-Y. Kung, “Kernel approaches to unsupervised and supervised machine learning,” in Advanced Multimedia Information Processing. New York, NY, USA: Springer-Verlag, 2009, pp. 1–32.

TOBAR et al.: MKLMS ALGORITHM

[31] W. Liu, J. C. Principe, and S. Haykin, Kernel Adaptive Filtering: A Comprehensive Introduction. New York, NY, USA: Wiley, 2010. [32] P. Bouboulis, S. Theodoridis, and M. Mavroforakis, “The augmented complex kernel LMS,” IEEE Trans. Signal Process., vol. 60, no. 9, pp. 4962–4967, Sep. 2012. [33] F. A. Tobar and D. P. Mandic, “The quaternion kernel least squares,” in Proc. IEEE ICASSP, May 2013, pp. 6128–6132. [34] H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” in Proc. Adv. Neural Inf. Process. Syst., 1996, pp. 155–161. [35] C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learning algorithm in dual variables,” in Proc. 15th ICML, 1998, pp. 1–7.

Felipe A. Tobar received the B.Sc. and M.Sc. degrees in electrical and electronic engineering from Universidad de Chile, Santiago, Chile, in 2008 and 2010, respectively, and is currently pursuing the Ph.D. degree in adaptive signal processing at Imperial College London, London, U.K. His current research interests include kernel regression, enhanced dimensionality reproducing kernel Hilbert spaces, nonlinear filtering, and Monte Carlo methods.

Sun-Yuan Kung is a Professor with the Department of Electrical Engineering, Princeton University, Princeton, NJ, USA. He has authored or coauthored more than 500 technical publications and numerous textbooks, including VLSI and Modern Signal Processing (Prentice-Hall, 1985), VLSI Array Processors (Prentice-Hall, 1988), Digital Neural Networks (Prentice-Hall, 1993), Principal Component Neural Networks (John-Wiley, 1996), Biometric Authentication: A Machine Learning Approach (Prentice-Hall, 2004), and Kernel Methods and Machine Learning (Cambridge University Press, 2013). His current research interests include VLSI array processors, system modeling and identification, neural networks, wireless communication, sensor array processing, multimedia signal processing, bioinformatic data mining, and biometric authentication. Dr. Kung served as a member of the Board of Governors of the IEEE Signal Processing Society from 1989 to 1991. Since 1990, he has been the Editor-inChief of the Journal of VLSI Signal Processing Systems. He was a recipient of the IEEE Signal Processing Society’s Technical Achievement Award for his contributions on parallel processing and neural network algorithms for signal processing in 1992, a Distinguished Lecturer of IEEE Signal Processing Society in 1994, the IEEE Signal Processing Society’s Best Paper Award for his publication on principal component neural networks in 1996, and the IEEE Third Millennium Medal in 2000.

277

Danilo P. Mandic is a Professor in signal processing with Imperial College London, London, U.K., where he has been working in the area of nonlinear adaptive signal processing and nonlinear dynamics. He has been a Guest Professor with Katholieke Universiteit Leuven, Leuven, Belgium, the Tokyo University of Agriculture and Technology, Tokyo, Japan, and Westminster University, London, and a Frontier Researcher with RIKEN, Tokyo. His publications include two research monographs titled Recurrent Neural Networks for Prediction (Wiley, 2001) and Complex-Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models (Wiley, 2009), an edited book titled Signal Processing for Information Fusion (Springer, 2008), and more than 200 publications on signal and image processing. He has been a member of the IEEE Technical Committees on Signal Processing Theory and Methods, and Machine Learning for Signal Processing, an Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS II, the IEEE T RANSACTIONS ON S IGNAL P ROCESSING, the IEEE T RANSACTIONS ON N EURAL N ETWORKS , and the IEEE S IGNAL P ROCESS ING M AGAZINE. He has produced award winning papers and products from his collaboration with the industry. He is a member of the London Mathematical Society.

Multikernel least mean square algorithm.

The multikernel least-mean-square algorithm is introduced for adaptive estimation of vector-valued nonlinear and nonstationary signals. This is achiev...
1MB Sizes 5 Downloads 3 Views