356

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

Model-Based Online Learning With Kernels Guoqi Li, Member, IEEE, Changyun Wen, Fellow, IEEE, Zheng Guo Li, Senior Member, IEEE, Aimin Zhang, Feng Yang, Member, IEEE, and Kezhi Mao, Member, IEEE

Abstract— New optimization models and algorithms for online learning with Kernels (OLK) in classification, regression, and novelty detection are proposed in a reproducing Kernel Hilbert space. Unlike the stochastic gradient descent algorithm, called the naive online Reg minimization algorithm (NORMA), OLK algorithms are obtained by solving a constrained optimization problem based on the proposed models. By exploiting the techniques of the Lagrange dual problem like Vapnik’s support vector machine (SVM), the solution of the optimization problem can be obtained iteratively and the iteration process is similar to that of the NORMA. This further strengthens the foundation of OLK and enriches the research area of SVM. We also apply the obtained OLK algorithms to problems in classification, regression, and novelty detection, including real time background substraction, to show their effectiveness. It is illustrated that, based on the experimental results of both classification and regression, the accuracy of OLK algorithms is comparable with traditional SVM-based algorithms, such as SVM and least square SVM (LS-SVM), and with the state-ofthe-art algorithms, such as Kernel recursive least square (KRLS) method and projectron method, while it is slightly higher than that of NORMA. On the other hand, the computational cost of the OLK algorithm is comparable with or slightly lower than existing online methods, such as above mentioned NORMA, KRLS, and projectron methods, but much lower than that of SVM-based algorithms. In addition, different from SVM and LS-SVM, it is possible for OLK algorithms to be applied to non-stationary problems. Also, the applicability of OLK in novelty detection is illustrated by simulation results. Index Terms— Classification, Kernels, novelty detection, online learning, regression, reproducing Kernel Hilbert Space.

I. I NTRODUCTION

K

ERNEL-BASED algorithms have been successfully applied in many batch settings [1]–[3] and there have been a lot of schemes to derive its online algorithms as shown in [4]–[9]. Here, the term ‘online’ means that the learning process can be updated one-by-one without re-training using Manuscript received April 1, 2011; revised August 18, 2012; accepted November 6, 2012. Date of publication January 1, 2013; date of current version January 30, 2013. G. Li is with the Advanced Concepts and Nanotechnology (ACN) Division, Data Storage Institute, 117608, Singapore (e-mail: [email protected]). C. Wen and K. Z. Mao are with the School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore (e-mail: [email protected]; [email protected]). Z. Li is with the Department of Signal Processing, Institute for Infocomm Research, 119613, Singapore (e-mail: [email protected]). A. Zhang, corresponding author, is with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China (e-mail: [email protected]). F. Yang is with the Department of Computing Science, Institute of High Performance Computing, 138632, Singapore (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2229293

all the learning data when a new learning data becomes available. Many researchers have developed different kinds of online learning algorithms resulting in significant advances both theoretically and practically, see for examples, perceptron method [10], [11], Kernel perceptron method [12], forgetron method [13], sub-gradient methods [14], Kernel recursive least square (KRLS) method [15], limited stochastic metadescent method [16], and projectron method [17]. All of these methods are restricted to classification or regression. One popular online learning algorithm with Kernels, proposed by Kivinen et al. [4], [18] in a reproducing Kernel Hilbert space (RKHS) deals with classification, regression, and novelty detection [19] in a unified framework. Such an algorithm is named naive online Reg minimization algorithm (NORMA). The general idea is to make use of classical stochastic gradient descent, which performs gradient descent with respect to a defined instantaneous risk function. As discussed in [4], [20], and [21], in order to determine the gradient in a closed form and an easier way to update the learning process, an explicit update is used to substitute the implicit update approximately. This results in the familiar stochastic gradient descent update. The advantage of NORMA is that its loss bound can be exactly derived. In our point of view, one less satisfactory point of NORMA is that it is not based on solving an optimization model. Although an objective cost function is given, the constraint conditions are not clearly presented. Note that when minimizing an objective cost function under constraint conditions, algorithms can be derived instead of only using gradient descent method. By doing this, the performance of an online learning algorithm will be improved. In this paper, we propose a new optimization model and derive new algorithms for online learning with Kernels (OLK) by us, in classification (OLKC ), regression (OLK R ), and novelty detection (OLK N ) in an RKHS. We obtain a similar iteration process to that of the NORMA. Our approach would further strengthen the foundation of online learning with Kernels. In OLK, analogous to Vapnik’s support vector machine (SVM) [22], the optimal solution of the optimization problem can be achieved by exploiting the techniques of Lagrange dual problem. However, a major difference existing in the learning process is that the data points are processed one by one, whereas the Lagrange dual problem is solved just by maximizing a quadratic equation in a given interval. As such, our learning algorithm can be easily realized by a simple update process with a low computational cost. We also note that the well known sequential minimal optimization (SMO) proposed in [23] is a tool for training traditional SVM by dividing a large quadratic programming (QP) problem into a

2162–237X/$31.00 © 2013 IEEE

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

series of smaller QP problems. This results in much faster computational speed. However, SMO is still not an online algorithm as all the data points still need to be used for retraining when a newly arrived data point becomes available. To illustrate the performance of our proposed algorithms, a few examples in classification, regression, and novelty detection are tested. The proposed OLK is compared with SVMbased methods and several state-of-the-art online learning algorithms. It is shown that the accuracy of OLK algorithms is comparable with traditional SVM-based algorithms, such as SVM and LS-SVM in both classification and regression, and projectron method (in classification) and KRLS (in regression), while it is slightly higher than that of NORMA. On the other hand, the computational cost of the OLK algorithms is comparable with or slightly lower than NORMA, KRLS, and projectron methods but much lower than that of SVM-based algorithms. In addition, OLK algorithms can be applied to non-stationary regression problems [24]. For novelty detection, the algorithm is applied to the field of real time background substraction [21], [25]. Both the quantitative and qualitative comparisons are made with NORMA. Experimental results clearly reveal that the proposed OLK is applicable in this field. The remaining part of this paper is organized as follows. In Section II, some background knowledge is presented. Then the models for modeling OLK in an RKHS are proposed and resulting algorithms are presented in Section III. Simulation examples are given in Section IV to show the performances of the proposed algorithm with comparisons to other similar schemes. Finally, this paper is concluded in Section V.

357

either x or y) denotes a map mapping x from the input space into a feature space. 6) For f ∈ H, we have (d(< f (.), k(x, .) >H ))/d f = k(x, .) and (d(< f, f >H ))/d f = 2 f [4]. B. Lagrange Dual Problem Background knowledge in Lagrange dual problem plays an important role in modeling online learning with Kernels. Consider the following primal constrained optimization problem: min {F (x)}, x

s.t.

A. Basic Properties of RKHS We briefly review the relevant techniques in RKHS, which are useful in our approach. Let X be a set and H be a Hilbert space of functions on X . We say that H is a RKHS if for every x in X there exists a unique Kernel k(x, .) of H with the property that: f (x) =< f, k(x, .) >H for all f ∈ H, where < ., . >H denotes dot product in H. Note that an RKHS H has the following properties [1], [4], [26], [27]. 1) k(x, y) =< k(x, .), k(., y) >H where x, y ∈ X . 2) k(x, y) has the reproducing property that < f, k(x, .) >H = f (x) where x ∈ X f ∈ H . 3) H is the closure of the span of all k(x, .) with x ∈ X . In other words, all f ∈ H are linear combinations of Kernel functions. 4) The inner product induces a norm of f in H, i.e.,  f 2H =< f, f >H . 5) Let φ(.) = {φi }∞ i=1 be an orthonormal sequence such that the closure of its span is equal to H, then, k (x, y) = ∞ φ φ = φ(x)T φ(y) where φ(.) (‘.’ can be (x) (y) k k=1 k

(1)

where F (x) is an objective cost function that is minimized under the l constraints ci (x) ≤ 0 for i = 1, . . . , l. The Lagrange function of (1) is L(x, α) = F (x) +

l 

αi ci (x)

(2)

i=1

where α = {αi }li=1 = (α1 , . . . , αl )T are non-negative Lagrange multipliers. Note that {.} denotes a column vector in this paper. Optimizing (1) is actually minimizing the maximum (the worst case) of the Lagrange function, that is min x

max{L(x, α)} α

s.t. αi ≥ 0,

i = 1, . . . , l.

(3)

Note that the Lagrange dual problem of (1) or (3) is to maximize the minimum of its Lagrange function [28] max min{L(x, α)}, α

II. BACKGROUND K NOWLEDGE Preliminaries used to model the problem of online learning with Kernels in RKHS are presented in this section. We first review the theory of RKHS and Lagrange dual problem. Then we show the model of standard SVM in RKHS by exploiting the Lagrange dual problem technique.

x ∈ R n , i = 1, . . . , l

ci (x) ≤ 0,

s.t.

x

αi ≥ 0,

i = 1, . . . , l.

(4)

Lemma 2.1 [29]: If F (x) and ci (x), for i = 1, . . . , l are all continuous and differentiable convex functions, then the optimization problems of (3) and (4) are equivalent, i.e., their optimal solutions are the same. Our original aim is to minimize the objective function in (1) under some constraints. Though the optimization problem in (1) is convex, it is difficult to obtain its solution directly. So (1) is converted to (3) and then further converted to (4) to make the optimal solution easier to be searched for. In this case, the abovementioned Lagrange dual problem is called Wolfe dual problem. Actually, the standard SVM is based on the theory of Lagrange dual problem as shown in the next section. C. SVM in RKHS Given the training data (x 1 , y1 ), . . . , (x t , yt ), . . . , (xl , yl ) where x t ∈ X and yi ∈ Y = {−1, 1}, consider the following problem:   l  1 2  f H + C ξi min f,ξ 2 i=1

s.t.

yi f (x i ) ≥ 1 − ξi , ξi ≥ 0,

i = 1, . . . , l

(5)

where ξ is called slack variable which corresponds to the classification errors. The standard SVM was initially proposed by Vapnik in [22] for solving the problem in (5). Basically, SVM

358

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

tries l to find f in H that compromises 2 the training error i=1 ξi and functional complexity 1/2 f H . C is a parameter chosen to realize such a regularization. In an RKHS H, f is an element or a vector denoted as f ∈ H. Strictly speaking, the dimension of vector f in RKHS can be infinity. But SVM searches for the optimal solution of (5) in a subspace spanned by the basis functions constructed by the training data points. In this case, we have f = w T φ(.) + b with  f 2H = w T w where φ(.) is a map determined by a chosen Mercer Kernel k(x, y), and w and b are parameters to be determined based on the solutions of the following problem, which is the primal problem of SVM in [22] converted from (5):  min

w,b,ξ

 1 T w w+C ξi 2 l



i = 1, . . . , l. (6)

The above optimization problem is convex as the Kernel is chosen to be a Mercer Kernel. From the previous section, instead of solving (6) directly, the standard SVM solves its Wolfe dual problem as follows: min {L(w, b, ξ, α, β)}

w,b,ξ

s.t.

αi ≥ 0 βi ≥ 0

i = 1, . . . , l

0 ≤ αi ≤ C,

yi αi = 0

(11)

i=1

which is equivalent to   l  1 T α Qα − min αi α 2 0 ≤ αi ≤ C,

i=1 l 

yi αi = 0

(12)

(1 − ξi − yi (w T φ(x i ) + b)) −

i=1 l 

l

i=1

βi ξi .

(8)

i=1

Note that L(w, b, ξ, α, β) attains its minimum with respect to w, b, and ξ if and only if the following conditions are satisfied [22]: ∇b L(w, b, ξ, α, β) = 0 ∇w L(w, b, ξ, α, β) = 0 ∇ξ L(w, b, ξ, α, β) = 0.

i=1

after obtaining the solutions of α and b. As seen in (13), f (x) is represented by a linear combination of Kernel functions in H to separate the data set with the maximum margin. III. M ODELING OLK IN RKHS

  1 L(w, b, ξ, α, β) = w T w + C ξi + αi 2 l

where Q i j = yi y j k(x i , x j ). Note that Q can still be a positive definite matrix since the Kernel is chosen as a Mercer Kernel, which ensures the above optimization to remain as convex. The strategical technique of SVM is to map the input learning data into a higher dimensional feature space, which may be infinite dimension corresponding to a chosen Kernel, and construct a decision function   l αi yi k(x, x i ) + b (13) h(x) = sign( f (x)) = sign

(7)

where α = (α1 , . . . , αl )T and β = (β1 , . . . , βl )T are Lagrange multipliers corresponding to the constraints in (6), and the Lagrange function L(w, b, ξ, α, β) is given by

(9)

These conditions yield l 

s.t.

l 

i=1

i=1

α

i=1

s.t.

s.t. yi (w T φ(x i ) + b) ≥ 1 − ξi ξi ≥ 0,

max

we obtain the Wolfe dual problem of (7)   l  1 T − α Qα + αi max α 2

In this section, we propose and derive a new model of online OLK in an RKHS H based on the Lagrange dual problem theory. The proposed algorithms for classification, regression, and novelty detection are named as OLKC , OLK R , and OLK N , respectively. A. Modeling OLK for Classification (OLKC ) Assume that the current learning data for classification are (x 1 , y1 ), . . . , (x i , yi ), . . . , (x t , yt ) where x i ∈ X and yi ∈ Y = {−1, 1} and the current decision function is h t = sign( f t ). In online learning, we wish to update the function ft +1 based on f t when the next data point (x t +1, yt +1 ) becomes available. The learning algorithm starts from one learning data point (x 1 , y1 ). In the beginning, f 0 is set as f 0 = 0. Note that f t and f t +1 can be written as follows without b in H [30]: ft (.) =

yi αi = 0

t 

αi∗ k(x i , .)

i=1

i=1

w=

l 

f t +1 (.) =

αi yi φ(x i )

α˜ i∗ k(x i , .) + α˜ t∗+1 k(x t +1, .).

(14)

i=1

i=1

C − αi − βi = 0.

t 

(10)

Actually the conditions that C − αi − βi = 0, αi ≥ 0 and βi ≥ 0 imply that 0 ≤ αi ≤ C for all i . Substituting (10) to (7) and maximizing its Lagrange function with respect to α,

Note that k(., .) is a chosen Kernel function such that k(x, y) = 1 when x = y. This is reasonable and realizable +1 after normalization. Then, the objective is to update {α˜ i∗ }ti=1 t ∗ from {αi }i=1 based on a proper algorithm to be derived. For simplicity, we let f denote f t +1 that needs to be determined

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

359

at time t +1. Similar to the key idea of SVM in compromising the classification error and the complexity of f , our proposed model updating f from ft can be formulated as follows:   1 r  f − f t 2H +  f 2H + Cξt +1 min f,ξt+1 2 2 s.t. yt +1 f (x t +1 ) ≥ 1 − ξt +1 , ξt +1 ≥ 0 (15)

Recall from [1] and [4] that

where 1/2 f − ft 2H measures the distance of a predicted f from previous f t , the term  f 2H controls the complexity of the predicted f , and ξt +1 is the slack variable denoting a possible error for the newly arrived data (x t +1, yt +1 ) after f is determined, r and C are parameters reflecting the weights compromising complexity and the classification error. Actually r also plays the role of a forgetting factor. This can be seen more clearly in (33), (52), and (67) later when deriving OLKC , OLK R , and OLK N where we have α˜ i∗ = 1/1 + rαi∗ for i = 1, . . . , t. Due to the forgetting factor, controlling the number of support vectors becomes possible, as discussed in [13]. Also, such a forgetting factor may make it possible that OLK has the capability of dealing with non-stationary or time varying problems [24], but its qualitative and quantitative performance in this case needs to be further investigated. Remark 1: Note that f (.) = w T φ(.), ft (x) = wtT φ(x) and f t +1 (x) = wtT+1 φ(x) in RKHS. This implies that there is a one-to-one mapping between a vector w and f (.). More specifically, we actually project f t +1 (x) and f t (x) onto the space spanned by the basis functions {k(x i , x)}ti=1 and +1 , respectively, in (14). So f (.) can be treated as {k(x i , x)}ti=1 a vector in a finite dimensional space. In this sense, f t (x) is a previous known vector and f t +1 (x) is a vector to be determined at t + 1. And f (x t +1) is the inner product of two vectors in H: < f, k(x t +1 , .) >H = f (x t +1). Thus, the optimization problem (15) is actually a convex optimization with respect to a unknown finite dimensional vector. This process is the motivation of applying Lemma 2.1 to (15) with a unknown function f (.) in RKHS. Then, by applying Lemma 2.1 in Section II-B, the Lagrange dual problem of (15) is

Thus, from (18), (19), and (21), it can be obtained that

max min {L( f, ξt +1 , αt +1 , βt +1 )} αt+1 f,ξt+1

s.t. αt +1 ≥ 0,

βt +1 ≥ 0

(16)

where αt +1 and βt +1 are the Lagrange multipliers corresponding to the constraints yt +1 f (x t +1 ) ≥ 1 − ξt +1 and ξt +1 ≥ 0, respectively, and L( f, ξt +1 , αt +1 , βt +1 ) =

1 r  f − f t 2H +  f 2H + Cξt +1 2 2 −αt +1 (yt +1 f (x t +1 ) − 1 + ξt +1 ) −βt +1ξt +1 .

(17)

Similarly to (9), L( f, ξt +1 , αt +1 , βt +1 ) attains its minimum with respect to f and ξt +1 , if and only if the following conditions are satisfied: ∇ f L( f, ξt +1 , αt +1 , βt +1 ) = 0 ∇ξt+1 L( f, ξt +1 , αt +1 , βt +1 ) = 0.

(18) (19)

d < f, k(x t +1 , .) >H d f (x t +1 ) = = k(x t +1, .). df df

(20)

Then (18) gives ∇ f L(.) = f − f t + r f − αt +1 yt +1 k(x t +1, .) = 0.

1 αt +1 ft + yt +1 k(x t +1, .) 1+r 1+r t  1 αt +1 αi∗ k(x t , .) + yt +1 k(x t +1, .) = 1+r 1+r

(21)

f =

(22)

i=1

C − αt +1 − βt +1 = 0.

(23)

Note that (22) gives the update rule for α˜ i∗ from 1 to t + 1 1 α∗ , 1+r i αt +1 yt +1 = 1+r

α˜ i∗ = α˜ t∗+1

for i = 1, . . . , t (24)

where r , yt +1 , and αi∗ for i = 1, . . . , t are known while the Lagrange multiplier αt +1 needs to be determined. So the parameter r can also be considered as a forgetting factor as seen in (24). From (22), we notice that 1 f t (x t +1 ) + 1+r 1 f t (x t +1 ) + = 1+r

f (x t +1 ) =

αt +1 yt +1 k(x t +1, x t +1 ) 1+r αt +1 yt +1 (25) 1+r

since k(x, y) = 1 when x = y. Also, we have yt2+1 = 1 and C − αt +1 − βt +1 = 0. Together with βt +1 ≥ 0, these result in 0 ≤ αt +1 ≤ C.

(26)

Substituting (22), (23), and (25) into (17) and using the properties of RKHS in Section II-A, we have  2  1 1 r αt +1   f − f t 2H =  f y − k(x , .) t t +1 t +1   2 2 1+r 1+r H r2 r αt +1 2 < ft , yt +1k(x t +1 , .) >H =  ft H − 2(1 + r )2 (1 + r ) 1+r αt2+1 + y 2 < k(x t +1 , .), k(x t +1 , .) >H 2(1 + r )2 t +1 r2 r =  ft 2H − αt +1 yt +1 f t (x t +1) 2(1 + r )2 (1 + r )2 αt2+1 + (27) 2(1 + r )2 and

 2  r r αt +1 1 2   f H =  ft + yt +1 k(x t +1 , .)  2 2 1+r 1+r H r r 2 =  f t H + αt +1 yt +1 ft (x t +1 ) 2(1 + r )2 (1 + r )2 r + α2 . (28) 2(1 + r )2 t +1

360

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

Finally, optimizing (16) becomes maximizing L(αt +1 ), which is the following quadratic function of αt +1 : 1 α 2 + αt +1 2(1 + r ) t +1   1 × 1− yt +1 f t (x t +1 ) + c0 1+r

L(αt +1 ) = −

(29)

where c0 = r 2 /(2(1 + r )2 ) f t 2H + r/(2(1 + r )2 ) ft 2H . That is, by combining both (26) and (29), the Lagrange dual problem of (16) has now been simplified to  

1 1 αt2+1 − αt +1 1 − yt +1 ft (x t +1 ) − c0 min αt+1 2(1 + r ) 1+r s.t. 0 ≤ αt +1 ≤ C. (30) Let α¯ t +1 be the stationary point such that d L(αt +1 ))/ (dαt +1 )|αt+1 =α¯ t+1 = 0. Then we have α¯ t +1 = 1 + r − yt +1 f t (x t +1). For simplicity of expression, we use αt +1 to denote the optimal solution of (30) in the rest of this paper. Minimizing −L(αt +1 ) in the interval [0, C] is straightforward and the analytic optimal solution is at either 0, C or α¯ t +1 . Then αt +1 = 0, αt +1 = C,

if α¯ t +1 ≤ 0 if α¯ t +1 ≥ C

αt +1 = α¯ t +1 ,

otherwise.

B. Modeling OLK for Regression (OLK R ) In this section, we consider modeling the online learning with Kernels for a regression problem in H. For given learning  data (x 1 , y1 ), . . . , (x i , yi ), . . . , (x t , yt ) and current f t = ti=1 αi∗ k(x t , .) ∈ H where x i ∈ R n and yi ∈ R, we +1 hope to update {α˜ i∗ }ti=1 based on {αi∗ }ti=1 when a new pair of data (x t +1, yt +1 ) becomes available. The model of online regression problem is formulated as   r 1  f − f t 2 +  f 2 + Cξt +1 min f,ξt+1 2 2 (34) s.t. |yt +1 − f (x t +1)| ≤ ε + ξt +1 , ξt +1 ≥ 0 which can be modified to   1 r 2 2 ˜  f − f t  +  f  + C(ξt +1 + ξt +1 ) min 2 2 f,ξt+1 ,ξ˜t+1 s.t. yt +1 − f (x t +1 ) ≤ ε + ξt +1 , ξt +1 ≥ 0 (35) f (x t +1 ) − yt +1 ≤ ε + ξ˜t +1 , ξ˜t +1 ≥ 0 where ε ≥ 0 is a pre-chosen parameter reflecting the tolerant error in the regression, and ξt +1 , ξ˜t +1 are the slack variables. For more details of ε, one may refer to [31], where an error tolerance SVM is proposed. Like (3), we formulate the primal optimization problem together with its Lagrange dual problem of online learning for regression as min

(31)

s.t. αt +1 ≥ 0, βt +1 ≥ 0,

Combining (24) and (31), the derived online learning algorithm for classification is presented as

max

αt+1 ,α˜ t+1

(32)

where 1 α∗ , for i = 1, . . . , t 1+r i = 0, if α¯ t +1 ≤ 0 C yt +1 , = if α¯ t +1 ≥ C 1+r 1 + r − yt +1 f t (x t +1) yt +1 , = otherwise. (33) 1+r

α˜ i∗ = α˜ t∗+1 α˜ t∗+1

min

f,ξt+1 ,ξ˜t+1

(36)

{L(.)}

s.t. αt +1 ≥ 0, βt +1 ≥ 0,

i=1

α˜ t∗+1

α˜ t +1 ≥ 0 β˜t +1 ≥ 0

and

1 αt +1 f t (x) + yt +1k(x t +1 , x) 1+r 1+r t  = α˜ i∗ k(x i , x) + α˜ t∗+1 k(x t +1 , x)

f t +1 (x) =

h t +1 (x) = sign( ft +1 (x))

max {L(.)}

f,ξt+1 ,ξ˜t+1 αt+1 ,α˜ t+1

Remark 2: It is easy to see that r plays a role of forgetting factor, which can also be seen in (52) and (67) for OLK R and OLK N later. This implies that OLK should have the ability of dealing with time varying (non-stationary) problems, since the choice of r allows us to give higher weight to more recent data. To this end, variation rate r needs to be known or estimated before applying OLK algorithms. Nevertheless, we focus on the applications of OLK in stationary problems in this paper. But it is an interesting future work to investigate its capability of dealing with time varying problems in detail.

α˜ t +1 ≥ 0 β˜t +1 ≥ 0

(37)

where αt +1 , α˜ t +1 , βt +1 , β˜t +1 are the Lagrange multipliers corresponding to the constraints in (35), and L(.) = L( f, ξt +1 , ξ˜t +1 , αt +1 , α˜ t +1 , βt +1 , β˜t +1 ) 1 r =  f − ft 2 +  f 2 + C(ξt +1 + ξ˜t +1 ) 2 2 + αt +1 (yt +1 − f (x t +1 ) − ε − ξt +1 ) − βt +1 ξt +1 + α˜ t +1 ( f (x t +1) − yt +1 − ε − ξ˜t +1 ) − β˜t +1 ξ˜t +1 . (38) Note that the optimization problem in (36) is a convex optimization problem due to the convexity of the objective function and constraints. Then optimizing (36) and (37) is equivalent from Lemma 2.1. To obtain min f,ξt+1 ,ξ˜t+1 L(.), we take the partial derivatives of L(.) with respective to f , ξt +1 and ξ˜t +1 , respectively, and set them to zero. This gives f =

1 αt +1 − α˜ t +1 ft + k(x t +1 , .) 1+r 1+r C − αt +1 − βt +1 = 0 C − α˜ t +1 − β˜t +1 = 0.

(39) (40) (41)

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

361

With similar notations used in OLKC , (39) implies that 1 for i = 1, . . . , t α˜ i∗ = α∗ , 1+r i αt +1 − α˜ t +1 (42) α˜ t∗+1 = 1+r where the Lagrange multipliers αt +1 and α˜ t +1 need to be determined. From (40) and (41) as well as βt +1 ≥ 0 and β˜t +1 ≥ 0, we have 0 ≤ αt +1 ≤ C. Substituting (39)–(41) into (37) and ignoring the constant term c0 , (37) becomes 

1 1 α˜ 2 + α˜ t +1 f t (x t ) − yt +1 − ε − max 2(1 + r ) t +1 1+r αt+1 ,α˜ t+1  1 1 αt2+1 + αt +1 (yt +1 − f t (x t ) − ε) − 2(1 + r ) 1+r 0 ≤ α˜ t +1 ≤ C (43) s.t. 0 ≤ αt +1 ≤ C,  1 1 α˜ t2+1 − α˜ t +1 ( f t (x t ) − yt +1 − ε) min 1+r αt+1 ,α˜ t+1 2(1 + r )  1 1 αt2+1 − αt +1 (yt +1 − f t (x t ) − ε) + 2(1 + r ) 1+r s.t. 0 ≤ αt +1 ≤ C, 0 ≤ α˜ t +1 ≤ C. (44) On the other hand, from (38), we have αt +1 ≥ 0, α˜ t +1 ≥ 0,

yt +1 − f (x t +1) − ε − ξt +1 ≤ 0 f (x t +1) − yt +1 − ε − ξ˜t +1 ≤ 0.

(45)

By using the KKT conditions [22], we have αt +1 α˜ t +1 = 0

(46)

αt +1 and α˜ t +1 cannot be nonzero at the same time. Then (44) is equivalent to min

αt+1 ,α˜ t+1

{L 1 (α˜ t +1 ), L 2 (αt +1 )}

s.t. 0 ≤ αt +1 ≤ C,

0 ≤ α˜ t +1 ≤ C

(47)

where

1

1 α˜ 2 − α˜ f t (x t +1 ) − yt +1 − ε L 1 (α) ˜ = 2(1 + r ) 1+r

1 1 2 α − α yt +1 − ft (x t +1 ) − ε . (48) L 2 (α) = 2(1 + r ) 1+r As both L 1 (α) and L 2 (α) are quadratic functions, they attain their minimum either at the stationary point ˜ α)| ˜ α= ((d L 1 (α))/( ˜ α˜¯ = 0 or (d L 2 (α))/α|α=α¯ = 0) or boundary point (0 or C). Then (d L 1 (α)/ ˜ α)| ˜ α= ˜ α˜¯ = 0 and d L 2 (α)/α|α=α¯ = 0 yield that α˜¯ = f t (x t +1 ) − (1 + r )(yt +1 + ε) α¯ = (1 + r )(yt +1 − ε) − ft (x t +1 ).

(49)

As L 1 (0) = L 2 (0) = 0, the minimum of (47) on [0, C] becomes ˜¯ L 1 (C), L 2 (α), ¯ L 2 (C), 0}. min {L 1 (α),

(50)

Combining (39) and (50), the derived online learning algorithm for regression is now given as 1 αt +1 − α˜ t +1 ft (x) + k(x t +1, x) f t +1 (x) = 1+r 1+r t  α˜ i∗ k(x i , x) + α˜ t∗+1 k(x t +1 , x) (51) = i=1

where 1 α ∗ for i = 1, . . . , t 1+r i α˜¯ = f t (x t +1) − (1 + r )(yt +1 + ε) α¯ = (1 + r )(yt +1 − ε) − f t (x t +1) ˜¯ if min{L 1 (α), ¯˜ L 1 (C), L 2 (α), ¯ L 2 (C), 0} = L 1 (α) ˜ ¯ αt +1 = 0. α˜ t +1 = α, ˜¯ L 1 (C), L 2 (α), if min{L 1 (α), ¯ L 2 (C), 0} = L 2 (α) ¯ ¯ α˜ t +1 = 0, αt +1 = α. if min{L 1 (α), ¯˜ L 1 (C), L 2 (α), ¯ L 2 (C), 0} = L 1 (C)

α˜ i∗ =

α˜ t +1 = C, αt +1 = 0. ˜¯ L 1 (C), L 2 (α), if min{L 1 (α), ¯ L 2 (C), 0} = L 2 (C) α˜ t +1 = 0, αt +1 = C. ˜¯ L 1 (C), L 2 (α), if min{L 1 (α), ¯ L 2 (C), 0} = 0 α˜ t +1 = 0, αt +1 = 0. α˜ t +1 − αt +1 α˜ t∗+1 = . (52) 1+r Remark 3: In the case that ε = 0 in (52), it can be obtained that α˜¯ = −α¯ from (49). Then from (48), we have either ˜¯ < 0 < L 2 (α) ˜¯ < 0 < L 1 (α), L 1 (α) ¯ or L 2 (α) ¯ and α˜ t∗+1 can never be zero except when f t (x t +1 ) and (1 + r )yt +1 are identical. So all α˜ t∗+1 are nonzero almost surely for this case. Also, by considering the model in (34) with ε = 0, either ξt +1 or ξ˜i+1 is nonzero almost surely for all learning data points. Therefore, either αt +1 or α˜ t +1 will be nonzero almost surely. Actually, the error tolerance ε controls the number of nonzero α˜ i∗ . For a larger ε, nonzero α˜ i∗ will be sparse relative to the number of learning data as seen in Example 2 later. C. Modeling OLK for Novelty Detection (OLK N ) Novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. Generally speaking, novelty detection is similar to classification without labels as seen in [32], where a one-class SVM is investigated. Novelty detection is useful for condition monitoring tasks, such as network intrusion detection. Recently, novelty detection has also been used in background substraction in the research field of image processing [21]. In our simulation section later, we will consider such applications. Given a training data set without any class information (x 1 , . . . , x i , . . . , xl ) where x i ∈ X for i = 1, . . . , l, we wish to assign each data point a label +1 or −1. Besides the unknown classification label, another difference of novelty detection from a general classification lies in that most of the training data points are assigned to the label +1 and only a few data points will be assigned the label −1. Those who are assigned −1 are considered to be abnormal points or novelty points. Before proposing our online model, we review one class SVM proposed by Schölkopf [1], which is formulated as   l l  1 T w w−ρ + ξi min w,ξ,ρ 2 vl i=1

s.t. w T φ(x i ) ≥ ρ − ξi , ξi ≥ 0,

i = 1, . . . , l (53)

362

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

where w, φ(.), and ξ have the same meanings as those of SVM in Section II, and v ∈ (0, 1] is a pre-chosen parameter reflecting the tolerance in terms of the percentage of novelty points we wish to detect. Just like we use αt +1 to denote the optimal solution of L(αt +1 ) for the simplicity of expression, we let w and ρ be the optimal solutions of (53). Since nonzero slack variables ξi , for i = 1, . . . , l, are penalized in the objective function, it can be expected that, if w and ρ are obtained, the decision function h(x) = sign( f (x)) = sign(w T φ(x) − ρ)

(54)

will be positive for most samples x i contained in the training set, while the simplicity of the function f (x) reflected by the term w T w in (53) will remain. Similar to classification and regression problems, the online +1 based novelty detection problem is to update f t +1 , i.e., {α˜ t∗ }ti=1 t ∗ on {αt }i=1 for t = 0, . . . , l − 1 when the newly arrived date point x t +1 becomes available. The optimization model is given as   1 r  f − f t 2 +  f 2 + Cξt +1 − vρ min f,ρ,ξt+1 2 2 s.t. f (x t +1) − 1 ≥ ρ − ξt +1 , ξt +1 ≥ 0, ρ ≥ 0. (55) where C and v are pre-set parameters with C > v. Note that we require ρ ≥ 0 as discussed in [32]. A modification from one-class SVM made here is that the constraint is f (x t +1 ) − 1 ≥ ρ − ξt +1 instead of f (x t +1 ) ≥ ρ − ξt +1 , which means most data points satisfy that f (x t +1) − ρ ≥ 1. The decision function is given as h(x) = sign( f (x)) = sign( f (x t +1) − 1 − ρ).

(56)

The primal problem and its corresponding dual problem to be optimized are min

f,ξt+1 ,ρt+1

max{L( f, ξt +1 , ρt +1 , αt +1 , βt +1 , γt +1 )} αt+1

αt +1 ≥ 0, βt +1 ≥ 0, γt +1 ≥ 0

(57)

and max αt+1

min

f,ξt+1 ,ρt+1

{L( f, ξt +1 , ρt +1 , αt +1 , βt +1 , γt +1 )}

αt +1 ≥ 0, βt +1 ≥ 0, γt +1 ≥ 0

(58)

with L( f, ξt +1 , ρt +1 , αt +1 , βt +1 , γt +1 ) 1 r =  f − f t 2 +  f 2 + Cξt +1 − vρt +1 2 2 +αt +1 (ρt +1 − ξt +1 − f (x t +1 ) + 1) −βt +1 ξt +1 − γt +1 ρt +1

αt +1 1 f t (.) + k(x t +1, .) 1+r 1+r C − αt +1 − βt +1 = 0 −v + αt +1 − γt +1 = 0.

v ≤ αt +1 ≤ C.

(64)

Substituting (60)-(64) into (58), we obtain   1 1 2 α f t (x t +1)) + c0 − max + αt +1 (1 − αt+1 2(1 + r ) t +1 1+r (65) s.t. v ≤ αt +1 ≤ C where c0 denotes the constant part. Minimizing (65) is quite straightforward and this gives the online algorithm for novelty detection as follows: 1 1 f t (x) + αt +1 k(x t +1 , x) f t +1 (x) = 1+r 1+r t  α˜ i∗ k(x i , x) + α˜ t∗+1 k(x t +1 , x) = i=1

h t +1 (x) = sign( f t +1 (x) − 1 − ρt +1 )

(66)

where 1 αi , for i = 1, . . . , t 1+r = 1 + r − ft (x t +1 ) = v, if α¯ t +1 ≤ v

α˜ i∗ = α¯ t +1 αt +1

αt +1 = C, if α¯ t +1 ≥ C if otherwise αt +1 = α¯ t +1 , αt +1 ∗ . (67) α˜ t +1 = 1+r Now ρ needs to be determined. Let ξ and ρ be feasible values for (55), and = min (ξ, ρ). If we substitute ρ = ρ− and ξ = ξ − , then the constraints remain in force and the objective value decreases by − (C − v). Hence, at the optimum point, at most one of ξ and ρ can be nonzero and the constraint f (x) − 1 ≥ ρ − ξ holds as an equality, since otherwise we could get a better solution by increasing ρ. After we obtain f by (66), there are two possibilities ρ = 0, ξ = 1 − f (x),

if f (x) − 1 ≤ 0

ρ = f (x) − 1, ξ = 0,

if f (x) − 1 > 0.

(68)

(60)

Note that (68) is derived directly and rigorously from the primal problem (55). But we feel that ρt +1 should be avoided changing too much during iteration. So when αt +1 = v or αt +1 = C, i.e., αt +1 is on the boundary of the interval [v, C], we keep ρt +1 = ρt . Otherwise, we update ρt +1 based on (68). In simulation, it is found that this actually improves the performances of OLK N . In summary, ρt +1 is updated as follows:

(61) (62)

ρt +1 = ρt , if αt +1 = C or αt +1 = v if v < αt +1 < C. (69) ρt +1 = max{0, f t +1 (x t +1 ) − 1},

(59)

where βt +1 and γt +1 are the Lagrange multipliers corresponding to the constraints ξt +1 ≥ 0 and ρ ≥ 0, respectively. Minimizing L(.) with respect to f , ξt +1 and ρt +1 results in f (.) =

Similar to OLKC and OLK R , (60) gives 1 α˜ i∗ = α∗ , for i = 1, . . . , t 1+r i αt +1 α˜ t∗+1 = . (63) 1+r We need to solve αt +1 . Also, (61) and (62) imply that αt +1 = C − βt +1 ≤ C and αt +1 = v + γt +1 ≥ v, respectively, as both γt +1 and βt +1 are non-negative variables. Then, the constraint in terms of αt +1 is

So for OLKC and OLK R , the update algorithms are rigorously obtained by solving their respective primal problems, while for OLK N , the update algorithm is based on a modified version of the solution of its primal problem. Also, we notice that only a few values of αt +1 are located on the interval (v, C), i.e., v < αt +1 < C, as observed in the simulation studies. This is consistent with what we expect, since we feel that ρ should not change too much during iteration. D. Algorithms and Discussions In this section, we present the implementation procedures of the algorithms and then give some discussions, involving comparison of OLK methods with other algorithms in the related area. Let t be the t−th step in learning the training data (x 1 , y1 ), . . . , (xl , yl ). The steps for OLKC , OLK R and OLK N are illustrated as follows. 1) Step 1: Let f 0 (x) = 0. Collect (x 1 , y1 ) and set appropriate values for the parameters p, C, r , and ε in OLK. Then, obtain α1∗ based on (33), (52), or (67) OLKC , OLK R , and OLK N , respectively. 2) Step t + 1 for 1 ≤ t ≤ l − 1: In this step, when (x t +1 , yt +1 ) becomes available and we already have f t (x) = ti=1 αi∗ k(x i , x). Update αi∗ for i = 1, . . . , t and αt∗+1 based on (33), (52), or (67) for OLKC , OLK t +1R , ∗and OLK N , respectively. Update f t +1 (x) = i=1 αi k(x i , x). 3) Step l: Obtain fl (x) and end. Note that for OLK N , in order to obtain its decision function in (66), ρ also needs to be updated in each step based on (69). Remark 4: One of the main issues in designing a learning algorithm is that the number of support vectors, which is related to the memory requirement of the algorithm, grows linearly with the number of data points. In [15] and [17], how to bound the number of support vectors is discussed. Also the scheme of forgetron is considered in [13]. These ideas are very interesting approaches. Note that “forgetting” in this paper does not limit the number of support vectors in any manner in OLK. But it offers a possibility for us to set a threshold to drop some support vectors as the learning process continues, as discussed below. 1) For OLKC , it is possible for us to select a small threshold Thrsv to drop some support vectors. By applying the forgetting parameter r , the weights on the previous support vectors will become less and less since we have αi = 1/(1 + r )αi for i = 1, . . . , t. Let α1t and α1t0 be α1 at time t and t0 , respectively. Then, α1t = (1/(1 + r ))t −t0 α1t0 ≤ (1/(1 + r ))t −t0 C. A small threshold Thrsv can be selected to drop some support vectors since when t > |(lg(Thrsv ) − lgC|)/(lg(1 + r )) + t0 , α1t < Thrsv . In Example 1 of Section IV, how different Thr controls the number of support vectors will be illustrated. 2) For OLK R , ε controls the number of support vectors and its number can be bounded, which is similar to projectron method by Orabona, Keshet, Caputo (2009).

363

Number of nonzero Lagrange multipliers

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

800 700 600 500 400 300 200 100 0 0

Fig. 1.

0.1

0.2

0.3

0.4

0.5 ε

0.6

0.7

0.8

0.9

1

∗ . Illustration of ε controlling the number of nonzero α˜ t+1

The illustration of ε controlling the number of support vectors is shown in Fig. 1. 3) For OLK N , all αi are nonzero since v ≤ αi ≤ C. However in this case, we do not need additional memory to store the values for those satisfying that αi = v. Remark 5: Note that OLK, NORMA, and SVM can all be applied in classification, regression, and novelty detection. The difference of these three methods are as follows. In our opinion, SVM optimizes the sequence {αi∗ }ti=1 for i = 1, . . . , t in a batch of data set {x i , yi }ti=1 while OLK optimizes αi∗ at each iteration. It is not hard to imagine that the solution of OLK can be a feasible solution of SVM but obviously not a global optimal one. However, the solution is already good enough with low computational cost to be applied in many classification, regression, and novelty detention problems. In addition, OLK includes the online classification, regression, and novelty problems in a unified framework. Similar to OLK, NORMA is a method to update αi at each iteration. The difference between OLK and NORMA is that OLK guarantees that αi∗ is the optimal solution at each iteration step. The step size is pre-selected in NORMA while determined by solving a optimization problem in our method. IV. A PPLICATIONS To verify the proposed OLKs algorithms, simulation experiments have been conducted in three areas: classification, regression, and novelty detection. The Gaussian Kernel 2 2 k(x, y) = e−x−y / p with p being a parameter controlling the width of the Kernel is chosen for our simulation studies. SMO method is used to train SVM to accelerate its speed.

A. Online Learning With Kernels in Classification (OLKC ) We compare the performance of OLKC with the well known standard SVM [22], perceptron [10], Kernel proceptron [12], projectron [17], and NORMA [4].

364

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

150

TABLE I C ROSS -VALIDATION OF THE PARAMETERS p=1

C=2

OLKC

p = 1.2

C = 0.8

r = 0.001

p=1

λ = 0.01

η = 0.1

Perceptron

η = e−t/50

Kernel Perceptron

p = 1.2

η = 0.1

Projectron

p = 1.2

η = 0.02

Example 1: The training and testing data are constructed as follows:

Number of support vectors

Thrsv=0.001

SVM NORMA

Thrsv=0

50

0

0

Fig. 2.

500 1000 Number of Learning data

1500

Number of support vectors for different thresholds.

45 OLKN

40

Peceptron Norma Kernel perceptron Projectron SVM

35 Number of mistakes

where x i is a 2-D learning data point with horizontal coordinate and vertical coordinate, u i , u i ∈ U (−0.5, 0.5) and l is the number of training data. Note that U (−0.5, 0.5) denotes the uniform distribution in the interval [−0.5, 0.5] and N(−1, 1) is the normal distribution with mean −1 and standard derivation 1. The parameters to be tuned in SVM are C and p. C is the regularization parameter and p reflects the width of the Gaussian Kernel. For OLKC , another parameter r needs to be chosen to control the norm of  f 2H as seen in (15). As mentioned, r also plays the role of a forgetting factor. When r > 0, more recent data points are treated with higher importance. When r = 0, all the training data are treated equally. Usually r can be chosen as a very small positive number. The choice of parameters of perceptron and NORMA can be seen in [4] and [10], which involves the regularization parameter λ, a learning rate η, as well as the Kernel width p in NORMA. For simulation, cross-validation is applied to select the optimal parameters for each method in this example. In this paper, we use the “k-fold cross-validation” where k = 10 to choose the optimal parameters. The intervals for all parameters are given as follows: p ∈ [0.1, 2], C ∈ (0, 2], r ∈ [0, 0.1], λ ∈ (0, 1] and η ∈ (0, 1]. The parameters chosen for different methods are given by Table I above. As mentioned in Remark 4, it is possible to use Thr to drop some support vectors. Fig. 1 shows the results. It can be seen that, when Thrsv = 0, the number of support vectors increases linearly as time goes on. If a larger threshold is allowed, the number of support vectors can be controlled over time. To compare the accuracy in terms of the number of learning data points, we do the experiments by choosing one ranging from 50 to 1500. Fig. 2 shows the cumulative errors for the algorithms. It can be observed that all the algorithms achieve good effects except that perceptron algorithm performs slightly poorer. OLKC is comparable to SVM and projectron. When one is about 1000, it takes almost half an hour for SVM to obtain the decision function while only a few seconds for

sv

Thrsv=0.01

100

Dtraining = {(x i , yi )}li=1  x i = (N(1, 1), N(1, 1)) if yi = sign(u i ) = 1 = x i = (N(−1, 1), N(−1, 1)) if yi = sign(u i ) = −1 Dtest = {(x i , yi )}li =1  x i = (N(1, 1), N(1, 1)) if yi = sign(u i ) = 1 = x i = (N(−1, 1), N(−1, 1)) if yi = sign(u i ) = −1

Thr =0.002

30 25 20 15 10 5 0 −5

Fig. 3.

0

500 1000 Number of Learning data

1500

Classification errors for each method.

OLKC , NORMA, and perceptron algorithms. We feel that, though SVM is a powerful tool in many learning problems, its computational cost can be unbearable in problems with huge number of learning data points. The time cost, using M ATLAB, for all algorithms is shown in Fig. 3. These results show that perceptron is the fastest algorithm but with least accuracy. OLKC is comparable to SVM and projectron and performs slightly better than NORMA in accuracy, while its speed is comparable to or slightly faster than projectron and NORMA. B. Online Learning With Kernels in Regression OLK R In this experiment, two examples are illustrated for comparing the performances of OLK R , NORMA, Kernel perceptron, KRLS, and LS-SVM [33] in regression. Example 2: 1) Dtraining = {(x i , yi )}li=1 where x i ∈ U (−π, π), yi = f (x i ) + v i , and f (x i ) = 0.2 sin(5x i ) + 0.5 sin(x i ) with noise v i ∈ U (−0.5, 0.5);

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

365

4

1.5

10

1.5

learning data true function value estimated function value

1 3

10

OLKN

computational time

1

10

0.5

0.5

Perceptron Norma Kernel perceptron Projectron SVM

2

10

0 0

−0.5

−0.5

−1

−1 −3 −2 −1 1.5

0

10

−1

−2

10

50 learning data points

0

1

2

3

4

−1.5 −4 1.5

learning data true function value estimated function value

1

10

0.5

0.5

0

0

−0.5

−0.5

−3

0

500 1000 Number of Learning data

1500

0

2

4

−1 300 learning data points

−1.5 −4

−2

Fig. 5. Fig. 4.

150 learning data points

−2

learning data true function value estimated function value

1

−1

10

learning data true function value estimated function value

1

0

2

4

−1.5 −4

600 learning data points −2 0 2 4

Illustration of OLK R with the number of training data l.

Time cost for each method.

0.35

i =1

l  = ( yˆi − yi )2 /l i =1

is evaluated and compared for each algorithm. Fig. 5 illustrates the regression results for OLK R on the data set Dtraining, where the number of data points l is chosen to be 50, 150, 300, and 600, respectively. The RMSEs based on the data in Dtesting are compared among LS-SVM, NORMA, KRLS, Kernel perceptron and our proposed OLK R trained using data in Dtraining with l ranging from 10 to 700. The results are shown in Fig. 6. Clearly, OLK R slightly outperforms NORMA and Kernel perceptron methods, and is comparable to KRLS and LS-SVM. It is also observed that the performance of OLK R is comparable to the batch learning method LS-SVM. Note that the number of nonzero α˜ t∗+1 reflects the complexity of the decision function. A decision function with a

LS−SVM NORMA OLKR

0.3 RMSE for algorithms

2) Dtest = {x i , yi }li =1 where x i ∈ U (−π, π), yi = 0.2 sin(5x i ) + 0.5 sin(x i ). The mentioned cross-validation method in Example 1 has been employed to choose the optimal parameters for all algorithms. Note that the ε-insensitive loss is considered in both OLK R and NORMA. The parameters are chosen as follows. For LS-SVM, the parameters p and C are chosen as 0.5 and 10, respectively. For OLK R , the parameters are chosen as p = 0.5, ε = 0.5, and C = 1 with a forgetting factor r = 0.001. For NORMA, p = 0.5, λ = 0.01, and η = 0.1. For Kernel perceptron, p = 0.5, η = 0.02. For KRLS, p = 0.5, v = 0.001. More details about how to set the parameters for each methods can be found in [4], [12], [15], and [33]. Note that our objective is to estimate f (x) denoted as fˆ(x) from the data set Dtraining. To study the accuracy, the root mean square error (RMSE) defined as l  RMSE = ( fˆ(x i ) − f (x i ))2 /l

KRLS Kernel Perceptron

0.25 0.2 0.15 0.1 0.05 0

Fig. 6.

0

100

200 300 400 500 Number of learning data

600

700

RMSE on Dtest of algorithms with respect to l.

lower complexity can improve the generalization ability of a learning algorithm. In this example, it is seen that the error tolerance parameter ε actually controls the number of nonzero α˜ t∗+1 . Fig. 1 shows how this is achieved. This implies that our algorithm has the flexibility of controlling the complexity of the decision function by choosing ε. From Fig. 1, it can be observed that the larger the ε, the simpler the complexity of the decision function. For example, when ε = 0, α˜ t∗+1 is always nonzero for every newly arrived data points almost surely. On the other hand, when ε is big enough, for example ε > | f (x)| + max {v i }, then α˜ t∗+1 is zero for any newly arrived data. C. Online Learning Detection (OLK N )

With

Kernels

in

Novelty

As mentioned, novelty detection is the identification of new or unknown data or signal that a machine learning system is not aware of during training. There are a multitude of applications where novelty detection is extremely important,

366

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

6

1

Normal points true Nolvety points Novelty points find the algorithm

4

0.9 0.8 True Positive Rate

Vertical coordinate

2 0 −2 −4

0.7 0.6 0.5 0.4 0.3 0.2 NORMA

0 −8 −10

−5 0 Horizontal coordinate

5

Fig. 8. Fig. 7.

OLKN

0.1

−6

0

0.2

0.4 0.6 False Positive Rate

0.8

1

ROC curves of novelty detection by OLK N and NORMA.

Illustration of novelty points.

including signal processing, computer vision, pattern recognition, data mining, and robotics. In this section, three novelty detection examples by making use of OLK N are illustrated. One example is conducted on an artificial data set and the other two are to deal with real-time background substraction in the field of image processing. Note that the one class SVM in [32] is too slow to be suitable for the real time background substraction in Examples 4 and 5. Example 3: The training and testing data are the same data set Dtraining = {(x i , yi )}100 i=1  x i = (N(2, 1), N(2, 1)), = x i = (N(−2, 1), N(−2, 1)),

if yi = sign(u i ) = 1 if yi = sign(u i ) = −1

where x i is a 2-D vector with horizontal coordinate and vertical coordinate and u i = U (−0.05, 0.95). Basically 95% of data points are normal points with label +1 and only 5% of data points are −1. The objective is to determine whether there are novelty points on a given set and label them if possible. In training OLK N , we do not use any of the classification label information of the data. For those data points, which are detected to be novelty points by OLK N , they will be assigned label −1, and other points are assigned label +1. In testing OLK N , it can be checked whether a detected novelty point is a true novelty point by comparing the assigned label with its given label value. Set the parameters as r = 0.001, p = 1, C = 0.2, and v = 0.1 for conducting the experiment. As seen in Fig. 7, all the novelty points can be separated from the normal points by OLK N . However, it is also observed that a few points that are supposed to be normal points are assigned the label −1. Nevertheless this is reasonable, as these points are far away from other normal points and thus they behave like novelty points. In this example, when determining the values of α˜ t∗+1 for t = 0, . . . , l − 1, we need to solve for αt +1 as α˜ t∗+1 = (αt +1 )/(1 + r ) in (67). Note that majority of αt +1 locate on the boundary of the interval [v, C], while only a few are in (v, C).

Thus, C and v controls the number of novelty points detected by the algorithm. By changing the parameters of a novelty detection algorithm, we can obtain the true positive rates as well as the corresponding false positive rates. Fig. 8 shows the receiver operating characteristic curves (ROC) by OLK N and NORMA where a larger area under a curve represents a better performance. We can observe that the performance of OLK N is slightly better than that of NORMA. Example 4: Real time background substraction with dynamic background. First, the problem is formulated as follows. Let xi ∈ X denote an observed image consisting of n pixels {xi 1 , . . . , xi n }, and let yi be the assigned label of xi , i.e., yi = {yi j }nj =1 where yi j ∈ {−1, 1}. When processing an input image sequences D = (x1 , . . . , xi , . . . , xl ), the goal is to assign labels (y1 , . . . , yi , . . . , yl ) to each pixels. Note that when the image sequence changes, the pixels of background points do not change much while those of foreground points change dramatically. So the background pixels can be considered as normal points and foreground points as novelty points. Thus, our proposed OLK N can be applied to such a novelty detection problem. In this example, a real scene with dynamic background is considered. The Jug data in [34] are used. The data set contains 117 frames of images, with 230 × 320 pixels in each image. The background is rippling water and the foreground is a floating jug. All the image frames can be downloaded from: http://www.cs.bu.edu/groups/ivc/data.php. The parameters of OLK N are set to be: r = 0.001, p = 5, C = 0.2, and v = 0.1 in the experiment. A selected batch of original image sequence and visual results are given in Figs. 9 and 10, respectively. It is observed that OLK N produces the foreground segments without image pre-processing and post-processing. Thus, OLK N is capable of dealing with image scene with dynamic background. Similar experiments by NORMA with different parameters were also conducted. Relatively good results with

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

Fig. 9.

Original Jug data.

Fig. 10. Results of processing Jug117 data with r = 0.001, v = 0.1, p = 10, and C = 0.2 by OLK N .

Fig. 11.

Jug117 data with λ = 0.0001, p = 8, and η = 0.001 by NORMA.

λ = 0.0001, p = 15, and η = 0.001 are in Fig. 11. Qualitatively, a comparison between the results by OLK N and the results by NORMA given in Fig. 11 shows that OLK N performs comparably to NORMA. Example 5: Real time background substraction with indoor background. In this example, OLK N is applied to indoor scenes. The background is a laboratory. Two human subjects simultaneously walk into and out the scene. Note that the shadows of the human subjects may affect the pixels values of the background slightly. The data is a video sequence containing

367

Fig. 12.

Original data of Video-DSCF6721.

Fig. 13. Results of processing Video-DSCF6721 with nine-frame data (r = 0.001, v = 0.1, p = 20, C = 0.5).

100 frames with 230 × 320 pixels in each frame captured by using a camcorder. The experiments are conducted in two cases. In Case 1, all 100 frames are used while in Case 2 only selected nine frames (i.e., frames 11, 20, 29, 38, 47, 56, 65, 74, and 82) are used. The purpose of Case 2 is to test the performance of the OLK N on scenarios with small samples. The frames of the image sequence are shown in Fig. 12. The parameters are the same as those in Example 4 for Case 1 and the parameters ρ and C are set to be 20 and 0.5 for Case 2 for the consideration of small number of frames. Note that for both OLK N and NORMA running in M ATLAB environment, it takes only a few seconds to process each frame in Examples 4 and 5. To save space, only the visual results of the above nine frames for Case 2 are given in Fig. 13. Overall, OLK N performs successfully in producing segmentations of the foregrounds in both cases. It seems that OLK N in Case 2 produces even better foreground points than in Case 1 but with a higher level of noises in background points. The successful application in Case 2 implies that OLK N can be applied to situations in which only a limited number of samples are available. This makes OLK N become more meaningful as the learning samples may be very limited in many applications.

368

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

V. C ONCLUSION In this paper, we considered the modeling of online learning with Kernels in an RKHS space H. The major contributions of the paper are summarized as follows. 1) We proposed a method to derive online learning algorithms OLKC , OLK R , and OLK N for classification, regression, and novelty detection, which is new in the field of online learning with Kernels and SVM. 2) We tested the obtained algorithms in classification, regression, and novelty detection, including background substraction in image processing. The performances of the algorithms show their effectiveness. 3) It was shown that, based on the experimental results of both classification and regression, the accuracy of OLK algorithms is comparable with traditional SVM-based algorithms, such as SVM and LS-SVM, and with the state-of-the-art algorithms, such as KRLS and projectron methods, while it is slightly higher than that of NORMA. On the other hand, the computational cost of the OLK algorithm is comparable with or slightly lower than existing online methods, such as abovementioned NORMA, KRLS, and projectron methods, but much lower than that of SVM-based algorithms. In addition, different from SVM and LS-SVM, OLK algorithms can be applied to non-stationary problems. Also, the applicability of OLK in novelty detection is illustrated by simulation results. 4) The proposal of model-based online learning with Kernels strengthens the foundation of online learning with Kernels and also enriches the research area of SVM. R EFERENCES [1] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. Cambridge, MA: MIT Press, 2002. [2] G. Li, C. Wen, W. Zheng, and Y. Chen, “Identification of a class of nonlinear autoregressive with exogenous inputs models based on Kernel machines,” IEEE Trans. Signal Process., vol. 59, no. 5, pp. 2146–2158, May 2011. [3] M. Ramona, G. Richard, and B. David, “Multiclass feature selection with Kernel gram-matrix-based criteria,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1611–1623, Oct. 2012. [4] J. Kivinen, A. Smola, and R. Williamson, “Online learning with Kernels,” IEEE Trans. Signal Process., vol. 100, no. 10, pp. 2165–2176, Oct. 2004. [5] D. J. Sebald and J. A. Bucklew, “Support vector machine techniques for nonlinear equalization,” IEEE Trans. Signal Process., vol. 48, no. 11, pp. 3217–3226, Nov. 2000. [6] E. Hu, S. Chen, D. Zhang, and X. Yin, “Semisupervised Kernel matrix learning by Kernel propagation,” IEEE Trans. Neural Netw., vol. 21, no. 11, pp. 1831–1841, Nov. 2010. [7] H. Ning, X. Jing, and L. Cheng, “Online identification of nonlinear spatiotemporal systems using Kernel learning approach,” IEEE Trans. Neural Netw., vol. 22, no. 9, pp. 1381–1394, Sep. 2011. [8] G. Li and C. Wen, “Identification of Wiener systems with clipped observations,” IEEE Trans. Signal Process., vol. 60, no. 7, pp. 3845–3852, Jul. 2012. [9] S. Liwicki, S. Zafeiriou, G. Tzimiropoulos, and M. Pantic, “Efficient online subspace learning with an indefinite Kernel for visual tracking and recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 10, pp. 1624–1636, Oct. 2012.

[10] S. I. Gallant, “Perceptron-based learning algorithms,” IEEE Trans. Neural Netw., vol. 1, no. 2, pp. 179–191, Feb. 1990. [11] M. Fernandez-Delgado, J. Ribeiro, E. Cernadas, and S. B. Ameneiro, “Direct parallel perceptrons (DPPs): Fast analytical calculation of the parallel perceptrons weights with margin control for classification tasks,” IEEE Trans. Neural Netw., vol. 22, no. 11, pp. 1837–1848, Nov. 2011. [12] Y. Freund and R. E. Schapire, “Large margin classification using the perceptron algorithm,” Mach. Learn., vol. 37, no. 3, pp. 277–296, 1999. [13] O. Dekel, S. S. Shwartz, and Y. Singer, “The forgetron: A Kernelbased Perceptron on a budget,” SIAM J. Comput., vol. 37, no. 5, pp. 1342–1372, 2008. [14] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, pp. 2121–2159, Jan. 2011. [15] Y. Engel, S. Mannor, and R. Meir, “The Kernel recursive least squares algorithm,” IEEE Trans. Sig. Process., vol. 52, no. 8, pp. 2275–2185, Aug. 2004. [16] W. He, “Limited stochastic meta-descent for Kernel-based online learning,” Neural Comput., vol. 21, no. 9, pp. 2667–2686, 2009. [17] F. Orabona, J. Keshet, and B. Caputo, “Bounded Kernel-based online learning,” J. Mach. Learn. Res., vol. 10, pp. 2643–2666, Nov. 2009. [18] P. Bouboulis, K. Slavakis, and S. Theodoridis, “Adaptive learning in complex reproducing Kernel Hilbert spaces employing Wirtinger’s subgradients,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 3, pp. 425–438, Mar. 2012. [19] M. Davy, F. Desobry, A. Gretton, and C. Doncarli, “An online support vector machine for abnormal events detection,” Sig. Process., vol. 86, pp. 2009–2025, Sep. 2006. [20] J. Kivinen, M. K. Warmuth, and B. Hassibi, “The p-norm generalization of the LMS algorithm for adaptive filtering,” IEEE Trans. Sig. Process., vol. 54, no. 5, pp. 1782–1793, May 2006. [21] L. Cheng, M. Gong, D. Schuurmans, and T. Caelli, “Real-time discriminative background subtraction,” IEEE Trans. Image Process., vol. 20, no. 10, pp. 1401–1414, Oct. 2011. [22] V. Vapnik, Statistical Learning Theroy. New York: Wiley, 1998. [23] J. C. Platt, “Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods: Support Vector Machines. Cambridge MA: MIT Press, 1998. [24] E. A. D. Oliveira and N. Caticha, “Inference from aging information,” IEEE Trans. Neural Netw., vol. 21, no. 6, pp. 1015–1020, Jun. 2010. [25] Y. Dong and G. N. DeSouza, “Adaptive learning of multisubspace for foreground detection under illuminiation changes,” Comput. Vis. Image Understan., vol. 115, no. 1, pp. 31–39, 2011. [26] H. Sun, “Mercer theorem for RKHS on noncompact sets,” J. Complex., vol. 21, no. 3, pp. 337–349, 2005. [27] S. Vijayakumar and H. Ogawa, “RKHS-based functional analysis for exact incremental learning,” Neurocompting, vol. 29, pp. 85–113, Oct. 1999. [28] S. Boyd and L. Vandenberghe, Convex Optimization. (2004) [Online]. Available: http://www.stanford.edu/∼boyd/cvxbook/ [29] S. G. Nash and A. Sofer, Linear and Nonlinear Programming. New York: McGraw-Hill, 1996. [30] B. Scholkopf, R. Herbric, and A. J. Smola, “A generalized representer theory,” in Proc. Annu. Conf. Comput. Learn. Theory, 2001, pp. 416–426. [31] G. Li, C. Wen, G. B. Huang, and Y. Chen, “Error tolerance based support vector machine for regression,” Neurocompting, vol. 74, no. 5, pp. 771–782, 2011. [32] B. Scholkopf, ¨ J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, 2001. [33] J. A. K. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, 1999. [34] J. Zhong and S. Sclaroff, “Segmenting foreground objects from a dynamic textured background via a robust Kalman filter,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2003, pp. 44–50.

LI et al.: MODEL-BASED ONLINE LEARNING WITH KERNELS

Guoqi Li (M’12) received the B.Eng. degree from the Xi’an University of Technology, Xi’an, China, and the M.Eng degree from Xi’an Jiaotong University, Xi’an, in 2004 and 2007, respectively, and the Ph.D. degree in electrical engineering from Nanyang Technological University, Singapore, in 2012. He has been a Scientist with Data Storage Institute, Singapore, since 2011. His current research interests include artificial cognitive memory, system identification, neural networks, statistical learning theory, and pattern-based learning control. He has authored or co-authored about 20 papers in journals and conferences. Dr. Li is a Reviewer of a number of international journals and conferences, such as the IEEE T RANSACTIONS ON AUTOMATIC C ONTROL, Automatica, the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YS TEMS , Pattern Recognition, Neurocomputing, IET Signal processing, the IEEE Conference on Decision and Control, American Control Conference, the IEEE Conference on Industrial Electronics and Applications, and ICARCV.

Changyun Wen (F’10) received the B.Eng. degree from Xi’an Jiaotong University, Xi’an, China, and the Ph.D. degree from the University of Newcastle, Newcastle, Australia, in 1983 and 1990, respectively. He was a Post-Doctoral Fellow with the University of Adelaide, Adelaide, Australia, from 1989 to 1991. He joined the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, as a Lecturer, where he was promoted to Full Professor through the first Promotion and Tenure Review Exercise in 2008. His current research interests include adaptive control, modeling, and control of active chilled beams for HVAC systems, switching and impulsive systems, model-based online learning, system identification, control and synchronization of chaotic systems, and biomedical control systems and signal processing. Dr. Wen was a recipient of the IES Prestigious Engineering Achievement Award from the Institution of Engineers, Singapore (IES), in 2005. He is an Associate Editor of a number of journals including Automatica and the IEEE Control Systems Magazine. He was an Associate Editor of the IEEE T RANSACTIONS ON AUTOMATIC C ONTROL from 2000 to 2002. He is actively involved in organizing international conferences and is the General Chair, the General Co-Chair, the Technical Program Committee Chair, the General Advisor, the Publicity Chair, and so on. He is a Distinguished Lecturer of IEEE Control Systems Society.

Zheng Guo Li (SM’03) received the B.Sci. and M.Eng. degrees from Northeastern University, Shenyang, China, in 1992 and 1995, respectively, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2001. He joined the MPEG, in 2002, where he was involved in the development of H.264 and HEVC. He had three informative proposals adopted by the H.264 and three normative proposals adopted by the HEVC. He is currently with the Agency for Science, Technology and Research, Singapore. He has authored or co-authored more than 150 journal and conference papers, and co-authored one monograph. He holds 6 granted patents including normative technologies on scalable extension of H.264. His current research interests include high-dynamic range imaging, video processing, QoS, hybrid systems, and chaotic secure communication.

369

Dr. Li is an elected Technical Committee Member of the IEEE Visual Signal Processing and Communication. He is the Invited Session Chair of the ICARCV 2008, the Technical Chair of the IEEE ICIEA 2010, the General Chair of the IEEE ICIEA 2011, the Best Paper Award Chair of the IEEE ICIEA 2012, the Technical Brief Co-Chair of the SIGGRAPH Asia 2012, the General Co-Chair of the CCDC 2013, and the Workshop Chair of the IEEE ICME 2013. He is an Invited Lecturer of the 2011 IEEE Signal Processing Summer School on “3-D and High Definition/High Contrast Video Processing Systems” and a Distinguished Invited Lecturer of the IEEE INDIN 2012.

Aimin Zhang received the B.Eng., M.Eng., and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 1983, 1989, and 2008, respectively. She has been with the School of Electronic and Information Engineering, Xi’an Jiaotong University, since 1983, where she is currently an Associate Professor. Her current research interests include adaptive control, new energy control systems, embedded intelligent measurement, and control systems. Dr. Zhang is a member of the Council of Shaanxi Instrument Association.

Feng Yang (M’10) received the B.Eng. and M.Eng. degrees in electronic and information engineering from Xi’an Jiaotong University, Xi’an, China, and the Ph.D. degree in electric and electronic engineering from Nanyang Technological University (NTU), Singapore. He was a Research Assistant at NTU from 2011 to 2012. Since 2012, he has been a Research Scientist with the Institute of High Performance Computing, Singapore. His current research interests include data mining, pattern recognition, and complex system monitoring.

Kezhi Mao (M’09) received the B.Eng. degree from Jinan University, Guangzhou, China, the M.Eng. degree from Northeastern University, Shenyang, China, and the Ph.D. degree from the University of Sheffield, Sheffield, U.K., in 1989, 1992, and 1998, respectively. He was a Research Associate with the University of Sheffield in 1998, where he was a Research Fellow with the Centre for Signal Processing (now part of I2R of A-Star) from 1998 to 2001. He joined the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, as an Assistant Professor in 2001, where he is currently an Associate Professor. His current research interests include machine learning, computational intelligence, biomedical image analysis, and bioinformatics.

Model-based online learning with kernels.

New optimization models and algorithms for online learning with Kernels (OLK) in classification, regression, and novelty detection are proposed in a r...
905KB Sizes 0 Downloads 4 Views