40

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

Online Bayesian Learning With Natural Sequential Prior Distribution Yohei Nakada, Member, IEEE, Makio Wakahara, and Takashi Matsumoto, Life Fellow, IEEE

Abstract— Online Bayesian learning has been successfully applied to online learning for multilayer perceptrons and radial basis functions. In online Bayesian learning, typically, the conventional transition model has been used. Although the conventional transition model is based on the squared norm of the difference between the current parameter vector and the previous parameter vector, the transition model does not adequately consider the difference between the current observation model and the previous observation model. To adequately consider this difference between the observation models, we propose a natural sequential prior. The proposed transition model uses a Fisher information matrix to consider the difference between the observation models more naturally. For validation, the proposed transition model is applied to an online learning problem for a three-layer perceptron. Index Terms— Bayesian learning, Fisher information, online learning, prior distribution, sequential Monte Carlo (SMC).

I. I NTRODUCTION IVEN data z t (t = 1, 2, 3, . . .) generated sequentially from unknown time-varying systems, the problem of approximating the target system by a parameterized probabilistic model P(z t |θ ) via the data sequence z 1:t = (z 1 , . . . , z t ) is often called an online learning problem. Such online learning problems can be found in many fields in science and engineering, e.g., control systems, signal processing, and data analysis. In many cases, approaches for tackling such problems are required to sequentially track the target system behind the data, under the condition that one variable after another is given. A reasonable approach to such problems is the online Bayesian approach. In many cases, two probabilistic models are considered with the online Bayesian approach: 1) an observation model P(z t |θt ) parameterized by time-varying parameter θt (∈ IRk ) for the behavior of the observation data z t at time t, and 2) a transition model (prior distribution) P(θt |θt −1) to describe the behavior of the time-varying parameter θt . By using these two models, online Bayesian approach enables us to track the unknown system behind the data z t . Here, the integer k stands for the dimension of the parameter θt . The observation model P(z t |θt ) and transition model P(θt |θt −1), described later, depend on hyperparameters βt , denoting the

G

Manuscript received April 11, 2012; revised February 12, 2013; accepted February 20, 2013. Date of publication March 29, 2013; date of current version December 13, 2013. Y. Nakada is with the College of Science and Engineering, Aoyama Gakuin University, Kanagawa 252-5258, Japan (e-mail: [email protected]). M. Wakahara and T. Matsumoto are with Waseda University, Tokyo 165-8050, Japan (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TNNLS.2013.2250999

noise level of observation data, and γt , controlling the scale of transition of θt . For simplicity, the dependencies of the models on the hyperparameters are omitted until Section IV.1 In many applications of online Bayesian approaches, the definition of the transition model P(θt |θt −1) can be an important problem. Here, let us consider a case where the observation model P(z t |θt ) can be defined obviously from knowledge about the system behind the data. In this case, the relation between the parameter of the observation model P(z t |θt ) and the target system behind the data z t may be clear. Moreover, the implications of the transition model P(θt |θt −1) can be understood, and modeling of the transition model P(θt |θt −1) may not be such a difficult task. Actually, online Bayesian approaches have been successfully applied to real-world problems, with several observation models based on linear regression models or auto regressive (AR) models (e.g., [1]–[6]). On the other hand, in many cases where sufficient knowledge about the system behind the data is not available, probabilistic models based on (nonlinear) basis functions (such as multilayer perceptrons and radial basis functions), which can approximate various functions/systems, are used as the observation models P(z t |θt ). For such observation models P(z t |θt ) with basis functions, transition models P(θt |θt −1) based on the squared norm of differences between the current parameter θt and the previous parameter θt −1 , which can be written as   γ t (1) P(θt |θt −1) ∝ exp − ||θt − θt −1 ||2 2 have been typically considered (e.g., [7]–[10]). Here, the variable γt is often called the hyperparameter, which controls the scale of the transition. However, the squared norm ||θt −θt −1||2 does not adequately relate to the difference between the current observation model P(z t |θt ) and the previous observation model P(z t −1 |θt −1), and the changes between the observation models P(z t |θt ) and P(z t −1 |θt −1) are highly dependent on the values of θt −1 . Therefore, the change between the observation models is extremely small at θt −1 = θa , whereas the change between the observation models is large at θt −1 = θb with the transition model (1). We may consider that this phenomenon can happen when the eigenvalues of the Fisher information matrix are small at θt −1 = θa and when the eigenvalues of 1 In many frameworks, variables describing the noise level of the data (i.e., βt described later) may be called parameters rather than hyperparameters. However, the variables describing the noise level of data are treated as hyperparameters in several Bayesian learning approaches that consider hierarchal structures of probabilistic models (e.g., [43], [44], [46]). According to this precedent in Bayesian learning frameworks, the variable describing the noise level of the data is treated as a hyperparameter in this paper.

2162-237X © 2013 IEEE

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

the Fisher information matrix are large at θt −1 = θb . This phenomenon of the transition model (1) can adversely affect the learning and/or tracking of the system behind the data z t . In this paper, we propose the natural sequential prior transition model   γ (2) P(θt |θt −1 ) ∝ exp − θt T Ft −1 θt 2 to adequately consider the difference between the observation models P(z t |θt ) and P(z t −1 |θt −1), and to avoid influences on the learning and/or tracking of the target system. Here, θt = θt − θt −1, and T stands for the transpose operator of vectors and matrices. The natural sequential prior uses a Fisher P(z|θt −1 )(∂ log P(z|θt −1 )/ information matrix Ft −1 = ∂θt −1)(∂ log P(z|θt −1 )/∂θt −1)T dz derived from the observation model P(z t −1 |θt −1), in order to consider the difference between the observation models more naturally. In information geometry approaches, this Fisher information matrix Ft −1 (∈ IRk×k ) is defined as a metric matrix of a model manifold corresponding to the observation model P(z t −1 |θt −1) [11]. The natural sequential prior is based on the norm θt T Ft −1 θt , which considers such a metric of the model manifold. This paper is mainly related to learning methods based on Bayesian approaches. Although the contribution of this paper is likely to be of interest mainly to researchers of Bayesian approaches, we do not target only these researchers. For example, we consider many researchers who are interested in approaches based on information theory and/or information geometry as potential readers, since the proposed model is closely related to several studies based on such approaches, which have widely contributed to improved learning theories for neural networks and learning systems. In addition, many approaches based on information theory [12]–[14] and/or information geometry [15], [16] can be found in this journal. Thus, we believe that this paper will be of interest not only to specific groups but also to a broader readership of this journal. In the next section, we mention some related work of this paper from several viewpoints. Section III describes an information-theoretic view of the natural sequential prior. In Section IV, concrete model specifications with the natural sequential prior are described for online Bayesian learning. An implementation of online Bayesian learning with the natural sequential prior is explained in Section V. For validation, the natural sequential prior is applied to the online learning problem for a three-layer perceptron in Sections VI and VII. In Section VIII, we discuss two possible improvements. In the last section, we give some conclusions and describe future work. II. R ELATED W ORK In this section, we mention some work related to this paper considered from four viewpoints: 1) state space models; 2) prior distributions; 3) Bayesian machine learning; 4) non-Bayesian online learning.

41

A. State Space Models Combinations of an observation model for observation data and a transition model for unknown variables, such as parameters, can be described as state space models [17]–[21]. At least three classes of state space models can be found: 1) standard state space models for linear Gaussian models; 2) general state space models, which are an extension of the standard state space models for nonlinear and/or non-Gaussian models; and 3) self-organizing state space models, which also consider transition models of hyperparameters. More details of self-organizing state space models are described in [20] and [21]. Such state space models, in particular standard state space models, have a long history. Models that can be classified as standard state space models have been proposed and applied in several fields, including Bayesian statistics and time-series analysis. For instance, we can find standard state space models such as time-varying AR models [1]–[3] and dynamic linear models [4]–[6]. For these models, the Kalman filter method [22] has been used as an implementation method to evaluate the filtering distributions (posterior distributions) and/or the predictive distributions. For extended models of such standard state space models involving partial nonlinearity and/or non-Gaussian properties, the extended Kalman filter (EKF) method [23] has been used for a long time to approximate the filtering distribution. The EKF method approximates a target filtering distribution by a unimodal Gaussian distribution. Therefore, the EKF method has often been shown to be inadequate for approximating filtering distributions of generalized state space models when the filtering distributions are multimodal. On the other hand, the sequential Monte Carlo (SMC) method (also known as a particle filter or Monte Carlo filter) began to attract attention with the explosive growth in computational power over the past few decades. The SMC method can be applied to a wider variety of filtering distributions, including multimodal cases, e.g., online estimation of stochastic volatility models in finance timeseries analysis [20], [30]–[32], online learning for perceptrons and/or radial basis functions [7]–[10], and online estimation of hyperparameters in self-organizing state space models [19]. The model structure used in this paper can be viewed as a state space model. The challenge in this paper was to propose a more natural state space model for online Bayesian learning compared with other studies of state space models. B. Prior Distributions This paper can also be regarded as an attempt to design natural and/or adequate prior distributions by considering the characteristics of the observation model and/or posterior distribution. Many attempts to design such natural and/or adequate prior distributions have been proposed from several concepts based on (subjective or objective) Bayesian statistics, information theory, and information geometry. For instance, the natural conjugate prior [33] was proposed by considering the natural conjugacy between a prior distrib-

42

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

ution and a posterior distribution. Such a natural conjugate prior is known to be a mathematically convenient family of prior distributions which facilitates mathematical analysis and/or computational implementation. In [34], the use of such a natural conjugate prior was advocated for an exponential family. The Jeffreys (invariant) prior [35], which uses the Fisher information matrix—which was designed based on a parameter transformation invariance under proper conditions—is another well known example. Another example is the reference prior [36] based on maximization of an expected Kullback– Leibler (KL) divergence between a prior distribution and a posterior distribution. An extension of the Jeffreys prior can be found in the α-parallel prior [37] based on information geometry. These priors can be seen as attempts to design to natural noninformative priors from objective Bayesian concepts. The entropic prior [38]–[41] can be considered to be the most important example for this paper. The entropic prior is designed by using the Fisher information matrix and the KL divergence between an observation model and a reference measure of observation. More details are given in Section III. In such related studies of prior distributions, however, we could not find any attempt to design an adequate prior distribution for parameter changes by using a Fisher information matrix, as we did in our study. C. Bayesian Learning This section describes several studies related to Bayesian learning for basis functions, such as a multilayer perceptron and a radial basis function. Several well-known Bayesian learning methods can be found, e.g., a parameter learning method by using maximum a posteriori estimation [42], a hierarchical learning method for parameters and hyperparameters based on an evidence framework with a Gaussian approximation method [43], [44], and a posteriori sampling method based on the Markov chain Monte Carlo approach [45], [46]. These methods have been extended and employed in many applications, e.g., nonlinear time-series predictions [47], [48]. The above Bayesian learning methods are not based on online frameworks but are based on batch Bayesian frameworks (typically using a weight decay prior to avoid overfitting). On the other hand, several methods can also be found for online problems. Online Bayesian methods based on EKF can be found in [49]–[51]. Gaussian approximation methods without EKF have also been proposed in [52] and [53]. SMC-based online learning methods, which are described in [7]–[10], enable us to approximate the posterior distribution (filtering distribution) without Gaussian approximation methods, including EKF. In such studies of Bayesian learning methods, we could not find a transition model (prior distribution) for parameter changes based on a Fisher information matrix.

Mainly these online methods are based on gradients of loss functions (such as negative log-likelihood functions). Among such online methods, the most important with regard to this paper is an online learning method with a natural gradient [59]. This natural gradient method is extended in [60] and [61]. In [60], a natural gradient method with an online adaptive approximation of the Fisher information matrix is proposed. In [61], a natural gradient method combining a regularization method via network information criteria is described. In the natural gradient method, the parameter updating based on a Fisher information matrix has some similarity to the online learning method described in this paper. There is, however, an obvious difference in the parameter update procedures. While the update procedure of the natural gradient method uses the gradient of negative log-likelihood functions, the update procedure described in this paper is based on Monte Carlo methods within an online Bayesian framework. III. I NFORMATION -T HEORETIC V IEW OF NATURAL S EQUENTIAL P RIOR As mentioned in Section I, the natural sequential prior can be described as a transition model designed by using a Fisher information matrix as the metric matrix of the model manifold from an analogy with information geometry. This natural sequential prior can be elicited by considering optimization problems with two other types of information: KL divergence (also known as relative entropy) and Shannon entropy. This section describes the process of eliciting a natural sequential prior. A. Preliminaries and Conditions In this section, a transition model P(θt |θt −1) is assumed as a multivariate Gaussian distribution, which has a property  such as θt P(θt |θt −1) dθ = 0   1 1 T −1 P(θt |θt −1) = (3) exp − θt t θt 2 Z (t−1 ) where the matrix t (∈ IRk×k ) is a covariance matrix at time t, and the function Z (·) stands for a normalizing constant. Note that this model (3) is the natural sequential prior (2) when the covariance matrix t is t ∝ Ft−1 −1 . Under the condition t ∝ I , the model (3) is identical to the conventional transition model (1) based on the squared norm ||θt − θt −1||2 , where the matrix I stands for a unit matrix. To simplify the discussion here, the following two conditions are assumed for the observation model P(z t |θt ): 1) the observation model P(z t |θt ) is continuously differentiable enough times at all θt for any z t , and 2) the Fisher information matrix Ft −1 is a symmetric positive-definite matrix. Similar to previous sections, the dependencies of the models on the hyperparameters are omitted for simplicity. B. Concepts

D. Non-Bayesian Online Learning A vast number of online learning methods [54]–[57] that are not based on Bayesian frameworks have been proposed. Many such non-Bayesian online methods are described in [58].

This part describes two design concepts for the transition model by using a Shanon entropy S  (4) S(P; θt −1 ) = − P(θt |θt −1 ) log P(θt |θt −1 ) dθt

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

which describes the degree of uncertainty in the parameter space of θt ; and an expectation of KL divergence L  L(P; θt −1 ) = D K L (θt −1 ||θt )P(θt |θt −1) dθt (5) which can represent information loss from the previous time t − 1. Here, D K L (θt −1 ||θt ) stands for the KL divergence2 between P(z|θt ) and P(z|θt −1 )  P(z|θt −1 ) dz. (6) P(z|θt −1 ) log D K L (θt −1 ||θt ) = P(z|θt ) More concretely, we consider two approaches, as follows: Approach 1: Maximize the Shannon entropy S with respect to P(θt |θt −1 ) under the fixed expected KL divergence L. Approach 2: Minimize the expected KL divergence L with respect to P(θt |θt −1) under the fixed Shannon entropy S. From the context of online learning, the Shannon entropy (4) can be related to an area for searching in the parameter space. On the other hand, we can consider the expected KL divergence (5) as an expected degree of forgetting of the observation model. In such a context, these two approaches can be regarded as optimizations of the transition model by considering its characteristics of searching and forgetting. One can also see that Approach 1 has some similarity with the frameworks of the Maximum Entropy (MaxEnt) prior [62], which also considers a maximization of the Shannon entropy (or its variants). In this sense, the first approach can be regarded as an approach for online Bayesian learning within the MaxEnt prior frameworks. C. Optimization Problems This section mentions optimization problems based on the above-described approaches. Under the setting (3), the Shannon entropy (4) can be written as S(P; θt −1 ) =

k 1 k log det t + log 2π + 2 2 2

(7)

where det denotes the determinant of a matrix. The expected KL divergence L is difficult to treat when the observation model P(z t |θt ) is complex. To ease this difficulty, the expected KL divergence L is expanded around θt −1 by using Fisher information Ft −1 as follows: D K L (θt −1||θt ) =

1 θ T Ft −1 θt + O(||θt ||3 ) 2 t

(8)

where the symbol O(·) stands for Landau’s big-O. If we can assume that the term O(||θt ||3 ) is small enough to 2 Actually, our results described in this section can also be derived by using α-divergence    4 Dα (θt−1 ||θt ) = 1 − P(z|θt−1 )(1+α)/2 P(z|θt )(1−α)/2 dz 2 1−α

which is a single-parameter generalization of KL divergence by α ∈ IR, or symmetrized KL divergence D S K L (θt−1 ||θt ) =

1 1 D K L (θt−1 ||θt ) + D K L (θt ||θt−1 ) 2 2

instead of KL divergence D K L (θt ||θt−1 ).

43

be ignored, we get an approximation of the expected KL divergence L as L(P; θt −1 ) ≈

1 tr (Ft −1 t ) 2

(9)

by using the covariance matrix t of the transition model (3).3 Here, the operator tr is defined as the trace of a matrix. By using (7) and (9), we can derive the following optimization problems with respect to the symmetric nonnegative matrix t from the approaches described in the previous subsection. Problem 1: Minimize tr (Ft −1 t ) subject to det t = c1 . Problem 2: Maximize det t subject to tr (Ft −1 t ) = c2 . Here, c1 and c2 are (positive) constants. To derive these optimization problems, we use the relation that the Shannon entropy (7) is a monotonic increasing function of det t . One can see that the solutions of these optimization problems satisfy the relation t ∝ Ft−1 −1 ; namely, the natural sequential prior is a solution for both optimization problems. D. Relation to Entropic Prior As mentioned in Section II, there is a relation between the natural sequential prior and the entropic prior [38]–[41]. Actually, the natural sequential prior can also be elicited by considering an approximation for an extension of the entropic prior. This part describes an elicitation of the natural sequential prior from the entropic prior. The entropic prior can be regarded as a special extension of the MaxEnt prior [62] based on an entropy for the joint distribution of observation data z and a parameter θ . For a parametric probabilistic model P(z|θ ) of the observation data z parameterized by θ , the entropic prior P(θ ) is given by  1 exp (ζ (θ )) det F(θ ) (10) P(θ ) =

(ζ ) where the function (·) stands for a normalizing constant  

(ζ ) = exp (ζ (θ )) det F(θ ) dθ (11) and the function (θ ) stands for “entropy of the likelihood” [41]  P(z|θ ) (θ ) = − P(z|θ ) log dz. (12) m(z) The matrix F(θ ) stands for the Fisher information matrix at θ , and the function m(z) is the reference measure (or distribution), which can be considered as a prior distribution for the data z. The (nonnegative) variable ζ stands for a hyperparameter of the entropic prior (10). One can easily see that this entropic √ prior (10) is identical to the Jeffreys prior [35]: P(θ ) ∝ det F(θ ) with the hyperparameter ζ set to ζ = 0. One may see that this prior (10) has parametertransformation invariance, similar to the Jeffreys prior. From 3 More strictly, to ignore the term O(|| ||3 ), we need additional condiθt tions, e.g., a condition that the absolute values of the elements of t are small. At least, this condition for the elements of t can be satisfied by using a large hyperparameter γt , when t is controlled by the hyperparameter γt , as described later.

44

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

this context, we can regard the entropic prior (10) as a single-parameter extension of the Jeffreys prior by using the hyperparameter ζ and the entropy of the likelihood (θ ). In Bayesian frameworks with the entropic prior, the reference measure m(z) can be selected by considering the characteristics of the observation data z [41]. A reasonable setting for the reference measure m(z) was described by Rodríguez in [39]. Rodr´ı guez’s idea is that the reference measure m(z) is defined as m(z) = P(z|θ ∗ ) by using the model P(z|θ ) at θ = θ ∗ . Here, the parameter value θ ∗ is an “initial guess” of the parameter θ . Specifically, Rodr´ı guez’s entropic prior is given by

 1 exp −ζ D K L (θ ||θ ∗ ) det F(θ ) (13) P(θ ) =

(ζ ) since the entropy of the likelihood (θ ) can be written as (θ ) = −D K L (θ ||θ ∗ ) under m(z) = P(z|θ ∗ ). We can extend Rodr´ı guez’s entropic prior (13) to the time-varying parameter θt in online Bayesian frameworks by considering the current parameter θt as the target parameter θ and the previous parameter θt −1 as the initial guess value θ ∗ . That is, an extension of Rodr´ıguez’s entropic prior for online Bayesian frameworks is  1 exp (−ζ D K L (θt ||θt −1)) det F(θt ). (14) P(θt |θt −1) =

(ζ ) The logarithm of this extension (14) is written as log P(θt |θt −1) 1 log det F(θt ) − log (ζ ). (15) 2 In the following argument, let us make two assumptions: 1) the divergence D K L (θt ||θt −1) can be approximated as D K L (θt ||θt −1) ≈ 12 θt T Ft −1 θt by considering the expansion at θt = θt −1 1 D K L (θt ||θt −1) = θt T Ft −1 θt + O(||θt ||3 ) (16) 2 and 2) we can ignore the influence of the second term, 1 4 2 log det F(θt ), in (15). Under these two assumptions, we can approximate (15) by ζ (17) log P(θt |θt −1) ≈ − θt T Ft −1 θt + constant. 2 That is, the extension of Rodr´ıguez’s entropic prior (14) is approximated by   ζ 1 exp − θt T Ft −1 θt P(θt |θt −1 ) ≈ (18) Z (ζ Ft −1 ) 2 = −ζ D K L (θt ||θt −1) +

where the function Z (·) stands for a normalizing constant. This approximated prior (18) is obviously identical to the natural sequential prior (2) with γt = ζ . IV. M ODEL S PECIFICATIONS This section explains the model specifications for online Bayesian learning with the natural sequential prior. Note that the dependencies of the models on the hyperparameters βt and γt are explicitly described here. 4 At least, assumption 2) can be assumed when the hyperparameter ζ is

sufficiently large with the nonsingular condition det F(θt )  = 0.

A. Observation Model In this paper, we consider the observation data z t generated by an input–output system. The observation model P(z t |θt , βt ) describing the observation data z t is P(z t |θt , βt ) = P(yt |x t , θt , βt )P(x t )

(19)

where the variables yt (∈ IR) and x t (∈ IRn x ) stand for the output variable and the input variable at time t, respectively, and the observation data z t is considered as the set of these variables x t and yt , i.e., z t = (yt , x t ). The function P(x t ) describes a probability distribution of the input variable x t , and we assume that this distribution P(x t ) does not depend on the parameter θt ; i.e., the distribution P(x t ) is not a target to be estimated here. On the other hand, the conditional distribution P(yt |x t , θt , βt ) for the output variable yt is   βt 1 exp − (yt − f (x t ; θt ))2 (20) P(yt |x t , θt , βt ) = Z (βt ) 2 by assuming that the variable yt contains Gaussian noise. The variable βt ∈ (0, ∞) stands for a hyperparameter describing the inverted variance of the Gaussian noise at time t. The function Z (·) is a normalizing constant. The function f (·) denotes a parameterized basis function. In the experiments described later, a three-layer perceptron defined as follows is used as the function f (·) f (x t ; θt ) = a0,t +

nh

ai,t σ (x t ; bi,t )

(21)

i=1

where the integer n h (> 0) denotes the number of hidden units, and the integer n x (> 0) is the dimension of the input variable x t . The parameter a0,t (∈ IR) is the output bias, the parameter ai,t (∈ IR) (i = 1, . . . , n h ) describes a weight parameter of the i th hidden unit, and the parameter vector bi,t (∈ IRn x +1 ) (i = 1, . . . , n h ) denotes a weight parameter vector connecting from the inputs x t and the input bias to the i th hidden unit. The function σ (·), which is the output function of the hidden units, is ⎛ ⎞ nx σ (x t ; bi,t ) = φ ⎝bi0,t + bi j,t x j,t ⎠ (22) j =1

where the variable bi j,t stands for the j th component of the parameter vector bi,t (i.e., bi,t = (bi0,t , . . . , bin x ,t )), and the variable x j,t stands for the j th component of the input variable x t (i.e., x t = (x 1,t , . . . , x n x ,t )). In the experiments described in this paper, the following logistic sigmoid function is used as the function φ(·) φ(r ) =

1 . 1 + exp (−r )

(23)

In the settings described here, the parameter θt is described as θt = (at , bt ), where at = (a0,t , . . . , anh ,t ), and bt = (b1,t , . . . , bnh ,t ).

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

B. Fisher Information Matrix Given the observation model in (19), the Fisher information matrix Ft can be written as  Ft = P(x t )P(yt |x t , θt , βt )

45

For such cases, the Fisher information matrix Ft −1 is modified to a positive-definite matrix Ft −1 = Ft −1 + δ I in this paper; i.e., we use a modified natural sequential prior  γ  1 t exp − θt T Ft −1 θt P(θt |θt −1, τt ) = (28) Z (γt F t −1 ) 2

∂ log P(yt |x t , θt , βt ) ∂ log P(yt |x t , θt , βt ) T d x t d yt . (24) ∂θt ∂θt Specifically, when the conditional distribution of the output variable yt is written as (20), the Fisher information matrix (24) can be described as  (25) Ft = βt gθ,t gθ,t T P(x t ) d x t

in the experiment described later. Here, the matrix I denotes a unit matrix, and the constant δ is a positive real number (i.e., δ > 0). A detailed discussion of such singularity of the Fisher information with a multilayer perceptron can be found in [63].

where the vector gθ,t stands for ∂ f (x t ; θt )/∂θt . When we use the three-layer perceptron described by (21), the vector gθ,t can be written as gθ,t = (ga,t , gb,t ) where ga,t = (s0,t , s1,t , x ∗ . For . . . , snh ,t ), gb,t = (gb1 ,t , . . . , gbnh ,t ), and gbi ,t = ai,t si,t t = simplicity of notation, here we use si,t = σ (x t ; bi,t ), si,t ∗ si,t (1 − si,t ), s0,t = 1, x t = (x 0,t , x t ), and x 0,t = 1. More specifically, we can describe the Fisher information matrix (24) as   Faa,t Fab,t (26) Ft = Fba,t Fbb,t

In the experiment described later, the hyperparameters βt and γt are specified in two ways, as follows: 1) both of the hyperparameters βt and γt are fixed; i.e., we set βt = β and γt = γ , where the variables β and γ are fixed values, and 2) the hyperparameter transition models as described in this paper are used for automatic tuning of hyperparameters; i.e., we use the transition models based on a log normal distribution as follows:   1 (log βt − log βt −1 )2 (29) exp − P(βt |βt −1 ) = √ 2σβ2 2πσβ βt   1 (log γt − log γt −1 )2 P(γt |γt −1 ) = √ (30) exp − 2σγ2 2πσγ γt

×

where T Fab,t = Fba,t = (Fab1 ,t , . . . , Fabh ,t ) ⎞ ⎛ Fb1 b1 ,t · · · Fb1 bnh ,t ⎟ ⎜ .. Fab,t = ⎝ ... ⎠ ··· . Fbnh b1 ,t · · · Fbnh bnh ,t  [Faa,t ]i j = βt si−1,t s j −1,t P(x t ) d x t    Fabi ,t j l = βt ai,t s j −1,t si,t xl−1,t P(x t ) d x t    Fbi b j ,t lm = βt ai,t a j,t si,t s j,t xl−1,t x m−1,t P(x t ) d x t .

Here, the operator [·]i j denotes the element in row i and column j of the matrix, Faa,t ∈ IR(nh +1)×(nh +1) , Fabi ,t ∈ IR(nh +1)×(n x +1) , and Fbi b j ,t ∈ IR(n x +1)×(n x +1) . C. Natural Sequential Prior for Parameter The natural sequential prior for the parameter θt can be written by using the Gaussian distribution with the covariance matrix γt−1 Ft−1 −1 as   γ 1 t exp − θt T Ft −1 θt . (27) P(θt |θt −1 , τt ) = Z (γt Ft −1 ) 2 Here, the variable γt ∈ (0, ∞) is a hyperparameter controlling the scale of the transition, the function Z (·) is a normalizing constant, and the variable τt is defined as τt = (γt , βt ). Unfortunately, when we use the observation model (19) and (20) with a three-layer perceptron, as in the experiment described later, we cannot ensure that the Fisher information matrix Ft −1 is a positive-definite matrix, though the matrix Ft −1 can be ensured to be a nonnegativedefinite matrix. Therefore, it is difficult to use the prior (27).

D. Hyperparameters

where the variables σγ and σβ are often called hyperhyperparameters. E. Filtering Distribution and Predictive Distribution Within an online Bayesian framework under the model specifications described here, our goal is to evaluate the filtering distribution of all parameters t = (θt , βt , γt ) P(t |z 1:t ) = 

P(yt |x t , θt , βt )P(t |z 1:t −1 ) P(yt |x t , θt , βt )P(t |z 1:t −1 ) dt

(31)

and the predictive distribution of the observation yt +1 P(yt +1 |x t +1 , z 1:t )  = P(yt +1 |x t +1, θt +1 , βt +1 )P(t +1 |z 1:t ) dt +1 (32) where P(t +1 |z 1:t ) =

 P(t +1 |t )P(t |z 1:t ) dt

P(t +1 |t ) = P(θt +1 |θt , τt +1 )P(βt +1 |βt )P(γt +1 |γt ). The implementation method use to evaluate these distributions (31) and (32) is described in the next section. V. I MPLEMENTATION The model described in the previous section obviously involves partial nonlinearity and non-Gaussian properties. Therefore, it is difficult to use analytical implementation methods such as the Kalman filter method for evaluating the 5 The proposal distribution Q(·) is set as P(θ |θ t t−1 , τt )P(βt |βt−1 ) P(γt |γt−1 ) in the experiment described later.

46

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

Fig. 2.

Fig. 1.

Implementation based on SMC.

filtering distribution (31) and the predictive distribution (32). This indicates that we need a numerical approximation method to evaluate the distributions (31) and (32). In this paper, the SMC method, which has been successfully applied to approximate filtering distributions of many nonlinear/non-Gaussian models [64], is used to evaluate the distributions. The SMC method consists of two steps: 1) an importance sampling step, and 2) a resampling step. By iterating these two steps, the SMC method enables us to draw samples from the filtering distribution (31) sequentially. The predictive distribution is approximated by using the samples from the filtering distribution (31). Although there are several variants of the SMC method, here we use the SMC implementation summarized in Fig. 1. The integer N denotes the number of particles in Fig. 1. VI. A PPLICATION TO O NLINE L EARNING FOR P ERCEPTRONS In this section, the natural sequential prior is applied to the online learning problem for the three-layer perceptron. Although examples described in this section are related to regression problems, similar examples related to classification problems can be found in several papers on adaptive classifiers, e.g., JIT adaptive classifiers [65], [66], and Learn++ . NSE [67]. A. Target Problem In this section, we consider a nonlinear regression problem given the sequential observation data, with the following conditions. 1) Unknown functional form: The model does not involve any information about the form of the true function. 2) Unobservable variable: There is an unknown variable, which cannot be observed directly, in the true function.

Shapes of true function in the experiment.

3) Uncertainty of observation: The observation data involves probabilistic uncertainty (i.e., observation noise). These conditions correspond to cases where not enough knowledge about the system behind data is available, as described in Section I. More concrete settings of the target problem are mentioned in the following sections. 1) True Function: We consider the true function ρ(x; v) = 2.6 sin(0.5x + πv) cos(2.0x)

(33)

where the variable x ∈ IR stands for an input variable (known), and the variable v ∈ IR is an unobservable variable (unknown). The true function is plotted in Fig. 2 for several settings of the unobservable variable v. In the following parts, we describe the input variable x at time t as x t , and the unobservable variable v at time t as v t . 2) Unobservable Variable: We consider the following cases for setting the variable v t : Example 1. the variable v t is fixed at v t = 0.0; Example 2. the variable v t gradually increases, v t = t/1000; Example 3. the variable v t changes cyclically, v t = 0.5 sin (πt/1000); Example 4. the variable v t probabilistically changes, v t = v t −1 + κξt , where the variable ξt stands for stochastic independent noise generated from the standard Gaussian distribution, v 0 = 0.0, and κ = 1.0/1000. 3) Observation Data: By using the true function (33) with the unobservable variable v t , the observation data z t = (yt , x t ) is generated as follows: x t ∼ i.i.d. U (−3.0, 3.0)

(34)

yt = ρ(x t ; v t ) + νt , νt ∼ i.i.d. N(0.0, 0.03)

(35)

where the observation noise νt is generated from the Gaussian distribution with mean 0.0 and variance 0.03, and the input variable x t is generated from the uniform distribution with range (−3.0, 3.0).

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

10.00 Example 1 Example 2 Example 3 Example 4

Averaged squared eerror of last 100 data

Averaged squared eerror of last 100 data

10.00

47

1.00

0.10

0.01

Example 1 Example 2 Example 3 Example 4 1.00

0.10

0.01 3.0 30

4.0 40

5.0 6.0 7.0 8.0 50 60 70 80 Logarithm of fixed hyperparameter γ

9.0 90

10.0 10 0

3.0 30

4.0 40

5.0 8.0 50 66.00 77.00 80 Logarithm of fixed hyperparameter γ

(a) Fig. 3.

9.0 90

10.0 10 0

(b)

Averaged squared errors for last 100 data points with various fixed hyperparameters. (a) Conventional model. (b) Proposed model.

B. Model Settings This part describes model settings for the experiment. 1) Observation Model Settings: In this section, we use the observation model described by (19) and (20). To approximate the true function ρ(·) (33), the parameterized function f (·) is set for a three-layer perceptron (21) with the number of hidden units n h = 8. 2) Parameter Transition Model Settings: When we use the observation model with the three-layer perceptron (21), we cannot ensure that the Fisher information matrix is positive definite. Therefore, the transition model described by (28) with δ = 0.1 is used here. For comparison, we also use the standard transition model based on the squared norm ||θt − θt −1 ||2 (conventional model), as follows:   γ 1 t exp − ||θt − θt −1 ||2 . (36) P(θt |θt −1, γt ) = Z (γt I ) 2 For both cases, the components of the initial parameter θ0 are generated independently from the standard normal distribution. 3) Settings for Hyperparameters: In this section, the hyperparameters βt and γt are treated in two ways, as described before. First, the hyperparameters βt and γt are fixed. More specifically, the hyperparameter βt is fixed at the inverse variance of noise in (35), i.e., βt = β = 0.03−1. On the other hand, the hyperparameter γt is fixed at various values depending on the case: γt = γ = 50, 100, 250, 500, 1000, 2500, and 5000. Second, we use the transition model of the hyperparameters (29) and (30). We set σγ = σβ = 0.01 for the conventional model (36), and σγ = σβ = 0.05 for the proposed model.6 The initial values of the hyperparameters βt and γt are generated independently from the standard lognormal distribution. C. Computation of Fisher Information Matrix This subsection explains the computation of the Fisher information matrix Ft . When we used the observation model with 6 More specifically, these variables of the hyper-hyperparameters show the

best performance (within 0.1, 0.05, 0.025, 0.01, and 0.005) in terms of the sum of the averaged squared error in the last 100 data points of all four examples with a single trial.

the three-layer perceptron, we could not find any analytical solution for the Fisher information matrix Ft (24), except for several special cases.7 Therefore, we used a Monte Carlo integration to compute the Fisher information Ft , as described by    T βt gθ,t x (s) gθ,t x (s) 100 100

Ft ≈

(37)

s=1

where gθ,t (x) = ∂ f (x; θt )/∂θt and x (s) (s = 1, . . . , 100) are generated from P(x t ). D. Results This part describes the results. For all cases, we consider the range t = 1, . . . , 2000, and all of the results are averaged over 20 trials. The number of particles is N = 1000 for the SMC method. 1) Results With the Fixed Hyperparameter: The results obtained with the fixed hyperparameter are given here. Fig. 3 plotsthe averaged squared errors for the last 100 data points: 2000 1 2 t =1901 (yt − yˆt ) . Here, the variable yˆt stands for 100 the predictive mean evaluated by the our implementation. Fig. 4 plots the transition of the moving-average squared 1 t 2 5 shows the cumulative errors: 100 t =t −99 t(yt − yˆt ) . Fig. 2 . The results shown in (y − y ˆ ) squared errors: t t =1 t Figs. 4 and 5 were computed with the best hyperparameters in terms of the averaged squared error for the last 100 data points for each case. It can be observed from Fig. 3 that the averaged errors in the last 100 data points with the proposed model (natural sequential prior) tend to be smaller than those with the conventional model (36). We can also observe that the proposed model is superior to the conventional model in the case of the best hyperparameter in terms of the averaged errors in the last 100 data points. From Figs. 4 and 5, the average errors and the cumulative errors with the proposed model are 7 For example, we can find an analytical solution in [68] with the condition that the probability distribution of the input variable P(xt ) is the normal distribution (with mean 0 vector and unit covariance matrix), and the output r    √2 function of hidden units σ (·) can be written as ψ (r) = −∞ exp −ω2 dω.

48

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

(b)

(a) Fig. 4.

Transition of averaged squared error of 100 data points with the best fixed hyperparameter in Fig. 3. (a) Conventional model. (b) Proposed model.

(a) Fig. 5.

(b)

Time evolution of cumulative squared error with the best fixed hyperparameter in Fig. 3. (a) Conventional model. (b) Proposed model.

smaller than those with the conventional model after t = 200. In Examples 2 and 3, one can see that the average errors with the conventional model increase temporarily around t = 1000, since the true function greatly changes in Examples 2 and 3 around t = 1000. On the other hand, one cannot see such an obvious increase in the average errors with the proposed model around t = 1000. 2) Best Fixed Hyperparameters: In order to consider the degree of the changes in the four examples, let us consider the squared norm of the time difference of the true functions as follows:  ρt 2 = ρt (x t )2 P(x t ) d x t (38) where ρt (·) = ρ(·, νt ) − ρ(·, νt −1 ). Table I shows the averaged values and the maximum values of the squared norms (38) for the last 100 points in each example. These values for Example 4 were averaged over 100 trials. From this table, one can see that Example 3 has the largest averaged squared norm in the last 100 data points. The table also shows that Example 4 has the largest maximum value. These results indicate that the true functions ρ(·) of Examples 3 and 4 changed more dynamically than those of Examples 1 and 2 in the last 100 data points. In Fig. 3, one can see that the best hyperparameters γ of the conventional model are similar for the four examples.

TABLE I C HANGE OF THE T RUE F UNCTION IN THE L AST 100 D ATA P OINTS Example Average Max

Example 1 0.0 0.0

Example 2 1.04 × 10−4 1.05 × 10−4

Example 3 2.48 × 10−4 2.59 × 10−4

Example 4 1.03 × 10−4 7.94 × 10−4

Therefore, the best fixed hyperparameters γ of the conventional model are not considered to adequately reflect the characteristics of each example. However, the best fixed hyperparameters γ of the proposed model in Examples 3 and 4 are much smaller than those in Examples 1 and 2. This result indicates that the proposed model properly captured the characteristics of each example, and these characteristics were reflected in the best fixed hyperparameters. 3) Results With the Transition Model of Hyperparameters: The results obtained with the hyperparameter transition models (29) and (30) are described here. Fig. 6 shows the  time evolution of the moving-average squared errors: t 1 2 t  =t −99 (yt − yˆt ) . Fig. 7 plots the cumulative squared 100 t errors: t =1 (yt − yˆt )2 . From these figures, one can see that the averaged squared error and the cumulative squared errors obtained with the proposed model tend to be smaller than those obtained with the conventional model, similar to the case with the fixed hyperparameter.

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

(a) Fig. 6.

(b)

Time evolution of cumulative squared error with hyperparameter transition model. (a) Conventional model. (b) Proposed model.

(a) Fig. 8.

(b)

Time evolution of averaged squared error of 100 data points with hyperparameter transition model. (a) Conventional model. (b) Proposed model.

(a) Fig. 7.

49

(b)

Comparison of computational times. (a) With the best fixed hyperparameter in Fig. 3. (b) With the transition model of the hyperparameter.

4) Results by Considering Computational Time: In the results described in the previous sections, the number of particles N is set at N = 1000 for both the conventional model and the proposed model. However, the computational cost with

the proposed model is larger than that with the conventional model with the same number of particles N, since the proposed model requires computation of the Fisher information matrix. Therefore, the conventional model with N = 20 000 is also

50

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

(a)

(b)

Fig. 9. Comparison of averaged squared errors of the last 100 data points. (a) With the best fixed hyperparameter in Fig. 3. (b) With the transition model of the hyperparameter.

(a) Fig. 10.

(b)

Comparison of cumulative squared errors. (a) With the best fixed hyperparameter in Fig. 3. (b) With the transition model of the hyperparameter.

tested in this part. Fig. 8 compares the computational time for the conventional model (with N = 20 000) and the proposed model (with N = 1000).8 Fig. 9 plots the averaged squared errors of the last 100 data points. The cumulative squared errors are described in Fig. 10. The results obtained with the conventional model with the N = 1000 are also shown in these figures. From Fig. 8, one can see that the computational time of the conventional model with N = 20 000 is slightly larger than that of the proposed model with N = 1000. Figs. 9 and 10 indicate that the proposed model with N = 1000 performed better than the conventional model with N = 20 000, except for a few cases. VII. D EMONSTRATION W ITH A BRUPT C HANGE In the previous section, the unobservable variable v t was fixed or gradually changed. To consider the more difficult situation of a changing environment, an example including an abrupt change is used in this section. 8 Specifications of the PC are as follows. CPU: Intel Xeon 3.00 GHz × 2, Memory: 1.0 GByte, OS: Linux version 2.6.9.

A. Settings In this experiment, the true function ρ(·) (33) is also used, and the variable v t is defined as v t = t/1000 (0 < t ≤ 1000) and v t = t/1000 + 1 (1000 < t ≤ 2000). Note that there is an abrupt change at t = 1001 in this case. The hyperparameters βt and γt are treated in two ways, similar to the previous section. In the cases with the fixed hyperparameters, the hyperparameter βt is fixed to βt = 0.03−1 for both the conventional model and the proposed model, and the hyperparameter γt is fixed to γt = 100 for the conventional model and γt = 250 for the proposed model, since these settings showed the best performance (among γt = 50, 100, 250, 500, 1000, 2500, 5000) in terms of the averaged squared error for the last 100 data points. In the cases with the transition model of the hyperparameters, we set σγ = σβ = 0.01 for the conventional model and σγ = σβ = 0.05 for the proposed model, similar to the previous section. Other settings are the same as used in the previous section. B. Results Fig. 11 plots the averaged squared errors for 100 data points. Fig. 12 plots the cumulative squared error. These

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

(a)

51

(b)

Fig. 11. Time evolution of averaged squared error of 100 data points in the case with an abrupt change. (a) With the best fixed hyperparameter. (b) With hyperparameter transition model.

(a)

(b)

Fig. 12. Time evolution of cumulative squared error in the case with an abrupt change. (a) With the best fixed hyperparameter. (b) With hyperparameter transition model.

results are averaged over 20 trials. Fig. 11 shows that the averaged squared errors in all cases have peaks after the abrupt change occurs. The averaged squared error with the fixed hyperparameters of the proposed model has a slightly higher peak than that of the conventional model around t = 1100. However, the averaged squared errors of the proposed model are lower than those of the conventional model in the other parts. From Fig. 12, we can see that the cumulative error with the proposed model is smaller than that with the conventional model in this case. VIII. D ISCUSSION In this section, we discuss two possible improvements: 1) an improvement to the modification of the Fisher information matrix, and 2) an improvement to the proposed transition model. A. Improving the Modification of the Fisher Information Matrix In the experiments reported in this paper, the parameter δ of the modification was set to δ = 0.1 in the transition

model (28). Although an extremely small value of δ seems to be suitable in a theoretical sense, the numerical calculations can become unstable with an extremely small δ in practice. Such problems mainly occur when the minimum eigenvalue of the modified matrix Ft is too small to be compared with the maximum eigenvalue of Ft . Actually, there were problems in the calculations with δ = 0.01 in our preliminary experiment. On the other hand, the proposed transition model with δ = 1.0 performed worse than that with δ = 0.1, since the proposed transition model is quite similar to the conventional transition model (1) when δ is large. To overcome such problems, one can consider several adaptive methods for setting δ, based on information about Ft . For example, we can use a simple adaptive method with a trace of Ft , such as δ = δ0 tr(Ft ). Here, the variable δ0 is a small positive number. In this method, the minimum eigenvalue of the modified matrix Ft is not less than δ0 /(1 + kδ0 ) times the maximum eigenvalue of Ft , i.e., λmin (Ft ) ≥ δ0 /(1 + kδ0 )λmax (Ft ). Here, the functions λmax (·) and λmin (·) are the maximum eigenvalue function and the minimum eigenvalue function, and the integer k denotes the

52

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

size of Ft (i.e., Ft ∈ IRk×k ). Alternatively, one can consider another method in which the eigenvalues are directly used, such as δ = max [0, (ηλmax (Ft ) − λmin (Ft ))/(1 − η)], where η ∈ (0, 1). By using this method, it is guaranteed that the minimum eigenvalue is not less than η times the maximum eigenvalue (i.e., λmin (Ft ) ≥ ηλmax (Ft )). Instead of the modification Ft = Ft + δ I , one may consider using another modification Ft = Ft + δ diag[Ft ], which is used in the Levenberg–Marquardt method in other fields, such as least squares curve fitting and nonlinear programming. Here, the matrix diag[Ft ] stands for the diagonal matrix consisting of the diagonal elements of Ft . In addition, we can also consider a more complicated modification method based on the eigen decomposition (or spectral decomposition). Specifically, let us describe the eigen decomposition of Ft as Ft = Q t t Q tT , where the matrix Q t stands for an orthogonal matrix (i.e., Q tT Q t = Q t Q tT = I ). The matrix t is a diagonal matrix t = diag[λ1,t , . . . , λk,t ], and the variable λi,t stands for the i th eigenvalue of Ft . Under such settings, a modification Ft = Q t  t Q tT can be used, where  t = diag[λ 1,t , . . . , λ k,t ], λ i,t = max[λi,t , λmax (Ft )]. The variable  is a small positive number. In this modification, the minimum eigenvalue is not less than  times the maximum eigenvalue (i.e., λmin (Ft ) ≥ λmax (Ft )). B. Improving the Proposed Transition Model In the proposed model, the quadratic form with the Fisher information matrix (metric tensor) Ft is used to approximate the KL divergence.9 Therefore, more complicated information of the manifold (e.g., connection, curvature) is not directly considered in this model. Moreover, the approximation of the expected KL divergence L may not be valid when the hyperparameter γt is small. Specifically, the approximation error of the expected KL divergence with the proposed transition model can be written as O(γt−2 ||Ft−1 ||2 ) under several conditions. A simple way to reduce the approximation error is to adopt an approach using multiple transitions with the step size  := 1/s (where s is a positive integer). More specifically, the transition model for θt + given θt is defined as P(θt + |θt )  γ  1 t T exp − (θ = − θ ) F (θ − θ ) . t + t t t + t Z (−1 γt Ft ) 2 (39) In this approach, s transitions are considered in one step using this transition model (39) (i.e., the transitions generate θt −1+ , θt −1+2, . . . , θt ). The error term of the expected KL divergence (5) with one transition with  can be written as O(2 γt−2 ||Ft−1 ||2 ). This approach can also be derived by using the stochastic process dθt = rt t d Wt

(40)

where the vector Wt stands for the k-dimensional Wiener process, the matrix t is the square root of Ft−1 , and the 9 To simplify the discussion here, the Fisher information matrix F is t assumed to be a positive-definite matrix.

−1

variable rt denotes a scaling parameter, defined as rt = γt 2 . When this stochastic process is discretized with the step size  by the Euler method for stochastic processes (also known as the Euler–Maruyama method) [69], one can obtain the transition model (39). Although the complicated information of the model manifold is not used explicitly in this approach, this approach (with the small step size) is more valid, at least in terms of the approximation error of the expected KL divergence. Alternatively, one may also consider another approach with a modification of the process (40), which is based on a more differential geometrical concept dθt = rt t d Wt + rt2 χt dt

(41) k

 where χt = (χ1,t , . . . , χk,t )T , χl,t = − 12 ki=1 j =1 [Ft−1 ]i j il j (θt ), the operator [·]i j stands for the i j th element of the matrix, and the coefficient il j (θt ) is the Christoffel symbol of the second kind. This stochastic process (with the scaling parameter rt fixed) is a definition of Brownian motion derived by the Laplace operator on a (Riemannian) manifold (i.e., the Laplace–Beltrami operator) with the metric tensor Ft [70]. By considering a discretization of this stochastic process with the step size , the modification of the transition model (39) can be obtained as follows: P(θt + |θt )  γ  1 t T exp − (θ − μ ) F (θ − μ ) = t + t t t + t Z (−1 γt Ft ) 2 (42)

where μt = θt + rt2 χt . With this modification (42), it can be very costly to calculate the vector χt , at least for complex models such as three-layer perceptrons; however, we consider that this modification (and its variants) may be interesting from the viewpoint of Bayesian learning based on information geometry and stochastic process theory. IX. C ONCLUSION In this paper, we described a natural sequential prior which uses a Fisher information matrix to consider the difference between the observation models more naturally. For validation, the proposed model was applied to the online learning problem for a three-layer perceptron with several settings. As a result, we confirmed the superiority of the proposed model in terms of errors. One of the most important aspects of future work will be to generalize and/or extend the proposed model to apply it to other types of observation models. ACKNOWLEDGMENT The authors would like to thank Prof. A. Doucet, Oxford University, U.K., and Prof. N. Murata, Waseda University, Japan, for valuable comments. They also wish to thank K. Ito, T. Kudo, H. Sasaki, T. Kimura, K. Sega, and T. Tokuda, Waseda University, and the Associate Editor and the reviewers for their suggestions.

NAKADA et al.: ONLINE BAYESIAN LEARNING WITH NATURAL SEQUENTIAL PRIOR DISTRIBUTION

R EFERENCES [1] G. Kitagawa, “Changing spectrum estimation,” J. Sound Vibrat., vol. 89, no. 3, pp. 433–445, Aug. 1983. [2] G. Kitagawa and W. Gersch, “A smoothness priors time-varying AR coefficient modeling of nonstationary covariance time series,” IEEE Trans. Autom. Control, vol. 30, no. 1, pp. 48–56, Jan. 1985. [3] X. Q. Jiang and G. Kitagawa, “A time varying coefficient vector AR modeling of nonstationary covariance time series,” Signal Process., vol. 33, no. 3, pp. 315–331, Sep. 1993. [4] P. J. Harrison and C. Stevens, “Bayesian forecasting,” J. Res. Stat. Soc. Ser. B, Stat. Methodol., vol. 38, no. 3, pp. 205–247, Mar. 1976. [5] R. D. Snyder, “Recursive estimation of dynamic linear models,” J. Res. Stat. Soc. Ser. B, Stat. Methodol., vol. 47, no. 2, pp. 272–276, Feb. 1985. [6] J. Harrison and M. West, “Dynamic linear model diagnostics,” Biometrika, vol. 78, no. 4, pp. 797–808, 1991. [7] J. F. G. de Freitas, M. Niranjan, A. Gee, and A. Doucet, “Sequential Monte Carlo methods for optimisation of neural network models,” Dept. Eng., Cambridge Univ., Tech. Rep. CUED/F-INFENG/TR 328, 1998. [8] J. F. G. de Freitas, M. Niranjan, A. Gee, and A. Doucet, “Sequential Monte Carlo methods to train neural network models,” Neural Comput., vol. 12, no. 4, pp. 955–993, Apr. 2000. [9] N. de Freitas, C. Andrieu, P. Højen-Sørensen, M. Niranjan, and A. Gee, “Sequential Monte Carlo methods for neural networks,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, Eds. New York, USA: Springer-Verlag, 2001, pp. 359–380. [10] T. Kurihara, Y. Nakada, K. Yosui, and T. Matsumoto, “Bayesian online learning: A sequential Monte Carlo with importance resampling,” in Proc. IEEE Signal Process. Soc. Workshop Neural Netw. Signal Process., Sep. 2001, pp. 163–172. [11] S. Amari, Differential Geometrical Method in Statistics (Lecture Notes in Statistics). New York, USA: Springer-Verlag, 1985. [12] R. A. Morejon and J. C. Principe, “Advanced search algorithms for information-theoretic learning with kernel-based estimators,” IEEE Trans. Neural Netw., vol. 15, no. 4, pp. 874–884, Jul. 2004. [13] A. Seghouane and S. Amari, “The AIC criterion and symmetrizing the Kullback–Leibler divergence,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 97–106, Jan. 2007. [14] S. Melacci and M. Gori, “Unsupervised learning by minimal entropy encoding,” IEEE Trans. Neural Netw. Learning Syst., vol. 23, no. 12, pp. 1849–1861, Dec. 2012. [15] S. Amari, K. Kurata, and H. Nagaoka, “Information geometry of Boltzmann machines,” IEEE Trans. Neural Netw., vol. 3, no. 2, pp. 260–271, Mar. 1992. [16] F. Cousseau, T. Ozeki, and S. Amari, “Dynamics of learning in multilayer perceptrons near singularities,” IEEE Trans. Neural Netw., vol. 19, no. 8, pp. 1313–1328, Aug. 2008. [17] G. Kitagawa, “Non-Gaussian state space modeling of nonstationary time series,” J. Amer. Stat. Assoc., vol. 82, no. 4, pp. 1032–1041, Dec. 1987. [18] G. Kitagawa and W. Gersch, Smoothness Priors Analysis of Time Series (Lecture Notes in Statistic). New York, USA: Springer-Verlag, 1996. [19] G. Kitagawa, “A self-organizing state space model,” J. Amer. Stat. Assoc., vol. 93, pp. 1203–1215, Sep. 1998. [20] G. Kitagawa and S. Sato, “Monte Carlo smoothing and self-organizing state-space model,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, Eds. New York, USA: Springer-Verlag, 2001, pp. 177–195. [21] T. Higuchi, “Self-organizing time series model,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, Eds. New York, USA: Springer-Verlag, 2001, pp. 429–444. [22] R. E. Kalman, “A new approach to linear filtering and prediction problems,” ASME Trans. J. Basic Eng., vol. 82, no. 1, pp. 35–45, 1960. [23] H. Cox, “On the estimation of state variables and parameters for noisy dynamic systems,” IEEE Trans. Autom. Control, vol. 9, no. 1, pp. 5–12, Jan. 1964. [24] B. Friedland and I. Bernstein, “Estimation of the state of a nonlinear process in the presence of non-Gaussian noise and disturbances,” J. Franklin Inst., vol. 281, no. 6, pp. 455–480, Jan. 1966. [25] L. Ljung, “Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems,” IEEE Trans. Autom. Control, vol. 24, no. 1, pp. 36–50, Feb. 1979. [26] J. E. Handschin and D. Mayne, “Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering,” Int. J. Control, vol. 9, no. 5, pp. 547–559, 1969. [27] H. Akashi and H. Kumamoto, “Random sampling approach to state estimation in switching environments,” Automatica, vol. 13, no. 4, pp. 429–434, Jul. 1977.

53

[28] G. Kitagawa, “Monte Carlo filter and smoother for non-Gaussian nonlinear state space models,” J. Comput. Graph. Stat., vol. 5, no. 1, pp. 1–25, Mar. 1996. [29] A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo sampling methods for Bayesian filtering,” Stat. Comput., vol. 10, no. 3, pp. 197–208, 2000. [30] M. K. Pitt and N. Shephard, “Filtering via simulation: Auxiliary particle filters,” J. Amer. Stat. Assoc., vol. 94, no. 446, pp. 590–599, 1999. [31] M. K. Pitt and N. Shephard, “Auxiliary variable based particle filters,” in Sequential Monte Carlo Methods in Practice, A. Doucet, N. de Freitas, and N. Gordon, Eds. New York, USA: Springer-Verlag, 2001, pp. 273–293. [32] C. Roberto, “Bayesian Monte Carlo filtering for stochastic volatility models,” Dept. Econ., Univ. Brescia, Brescia, Italy, Tech. Rep. 0415, 2004. [33] H. Raiffa and R. Schlaifer, Applied Statistical Decision Theory. Cambridge, MA, USA: Harvard Univ. Press, 1961. [34] P. Diaconis and D. Ylvisaker, “Conjugate priors for exponential families,” Ann. Stat., vol. 7, no. 2, pp. 269–281, Mar. 1979. [35] H. Jeffreys, “Theory of probability,” in Reprinted in Oxford Classic Texts in the Physical Sciences, 3rd ed. London, U.K.: Oxford Univ. Press, 1998. [36] M. J. Bernardo, “Reference posterior distributions for Bayesian inference,” J. Royal Stat. Soc. Ser. B, Methodol., vol. 41, no. 2, pp. 113–147, 1979. [37] J. Takeuchi and S. Amari, “α-parallel prior and its properties,” IEEE Trans. Inf. Theory, vol. 51, no. 3, pp. 1011–1023, Mar. 2005. [38] J. Skilling, “Classic maximum entropy,” in Maximum Entropy and Bayesian Methods, J. Skilling, Ed. Norwell, MA, USA: Kluwer, 1988, pp. 45–52. [39] C. Rodríguez, “Entropic prior,” (Oct. 1991) [Online]. Avilable: http: omega.albany.edu:8008/entpriors.ps [40] A. Caticha, “Maximum entropy, fluctuations and priors,” in Proc. 20th Int. Workshop Bayesian Inference Maximum Entropy Methods Sci. Eng., 2001, pp. 94–105. [41] A. Caticha and R. Preuss, “Entropic priors,” in Proc. 23rd Int. Workshop Bayesian Inference Maximum Entropy Methods, 2003, pp. 1–5. [42] W. L. Buntine and A. S. Weigend, “Bayesian back-propagation,” Complex Syst., vol. 5, no. 6, pp. 603–643, 1991. [43] D. J. C. MacKay, “A practical Bayesian framework for backpropagation networks,” Neural Comput., vol. 4, no. 3, pp. 448–472, May 1992. [44] D. J. C. MacKay, “Bayesian modeling and neural networks,” Ph.D. dissertation, Dept. Comput. Neural Syst., CalTech, Pasadena, CA, USA, 1992. [45] R. M. Neal, “Bayesian training of backpropagation networks by the hybrid Monte Carlo method,” Dept. Comput. Sci., Univ. Toronto, Toronto, ON, Canada, Tech. Rep. CRG-TR-92-1, 1992. [46] R. M. Neal, Bayesian Learning for Neural Networks. New York, USA: Springer-Verlag, 1996. [47] T. Matsumoto, Y. Nakajima, M. Saito, J. Sugi, and H. Hamagishi, “Reconstructions and predictions of nonlinear dynamical systems: A hierarchical Bayesian approach,” IEEE Trans. Signal Process., vol. 49, no. 9, pp. 2138–2155, Sep. 2001. [48] Y. Nakada, T. Matsumoto, T. Kurihara, and K. Yosui, “Bayesian reconstructions and predictions of nonlinear dynamical systems via the hybrid Monte Carlo scheme,” Signal Process., vol. 85, no. 1, pp. 129–145, Jan. 2005. [49] S. Singhal and L. Wu, “Training multilayer perceptrons with the extended Kalman algorithm,” in Advances in Neural Information Processing Systems, D. S. Touretzky, Ed. San Mateo, CA, USA: Morgan Kaufmann, 1989, pp. 133–140. [50] J. F. G. de Freitas, M. Niranjan, and A. Gee, “Hierarchical BayesianKalman models for regularisation and ARD in sequential learning,” Dept. Eng., Cambridge Univ., Cambridge, U.K., Tech. Rep. CUED/FINFENG/TR 307, 1997. [51] J. F. G. Freitas, M. Niranjan, and A. Gee, “Hierarchical Bayesian models for regularization in sequential learning,” Neural Comput., vol. 12, no. 4, pp. 933–953, Apr. 2000. [52] M. Opper, “A Bayesian approach to on-line learning,” in On-Line Learning in Neural Networks, D. Saad, Ed. Cambridge, U.K.: Cambridge Univ. Press, 1998, pp. 363–378. [53] S. A. Solla and O. Winther, “Optimal perceptron learning: An on-line Bayesian approach,” in On-Line Learning in Neural Networks, D. Saad, Ed. Cambridge, U.K.: Cambridge Univ. Press, 1998, pp. 379–398. [54] S. Amari, “A theory of adaptive pattern classifiers,” IEEE Trans. Electron. Comput., vol. 16, no. 3, pp. 299–307, Jun. 1967.

54

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 25, NO. 1, JANUARY 2014

[55] Y. Z. Tsypkin, Foundation of the Theory of Learning Systems. New York, USA: Academic, 1973. [56] T. M. Heskes and B. Kappen, “Learning processes in neural networks,” Phys. Rev. A, vol. 44, no. 4, pp. 2718–2726, Aug. 1991. [57] N. Barkai, H. Seung, and H. Sompolinsky, “On-line learning of dichotomies,” in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds. Cambridge, MA, USA: MIT Press, 1995, pp. 303–310. [58] D. Saad, On-Line Learning in Neural Networks. Cambridge, U.K.: Cambridge Univ. Press, 1998. [59] S. Amari, “Natural gradient works efficiently in learning,” Neural Comput., vol.10, no. 2, pp. 251–276, Feb. 1998. [60] S. Amari, H. Park, and F. Fukumizu, “Adaptive method of realizing natural gradient learning for multilayer perceptrons,” Neural Comput., vol. 12, no. 6, pp. 1399–1409, Jun. 2000. [61] H. Park, N. Murata, and S. Amari, “Improving generalization performance of natural gradient learning using optimized regularization by NIC,” Neural Comput., vol. 16, no. 2, pp. 355–382, Feb. 2004. [62] E. T. Jaynes, “Prior probabilities,” IEEE Trans. Syst. Sci. Cybern., vol. 4, no. 3, pp. 227–241, Mar. 1968. [63] K. Fukumizu, “A regularity condition of the information matrix of a multilayer perceptron network,” Neural Netw., vol. 9, no. 5, pp. 871–879, 1996. [64] A. Doucet, N. de Freitas, and N. Gordon, Sequential Monte Carlo in Practice. New York, USA: Springer-Verlag, 2001. [65] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part I: Detecting nonstationary changes,” IEEE Trans. Neural Netw., vol. 19, no. 7, pp. 1145–1153, Jul. 2008. [66] C. Alippi and M. Roveri, “Just-in-time adaptive classifiers—part II: Designing the classifier,” IEEE Trans. Neural Netw., vol. 19, no. 12, pp. 2053–2064, Dec. 2008. [67] R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517–1531, Oct. 2011. [68] H. Yang and S. Amari, “Complexity issues in natural gradient descent method for training multilayer perceptrons,” Neural Comput., vol. 10, no. 8, pp. 2137–2157, 1996. [69] P. E. Kloeden and E. Platen, Numerical Solution of Stochastic Differential Equations. New York, USA: Springer-Verlag, 1992. [70] N. Ikeda and S. Watanabe, “Diffusion process on manifolds,” in Stochastic Differential Equations and Diffusion Processes, 2nd ed. Amsterdam, The Netherlands: North-Holland, 1991, pp. 247–436.

Yohei Nakada (M’06) received the B.E., M.E., and D.E. degrees from Waseda University, Tokyo, Japan, in 1999, 2001, and 2009, respectively. He is currently an Assistant Professor with the College of Science and Engineering, Aoyama Gakuin University, Tokyo, Japan. His current research interests include machine learning, pattern recognition, and signal processing based on Bayesian approachs.

Makio Wakahara received the B.E. and M.E. degrees from Waseda University, Tokyo, Japan, in 2004 and 2006, respectively. His interest has been Bayesian learning, in particular, online Bayesian learning algorithms based on sequential Monte Carlo.

Takashi Matsumoto (M’71–SM’83–F’85–LF’10) received the B.S. degree in electrical engineering from Waseda University, Tokyo, Japan, the M.S. degree in applied mathematics from Harvard University, Cambridge, MA, USA, and the Ph.D. degree in electrical engineering from Waseda University, in 1966, 1970, and 1973, respectively. He is currently a Professor with the Graduate School of Advanced Science and Engineering, Waseda University. He is also constructing EM algorithms with nonparametric prior for Bayesian learning and applications. The target applications include time-series predictions, EEG and NIRS-based brain signal decoding, hyperspectral imagingbased individual authentication, image segmentation with nonparameteric pior, language processing, and HMM based event predictions with nonparametric prior. His current research interests include Bayesian learning algorithms for real-world problems with Monte Carlo implementations. Dr. Matsumoto was a recipient of the 1994 Best Paper Award of the Japanese Neural Network Society. He has held visiting positions at Cambridge University, U. K., and U. C. Berkeley.

Online Bayesian learning with natural sequential prior distribution.

Online Bayesian learning has been successfully applied to online learning for multilayer perceptrons and radial basis functions. In online Bayesian le...
4MB Sizes 9 Downloads 3 Views