1992

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Context Dependent Encoding Using Convolutional Dynamic Networks Rakesh Chalasani, Student Member, IEEE, and Jose C. Principe, Fellow, IEEE Abstract— Perception of sensory signals is strongly influenced by their context, both in space and time. In this paper, we propose a novel hierarchical model, called convolutional dynamic networks, that effectively utilizes this contextual information, while inferring the representations of the visual inputs. We build this model based on a predictive coding framework and use the idea of empirical priors to incorporate recurrent and top-down connections. These connections endow the model with contextual information coming from temporal as well as abstract knowledge from higher layers. To perform inference efficiently in this hierarchical model, we rely on a novel scheme based on a smoothing proximal gradient method. When trained on unlabeled video sequences, the model learns a hierarchy of stable attractors, representing low-level to high-level parts of the objects. We demonstrate that the model effectively utilizes contextual information to produce robust and stable representations for object recognition in video sequences, even in case of highly corrupted inputs. Index Terms— Context, deep learning, dynamic models, empirical priors, object recognition.

I. I NTRODUCTION

N

O SENSORY signal (or stimulus) occurs in isolation; but instead is surrounded and influenced by many other signals. The perception of (or response to) any stimuli is very strongly influenced by both spatial as well as temporal context [1]. Visual perception can be treated as obtaining some unknown representation of the sensory signals in light of past experiences learned from the environment and surrounding elements, both in space and time. Such representation can be useful, both in biology and machine vision, for variety of tasks—recognizing objects, reconstructing missing or occluded parts of an object, attention, tracking, and segmentation of a scene to name a few. Biological vision has two important characteristics that helps it to obtain such rich representations: first, it decomposes the inputs in a hierarchical fashion [2] and second, it is a dynamic process and uses contextual information very efficiently [1], [3]. It has the ability to obtain information from various sources and combine them at various levels— information from the instantaneous inputs, spatio-temporal relationships (short-term memory) and the expectations

Manuscript received December 15, 2013; accepted September 11, 2014. Date of publication November 5, 2014; date of current version August 17, 2015. This work was supported by the Office of Naval Research under Grant N000141010375. R. Chalasani is with AnalytXbook inc., Boston, MA 02109 USA and also with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, Fl 32611 USA (e-mail: [email protected]). J. C. Principe is with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2014.2360060

coming from prior knowledge about the external environment (long-term memory). Predictive coding has been proposed [4], [5] as a unified mathematical framework to explain such biological sensory processing and combine information from these various sources. In this paper, inspired by predictive coding, we propose a novel convolutional dynamic network (CDN) for robust object recognition in a video sequence. Keeping in mind the human visual system, we restrict our model to have certain important properties as follows. 1) The model has to consider the inputs as a dynamic process, i.e., it has to maintain the temporal relationships during inference. This leads to incorporating short-term contextual information into the model [1]. 2) It has to decompose the inputs in a hierarchical and distributive fashion. Such a deep representation should extract abstract information from the inputs, leading to better generalization and invariance to several transformations of the objects [6]. 3) The representations have to be highly overcomplete, and the model has to maintain sparsity at every level. This helps to not only obtain an efficient representation of the inputs but also to learn better discriminability [7]. 4) It should combine the top-down feedback information from the higher layers in the hierarchy, while performing inference in the lower layers. Such bidirectional information flow can be helpful at the lower layers to reconstruct the missing parts of the object or disambiguate between different objects in a noisy environment [8]. Previously, many methods were proposed to model visual perception, and they encompass some of the above mentioned properties. Notably, Rao and Ballard [4] previously proposed a predictive coding model based on Kalman filter-like update rules, while Friston [5] proposed a similar hierarchical dynamic model with generalized coordinates based on empirical Bayes and presented a unified theory based on free energy principles. Lee and Mumford [9] proposed a particle filtering based approach to build a model for visual perception. Though these models are successful in incorporating top-down connections and temporal relationships, neither of these models consider sparsity on the hierarchical representations nor scale well to large images and videos. In spite of these limitations, predictive coding models are able to explain the importance of context (or empirical priors) in sensory perception [10]–[12] and make it possible to explore the same in deep networks for visual object recognition. From a machine vision point of view, imposing some prior knowledge on the model can lead to better representation of

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

the inputs. Such prior knowledge can be domain specific—as in Scale-invariant feature transform (SIFT) [13] Histogram of Oriented Gradients [14]—or can be more generic—imposing fixed constraints, such as sparsity [15] or temporal coherence [16]. The use of these generic priors has become particularly useful, while learning deep networks [17]–[20]. However, contextual information might contain more task or objective specific information, which cannot be easily incorporated, while using fixed priors. One way to adapt the model with contextual information is by empirically altering the priors. In this paper, we use this idea of adaptive priors to incorporate contextual information obtained from the temporal relationships in the video and top-down abstract information about the objects to build a robust recognition system. A. Overview We start with a brief description of the predictive coding framework proposed in [4] and [5] and describe how a hierarchical generative model can be built within this framework (Section II-A). The basic building block of this model, pervasive across the hierarchy, is a state-space model with some unknown causes. We consider a specific architecture based on spatial convolution for this state-space model, designed to extract information that is invariant to transformations of the objects in the input scene (Section II-B). We show that this model can combine the bottom-up, top-down, and lateral (or temporal) influences, as envisaged above. However, performing inference efficiently in this hierarchical model is essential to scale it to large videos. We propose an efficient inference procedure based on smoothing proximal gradient methods to obtain sparse states from the state-space model, when the parameters of the model are fixed (Section III-A). In addition, the unknown causes are modeled as modulatory signals influencing the shape of the sparse prior on the states and are inferred along with the states. We stack such two-stage models (with states and causes) and build the hierarchy using a greedy layer-wise learning procedure (Section IV). The top-down connections in this hierarchical model are the predictions coming from higher layer state-space equations and, as we will show, can be readily incorporated, while performing inference (Section III-B). The performance of the proposed model is tested on a variety of tasks (Section V). First, we show that the model can learn complex structures from video sequences and use them to extract representations for object recognition in static images (resembling self-taught learning [21]). We then analyze the effect of contextual information during inference in a sequence labeling task, where the objective is to label every frame in a video sequence and eventually classify the entire sequence. Finally, we show that the model can learn hierarchical decomposition of the objects from unlabeled video sequences and show that the top-down and temporal connections can help denoise video sequences, even when corrupted with structured noise. II. M ODEL In this section, we start with a brief description of the predictive coding framework for constructing hierarchical models.

1993

We then discuss the architecture of the proposed model that combines the predictive coding framework with convolutional (or deconvolutional [22]) networks. A. Predictive Coding The general idea of predictive coding models is to predict the external sensory inputs using an a priori latent variable and to encode only the residual prediction errors, such that the internal representations are modified only according to the unexpected (or surprising) changes in the inputs. This prediction error, from a generative model perspective, is the difference between the actual observation and predicted observation generated by the underlying causes (also called as latent or hidden variables) [23]. Mathematically, if yt is an observation at time t, then it is described by an underlying cause (ut ) through function F as follows: yt = F (ut ) + nt .

(1)

In addition, since we assume that the observations are time varying, intermediate states (xt ) can be considered to encode the dynamics over time. Hence, a combined model that encodes a sequence of observations can be written as a generalized state space model of the form [5] yt = F (xt , ut ) + nt xt = G(xt −1 , ut ) + vt .

(2)

Here, G is called the state-transition function. We assume that both the functions F and G can be parameterized by some θ . As we are usually interested in obtaining abstract information from the observations, the unknown causes ut are encouraged to have a nonlinear relationship with the observations. The hidden states xt can then be said to mediate the influence of the cause on the observations and endow the system with memory [5]. The terms vt and nt are stochastic and model uncertainty in the predictions. Now, to build a multilayer hierarchical network, several such state-space models can be stacked such that the causes from one acts as observations to the model in the layer above. Mathematically, an L-layered network of this form can be written as  (l) (l)  (l−1) (l) = F (l) xt , ut + nt ∀l ∈ {1, 2, . . . , L} ut   (l) (l) (l) (l) (l) (3) xt = G xt −1, ut + vt . When l = 1, i.e., at the bottom layer, ut(l−1) = yt , where yt is (l) (l) the input data. The terms vt and nt are stochastic fluctuations at the higher layers and enter each layer independently. In other words, this model forms a Markov chain across the layers and the latent variables at any layer are now only dependent on the observations coming from the layer below and the predictions from the layer above. Such a model can be called as empirical Bayes model because the higher level latent variables model the priors on the lower level representations [23]. In the following section, we consider a particular architecture for this general model that is suitable for extracting discriminative information from inputs. Specifically, we consider a convolutional architecture for the dynamic

1994

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Fig. 1. Block diagram of a single layer model. Inputs here contain three channels (denoted in RGB colors) and each channel is modeled as a combination of the state maps (black) convolved with filters C (blue). The max-pooled state maps (orange) are decomposed using the cause maps (purple) convolved with filters B (blue). During inference there is a two-way interaction between the state and the cause mappings through pooling/unpooling operations, which is left implicit here.

model described above to extract features from large scale video sequences. We use an architecture similar to other convolutional networks [19], [22], but with temporal and top-down connections, as well as learning more complex invariances. B. Model Architecture 1) Single Layer Model: We first consider a single layer model, as shown in Fig. 1, to process a video sequence. Here, the input to the model is a sequence of video frames It , ∀t ∈ {1, 2, . . . , T } and each frame is composed of M color channels, denoted as {It1 , It2 , . . . , ItM }. Now, we assume that each channel Itm acts as an observation to a state-space model, with the same set of states used across all the channels. Specifically, each channel Itm is modeled as a linear combination of K matrices, X tk ∀ k ∈ {1, 2, . . . , K }, convolved with filters Cm,k ∀ k. The state-space equations for this model can be written as Itm = X tk (i, j ) =

K  k=1 K 

Cm,k ∗ X tk + Ntm ˜

ak,k˜

 1, k = k˜ = 0, otherwise

that is, we consider only self-recurrent connections between state maps, which encourages temporal coherence. However, one can alternatively model the motion in the observations by learning the coefficients ak,k˜ along with the rest of the model parameters [24]. Since (4) is a underdetermined model, we regularize it with a sparsity constraint on the states to obtain a unique solution. Hence, the combined energy function for the state-space model in (4) can be written as follows:  2 M  K    m  k Cm,k ∗ X t  E x (Xt , C) =  It −   m=1

k=1

2

+ λXt − Xt −1 1 +

K 

  γ k ·  X tk 

(5)

k=1

∀m ∈ {1, 2, . . . , M}

ak,kˆ X tk−1 (i, j ) + Vtk (i, j )

paper, we assume

(4)

˜ k=1

where ∗ denotes convolution and X tk (i, j ) indicates an element in the matrix X tk . If Itk is a w × h frame and Cm,k is a s × s pixel filter, then X tk is a matrix of size (w + s − 1) × (h + s − 1). We refer to Xt = [X t1 |X t2 | . . . |X tk | . . . |X tK ] as state maps (or sometimes simply as states). In addition, ak,k˜ indicates the lateral connections between the state maps over time. Since we are only interested in object recognition in this

where  · 1 is the L1-norm over all the elements, | · | elementwise absolute value and  · 2 represents the Frobenius norm over the matrix. Notice that we consider that the state transition noise Vt in (4) to also be sparse, so that it is consistent with the sparsity on the states. This makes practical sense, as the number of changes between two consecutive frames in a typical video sequence is small. In (5), γ k is a sparsity parameter on the kth state map. Instead of assuming that the sparsity of the states is constant (or that the prior distribution over the states is stationary) as in [22], here we assume that the cause maps (or causes) Ut modulate the activity of the states through

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

the sparsity parameter. In line with the model proposed by Karklin and Lewicki [25], we consider the sparsity parameter in terms of the causes Ut ∈ R(w+s− p)×(h+s− p)×D as

  D  γ0 k d γ = Bk,d ∗ Ut (6) 1 + exp − 2 d=1

where γ0 > 0 is a constant. We note that all the elements of B are initialized to be nonnegative and they remain so without any additional constraint. This nonlinear multiplicative interaction between the state and the cause mappings leads to extracting information that is invariant to several transformations from the observations. Essentially, through the filters Bk,d ∈ R p× p , Utd learns to group together the states that co-occur frequently. Since co-occurring components typically share some common statistical regularity, such activity typically leads to locally invariant representation [25]. More importantly, unlike many other deep learning methods [22], [26], the activity of the causes influences the states directly through the top-down connections (Bk,d ) and the statistical grouping is learned from the data, instead of a predetermined topographic connections [27]. Given fixed state maps, the energy function that needs to be minimized to obtain the causes is

  D K     γ0 Bk,d ∗ Utd ·  X tk  E u (Ut , B) = 1 + exp − 2 k=1

d=1

+ βUt 1

(7)

where we regularized the solution using an 1 sparsity penalty. 2) Building a Hierarchy: Several of these single-layer models described above can easily be stacked to form a hierarchical model. Like many other deep architectures, such as deep belief networks [26] and stacked autoencoders [28], the cause maps from one layer act as observations to the layer above. However, unlike these models, each layer gets, along with the bottom-up observations, a top-down predictions of its output causes. Then, the goal during inference of the states and the causes at any layer is to come up with representations that best predicts the observations, while reducing the top-down prediction error. More formally, combining the top-down predictions into the single layer model described in Section II-B.1, the energy function at the lth layer in the hierarchical model can be written as  2  Dl−1  Kl   m,l−1   l l l l k,l  l Ut  − C ∗ X E l Xt , Ut , C , B = t  m,k    m=1 k=1 2



+ λl Xlt 



− Xlt −1 1 

+ β l Ult 1

+



+ ηl Ult

Kl 

  γ ·  X tk,l  k

k=1 2 − Ult 2

⎫⎞ ⎧ Dl ⎬ ⎨  γ0 l ⎝1 + exp − γk = Bk,d ∗ Utd,l ⎠ (8) ⎭ ⎩ 2 ⎛

d=1

Ul−1 t

where (in the first term) are the causes coming from the layer below and Ult (in the last term) are the top-down predictions coming from the state-space model in the layer above. As indicated by the energy function in (8),

1995

the architecture at each layer is similar to the single layer model described before, though the number of states (K l ) and causes (Dl ) might vary over the layers. 3) Implementation Details: To make the implementation more efficient, we introduce some restrictions on the architecture. First, like other convolutional models [19], [22], [29], [30], we assume sparse connectivity between the observations and states and, also, between states and causes. This not only increases the efficiency during inference but also breaks the symmetry between layers and helps to learn complex relationships. Second, we shrink the size of the states using max pooling between the states and the causes. Correspondingly, the sparsity parameters (γ ) obtained from the causes are unpooled during inference of the states (see Section III for details). This reduces the size of the observations going into the higher layers, and hence, is more efficient during inference. In addition, the pooling is shown to produce better invariant representations [31]. III. I NFERENCE At any layer l, inference involves finding the states Xlt and the causes Ult that minimizes the energy function El in (8). To perform this joint inference, we alternately update the states with the causes fixed and then update the causes with the states fixed until convergence. Updating either of them involves solving an 1 convolutional sparse coding problem and we use a proximal gradient based method called FISTA [32], [33] (and some variations [34]) for this, where each update step involves computing the gradient, followed by a soft thresholding function to obtain a sparse solution. A. Procedure Algorithm 1 shows the steps involved in this iterative inference procedure and below we will elucidate each of the steps in detail. Updating States: First, with the causes fixed, updating the states involve finding the gradient of all the terms other than the sparsity penalty in El with respect to Xlt . For convenience, we rewrite these terms as  2  Dl−1  Kl  m,l−1   l  k,l  l Ut  h Xt = − C ∗ X t  m,k   m=1  k=1 2  + λXlt − Xlt −1 1 . (9) Since h(Xlt ) is not differentiable, as the second term involving state transitions has 1 penalty, the gradient does not exist everywhere. However, to approximately compute it, we first use the idea of Nestrov’s smoothness [35] to approximate the nonsmooth state transition term in h(Xlt ) with a smooth function. To begin, let’s consider (Xlt ) = et 1 , where et = vec(Xlt ) − vec(Xlt −1 )1 . The idea is to approximate (Xlt ) with a smooth function and compute its gradient with respect to et . Since, et is linear function of Xlt , computing the gradient of (Xlt ) with respect to Xlt then becomes straightforward.

1996

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

shrinkage-thresholding algorithm [32] for convolutional statesspace models with a sparsity constraint. The gradient of reformulated h(Xlt ) with respect to Xlt is given as follows: ⎛ ⎞ Dl−1 Kl    k,m ∗ ⎝Utm,l−1 − Ck,m ∗ X tk,l ⎠ ∇ k,l h(Xlt ) = − C

Algorithm 1 Inference in CDN Require: Inputs - I1:T , N - # FISTA iterations, L - # layers, Parameters - C1:L , B1:L Require: Hyper-parameters - λ1:L , β 1:L , η1:L , γ01:L Require: Initialize states - X01:L = 0 1: for t = 1 : T do // Loop over time // Top-down predictions. 2: for l = L : −1 : 1 do // Loop over layers 3: Compute Xlt using (21) 4: Predict: Ult using (20) 5: end for // Bottom-up inference. 6: Initialize: Xlt = Xlt −1 , Ult = Ult 7: for l = 1 : L do // Loop over layers 8: for n = 1 : N do // FISTA iteration 9: Compute state prediction term: α ∗ . 10: Update states Xlt using (15) and (16). k,l 11: Max-pooling: [down(X tk,l ), pk,l t ] = pool(X t ). l 12: Update causes Ut using (18). 13: Unpool and recompute γ l using (19). 14: end for 15: end for 16: end for

Xt

k=1

+ λMαk ∗ (14)  k,m indicates that the matrix C k,m is flipped vertically where C and horizontally and Mαk ∗ is the kth map from a matrix obtained after reshaping α ∗ . Once we obtain the gradient, the states can be updated as   Xlt = Xlt − γ l τ ∇Xl h Xlt (15) t

where τ is a step size for the gradient descent update.1 Following this, we pass the updated states through a soft thresholding function that clamps the smaller values, leading to a sparse solution:      Xlt = sign Xlt max Xlt  − γ l . (16)

Now, using the dual of the 1 -norm, we can rewrite (Xlt ) as     Xlt ) = arg max α T et (10) α∞ ≤1

where α ∈ Rcard(et ) . Using Nestrov’s smoothness property, we can approximate (Xlt ) with a smooth function of the form      Xlt ≈ f μ (et ) = arg max α T et − μd(α) (11) α∞ ≤1

α22 /2

is a smoothness function and μ is the where d(α) = smoothness parameter. Following [36, Th. 1], we can show that f μ (et ) is convex and smooth and, moreover, the gradient of f μ (et ) with respect to et is given by: ∇et f μ (et ) = α ∗

m=1

(12)

α∗

is the optimal solution to (11). We can obtain a where closed-form solution to α ∗ as (for proof refer to [34])   et α∗ = S (13) μ where S(·) is a projection operator applied over every element in α ∗ and is defined as follows: ⎧ ⎪ −1 ≤ x ≤ 1 ⎨ x, S(x) = 1, x>1 ⎪ ⎩ −1, x < −1. As discussed above, using the chain rule, f μ (et ) is also convex and smooth in Xlt and its gradient ∇Xl f μ (et ) remains the same t as in (12). Given this smooth approximation of the nonsmooth state transition term and its gradient, we now apply the iterative

Max Pooling: We then perform a spatial max pooling [37] over small neighborhoods across each and every 2-D state map as       = pool X tk,l down X tk,l , pk,l t where pk,l indicates the pooling indexes (refer to [38] for t details). We do not pool across the state maps, so the number of state maps remains the same, while the resolution of each map decreases [denoted as down(X tk,l )]. We use nonoverlapping spatial windows for the pooling operation. Update Causes: Similar to the state updates described above, we fix the states and compute the gradient using only the smooth part of the energy function El [denoted as h(Utk,l )] with respect to Ult . Given the pooled states, the gradient can be computed as follows:   ∇ d,l h Ult Ut ⎫⎞ ⎧ ⎡⎛ Kl Dl ⎬ ⎨  γ0  d,l ⎠ l ⎣ ⎝  =− B ∗ U exp − Bk, t k,d d ⎭ ⎩ 2 k=1 d=1 ⎤      td,l · down X tk,l ⎦ + 2η Utd,l − U . (17) Similar to the state updates described above, using this gradient information, we update the causes by first taking a gradient step, followed by a soft thresholding function:   Ult = Ult − β l τ ∇Ul h Ult    t   Ul = sign Ul max Ul  − β l . (18) t

t

t

1 FISTA uses a momentum term during the gradient update and leads to much faster convergence. We use this in our implementation. Refer to [32] for details.

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

Fig. 2.

1997

Block diagram of the inference procedure, with arrows indicating the flow of information during inference.

Unpooling: Now, after updating the causes, we reevaluate the sparsity parameter γ for the next iteration. We do this as follows: ⎧ ⎛ ⎞⎫⎞ ⎛ Dl ⎬ ⎨  γ0 ⎝ l γ k,l = Bk,d ∗ Utd,l ⎠ ⎠ (19) 1 + exp −unpoolpk ⎝ t ⎭ ⎩ 2 d=1

where unpoolpk,l (·) indicates reversing the pooling operation t using the indexes pk,l obtained during the max pooling operation described above [39]. Notice that, while the inputs to the pooling operation are the inferred states, the inputs to the unpooling operations are the likely states generated by the causes. Overall Iteration: A single iteration consists of the above mentioned four steps: update the states using a single FISTA step, perform max pooling over the states, update the causes using a single FISTA step and, finally, reevaluate the sparsity parameter for the next iteration. All the computations during inference involve only basic operations, such as convolution, summation, pooling, and unpooling. All of these can be efficiently implemented on a GPU with parallelization [38], making the overall process very quick. B. Approximate Inference With Top-Down Connections In the inference procedure described above, while updating the causes, we assumed that the top-down predictions Ult are already available and are constant throughout the inference procedure. However, ideally, this should not be the case. Since the layers are arranged in a Markov chain, all the layers have to be concurrently updated, while passing top-down and bottom-up information, until the system reaches an equilibrium. In practice, this can be very slow to converge. To avoid this, we do an approximate inference, where we first make a single approximate top-down prediction at each time step using the states from the previous time instance and perform a single bottom-up inference with fixed top-down predictions, starting from the bottom layer. More formally, at every time step, using the state-space model at each layer we predict the most likely cause at the layer below ( Ul−1 t ), given only the previous states and the predicted causes from the layer above. Mathematically, the top-down

prediction at layer l can be written as tm,l−1 = U

Kl k=1 Cm,k

∗ X tk,l

∀m ∈ {1, 2, . . . , Dl−1 }

where     Xlt = arg min λl Xlt − Xlt −1 1 + γ · Xlt 1 Xlt

⎛ ⎞⎫⎞ Dl ⎬  γ0 ⎝ td,l ⎠ ⎠ k = γ Bk,d ∗ U 1 + exp −unpoolpk,l ⎝ t−1 ⎭ ⎩ 2 ⎛

⎧ ⎨

d=1

(20) and Ult itself is a top-down prediction coming from layer l +1. At the top-layer, we consider the output from the previous time as the predicted causes, i.e., UtL = UtL−1 , allowing temporal smoothness over the outputs of the model. A simple analytic solution can be obtain for Xlt in (20) and is given by  γ k (i, j ) < λl X tk,l k,l −1 (i, j ) (21) X t (i, j ) = 0 γ k (i, j ) ≥ λl . Fig. 2 shows the block diagram of a two-layered network, indicating the flow of information during inference. Before going further, there are several important things about this inference procedure that are worth noting. First, the prior (or the regularization term) on the causes in the hierarchical model (8) involves two terms: an 1 regularization encouraging sparsity and 2 term with a bias coming from the top-down predictions. This resembles the elastic net regularization [40], albeit with a bias. Second, observing the role of top-down predictions during inference, through (17) and (21), one can see that they play a dual role of driving and as well as modulatory signals. While in the former equation the predictions Ult drives the representations through the gradient and bias them to some top-down expectations, in (21) they shut some of the state elements, while performing top-down predictions and hence, acts as a modulatory signal. In addition, at any layer the mapping between the inputs and the output causes is highly nonlinear. This nonlinearity comes from several factors: 1) the thresholding function while updating the states and the causes; 2) the pooling operation; and 3) the causes and the states interact through an exponential function as shown in (17).

1998

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

IV. L EARNING During learning the goal is to estimate the filters across all the layers in the model, such that they capture the structure across the entire input sequence {I1 , I2 , . . . , IT }. We do this in a greedy layer-wise fashion, where we estimate the parameters of one layer at a time, starting from the bottom layer. We do not consider any top-down connections during learning [i.e., we set ηl = 0 in (8) during learning]. At layer l, after inferring Xlt and Ult and fixing them, we update the filters Cl and Bl using gradient descent with momentum, minimizing the cost function El (·). The gradient of El (·) with respect to Cl can be computed as ⎛ ⎞ Kl  k,l ⎝ m  Ck,m ∗ X tk,l ⎠ ∇Cl El = −2 X (22) t ∗ It − m , k

TABLE I M ODEL A RCHITECTURE

k=1

and the gradient of El (·) with respect to Bl can be computed as ∇B k, d El =

td,l −U

⎫⎞ ⎧ ⎤ Dl ⎬ ⎨    k,l d,l l ⎠ · down(X t )⎦. ∗ ⎣⎝exp − B k,d ∗ Ut ⎭ ⎩ ⎡⎛

d=1

(23) After updating the filters, we normalize each filter to be of unit norm to avoid a trivial solution. V. E XPERIMENTS We test the performance of the proposed model on various tasks—its ability to learn hierarchical representations and objects parts from unlabeled video sequences, object recognition with contextual information, sequential labeling in video sequences for recognition, and robustness in noisy environment. Preprocessing: In all the experiments, we perform the same preprocessing on the images/videos. Each frame in a video sequence (or each image) is converted into gray scale. We then normalize each frame to be zero mean and unit norm, followed by local contrast normalization as described in [37]. A. Learning From Natural Video Sequences To visualize what internal representations the model can learn, we construct a two-layered network using the Hans van Hateren natural scene videos. Here, each frame is 128 × 128 pixels in size and is preprocessed as described above. The bottom layer consists of 16 states of 7 × 7 filters and 32 causes of 6 × 6 filters, while the second layer is made up of 64 states of 7 × 7 filters and 128 causes of 6 × 6 filters. The pooling size between the states and the causes for both layers is 2 × 2. Table I summarizes the model architecture and Fig. 3 shows the receptive fields of the first layer states and the second layer causes. We observe that the receptive fields of the first layer states [Fig. 3 (top)] resemble simple oriented filters, similar to those obtained from sparse encoding methods [15]. The receptive fields of the second layer causes, as shown in Fig. 3 (bottom), contains more complex structures like

Fig. 3. Receptive fields of the two-layered network learn on natural video sequences. Top: receptive fields of layer-1 states. Bottom: receptive fields of layer-2 causes. They are constructed as a linear combination of bottom layer filters.

edge junctions and curves. These are constructed as weighted combinations of the lower layer filters. B. Object Recognition—Caltech-101 Dataset One advantage of distributive models, like ours, is their ability to transfer the model learned on unlabeled data to extract features for generic object recognition, the so called self-taught learning [21]. We use this to assess the quality of the learning procedure by performing object recognition in static image from Caltech-101 dataset [43]. Each image in the dataset is resized to be 152 × 152 (zero padded to preserve the aspect ratio) and preprocessed as described above. We use the same two-layered model learned from natural videos sequences as above and extract features for each image using a single bottom-up inference—i.e., without any temporal or top-down information by setting λ = 0 and η = 0 for both the layers in (8). Note that, without temporal and topdown connections, the proposed model is similar to deconvolutional networks (DNs) [22]. Following the procedure described in [22] and [30], the output causes from layer 1 and layer 2 are taken and are made into a three level spatial pyramid for each layer output [42]. They both are then concatenate to form a feature vector for each image and are fed as inputs to

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

1999

TABLE II C LASSIFICATION P ERFORMANCE OVER C ALTECH -101 D ATASET

TABLE III C LASSIFICATION P ERFORMANCE OVER COIL-100 D ATASET

W ITH O NLY A S INGLE B OTTOM -U P I NFERENCE

W ITH VARIOUS C ONFIGURATIONS

linear classifier (we use an L2-SVM in liblinear package [44]). Table II shows the results obtained when 30 images per class were used for training and testing, following standard protocol, and averaged over 10 runs. The parameters of the model were set through cross validation. We observe that using layer 1 causes alone leads to an accuracy of 62.1%, while using the causes from both the layers improves it to 66.9%. These results are comparable with other similar methods that use convolution architecture [19], [22], [30], [41] and are slightly better than using hand-designed features like SIFT [42]. C. Recognition With Context As discussed in [1], visual perception is not static and uses contextual information from both space and time. We argue that our model can effectively utilizes this contextual information and produces a robust representation of the objects in video sequences. While the temporal relationships are encoded through the state-space model at each layer, the spatial context modulates the representation through two different mechanisms: 1) spatial convolution along with sparsity ensures that there is competition between elements, leading to some kind of interaction across space and 2) the top-down modulations coming from higher layers contribute context from accumulated information across larger receptive fields. To test this hypothesis, we show the performance of the model over two different tasks. First, we show that using contextual information during inference can lead to a consistent representation of the objects, even in cases where there are large transformations of the object over time. We use the COIL-100 [45] dataset for this task. Second, we use the model for sequence labeling task, where we assign a class to each frame in sequence before classifying the entire sequence as a whole. The goal is to show the extent of invariance the model can encode, particularly in cases of corrupted inputs. We test the performance using the Honda/UCSD [46] and the YouTube celebrities [47] face video datasets. 1) Role of Contextual Information During Inference: For this experiment, we consider the COIL-100 dataset that contains 100 different objects (or classes). For each object there is a sequence obtained by placing the object on a turn table and taking a picture for every 5° turn, resulting in 72 frame long video per object. Each frame is resized to be 128 × 128 pixel in size and preprocessed as described above.

We use the same two-layered network described in Section V-A and perform inference with top-down connections over each of the sequences. We combine the causes from both the layers for each frame and use it to train a linear SVM for classification. As per the protocol described in [45], we consider 4 frames per object at viewing angles 0°, 90°, 180°, and 270° as labeled data used for training the classifier and the rest are used for testing. Notice that we assume that we have access to the test samples during training. This resembles the transductive learning setting described in [20]. Here, we compare the proposed method with other deep learning models—a two stage hierarchical model built using more biologically plausible feature detectors called view-tuned network [48], stacked independent subspace analysis learned with temporal regularization (Stacked ISA + Temporal) [18] and convolutional networks trained with temporal regularization (ConvNet + Temporal) [20]. While the first two methods do not utilize contextual information during training the classifier,2 Mobahi et al. [20] use a similar setting as ours in which the entire object sequence is considered during training. In addition, we consider three different settings during inference in our model, where each frame processed: 1) independently and without considering any contextual information, i.e., no temporal or top-down connections; 2) with only temporal connections; and 3) with both temporal and top-down connections. As shown in Table III, our method performed much better than all the other methods when contextual information was used. While using temporal connections itself proved sufficient to obtain good performance, having top-down connections improved the performance further. On the other hand, not using any contextual information led to a significant drop in performance. In addition, we would like to emphasize that the model was learned on video sequences completely unrelated to the task, indicating that the contextual information during inference is more important than using it for just training the classifier as in [20]. The reason for this could be that contextual information might push the representations from each sequence into a well-defined attractor, separating it from other classes (more about this in Section V-D). 2) Sequence Labeling: While the above experiment shows the role of context during inference, it does not tell much about the discriminability of the model itself. For this, in the 2 Zou et al. [18] uses temporal regularization only, while learning the model with an unrelated video sequence.

2000

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Fig. 4. Part of the face sequence (from left to right) belonging to three different subjects extracted from Honda/UCSD dataset.

following experiment, we test the performance of the proposed model on a sequence labeling task, where the goal is to classify a probe sequence given a set of labeled training sequences. Here, we perform this experiment on face recognition in the Honda/UCSD dataset [46] and the YouTube celebrities dataset [47]. The Honda dataset contains 59 videos of 20 different subjects, while the YouTube dataset contains 1910 videos of 47 subjects. We note here that, while the Honda dataset is obtained from a controlled environment, the YouTube dataset is obtained from more natural setting, with very noisy and low-resolution videos, making the task very challenging. First, for all the video, the faces from each frame were detected using Voila–Jones face detection [53] and then resized to be 20 × 20 pixels for the Honda dataset and 30 × 30 pixels for the YouTube dataset. Fig. 4 shows some example face sequences obtained from the Honda/UCSD dataset. Each set of faces detected from a video are then considered as an observation sequence. Next, in addition to the preprocessing described above, we also performed histogram equalization on each frame to remove any illumination variations. Finally, following standard protocol [46], for the Honda dataset we consider 20 face sequences for training and the rest 39 sequences for testing. We report results using varying number of frames per sequence (N) as defined in [51]: 50, 100, and full length. When the length of the sequence was less than N, all the frames in the sequence were used. In the YouTube dataset, we first randomly partitioned the dataset into 10 subsets of nine videos each, then divided each subset into three videos for training, and the remaining six for testing. We report the average performance over all the 10 subsets. For the Honda dataset, we used the 20 training sequences to learn a two-layered network, with the first layer made up of 16 states and 48 causes and the second layer is made up of 64 states and 100 causes. All the filters are 5 × 5 pixels in size, and the pooling size in both the layers is 2 × 2. We used a similar architecture for the YouTube dataset as well, but with filter size 7 × 7, and the model parameters were learned by randomly sampling from the all the sequences in the dataset. We emphasize that the learning is completely unsupervised. During classification, for the Honda dataset, the inferred causes from both the layers for each frame were concatenated and used as feature vectors. On the other hand, for the YouTube dataset, we made a three-level spatial pyramid of the causes from both the layers [42] and used it as a feature vector. Any probe sequence was assigned a class based on the maximally polled predicted label across all the frames. All the

parameters were set after performing a parameter sweep to find the best performance.3 Table IV summarizes the results obtained on the Honda/ UCSD dataset. We compare our method here with manifold discriminant analysis (MDA) [49], set based face recognition methods (Affine Hull Image Set Distance and Convex Hull Image SetDistance) [50], sparse approximated nearest points (SANP) [51], and dictionary-based face recognition from video [52]. Our method (CDN) clearly outperforms all these methods, across all the sequence lengths considered. On the YouTube dataset, we compare our method, in addition to SANP [51] and MDA [49], with other methods that use covariance features [Covariance matrix (COV) + Partial Least Squares (PLS)] [54], and kernel learning [COV + kullback leibler (KL) and Proj. + KL] [55]). As shown in Table V, the proposed model is competitive with the other state-of-the-art methods. We note here that most of the methods mentioned above (particularly, COV + PLS, Proj. + PLS, and COV + KL) consider all the frames in the sequence to extract features before performing classification. On the other hand, we perform a sequential labeling, utilizing knowledge only from the past frames to extract the features. In addition, without either the temporal or top-down connections, the performance of the proposed method again drops to around 69.5% (CDN without context). Finally, to evaluate the performance of the model with noisy observations, we corrupt the Honda/UCSD sequences with some structured noise in the above experiment (but maintain the same parameters learned from clean sequences). We make the noisy sequence as follows: one-half of each frame of all the sequences is corrupted by adding one-half of a randomly chosen frame of random subject. We repeat this a number of times per frame (the number is based on a Poisson distribution with mean 2). Fig. 5 summarizes the classification results [per sequence in Fig. 5(a) and per frame in Fig. 5(b)] obtained with sequence length of 50 frames. While the performance on the proposed model drops in both the cases, i.e., with and without temporal and top-down connections (denoted as CDN with context and CDN without context, respectively), the performance drop is steep when contextual information is not used than when it is used. The difference is more prominent in the classification accuracy per frame. For comparison, we also show the performance of SANP, whose performance drops significantly with noise. 3) Analysis of Temporal and Top-Down Connections: To understand the extent of influence the temporal and top-down connections have on the representations, we varied the hyperparameters λ and η in (8), which determine the extent of influence they have, respectively, during inference. We used the same experimental setup as above with noisy Honda/UCSD sequences and record the classification performance (per sequence and per frame) for different λ and η values. To make the visualization easier, we used the same set of hyperparameters for both the layers, with sparsity parameters fixed at γ0 = 0.3 and β = 0.05, which were obtained after performing 3 On the YouTube dataset, parameter sweep was done a single subset and the same parameters were used for the rest of the subsets.

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

2001

TABLE IV R ECOGNITION R ATE ( IN P ERCENTAGE ) FOR FACE R ECOGNITION IN H ONDA /UCSD D ATASET

TABLE V C LASSIFICATION P ERFORMANCE OVER Y OU T UBE C ELEBRITIES D ATASET

Fig. 5. Classification performance on noisy Honda/UCSD dataset with 50 frames per sequence. The plots shows the recognition rates (a) per sequence and (b) per frame of different methods with clean (green/dark) and noisy (yellow/light) sequences. For noisy sequences, the performance shown is averaged over five runs.

a parameter sweep for best performance. Fig. 6 shows the recognition rate on the noisy Honda/UCSD data set, as a function of both temporal connection parameter (λ) and top-down connection parameter (η). We observe that the performance is dependent on both the parameters and should be set reasonably (neither too high nor too low) to obtain good results. While these plots show the effective contribution of temporal and top-down connections, they also show something more interesting. While the performance is better with either temporal or top-down connections, best performance is obtained only when both are available. This indicates that both temporal and top-down connections play an equally important role during inference. This is in accordance with other predictive coding models used for detecting bird songs [56]. D. Learning Hierarchy of Attractors In this section, we further analyze the model from a slightly different perspective. The aim here is to visualize and understand the representations learned in the hierarchical

Fig. 6. Performance on noisy Honda/UCSD dataset for various values of λ and η. The performance on noisy Honda/UCSD dataset for various values of λ and η. (a)–(c) Recognition rates versus temporal connection parameter (λ), where each color plot indicates a particular value of η. (b)–(d) Recognition rates versus top-down connections parameter (η), where each color plot indicates a particular value of λ. In addition, (a)–(b) recognition rates per sequence. (c)–(d) Recognition rates per frame. Note that the higher recognition rates per frame does not always reflect as higher recognition rates per sequence.

model and get some insight into the working of the top-down and temporal connections. The key assumption in our model is that any visual input sequence unfolds with a well defined spatio-temporal dynamics [56], [57] and these dynamics can be modeled as trajectories in some underlying attractor manifold. In the hierarchical setting, we further assume that the shape of the manifold that describes the inputs is itself modulated by the dynamics in an even higher level attractor manifold. From a generative model perspective, this is equivalent to saying that a sequence of causes in a higher layer nonlinearly modulates the dynamics of lower layer representations, which in turn represent an input sequence. In other words, as succinctly described in [56], such hierarchical dynamic models represent the inputs as sequences of sequences. In the following experiments, we show that the model can learn a hierarchy of attractors such that the complexity of the representation increases with the depth of the model. In addition, we show that the temporal and top-down

2002

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

Fig. 7. Hierarchical decomposition of object parts learned by the model from face videos of 16 different subjects in the VidTIMIT dataset. (a) Receptive fields of layer 1 causes. (b) Receptive fields of layer-2 causes. Both are constructed as weighted linear combination of filters in the layers below. Top: layer-1 states have similar receptive fields, as shown in Fig. 3.

connections (or empirical priors) lead the representations into stable attractors, making them robust to noise. 1) Learning Object Parts From Unlabeled Sequences: We show that the model can learn the hierarchical compositions of the objects from the data itself in a completely unsupervised manner. For this, we consider the VidTIMIT dataset [58], where face videos of 16 different people with different facial expressions are used as inputs. We learned two-layered networks with 16 first layer states, 36 first layer causes, 36 second layer states, and 16 second layered causes. We further use 3 × 3 nonoverlapping pooling regions in the first layer and 2 × 2 nonoverlapping pooling regions in the second layer. We constructed the receptive fields of the layer 1 and layer 2 causes using the linear combination of the basis in the layers below and are shown in Fig. 7. We observe that the model is able to learn a hierarchical structure of the faces. While the first layer states represent primitive features like edges, first layer causes learn parts of the faces. The second layer causes, where the model combines the responses of the first layer causes, are able to represent an entire face. More importantly, we observe that each cause unit in the second layer is specific to a particular face (or object), increasing the discriminability between faces. 2) Denoising Videos Using Top-Down Information: We next show that the top-down information can be useful to denoise a highly corrupted video using the contextual information. To show this, we used the same model as above on the face video sequences. We corrupted a face video sequence (different from the one used to learn the model) with structured noise, where one-fourth part of each frame was occluded with a completely unrelated image. There was no correlation between the occlusion in two consecutive frames. Figs. 8 and 9 show the results obtained.4 In Fig. 8, we project the response of the layer two states into the input space to understand the underlying representation of the model. Since layer two states get information from the bottom layer as well as the top-down information from the second layer causes, it should be able to resolve the occluded portion of the video sequence using the 4 These results are made into videos and they are made available at http://cnel.ufl.edu/∼rakesh/face_video_2013.html

Fig. 8. Video denoising with temporal and top-down connections. (a) and (b) Examples with two different video sequences. Top: corrupted video sequences where in every frame one-fourth of the frame is occluded with an unrelated image. Middle: linear projection of layer-2 states onto the image space when inference is performed with temporal and top-down connections. Bottom: similarly, linear project of layer-2 states when inference is performed without temporal or top-down connections.

Fig. 9. PCA projections of layer two causes in the denoising experiment (a) without and (b) with temporal and top-down connections.

contextual information over time and space. We observe that with the top-down information the representation stabilized over time and the model was able to resolve the occluded part of the input video sequence. On the other hand, without the contextual information the representations did not converge to a stable solution. Fig. 9 shows the 2-D Principal component analysis (PCA) projections of the layer-2 causes. Again, we observe that representations obtained with temporal and top-down connections for each subject are stable and mapped into a well defined attractors, separated from one another [Fig. 9(b)]. On the other hand, without these connections the representations are not stable and cannot be well separated [Fig. 9(a)]. VI. D ISCUSSION A. Relationship With Feed-Forward Networks Many deep learning methods—like deep belief networks [26], stacked autoencoders [28], convolutional neural networks [29], and so on—encode the inputs as a hierarchical representation. It is observed that increasingly invariant representations can be obtained with the depth of the hierarchical models [59]. However, in contrast to our model, these methods neither perform explaining away nor consider temporal and top-down connections, and only focus on feed-forward rapid recognition without context. The proposed model can also be written as a feed-forward network by performing approximate inference. Starting from

CHALASANI AND PRINCIPE: CONTEXT DEPENDENT ENCODING USING CDNs

initial rest (i.e., all the variables are initialized to zeros) and considering only a single FISTA iteration, the states and the causes can be (approximately) inferred as [60] ⎛ ⎞ Dl−1  1 k,m ∗ Utm,l−1 ⎠ C X tl,k = Tγ0 ⎝ L m=1 ⎛ ⎞ Kl  1  Utl,d = Tβ ⎝ (24) Bk,m ∗ X tk,l ⎠ L k=1

where Tγ (·) is a soft thresholding function and L determines the step-size. However, such representations have only a limited capacity, as there is no competition between the elements to explain the inputs. On the Caltech-101 dataset experiment described in Section V-B such approximate inference only produced a modest rate of 46%5 (chance is below 1%). B. Comparison With Other Methods The methods that are closest to our approach are those involving convolutional sparse coding [22], [38]. DN uses a similar hierarchical sparse coding but does not consider temporal or top-down connections. Our method can be considered a generalization of DN to a broader class of dynamic systems. In addition, pooling with switch setting [38] can also be incorporated into our model without any significant changes. Deep Boltzmann machine [61] and convolutional deep belief networks [30] use undirected graphical models to construct deep networks. Similar to our model, they also incorporate top-down connections, though they do not consider temporal connections. However, they rely on sampling methods to perform inference and require several iterations across the layers before a stable solution is obtained. In contrast, in our method, we only perform a single top-down and bottom-up pass. In addition, learning in these models is slow. VII. C ONCLUSION In this paper, we proposed a novel CDN based on predictive coding framework. The crux of our approach is to build a hierarchical generative model by stacking several state-space models with sparse states and causes. The temporal (or recurrent) connections with-in each layer and interaction between top-down and bottom-up connections across the layers, allows the model to incorporate contextual information, while extracting features from sequence of sensory signals. We have shown that these features are stable and robust to transformations and noise on the objects in the input sequence. Performance of the model in object recognition and sequence labeling datasets show that using contextual information can lead to significant gains in recognition rates. ACKNOWLEDGMENT The authors would like to thank M. Emigh for his comments, which improved this manuscript significantly. 5 It should be noted that we do not perform any local contrast normalization between layers, which is reported to produce better performance in feed-forward networks [31].

2003

R EFERENCES [1] O. Schwartz, A. Hsu, and P. Dayan, “Space and time in visual context,” Nature Rev. Neurosci., vol. 8, no. 7, pp. 522–535, Jul. 2007. [Online]. Available: http://dx.doi.org/10.1038/nrn2155 [2] D. J. Felleman and D. C. Van Essen, “Distributed hierarchical processing in the primate cerebral cortex,” Cerebral Cortex, vol. 1, no. 1, pp. 1–47, Jan. 1991. [Online]. Available: http://dx.doi.org/10.1093/cercor/1.1.1-a [3] M. Bar, “Visual objects in context,” Nature Rev. Neurosci., vol. 5, no. 8, pp. 617–629, 2004. [4] R. P. N. Rao and D. H. Ballard, “Dynamic model of visual recognition predicts neural response properties in the visual cortex,” Neural Comput., vol. 9, no. 4, pp. 721–763, 1997. [5] K. Friston, “Hierarchical models in the brain,” PLoS Comput. Biol., vol. 4, no. 11, p. e1000211, Nov. 2008. [6] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1–127, Jan. 2009. [Online]. Available: http://dx.doi.org/10.1561/2200000006 [7] R. Rigamonti, M. A. Brown, and V. Lepetit, “Are sparse representations really relevant for image classification?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 1545–1552. [8] C. D. Gilbert and W. Li, “Top-down influences on visual processing,” Nature Rev. Neurosci., vol. 14, no. 5, pp. 350–363, 2013. [9] T. S. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer. A, Opt., Image Sci., Vis., vol. 20, no. 7, pp. 1434–1448, Jul. 2003. [10] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects,” Nature Neurosci., vol. 2, no. 1, pp. 79–87, Jan. 1999. [Online]. Available: http://dx.doi.org/10.1038/4580 [11] S. J. Kiebel, J. Daunizeau, and K. J. Friston, “A hierarchy of timescales and the brain,” PLoS Comput. Biol., vol. 4, no. 11, p. e1000209, Nov. 2008. [12] S. J. Kiebel, K. von Kriegstein, J. Daunizeau, and K. J. Friston, “Recognizing sequences of sequences,” PLoS Comput. Biol., vol. 5, no. 8, p. e1000464, Aug. 2009. [Online]. Available: http://dx.doi.org/10.1371/ journal.pcbi.1000464 [13] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. 7th IEEE Int. Conf. Comput. Vis. (ICCV), vol. 2. Washington, DC, USA, Sep. 1999, pp. 1150–1157. [14] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1. Washington, DC, USA, Jun. 2005, pp. 886–893. [15] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, Jun. 1996. [16] L. Wiskott and T. Sejnowski, “Slow feature analysis: Unsupervised learning of invariances,” Neural Comput., vol. 14, no. 4, pp. 715–770, Apr. 2002. [17] H. Lee, C. Ekanadham, and A. Y. Ng, “Sparse deep belief net model for visual area V2,” in Advances in Neural Information Processing Systems 20. Red Hook, NY, USA: Curran Associates, 2007, pp. 873–880. [18] W. Zou, S. Zhu, K. Yu, and A. Y. Ng, “Deep learning of invariant features via simulated fixations in video,” in Advances in Neural Information Processing Systems 25. Red Hook, NY, USA: Curran Associates, 2012, pp. 3212–3220. [19] K. Kavukcuoglu, P. Sermanet, Y.-I. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, “Learning convolutional feature hierarchies for visual recognition,” in Advances in Neural Information Processing Systems 23. Red Hook, NY, USA: Curran Associates, pp. 1090–1098, 2010. [20] H. Mobahi, R. Collobert, and J. Weston, “Deep learning from temporal coherence in video,” in Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2009, pp. 737–744. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553469 [21] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Self-taught learning: Transfer learning from unlabeled data,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 759–766. [22] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2528–2535. [23] K. Friston, “A theory of cortical responses,” Philosoph. Trans. Roy. Soc. B, Biol. Sci., vol. 360, no. 1456, pp. 815–836, 2005. [24] C. Cadieu and B. A. Olshausen, “Learning transformational invariants from natural movies,” in Advances in Neural Information Processing Systems 21. Red Hook, NY, USA: Curran Associates, 2008, pp. 209–216.

2004

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015

[25] Y. Karklin and M. S. Lewicki, “A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals,” Neural Comput., vol. 17, no. 2, pp. 397–423, 2005. [26] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006. [27] A. Hyvärinen and P. O. Hoyer, “A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images,” Vis. Res., vol. 41, no. 18, pp. 2413–2423, Aug. 2001. [28] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Mar. 2010. [29] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, Dec. 1989. [Online]. Available: http://dx.doi.org/10.1162/neco.1989.1.4.541 [30] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proc. 26th Annu. Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2009, pp. 609–616. [31] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce, “Learning mid-level features for recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2559–2566. [32] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, pp. 183–202, Mar. 2009. [33] R. Chalasani, J. C. Principe, and N. Ramakrishnan, “A fast proximal method for convolutional sparse coding,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), Aug. 2013, pp. 1–5. [34] R. Chalasani and J. C. Principe, “Deep predictive coding networks,” in Proc. Workshop Int. Conf. Learn. Represent. (ICLR), Apr. 2013, doi: arXiv:1301.3541. [35] Y. Nesterov, “Smooth minimization of non-smooth functions,” Math. Programm., vol. 103, no. 1, pp. 127–152, 2005. [36] X. Chen, Q. Lin, S. Kim, J. G. Carbonell, and E. P. Xing, “Smoothing proximal gradient method for general structured sparse regression,” Ann. Appl. Statist., vol. 6, no. 2, pp. 719–752, 2012. [37] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep./Oct. 2009, pp. 2146–2153. [38] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, pp. 2018–2025. [39] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [40] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 67, no. 2, pp. 301–320, 2005. [41] B. Chen, G. Polatkan, G. Sapiro, L. Carin, and D. B. Dunson, “The hierarchical beta process for convolutional factor analysis and deep learning,” in Proc. 28th Int. Conf. Mach. Learn. (ICML), 2011, pp. 361–368. [42] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2. 2006, pp. 2169–2178, doi: 10.1109/CVPR.2006.68. [43] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories,” Comput. Vis. Image Understand., vol. 106, no. 1, pp. 59–70, 2007. [44] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. Res., vol. 9, pp. 1871–1874, Jun. 2008. [45] S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (COIL-20),” Columbia Univ., New York, NY, USA, Tech. Rep. CUCS-005-96, 1996. [46] K.-C. Lee, J. Ho, M.-H. Yang, and D. Kriegman, “Visual tracking and recognition using probabilistic appearance manifolds,” Comput. Vis. Image Understand., vol. 99, no. 3, pp. 303–331, 2005. [47] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, “Face tracking and recognition with visual constraints in real-world videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2008, pp. 1–8. [48] H. Wersing and E. Körner, “Learning optimized features for hierarchical models of invariant object recognition,” Neural Comput., vol. 15, no. 7, pp. 1559–1588, Jul. 2003. [49] R. Wang and X. Chen, “Manifold discriminant analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 429–436.

[50] H. Cevikalp and B. Triggs, “Face recognition based on image sets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2567–2573. [51] Y. Hu, A. S. Mian, and R. Owens, “Sparse approximated nearest points for image set classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2011, pp. 121–128. [52] Y.-C. Chen, V. M. Patel, P. J. Phillips, and R. Chellappa, “Dictionarybased face recognition from video,” in Proc. 12th Eur. Conf. Comput. Vis. (ECCV), 2012, pp. 766–779. [53] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. Comput. Vis., vol. 57, no. 2, pp. 137–154, May 2004. [54] R. Wang, H. Guo, L. S. Davis, and Q. Dai, “Covariance discriminative learning: A natural and efficient approach to image set classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 2496–2503. [55] R. Vemulapalli, J. K. Pillai, and R. Chellappa, “Kernel learning for extrinsic classification of manifold features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2013, pp. 1782–1789. [56] K. Friston and S. Kiebel, “Cortical circuits for perceptual inference,” Neural Netw., vol. 22, no. 8, pp. 1093–1104, 2009. [57] D. George and J. Hawkins, “A hierarchical Bayesian model of invariant pattern recognition in the visual cortex,” in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), vol. 3. Jul./Aug. 2005, pp. 1812–1817. [58] C. Sanderson, Biometric Person Recognition: Face, Speech and Fusion. Saarbrücken, Germany: VDM Verlag, 2008. [59] I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng, “Measuring invariances in deep networks,” in Advances in Neural Information Processing Systems 22. Red Hook, NY, USA: Curran Associates, 2009, pp. 646–654. [60] M. Denil and N. de Freitas. (2012). “Recklessly approximate sparse coding.” [Online]. Available: http://arxiv.org/abs/1208.0959 [61] R. Salakhutdinov and G. Hinton, “An efficient learning procedure for deep Boltzmann machines,” Neural Comput., vol. 24, no. 8, pp. 1967–2006, 2012. Rakesh Chalasani (S’10) received the B.Tech. degree in electronics and communication engineering from the National Institute of Technology, Nagpur, India, in 2008, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Florida, Gainesville, FL, USA, in 2010 and 2013, respectively. He was a Graduate Research Assistant with the Computational NeuroEngineering Laboratory, University of Florida, from 2010 to 2013, and a Research Intern with the Bosch Research and Technology Center, Palo Alto, CA, USA, in 2012. He is currently a Machine Learning Researcher with AnalytXbook, Inc., Boston, MA, USA. His current research interests include machine learning, pattern recognition, unsupervised learning, kernel methods, information theoretic learning, and computer vision. Jose C. Principe (M’83–SM’90–F’00) received the bachelor’s degree in electrical engineering from the University of Porto, Porto, Portugal, the master’s and Ph.D. degrees from the University of Florida, Gainesville, FL, USA, and the Laurea (Hons.) degree from the Mediterranea University of Reggio Calabria, Reggio Calabria, Italy. He joined the University of Florida in 1987, after an eight-year appointment as a Professor with the University of Aveiro, Aveiro, Portugal. He founded the Computational NeuroEngineering Laboratory at the University of Florida, in 1991, to synergistically focus his research on biological information processing models. He has been a Distinguished Professor of Electrical and Biomedical Engineering with the University of Florida since 2002. He was the Supervisory Committee Chair of 65 Ph.D. and 67 master’s students, and has authored over 500 refereed publications (three books, four edited books, 14 book chapters, 200 journal papers, and 380 conference proceedings). His current research interests include nonlinear non-Gaussian optimal signal processing and modeling, and biomedical engineering. He holds five patents and has submitted seven more. Dr. Principe is a fellow of the American Institute for Medical and Biological Engineering. He was the President of the International Neural Network Society, the Editor-in-Chief of the IEEE T RANSACTIONS ON B IOMEDICAL E NGINEERING, and a member of the Advisory Science Board of the FDA. He was a recipient of the Gabor Award from the International Neural Network Society for his contributions.

Context Dependent Encoding Using Convolutional Dynamic Networks.

Perception of sensory signals is strongly influenced by their context, both in space and time. In this paper, we propose a novel hierarchical model, c...
3MB Sizes 4 Downloads 5 Views