IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

1709

Error Surface of Recurrent Neural Networks Manh Cong Phan and Martin T. Hagan, Member, IEEE

Abstract— We found in previous work that the error surfaces of recurrent networks have spurious valleys that can cause significant difficulties in training these networks. Our earlier work focused on single-layer networks. In this paper, we extend the previous results to general layered digital dynamic networks. We describe two types of spurious valleys that appear in the error surfaces of these networks. These valleys are not affected by the desired network output (or by the problem that the network is trying to solve). They depend only on the input sequence and the architecture of the network. The insights gained from this analysis suggest procedures for improving the training of recurrent neural networks. Index Terms— Error surface, Fibonacci polynomials, recurrent neural networks, spurious valleys.

I. I NTRODUCTION

I

T IS well known that recurrent neural networks (RNNs) have powerful computational abilities; however, they are very difficult to train consistently [1], [2]. One of the proposed reasons for the difficulties in training RNNs is the bifurcation of the network dynamics. The problem of bifurcation in RNNs has been investigated in several papers [3]–[5]. Bifurcation is a change in the dynamic behavior of the network (equilibrium points, periodic orbits, stability properties, etc.) as a parameter is varied. The parameter space could be divided into many regions with very different dynamic behaviors. At bifurcation points, the output of a network can change discontinuously with the change of parameters, and therefore the convergence of gradient descent algorithms is not guaranteed [3]. Another proposed reason for difficulties in RNN training is the problem of long-term dependencies. This problem was discussed by Bengio et al. [6] and others [7]–[12]. They argued that for the tasks with long-term dependencies (i.e., where the output depends on inputs presented at times far in the past), the gradient descent algorithms perform poorly. The reason is that the gradient vanishes as it is propagated back into the past [6], [7], [13]. Another reason for difficulties in training recurrent networks is the existence of spurious valleys in the error surface, where training algorithms can be easily trapped. We first presented this issue in [14], and later provided a mathematical analysis of the first-order case in [15]. The problem was discovered when we trained a neural network-based model reference controller. Manuscript received May 29, 2012; revised February 25, 2013; accepted April 3, 2013. Date of publication June 25, 2013; date of current version October 15, 2013. The authors are with the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2013.2258470

We found that the error sometimes increased during training, although a line search algorithm was used. Through plotting the mean square error surface along the search direction, we obtained profiles like the one shown in Fig. 1 (where α is the fractional change in the weights). Obviously, it is very difficult for any line search algorithm to locate the minimum along the search direction, because there are so many local minima in such a small range. In [15], we demonstrated that such spurious valleys appear even in the simplest recurrent network—a single-layer network with only one neuron. There has been other work discussing valleys in the error surface of RNNs. For example, [16] provided a qualitative discussion of the spurious valleys that appeared in the surface of a recurrent neural adaptive filter that was functionally equivalent to the RNN analyzed in [14] and [15]. They discussed the training difficulties caused by the valleys, but they did not provide a mathematical analysis. In [17]–[19], the authors described methods for searching along valleys in the error surface. They wanted to reach the minimum of the error surface, which occurred in the valleys. The valleys they discussed are different than the spurious valleys discussed in this paper, which are not related to the minimum of the error surface. We want to avoid the spurious valleys, to reach the true minimum. The purpose of this paper is to extend the results in [15] to general RNNs. This is a further step in understanding the error surface of RNNs, which can then lead to improvements in training. To the best of our knowledge, this method of analyzing the error surface of RNNs has not been previously reported in the literature. This paper is organized as follows. In Section II, we provide some background material on linear systems that will be needed for our analysis of recurrent networks. Next, in Section III, we analyze the error surface of a two-layer linear network, and then, in Section IV, a twolayer nonlinear network. Section V extends the previous results to general RNN architectures, and Section VI concludes this paper and discusses practical implications of the results. II. P RELIMINARY M ATERIAL In analyzing the spurious valleys of RNNs, we will begin with a linearized network, in which all sigmoid activation functions are replaced with linear activation functions. We will then analyze the resulting linear network using standard linear system theory. In this section, we review some of the key results from linear system theory that we will need later in this paper. (See [20] and [21] for more detailed development.) We also present some notation for describing frozen roots (introduced in [15]), which will be essential in describing one type of spurious valley.

2162-237X © 2013 IEEE

1710

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

where λmax is the real pole with the largest magnitude. In addition, it is known that (1) will be unstable if λmax has magnitude greater than 1. In addition to the difference equation in (1), a linear dynamic system can also be written in state space form as follows: x(t + 1) = Ax(t) + B p(t) (8) a(t) = Cx(t)

4.2 4.1 4

mse

3.9 3.8

where x = [x 1 , . . . , x n ]T ∈ R n is the state vector. The solution to the state equation for arbitrary initial condition x(1) can be written as follows:

3.7 3.6 3.5 3.4 0

Fig. 1.

0.2

0.4

α

0.6

0.8

a(t) = CAt −1 x(1) +

1 x 10

Error profile along the search direction.

First, consider the difference equation representation of a linear dynamic system (linear RNN) a(t) + m 1 a(t − 1) + m 2 a(t − 2) + · · · + m n a(t − n) = s1 p(t − 1) + s2 p(t − 2) + · · · + sn p(t − n) (1) where p(t) is the input to the system and a(t) is the system response. This system can also be represented by its transfer function + · · · + sn s1 . G(z) = n n−1 z + m1z + · · · + mn

(2)

The response of this linear system to an arbitrary input sequence, assuming zero initial conditions, can be found from the following convolution sum: a(t) =

t −1 

g(i ) p(t − i ) =

i=1

t −1 

g(t − l) p(l)

(3)

where g(i ) is the impulse response of (1), which can be found from the following: g(i ) + m 1 g(i − 1) +m 2 g(i − 2) + . . . + m n g(i − n)  s i = 1, . . . , n = i (4) 0 i >n where g(i ) = 0, i ≤ 0. The difference equation (1) is characterized by the following characteristic equation: Dn (λ) = λn + m 1 λn−1 + m 2 λn−2 + · · · + m n = 0.

(5)

The roots of this equation, λi , i = 1, · · · , n, are called the system poles. The impulse response can be written in terms of the system poles, as in the following: n 

α j (λ j )i .

(6)

j =1

i→∞

An = −m 1 An−1 − · · · − m n I.

g(i + 1) = λmax g(i )

(10)

Therefore, (9) can be written as follows: a(t) = =

n  l=1 t −1 

CAt −l−1 x(1)(−m l ) +

t −1 

CAt −l−1 B p(l)

l=1

CAt −l−1 B p  (l) =

l=1

t −1 

g(t − l) p  (l).

(11)

l=1

This is equivalent to the convolution sum of (3), where the first n elements of the input sequence are modified to include effects from the initial conditions. These modified elements can be computed from the following: (12)

where

−m l CAn−l x(1) , l = 1, · · · , n. (13) CAn−l B The choice of state vector is not unique, but for all choices of state, the transfer function of (2) can be computed as follows: p∗ (l) =

G(z) = C(zI − A)−1 B Cad j (zI − A)B = det(zI − A) s1 z n−1 + s2 z n−2 + · · · + sn = n . (14) z + m 1 z n−1 + · · · + m n A key for determining the location of a certain type of spurious valley are the roots of a polynomial whose coefficients are elements of the input sequence Pit (λ) = p(i )λt −i + p(i + 1)λt −i−1 + · · · + p(t) t t   p(l)λt −l = p(t − l + i )λl−i . (15) = l=i

l=i

At every time point, this polynomial increases in order by 1, according to the following:

From this, it can be shown ( [22], Lesson 8) that lim

(9)

If we compare this with (3), we can see that g(i ) = CAi−1 B. The second term is the convolution sum and the first term is the initial condition response. From the Cayley–Hamilton theorem [20], we obtain the following:

p  (l) = p(l) + p∗ (l), l = 1, . . . , n

l=1

g(i ) =

CAt −l−1 B p(l).

l=1

−12

z n−1

t −1 

(7)

Pit +1 (λ) = λPit (λ) + p(t + 1).

(16)

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

Fig. 2.

1711

Two-layer linear network.

We showed in [15] that if Pit (λ) has a real root greater than 1 in magnitude (and there is a high probability that this will occur), then Pit +1 (λ) also has a root at approximately the same location. We call this a frozen root, and it is the cause of spurious valleys in the error surface, as we will demonstrate. If the frozen root, call it r , occurs at time step t = t ∗ , then we can say the following: Pit (r ) =

t 

r t −l p(l) = 0, t > t ∗ .

Fig. 3.

Stable region of Fig. 2 network.

Fig. 4.

Magnetic levitation system.

(17)

l=i

III. E RROR S URFACE OF A T WO -L AYER L INEAR RNN A. Description of Network Consider the two-layer linear network shown in Fig. 2. The network can be represented in state space form as follows:     1 w1 w2 x(t) + x(t + 1) = p(t) 1 0 0   a(t) = 1 0 x(t) (18) where x 1 (t) = a2 (t), x 2 (t) = a2 (t − 1), and a(t) = a2 (t). The corresponding transfer function is z G(z) = 2 (19) z − w1 z − w2 and the resulting difference equation is a(t) − w1 a(t − 1) − w2 a(t − 2) = p(t − 1)

(20)

which has the following characteristic equation: D2 (λ) = λ2 − w1 λ − w2 = 0.

(21)

The system will be stable if the poles (roots of the characteristic equation) are inside the unit circle. Using the Jury stability test [23, p. 185], we can find three conditions for stability w1 + w2 < 1 −w1 + w2 < 1 |w2 | < 1. These inequalities define the interior of the triangle shown in Fig. 3. In the region below the parabola in Fig. 3, the poles are complex. This region is defined by w12 + 4w2 < 0.

(22)

This region will be important in determining the shape of the error surface of the network.

To generate an error surface for this network, we first generate training data. One of the very useful applications of recurrent networks is the identification and control of dynamic systems. We will use the recurrent network of Fig. 2 to model the magnetic levitation system shown in Fig. 4. (See [24] for details on the model.) We use a sequence of independent Gaussian random variables with mean zero and variance one for the network input p(t), which represents the current i (t) going to the electromagnet. This current is applied to a computer simulation of the magnetic levitation system, and the magnet position y(t) is stored as the desired output d(t) of the neural network. The sum square error (SSE) is then computed with the following: F=

Q 

(d(t) − a(t))2

(23)

t =1

where Q is the number of points in the input sequence. (Note that the network in Fig. 2 will not be able to accurately approximate the magnetic levitation system, which would require a nonlinear network with more neurons [24]. Our objective is, however, to show the error surfaces of recurrent networks, and this data will serve nicely for that purpose. In addition, we normalize the inputs and desired outputs so that the input weight of 1 shown in Fig. 2 will be adequate for the data. (With only two remaining adjustable weights, we will be able to plot the error surface, which provides insights into the spurious valleys.)

1712

Fig. 5.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Error surface of Fig. 2 network. Fig. 7.

Average surface valleys.

Let λ1 = r . Since λ1 λ2 = −w2 , λ2 = −w2 /r . From this condition, we will be able to tell if r is the largest root. From (7), we can say that lim

i→∞

g(i + 1) = λ1 = r g(i )

(26)

where we are considering the segment of the line (25) where |w2 | < r 2 , and, therefore, |λ1 | > |λ2 |. Therefore, given ε > 0, there exists an i ∗ such that |g(i + 1)/g(i ) − r | < ε for every i ≥ i ∗ . Therefore, g(i +1) ≈ rg(i ) for i ≥ i ∗ (in our tests, i ∗ = 5 is usually sufficient). If we substitute this into the convolution (3), we have the following:

Fig. 6.

a(t) =

Valleys in the error surface.

The error surface for a particular input sequence of 15 points is plotted in Fig. 5. Fig. 6 is a plot of the valleys. We have two types of valleys: root valleys and architecture valleys. The architecture valleys are a group of parabolas with different curvatures, and the root valley is a line segment to the left of the parabolas. We will explain these valleys in the following sections.

≈ = =

∗ −1 i

i=1 ∗ −1 i i=1 ∗ −1 i i=1 ∗ −1 i

g(i ) p(t − i ) +

t −1 

g(i ) p(t − i )

i=i ∗

g(i ) p(t − i ) + g(i ∗ ) g(i ) p(t − i ) + g(i ∗ )

t −1  i=i ∗ t −i∗



r i−i p(t − i ) r t −i

∗ −l

p(l)

l=1

g(i ) p(t − i ) + g(i ∗ )P1t −i∗ (r ).

(27)

i=1

B. Root Valleys To investigate the root valleys, as in the single neuron case [15], we consider the polynomial P1t (λ). Assume that there is a frozen root, r , starting at t = t ∗ , hence (17) is satisfied. We want to find the weight values so that r is also a pole (root of the characteristic equation). If we plug in r for λ in (21), we have the following: D2 (r ) = r 2 − w1r − w2 = 0.

(24)

This equation will define the root valley, and we want to find the values of the weights that will cause this equation to be satisfied. We can rewrite (24) as follows: w2 = −r (w1 − r ).

(25)

We will show that the root valley is a segment of this line. As |r | > 1, this line lies outside the stable region in Fig. 3. It will also be important that r be the largest root of D2 (λ).

If t − i ∗ ≥ t ∗ or t ≥ t ∗ + i ∗ , then the second term is zero [from (17)]. Thus, a(t) =

∗ −1 i

g(i ) p(t − i ).

(28)

i=1

We need to emphasize that this is true for every t ≥ t ∗ + i ∗ . Therefore, every point on the line (25) at which |w2 | < r 2 will produce small outputs a(t) for every t ≥ t ∗ + i ∗ (and so relatively small errors). The output at points on either side of this line segment, however, will be significantly higher, as they are in the unstable region of the system. Thus, the SSE is small on this line segment, compared with points on either side. Therefore, this line segment becomes a valley in the error surface. This valley does not depend on the targets, only on the input sequence [which determines P1t (λ)] and the network architecture [which determines D2 (λ)]. An example of this

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

line segment is shown in Fig. 6, which demonstrates how the actual valley follows the line segment. We should make very clear that the location of the root valley depends only on the input sequence (and potentially on nonzero initial conditions); it does not depend on the desired output. This is a very unexpected result. During training, we are trying to get the neural network to approximate some unknown dynamic system that defines the relationship between the inputs and the desired outputs. (In our example here, the unknown dynamic system is the magnetic levitation system of Fig. 4.) Now, we find that there are steep valleys in the error surface that depend only on the input sequence. In addition, they depend only on the first few points of the input sequence. This is because the frozen root generally occurs within the first five to ten time steps and then remains frozen for the rest of the sequence [15]. In later sections of this paper, we extend these results to a very general RNN of arbitrary size. We will have equations equivalent to (17), (27), and (24). In other words, we will have a polynomial Pit (λ) whose coefficients are elements of the input sequence. (In some cases, the first few coefficients may be modified, as in (12).) This polynomial will have a root, r , outside the unit circle as follows: Pit (r ) = 0, for some |r | > 1, t > t ∗ .

(29)

We will then find a set of network weights so that a root of the characteristic equation (system pole) is equal to that root of Pit (λ) outside the unit circle Dn (r ) = 0

(30)

for some set of network weights. Because of this root, the network response will be very small, even though the network is unstable. This means that a small change in the weights will cause a large change in the network response, which will result in a very narrow and deep valley in the error surface. (The width of the valleys will decrease, and the depth of the valleys will increase, with the length of the input sequence, because, as the length increases, the network outputs will become larger for unstable weights.) The equation that defines the valley will be Dn (r ) = 0, where the coefficients of the Dn (r ) polynomial are functions of the network weights. (In some cases, the first few coefficients of the Pit (λ) polynomial will also be functions of the network weights, in which case we will also have the equation Pit (r ) = 0 to be satisfied.) Because these root valleys depend only on the first few steps of the input sequence, and not on the desired outputs (or on the underlying dynamic system that we are attempting to approximate), we call these valleys “spurious.” (The dictionary definition of spurious is “Not proceeding from the true source.”) The minima of these valleys are unrelated to the global minimum of the error surface. They cause difficulties during training, because they trap the search for the true minimum. Again, the cause of this valley is that P1t (λ) has a root, r , greater than 1 in magnitude for t > t ∗ , and that D2 (r ) = 0 over some regions of the weight space. These regions of the weight space will form the valley. This same analysis can be applied to linear networks of arbitrary size, as we will show

1713

Fig. 8.

Intrinsic architecture valleys and their equations.

in a later section. If P1t (λ) has more than one root greater than 1 in magnitude, the error surface will have more valleys of this kind. C. Intrinsic Architecture Valleys In the previous sections, we found that the existence and location of root valleys in the error surface depend on the input sequence p(t). How about the parabolic valleys that appear in Fig. 6? Does their presence depend on the input sequence? To answer these questions, we will develop an error surface that is independent of the input sequence. The idea is to take the average of all possible surfaces (one for each possible input sequence). In other words, we will consider p(t) as a random process and find an expression for the expectation of the SSE (23). For simplicity, we need to make some approximations first. Consider the following approximation of the SSE (23): Q  (31) Fa = (a(t))2 . t =1

The normalized error of this approximation is  Q  2 Fa − F t =1 2a(t)d(t) − (d(t)) = . (32) Q 2 Fa t =1 (a(t)) t −1 From (3), we have a(t) = i=1 g(i ) p(t − i ), which is a function of w2 . As the order of the numerator in (32) is less than the order of the denominator (notice that the d(t) are bounded constants, and a(t) increases with w2 ), the above ratio goes to zero as w2 goes to infinity. Therefore, we can approximate F by Fa for large w2 . We will work with the approximate average surface Fa to find the equations for the intrinsic architecture valleys. Now we can take the expectation of (31) to get the average surface. This average surface is a function of only the two weights w1 and w2 ⎡ ⎤ Q  2 Ft (w1 , w2 ) = E ⎣ (a(t)) ⎦ t =1

=

Q  t =2

⎡ 2 ⎤ t −1  E⎣ g(i ) p(t − i ) ⎦. i=1

1714

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Suppose that the p(i ) are independent and identically distributed random variables. (For illustration, we will assume mean zero and varianceof one.) Then, the expectation of every cross 2 t −1 g(i ) p(t − i ) is zero. Therefore term in the sum i=1 Ft (w1 , w2 ) = = =

Q  t −1  t =2 i=1 Q  t −1 

  (g(i ))2 E ( p(t − i ))2 (g(i ))2

t =2 i=1 Q−1 

(Q − i ) (g(i ))2 .

(33)

i=1

The average surface valleys are plotted in Fig. 7. It looks like the sample surface valleys shown in Fig. 6, except that there is no root valley. This confirms that the root valley depends on the input sequence, whereas the parabolic valleys do not. They are the properties of the network architecture, and therefore we call them “intrinsic architecture valleys.” Next, we will find equations that describe the intrinsic architecture valleys. A natural way to find the valleys is to set the first partial derivatives of Ft equal to zero. Here, we will take the first derivative of Ft with respect to w1 and solve for w1 . That is d Ft =2 dw1

Q−1 

(Q − i )g(i )

i=1

dg(i ) . dw1

Fig. 9. Impulse responses for different intrinsic architecture valleys. Last impulse response equals zero.

(34)

Through solving (4) to find the g(i ), and rearranging terms, this can be rewritten as follows: 1 d Ft = {(Q − 2) + 2(Q − 3)w2 + 4(Q − 4)w22 2 dw1 Q−3 + 6(Q − 5)w23 + · · · + a0 w2 }w1 + {2(Q − 3) + 8(Q − 4)w2 + 22(Q − 5)w22

Now we consider two cases where d Ft /dw1 = 0. The first case is g(Q−1) = 0. The second case is dg(Q − 1)/dw1 = 0. In the appendix, we show that d 2 Ft /dw12 > 0 if g(Q−1) = 0. Therefore, in the first case, the second derivative is positive, and we have a valley. Thus, g(Q − 1) = 0 is the equation for the valleys. From [25], Corollary 10, we can write the following expressions for g(Q − 1):

+ · · · + a1 w2Q−4 }w13 + · · · + a Q−3 w12Q−5 . (35) We want to find the roots of the right-hand side of (35), which is a polynomial in w1 . The coefficients of this polynomial are polynomials in w2 . If w2 is large enough, we can approximate these coefficients by the highest order terms. The polynomial with these approximate coefficients is 1 d Ft ≈ a0 w2Q−3 w1 + a1 w2Q−4 w13 + · · · + a Q−3 w12Q−5 2 dw1 dg(Q − 1) = g(Q − 1) . (36) dw1 In the appendix, we show that the roots of the polynomial in (35) approach the corresponding roots of the polynomial in (36) as w2 increases. Therefore, to locate the valleys, we can set g(Q − 1)dg(Q − 1)/dw1 equal to zero and check the second derivative of Ft with respect to w1 . If that second derivative is positive, we have the valleys. If it is negative, we have the ridges. The second derivative can be written as follows:   dg(Q − 1) 2 d 2 g(Q − 1) 1 d 2 Ft = + g(Q − 1) . (37) 2 2 dw1 dw1 dw12

g(Q−1) =

⎧ (Q−3)/2 kπ ⎪ (w12 + 4w2 cos 2 ( Q−1 )) if Q is odd ⎨w1 k=1 ⎪ ⎩(Q−2)/2 k=1

kπ (w12 + 4w2 cos 2 ( Q−1 )) if Q is even.

For the second case, where dg(Q − 1)/dw1 = 0, we show in the appendix that d 2 Ft /dw12 < 0. Because the second derivative is negative, we have a ridge, instead of a valley. For the case dg(Q − 1)/dw1 = 0, we have just one more valley for even Q. That is, the vertical line w1 = 0 with w2 > 0. To summarize, the equations for the intrinsic architecture valleys are as follows: 1) if Q is odd, the architecture valleys are made up of the vertical line w1 = 0 and a family of parabolas w2 = −w12 /4cos 2 (kπ/(Q − 1)) for k = 1, 2, . . . , (Q − 3)/2; 2) If Q is even, the architecture valleys are made up of the section of the vertical line w1 = 0 with w2 > 0 and the family of parabolas w2 = −w12 /4cos2 (kπ/(Q − 1)) for k = 1, 2, . . . , (Q − 2)/2 . Three sets of curves are shown in Fig. 8. The thick dark curve is w2 = −w12 /4, below which the system characteristic equation has complex roots. The intrinsic architecture valleys

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

1715

however, the last impulse response coefficient g(Q − 1) is always small for all valleys. D. Sample Architecture Valleys In the previous section, we found the locations of architecture valleys on the average performance surface. They are independent of the input sequence. For a particular input sequence, we call these valleys “sample architecture valleys.” One example of sample architecture valleys is shown in Fig. 6. In this section, we will investigate how the sample architecture valleys are different than the intrinsic architecture valleys. We will use the approximate SSE in (31) and the argument for intrinsic architecture valleys to find the sample architecture valleys. Fig. 10.

Sample architecture valleys and their approximations.

F

Q 

(a(t))2

t =1

=

t −1 Q   t =2

2 g(i ) p(t − i )

i=1

The first derivative of F with respect to w1 is  t −1   t −1  Q    dg(i ) 1 dF = g(i ) p(t − i ) p(t − i ) . 2 dw1 dw1 t =2

Fig. 11. valleys.

Sample architecture valleys compared with intrinsic architecture

(except the section of the vertical line w1 = 0 with w2 > 0) fall below this curve. There are always (Q − 2) such valleys. The black curves show numerically computed valleys in the average performance surface. The thin gray curves show the equations for the approximate intrinsic architecture valleys. We can see that the approximate valleys accurately match the actual valleys. A more intuitive way to explain the intrinsic architecture valleys is to use the impulse response g(i ). In the region defined by inequality (22) (and outside the stable region), the impulse response oscillates and expands with time. Therefore, the last impulse response coefficient, g(t − 1), has the most significant affect on the output a(t) in (3). At those points (w1 , w2 ) where g(t − 1) = 0, the output becomes small (compared with the outputs at those points where g(t −1) = 0) and the SSE is small as well. This means that g(t − 1) = 0 is the equation of the intrinsic architecture valleys. We confirmed this experimentally. The impulse responses at different intrinsic architecture valleys are shown in Fig. 9 (an input sequence of seven points is used in this demonstration). Each valley corresponds to a different frequency of oscillation of the impulse response,

i=1

i=1

Next, approximate this derivative by the last term in the sum as follows: ⎡ ⎤⎡ ⎤ Q−1 Q−1   dg(i ) 1 dF ⎣ g(i ) p(Q − i )⎦ ⎣ p(Q − i )⎦ . 2 dw1 dw1 i=1 i=1 (38) Therefore, either Q−1 

g(i ) p(Q − i ) = 0

(39)

dg(i ) p(Q − i ) = 0 dw1

(40)

i=1

or Q−1  i=1

will define the sample architecture valleys. We will consider these two cases individually.  Q−1 1) i=1 g(i ) p(Q − i ) = 0: Like Appendix II, we can show that if (39) is satisfied, the second derivative of F with respect to w1 is positive. Thus, (39) is indeed an equation for sample architecture valleys. Comparing with (3), this means that the last output equals zero. This makes sense, because, like the impulse response g(t), the output a(t) is expanding with time. Therefore, the last output dominates the SSE. If it equals the last target (targets are small numbers), the SSE becomes small, and we have sample architecture valleys. We confirmed experimentally that at the sample architecture valleys, the last output is small. The network responses for the sample architecture valleys are similar in form to the impulse responses shown in Fig. 9.

1716

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

 Q−1 2) i=1 (dg(i )/dw1) p(Q − i ) = 0: Unlike Appendix III, the sign of the second derivative of F with respect to w1 depends on the input sequence. This means that (40) could be the equation for valleys or ridges, depending on the input sequence. Our experiments show that, in most cases, (40) is the equation for ridges. There are, however, some input sequences in which some of the curves from (40) are valleys. Substituting g(i ) from (4) into (39) and solving for w1 in terms of w2 , we can obtain equations for the sample architecture valleys. [We can do the same thing for (40) to find possible additional sample architecture valleys.] The actual valleys for a particular input sequence and the corresponding approximate valleys obtained from (39) are shown in Fig. 10. The approximation is quite good for the regions in which |w2 | is large enough. Like the intrinsic architecture valleys, all sample architecture valleys [there are still (Q − 2) such valleys] lie in the region below the parabola w2 = −w12 /4 (the region in which (21) has complex roots). The root valley acts as a transition between two architecture valleys (see the left most one and the vertical one in Fig. 10). The sample architecture valleys are shifted versions of the intrinsic architecture valleys, as shown in Fig. 11. The direction of shifting depends on the sign of the root of P1t (λ), which determines the location of the root valley. IV. E RROR S URFACE OF A T WO -L AYER N ONLINEAR RNN A. Description of Network The objective of this section is to understand the effect of adding nonlinear transfer functions to the two-layer network. If the linear transfer functions in both layers of the network shown in Fig. 2 are replaced by the tanh function, we have the following: a1 (t) = tanh [ p(t) + w1 a1 (t − 1) + w2 a2 (t − 1)] (41) (42) a2 (t) = tanh [a1 (t − 1)] . As in the one-neuron network [15], the nonlinearity makes the shape of valleys in the error surface much more complex. The error surface, and the valleys, for a two-layer nonlinear RNN are shown in Fig. 12. B. Root Valleys In this section, we will show that D2 (r ) = 0 still defines the root valley in the nonlinear case, with a modification of the meaning of the root r . We will show that on the root valley, the output is small. When the output is small, the tanh activation functions will be approximately linear, and the same analysis from the linear case can be applied. The layer outputs for the first few time steps are (assuming zero initial conditions) as follows:

Fig. 12.

Error surface and valleys for a two-layer nonlinear RNN.

a2 (3) = tanh [a1 (2)] a1 (4) = tanh [ p(4) + w1 a1 (3) + w2 a2 (3)] a2 (4) = tanh [a1 (3)] .

At some points (w1 , w2 ) on the error surface where p(3) + w1 a1 (2) + w2 a2 (2) is small enough, the tanh function can be approximated as linear. Because of this, a1 (3) is also small. Thus, we can make the following approximation: a(4) = a2 (4) ≈ a1 (3) ≈ p(3) + w1 a1 (2) + w2 a2 (2). If this type of approximation can be applied for the following time points, then we can use the linear difference equation of (20), which will be valid for t ≥ 4, with initial conditions a(2) and a(3). This means that we can use the modified convolution equation of (11), which allows for nonzero initial conditions, as follows: a(t) =

a2 (2) = tanh [a1 (1)] a1 (3) = tanh [ p(3) + w1 a1 (2) + w2 a2 (2)]

t −1 

g(t − l) p (l), t ≥ 4

(44)

l=3

a1 (1) = tanh [ p(1)] a2 (1) = 0 a1 (2) = tanh [ p(2) + w1 a1 (1)]

(43)

where, using (13), we can find the modified input elements,   p (3) and p (4), to be 

p (3) = p(3) + w1 a1 (2) + w2 a2 (2) 

p (4) = p(4) + w2 a2 (3).

(45)

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

1717

Now, let r be the largest root of D2 (λ) = 0. Using the same argument as in the linear case, we have an expression for the output similar to (27). To be on the root valley, we need to have t −i ∗ ∗  t −i ∗  r t −i −l p (l) = 0 (46) P 3 (r ) =

case, the root valley is a part of D2 (r ) = 0. In the linear case, r is the frozen root of P1t (λ) and is fixed. In the nonlinear case,  t r is the frozen root of P 1 (λ) and varies with the weight w1 (and various initial conditions). That is why the valley is no longer a straight line; in the general case, it is a hypersurface in the weight space. We want to emphasize that the mechanism that causes the nonlinear root valley is unchanged: the input sequence (or adjusted input sequence) has a frozen root outside the unit circle. The shape of the valley is changed because of the saturation of the sigmoid function. In the general case, which will be discussed in a later section, a similar pattern emerges. Root valleys are determined by simultaneous equations, like those in (47). The polynomial t P i (λ) has a frozen root, r (for t ≥ t ∗ ), which is greater than 1 in magnitude. The valley then appears along the hypersurface Dn (r ) = 0. The difficulty in describing the valley for the general case is caused by the fact that the first few coefficients t of P i (λ) are functions of the network weights. We are able to solve this numerically for the second-order network, because we can solve Dn (r ) = 0 for w2 , and then use that to eliminate w2 from the other equation.

l=3 t



where P i (λ) is defined as in (15), but with p (l) instead of p(l). So now, as in the linear case, the root valley is defined by the following two equations: P

 t −i



3

(r ) = 0

D2 (r ) = 0.

(47)

The difference in the ∗ nonlinear case is that the first two  t −i   coefficients in P 3 (r ) ( p (3) and p (4)) depend on the network weights w1 and w2 , whereas in the linear case, the ∗ coefficients of P1t −i (r ) are just the elements of the input sequence. To find the root valley in the linear case, we simply ∗ find the frozen root of P1t −i (λ), and then plug that value, r , into D2 (r ) = 0. This gives us the line w2 = −r (w1 − r ). For the nonlinear case, we have to solve the simultaneous equations in (47). It is difficult to find a closed form solution to these equations, but we can find a numerical solution using the following procedure. If we substitute (45) and w2 = r 2 − w1r into (46) we have the following: P

 t −i

3



(r ) = a2 (2)r t −i

∗ −1

+ [a2 (3) − w1 a2 (2)] r t −i

+ [ p(3) + w1 a1 (2) − w1 a2 (3)] r t −i ∗ ∗ + p(l)r t −i −l .

∗ −2

t −i ∗ −3

(48)

l=4 

Now, let p be the input sequence p, except for the following points: 

p (1) = a2 (2) 

p (2) = a2 (3) − w1 a2 (2) 

p (3) = p(3) + w1 a1 (2) − w1 a2 (3).

(49)

(Notice that these terms are only functions of w1 , and not w2 .) Then t −i ∗ ∗ ∗  t −i  t −i ∗  P 3 (r ) = P 1 (r ) = r t −i −l p (l). (50) l=1  t −i



If r is the frozen root of P 1 (λ), then w2 = −r (w1 − r ) is the equation for the root valley. The procedure to find the nonlinear root valley can be summarized as follows. For each value of w1 : 1) compute a1 (2), a2 (2), and a2 (3) from (43).;  2) form p as in (49). Find r , which is the frozen root of  t P 1 (λ).; 3) compute w2 = −r (w1 − r ). [The condition |w2 | < r 2 needs to be satisfied, so that r is the largest root of (21).] An example error surface and associated valleys are shown in Fig. 12. The approximate root valley, determined by the procedure above, accurately matches a true valley. As in the linear

C. Architecture Valleys Architecture valleys in the nonlinear case are created by the same mechanisms as the linear case, but their shapes are modified. An example of how the architecture valleys change as we go from the linear case to the nonlinear case is shown in Fig. 13. In the nonlinear case, valleys 1 and 2 move to the right, valleys 3–5 move to the left. (Valley 5 in the linear case— the part below the tangent point—is one of the architecture valleys.) In addition, some new valleys appear (see valleys a–e in Fig. 13). We can see that all of the architecture valleys in the linear case are still present in the nonlinear error surface. (We call them basic architecture valleys.) They just move to new locations. The new architecture valleys appear around them. We verified experimentally that the mechanism that causes these nonlinear architecture valleys is similar to the linear case (see Section III-D): one of the last outputs is changing the most quickly and is crossing the corresponding target, while the other outputs are almost constant. For the basic architecture valleys, the last output is changing the most quickly between tanh(1) and − tanh(1), and it is crossing the last target (see valleys 1–5 in Fig. 13). Each valley corresponds to a different frequency of oscillation of the network response. This is similar to the linear case. For the new architecture valleys, in the process of changing from one frequency to another, the last output stays saturated at tanh(1) or − tanh(1). If the second to last output changes the most quickly, and crosses the second to last target, we have new valleys (valleys a and b in Fig. 13). Furthermore, if the last two outputs are saturated, and if the third to last output changes the most quickly and crosses the third to last target, we have additional new valleys (see valleys c–e in Fig. 13). This cycle has the potential to continue until the first time point, hence it can cause many more new valleys around the basic valleys. These new valleys are similar to the type 3 valleys described in [15] for the single-neuron network.

1718

Fig. 13.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

Movement of architecture valleys from the linear case to the nonlinear case (seven-point sequence).

Through replacing the linear transfer function with the sigmoid transfer function, we increase the number and complexity of the architecture valleys. These valleys inherit properties of valleys that appear in the linear case, however, they are more numerous and complex because of the saturation of the sigmoid transfer function. V. E RROR S URFACE OF A M ORE G ENERAL C LASS OF RNN S In this section, we will discuss how the results for the one- and two-layer RNNs can be extended to a more general class of RNN. We first analyze linear networks (where linear functions replace sigmoid functions), before proceeding to nonlinear networks. A. Description of Network Now we consider a very general class of RNN—the layered digital dynamic network (LDDN)—first introduced in [26]. The net input n m (k) for layer m of an LDDN can be computed as follows:   nm (k) = LWm,l (d)al (k − d) f l∈L m

+

d∈D L m,l

 

IWm,l (d)pl (k − d) + bm

(51)

The steps we followed in Section III for the two-layer network can be followed almost exactly for the general case. If P1t (λ) has a frozen root r bigger than 1 in magnitude, the error surface will have a root valley. That valley can be obtained by substituting r into the characteristic equation (5). In other words, the equation of the root valley is as follows: Dn (r ) = r n + m 1r n−1 + · · · + m n = 0

(53)

t ∗ ).

where has a frozen root, r (for t ≥ The valley is therefore defined by the following simultaneous equations: P1t (r ) = 0, t ≥ t ∗

m,l

is the lth input to the network at time k, IW where is the input weight between input l and layer m, LWm,l is the layer weight between layer l and layer m, bm is the bias vector for layer m, DL m,l is the set of all delays in the tapped delay line between layer l and layer m, Im is the set of indices f of input vectors that connect to layer m, and L m is the set of indices of layers that connect directly forward to layer m. The output of layer m is as follows: am (k) = fm (nm (k))

B. Root Valleys

P1t (λ)

l∈Im d∈D Im,l

pl (k)

for m = 1, 2, . . . , M, where fm is the transfer function at layer m. The set of M paired equations (51) and (52) describes the LDDN. LDDNs can have any number of layers, any number of neurons in any layer, and arbitrary connections between layers (as long as there are no zero-delay loops). An LDDN can be linearized by making all activation functions in (52) linear. Any linear LDDN can be represented in state space form using (8). The corresponding transfer function can then be obtained using (14), which leads to difference equation (1) and characteristic equation (5). The output can be computed by convolving the input sequence with the impulse response, as in (3) [or (11) for nonzero initial conditions].

(52)

Dn (r ) = 0.

(54)

We will summarize the main points that lead to this conclusion. 1) For all sets of weights that satisfy (53), there is a system pole outside the unit circle, which means that the network is unstable (the m i are complex functions of the LDDN layer weights). 2) If there is a region on (53) where r is the root of Dn (λ) with largest magnitude, then we can approximate the

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

impulse response using g(i + 1) = rg(i ) for i > i ∗ [from (7)]. This leads to (27). Because r is the frozen root of the P1t (λ), the output is small. 3) The root is frozen as time increases, therefore the output remains small after a certain time point. 4) As the network weights vary even slightly from the hypersuface (53), the network output will increase dramatically, because that hypersurface is in the unstable region of the network. C. Architecture Valleys We can extend the analysis of the architecture valleys in the two-layer network case to explain how these valleys will form in the general case. 1) The intrinsic architecture valleys appear at those points where the last impulse response, g(Q − 1), equals zero. Their presence is independent of the input. They are properties of the network architecture. 2) The sample architecture valleys appear when the last output equals zero. They are shifted versions of the intrinsic architecture valleys. D. Valleys for Nonlinear RNNs Because of the saturation effect, the nonlinear valleys are modified (in terms of shape and location) versions of the linear valleys. In addition, the number of valleys is multiplied. 1) The nonlinear root valley is a shifted version of the linear root valley. Starting at a certain point in time and at certain places on the error surface, the layer outputs of the network become small. Therefore, the sigmoid activation functions can be approximated as linear. (This would at least be true in regions where the biases are zero and the inputs are very small, but it seems likely that this could be possible in many other cases.) Therefore, the network operation can be described by the linear difference equation (1). The root, r , that is used to determine the root valley equation, using (53), is changing, as the first n coefficients of the input sequence are modified by the first n network outputs [see (13)]. (This involves modifying (54) to replace t P1t (r ) with P n (r ).) Thus, the nonlinear root valley is shifted compared with the linear case. The shift depends on the si [see (1)]. The si depend on the input weights of the LDDN. 2) The architecture valleys still appear. They just move to new locations. In addition, new architecture valleys appear, as described for the two-layer network. Although we cannot visualize the error surface and valleys of a general RNN, the difference equation (1) allows us to use the method of analysis for the two-layer RNN to analyze the general RNN. The understanding of the mechanisms that cause the spurious valleys can help us avoid them during training. It is not necessary to know the exact locations of the valleys. It is enough to know that they are not related to the problem that the RNN is being trained to solve. They are functions of the network inputs in the training set, the initial states of the network, and the network architecture.

1719

VI. C ONCLUSION This paper generalized the results of [15], which described mechanisms by which spurious valleys are introduced into the error surface of a first-order recurrent network. These valleys are spurious, because they are unrelated to the problem that we are attempting to solve with the RNN. They do not depend to any significant extent on the desired network output. They depend only on the network inputs, the initial network states, and the architecture of the network. In this paper, we used some basic concepts of linear system theory to develop a framework that can be used to analyze spurious valleys that appear in most practical recurrent network architectures, no matter their size. We showed that there are two main types of spurious valleys in the error surface of RNNs: root valleys and architecture valleys. Root valleys depend on the input sequence and can be explained using some properties of random polynomials and concepts from linear systems. Architecture valleys appear independently of the input sequence and are fundamental characteristics of the network architecture. We derived approximate equations for the valleys for a very general class of RNN and confirmed the equations experimentally for the second-order networks. This type of analysis has not been previously used to investigate the error surface of RNNs. The presence of spurious valleys in the error surface of RNNs makes training, especially using batch, gradient-based methods, very difficult. For large networks with arbitrary recurrent connections, there will be many such valleys where the search algorithm is very likely to become trapped (see Fig. 1 for an actual error surface in a practical problem). In this paper, we identified several characteristics of these valleys, which suggest modifications to standard training procedures. For example, as the training sequence gets longer, more valleys appear in the error surface. In addition, the valleys become deeper and more narrow. Therefore, it would be a good training practice to use short input sequences, or to segment long sequences into short sequences before training. In addition, we showed that the spurious valleys appear in unstable regions of the error surface. Therefore, by constraining the network weights to maintain stability, we might be able to avoid the valleys during the training process. Some stability criteria for RNNs were recently developed in [27] and [28]. The next step is to find an effective way to incorporate these constraints into current RNN training algorithms. Another way to maintain network stability would be to use regularized training, in which large network weights are penalized. The regularization could be reduced as training proceeded, so as not to bias the final results. We also showed that the locations and shapes of the valleys are affected by the initial conditions of the network. This suggests that small changes in initial conditions might assist in moving a valley, when training is stuck. In addition, because the valleys are so steep, training gradients will tend to become large whenever the search arrives in a valley. This could indicate to the algorithm that a new sequence, or new initial conditions should be used. In addition to switching sequences and/or initial conditions, it might be helpful to train with

1720

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 11, NOVEMBER 2013

several independent sequences and to average the results. This would have the tendency to smooth out the valleys. A PPENDIX I P ROOF T HAT (35) A PPROXIMATES (36) We will show that the roots of the polynomial in (35) approach the corresponding roots of the polynomial in (36) as w2 increases. For simplicity, we will rewrite the polynomial in (36) as follows: f (w1 ) =

Q−3 

bk w12k+1

(55)

k=0

ak w2Q−3−k .

where bk = Let w1i (i = 1, 2, . . . , Q − 3) be the Q − 3 roots of the (Q − 3)rd order polynomial (55). We consider the right-hand side of (35) as a polynomial in w1 with perturbed coefficients. The coefficient bk is perturbed by bk , which is a polynomial in w2 with order less than the order of bk . Because of this perturbation, the root of (35) corresponding to w1i differs from w1i by w1i . The following formula shows the sensitivity of the roots to the perturbation of coefficients [29]:        w1i   = S w1i  bk   bk   w  bk  1i

A PPENDIX III P ROOF THAT d 2 Ft /dw12 < 0 IF dg(Q − 1)/dw1 = 0 Through (37), we have (1/2)(d 2 Ft /dw12 ) = g(Q − 1)(d 2 g(Q − 1)/dw12 ). In addition, by induction, we can prove the following identity:   −3w1 dg(Q−1) + (Q − 1)2 − 1 g(Q − 1) d 2 g(Q − 1) dw1 = . dw12 w12 + 4w2 (57) 2 g(Q − 1)/dw 2 = 0, d = Since dg(Q − 1)/dw 1 1   ( (Q − 1)2 − 1 g(Q − 1))/(w12 + 4w2 ). Thus   (Q − 1)2 − 1 [g(Q − 1)]2 1 d 2 Ft = . 2 dw12 w12 + 4w2

(58)

As we will show in Appendix II, f (w1i ), which is w (1/2)(d 2 Ft /dw12 )|w1i , could not be zero. Therefore, Sbk1i is finite. In other words, we can always find a bound M for all Sbwk1i . Given small ε > 0, for large enough w2 , we have | bk /bk | < ε/M. Thus, | w1i /w1i | < M × (ε/M) = ε. This implies that we can make the relative error of the roots as small as desired.

As in Appendix II, we can show that because dg(Q − 1)/dw1 = 0, g(Q − 1) = 0. Therefore, the numerator of 1/2d 2 Ft /dw12 in (58) is positive. We just need to consider the denominator, which is w12 + 4w2 . We consider two subcases as follows: 1) w1 = 0: this occurs when Q is even. If w2 > 0, then w12 +4w2 = 4w2 > 0, hence d 2 Ft /dw12 > 0. Thus, part of the vertical line w1 = 0 with w2 > 0 is a valley. Otherwise, if w2 < 0, then d 2 Ft /dw12 < 0, hence a part of the vertical line w1 = 0 with w2 < 0 is a ridge. 2) w1 = 0: We will show that d 2 Ft /dw12 < 0 or w12 + 4w2 < 0. We will prove this by contradiction. Suppose that w12 + 4w2 > 0, then (21) has two real and distinct roots λ1 and λ2 (assume |λ1 | > |λ2 |). The condition w1 = 0 guarantees that |λ1 | = |λ2 | √ (if w1 = 0 then |λ1 | = |λ2 | = w2 ). It also guarantees that g(Q − 2) = 0, which makes the limit below valid. Again, by [22], Lesson 8, we have the following:  w1 + w12 + 4w2 g(Q − 1) = λ1 = . (59) lim Q→∞ g(Q − 2) 2

A PPENDIX II

From (56) and the condition that dg(Q − 1)/dw1 = 0, we have the following:

w

where Sbk1i is the sensitivity of the relative error in the root w1i to the relative perturbations of the coefficient bk , and      b w2k    2k b w k k     w 1i  Sbk1i =   1i  =  .  f (w1i )   b Q−3 j =i (w1i − w1 j )  

P ROOF THAT d 2 Ft /dw12 > 0 IF g(Q − 1) = 0 Through (37), we have (1/2)(d 2 Ft /dw12 ) = [dg(Q − 1)/dw1 ]2 ≥ 0. We will show that dg(Q − 1)/dw1 = 0. Through induction, we can prove the following identity: 2(Q − 1)w2 g(Q − 2) + (Q − 2)w1 g(Q − 1) dg(Q − 1) = . dw1 w12 + 4w2 (56) One property of Fibonacci polynomials is that the greatest common divisor of g(Q − 1) and g(Q − 2) is 1 [25, Th. 4]. Thus, g(Q − 1) and g(Q − 2) could not be both zero for the same w1 and w2 . As g(Q −1) = 0, g(Q −2) = 0. In addition, we have w2 = 0 (our assumption above is that |w2 | is large enough). Therefore, by (56), we have dg(Q − 1)/dw1 = (2(Q − 1)w2 g(Q − 2))/(w12 + 4w2 ) = 0. This means that d 2 Ft /dw12 > 0.

w2 = −

(Q − 2) g(Q − 1) × w1 × . 2(Q − 1) g(Q − 2)

(60)

Taking the limit of both sides as Q → ∞ and using (59), we have the following:  w (w + w12 + 4w2 ) 1 1 g(Q − 1) 1 w2 = − w1 lim =− 2 Q→∞ g(Q − 2) 4  2 ⇔ w1 + 4w2 = −w1 w12 + 4w2 . This implies that w12 + 4w2 = w12 (by squaring both sides), therefore w2 = 0. This does not satisfy our assumption that |w2 | is large enough. Therefore, the assumption that w12 + 4w2 > 0 is false. Thus, we have w12 + 4w2 < 0. So dg(Q − 1)/dw1 = 0 (w1 = 0) is the equation for the ridges.

PHAN AND HAGAN: ERROR SURFACE OF RECURRENT NEURAL NETWORKS

R EFERENCES [1] A. F. Atiya and A. G. Parlos, “New results on recurrent network training: Unifying the algorithms and accelerating convergence,” IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 697–709, May 2000. [2] M. Gori, B. Hammer, P. Hitzler, and G. Palm, “Perspectives and challenges for recurrent neural network training,” Logic J. IGPL, vol. 18, no. 5, pp. 617–619, 2010. [3] K. Doya, “Bifurcations in the learning of recurrent neural networks,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 6. May 1992, pp. 2777–2780. [4] F. Pasemann, “Dynamics of a single model neuron,” Int. J. Bifurcat. Chaos, vol. 3, no. 2, pp. 271–278, 1993. [5] R. Haschke and J. Steil, “Input space bifurcation manifolds of recurrent neural networks,” Neurocomputing, vol. 64, pp. 25–38, Mar. 2005. [6] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994. [7] P. Frasconi, M. Gori, and G. Soda, “Local feedback multilayer networks,” Neural Comput., vol. 4, no. 1, pp. 120–130, 1991. [8] P. Frasconi, M. Gori, M. Maggini, and G. Soda, “Unified integration of explicit knowledge and learning by example in recurrent networks,” IEEE Trans. Knowl. Data Eng., vol. 7, no. 2, pp. 340–346, Apr. 1995. [9] T. Lin, B. Horne, P. Tino, and C. Giles, “Learning long-term dependencies in NARX recurrent networks,” IEEE Trans. Neural Netw., vol. 7, no. 6, pp. 1329–1338, Nov. 1996. [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [11] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: The difficulty of learning long-term dependencies,” in A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen, Eds. Piscataway, NJ, USA: IEEE Press, 2001. [12] A. Schaefer, S. Udluft, and H. Zimmermann, “Learning long-term dependencies with recurrent neural networks,” Neurocomputing, vol. 71, nos. 13–15, pp. 2481–2488, 2008. [13] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. New York, NY, USA: Wiley, 2001. [14] O. D. Jesus, J. Horn, and M. T. Hagan, “Analysis of recurrent network training and suggestions for improvements,” in Proc. Int. Joint Conf. Neural Netw., Jul. 2001, pp. 2632–2637. [15] J. Horn, O. D. Jesus, and M. T. Hagan, “Spurious valleys in the error surface of recurrent networks—Analysis and avoidance,” IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 686–700, Apr. 2009. [16] A. I. Hanna, I. R. Krcmar, and D. P. Mandic, “Perlustration of error surfaces for nonlinear stochastic gradient descent algorithms,” in Proc. 6th Seminar Neural Netw. Appl. Electr. Eng., Sep. 2002, pp. 11–16. [17] K. Gouhara, T. Watanabe, and Y. Uchikawa, “Learning process of recurrent neural networks,” in Proc. Int. Joint Conf. Neural Netw., Nov. 1991, pp. 746–751. [18] K. Gouhara, K. Yokoi, and Y. Uchikawa, “Valley searching method for recurrent neural networks,” in Proc. Int. Joint Conf. Neural Netw., Jun. 1992, pp. 972–979. [19] K. Yokoi, K. Gouhara, and Y. Uchikawa, “Comparative study of searching method in recurrent neural networks,” Syst. Comput. Jpn., vol. 27, no. 12, pp. 22–32, 1996. [20] C.-T. Chen, Linear System Theory and Design. Oxford, U.K.: Oxford Univ. Press, 1998. [21] A. V. Oppenheim, A. S. Willsky, and S. Hamid, Signals and Systems. Englewood Cliffs, NJ, USA: Prentice-Hall, 1997. [22] B. A. Brousseau, Linear Recursion and Fibonacci Sequences. San Jose, CA, USA: Fibonacci Association, 1971.

1721

[23] K. Ogata, Discrete-Time Control Systems, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1995. [24] M. T. Hagan, H. B. Demuth, and O. D. Jesus, “An introduction to the use of neural networks in control systems,” Int. J. Robust Nonlinear Control, vol. 12, no. 11, pp. 959–985, 2002. [25] V. E. Hoggatt, “Divisibility properties of generalized Fibonacci polynomials,” Fibonacci Quart., vol. 12, no. 2, pp. 113–120, Apr. 1974. [26] O. D. Jesus and M. T. Hagan, “Backpropagation algorithms for a broad class of dynamic networks,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 14–27, Jan. 2007. [27] N. H. Nguyen and M. T. Hagan, “Stability analysis of layered digital dynamic networks using dissipativity theory,” in Proc. Int. Joint Conf. Neural Netw., Aug. 2011, pp. 1692–1699. [28] R. Jafari and M. T. Hagan, “Global stability analysis using the method of reduction of dissipativity domain,” in Proc. Int. Joint Conf. Neural Netw., Aug. 2011, pp. 2550–2556. [29] P. Guillaume, J. Schoukens, and R. Pintelon, “Sensitivity of roots to errors in the coefficient of polynomials obtained by frequency-domain estimation methods,” IEEE Trans. Instrum. Meas., vol. 38, no. 6, pp. 1050–1056, Dec. 1989.

Manh Cong Phan received the B.S. degree in electrical engineering from the Hanoi University of Science and Technology, Hanoi, Vietnam, in 2008. Since 2009, he has been pursuing the Ph.D. degree with the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK, USA. His current research interests include recurrent neural networks and their applications in system identification and control systems.

Martin T. Hagan (M’78) received the B.S. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, USA, in 1972, the M.S. degree in information and computer science from the Georgia Institute of Technology, Atlanta, GA, USA, in 1973, and the Ph.D. degree in electrical engineering from the University of Kansas, Lawrence, KS, USA, in 1977. He is currently a Professor of electrical and computer engineering with Oklahoma State University, Stillwater, OK, USA, where he has taught and conducted research in the areas of statistical modeling and control systems since 1986. He was with the faculty of electrical engineering at the University of Tulsa, Tulsa, OK, from 1978 to 1986. He was a Visiting Scholar with the Department of Electrical and Electronic Engineering, University of Canterbury, Christchurch, New Zealand, in 1994, and the Laboratoire d’Analyse et d’Architecture des Systèms, Centre National de la Recherche Scientifique, Toulouse, France, from 2005 to 2006. He is the author of Neural Network Design (Boston, MA: PWS, 1994) with H. Demuth and M. Beale. He is co-author of Neural Network Toolbox for MATLAB.

Error surface of recurrent neural networks.

We found in previous work that the error surfaces of recurrent networks have spurious valleys that can cause significant difficulties in training thes...
4MB Sizes 0 Downloads 0 Views