F -Discrepancy for Efficient Sampling in Approximate Dynamic Programming.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

F-Discrepancy for Efficient Sampling in Approximate Dynamic Programming Cristiano Cervellera and Danilo Macciò

Abstract—In this paper, we address the problem of generating efficient state sample points for the solution of continuous-state finite-horizon Markovian decision problems through approximate dynamic programming. It is known that the selection of sampling points at which the value function is observed is a key factor when such function is approximated by a model based on a finite number of evaluations. A standard approach consists in generating these points through a random or deterministic procedure, aiming at a balanced covering of the state space. Yet, this solution may not be efficient if the state trajectories are not uniformly distributed. Here, we propose to exploit F-discrepancy, a quantity that measures how closely a set of random points represents a probability distribution, and introduce an example of an algorithm based on such concept to automatically select point sets that are efficient with respect to the underlying Markovian process. An error analysis of the approximate solution is provided, showing how the proposed algorithm enables convergence under suitable regularity hypotheses. Then, simulation results are provided concerning an inventory forecasting test problem. The tests confirm in general the important role of F-discrepancy, and show how the proposed algorithm is able to yield better results than uniform sampling, using sets even 50 times smaller. Index Terms—Approximate dynamic programming (ADP), F-discrepancy, Markovian decision problem (MDP), state sampling, value function approximation.

I. I NTRODUCTION E ADDRESS the problem of finding approximate solutions to a discrete-time finite-horizon Markovian decision problem (MDP) with continuous state and decision vectors. The aim of the MDP is to find optimal decision vectors at each stage that minimize a given cost function related to the evolution of the state, ruled by a suitable state equation. The standard algorithm to solve this problem is dynamic programming (DP), based on the definition, at each temporal stage, of a value function that quantifies the optimal cost that needs to be paid from a given state point to the end of the horizon. The DP equations need generally to be solved numerically, which implies a procedure for the approximation of the value functions, leading to a family of algorithms that is known in the literature as approximate DP (ADP). In the past years, there has been a vast amount of research on ADP, and on the closely related reinforcement learning

W

Manuscript received January 30, 2015; revised May 19, 2015; accepted July 1, 2015. This paper was recommended by Associate Editor H. Zhang. The authors are with the Institute of Intelligent Systems for Automation, National Research Council, Genova 16149, Italy (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TCYB.2015.2453123

(RL), more focused on decision problems where the state equation is not known in advance. Contributions have been provided from different disciplines and scientific communities: good introductions to ADP are [1]–[3] and the references therein, while the works [4]–[16] contain recent developments in the cybernetics and machine learning fields, including RL and its connections to MDP and ADP. The works are characterized by a wide spectrum of MDP instances that can be addressed: discrete/continuous time, finite/infinite horizon, discrete/continuous state space, finite/infinite decision vector space, etc. For the analysis presented in the paper, we consider the framework in which we look for an approximation of the value functions by means of parameterized models, chosen inside a suitable class. Many different architectures have been proposed in the literature for the approximation of the value function (or the Q function in RL), like polynomial approximators [17], splines [18], multivariate adaptive regression splines [19], kernel models [20], and fuzzy models [21]. Neural networks are also a very popular choice (see [9], [13], [22], [23]). In general, to obtain such approximations we need to evaluate the value function at a set of sampling points in the state space at the various temporal stages. It is known that one of the main issues of this approach is how to select efficient sets of sampling points for the state space that do not lead, at least structurally, to an exponential growth of their size. For the considered setting, from the early days of DP (where the regular grid, computationally unfeasible in high-dimensional contexts due to the exponential growth, was proposed) the standard approach has been mainly to look for discretization methods aiming at a uniform covering of the state space. The simplest solution is to use random sampling. For instance, in [9], ADP is analyzed for the particular case in which the system has affine structure and the cost is quadratic, over an infinite horizon. In the proposed iterative algorithm the initial state is sampled at random. Another ADP algorithm for infinite-horizon problems was proposed recently in [11], where the value function is not approximated directly and a reconstruction error is estimated instead. Also in this paper, the algorithm starts with a random sampling of the initial states. It must be remarked that the addressed case, in both aforementioned works, is one in which the state equation is deterministic, so the way the initial point is sampled has a strong impact on the resulting trajectories. Other examples of uniform sampling are techniques commonly used in statistics for design of experiments,

c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

such as orthogonal arrays [24], Latin hypercubes [25], low-discrepancy sequences, and lattice points, commonly employed for numerical integration and number-theoretic algorithms (see [20], [23], [26]–[28]). There are good reasons behind the choice of uniform sampling, the most obvious one being that it leads to a balanced approximation of the value function over the whole state space, eventually reducing the risk of finding a wrong minimum in some zones of the state space. Yet, a main limitation of this approach is the risk of placing points where the state space is never actually reached while, conversely, undersampling regions that are very likely to be visited by the system. In fact, in the stochastic setting, the presence of the random vector influences the evolution of the system in such a way that the actual distribution of the state points, relevant for the computation of the value function, may be quite different from a uniform one. To overcome this limitation, we propose to exploit the concept of F-discrepancy (see, for instance, [29], [30]) as a quality parameter to select sampling sets that are representative of the distribution underlying the MDP. F-discrepancy is a generalization to arbitrary probability distribution of the uniform discrepancy, a measure commonly employed in quasi-Monte Carlo integration and number theoretic methods to evaluate the uniformity of a sequence of points in a bounded set (see [31], [32]). F-discrepancy is defined, for a sample of points in a given set, as the maximum difference, over all the possible subsets, between the fraction of points falling in a given subset and the measure of the subinterval according to F. Loosely speaking, for the F-discrepancy to be small we need to have as many points as possible in zones where the distribution F is high, so a set of points with smaller F-discrepancy represents in a better way the distribution underlying the simulated trajectories with respect to another set with higher F-discrepancy. An example of an algorithm exploiting the concept of F-discrepancy will be proposed, as a method to select efficient sampling points starting from a large set of trajectories obtained by simulation of the system or, if possible, by actual operation when the state equation is unknown. It will be proved that such algorithm allows convergence of the ADP solutions to the true solutions of the DP equations, under suitable regularity hypotheses of the involved functions. To this purpose, we will exploit results that relate the F-discrepancy of a sample of points to the estimation of integral errors through empirical quantities based on data. We remark that, apart from the aforementioned regularity hypotheses, these results apply to very general instances of MDP, since they do not rely on any particular form of the state equation, the cost function or the approximating architectures for the value functions. To showcase the importance of F-discrepancy of the sampling points and the proposed algorithm as well, simulation results are provided involving a classic MDP testbed, namely, an inventory forecasting problem. Concerning recent works specifically focused on sampling, a similar problem of collecting a representative subsample of the available data is considered in [7] for ADP over infinite horizon using the policy iteration approach. For a continuous state

IEEE TRANSACTIONS ON CYBERNETICS

space, the subsample for the initial state is obtained through a procedure involving clustering methods and the Nyström extension method. Although there are important differences with the context addressed here (e.g., the infinite-horizon problem, the policy iteration approach, the predefined structure of the value function approximation), the work is important to show how a smart subsampling of the collected data can improve greatly the efficiency of methods based on the DP paradigm for the solution of MDPs. Also notice that, in general, selecting states to visit according to the actual evolution of the system is not a new concept in the ADP and RL literature (see [12] for a recent method taking advantage of sampled state trajectories to solve linear-quadratic infinite-horizon ADP problems). Most approaches based on this kind of strategy either do not build an approximation of the value function based on a predefined architecture (like Monte Carlo methods, see [22]), or build the approximations through an iterative procedure, possibly online. Then, in many cases, the state and action spaces are assumed to be discrete (see [6]). Formally, here we consider the ADP version in which the value functions for all the stages of the horizon are constructed entirely offline and in one shot, starting from a set of sampling points that is decided a priori. However, it is straightforward to place the proposed approach based on F-discrepancy in an iterative procedure in which at each iteration we build an improved set of value function approximations exploiting those computed in the previous iteration. Furthermore, even if here the analysis is carried out for the continuous-state case, the approach can be applied straightforwardly also in the case of discrete state space when the number of states is very large, and there is need to select a representative subsample. Finally, the addressed case of finite horizon is actually quite general, since infinite horizon problems can often be coped with applying, e.g., multistage lookahead-policies (see [1]). See also [9] for ideas on how to extend results of finite-horizon problems to infinite-horizon ones. This paper is organized as follows. Section II describes the problem formulation. In Section III, an overview of the DP algorithm is presented. Section IV is devoted to the concept of F-discrepancy, and some results useful for the error analysis are there introduced as well. The section also contains a description of the proposed algorithm to derive efficient point sets for ADP. In Section V, the analysis of the ADP error is derived, together with conditions for its convergence. Finally, Section VI reports simulation results for an inventory forecasting test problem, while in Section VII conclusions and some remarks on future developments are drawn. II. P ROBLEM F ORMULATION We address the problem of finding approximate solutions to a discrete-time finite-horizon MDP with continuous state and decision vectors. Such problems are characterized by a dynamic system evolving according to the state equation xt+1 = f (xt , ut , wt ) for t = 0, . . . , T − 1, where xt ∈ Xt ⊂ Rn is the state vector, ut ∈ Ut (xt ) ⊂ Rm is the decision vector and wt ∈ Wt ⊂ Rp is

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. CERVELLERA AND MACCIÒ: F -DISCREPANCY FOR EFFICIENT SAMPLING IN ADP

a random variable characterized by a suitable probability measure P. Ut (xt ) ∈ Rm is the set of admissible decision vectors for state xt . We also define the total set of admissible decision vectors as U¯ t = xt ∈Xt Ut (xt ). The vectors wt are assumed to be independent over the T stages forming the decision horizon. The aim of the MDP is to find optimal decision vectors at each stage t that minimize a given cost function related to the evolution of the state, having an additive form over the T stages J=

T−1

c(xt , ut , wt ) + cT (xT )

t=0

where c is the cost function of the single stage and cT is the cost function of the final stage. In some problems, we may have cT = 0. Notice that we want decision vectors ut to be the output of functions of the current state vector, i.e., ut = μt (xt ), where μt is usually called policy. Concerning the state equation, we must guarantee that the transition map f never drives the system from a state xt to a state xt+1 that belongs to an unbounded set, starting from x0 ∈ X0 . More specifically, we assume the following feasibility condition. Assumption 1: For every xt ∈ Xt , ut ∈ Ut (xt ) and wt ∈ Wt , we have f (xt , ut , wt ) ∈ Xt+1 for any t = 0, . . . , T − 1, with Xt+1 compact. An example of transition map satisfying Assumption 1 occurs when the function f is continuous and the sets X0 , Ut and Wt (for all t) are compact. In this case, we have that supxt ,ut ,wt f (xt , ut , wt ) = γ is finite, therefore Xt+1 can be defined as the closed ball of radius γ , which is a compact set. Since the MDP is solved over a finite horizon we have that Xt is compact for every t = 0, . . . , T. Notice that, once this assumption is guaranteed, any further requirement on the state vector (e.g., norm bounded by a desired value, tracking of a reference target, etc.) can be satisfied, if needed, by defining the cost suitably. III. OVERVIEW OF A PPROXIMATE DYNAMIC P ROGRAMMING The finite-horizon DP algorithm relies on the definition of a value function at each stage t, denoted by Jt◦ (xt ), representing the cost that needs to be paid from a given stage to the end of the horizon, computed by solving the following well-known recursive equations: JT◦ (xT ) = cT (xT )

◦ Jt◦ (xt ) = min Ewt c(xt , ut , wt ) + Jt+1 ( f (xt , ut , wt )) (1) ut ∈Ut

for t = T − 1, . . . , 0, where E(·) is the expectation operator. It is known that J0◦ (x0 ) corresponds to the optimal cost of the original MDP (see [1]). If we introduce the Bellman operator T defined, for the generic cost function J, as T J(x) = min Ew c(x, u, w) + J( f (x, u, w)) u

3

the DP equations (1) for the various stages can be written in compact form as JT◦ = cT

◦ Jt◦ = T Jt+1 ,

t = T − 1, . . . , 0.

Thus, we have J0◦ = T T JT◦ , where T k denotes the composition of the mapping T with itself k times, i.e., T k J = T (T k−1 J). Since in general the DP equations (1) cannot be solved analytically, ADP relies at each stage t on an approximation Jˆ t of the value function Jt◦ , obtained by estimation through a suitable class of models t (such as, e.g., the class of feedforward neural networks, kernel smoothers, multivariate adaptive regression splines, etc.) Then, the ADP algorithm consists in solving, at each stage, the following set of recursive equations: J˜ T = JT◦ = cT ◦ , t = T − 1, . . . , 0. J˜ t = T Jˆ t+1 At each stage t, J˜ t is an approximation of the true value function Jt◦ obtained using the approximation Jˆ t+1 built at the previous stage. It is known that the error between the cost obtained by the ADP solution and the optimal one is directly affected by the accuracy of the approximation of J˜ t by means of Jˆ t at stage t (a more detailed analysis will be reported in Section V). To actually obtain such approximations, we rely on a set tL of L sampling points in Xt , assuming for simplicity that the latter is a compact n-dimensional hypercube j tL = xt ∈ Xt : j = 1, . . . , L , t = 0, . . . , T − 1 j

where xt indicates the jth sampling point in Xt , and compute j the values of J˜ t (xt ) in each point of tL . Then, for each stage t consider a class of models t = {ψ(·, αt ) : αt ∈ t ⊂ Rk } characterized by a vector of parameters αt to be determined. For instance, αt can be the set of weights of a feedforward neural network or the width parameters of a kernel model. Then, we obtain the optimal parameter vector αt∗ by minimizing a given error function in the points of tL (such as a mean squared error) between the computed values of J˜ t and ψ, and define Jˆ t (xt ) = ψ(xt , αt∗ ). However, in general we are not able to express analytically the expected value with respect to wt , and need to estimate it through an average computed over S realizations WtS = {w1t , . . . , wSt } of the random variables drawn from their probability distribution P. Then, the ADP equation for stage t = T − 1, . . . , 0 eventually becomes S 1 j j j c xt , ut , wst + Jˆ t+1 f xt , ut , wst J˜ tS xt = min ut ∈Ut S s=1

(2) j for t = T − 1, . . . , 0 and for each xt ∈ tL , where Jˆ t+1 is the approximate value function obtained at the previous stage. Then, if we employ a mean squared error criterion,



we can obtain the optimal parameter vector αt∗ by the following optimization problem: αt∗ = arg min

α∈t

2 1 S j j J˜ t xt − ψ xt , α . L L

(3)

j=1

Notice that the estimation of the expected value through an empirical mean introduces a further level of approximation, which needs to be included in the error analysis, as will be detailed in the following. The recursive procedure to obtain the value function approximations at stage t, for t = T − 1, . . . , 1, can eventually be summarized as follows. 1) Choose a sample of points tL in Xt . j 2) For each point xt ∈ tL , j = 1, . . . , L, compute the value j S J˜ t (xt ) with (2) using the value function approximation Jˆ t+1 computed at stage t + 1. 3) Build the approximate value function Jˆ t with (3) using j j the input–output pairs {xt , J˜ tS (xt )}, j = 1, . . . , L. Notice that the procedure described above allows one to obtain the value function approximations offline and backwards in time. In the online phase, the approximations are employed to compute the optimal decision vector in a given state forward from time 0 to time T −1. In particular, the optimal vectors u˜ ◦0 , . . . , u˜ ◦T−1 are computed online, starting from the initial state x˜ 0 , using again at each stage t = 0, . . . , T − 1 the ADP equations as follows: u˜ ◦t = arg min u∈Ut

S

1 c x˜ t , u, wst + Jˆ t+1 f x˜ t , u, wst S s=1

being x˜ t+1 = f (˜xt , u˜ ◦t , w˜ t ), where w˜ t is the realization of the random vector that affects the system in the online phase. By applying this procedure for all the stages, we obtain the total cost actually paid in the online phase from x˜ 0 to x˜ T . In general, the accuracy of the approximator Jˆ t is influenced by two concurrent effects: 1) how “rich” is the class of approximators t , i.e., how close we can get to the true value function with elements in the class and 2) how well we are able to estimate, within the class t , the best element, i.e., the element that is closest to the true value function. The latter in particular depends not only on the complexity of the class t , but also on the set of sampling points tL that are used for the optimization of the parameters of the approximating model in (3). In the following, we propose a method to obtain efficient samples for the construction of the value function approximations based on the concept of F-discrepancy of the set tL . IV. A N A PPROXIMATE DYNAMIC P ROGRAMMING A PPROACH BASED ON F-D ISCREPANCY In order to present the proposed algorithm and proceed with the analysis, we recall the concept of F-discrepancy of a generic set N of N points in a compact X ⊂ Rn , as a way to quantify its distribution properties with respect to the measure F on X. The following notations are adopted. For u, v ∈ X, we denote by [u, v] the subinterval ni=1 [ui , vi ]

and by A([u, v], N ) the number of points of N belonging to [u, v]. Definition 1: The F-discrepancy of the sequence N = {x1 , . . . , xN } is defined as A([u, v], N ) − F([u, v]). DF (N ) = sup N u,v∈X By the definition we can see that, if a set of points has small F-discrepancy, the fraction of points belonging to a given subinterval of X is close to the measure of the subinterval according to F. For the sake of simplicity and without loss of generality we assume from now on that X = [0, 1]n . The case in which F is the Lebesgue measure is by far the most common in the literature, due to its importance in quasi-random integration. In that case, the F-discrepancy is simply called “discrepancy.” The convergence of DF (N ) in the uniform case as N grows has been studied in many works, and many examples of efficient sampling schemes leading to good convergence rates for integration of sufficiently smooth functions have been proposed (see [33]). In the general case of F-discrepancy, the available results in the literature are much less, yet important ones have been provided. In particular, it has been shown that the smallest possible F-discrepancy of N points N in [0, 1]n is bounded √ ◦ 1/2 by DF (N ) ≤ 2Cn N −1/2 , where C is a constant that does not depend on n (see [34]). In practice, estimation of the F-discrepancy of a set of points is not a straightforward task. For the uniform case, efficient methods have been proposed in the literature to provide estimates of the discrepancy, at least for moderate N and dimension n (see [35]). In the case of F-discrepancy considered in the ADP context of this paper, the task is twice more difficult, since it is not possible to exploit regularities of the uniform measure on which the aforementioned methods are based and F is unknown. Then, to apply the methods and the theory proposed in this paper, we estimate the F-discrepancy of a set N by approximating the probability F employing a larger set M of points with M N (possibly with N ⊂ M ). This leads to the definition of the estimated F-discrepancy as A([u, v], N ) A([u, v], M ) . (4) ˜ F (N , M ) = sup − D N M u,v∈X ˜ F instead of the true The error committed using D F-discrepancy can be analyzed by writing A([u, v], N ) ˜ F (N , M ) = sup − F([u, v]) D N u,v∈X A([u, v], M ) + F([u, v]) − M A([u, v], N ) ≤ sup − F([u, v]) N u,v∈X A([u, v], M ) + sup F([u, v]) − M u,v∈X = DF (N ) + DF (M ).


˜ F instead of DF is Then, the error committed by using D quantified by the F-discrepancy of the set M used to approximate the probability F. We have seen that the bound for the smallest possible F-discrepancy of √ M points corresponds to an ˜ F (N , M ) ≤ DF (N )+ 2Cn1/2 M −1/2 . In particular, error D the bound holds when the points of M are drawn according to the probability F (see [34]). ˜ F (N , M ) in (4) must be estiNotice that the value of D mated through a numerical optimization procedure over the 2n-dimensional space of the vertices uniquely defining the subintervals [u, v] ∈ X. Since the function to be minimized is discontinuous, a Monte Carlo search algorithm seems to be an appropriate choice. For the purposes of our analysis, the F-discrepancy plays an important role in a generalization to the F distribution of the Koksma–Hlawka inequality, commonly employed in quasirandom numerical integration. To this end, first we define the concept of variation in the sense of Hardy and Krauseof a function. For each vertex of a given subinterval B = ni=1 [ai , bi ] of X assign a binary label “0” to every ai and “1” to every bi . Given a function ϕ : X → R, denote by (ϕ, B) the alternating sum of ϕ computed at the vertices of B, i.e., the signs of the sum change for vertices having n − 1 equal component values. Formally we have ϕ(x) − ϕ(x) (ϕ, B) = x∈eB

x∈oB

where eB is the set of vertexes with an even number of 1s in their label, and oB is the set of vertexes with an odd number of 1s. As an example, for n = 2 and B = [a1 , b1 ] × [a2 , b2 ], we have (ϕ, B) = ϕ([a1 , a2 ]) − ϕ([a1 , b1 ]) + ϕ([a2 , b2 ]) − ϕ([a2 , b1 ]). Definition 2: The variation of ϕ on X in the sense of Vitali is defined by | (ϕ, B)| (5) V (n) (ϕ) = sup ℘

B∈℘

where ℘ is any partition of X into subintervals. The variation in the sense of Vitali is a measure of the variability of a multidimensional function over its domain. In fact, the sum over the elements | | is large when the domain can be divided into many subintervals having vertices on which the function ϕ oscillates significantly. Then, in order to measure the total variability on each possible subset of the components, the variation in the sense of Hardy and Krause is introduced. To this purpose, for 1 ≤ k ≤ n and 1 ≤ i1 < i2 < · · · < ik ≤ n, let V (k) [ϕ; i1 , . . . , ik ] be the variation in the sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1 , . . . , xn ) ∈ X : xi = 1 for i = i1 , . . . , ik }. Definition 3: The variation of ϕ on X in the sense of Hardy and Krause is defined by VHK (ϕ) =

n

k=1 1≤i1 0, independent from K and α, such that |rt (x, αt∗ ) − rt (y, αt∗ )| ≤ ηx − y for all x, y ∈ Xt . ∗ exist due to the continuity of Notice that rKmax , rKmin and xK rt , and, since δK → 0, we have that rKmin → 0 as K → ∞. ∗ ) centered Let K the biggest radius of the ball BK (xK max ∗ ∗ ∗ ∗ in xK such that |rt (x, αt ) − rt (xK , αt )| ≤ (rK − rKmin )/2. ∗ ) such There exists a point x¯K on the boundary of BK (xK max ∗ min ∗ ∗ that |rt (x¯K , αt ) − rt (xK , αt )| = (rK − rK )/2. The Lipschitz continuity of rt implies that K ≥ (rKmax − rKmin )/2η. Then, we have

∗ rt x, αt dFt (x) ≥ δK = rt x, αt∗ dFt (x) ∗ Xt BK (xK )

∗

≥ min rt x, αt∗ Ft BrK xK . (11) ∗ x∈BK (xK ) Since minx∈BK (xK∗ ) | rt (x, αt∗ )| = (rKmax + rKmin )/2, expression (11) becomes 2 2 ∗

δK ≤ δK rKmax + rKmin ≤ Ft BK xK κKl where the second inequality derives from the assumption 2). Recalling that K ≥ (rKmax − rKmin )/2η, we have l 2l+1 ηl δK . rKmax + rKmin rKmax − rKmin ≤ κ Since δK is a decreasing function by assumption, we have that rKmax → rKmin . By remembering that rKmin → 0 as K → ∞ the assertion is proved. Concerning the assumptions required by the theorem, we have that item 1) is a uniform Lipschitz condition satisfied by a large set of functions (e.g., differentiable functions). Item 2) states that the measure Ft (BK (x)) must increase at least polynomially with respect to . Notice that in order to guarantee the convergence of G(αt∗ ) as K grows, since in this case we have V¯ = V¯ K , we must assume that supK V¯ K < ∞. In other words, we must choose a class of approximating functions with uniformly bounded variation, and endowed with a universal approximation property within the set of functions with bounded total variation to ensure that G◦ can be made arbitrarily small (since the target function is assumed to have bounded variation). The class of feedforward neural networks with finite number of neural units and bounded weights is an example of class suited to this purpose.


If the class of functions is not endowed with such universal approximation property, we must accept a bias in the error rKmax due to G◦ , as shown in the proof. However, the analysis shows that even in this case a smaller F-discrepancy of the set tL is still advantageous, since it makes G(αt∗ ) closer to G◦ . As a last remark, notice that in practice we must replace ˜ Ft . Then, the DFt in the analysis with its estimated version D proper error term (decreasing with M) must be added to the bounds as discussed in Section IV. VI. S IMULATION TESTS In order to evaluate the effect of the F-discrepancy of the sampling sets, and the performance of the algorithm proposed in Section IV-A as well, simulation tests have been carried out. The chosen test problem is a 9-dimensional inventory forecasting one, often used in the literature as a testbed for ADP algorithms (see [19], [26]). The problem consists in optimizing the orders of three items through a finite horizon of T stages, with the aim of both satisfying the demand and keeping the storage levels as small as possible. We denote with xt,i and ut,i the ith component of the state vector and the decision vector at stage t, respectively, for i = 1, 2, 3. The component xt,i represents the surplus (or deficit) of item i with respect to the demand in period t for i = 1, 2, 3, the forecast for the demand of item i − 3 in period t + 1 for i = 4, 5, 6, and the forecast for the demand of item i − 6 in period t + 2 for i = 7, 8, 9, respectively. The component ut,i represents the amount of product i ordered at stage t, for i = 1, 2, 3. The state equation that determines the evolution of the system between stages t and t + 1 has the form xt+1,i = xt,i + ut,i − xt,i+3 · wt,i for i = 1, 2, 3 xt+1,i = xt+1,i+3 · wt,i for i = 4, 5, 6 xt+1,i = di−6 · wt,i for i = 7, 8, 9 where di is a constant term and wt,i is a random correction between the forecast and the true demand. The 9-dimensional vector wt = (wt,1 , . . . , wt,9 )T has a lognormal distribution with mean equal to 1 and variance equal to 0.1. The single-stage cost function has the V-shape form c(xt , ut , wt ) =

3 βi max xt+1,i , 0 − γi min 0, −xt+1,i . i=1

To reflect both objectives of the inventory forecasting problem, the coefficients βi ≥ 0 and γi ≥ 0 represent the holding cost and the backorder cost parameters for item i, respectively. Concerning the parameters for the actual tests, the following values have been employed: d1 = 8, d2 = 10, d3 = 6; β1 = 1, β2 = 0.75, β3 = 1.5; γ1 = 10, γ2 = 9, γ3 = 8. The components of the decision vector ut are constrained in such a way that ut,1 + ut,2 ≤ 20, ut,2 + ut,3 ≤ 18, and ut,i ≥ 0 for every i. The first tests have been performed considering a horizon of T = 3 stages. To implement the algorithm described in Section IV-A, Q = 20 000 initial state vectors x0 have been sampled uniformly in the hypercube whose bounds are

9

TABLE I B OUNDS ON THE I NITIAL S TATE V ECTOR

q

shown in Table I. For each initial state x0 , a corresponding q q sequence of decision vectors u0 , . . . , uT−1 has been sampled with uniform distribution. Then, a proper scaling has been operated to ensure that the resulting vectors are feasible according to the constraints described above. Finally, for each of the 20 000 initial points, a sequence of random logq q normally distributed vectors w0 , . . . , wT−1 has been generated. Q From this procedure, sets of Q = 20 000 sampling points t have been obtained for each stage t = 0, 1, 2. Two kinds of tests have been performed to showcase the importance of the F-discrepancy in choosing efficient sampling points. In a first round of tests, many subsets tL of different sizes L have been generated for each stage t = 1, 2. Specifically, six values of L (L = 200, 500, 1000, 2000, 3000, 5000) have been considered and, for each size L, ten different subsets have been generated, for a total of 60 sets for each stage t. Of the ten subsets for each size L, three have been generated in such a way that an increasing Q share of the points does not come from t , but rather from a uniform distribution of the state space Xt (taken as the smallest Q hypercube containing t ). Notice that, obviously, we expect these sets to have the highest F-discrepancy. This portion of the test is then aimed at comparing the performance of a sampling based on the F-discrepancy with the classic sampling targeted at a uniform approximation of the value function. For all the 60 sets, the corresponding F-discrepancy Q has been estimated by applying (4), using M = t . The supremum in the formula has been estimated by a direct Monte Carlo optimization, taking the maximum value observed in 50 000 randomly extracted points (u, v) in Xt . Then, pairs {1L , 2L } have been formed according to an ascending sorting of the estimated F-discrepancy at each ˜ F1 has stage t (i.e., the set with the lowest F-discrepancy D ˜ F2 , been paired with the set with the lowest F-discrepancy D and so on) and used to obtain the 60 corresponding ADP solutions as described in Section III. As a global measure of the ˜ F1 + D ˜ F2 has been ˜ F∗ = D F-discrepancy of the pair, the sum D considered. For each ADP solution, the approximation of the value functions has been obtained using a one-hidden-layer feedforward neural network with 20 neural units, trained with the Levenberg–Marquardt algorithm [39]. To test the goodness of a given solution, i.e., the performance of a given pair {1L , 2L }, 20 different initial state vectors x˜ 0 have been sampled uniformly in X0 , and for each state vector 20 sequences of random log-normally distributed vectors {w˜ 0 , w˜ 1 , w˜ 2 } have been employed to evaluate the online cost obtained using the value function approximations, as described in Section III. Then, the average over these 200 costs has been taken as the performance indicator for the 60 pairs of sampling subsets under test.



Fig. 1.

˜ F∗ . Average costs versus estimated D

Fig. 2.

˜ F∗ . Size of the sampling sets versus estimated D

The results are depicted in Fig. 1, where the cost yielded by a given sampling pair is reported in correspondence to the ˜ F∗ . Then, Fig. 2 plots the size L corresponding to estimated D ˜ F∗ . the aforementioned values of D For the sake of evaluating the actual goodness of these solutions, a reasonable heuristic approach for the selection of the quantities to be ordered has been tested as well. Such an approach consists in ordering, at each stage, the difference between the forecasted item quantities minus the surplus already available in the store. More in detail, this heuristic decision vector at stage t is defined, for item i, as ut,i = max(0, xt,i+6 + xt,i+3 − xt,i ). Then, suitable scalings are performed again to ensure that the resulting vectors are feasible according to the constraints. This heuristic policy yielded, over the 200 test instances defined above, an average cost of 84.02. This shows that all the 60 solutions obtained by the ADP procedures are actually good solutions. Two things are immediately evident by inspecting the figures: the clear increasing trend, proving in practice that smaller F-discrepancy is likely to lead to smaller cost and the irrelevance of the sample size with respect to the F-discrepancy: in fact, most pairs with large L and large F-discrepancy yield worse costs than sets with L = 200 having small discrepancy. To focus on this last point, a second round of tests has been set up by randomly selecting 100 pairs {1L , 2L } with

Fig. 3.

Boxplots for the second round of tests.

Fig. 4.

Mean costs over ten stages.

L = 200 and 100 pairs with L = 5000, in such a way that the ˜ F∗ pairs with L = 200 have small estimated F-discrepancy D (average equal to 0.21) while those with L = 5000 have large ˜ F∗ (average equal to 1.25). D Then, the ADP solutions have been computed using the aforementioned pairs, and the corresponding mean costs over the 200 test instances have been evaluated as described above. Fig. 3 reports boxplots of the results. It can be seen clearly how the sets with L = 200, despite having 50 times fewer points than the sets with L = 5000, ˜ F∗ . yield better results due to the smaller D A final round of tests has been devoted to investigating the behavior of the algorithm on longer horizons. To this purpose, a horizon of length T = 10 has been considered. Three different samplings of the state space with L = 200 have been tested, each characterized by a different value of the F-discrepancy ˜ F∗ = 0.41); medium (D ˜ F∗ = 3.63); and large ˜ F∗ : small (D D ˜ F∗ = 7.34). In particular, the set with large discrepancy cor(D responds to a pure independent identically distributed (i.i.d.) uniform sampling of the state space. To test the goodness of the solution corresponding to a given sampling, the same 20 initial state vectors x˜ 0 of the previous tests have been employed, and for each state vector ten new sequences of random log-normally distributed vectors {w ˜ 0 , . . . , w˜ 9 } have been extracted. Then, the average cost


over these 200 configurations has been taken again as the performance indicator as described above. Fig. 4 reports the average cumulative costs, plotted over the various stages, of the three solutions. In the figure, these are compared with the costs obtained with the reference heuristic policy already employed for the first round of tests. It can be seen how, once again, the best costs are given by the sets with smaller F-discrepancy. Then, the one with the largest F-discrepancy, i.e., the pure i.i.d. uniform one, yields a cost that is actually worse than the heuristic policy. This is not surprising, since it is well known that the propagation of the approximation errors over the stages can be an issue with long horizons. Then, here it is remarkable that the sets having small F-discrepancy are able to cope well directly with the long horizon. Overall, the simulation results for the chosen test problem show that F-discrepancy is a useful measure to select efficient samplings of the state space, following the algorithm proposed in Section IV-A. Then, the possibility of estimating ˜ F∗ actually makes the algorithm constructive and applicaD ble in practice. It is worth remarking that the worst costs in ˜ F∗ , correspond Fig. 1, i.e., those with the largest values of D L to sets t obtained by uniform distribution of the state space, as explained previously. This is a further proof that selecting sampling points according to a notion of F-discrepancy leads to an improvement over the standard employment of sampling schemes aiming at a uniform covering of the state space. Furthermore, the results depicted in Fig. 4 suggest that a sampling algorithm based on the F-discrepancy allows to avoid a degradation of the performance as the number of stages in the horizon grows. This enables to cope with long horizons without the need to adopt ad-hoc approaches such as multistage lookahead. VII. C ONCLUSION In this paper, we have proposed F-discrepancy as a quantity useful to select good state point samples for the approximate solution of continuous-state finite-horizon MDPs. Accordingly, we have proposed an example of constructive algorithm that, starting from a large set of simulated trajectories, yields subsets of efficient points to be used in a DP paradigm. The error analysis has proved that this procedure leads to convergence of the approximated costs to the optimal ones depending on suitable regularity hypotheses on the involved functions. Then, in the simulated application case, involving inventory forecasting, the algorithm yielded sampling sets that allow better results with respect to uniform sampling, having 50 times less the number of points. Concerning future developments, we can point out the following topics. 1) A key feature of the proposed methodology is the need for reliable estimates of the F-discrepancy, based on the available data. As said, some efficient algorithms are available for the case of uniform discrepancy, but so far there are no equivalent ones for the nonuniform case. Future work will then be aimed at developing algorithms specifically tailored to this context, integrating them at best within the MDP framework.

11

2) Application of the framework based on F-discrepancy to “exploration-exploitation” procedures. An idea worth exploring is to build sets tL adding points iteratively starting from the empty set, instead of subsampling them from a large set of existing trajectories. In this way, in the exploration phase we may try to explore new system states choosing those that make the overall F-discrepancy of tL decrease the most. How to find these points will be a subject of future research. 3) Exploiting the concept of F-discrepancy also for infinitehorizon MDPs, developing a specific method for policy iteration algorithms. 4) Test of the algorithm in other application cases will be the matter of future work as well.

R EFERENCES [1] D. Bertsekas, Dynamic Programming and Optimal Control, vol. 1, 2nd ed. Belmont, CA, USA: Athena Scientific, 2000. [2] W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed. Hoboken, NJ, USA: Wiley, 2011. [3] M. Puterman, Markov Decision Processes. New York, NY, USA: Wiley, 1994. [4] F. L. Lewis and K. G. Vamvoudakis, “Reinforcement learning for partially observable dynamic processes: Adaptive dynamic programming using measured output data,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 1, pp. 14–25, Feb. 2011. [5] T. Mori and S. Ishii, “Incremental state aggregation for value function estimation in reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 5, pp. 1407–1416, Oct. 2011. [6] K. Senda, S. Hattori, T. Hishinuma, and T. Kohda, “Acceleration of reinforcement learning by policy evaluation using nonstationary iterative method,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2696–2705, Dec. 2014. [7] X. Xu, Z. Huang, D. Graves, and W. Pedrycz, “A clustering-based graph Laplacian framework for value function approximation in reinforcement learning,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2613–2625, Dec. 2014. [8] M. L. Koga, V. Freire, and A. H. R. Costa, “Stochastic abstract policies: Generalizing knowledge to improve reinforcement learning,” IEEE Trans. Cybern., vol. 45, no. 1, pp. 77–88, Jan. 2015. [9] A. Heydari, “Revisiting approximate dynamic programming and its convergence,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2733–2743, Dec. 2014. [10] M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuous-time Q-learning for infinite-horizon discounted cost linear quadratic regulator problems,” IEEE Trans. Cybern., vol. 45, no. 2, pp. 165–176, Feb. 2015. [11] Q. Wei, F.-Y. Wang, D. Liu, and X. Yang, “Finite-approximation-errorbased discrete-time iterative adaptive dynamic programming,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2820–2833, Dec. 2014. [12] B. Kiumarsi, F. L. Lewis, M.-B. Naghibi-Sistani, and A. Karimpour, “Optimal tracking control of unknown discrete-time linear systems using input-output measured data,” IEEE Trans. Cybern., DOI: 10.1109/TCYB.2014.2384016. [13] A. Heydari and S. N. Balakrishnan, “Finite-horizon control-constrained nonlinear optimal control using single network adaptive critics,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 1, pp. 145–157, Jan. 2013. [14] K. G. Vamvoudakis, F. L. Lewis, and G. R. Hudas, “Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality,” Automatica, vol. 48, no. 8, pp. 1598–1611, 2012. [15] H. Zhang, C. Qin, and Y. Luo, “Neural-network-based constrained optimal control scheme for discrete-time switched nonlinear system using dual heuristic programming,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 3, pp. 839–849, Jul. 2014. [16] Q. Wei and D. Liu, “Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification,” IEEE Trans. Autom. Sci. Eng., vol. 11, no. 4, pp. 1020–1036, Oct. 2014.


[17] R. Bellman, R. Kalaba, and B. Kotkin, “Polynomial approximation—A new computational technique in dynamic programming allocation processes,” Math. Comput., vol. 17, no. 82, pp. 155–161, 1963. [18] S. A. Johnson, J. Stedinger, C. A. Shoemaker, Y. Li, and J. A. Tejada-Guibert, “Numerical solution of continuous-state dynamic programs using linear and spline interpolation,” Oper. Res., vol. 41, no. 3, pp. 484–500, 1993. [19] V. Chen, D. Ruppert, and C. A. Shoemaker, “Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming,” Oper. Res., vol. 47, no. 1, pp. 38–53, 1999. [20] C. Cervellera, M. Gaggero, and D. Macciò, “Low-discrepancy sampling for approximate dynamic programming with local approximators,” Comput. Oper. Res., vol. 43, pp. 108–115, Mar. 2014. [21] H. Zhang, J. Zhang, G.-H. Yang, and Y. Luo, “Leader-based optimal coordination control for the consensus problem of multiagent differential games via fuzzy adaptive dynamic programming,” IEEE Trans. Fuzzy Syst., vol. 23, no. 1, pp. 152–163, Feb. 2015. [22] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont, CA, USA: Athena Scientific, 1996. [23] M. Baglietto, C. Cervellera, M. Sanguineti, and R. Zoppoli, “Management of water resources systems in the presence of uncertainties by nonlinear approximators and deterministic sampling techniques,” Comput. Optim. Appl., vol. 47, no. 2, pp. 349–376, 2010. [24] V. C. P. Chen, “Measuring the goodness of orthogonal array discretizations for stochastic programming and stochastic dynamic programming,” SIAM J. Optim., vol. 12, no. 2, pp. 322–344, 2001. [25] C. Cervellera, V. Chen, and A. Wen, “Neural network and regression spline value function approximations for stochastic dynamic programming,” Comput. Oper. Res., vol. 34, no. 1, pp. 70–90, 2006. [26] C. Cervellera and D. Macciò, “A comparison of global and semi-local approximation in T-stage stochastic optimization,” Eur. J. Oper. Res., vol. 208, no. 2, pp. 109–118, 2011. [27] C. Cervellera and M. Muselli, “Efficient sampling in approximate dynamic programming algorithms,” Comput. Optim. Appl., vol. 38, no. 3, pp. 417–443, 2007. [28] H. Fan, P. K. Tarun, and V. Chen, “Adaptive value function approximation for continuous-state stochastic dynamic programming,” Comput. Oper. Res., vol. 40, no. 4, pp. 1076–1084, 2013. [29] J. Hartinger and R. Kainhofer, “Non-uniform low-discrepancy sequence generation and integration of singular integrands,” in Monte Carlo and Quasi-Monte Carlo Methods 2004, H. Niederreiter and D. Talay, Eds. Berlin, Germany: Springer, 2006, pp. 163–179. [30] C. Cervellera, M. Gaggero, and D. Macciò, “An analysis based on F-discrepancy for sampling in regression tree learning,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2014, Beijing, China, pp. 1115–1121. [31] K. T. Fang and Y. Wang, Number-Theoretic Methods in Statistics. London, U.K.: Chapman and Hall, 1994. [32] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia, PA, USA: SIAM, 1992. [33] J. Dick, F. Y. Kuo, and I. H. Sloan, “High dimensional integration—The quasi-Monte Carlo way,” Acta Numer., vol. 22, pp. 133–288, May 2013. [34] S. Heinrich, E. Novak, G. Wasilkowski, and H. Wo´zniakowski, “The inverse of the star-discrepancy depends linearly on the dimension,” Acta Arith., vol. 96, no. 3, pp. 279–302, 2001.


[35] M. Gnewuch, M. Wahlström, and C. Winzen, “A new randomized algorithm to approximate the star discrepancy based on threshold accepting,” SIAM J. Numer. Anal., vol. 50, no. 2, pp. 781–807, 2012. [36] A. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp. 930–945, May 1993. [37] F. Girosi, M. Jones, and T. Poggio, “Regularization theory and neural networks architectures,” Neural Comput., vol. 7, no. 2, pp. 219–269, 1995. [38] L. K. Jones, “A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training,” Ann. Stat., vol. 20, no. 1, pp. 608–613, 1992. [39] M. Hagan and M. Menhaj, “Training feedforward networks with the Marquardt algorithm,” IEEE Trans. Neural Netw., vol. 5, no. 6, pp. 989–993, Nov. 1994.

Cristiano Cervellera received the M.Sc. degree in electronic engineering and the Ph.D. degree in electronic engineering and computer science from the University of Genoa, Genova, Italy, in 1998 and 2002, respectively. He is currently a Researcher with the Genoa Branch of the Institute of Intelligent Systems for Automation of the Italian National Research Council, Genova. His current research interests include approximate dynamic programming, machine learning for optimal control, numbertheoretic methods for optimization and learning, and local learning models. He has authored over 70 papers in various journals and conferences. Dr. Cervellera is an Associate Editor of the IEEE T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS.

Danilo Macciò received the M.Sc. degree in telecommunication engineering and the Ph.D. degree in mathematical engineering from the University of Genoa, Genova, Italy, in 2005 and 2009, respectively. He is currently a Researcher with the Genoa Branch of the Institute of Intelligent Systems for Automation of the Italian National Research Council, Genova. His current research interests include machine learning, statistical estimation, and kernel methods and numeric solutions of functional optimization problems.

Revisiting approximate dynamic programming and its convergence.

Fast approximate quadratic programming for graph matching.

HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs.

Dynamic programming.

APPROXIMATE SAMPLING FORMULAS FOR GENERAL FINITE-ALLELES MODELS OF MUTATION.

Dynamic programming algorithms for restriction map comparison.

FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test.

Disparity map generation from illumination variant stereo images using efficient hierarchical dynamic programming.

Pareto optimization in algebraic dynamic programming.

Clipping in neurocontrol by adaptive dynamic programming.

Learning with kernel smoothing models and low-discrepancy sampling.

Approximate, computationally efficient online learning in Bayesian spiking neurons.

A Dynamic Programming Algorithm For (1,2)-Exemplar Breakpoint Distance.

Eradication of Ebola Based on Dynamic Programming.

Model-Free Dual Heuristic Dynamic Programming.

Preserving correlations between trajectories for efficient path sampling.

Integrating Clonal Selection and Deterministic Sampling for Efficient Associative Classification.

An efficient sampling algorithm with adaptations for Bayesian variable selection.

Dynamic F waves in neurogenic claudication.

Efficient organic photomemory with photography-ready programming speed.

LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

Performance evaluation of Warshall algorithm and dynamic programming for Markov chain in local sequence alignment.

Finite-approximation-error-based discrete-time iterative adaptive dynamic programming.

Robust adaptive dynamic programming with an application to power systems.