A clustering-based graph Laplacian framework for value function approximation in reinforcement learning.

IEEE TRANSACTIONS ON CYBERNETICS, VOL. 44, NO. 12, DECEMBER 2014

2613

A Clustering-Based Graph Laplacian Framework for Value Function Approximation in Reinforcement Learning Xin Xu, Senior Member, IEEE, Zhenhua Huang, Daniel Graves, Member, IEEE, and Witold Pedrycz, Fellow, IEEE

Abstract—In order to deal with the sequential decision problems with large or continuous state spaces, feature representation and function approximation have been a major research topic in reinforcement learning (RL). In this paper, a clustering-based graph Laplacian framework is presented for feature representation and value function approximation (VFA) in RL. By making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means clustering, a graph Laplacian is constructed by subsampling in Markov decision processes (MDPs) with continuous state spaces. The basis functions for VFA can be automatically generated from spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with a class of approximation policy iteration algorithms called representation policy iteration (RPI) for RL in MDPs with continuous state spaces. Simulation and experimental results show that, compared with previous RPI methods, the proposed approach needs fewer sample points to compute an efficient set of basis functions and the learning control performance can be improved for a variety of parameter settings. Index Terms—Approximate dynamic programming, clustering, learning control, Markov decision processes, reinforcement learning, value function approximation.

I. I NTRODUCTION EINFORCEMENT learning (RL) is a class of machine learning methods for solving sequential decision-making problems that can be described as Markov decision processes (MDPs). In the past decade, RL has been studied from various perspectives such as machine learning, optimal control, and operations research [1]. In RL, by interacting with an uncertain

R

Manuscript received August 12, 2012; revised December 30, 2013 and March 5, 2014; accepted March 6, 2014. Date of publication April 25, 2014; date of current version November 13, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61075072 and Grant 91220301, and in part by the New Century Excellent Talent Plan under Grant NCET-10-0901. This paper was recommended by Associate Editor F. Karray. X. Xu and Z. Huang are with the College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China (e-mail: [email protected]). D. Graves is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2V4, Canada. W. Pedrycz is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2V4, Canada, with the Department of Electrical and Computer Engineering, Faculty of Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia, and also with the Systems Research Institute, Polish Academy of Sciences, Warsaw 01-447, Poland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2311578

environment, the agent learns its action policies to maximize cumulative payoffs. Therefore, RL has been viewed as an important framework to solve learning control or adaptive optimal control problems with model uncertainties [2]–[7]. In earlier stages of RL research, most efforts were focused on learning control algorithms for MDPs with discrete state and action spaces. But in many real-world applications, it is necessary for RL algorithms to solve MDPs with large or continuous state spaces. In such cases, since most earlier RL algorithms such as Q-learning and Sarsa-learning belong to tabular algorithms, they usually converge slowly and the computational costs are huge [2], [6]. Aiming at the above problem, approximate RL methods have received increasing attention in recent years. Generally, three classes of approximate RL methods have been popularly studied, which include policy search [7], value function approximation (VFA) [8]–[10], and actor-critic methods [11], [12]. In particular, the actor-critic algorithms, which can be viewed as a combination of policy search and VFA, have been shown to be very efficient in online learning control tasks [13]. In an actor–critic-based controller, an actor is used to learn the policies and a critic is used for policy evaluation or VFA. Among various actor–critic methods for RL, adaptive critic designs (ACDs) [14]–[16] have received lots of research interests for learning control of nonlinear dynamical systems. ACDs and related learning control methods are also called approximate dynamic programming (ADP) or adaptive dynamic programming [17]–[19]. It has long been recognized that VFA lies in the heart of most successful RL applications, where a variety of linear and nonlinear approximation architectures have been studied in the literature. Since linear approximation structures for VFA usually have advantages in convergence and stability, RL algorithms with linear function approximators have been popularly studied in the past decade. Recent advances in VFA with linear approximators include linear Sarsa-learning [20], least-squares policy iteration (LSPI) [12], and others. Although RL with nonlinear VFA may exhibit better approximation ability than linear VFA, existing RL applications using nonlinear VFA usually did not have rigorous theoretical analysis. In [5], a kernel-based least squares policy iteration (KLSPI) algorithm was proposed for MDPs with large state spaces, but the kernel functions still need to be selected by the designer. Therefore, a common drawback of the previous work in VFA is that

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2614

the basis functions are usually hand-coded by human experts, but not automatically constructed from the geometry of the underlying state space. Recently, a methodology for VFA called proto-value function (PVF) was proposed [21]. According to the results reported in [21], PVFs can be constructed by using spectral analysis of the self-adjoint Laplacian operator. After diagonalizing the Laplacian matrix L, basis functions are generated by computing the smallest eigenvalues and the corresponding smooth eigenvectors. Based on this approach, a class of approximate policy iteration algorithms called representation policy iteration (RPI) [21], [22] was proposed. For MDPs with large or continuous state spaces, one important problem in RPI is called subsampling that is to select an appropriate subset of samples to construct the graph since the number of collected sample may be very large. In RPI [22], a random subsampling method and a trajectory-based subsampling method were studied but the performance of RPI still needs to be improved for MDPs with large or continuous state spaces. In this paper, a clustering-based graph Laplacian framework is presented for feature representation and VFA in RL. Clustering is one of the most popular methods for unsupervised data analysis. The application areas of clustering include data mining, bioinformatics, computer security, and social sciences [23]–[27]. Based on similarity, clustering analysis can assign the patterns into different clusters and calculate the centers based on the similarity measurement. In the proposed framework, by making use of clustering-based techniques, that is, K-means clustering or fuzzy C-means (FCM) clustering, a graph Laplacian is constructed by subsampling in MDPs with continuous state spaces. The basis functions for VFA can be automatically generated from the spectral analysis of the graph Laplacian. The clustering-based graph Laplacian is integrated with RPI for RL in MDPs with continuous state spaces, where a new learning control algorithm called clustering-based RPI (CRPI) is proposed. The purpose of subsampling in CRPI is to filter out the unnecessary points with little information about the underlying manifold feature of the state space and use the representative points to learn an efficient set of basis functions. After clustering analysis, all points in the same cluster exhibit the maximal similarity. The centers are the averages or weighted averages of all points in the same cluster. The graph constructed based on the centers of all clusters can represent the underlying manifold of the state space more accurately. Simulation results and experiments show that compared with previous approaches, the clusteringbased graph Laplacian method needs fewer sample points to compute an efficient set of basis functions and the performance of CRPI is much better than that of RPI. This paper is organized as follows. In Section II, an introduction on MDPs as well as some related works is given. The clustering-based graph Laplacian method is presented in Section III, where the CRPI algorithm and its performance analysis are discussed in detail. In Section IV, simulation and experimental results on three learning control problems with continuous state spaces are provided to illustrate the effectiveness of the proposed method. Section V draws conclusions and suggests future work.


II. MDP AND R ELATED W ORKS ON RL AND ADP A. Markov Decision Processes Let X denote the state space, A denotes the action space, P denotes the state transition probability, and R = R(x, a) represents the reward function. Then, an MDP M can be denoted as a tuple {X, A, R, P}. A stationary policy π is a mapping from states to a probabilistic distribution in the action space. Based on policy π , the probability of selecting action a in state x can be denoted by π (a|x). A deterministic stationary policy directly determines the selected action as at = π (xt ), t ≥ 0.

(1)

The objective of RL is to use observed data to estimate the optimal policy π * that satisfies ∞ π t Jπ ∗ = max Jπ = max E γ rt (xt , at ) (2) π

π

t=0

where 0 T; for k =1, 2, . . . , K do i = arg min(||xi − ck ||), i ∈ {1, 2, ..., N}; Add the data point xi into the subset Dc ; end for return subset Dc .

generated last time will be used as the initial centroids for the next episode. The FCM algorithm [23], [25], [27] partitions the given dataset X into c fuzzy sets by minimizing the within-cluster variance N c (u ki )b ||xi − vk ||2A (16) J(U, v) = i=1 k=1

where b is a weighting exponent, 1< b < ∞; U= [u ki ]c×N is the fuzzy c-partition of X, which satisfies the following conditions: u ki ∈ [0, 1];

c k=1

u ki = 1, for all i;

N

u ki > 0, for all k.

i=1

(17) v = (v1 , v2 , . . . , vc ) are the set of center vectors, vk = (vk1 , vk2 , . . . , vkn ) denotes the kth cluster center, and || · ||A is the induced A-norm on R p [23], [25]. The identity matrix is substituted for A in Algorithm 3. The objective of the clustering methods is to minimize the within-class distance and maximize the distance between classes. The generated subset is more representative than the one when using other subsampling methods. In addition, simulation and experimental results in the sequel also show that the graph constructed under the clustering-based graph Laplacian framework can reflect the topology structure of the whole state space much better than before. C. Performance Analysis and Discussions To learn near-optimal policies in MDPs with continuous states, it is necessary to compute eigenfunctions of new unexplored points based on the eigenfunctions of existing samples. In RPI, the Nyström extension method was used by combining iterative updates and randomized algorithms to realize low-rank approximations [22].

c: the number of clusters; b: the weighting exponent; T: the maximal iteration number; μ: the convergence criterion; X = {x1 , x2 , . . ., xN }: the collected samples.

Choose an initial matrix U (0) = [u (0) ki ]c×N satisfying condition (17), Dc = ∅; 2: for j =0, 1, . . . , T do (j+1) 3: Compute the center vk using: N N j b j+1 vk = (u ki )b xi (u ki ) ,1 ≤ k ≤ c. (18) 1:

i=1

4:

Compute an updated membership matrix U (j+1) with:

j+1 u ki

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

i=1

dki = ||xi − vk ||A

(19)

2 c dki (b−1) −1 =( ( ) ) ,1 ≤ i ≤ N; 1 ≤ k ≤ c. dkj j=1

(20)

if ||U (j+1) − U (j) || 12◦ or the simulation steps are greater than 30. When testing the performance of the learned policy, we set 10 000 as the max steps. The performance was evaluated by a median average over 50 independent learning runs. In order to demonstrate the performance of CRPI, four different numbers of clusters, 50, 80, 100, and 120, were tested with different numbers of l and k. Fig. 7 and 8 show the

2622

performance comparisons between CRPI and RPI in the inverted pendulum domain under two groups of parameters: l=5, k=25 and l=10, k=10, respectively. The normalized Laplacian operator was selected. The effects of K or c in CRPI and

in the trajectory-based method for the RPI algorithm were evaluated. Actually, by selecting K or c from 50 to 120, the learning curves obtained by CRPI all converged after 10 000 steps whereas the balancing steps using the RPI algorithm have large variances. In addition, different values of also affect the performance of RPI, which means that human experience is essential for the tuning of RPI. From the simulation results, it is shown that the proposed CRPI approach converges to a near-optimal policy after 50 or 100 episodes, with the same size of the subset samples that are used for graph construction. However, the trajectory-based RPI algorithm could hardly balance the inverted pendulum steadily except for =0.02. In addition, by using the trajectorybased method, the subset size also increases with the episodes, which makes the constructed graph be more computationally expensive. The control performance of the policies learned by the CRPI and RPI algorithms was also tested in a real inverted pendulum system. In the experiment, the real inverted pendulum system depicted in Fig. 9 was used for performance test. In the inverted pendulum system, the control step is set to be 0.05 s. The state contains four elements: the angle and angle velocity of the pendulum, the position and the velocity of the cart. A reward of 0 is given as long as the angle of the pendulum does not exceed π /15 in absolute value and the position of the cart does not exceed 0.4 in absolute value. Otherwise, the reward is −1. The discount factor is set to be 0.95. The control policies were learned by the CRPI and RPI algorithms via simulated data. Training samples were collected by using a simulated model of the real system that was controlled by an initial random policy. Each episode starts with a randomly perturbed state located close to the equilibrium state (0, 0, 0, 0) and following a policy that selects actions at random. The number of the nearest neighbors was set to be 10. The number of the PVFs was selected as 30. In the experiments, the K-means-based subsampling method was used, in which K was set to 400. After learning, the final policy was used in the actual pendulum control system, which is shown in Fig. 10. The angle of the pendulum and the position of the cart both varied in small ranges because of the discrete actions, just as depicted in Fig. 10. However, the real-time control performance of RPI is worse than that of CRPI in terms of the variance of the balanced position or angle. The experimental results indicate that CRPI is more effective than RPI in the inverted pendulum problem. C. Learning Control of a Real Mobile Robot Motion control is a fundamental problem in autonomous mobile robots. Due to uncertain dynamics and external disturbances, optimal motion control of mobile robots has been a challenge in robotics and control engineering. In the following, the proposed CRPI algorithm is applied and tested in path tracking control of a real wheeled mobile robot.


Fig. 10. Real-time inverted pendulum control module and performance comparisons between CRPI and RPI. (a) Angle of the inverted pendulum. (b) Car position.

Fig. 11. RPI.

Learning-based robot motion control system based on CRPI and

In the experiments, a P3-AT wheeled mobile robot system is used. The robot is differentially driven and the control commands are in terms of forwarding velocity and angular velocity. A built-in controller transforms the commands into a differential control law of the wheels [11]. By using highresolution optical encoders mounted on the driven wheels, the robot position and orientation are measured as state information. Fig. 11 shows that approximate policy iteration (API) algorithms, such as CRPI or RPI, can be used to optimize the proportional-derivative (PD) control parameters based on sampled trajectories. The aim of motion control is to track a given path with smallest tracking errors. The following learning-based PD control law is used: ˙ ωd (t) = k p (t)e(t) + kd (t)e(t)

(34)

where e(t) = yd (t) − y(t), yd (t), y(t) denote the expected and actual positions in the vertical axis, respectively, k p(t) and kd (t)

XU et al.: CLUSTERING-BASED GRAPH LAPLACIAN FRAMEWORK FOR VFA IN RL

2623

are variable PD parameters that are adaptively selected by API algorithms. A global coordinate (x, y, θ ) of the robot is defined in the experiments and the initial position of the robot is set near the origin (x, y)=(0, 0), where the initial orientation is randomly selected [11]. The desired path is y=1m and the absorbing states are defined as xT ∈ {(y, θ ) ||y − 1| < 0.005 |θ | < 2◦ }.

(35)

In order to focus on the lateral control problem, the longitudinal velocity of the robot is set as a constant v=0.2 m/s both in simulation and real-time control. In the experiments, the state of the robot is described by the position and orientation information. Thus, the state of the MDP is defined as x = (y, θ ). The candidate actions of the learning controller are defined as a set of preselected PD parameters a(t) ∈[(k p1 , kd1 ), (k p2 , kd2 ), . . . , (k pn , kdn )]. The aim of the learning controller is to compute a timeoptimal policy for the robot path tracking problem. The reward function is given as follows:

|yt − 1| < 0.005, |θt | < 2◦ 0 rt = −1 others. To simplify sample collection, the simulation software of the P3-AT robot is used to generate sample trajectories. To emulate disturbances in real systems, there are random noises in the execution of the robot control commands. The initial state (x, y, θ ) in an episode for sample collection is uniformly generated in the following interval: [0.0, 1.0] × [−1.0, 6.0] × [−π/18, 10π/18]. The maximum number of state transition steps in an episode is 10. A random policy is used in sample collection to select the PD parameters [k p (t), kd (t)] from three candidate combinations: a ∈{[27, 117], [10.8, 126], [9, 135]}. The candidate PD parameters were manually selected to improve stability. The sampling time for control is 0.05 s and 23 550 samples were collected for approximate policy iteration. In order to evaluate the performance of CRPI and RPI with the same size of sample subsets, the number of basis functions and nearest neighbor were selected as 30 and 20, respectively. Let Z denote the size of the subsets for constructing graph Laplacian. By setting the parameters of subsampling methods, two subsets with different sizes were obtained for basis function construction in RPI and CRPI. These two subsets have Z1 =342 and Z2 =692 samples, respectively. The final near-optimal policies of CRPI and RPI were evaluated both on the real robot platform. The initial configuration of the robot is set as (x, y, θ ) =(0, 0, 0). The path-tracking trajectories of CRPI and RPI are shown in Fig. 12. Curves 1 and 3 show the simulated path tracking results using the final policies obtained by RPI with subset size Z1 =342 and Z2 =692, respectively. Curves 2 and 4 show the path tracking results using the final policies obtained by CRPI with subset size Z1 =342 and Z2 =692, respectively. For comparisons, curve 5 shows the best performance obtained by conventional PD control using one of the three candidate PD parameters. It can be seen that the near-optimal policy of CRPI can make the robot reach the absorbing state in much

Fig. 12. Path tracking experiments of P3-AT.

shorter time, which is better than both RPI and conventional PD control. Curve 6 corresponds to the tracking result of the real mobile robot using the final control policy obtained by the CRPI algorithm. From the above simulation and experimental results, it is illustrated that the final policies obtained by CRPI have better performance than those obtained by the previous RPI. V. C ONCLUSION This paper presents a new clustering-based graph Laplacian framework for feature representation and VFA in RL. The clustering-based graph Laplacian is integrated with RPI for RL in MDPs with continuous state spaces, where the CRPI algorithm is proposed. One advantage of CRPI is that smooth basis functions can be learned for VFA in MDPs with continuous state spaces. The performance of CRPI is compared with that of RPI both theoretically and experimentally. It is shown that the graph constructed based on the centers of all clusters can more accurately represent the underlying manifold of the state space. Therefore, the basis function constructed by CRPI can be expected to approximate the value functions in continuous state spaces with higher accuracy. Simulation and experimental results show that the proposed clustering-based graph Laplacian method needs fewer sample points to compute an efficient set of basis functions and the performance of CRPI is better than RPI. More rigorous analysis and the extension of the proposed framework to VFA in continuous action spaces are subjects of our ongoing works. ACKNOWLEDGMENT The authors would like to thank the Associate Editor and the anonymous reviewers for their constructive comments and suggestions. R EFERENCES [1] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998. [2] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. New York, NY, USA: Wiley, 2007. [3] L. Busoniu, R. Babuska, B. Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, FL, USA: CRC Press, 2010. [4] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computat. Intell. Mag., vol. 4, no. 2, pp. 39–47, May 2009.

2624

[5] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iteration for reinforcement learning,” IEEE Trans. Neural Netw., vol. 18, no. 4, pp. 973–992, Jul. 2007. [6] M. Wiering and H. van Hasselt, “Ensemble algorithms in reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 4, pp. 930–936, Aug. 2008. [7] P. L. Bartlett and J. Baxter, “Infinite-horizon policy-gradient estimation,” J. Artif. Intell. Res., vol. 12, pp. 319–350, 2001. [8] M. Sugiyama, H. Hachiya, C. Towell, and S. Vijayakumar, “Value function approximation on non-linear manifolds for robot motor control,” in Proc. IEEE Int. Conf. Robot. Automat., Apr. 2007, pp. 1733–1740. [9] J. Boyan, “Technical update: Least-squares temporal difference learning,” Mach. Learn., vol. 49, nos. 2–3, pp. 233–246, 2002. [10] T. Mori and S. Ishii, “Incremental state aggregation for value function estimation in reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 41, no. 5, pp. 1407–1416, Oct. 2011. [11] X. Xu, C. Liu, S. Yang, and D. Hu, “Hierarchical approximate policy iteration with binary-tree state space decomposition,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 1863–1877, Dec. 2011. [12] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach. Learn. Res., vol. 4, pp. 1107–1149, Dec. 2003. [13] I. Grondman, M. Vaandrager, L. Busoniu, R. Babuska, and E. Schuitema, “Efficient model learning methods for actor-critic control,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 3, pp. 591–602, Jun. 2012. [14] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits Syst. Mag., vol. 9, no. 3, pp. 32–50, Jul./Sep. 2009. [15] D. Liu, H. Javaherian, O. Kovalenko, and T. Huang, “Adaptive critic learning techniques for engine torque and air-fuel ratio control,” IEEE Trans. Syst., Man, Cybern., Part B, Cybern., vol. 38, no. 4, pp. 988–993, Aug. 2008. [16] H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Trans. Neural Netw., vol. 22, no. 12, pp. 2226–2236, Dec. 2011. [17] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Netw., vol. 8, no. 6, pp. 1563–1573, Jun. 1997. [18] R. Padhi, N. Unnikrishnan, and S. Balakrishnan, “Optimal control synthesis of a class of nonlinear systems using single network adaptive critics,” in Proc. Am. Contr. Conf., vol. 2. 2004, pp. 1592–1597. [19] A. Al-Tamimi and F. Lewis, “Discrete-time nonlinear hjb solution using approximate dynamic programming: Convergence proof,” in Proc. IEEE Int. Symp. Approx. Dyn. Program. Reinforcement Learn., Apr. 2007, pp. 38–43. [20] R. S. Sutton, “Generalization in reinforcement learning: Successful examples using sparse coarse coding,” in Proc. Adv. Neural Inform. Process. Syst., vol. 8. 1996, pp. 1038–1044. [21] S. Mahadevan, “Proto-value functions: Developmental reinforcement learning,” in Proc. 22nd Int. Conf. Mach. Learn., 2005, pp. 553–560. [22] S. Mahadevan and M. Maggioni, “Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes,” J. Mach. Learn. Res., vol. 8, pp. 2169–2231, Oct. 2007. [23] J. V. de Oliveira and W. Pedrycz, Advances in Fuzzy Clustering and Its Applications. New York, NY, USA: Wiley Online Library, 2007. [24] V. Cherkassky and F. M. Mulier, Learning from Data: Concepts, Theory, and Methods. New York, NY, USA: Wiley, 2007. [25] W. Pedrycz, V. Loia, and S. Senatore, “Fuzzy clustering with viewpoints,” IEEE Trans. Fuzzy Syst., vol. 18, no. 2, pp. 274–284, Apr. 2010. [26] W. P. D. Graves and J. Noppen, “Clustering with proximity knowledge and relational knowledge,” Pattern Recog., vol. 45, no. 7, pp. 2633–2644, 2012. [27] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Comput. Geosci., vol. 10, no. 2, pp. 191–203, 1984. [28] H. Zhang, Y. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints,” IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 1490–1503, Sep. 2009. [29] D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin, “Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming,” IEEE Trans. Automat. Sci. Eng., vol. 9, no. 3, pp. 628–634, Mar. 2012. [30] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Natural actor-critic algorithms,” Automatica, vol. 45, no. 11, pp. 2471–2482, 2009.


´ Sen, “Kernel-based reinforcement learning,” Mach. [31] D. Ormoneit and S. Learn., vol. 49, nos. 2–3, pp. 161–178, 2002. [32] Y. Engel, S. Mannor, and R. Meir, “Bayes meets Bellman: The Gaussian process approach to temporal difference learning,” in Proc. ICML, 2003, pp. 154–161. [33] T. G. Dietterich and X. Wang, “Batch value function approximation via support vectors,” in Proc. NIPS, 2001, pp. 1491–1498. [34] C. E. Rasmussen and M. Kuss, “Gaussian processes in reinforcement learning,” in Proc. NIPS, 2003, pp. 751–759. [35] X. Xu, Z. Hou, C. Lian, and H. He, “Online learning control using adaptive critic designs with sparse kernel machines,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 762–775, May 2013. [36] S. Rosenberg, The Laplacian on a Riemannian Manifold. Cambridge, U.K.: Cambridge Univ. Press, 1997. [37] F. Chung, “Spectral graph theory,” in Number 92 in CBMS Regional Conference Series in Mathematics. Providence, RI, USA: American Mathematical Society, 1997. [38] P. Drineas and M. W. Mahoney, “On the Nyström method for approximating a gram matrix for improved kernel-based learning,” J. Mach. Learn. Res., vol. 6, pp. 2153–2175, Dec. 2005. [39] M.-A. Belabbas and P. J. Wolfe, “Spectral methods in machine learning and new strategies for very large datasets,” Proc. Nat. Acad. Sci., vol. 106, no. 2, pp. 369–374, 2009. [40] D. E. Crabtree and E. V. Haynsworth, “An identity for the Schur complement of a matrix,” Proc. Am. Math. Soc., vol. 22, no. 2, pp. 364–366, 1969. Xin Xu (M’07–SM’12) received the B.S. degree in electrical engineering from the Department of Automatic Control and the Ph.D. degree in control science and engineering from the College of Mechatronics and Automation, National University of Defense Technology (NUDT), Changsha, China, in 1996 and 2002, respectively. He is currently a Full Professor with the College of Mechatronics and Automation, NUDT. He has co-authored more than 100 papers in international journals and conferences, and four books. His current research interests include reinforcement learning, approximate dynamic programming, machine learning, robotics, and autonomous vehicles. Dr. Xu was a recipient of the 2nd Class National Natural Science Award of China in 2012. He is currently an Associate Editor of the Information Sciences Journal and a Guest Editor of the International Journal of Adaptive Control and Signal Processing. He is a Technical Committee Member of the IEEE on Approximate Dynamic Programming and Reinforcement Learning and on Robot Learning. Zhenhua Huang received the master’s degree from the College of Mechatronics and Automation (CMA), National University of Defense Technology (NUDT), Changsha, China, in 2010. He is currently pursuing the Ph.D. degree from the Institute of Unmanned Systems, CMA, NUDT. His current research interests include machine learning, robot control, and autonomous land vehicles.

Daniel Graves (M’12) received the B.Sc. degree in computing science from Thompson Rivers University, Kamloops, BC, Canada, in 2006, and the Ph.D. degree from the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada, in 2011. He is currently involved in machine vision and pattern recognition applications at Intelligent Imaging Systems, Edmonton, AB, Canada. He was a Software Developer with Raumfahrt Systemtechnik, Salem, Germany, where he was involved in synthetic aperture radar signal processing applications. He is a co-inventor of a commercial product called ClassTalk. His current research interests include computational intelligence, image/audio/video processing, pattern recognition, machine vision, data analytics, fuzzy sets, neural networks, and acoustics. Dr. Graves was the recipient of the Hetu Prize from the Canadian Acoustical Association for undergraduate research in acoustical signal processing.

XU et al.: CLUSTERING-BASED GRAPH LAPLACIAN FRAMEWORK FOR VFA IN RL

Witold Pedrycz (F’13) is a Professor and Canada Research Chair of computational intelligence with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada. He is also with the Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland. He holds an appointment of special professorship with the School of Computer Science, University of Nottingham, Nottingham, U.K. He is an author of 15 research monographs covering various aspects of computational intelligence, data mining, and software engineering. His current research interests include computational intelligence, fuzzy modeling and granular computing, knowledge discovery and data mining, fuzzy control, pattern recognition, knowledge-basedneural networks, relational computing,

2625

and software engineering. He has published numerous papers in these areas. Dr. Pedrycz was the recipient of the prestigious Norbert Wiener Award from the IEEE Systems, Man, and Cybernetics Council, the IEEE Canada Computer Engineering Medal, the Cajastur Prize for pioneering and multifaceted contributions to granular computing from the European Centre for Soft Computing, and the Killam Prize in 2007, 2008, 2009, and 2013, respectively. He was elected a Foreign Member of the Polish Academy of Sciences in 2009. In 2012, he was elected a fellow of the Royal Society of Canada. He has been a member of numerous program committees of IEEE conferences in the area of fuzzy sets and neurocomputing. He has been intensively involved in editorial activities. He is the Editor-in-Chief of Information Sciences and WIREs Data Mining and Knowledge Discovery (Wiley). He is currently an Associate Editor of the IEEE TRANSACTIONS ON FUZZY SYSTEMS and an editorial board member of several other international journals.

Deformed graph laplacian for semisupervised learning.

Intrinsic graph structure estimation using graph Laplacian.

Progressive image denoising through hybrid graph Laplacian regularization: a unified framework.

Multiagent reinforcement learning with unshared value functions.

Algorithmic survey of parametric value function approximation.

Context transfer in reinforcement learning using action-value functions.

A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks.

Function approximation using combined unsupervised and supervised learning.

A Local Structural Descriptor for Image Matching via Normalized Graph Laplacian Embedding.

Benchmarking for Bayesian Reinforcement Learning.

DIVE: a graph-based visual-analytics framework for big data.

A combined machine-learning and graph-based framework for the segmentation of retinal surfaces in SD-OCT volumes.

Ranking graph embedding for learning to rerank.

A Gaussian Approximation Approach for Value of Information Analysis.

Learning a Nonnegative Sparse Graph for Linear Regression.

Reinforcement learning for port-hamiltonian systems.

2-Norm Constraint and Graph-Laplacian PCA Method for Feature Extraction.

Risk-sensitive reinforcement learning.

A Kernel Classification Framework for Metric Learning.

Demonstrating Value for Biosimilars: A Conceptual Framework.

Reinforcement Learning Trees.

Reinforcement learning with Marr.

Personality, reinforcement and learning.

On the least signless Laplacian eigenvalue of a non-bipartite connected graph with fixed maximum degree.