Bioinspiration & Biomimetics

Related content

PAPER

Reaching control of a full-torso, modelled musculoskeletal robot using muscle synergies emergent under reinforcement learning

- Extracting motor synergies from random movements for low-dimensional taskspace control of musculoskeletal robots Kin Chung Denny Fu, Fabio Dalla Libera and Hiroshi Ishiguro

To cite this article: A Diamond and O E Holland 2014 Bioinspir. Biomim. 9 016015

- Shoulder complex linkage mechanism for humanlike musculoskeletal robot arms Shuhei Ikemoto, Yuya Kimoto and Koh Hosoda

View the article online for updates and enhancements.

- Review Chung Tin and Chi-Sang Poon

Recent citations - Postural synergy based design of exoskeleton robot replicating human arm reaching movements Kai Liu et al - Extracting motor synergies from random movements for low-dimensional taskspace control of musculoskeletal robots Kin Chung Denny Fu et al

This content was downloaded from IP address 129.16.69.49 on 30/10/2017 at 09:30

Bioinspiration & Biomimetics Bioinspir. Biomim. 9 (2014) 016015 (16pp)

doi:10.1088/1748-3182/9/1/016015

Reaching control of a full-torso, modelled musculoskeletal robot using muscle synergies emergent under reinforcement learning A Diamond and O E Holland Department of Engineering and Informatics, University Of Sussex, Brighton, UK E-mail: [email protected] Received 5 August 2013, revised 13 December 2013 Accepted for publication 13 January 2014 Published 12 February 2014 Abstract

‘Anthropomimetic’ robots mimic both human morphology and internal structure—skeleton, muscles, compliance and high redundancy—thus presenting a formidable challenge to conventional control. Here we derive a novel controller for this class of robot which learns effective reaching actions through the sustained activation of weighted muscle synergies, an approach which draws upon compelling, recent evidence from animal and human studies, but is almost unexplored to date in the musculoskeletal robot literature. Since the effective synergy patterns for a given robot will be unknown, we derive a reinforcement-learning approach intended to allow their emergence, in particular those patterns aiding linearization of control. Using an extensive physics-based model of the anthropomimetic ECCERobot, we find that effective reaching actions can be learned comprising only two sequential motor co-activation patterns, each controlled by just a single common driving signal. Factor analysis shows the emergent muscle co-activations can be largely reconstructed using weighted combinations of only 13 common fragments. Testing these ‘candidate’ synergies as drivable units, the same controller now learns the reaching task both faster and better. Keywords: musculoskeletal, biomimetic, robotics, muscle synergies, reinforcement learning S Online supplementary data available from stacks.iop.org/BB/9/016015/mmedia (Some figures may appear in colour only in the online journal)

control issues of this class of robot, such as the ECCERobot (figure 1) whilst exploiting their biomimetic nature by drawing upon biological motor control research. In particular, we consider an approach almost unexplored to date in the musculoskeletal robot literature, namely, muscle synergies. We define a synergy here as a precise, stable and distinct muscle activation pattern distributed between its participant muscles and driven as a single unit by a control signal. This approach offers the potential for significant dimensionality reduction and is supported by the emergence of an increasing and compelling evidence base suggesting that this mechanism is widely employed in the motor systems of animals, including that of humans.

1. Introduction 1.1. Background

Introducing robots into human environments requires them to handle settings designed specifically for human size and morphology, however, large conventional humanoid robots, designed for ease of control with stiff, high powered joint actuators, pose a significant danger to humans. By contrast, ‘anthropomimetic’ humanoids—mimicking both morphology and internal structure—are far safer, yet their resultant compliant structure presents a formidable challenge to conventional control. Here we seek to address characteristic 1748-3182/14/016015+16$33.00

1

© 2014 IOP Publishing Ltd

Printed in the UK

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

were reproduced across different loads, postures or directions and were also modulated to correct movements when a target location is changed [13]. The latter category of studies comprise model-based work which demonstrates synergy-based control effectively on either generic or biologically accurate muscle models of the limbs of frogs or humans [2, 10, 14]. Synergy patterns employed were either taken from biological measurements [10], approximated [14] or were extracted from the model itself using a ‘balanced truncation’ technique [2]. However, no studies exist to date on the potential application of this approach to musculoskeletal robots. 1.3. Selecting a learning strategy to acquire synergies

We therefore seek to test the theory, arising from this biological evidence, that a synergy-based control approach for a musculoskeletal, biomimetic structure (i.e. driving combinations of distinct fixed-weighting muscle co-activation patterns with simple, sustained signals) offers sufficient dimensionality reduction to allow relatively elementary search and learning techniques to be effective. Of these potential techniques, we choose to trial elementary reinforcement learning (RL), as it affords the following advantages. Firstly, its bio-inspired nature, whereby RL-like mechanisms for motor control of the body are well indicated in the CNS through the agent of dopamine [15]. Secondly, its ‘action discovery’ focus, exploiting amenable natural dynamics and the morphological computation potential of the complete biomimetic body structure. This becomes possible via learning that is driven by the goal of maximizing overall reward, in contrast to the conventional classical control approach of using a mathematically tractable model (proven as impossible in this case, see [16]) to calculate pre-planned, tightly controlled, trajectories in state space. Finally, the form of RL employed, an elementary variant of Q-learning, was favoured over newer and faster converging variants of high dimensional RL algorithms (e.g. [17, 18]) in order to demonstrate that the large cut in dimensionality afforded by reducing control signals to sustained combinations of specific synergy patterns brings effective control of such structures within range of such elementary algorithms. Put another way, if effective control was acquired, we did not wish to assign credit to an advanced, refined algorithm but rather to the shift to the sustained synergy paradigm.

Figure 1. The ‘anthropomimetic’ ECCERobot and biomechanical details. The ECCERobot torso has a skeleton of ‘bones’, hand-moulded from polycaprolactone, connected with flexible joints with up to 6 DOF using elastic shockcord to imitate ligaments. With nearly 100 DOF, construction closely follows Grays Anatomy and includes floating shoulder blades hung from collar bones and dislocateable ball joints in the shoulders. A flexible spine with individual vertebrae and deformable foam discs means that, as for a human, it cannot stay upright without tensed muscles. 50+ ‘muscles’ each comprise a length of thin inelastic ‘kiteline’ cable attached to the bones via sections of elastic shockcord that add compliance. Muscles are tensed by driving individual high torque DC motors to wind the kiteline cable onto its spindle via one or more pulleys.

1.2. Evidence of muscle synergies in biology

Numerous biological studies now strongly suggest that effective control of seemingly highly complex structures such as the bodies of frogs, cats or humans is achieved largely through advantageous, co-evolved natural dynamics combined with a small set of relatively simple signals each activating a selection of precise muscle groupings (synergies) [1–13]. Studies fall into two broad categories; empirical and modelled. The former encompasses those studies which have uncovered empirical evidence of an underlying, synergybased construction of muscle signals through the use of electromyographic (EMG) measurements followed by component or factor analysis (FA) to extract commonality. A good example is the study by Cheung et al [7] which elegantly demonstrates that the seemingly complex muscle EMG signals captured during reaching can be accurately reconstructed from the combination of just a few fixed (time-invariant) muscle synergy patterns, if each is driven by a distinct, timedependent, activating waveform. This is supported by other studies [11, 12] where combinations of just five synergies, extracted during fast reaching movements, were found to explain around 75% of the signal data if appropriately scaled in amplitude and shifted in time. The same synergy patterns

2. Method 2.1. Approach overview

We focus on the potential for control of biomimetic structures, such as the ECCERobot, through the use of simple sustained co-activations of muscles, combined in simple weightings and driven by a shared activation signal. To achieve this, we employ a simple standard RL cycle for acquiring maximum reward over time through a iterative series of trials and policy improvements [19]. Both the muscle co-activation pattern and the form of the signal are selected by a policy function driving from the problem state, which 2

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

Figure 2. Schematic illustration of reinforcement learning approach that seeks to tune a policy function through a search in action parameter

space, explicitly favouring the emergence and dominance of actions (comprising driven muscle co-activation patterns) that, when used in linear weighted combination with other actions (a radial basis function), prove to be effective contributors to estimated new actions targeting the next presented problem state.

Figure 3. Anatomy of an action: driving a co-activation pattern of muscle motors to cause a movement by the body. A co-activation of muscle motors M1−n comprises a weighting pattern w1−n where [−1 < w < +1]. Note that a negative weighting implies that the motor is driven in reverse to unwind its muscle cable, for example, this may cause a raised arm to be lowered.

comprises the set of environmental and robot state variables intended to describe the control problem sufficient for its solution; for example, the relative position and posture of the robot with respect to a target object to be reached. We choose this deliberately simple approach in order to test to what extent a realistic control task can be addressed by the specific means of locating an effective set of cooperating muscle co-activation patterns acting on amenable natural dynamics of the biomimetic structure in order to minimize the introduction of complexity to the controller. Figure 2 offers a figurative illustration of the proposed learning mechanism. The task of the policy is to generate, pertrial, a single net action intended to address the problem state by triggering a sustained movement lasting a period of time in the order of seconds, not milliseconds, whilst accumulating as

much reward as possible as it does so, in an amount governed by a reward function. Figures 3 and 4 illustrate the composition of actions selected by the policy function. The growing set of information from completed trials that is retained for the RL policy to draw upon is structured conventionally [19] into pairings comprising the problem state presented and the resultant action selected (a ‘stateaction’). These are retained alongside the reward accrued in the trialling of the action, forming a stored state-actionreward (SAR) combination [19]. As the problem state space is large, continuous and potentially high dimensional, there will be no previous action that will have addressed a given randomized sample. The policy must therefore estimate a new action drawing upon both the sampled problem state and the set of past (now stored) state-actions. It achieves this using a 3

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

SAR finally added to the stored set. Note that stored SARs not repeatedly reassessed in this way. A prospective SAR is assessed on its own only at the point it is created. After this it is only used to make weighted contributions to generate new actions when new states are presented. 2.2. Experiment design

We test the described approach to the task of reaching to a target object placed at successive random locations. The control subject is an extensive physics-based, reverse-engineered model of the ECCERobot developed in the impulse-based Bullet Physics engine, running in real time with 36 drivable muscles and 58◦ of freedom (figure 5). Features include modelled elastic muscle cables, motors, gearboxes, pulleys, joint friction and the wrapping of body parts by muscles—an essential element of a biomimetic musculoskeletal humanoid. See [20, 21] for modelling details.

Figure 4. The driving signal used to control waveform of the motor input voltage signal is parameterized with a duration T and a simple positive gain g, plus the position of four waypoints. We choose the number of waypoints available as a minimum that can still indicate a useful range of waveforms, from a single level or rising ramp to a nonlinear curve upwards or downwards. It also makes possible the use of a period of zero level at the start or end, allowing co-activations to be potentially shifted in phase with respect to each other.

The model robot’s pelvis is anchored to a static immoveable base. Each trial commenced with a model reset to a ‘ready position’ such that the robot is held upright under pre-tensioned torso and back muscles, with arms at its sides. To simplify the initial learning task we employ this same starting position on each trial, the problem state thus reducing to solely the randomly generated position (x,y,z) of the target object. The state-proximity function employed in weighting contributions from past actions is correspondingly reduced to a linear function of the scalar distance d from the nearest hand to the new target location.

2.2.1. Problem state.

radial basis function driving from the state–space proximity of the sampled state to the set of stored states but weight-biased by the value Q of their associated stored actions (the average of the reward R they have accrued). Recall that our overarching aim is to locate a limited set of stored muscle co-activation patterns that are effective— specifically in linear weighted combination—in addressing a sufficiently large region of problem state space. Therefore, although a new state-action is created, trialled, rewarded and stored as a SAR, those stored actions that contributed to its construction are also commensurately rewarded according to both the trial outcome (reward obtained) and importantly, the size of their contribution. The approach thus resembles closely the established RL technique of eligibility-traces often used to commensurately strengthen earlier actions in a temporal sequence leading to reward [19]. By limiting the overall storage capacity (through pruning of the least valuable) the stored SARs compete to be retained according to ability to contribute effectively to new actions. Whilst there are clearly important balances to be achieved in the weighting and reward functions, it is also important to achieve an effective exploration–exploitation trade-off. As our problem state space is continuous, no two problem states are ever identical, generating a form of exploratory ‘noise’ driven exploration as the state-proximity function will generate from the set of stored SARs, weighted contributions that vary randomly. In addition, since new actions are created through combination alone, then after trialling (but before storage), to encourage exploration the new action is mutated by a small degree of exploratory parameter creep (Gaussian-based, s.d. 5%). Finally, in order to significantly boost the learning rate, the mutated action is reappraised by attempting to estimate a better problem state to pair it with. For example, in a reaching task the original action’s trial may reveal that it is actually most effective at reaching to a different location. It is therefore re-trialled against this revised criteria, obtaining a correspondingly larger reward. Only now is the resultant new

The modelled target object comprised a plastic bottle of mass 200 g. For each reaching trial the bottle model is placed into the physics scene, balanced on a minimal static base, in front of the robot model at a random location within a limiting spherical zone. The experimental setup is illustrated in figure 6.

2.2.2. Target object and placement.

An overarching aims of the project is to demonstrate a simple reach and grasp of an object. We therefore define the reaching task as moving either hand from its constant start location to the close vicinity of the target, slowing or stopping the hand in that vicinity so as to potentially enable a successful grasp. As the model hand lacks actuation, we do not currently extend to attempting an actual grasp. We begin with a simple assumption that such a compound movement for reach/grasp can be achieved by only two sustained muscle co-activation stages: the first co-activation to generate a movement of the hand towards the target vicinity and a second to slow or hold the hand in its vicinity. We however place no explicit stipulation on the roles of either stage, beyond extending the reward function to favour both reaching to touch (rather than to strike) the target and also holding the hand in its vicinity. It may thus emerge that, in some subset of target positions, the second stage takes a different role, perhaps acting as a direction correction mechanism in the case of movements to regions where a single co-activation is

2.2.3. Action definition for a reaching movement.

4

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

Figure 5. Complete model of the biomimetic ECCERobot implemented in the Bullet Physics engine. The detail (right) shows how the floating shoulder blade and arm hang from the collarbone (red constraint), the scapula held in place by wrapping muscle cables (orange).

Figure 6. Side and top view of reaching experiment: for each trial, the centre of mass of the bottle model is placed at the target location (white cross) which is generated at random per-trial within the zone denoted by the red lines. The green sphere centered on the white cross indicates the extended zone for obtaining reward by proximity of the hand to the target.

insufficient to generate an accurate trajectory. To estimate a new compound action from a weighted combination of past compound actions we combine individually the actions of each two co-activation stages, according to the combination function (1). Note that, in this initial experiment, hand selection is not part of the learning and the nearest hand to the target is always selected. However, it is important to note that muscles from both sides of the body can form part of any given co-activation pattern.

Two related approaches were employed to obtain an initial set of stored SARs before learning is commenced. A first set of ten functional actions were selected from those developed by hand during initial feasibility testing focusing on spanning a range of endpoints and trajectories and ensuring that all available muscles were represented. 20 further actions were generated using functional variations of the first set by applying gain alterations and time stretching. A particular emphasis was placed on generating actions that moved the hand nearer the outer reaches of the

2.2.4. Initial set of stored actions.

5

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

where the weighting ωi placed on the ith stored action Ai is given by:

target placement zone. Every action was trialled and stored as an SAR by the same algorithm as the new actions subsequently produced by the policy.

ωi = ψ (pi Qi )

where pi is a scalar measure of proximity between the new randomly selected problem state S and the state Si attached to the stored action Ai , Qi is the value (averaged reward) of acquired by Ai and ψ is a simple normalizing function that linearly rescales all (pi Qi )values proportionally between 0 and 1, whilst always summing to 1.

2.3. Algorithm

We detail the specific equations and parameter values employed in the learning algorithm outlined above. In its raw form an action A resolves to a vector of 1, . . . , n simulated voltage signals, where vi (t ) is applied to the ith of n modelled motors. As discussed above, for this controller, the voltage signals are formed simply from a motor co-activation weighting pattern driven by a single shared driving signal (figure 3). A co-activation of n muscle motors is thus configured as a weighting pattern parameterized as w1−n weight values where [−1 < w < 1] where a negative weighting implies that the motor is driven in reverse to unwind its muscle cable, for example, this might cause a raised arm to be lowered. The set of muscle motors is activated, as a single unit, by a driving signal m(t ) that takes a waveform shape set by six parameters, comprising a duration T and a simple positive gain g, plus the location of four waypoints in the voltage/time domain (see figure 4). We choose the number of waypoints available as a minimum that can still indicate a useful range of waveforms, from a single level or rising ramp to a nonlinear curve upwards or downwards. It also makes possible the use of a period of zero level at the start or end, allowing co-activations to be potentially shifted in phase with respect to each other. Each of the (k = 4) waypoints is thus parameterized as a voltage level [−1 < vk < 1] applied to the motor along with a time, held as a relative fraction [0 < tk < 1]of the signal duration T. The kth waypoint is thus set as the point [gvk , tk T ]. Finally, to avoid the unwanted high frequency artefacts inherent in the raw waypoint-to-waypoint form of the driving signal we employ a digital low pass filtering function (LPF) to smooth out discontinuities before a final voltage signal reaches each motor. The final voltage value vi (t ) arriving at time t at the ith muscle motor is given by: 2.3.1. Actions.

vi (t ) = ϕ(gwi m(t ))

As the problem state comprises only the target location we simply set pi to be directly proportional to the absolute distance di in 3D space from the new target location to the target location [xi , yi , zi ] attached to the stored action Ai :   di (5) pi = 1 − dmax where dmax is the maximum distance that two targets can be, corresponding to the diameter of the target placement sphere. 2.3.3. Proximity function.

On the nth learning iteration, the value Qni of the ith stored action Ai is defined as the average reward over time awarded to Ai from its contributions to estimating new actions in the previous n−1 iterations: n−1 reward j=0 R ji (6) Qni = = n−1 contribution j=0 ω ji

2.3.4. Action values.

where ω ji is the contribution weighting that was assigned to the ith action on learning iteration j. This affords the opportunity to reformulate (6), recursively in terms of its previous action values via the relationship expressed in (4) n−1 j=0 R ji . (7) Qni = n−1 j=0 ψ (p ji Q ji ) The weighting ω generated from (4) is applied individually to each parameter of the contributing stored actions to create a value for the new action, e.g. the new gain parameter gn that will be applied to the new driving signal is calculated as:

2.3.5. Parameterizing new actions from weightings.

(1)

where m(t ) is the raw waypoint-to-waypoint form of the driving signal and the LPF function y = ϕ() is defined as: y j = y j−1 + α(x j − y j−1 )

gn =

(8)

The primary striking reward Rs is issued whenever a target strike is detected, by the amount is reduced proportional to the hand speed v in order to favour a slowed hand at the target (to aid future grasping). Hand speeds above vfast (set to 0.5 m s−1) are treated uniformly as ‘fast’. Thus:   v [v  vfast ] Rs = κ 2 − vfast Rs = κ [v > vfast ] (9)

The policy constructs an nth new action An from the linear weighted sum of the stored actions: ωi Ai

ωi gi .

2.3.6. Reward function.

(2)

2.3.2. Policy function.

n−1 

n−1  i=1

where y j and x j are respectively the filter output and input t on the jth timestep of t duration, α = τ +t and the time 1 constant τ = 2π f where f is the filter cut-off frequency in Hz.

An =

(4)

(3)

where κ is a scaling parameter to set the strike reward relative to the secondary zonal reward (see (10) below). To kick-start

i=1

6

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

Figure 7. Example screenshots showing the robot model successfully reaching the target object under muscle co-activation based control acquired under reinforcement learning.

early learning towards the vicinity of the target a secondary, zonal-based reward Rz is provided that is incremented every timestep that the hand is within the reward zone about the target (see figure 4, green region): dt [dt  r] r [dt > r]

100 ms. The timestep of the physics model is however set at 3 ms for the best performance versus stability trade-off, the model can run stably at close to real time. Each target presentation trial was limited to 3 s, unless the target is struck earlier. Four long extended learning trials were undertaken where the main learning cycle was set to iterate continuously while the reward issued was monitored. On each it was found that learning (as judged by the reward distribution pattern) plateaued in the region of 800 presentations, trials were therefore curtailed at 1000.

Rz = 1 − Rz = 0

Rz =

T 

Rz

(10)

(11)

t=1

where dt is the distance from the hand centroid to the target centroid on timestep t, r is the radius of the spherical reward zone around the target (set to 20 cm) and T is the number of timesteps that the hand remains within the reward zone.

3. Results Initial results are encouraging, after each extended trial the robot had learned to at least strike the bottle in a majority of target locations, albeit less successfully for higher locations. Figure 7 illustrates some examples of successful reaching, however the outcome can be better appreciated by viewing the video at http://tinyurl.com/ECCE-RL1 (available from stacks.iop.org/BB/9/016015/mmedia).

As discussed, before adding a new action to the policy, to encourage exploration beyond combination the parameters of the new action are mutated by a small degree. For example, the new gain gn parameter is re-calculated as: n−1  ωi gi . (12) gn = (1 + kng )

2.3.7. Policy update.

3.1. Reaching outcomes

The distribution of trial outcomes at six points during the learning is shown in figure 8. Although striking the target has been assimilated, the ability to just touch the target is rather less well developed (18.1% of trials, 4.1% s.d.). Investigation of these actions suggests that they are restricted to a subset of amenable location, analysis suggests these match the cases where the hand is able to approach the target with the first muscle co-activation alone, leaving the second to take a greater slowing role over a corrective guiding role. Figure 8 shows how the primary reward driver shifts steadily from low scoring zonal only, through high scoring

i=1

Artificial Gaussian noise ng is generated for each parameter, clipped to the range [−1  ng  1]. k ≈ 0.05 is a tuning parameter scaling the maximum mutation effect to approx 5%. 2.4. Trials

A maximum of N = 100 SARs were retained by the policy, pruned by lowest Q value. The granularity of the control signals and the state assessment functions was set at a timestep of 7

Bioinspir. Biomim. 9 (2014) 016015

A Diamond and O E Holland

Figure 8. Distribution of trial outcomes at six stages of learning, measured as a percentage of the trials undertaken during each phase. Bar heights show the mean value across the four extended learning trials, error bars indicate the standard deviation. A trial constitutes testing an action intended to reach a target presented at a random location. Outcome categories: no reward (grey)—no reward was awarded during trial. Low zonal (yellow)—zonal reward < mean (no strikes). High zonal (green)—zonal reward > mean (no strikes). Strike (red)—the target was struck with hand speed >0.1 m s−1. Touch (blue)—the target was ‘touched’; i.e. hand speed

Reaching control of a full-torso, modelled musculoskeletal robot using muscle synergies emergent under reinforcement learning.

'Anthropomimetic' robots mimic both human morphology and internal structure-skeleton, muscles, compliance and high redundancy--thus presenting a formi...
3MB Sizes 0 Downloads 3 Views