Vector AI Engineering Blog: Benchmarking Robustness of Reinforcement Learning Approaches using safe-control-gym
August 23, 2022
August 23, 2022
By Catherine Glossop
August 23, 2022
Robots are increasingly prevalent in our everyday lives as they take to the roads in the form of self-driving vehicles, assist in operations in our hospitals, work alongside us in industrial settings, and are even emerging in our own homes.
Reinforcement learning has become a promising area of research for robots learning to perform tasks. It has been applied to trajectory tracking, goal-reaching, and many other problems across a variety of robot types, from robotic manipulators to self-driving vehicles. Safety when using reinforcement learning (RL) to solve these real-world problems must be paramount. Unsafe interaction with the environment and people in that environment can have serious consequences, ranging from, most importantly, harm to humans to the destruction of the robot itself. For safety to be guaranteed, the agent (robot) must satisfy the constraints that define safe behaviour (i.e. not producing actions that damage the robot, hit obstacles or people) and be robust to variations in the environment, its own dynamics, and unseen situations that emerge in the real world.
In this review, the performance of existing reinforcement learning approaches in the context of robotic control (‘controllers’) will be evaluated on their robustness and performance quantitatively and qualitatively. To do this, safe-control-gym, an RL safety benchmarking gym, will be used to examine and compare the control approaches with a variety of disturbances applied.
This review aims to provide a basis for further research into robust control with RL and give an introduction to useful tools, like safe-control-gym, that can be leveraged to benchmark the performance of new robust algorithms.
Figure 1. Reinforcement learning with disturbances
In reinforcement learning, an agent, in this case, a simulated robot, performs an action and receives feedback or a reward from the environment based on how inline the action was with its desired behaviour (i.e. did it get the agent closer to the goal, did it follow a trajectory closely), perceives the updated state resulting from the action taken and repeats the process, learning as it goes to improve the actions it takes to get closer to the desired behaviour and achieve its goals. This behaviour is called the policy and maps the agent’s state to action.
In this review, we introduce disturbances at different points in this loop to simulate the real-world situations the agent might encounter. The system used is a cartpole moving along a 1-D rail.
Figure 2. Block diagram of the safe-control-gym (source)
Safe-control-gym is a RL benchmarking gym developed by the Dynamic Systems Lab led by Professor Angela Schoellig from the University of Toronto Institute for Aerospace Studies. This gym helps bridge the gap between simulated and real-world environments while providing tools to evaluate safety.
Here are some highlights of the gym:
In the paper released with the repository, safety is broken down into robustness and constraint satisfaction. Constraint satisfaction of the approaches implemented in the gym is explored in more depth in the paper. Given the importance of robustness for translating RL and other learned approaches to the real world, it is worth exploring the performance of control approaches in the context of robustness in detail, as is done in this review.
To simplify these experiments, training and evaluation will be done using the cartpole environment on a subset of the controllers.
The controllers that will be compared are:
PPO is a state-of-the-art policy gradient method proposed in 2017 by OpenAI. It improves upon previous policy optimisation methods like TRPO (Trust Region Policy Optimisation) and ACER(Actor-Critic with Experience Replay). PPO reduces the complexity of implementation, sampling, and parameter tuning using a novel objective function that performs a trust-region update that is compatible with stochastic gradient descent.
safe-control-gym PPO implementation
SAC is an off-policy actor-critic deep RL algorithm proposed in 2018 by UC Berkeley’s Artificial Intelligence Research Lab. The algorithm merges stochastic policy optimisation and off-policy methods like DDPG (Deep Deterministic Policy Gradient). This allows it to better tackle the exploration-exploitation trade-off pervasive in all reinforcement learning problems by having the actor maximise both the reward and the entropy of the policy. This helps to increase exploration and prevent the policy from getting stuck in local optima.
safe-control-gym SAC implementation
RARL was proposed in 2017 by researchers at Carnegie Mellon University and Google Brain. Unlike the previous two approaches, RARL, as well as the following approach, RAP, are designed to be robust and bridge the gap between simulated results and performance in the real world. To achieve this, an adversary is introduced that learns an optimal destabilisation policy and applies these destabilising forces to the agent, increasing its robustness to real disturbances.
safe-control-gym RARL implementation
RAP is an algorithm introduced by researchers from UC Berkeley in 2020 which extends RARL by introducing a population of adversaries that are sampled from and trained against. This algorithm hopes to reduce the vulnerability that previous adversarial formulations had to new adversaries by increasing the kinds of adversaries and therefore adversarial behaviours seen in training.
safe-control-gym RAP implementation
The safety layer is an approach proposed by researchers at the Israel Institute of Technology and Google DeepMind. It adds a layer directly to the policy that corrects the action at each step so that it does not violate defined constraints. The action correction function is learned on past trajectories with arbitrary actions.
safe-control-gym Safe PPO implementation
Safe-control-gym provides a plethora of settings to test the effects of changes in parameters of the robot and different kinds of noise/disturbances on the actions, observations, and external dynamics of the agent during training and evaluation. To limit the scope of this review, the control approaches will be evaluated on different kinds of noise across the three available categories and not parameter uncertainties.
Each RL agent was trained using a randomised initial state to improve performance[1]. For consistency in testing, a single arbitrary initial state is used. The range of disturbance levels used in each experiment was then manually tuned to find a series of values that allowed the performance of agents to be compared at relatively low levels of disturbances until they failed to stabilise the cartpole. The goal of the controller is to stabilise the cartpole at a pose of 0 m, or centre, in x and 0 rads in theta, when the pole is upright.
To measure the robustness and performance of these controllers, the negative quadratic return is averaged over the length of each evaluation run over 25 evaluation runs. The same cost/reward function was used for training and evaluation.
Equation 1 shows the cost computed at each step, i, of an evaluation run where x and xᵍᵒᵃˡ represent the actual and goal states of the system, u and uᵍᵒᵃˡ are the actual and goal input to the system, and Wₓ and Wᵤ are constant reward weights. L is the number of steps in a given evaluation run. Equation 2 shows the return computed from the cost function where L is the number of steps in the run. L can be a maximum of 250 steps but will be less if the agent fails early. Equation 3 shows the average return for N (25) evaluation runs, denoted as the reward in this review.
In addition to this quantitative measurement, visualisations of the performance of the approaches will be provided for qualitative comparison in certain cases.
As the majority of these approaches do not incorporate constraint satisfaction and this was explored in the aforementioned paper, constraint satisfaction will not be evaluated in this review. However, if implementing versions of these approaches that allow for constraint satisfaction is of interest, see the safe-control-gym paper for more information.
First, let’s look at the results with no disturbances applied during training or evaluation. This shows the base robustness of the controllers without having seen any noise during training or having to compensate for disturbances during evaluation. Therefore, this should be the best performance they can achieve in an essentially ideal environment.
Figure 3. Number of training steps vs. evaluation reward averaged over 10 runs
Figure 4. Number of training steps vs. evaluation MSE averaged over 10 runs
The baseline performance of each of the controllers can be visualised in the GIFs below.
Fig 5. PPO baseline
Fig 6. SAC baseline
Fig 7. RARL baseline
Fig 8. RAP baseline
Fig 9. PPO with safety layer baseline
The three algorithms which reach convergence fastest were SAC, PPO, and RAP. SAC and PPO benefit from the stochastic characteristics of their updates. RARL and safe explorer PPO train more slowly. This would be expected for RARL as it is also learning to counteract the adversary, although this behaviour is not seen for RAP. Safe explorer PPO does not achieve as high a reward during training as the other algorithms.
Experiment 0 shows that while the first four algorithms have very similar performance, safe explorer PPO fails to perform well in the most simple experiment. In the configuration of the controller, two absolute constraints are set, one on theta and one on x, shrinking the allowable state of the agent. However, the safety layer can only handle satisfying one constraint at a time, resulting in failure if a constraint is broken while trying to satisfy the other. While constraint satisfaction is an essential part of safety, we will not be further exploring it in this report and therefore this approach will be omitted in further experiments as it does not provide a useful comparison. Further discussion on constraint satisfaction is available in the paper linked in the safe-control-gym GitHub page.
Safe-control-gym offers the ability to apply disturbances to inputs, observations, and external dynamics to simulate the different points at which noise can be present in the system. To start, we will look at white noise disturbances to mimic the natural stochastic noise that an agent might see in the real world.
Figure 10. Example of one dimensional white noise
The noise is applied during testing at increasing values of standard deviation from zero. In the following three figures, the results for external dynamics, actions, and observations respectively are as follows:
Figure 11. Comparison of evaluation reward for white noise applied to external dynamics
Figure 12. Comparison of evaluation reward for white noise applied to actions
Figure 13. Comparison of evaluation reward for white noise applied to observations
The performance is very similar across control approaches for white noise disturbances on external dynamics, with a linear decrease in reward as the noise increases. PPO performs slightly better. Surprisingly, the robust approaches, RARL and RAP, show no significant difference in performance. The performance with the disturbance is visualised in figure 14.
Figure 14. The performance of PPO with a white noise disturbance on external dynamics over standard deviation values of 0.1, 0.5, and 1.0
For action disturbances, RARL consistently has the highest reward. For observation disturbances, PPO has the highest reward. Figures 15 and 16 show the performance of the control approaches at select values of standard deviation to visualise the difference in performance. Overall, the difference in performance across the four approaches is not large and they all demonstrate robustness when white noise is applied to observations and actions.
Figure 15. Visualisation of performance with white noise applied to actions at a standard deviation of 4.0
Figure 16. Visualisation of performance with white noise applied to observations at a standard deviation of 0.15
Step disturbances allow us to see the system’s response to a sudden and sustained change.
Figure 17. Example of a one-dimensional step function
Similar to the previous experiment, the disturbance is applied at varying levels, here as the magnitude of the step. The step is set to occur two steps into the episode for all runs.
Figure 18. Comparison of evaluation reward for a step disturbance applied to external dynamics
Figure 19. Comparison of evaluation reward for a step disturbance applied to actions
Figure 20. Comparison of evaluation reward for a step disturbance applied to observations
In comparison to the robustness of the controllers to white noise, the step disturbance has a much greater effect and lower magnitudes of disturbance result in a large decrease in reward. There is a dramatic decrease in reward when the agent can no longer stabilise the cartpole, as seen in figures 19 and 20.
For the step disturbance on external dynamics, there is no significantly better controller although PPO has a higher reward overall across the varying magnitudes. For actions and observations, PPO again achieves the best overall performance, not decreasing in reward until after the step magnitude reaches higher than 5 N for action disturbances, comparatively much higher than the other approaches. SAC remains at a slightly higher reward compared to the other approaches, however, for high magnitudes, this is actually where SAC fails quickly. This results in a higher reward due to a smaller denominator (fewer steps) in Equation 3. This behaviour is shown in figure 22 where SAC fails while PPO succeeds in stabilising the cartpole.
Figure 21. Visualisation of performance with an action step disturbance with a magnitude of 2.0
Figure 22. Visualisation of performance with an action step disturbance with a magnitude of 4.5
PPO is the most robust approach to step disturbances overall amongst these approaches.
Impulse disturbances allow us to see the system’s response to a sudden, but temporary change.
Figure 23. Example of a one-dimensional impulse function
Again, we look at varying levels of the impulse’s magnitude to test the controllers’ robustness. The width of the impulse is kept at 2 steps for all runs.
Figure 24. Comparison of evaluation ward for an impulse disturbance applied to external dynamics
Figure 25. Comparison of evaluation reward for an impulse disturbance applied to actions
Figure 26. Comparison of evaluation reward for an impulse disturbance applied to observations
The dramatic decrease in reward seen in the previous experiment with step disturbance is even more pronounced with the impulse disturbance. Impulse disturbances might be more easily handled than step disturbances, which is understandable as step responses require the system to adapt to a new baseline whereas for impulse disturbances the change is only temporary, allowing the agent to handle larger magnitudes of disturbance. On the other hand, it will show a more dramatic change in reward when the sharp disturbance causes the agent to not be able to stabilise, shown in figures 24 and 25. PPO is the most robust to impulse disturbances on external dynamics and RARL has increased robustness compared to its performance with step disturbances.
For the disturbances applied to actions, SAC and PPO are able to handle higher magnitudes of impulse disturbance than RARL or RAP.
Figure 27. Visualisation of performance with an action impulse disturbance with a magnitude of 110
The impulse disturbance on observations does not affect any of the control approaches, even at very high values. This may be due to the disturbance occurring over two steps, meaning the effect does not last like the other kinds of disturbances previously shown.
In addition to white noise, step, and impulse disturbances, it is worth exploring the ability of the controllers to deal with periodic disturbances. These kinds of disturbances challenge the approaches since there is a constant disturbance, similar to white noise, that forces them to be robust to more complex patterns of disturbance.
A sawtooth or saw wave is a cyclic wave that increases linearly to a set magnitude and drops back to a starting point instantaneously before repeating the cycle. This disturbance type introduces some of the characteristics of the step and impulse disturbances but is now applied periodically throughout the evaluation run.
Figure 28. Example of a one-dimensional sawtooth wave
The amplitude or magnitude of the wave is modified in this experiment while the period, sign, and offset are kept the same for all runs. The sign of the wave can be selected for each dimension. If the sign is negative, the wave function will decrease from the starting value to the set magnitude and for a positive sign, the wave function will increase from the starting value to the magnitude value. For the external dynamics disturbances, the first dimension has a positive sign and the second dimension has a negative sign. For the action disturbance, which is one-dimensional, the sign is positive. For observation disturbances, the first and third dimensions have a positive sign and the second and fourth have a negative sign.
Figure 29. Comparison of evaluation reward for a saw wave disturbance applied to external dynamics
Figure 30. Comparison of evaluation reward for a saw wave disturbance applied to actions
Figure 31. Comparison of evaluation reward for a saw wave disturbance applied to observations
The difference in performance for the approaches seem to be less obvious in comparison to the other types of disturbances applied in previous experiments. For disturbances in external dynamics, there is little difference in performance across the control approaches. RARL performs better than the other approaches at lower amplitudes of noise and SAC, similar to experiment 3, has a higher reward due to fast failure. For action disturbances, PPO shows little decrease in reward while the other approaches, specifically SAC, have a lower reward.
Figure 32. Visualisation of performance with a saw wave disturbance on actions with an amplitude/magnitude of 4.0
In figure 32, it can be seen that RAP and RARL fail more often than PPO and SAC, resulting in a lower average reward. SAC takes longer to stabilise, creating a larger denominator in equation 3, and hence lower reward. When the saw wave disturbance is applied to observations, all approaches have difficulty stabilising and the reward quickly drops to zero.
A triangle wave is a cyclic wave that increases linearly to a set magnitude and decreases at the same rate to a starting point before repeating. This disturbance type is very similar to the saw wave disturbance but acts more similar to a sinusoidal wave.
Figure 33. Example of Triangle Wave
The same settings were used for this experiment as the previous.
Figure 34. Comparison of evaluation reward for a triangle wave disturbance applied to external dynamics
Figure 35. Comparison of evaluation reward for a triangle wave disturbance applied to actions
Figure 36. Comparison of evaluation reward for a triangle wave disturbance applied to observations
As expected, the results for triangle wave disturbances look similar to those of the sawtooth wave disturbances. The triangle wave disturbance results in a slightly lower reward than the sawtooth wave disturbance but the relative performances of the control approaches remain the same. SAC performs slightly worse in the case of disturbances applied to dynamics. The drop in reward occurs earlier for all controllers for disturbances applied to observations, showing the increased sensitivity to the triangle wave disturbance compared to the sawtooth wave disturbance.
In the previous experiments, no disturbances were introduced during training. It would be expected that including some level of noise or disturbance during training should improve the performance of the controllers during the evaluation. Seeing more variations in the environment during training creates a more generalised model, similar to how the RARL and RAP use an adversary to increase robustness.
Therefore, in this experiment, we will look at two different control approaches, PPO, a classic state-of-the-art RL approach, and RAP, a recent robust approach, trained on varying levels of white noise (for 1000000 steps) to see if robustness is improved.
Figure 37. Evaluation reward heat map of PPO and RAP trained on and tested with varying levels of white noise applied to external dynamics
Figure 38. Evaluation reward heat map of PPO and RAP trained on and tested with varying levels of white noise applied to actions
Figure 39. Evaluation reward heat map of PPO and RAP trained on and tested with varying levels of white noise applied to observations
In general, the evaluation performance on higher levels of noise is actually better when trained on lower levels of noise, still achieving the best performance when trained with no disturbances. For external dynamics disturbances, the reward gradually decreases as the training noise is increased. At higher values of training noise, the reward at higher testing noise values improves slightly, suggesting there may be some slight improvement in performance. For action disturbances, the reward is not affected by increased training noise or testing noise, except at specific, high values where the reward decreases dramatically. In general, adding training noise does not produce an improvement in robustness with action disturbances. For noise added to observations, there is a sudden decrease to nearly zero reward when noise is introduced during training. RAP, which uses adversarial populations to train, generally has higher reward than PPO and seems to have slightly better performance at higher values of training noise, though it is negligible. Applying noise to external dynamics during training seems to be the only case in which there is any improvement.
Figure 40. PPO trained with no disturbances, tested on external dynamics disturbance of white noise with standard deviation 0, 0.5, 1.0
Figure 41. PPO trained with white noise disturbance of standard deviation 0.25, tested on external dynamics disturbance of white noise with standard deviation 0, 0.5, 1.0
It could be argued that given longer to train in this harder environment, it would be possible to see improved results. Therefore, the PPO model for external dynamics was trained for an additional 500,000 and 1,000,000 steps as this is the only case in which a potential for improvement can be observed.
Figure 42. Evaluation reward heat map of PPO trained and tested on varying levels of white noise for 500,000, 1,500,000, and 2,000,000 steps
In figure 42, results from training less than 1,000,000 steps (500,000 steps), show that the reward is still lower than the best model. As the model is trained more, at 1,500,000 and 2,000,000 steps, its performance worsens still, resulting in lower reward at the same values of testing and training noise. There is some small improvement at high values of training noise at 1,500,000 steps although it does not compare to the performance when there is no training noise. This improvement disappears as training is continued. A possible explanation for this result is that the model does not have the capacity to learn the behaviour of the two systems (the cartpole and the noise), resulting in worse performance, but further exploration is required.
An interesting extension to this experiment would be to train the models for some number of episodes without disturbances and then introduce disturbances, either continuing to train with disturbances or alternate between with and without disturbances. In addition, varying the parameters of the disturbance, similar to the approach of RAP, might help improve the diversity of the disturbances seen in training. This is left for future work.
Overall, these results suggest there is no improvement in performance and often performance can suffer when noise is included during training, even with longer training sessions.
From this review, we can start to understand the effects of disturbances on reinforcement learning algorithms during training and testing. Robustness to these disturbances will continue to grow in importance as the algorithms are applied in the real-world.
The approaches have robustness to disturbances on actions but often struggle more with disturbances on dynamics and observations, resulting in large and sudden decreases in reward, except in the case of impulse disturbances on observations. To deal with these disturbances better, an additional component that many RL approaches applied to robotic control would benefit from is state estimation. This component could be a learned or known model introduced into the system which will improve the agent’s knowledge of its state in the presence of disturbances and uncertainty. Ensuring the agent has improved knowledge of its state would be a necessary addition for the real-world application of reinforcement learning.
Looking at the results as a whole, the most robust algorithm is PPO which achieves the highest reward most often across all disturbances types on external dynamics, actions, and observations in these experiments. RARL and SAC seem to be the next most robust approaches, achieving higher reward than each other in different experiments. RAP is often less stable than RARL or SAC.
For using disturbances in training to improve performance, simply introducing disturbances throughout training does not seem to be a promising approach and often compromises the approach’s performance. Other non-adversarial methods of introducing disturbances during training to improve performance should be explored.
Robust reinforcement learning will continue to grow to ensure safety in real-world environments. In the context of working with safe-control-gym, there are many interesting next steps that can be taken.
This report is supported by Vector Institute and the Dynamic Systems Lab. I’d like to specifically acknowledge Amrit Krishnan (Vector Institute), Jacopo Panerati (Dynamic Systems Lab), Justin Yuan (Dynamic Systems Lab), and Professor Angela Schoellig (Dynamic Systems Lab) for their support and guidance.
[1] B. Mehta, M. Diaz, F. Golemo, C. J. Pal, L. Paull, “Active Domain Randomization”, Proceedings of the Conference on Robot Learning, vol. 100, pp. 1162–1176, Oct. 2020