Posted on

October 1, 2024

Reinforcement Learning for Robotic Precision: Adaptive Object Manipulation

By: Gautham Sathyanarayanan, Elmaddin Guliyev, Bharat Kumar Jhawar

As part of Fellowship.AI, we were tasked with developing a user-friendly tool that leverages reinforcement learning to enable a robotic system for precise, real-time object manipulation. This project aims to explore dynamic, adaptive robotic tasks, such as applying products to human bodies, ensuring efficiency, safety, and high-quality results in real-world scenarios.

The advances in robotics have given rise to new applications that are becoming a part of our everyday lives. One particularly innovative idea is to use a robotic arm to apply sunscreen on a human body. This idea offers potential benefits to industries like healthcare, elderly care, and even daily personal convenience.

To explore this, we delved into two distinct approaches: pre-defined coordinates and vision-based robotic manipulation. In both approaches, we use reinforcement learning for object manipulation.

Before we dive in, let’s look at the basic concepts of reinforcement learning in the context of our project:

The reinforcement learning process enables the robot to learn from its environment and improve its actions over time (like a child learning through trial and error). These are the key concepts are in our case:

‍Agent: The robotic arm interacting with the environment.‍
Environment: A simulation that includes the human body and objects, like the sunscreen bottle.‍
State: The robot’s current status, such as its arm’s position relative to the body.‍
Action: The robot’s movement or decision, such as picking up the bottle or applying sunscreen.‍
Reward: Feedback based on the success of an action. For example, grasping sunscreen or applying sunscreen correctly earns the robot a positive reward..

Now let’s explore the project.

The task here is to build a robotic arm capable of applying sunscreen evenly across a person’s body. The robot should pick up the sunscreen bottle, dispense sunscreen, and apply it to the human body, all while making real-time adjustments. To achieve this, two primary approaches were explored:

Approach 1: Coordinate-based manipulation using reinforcement learning.
Approach 2: Vision-based manipulation using reinforcement learning with real-time feedback from cameras.

Approach 1: Coordinate-Based Method

The first approach implemented a more traditional, deterministic method: using pre-defined coordinates to move the robotic arm across a human body. Although straightforward in theory, it required precise calibration to ensure accuracy.

Setup and Initialization

The environment was modelled using PyBullet (a physics simulation engine) and follows the standard OpenAI Gym interface. The robot used in this approach was a Franka Panda Arm, chosen for its flexibility and common usage in manipulation tasks. The gravity was set to 9.8 m/s², and collision detection was enabled. Several objects were loaded into the environment, including:

A bottle of sunscreen,
A sponge (represented as a cube) to rub the sunscreen,
The human body model (both large and small variations),
A wooden table as the surface for the interaction

In the initialization method, we set the action and observation spaces:

The action space consisted of 25 discrete actions, representing all possible arm movements, and the height-hack is enabled.
The observation space was a 9-dimensional state space, encompassing the position and orientation of the robot and its objects.

(x, y, z, α, β, γ , x-gripper, y-gripper, α-gripper )

Here, x, y, z is the position of the gripper, α, β, γ the orientation of the gripper, and x-gripper, y-gripper, α-gripper is the x, y position and yaw-rotation of the block object relative to the gripper coordinate frame.

Movement and Execution

This approach relied on reinforcement learning and coordinates.

Here grasping the sunscreen bottle was done using Q-learning (a framework based on the general formulation of robotic manipulation as a Markov Decision Process) to help the robot learn from the environment and make decisions regarding how to grasp and hold the bottle properly.

To apply sunscreen to the human body, the robotic arm followed a precise path calculated using coordinates. Different pressures were applied on different parts of the human body for even distribution of the sunscreen. Considering the safety of humans, the flipping of the body was done without the robotic arm. The decision was made due to potential injuries that might arise due to body weight, pressure sensitivity, positions, etc.

Reward Mechanism

Grasp Success: The robot receives a reward of +10 when it successfully picks up the sunscreen bottle and holds it for a certain period or above the height of 0.2m.
Grasp Failure: If the robot dropped the bottle or failed to pick it up, it received a reward of 0.

Demo

Limitations of the Coordinate Approach

While this method demonstrated the fundamental functionality, there are several drawbacks:

Lack of flexibility: Any change in body position, size, or environmental elements required constant manual recalibration, making it unsuitable for real-world scenarios where variations are common.
Real-World Adaptation Challenges: Moving from simulation to the real-world requires getting a 3D representation of the real-world environment. It should also be able to adjust the coordinates in real-time based on body movements. With the current technology this transition can be challenging.

When translating to the real world, several methods like NERF, Vision Models, etc. can be used but they are still an active area of research and are not fully explored. For immediate translation, manual recalibration is required even for minor changes, which makes this method less practical for real-world applications.

Approach 2: Vision-Based Method

On the other hand, the vision-based approach gave the robot the ability to adapt in real-time using cameras and reinforcement learning. This allows the robotic arm to adjust based on the exact position of the body and even avoid obstacles like clothing or jewelry, making it much more versatile than the coordinate-based method.

In this approach, we switched the robotic arm to Kuka considering its advanced safety features, torque-sensing (to detect and avoid collisions) and suitability for close human-robot collaboration.

Reinforcement Learning Setup

The robotic arm was equipped with an eye-in-hand camera, which moved with the robot’s arm to capture real-time images. These visual inputs allowed the robot to make decisions based on what it “saw” (just like humans). This setup was paired with reinforcement learning algorithms that allowed the robot to learn from its environment and adjust its actions over time.

We explored two RL algorithms:

Deep Q-Network (DQN): This algorithm was used for discrete action spaces, enabling the robot to make clear decisions such as moving to a new position or applying more pressure.
Twin Delayed DDPG (TD3): This algorithm is designed for continuous action spaces allowing the robotic arm to move smoothly and precisely, adjusting as it sensed changes in object position or movement.

Deep Q-Network (DQN)

The Q-network used for training is a Convolutional Neural Network (CNN)
that takes a state as input and outputs Q values for each action. Here the input to the network is a stack of consecutive (64, 64, 3) RGB frames from the camera.

One reason for using consecutive frames as input is to try and capture the environment dynamics. The input is converted to grayscale and resized before being sent to the network. Here experience replay is used for storing transitions in a cyclic replay buffer, and during training it samples batches from it randomly.

The loss function used for minimizing the error in the network weights is the Huber Loss:

where δ is the temporal difference error:

Huber loss behaves like the mean squared error for small values of the error and behaves like the mean absolute error for larger values of the error. Finally, Epsilon decay was used to encourage exploration at the beginning of the training and exploitation later on.

The model architecture of the DQN is shown in the figure below.

The input to the model is a stack of frames, and the output is the predicted Q-values q(s, ai ) for each action ai . In this case, the size of the action-space is 7, so the output of the Q-network is 7 Q-values corresponding to the predicted expected return for each action.

TD3

The idea here is to replace the Q-learning algorithm (DQN) with a Twin Delayed Deep Deterministic Policy Gradient (TD3). TD3 is a model-free RL algorithms which use both policy optimization and Q-learning, which means it can trade-off between the weaknesses and strengths encountered by policy gradient methods and Q-learning methods.

Furthermore, Q-learning methods tend to be less stable and more sample efficient when tuned perfectly due to their ability to reuse previous experiences. In contrast, policy gradient methods are more robust and more suitable for directly finding the optimal policy.

TD3 handles continuous action spaces, while DQN is limited to discrete actions. To combine the strengths of both methods, we developed a TD3 implementation using the network architecture and replay buffer.

Reward Mechanism

Both versions of the environment are provided with a sparse reward. The reward mechanism worked as follows:

Grasp Success: The robot received a reward of +1 when it successfully picked up the sunscreen bottle and held it for a certain period or above a threshold height.
Grasp Failure: If the robot dropped the bottle or failed to pick it up, it received a reward of 0.

In this approach, we tested 2 different experiments across several approaches in each experiment including both image-based and non-image-based observation (similar to approach 1).

In the non-image-based solution, DQN was used to handle non-image-based observations. Additionally, the reward function is no longer sparse but takes into account the distance between the gripper and the object.

In the vision-based solutions, static camera is considered for consistent performance in controlled environments where the body and objects remain stationary. And dynamic eye-in-hand camera is considered to enable the robotic arm to adjust in real-time to make it much more dynamic and adaptive to changes.

Experiment 1: Single Frame Observations

The first experiment is based on single-frame observations and a replay buffer of size 100.

Approaches Compared: Kuka RL (Static RGB Camera), Kuka Eye in Hand (Dynamic Camera), TD3 Eye in Hand, Kuka RL Simple (Non-Image Observations).

Observations:

The graph plotted across several episodes shows us that Kuka RL and Kuka Eye in Hand performed similarly. Whereas TD3 Eye in Hand performed moderately well and Kuka RL Simple showed the slowest learning curve.

Experiment 2: Stacked Frames Observations

In this, the concept of stacked frames was introduced. The idea is to capture a sequence of images (rather than a single frame). Stacking images (10 sequential images) from the eye-in-hand camera enabled the robot to capture temporal dynamics (how objects move and interact over time). This could allow the robotic arm to track the movement of objects over time, allowing it to better predict future actions.

Approaches Compared: Static (Stacked RGB Frames), Eye in Hand (Stacked RGB Frames), Depth in Hand (Stacked Depth Images).

Observations:

The above graph makes it clear that the static method has a good initial performance but poor generalization to unseen objects. When we do the same with Eye in Hand, the performance improved over time with better generalization but Depth in Hand showed the best overall performance due to clearer object features and reduced noise.

Demo

‍

Key Findings:

Eye-in-hand cameras with stacked frames and depth images significantly perform the best. This method enhances dynamic observations, learning, and generalization in robotic manipulation tasks.
Reward function and hyperparameter tuning proved to be critical for DRL success, with further tuning of DRL methods offering the potential for improved results.

Challenges:

While the vision-based RL approach significantly improved the robot’s performance, we faced several challenges throughout the project:

Computational Demands: The compute required for reinforcement learning and to process real-time visual data, especially when using stacked frames was huge. Tasks like this require significant GPU resources and longer training times.
Integration of Multimodal Sensors: Combining visual, tactile, and proximity sensors is crucial for real-time decision-making on tasks involving human interaction. But this comes with the challenge of increased computational needs and complex planning for accurate interpretation.
Human Safety: Actions involving contact with the human body require detailed planning. Even though tactile sensors and proximity sensors could help in minimizing the risks, other considerations had to be explored for actions like applying sunscreen and flipping a human.

Future Plans:

Further development of the vision approach for sunscreen application and move beyond object manipulation.
Include tactile and proximity sensors for future iterations to ensure a safe and effective application. Explore additional algorithms such as soft actor-critic (SAC) for better performance.
Incorporate LLaRA (Large language and robotic assistant) for user-friendliness, and safety and adapt to different environments and user needs.
Explore Curobo and IsaacSim (which we could not do due to computational restrictions) for advanced robot modelling and high-fidelity physics and rendering.

Conclusion:

In the approaches explored, the coordinate-based approach demonstrated the feasibility of robotic sunscreen application but lacked adaptability. Meanwhile, the vision-based approach showed improved versatility and real-time adaptability, making it more suitable for dynamic environments.

While challenges like computational limitations and human safety should be addressed properly, incorporating advanced sensors and formulating sample-efficient solutions can significantly enhance robotic systems for human-assistive tasks.

In the future of robotics, projects like this stand as a stepping stone to make robots smarter, more adaptable, and ready to assist in ways we have never thought possible.