✍️ Manipulation with Two-Stage Training#

This example demonstrates robotic manipulation using a two-stage training paradigm that combines reinforcement learning (RL) and imitation learning (IL). The central idea is to first train a privileged teacher policy using full state information, and then distill that knowledge into a vision-based student policy that relies on camera observations (and optionally robot proprioception). This approach enables efficient learning in simulation while bridging the gap toward real-world deployment where privileged states are unavailable.

Environment Overview#

The manipulation environment is composed of the following elements:

Robot: A 7-DoF Franka Panda arm with a parallel-jaw gripper.
Object: A box with randomized initial position and orientation, ensuring diverse training scenarios.
Cameras: Two stereo RGB cameras (left and right) facing the manipulation scene. Here, we use Madrona Enginer for batch rendering.
Observations:
- Privileged state: End-effector pose and object pose (used only during teacher training).
- Vision state: Stereo RGB images (used by the student policy).
Actions: 6-DoF delta end-effector pose commands (3D position + orientation).
Rewards: A keypoint alignment reward is used. This defines reference keypoints between the gripper and the object, encouraging the gripper to align to a graspable pose.
- This formulation avoids dense shaping terms and directly encodes task success.
- Only this reward is required for the policy to learn goal reaching.

RL Training (Stage 1: Teacher Policy)#

In the first stage, we train a teacher policy using Proximal Policy Optimization (PPO) from the RSL-RL library.

Setup:

pip install tensorboard rsl-rl-lib==2.2.4

Training:

python examples/manipulation/grasp_train.py --stage=rl

Monitoring:

tensorboard --logdir=logs

The reward learning curve looks like the following if the training is successful:

Key details:

Inputs: Privileged state (no images).
Outputs: End-effector action commands.
Parallelization: Large vectorized rollouts (e.g., 1024–4096 envs) for fast throughput.
Reward design: Keypoint alignment suffices to produce consistent grasping behavior.
Outcome: A lightweight MLP policy that learns stable grasping given ground-truth state information.

The teacher policy serves as the demonstration source for the next stage.

Imitation Learning (Stage 2: Student Policy)#

The second stage trains a vision-conditioned student policy that imitates the RL teacher.

Architecture:

Encoder: Shared stereo CNN encoder extracts visual features.
Fusion network: Merges image features with optional robot proprioception.
Heads:
- Action head: Predicts 6-DoF manipulation actions.
- Pose head: Auxiliary task to predict object pose (xyz + quaternion).

Training Objective:

Loss:
- Action MSE (student vs teacher).
- Pose loss = position MSE + quaternion distance.
Data Collection: Teacher provides online supervision, optionally with DAgger-style corrections to mitigate covariate shift.

Outcome: A vision-only policy capable of generalizing grasping behavior without access to privileged states.

Run training:

python examples/manipulation/grasp_train.py --stage=bc

Evaluation#

Both teacher and student policies can be evaluated in simulation (with or without visualization).

Teacher Policy (MLP):

python examples/manipulation/grasp_eval.py --stage=rl

Student Policy (CNN+MLP):

python examples/manipulation/grasp_eval.py --stage=bc --record

The student observes the environment via stereo cameras rendered with Mandrona.

Logging & Monitoring:

Metrics recorded in TensorBoard (logs/grasp_rl/ or logs/grasp_bc/).
Periodic checkpoints for both RL and BC stages.

Summary#

This two-stage pipeline illustrates a practical strategy for robotic manipulation:

Teacher policy (RL): Efficient learning with full information.
Student policy (IL): Vision-based control distilled from demonstrations.

The result is a policy that is both sample-efficient in training and robust to realistic perception inputs.