# ✍️ Manipulation with Two-Stage Training
This example demonstrates robotic manipulation using a **two-stage training paradigm** that combines **reinforcement learning (RL)** and **imitation learning (IL)**. The central idea is to first train a **privileged teacher policy** using full state information, and then distill that knowledge into a **vision-based student policy** that relies on camera observations (and optionally robot proprioception).
This approach enables efficient learning in simulation while bridging the gap toward real-world deployment where privileged states are unavailable.
---
## Environment Overview
The manipulation environment is composed of the following elements:
* **Robot:** A 7-DoF Franka Panda arm with a parallel-jaw gripper.
* **Object:** A box with randomized initial position and orientation, ensuring diverse training scenarios.
* **Cameras:** Two stereo RGB cameras (left and right) facing the manipulation scene. Here, we use [Madrona Enginer](https://madrona-engine.github.io/) for batch rendering.
* **Observations:**
* **Privileged state:** End-effector pose and object pose (used only during teacher training).
* **Vision state:** Stereo RGB images (used by the student policy).
* **Actions:** 6-DoF delta end-effector pose commands (3D position + orientation).
* **Rewards:** A **keypoint alignment** reward is used. This defines reference keypoints between the gripper and the object, encouraging the gripper to align to a graspable pose.
* This formulation avoids dense shaping terms and directly encodes task success.
* Only this reward is required for the policy to learn goal reaching.
---
## RL Training (Stage 1: Teacher Policy)
In the first stage, we train a teacher policy using **Proximal Policy Optimization (PPO)** from the [RSL-RL library](https://github.com/leggedrobotics/rsl_rl).
**Setup:**
```bash
pip install tensorboard rsl-rl-lib==2.2.4
```
**Training:**
```bash
python examples/manipulation/grasp_train.py --stage=rl
```
**Monitoring:**
```bash
tensorboard --logdir=logs
```
The reward learning curve looks like the following if the training is successful:
```{figure} ../../_static/images/manipulation_curve.png
```
**Key details:**
* **Inputs:** Privileged state (no images).
* **Outputs:** End-effector action commands.
* **Parallelization:** Large vectorized rollouts (e.g., 1024–4096 envs) for fast throughput.
* **Reward design:** Keypoint alignment suffices to produce consistent grasping behavior.
* **Outcome:** A lightweight MLP policy that learns stable grasping given ground-truth state information.
The teacher policy serves as the demonstration source for the next stage.
---
## Imitation Learning (Stage 2: Student Policy)
The second stage trains a **vision-conditioned student policy** that imitates the RL teacher.
**Architecture:**
* **Encoder:** Shared stereo CNN encoder extracts visual features.
* **Fusion network:** Merges image features with optional robot proprioception.
* **Heads:**
* **Action head:** Predicts 6-DoF manipulation actions.
* **Pose head:** Auxiliary task to predict object pose (xyz + quaternion).
**Training Objective:**
* **Loss:**
* Action MSE (student vs teacher).
* Pose loss = position MSE + quaternion distance.
* **Data Collection:** Teacher provides online supervision, optionally with **DAgger-style corrections** to mitigate covariate shift.
**Outcome:** A vision-only policy capable of generalizing grasping behavior without access to privileged states.
**Run training:**
```bash
python examples/manipulation/grasp_train.py --stage=bc
```
---
## Evaluation
Both teacher and student policies can be evaluated in simulation (with or without visualization).
* **Teacher Policy (MLP):**
```bash
python examples/manipulation/grasp_eval.py --stage=rl
```
* **Student Policy (CNN+MLP):**
```bash
python examples/manipulation/grasp_eval.py --stage=bc --record
```
The student observes the environment via stereo cameras rendered with Mandrona.
**Logging & Monitoring:**
* Metrics recorded in TensorBoard (`logs/grasp_rl/` or `logs/grasp_bc/`).
* Periodic checkpoints for both RL and BC stages.
---
## Summary
This two-stage pipeline illustrates a practical strategy for robotic manipulation:
1. **Teacher policy (RL):** Efficient learning with full information.
2. **Student policy (IL):** Vision-based control distilled from demonstrations.
The result is a policy that is both sample-efficient in training and robust to realistic perception inputs.