PyTorch implementation of RECAP based on the Physical Intelligence blog post: π*0.6: a VLA that Learns from Experience
RECAP: RL with Experience & Corrections via Advantage-conditioned Policies. RECAP implements a three-stage training process for Vision-Language-Action (VLA) models:
- Demonstrations: Supervised learning from expert demonstrations
- Corrections: Learning from expert interventions when the robot makes mistakes
- Autonomous Experience: Reinforcement learning with advantage-conditioned policies
The key innovation is using a value function for credit assignment and conditioning the policy on advantage values, enabling the model to learn from both good and bad experiences.
- Lerobot dataset v2.1 format support
- Configurable HuggingFace tokenizers for text
- Configurable CLIP ViT tokenizers for vision
- Advantage-conditioned policy training
- Value function for credit assignment
- Three-stage training pipeline
The following steps have been tested with CUDA Version: 12.4.
-
Clone this repository and navigate to pi06 directory:
git clone https://github.com/nahidalam/pi06 cd pi06 -
Install Package:
conda create -n pi06 python=3.11 -y conda activate pi06 pip install --upgrade pip # enable PEP 660 support pip install -e .
-
Install additional packages for training (optional):
pip install -e ".[train]"
-
Prepare your Lerobot v2.1 dataset (or use an existing one)
The dataset should follow the Lerobot v2.1 format:
<dataset_name>/ ├── data/chunk-000/episode_*.parquet ├── videos/chunk-000/observation.images.*/episode_*.mp4 └── meta/episodes.jsonl -
Create/edit the config file (
src/pi06/configs/recap_config.yaml):dataset: path: "path/to/lerobot/dataset" # Path to dataset root directory batch_size: 1 chunk_id: "chunk-000" # Chunk identifier image_keys: ["observation.images.main"] # Camera keys model: action_dim: 7 # Adjust for your robot training: demo_epochs: 10 correction_epochs: 5 autonomous_epochs: 20
-
Run training:
python -m pi06.train --config src/pi06/configs/recap_config.yaml
- Vision Encoder: CLIP ViT (configurable)
- Language Encoder: HuggingFace transformer (configurable)
- Fusion Layer: Combines vision and language features
- Action Expert: MLP head for action prediction
- Conditioning: Supports advantage conditioning
- Predicts expected future return from state features
- Used for credit assignment via GAE (Generalized Advantage Estimation)
- Enables learning from both good and bad experiences
- Demonstrations: Supervised learning to match expert actions
- Corrections: Learn recovery strategies from expert interventions
- Autonomous: RL training with advantage-conditioned policy
Key configuration options in recap_config.yaml:
model.action_dim: Dimension of action spacemodel.vision_model_name: CLIP model for visionmodel.text_model_name: HuggingFace model for texttraining.gamma: Discount factor for RLtraining.lambda: GAE lambda parametertraining.value_loss_weight: Weight for value function losstraining.policy_loss_weight: Weight for policy loss
Metrics are logged to WandB:
train/demo_loss: Demonstration training losstrain/policy_loss: Policy loss (autonomous training)train/value_loss: Value function losstrain/advantage_mean: Mean advantage valuestrain/entropy: Policy entropy
Checkpoints are saved after each training stage:
checkpoint_demo.pt: After demonstration trainingcheckpoint_correction.pt: After correction trainingcheckpoint_final.pt: After autonomous training
Resume training with:
python -m pi06.train --config src/pi06/configs/recap_config.yaml --checkpoint checkpoints/checkpoint_demo.ptThe implementation expects Lerobot v2.1 format with the following structure:
- Data: Episode data stored as parquet files in
data/chunk-*/episode_*.parquet - Videos: MP4 video files in
videos/chunk-*/observation.images.*/episode_*.mp4 - Metadata: Episode metadata in
meta/episodes.jsonl - Supports: Multiple camera views, episode type filtering (demo/correction/autonomous)
See Lerobot documentation for details on the v2.1 format.
This project is not affiliated with Physical Intelligence and is provided as-is for research purposes.