A complete implementation of a robotic handover system using imitation learning and reinforcement learning, featuring multi-modal sensor fusion (vision, proprioception, audio) and deformable object manipulation.
This project implements a 4-step pipeline for training a robot to perform deformable object handovers:
- Data Preparation: Streaming data pipeline from ALOHA and FrodoBots datasets
- Environment Setup: Custom Gymnasium environment with PyBullet physics
- Imitation Learning: DiffusionPolicy pretraining on expert demonstrations
- RL Fine-Tuning: PPO-based fine-tuning with hybrid IL+RL policy
- ✅ Streaming data loading from HuggingFace datasets (125k+ frames)
- ✅ Multi-modal observations: RGB images, force/torque, IMU, audio waveforms
- ✅ Custom Gymnasium environment with deformable object physics
- ✅ Hybrid IL+RL architecture with Transformer fusion
- ✅ Proper PPO implementation with GAE, stochastic policies, mini-batch updates
- ✅ Reward shaping and curriculum learning strategies
- ✅ GPU-accelerated training on cloud infrastructure (Vast.ai)
Multi-Modal Encoder (Transformer Fusion)
├── Image Encoder: CNN (3x84x84 → 256)
├── Proprioception Encoder: MLP (effort + IMU → 256)
└── Audio Encoder: 1D CNN + Adaptive Pooling (16kHz → 256)
↓
Transformer Encoder (3 layers, 8 heads)
↓
Policy Head (Stochastic)
├── Mean Network: MLP (256 → 6)
└── Learned Log-Std: Parameter (6)
↓
6DoF End-Effector Control
- Observation Space: Dict with image (84×84×3), effort (6), IMU (6), audio (16000)
- Action Space: Box(6) - normalized 6DoF end-effector velocities
- Physics: PyBullet soft-body simulation for deformable towel
- Reward: Progress-based with contact detection and success bonus
collaborative-robot/
├── README.md # This file
├── requirements.txt # Python dependencies
│
├── data_preparation.py # Step 1: Data pipeline
├── deformable_handover_env.py # Step 2: Gymnasium environment
├── diffusion_policy.py # Step 3: IL policy architecture
├── train_step3.py # Step 3: IL training script
├── train_step4_rl.py # Step 4: RL fine-tuning
├── train_step4_curriculum.py # Step 4: Curriculum learning
│
├── verify_step1.py # Verification scripts
├── verify_step2.py
├── test_environment.py
├── integration_test.py
├── debug_single_episode.py
│
└── docs/ # Documentation
├── STEP1_COMPLETION_SUMMARY.md
├── STEP2_COMPLETION_SUMMARY.md
├── STEP2_QUICKSTART.md
├── REWARD_SHAPING_FIXES.md
├── PPO_FIXES_SUMMARY.md
├── PLATEAU_BREAKING_FIXES.md
├── FINAL_TRAINING_FIXES.md
└── CURRICULUM_LEARNING_APPROACH.md
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (recommended)
# Clone repository
git clone https://github.com/yourusername/collaborative-robot.git
cd collaborative-robot
# Install dependencies
pip install -r requirements.txtpython3 data_preparation.pyLoads and augments ALOHA dataset, creates streaming generator yielding batches of observations and actions.
Output: Validated data pipeline with 125k+ frames ready for training.
python3 verify_step2.pyCreates custom Gymnasium environment with:
- Multi-modal observations
- Deformable object physics
- Reward shaping
- Safety constraints
Output: Working environment with 3-5 average reward.
python3 train_step3.pyTrains DiffusionPolicy on expert demonstrations:
- 100 epochs, batch size 64
- Adam optimizer, lr=1e-4
- MSE loss on action predictions
Output: IL checkpoint with validation MSE ≤ 0.15 ✅
Achieved: MSE = 0.027 (validation)
# Standard training
python3 train_step4_rl.py
# Curriculum learning
python3 train_step4_curriculum.pyPPO-based fine-tuning with:
- Hybrid IL+RL policy blending
- Multi-modal Transformer fusion
- Safety constraints
- Curriculum learning (3 stages)
Target: 85% success rate on handover task
Achieved: See Results section
- Data Pipeline: Successfully loads and augments 125k frames from HuggingFace datasets
- IL Training: Achieved 0.027 validation MSE (target: ≤0.15)
- Environment: Stable simulation with multi-modal observations
- Policy Architecture: Proper Transformer fusion of vision, proprioception, and audio
- PPO Implementation: Stable training with proper GAE, stochastic policies, mini-batching
Reward Engineering: The primary challenge was reward function design:
- Dense Rewards: Policy found exploit - achieved +17 reward without completing handover by "holding still near human"
- Sparse Rewards: Policy failed to learn - 0% success rate as no gradient signal
- Curriculum Learning: Insufficient for overcoming sparse reward challenge
Training Results:
- 500k steps (dense rewards): 3% success (plateau at local optimum)
- 500k steps (sparse rewards): 0% success (no learning)
- 450k steps (curriculum): 0% success across all stages
- Task Difficulty: Deformable object manipulation with multi-modal RL is at the edge of tractability with current methods
- Reward Hacking: Dense reward shaping is susceptible to exploitation
- Exploration: Sparse rewards require prohibitive exploration in high-dimensional spaces
- IL Foundation: Strong IL initialization (0.027 MSE) wasn't sufficient to bootstrap RL learning
Despite not achieving the target success rate, this project demonstrates:
- Streaming data loading for large datasets
- On-the-fly augmentation (5x multiplier)
- Memory-efficient generator pattern
- Proper train/val split
- Stochastic policies with learned variance
- True Gaussian log probabilities (not heuristics)
- GAE with proper episode boundary handling
- Mini-batch PPO updates
- Gradient clipping and normalization
- Transformer-based sensor fusion
- Proper normalization for different modalities
- Audio processing with adaptive pooling
- Efficient batched operations
- Fixed audio encoder dimension mismatches
- Resolved action space shape issues
- Corrected observation normalization
- Addressed reward exploitation
- Simpler Task First: Start with rigid objects before deformable
- Shorter Horizon: 30 steps instead of 70
- Denser Success Signal: Multiple levels of partial credit
- Demonstration Quality: More diverse handover demonstrations
- Model-Based Components: Hybrid model-free + model-based approach
- ✅ Comprehensive logging and diagnostics
- ✅ Incremental verification at each step
- ✅ Proper tensor shape handling
- ✅ Cloud GPU utilization
- ✅ Documentation throughout development
- Simplify Task: Rigid object handover (remove deformability)
- Increase Demonstrations: Collect more diverse IL data
- Hybrid Approaches: Add model predictive control
- Reduce Horizon: 30-step episodes
- Better Success Detection: Multiple handover stages
- Inverse RL: Learn reward function from demonstrations
- Hierarchical RL: Separate grasping and reaching skills
- Sim-to-Real: Domain randomization and reality gap bridging
- Multi-Task Learning: Pretrain on related tasks
If you use this code in your research, please cite:
@software{collaborative_robot_2025,
title={Collaborative Robot: Deformable Object Handover with Multi-Modal RL},
author={Your Name},
year={2025},
url={https://github.com/yourusername/collaborative-robot}
}MIT License - see LICENSE file for details
- ALOHA dataset for demonstration data
- LeRobot framework for policy architecture inspiration
- Stable-Baselines3 for RL reference implementation
- PyBullet for physics simulation
For questions or collaboration opportunities:
- GitHub Issues: github.com/yourusername/collaborative-robot/issues
- Email: your.email@example.com
Project Status: Complete implementation with documented challenges and insights. Suitable for educational purposes, research baselines, and further development.
Last Updated: December 2025