Collaborative Robot - Deformable Object Handover with Multi-Modal RL

A complete implementation of a robotic handover system using imitation learning and reinforcement learning, featuring multi-modal sensor fusion (vision, proprioception, audio) and deformable object manipulation.

Project Overview

This project implements a 4-step pipeline for training a robot to perform deformable object handovers:

Data Preparation: Streaming data pipeline from ALOHA and FrodoBots datasets
Environment Setup: Custom Gymnasium environment with PyBullet physics
Imitation Learning: DiffusionPolicy pretraining on expert demonstrations
RL Fine-Tuning: PPO-based fine-tuning with hybrid IL+RL policy

Key Features

✅ Streaming data loading from HuggingFace datasets (125k+ frames)
✅ Multi-modal observations: RGB images, force/torque, IMU, audio waveforms
✅ Custom Gymnasium environment with deformable object physics
✅ Hybrid IL+RL architecture with Transformer fusion
✅ Proper PPO implementation with GAE, stochastic policies, mini-batch updates
✅ Reward shaping and curriculum learning strategies
✅ GPU-accelerated training on cloud infrastructure (Vast.ai)

Architecture

Neural Network Architecture

Multi-Modal Encoder (Transformer Fusion)
├── Image Encoder: CNN (3x84x84 → 256)
├── Proprioception Encoder: MLP (effort + IMU → 256)
└── Audio Encoder: 1D CNN + Adaptive Pooling (16kHz → 256)
    ↓
Transformer Encoder (3 layers, 8 heads)
    ↓
Policy Head (Stochastic)
├── Mean Network: MLP (256 → 6)
└── Learned Log-Std: Parameter (6)
    ↓
6DoF End-Effector Control

Environment

Observation Space: Dict with image (84×84×3), effort (6), IMU (6), audio (16000)
Action Space: Box(6) - normalized 6DoF end-effector velocities
Physics: PyBullet soft-body simulation for deformable towel
Reward: Progress-based with contact detection and success bonus

Project Structure

collaborative-robot/
├── README.md                          # This file
├── requirements.txt                   # Python dependencies
│
├── data_preparation.py                # Step 1: Data pipeline
├── deformable_handover_env.py         # Step 2: Gymnasium environment  
├── diffusion_policy.py                # Step 3: IL policy architecture
├── train_step3.py                     # Step 3: IL training script
├── train_step4_rl.py                  # Step 4: RL fine-tuning
├── train_step4_curriculum.py          # Step 4: Curriculum learning
│
├── verify_step1.py                    # Verification scripts
├── verify_step2.py
├── test_environment.py
├── integration_test.py
├── debug_single_episode.py
│
└── docs/                              # Documentation
    ├── STEP1_COMPLETION_SUMMARY.md
    ├── STEP2_COMPLETION_SUMMARY.md
    ├── STEP2_QUICKSTART.md
    ├── REWARD_SHAPING_FIXES.md
    ├── PPO_FIXES_SUMMARY.md
    ├── PLATEAU_BREAKING_FIXES.md
    ├── FINAL_TRAINING_FIXES.md
    └── CURRICULUM_LEARNING_APPROACH.md

Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA-capable GPU (recommended)

Setup

# Clone repository
git clone https://github.com/yourusername/collaborative-robot.git
cd collaborative-robot

# Install dependencies
pip install -r requirements.txt

Usage

Step 1: Data Preparation

python3 data_preparation.py

Loads and augments ALOHA dataset, creates streaming generator yielding batches of observations and actions.

Output: Validated data pipeline with 125k+ frames ready for training.

Step 2: Environment Setup

python3 verify_step2.py

Creates custom Gymnasium environment with:

Multi-modal observations
Deformable object physics
Reward shaping
Safety constraints

Output: Working environment with 3-5 average reward.

Step 3: Imitation Learning Pretraining

python3 train_step3.py

Trains DiffusionPolicy on expert demonstrations:

100 epochs, batch size 64
Adam optimizer, lr=1e-4
MSE loss on action predictions

Output: IL checkpoint with validation MSE ≤ 0.15 ✅

Achieved: MSE = 0.027 (validation)

Step 4: RL Fine-Tuning

# Standard training
python3 train_step4_rl.py

# Curriculum learning
python3 train_step4_curriculum.py

PPO-based fine-tuning with:

Hybrid IL+RL policy blending
Multi-modal Transformer fusion
Safety constraints
Curriculum learning (3 stages)

Target: 85% success rate on handover task

Achieved: See Results section

Results

What Works ✅

Data Pipeline: Successfully loads and augments 125k frames from HuggingFace datasets
IL Training: Achieved 0.027 validation MSE (target: ≤0.15)
Environment: Stable simulation with multi-modal observations
Policy Architecture: Proper Transformer fusion of vision, proprioception, and audio
PPO Implementation: Stable training with proper GAE, stochastic policies, mini-batching

Challenges Encountered ⚠️

Reward Engineering: The primary challenge was reward function design:

Dense Rewards: Policy found exploit - achieved +17 reward without completing handover by "holding still near human"
Sparse Rewards: Policy failed to learn - 0% success rate as no gradient signal
Curriculum Learning: Insufficient for overcoming sparse reward challenge

Training Results:

500k steps (dense rewards): 3% success (plateau at local optimum)
500k steps (sparse rewards): 0% success (no learning)
450k steps (curriculum): 0% success across all stages

Key Insights

Task Difficulty: Deformable object manipulation with multi-modal RL is at the edge of tractability with current methods
Reward Hacking: Dense reward shaping is susceptible to exploitation
Exploration: Sparse rewards require prohibitive exploration in high-dimensional spaces
IL Foundation: Strong IL initialization (0.027 MSE) wasn't sufficient to bootstrap RL learning

Technical Achievements

Despite not achieving the target success rate, this project demonstrates:

1. Production-Grade Data Pipeline

Streaming data loading for large datasets
On-the-fly augmentation (5x multiplier)
Memory-efficient generator pattern
Proper train/val split

2. Proper RL Implementation

Stochastic policies with learned variance
True Gaussian log probabilities (not heuristics)
GAE with proper episode boundary handling
Mini-batch PPO updates
Gradient clipping and normalization

3. Multi-Modal Fusion

Transformer-based sensor fusion
Proper normalization for different modalities
Audio processing with adaptive pooling
Efficient batched operations

4. Iterative Debugging Process

Fixed audio encoder dimension mismatches
Resolved action space shape issues
Corrected observation normalization
Addressed reward exploitation

Lessons Learned

What Would Help

Simpler Task First: Start with rigid objects before deformable
Shorter Horizon: 30 steps instead of 70
Denser Success Signal: Multiple levels of partial credit
Demonstration Quality: More diverse handover demonstrations
Model-Based Components: Hybrid model-free + model-based approach

Best Practices Demonstrated

✅ Comprehensive logging and diagnostics
✅ Incremental verification at each step
✅ Proper tensor shape handling
✅ Cloud GPU utilization
✅ Documentation throughout development

Future Directions

Immediate Next Steps

Simplify Task: Rigid object handover (remove deformability)
Increase Demonstrations: Collect more diverse IL data
Hybrid Approaches: Add model predictive control
Reduce Horizon: 30-step episodes
Better Success Detection: Multiple handover stages

Research Directions

Inverse RL: Learn reward function from demonstrations
Hierarchical RL: Separate grasping and reaching skills
Sim-to-Real: Domain randomization and reality gap bridging
Multi-Task Learning: Pretrain on related tasks

Citation

If you use this code in your research, please cite:

@software{collaborative_robot_2025,
  title={Collaborative Robot: Deformable Object Handover with Multi-Modal RL},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/collaborative-robot}
}

License

MIT License - see LICENSE file for details

Acknowledgments

ALOHA dataset for demonstration data
LeRobot framework for policy architecture inspiration
Stable-Baselines3 for RL reference implementation
PyBullet for physics simulation

Contact

For questions or collaboration opportunities:

GitHub Issues: github.com/yourusername/collaborative-robot/issues
Email: your.email@example.com

Project Status: Complete implementation with documented challenges and insights. Suitable for educational purposes, research baselines, and further development.

Last Updated: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
docs		docs
.DS_Store		.DS_Store
.gitattributes		.gitattributes
GITHUB_READY.md		GITHUB_READY.md
GIT_COMMANDS.sh		GIT_COMMANDS.sh
LICENSE		LICENSE
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICK_START.md		QUICK_START.md
README.md		README.md
data_preparation.py		data_preparation.py
debug_single_episode.py		debug_single_episode.py
deformable_handover_env.py		deformable_handover_env.py
diffusion_policy.py		diffusion_policy.py
integration_test.py		integration_test.py
requirements.txt		requirements.txt
test_environment.py		test_environment.py
thesis.pdf		thesis.pdf
train_step3.py		train_step3.py
train_step4_curriculum.py		train_step4_curriculum.py
train_step4_rl.py		train_step4_rl.py
verify_step1.py		verify_step1.py
verify_step2.py		verify_step2.py

Folders and files

Latest commit

History

Repository files navigation

Collaborative Robot - Deformable Object Handover with Multi-Modal RL

Project Overview

Key Features

Architecture

Neural Network Architecture

Environment

Project Structure

Installation

Requirements

Setup

Usage

Step 1: Data Preparation

Step 2: Environment Setup

Step 3: Imitation Learning Pretraining

Step 4: RL Fine-Tuning

Results

What Works ✅

Challenges Encountered ⚠️

Key Insights

Technical Achievements

1. Production-Grade Data Pipeline

2. Proper RL Implementation

3. Multi-Modal Fusion

4. Iterative Debugging Process

Lessons Learned

What Would Help

Best Practices Demonstrated

Future Directions

Immediate Next Steps

Research Directions

Citation

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages