EMDSAC-ft: Bridging the Gap in Offline-to-Online Reinforcement Learning through Value Distribution Learning
This repository contains the official implementation of EMDSAC-ft, a novel algorithm for offline-to-online reinforcement learning that addresses key challenges in both offline pre-training and online fine-tuning phases.
- Uncertainty Decoupling: Separates epistemic and aleatoric uncertainty for better offline RL performance
- Distributional Value Learning: Captures full return distributions instead of just expected values
- Efficient Fine-tuning: UDPE and TTRPI modules for stable online adaptation
- State-of-the-art Performance: 14.9% average improvement over baselines, 25.8% improvement in fine-tuning
EMDSAC-ft/
├── Independent/ # Independent training implementation
│ ├── example_train/ # Main training scripts
│ │ ├── train_ORL.py # Offline training script
│ │ ├── train_O2O.py # Online fine-tuning script
│ │ ├── configs/ # Configuration files
│ │ │ ├── offline/ # Offline training configs
│ │ │ └── ft/ # Fine-tuning configs
│ │ ├── networks/ # Network architectures
│ │ ├── training/ # Training utilities
│ │ └── utils/ # Utility functions
│ └── Algorithms/ # Algorithm implementations
├── Vectorized/ # Vectorized implementation
│ ├── main.py # Main training script
│ ├── configs/ # Configuration files
│ └── Algorithms/ # Algorithm implementations
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.7+
- PyTorch 1.9+
- CUDA 11.0+ (optional, for GPU acceleration)
- Clone the repository:
git clone https://github.com/your-username/EMDSAC-ft.git
cd EMDSAC-ft- Create conda environment:
conda create -n EMDSAC python=3.8
conda activate EMDSAC- Install dependencies:
pip install -r requirements.txtIndependent Implementation:
conda activate EMDSAC
cd Independent/example_train
python train_ORL.py --config configs/offline/halfcheetah-medium-replay-v2.yamlVectorized Implementation:
conda activate EMDSAC
cd Vectorized
python main.py --config configs/halfcheetah-medium-replay-v2.yamlconda activate EMDSAC
cd Independent/example_train
python train_O2O.py --config configs/ft/halfcheetah-medium-replay-v2.yamlCore Components:
- Ensemble Value Distribution Networks: Quantify epistemic uncertainty from OOD actions
- Distributional Value Learning: Capture aleatoric uncertainty from environmental randomness
- Uncertainty Decoupling: Separate epistemic and aleatoric uncertainties for better performance
Key Innovations:
- Uneven Distribution of Pessimism Elimination (UDPE): Adaptive uncertainty handling
- True Trust Region Policy Improvement (TTRPI): Stable policy updates
- Seamless Offline-to-Online Transition: Maintain performance during adaptation