This repository contains the implementation code for the paper "Selective Scheduling: A Reinforcement Learning Approach to Improving GPU Allocation Rates in Kubernetes Clusters".
SelSched addresses the GPU fragmentation problem in Kubernetes clusters through strategic selective scheduling - the ability to skip certain pod placements when they would harm overall GPU allocation. By introducing reinforcement learning with Proximal Policy Optimization (PPO) and attention mechanisms, our approach achieves up to 39.2% improvement in GPU allocation rates.
- Selective Scheduling: Strategic skipping of pod placements to optimize GPU allocation
- PPO-based RL: Reinforcement learning agent with attention mechanisms for scalable scheduling
- High-fidelity Simulator: Kubernetes scheduling environment for training and evaluation
- Multiple Baselines: Comparison with traditional scheduling strategies (round-robin, binpack, spread)
- Ablation Studies: Configurable components for understanding the impact of different features
# Python 3.8+ recommended
pip install -r requirements.txt- PyTorch >= 1.8.0
- NumPy
- OpenAI Gym
- Matplotlib (for visualization)
- Seaborn (for enhanced plots)
# Train with the demo configuration
python run.py --config config.demo_config
# Train with custom parameters
python run.py --mode train --episodes 5000 --save_interval 500
# Use custom configuration file
python run.py --config config.your_custom_configSelSched/
├── config/ # Configuration files for different experiments
│ ├── demo_config.py # Configuration template with documentation
│ └── ablation/ # Ablation study configurations
├── entities/ # Core entities (Node, Pod, utilities)
├── environment/ # Kubernetes scheduling environment
│ ├── scheduler_env.py # Main RL environment
│ └── static_cluster.py # Cluster simulation
├── models/ # Neural network models
│ ├── ppo.py # PPO agent implementation
│ ├── networks.py # Actor-Critic networks
│ └── efficient_attention.py # Attention mechanisms
├── training/ # Training and evaluation
│ ├── trainer.py # Training loop
│ ├── evaluator.py # Evaluation metrics
│ └── baseline_policies.py # Baseline scheduling strategies
└── run.py # Main entry point
The configuration system allows full customization of the scheduling environment. See config/demo_config.py for a complete template with detailed documentation.
Key configuration categories:
- Cluster Setup: Define nodes with CPU, memory, and GPU resources
- Workload Patterns: Specify pods with their resource requirements
- RL Parameters: PPO hyperparameters and network architecture
- Environment Settings: Episode length, reward functions, scheduling rounds
- Training Options: Batch size, learning rate, update frequency
You can either:
- Modify the demo config directly with your parameters
- Use the provided
create_config()helper function for programmatic generation - Create custom configuration files following the same structure
- GPU Allocation Rate: Percentage of GPUs successfully allocated
- CPU/Memory Utilization: Resource usage efficiency
- Episode Rewards: Cumulative rewards during training
- Scheduling Success Rate: Percentage of pods successfully placed
For questions or issues, please contact mirocody@gmail.com