SelSched: Selective Scheduling for GPU Allocation in Kubernetes

This repository contains the implementation code for the paper "Selective Scheduling: A Reinforcement Learning Approach to Improving GPU Allocation Rates in Kubernetes Clusters".

Overview

SelSched addresses the GPU fragmentation problem in Kubernetes clusters through strategic selective scheduling - the ability to skip certain pod placements when they would harm overall GPU allocation. By introducing reinforcement learning with Proximal Policy Optimization (PPO) and attention mechanisms, our approach achieves up to 39.2% improvement in GPU allocation rates.

Key Features

Selective Scheduling: Strategic skipping of pod placements to optimize GPU allocation
PPO-based RL: Reinforcement learning agent with attention mechanisms for scalable scheduling
High-fidelity Simulator: Kubernetes scheduling environment for training and evaluation
Multiple Baselines: Comparison with traditional scheduling strategies (round-robin, binpack, spread)
Ablation Studies: Configurable components for understanding the impact of different features

Installation

Requirements

# Python 3.8+ recommended
pip install -r requirements.txt

Dependencies

PyTorch >= 1.8.0
NumPy
OpenAI Gym
Matplotlib (for visualization)
Seaborn (for enhanced plots)

Quick Start

Training a Model

# Train with the demo configuration
python run.py --config config.demo_config

# Train with custom parameters
python run.py --mode train --episodes 5000 --save_interval 500

# Use custom configuration file
python run.py --config config.your_custom_config

Project Structure

SelSched/
├── config/               # Configuration files for different experiments
│   ├── demo_config.py   # Configuration template with documentation
│   └── ablation/        # Ablation study configurations
├── entities/            # Core entities (Node, Pod, utilities)
├── environment/         # Kubernetes scheduling environment
│   ├── scheduler_env.py  # Main RL environment
│   └── static_cluster.py # Cluster simulation
├── models/              # Neural network models
│   ├── ppo.py          # PPO agent implementation
│   ├── networks.py     # Actor-Critic networks
│   └── efficient_attention.py  # Attention mechanisms
├── training/            # Training and evaluation
│   ├── trainer.py      # Training loop
│   ├── evaluator.py    # Evaluation metrics
│   └── baseline_policies.py  # Baseline scheduling strategies
└── run.py              # Main entry point

Configuration

The configuration system allows full customization of the scheduling environment. See config/demo_config.py for a complete template with detailed documentation.

Key configuration categories:

Cluster Setup: Define nodes with CPU, memory, and GPU resources
Workload Patterns: Specify pods with their resource requirements
RL Parameters: PPO hyperparameters and network architecture
Environment Settings: Episode length, reward functions, scheduling rounds
Training Options: Batch size, learning rate, update frequency

You can either:

Modify the demo config directly with your parameters
Use the provided create_config() helper function for programmatic generation
Create custom configuration files following the same structure

Key Metrics

GPU Allocation Rate: Percentage of GPUs successfully allocated
CPU/Memory Utilization: Resource usage efficiency
Episode Rewards: Cumulative rewards during training
Scheduling Success Rate: Percentage of pods successfully placed

Contact

For questions or issues, please contact mirocody@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SelSched: Selective Scheduling for GPU Allocation in Kubernetes

Overview

Key Features

Installation

Requirements

Dependencies

Quick Start

Training a Model

Project Structure

Configuration

Key Metrics

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
entities		entities
environment		environment
models		models
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

SelSched: Selective Scheduling for GPU Allocation in Kubernetes

Overview

Key Features

Installation

Requirements

Dependencies

Quick Start

Training a Model

Project Structure

Configuration

Key Metrics

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages