Skip to content

mirocody/SelSched

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SelSched: Selective Scheduling for GPU Allocation in Kubernetes

This repository contains the implementation code for the paper "Selective Scheduling: A Reinforcement Learning Approach to Improving GPU Allocation Rates in Kubernetes Clusters".

Overview

SelSched addresses the GPU fragmentation problem in Kubernetes clusters through strategic selective scheduling - the ability to skip certain pod placements when they would harm overall GPU allocation. By introducing reinforcement learning with Proximal Policy Optimization (PPO) and attention mechanisms, our approach achieves up to 39.2% improvement in GPU allocation rates.

Key Features

  • Selective Scheduling: Strategic skipping of pod placements to optimize GPU allocation
  • PPO-based RL: Reinforcement learning agent with attention mechanisms for scalable scheduling
  • High-fidelity Simulator: Kubernetes scheduling environment for training and evaluation
  • Multiple Baselines: Comparison with traditional scheduling strategies (round-robin, binpack, spread)
  • Ablation Studies: Configurable components for understanding the impact of different features

Installation

Requirements

# Python 3.8+ recommended
pip install -r requirements.txt

Dependencies

  • PyTorch >= 1.8.0
  • NumPy
  • OpenAI Gym
  • Matplotlib (for visualization)
  • Seaborn (for enhanced plots)

Quick Start

Training a Model

# Train with the demo configuration
python run.py --config config.demo_config

# Train with custom parameters
python run.py --mode train --episodes 5000 --save_interval 500

# Use custom configuration file
python run.py --config config.your_custom_config

Project Structure

SelSched/
├── config/               # Configuration files for different experiments
│   ├── demo_config.py   # Configuration template with documentation
│   └── ablation/        # Ablation study configurations
├── entities/            # Core entities (Node, Pod, utilities)
├── environment/         # Kubernetes scheduling environment
│   ├── scheduler_env.py  # Main RL environment
│   └── static_cluster.py # Cluster simulation
├── models/              # Neural network models
│   ├── ppo.py          # PPO agent implementation
│   ├── networks.py     # Actor-Critic networks
│   └── efficient_attention.py  # Attention mechanisms
├── training/            # Training and evaluation
│   ├── trainer.py      # Training loop
│   ├── evaluator.py    # Evaluation metrics
│   └── baseline_policies.py  # Baseline scheduling strategies
└── run.py              # Main entry point

Configuration

The configuration system allows full customization of the scheduling environment. See config/demo_config.py for a complete template with detailed documentation.

Key configuration categories:

  • Cluster Setup: Define nodes with CPU, memory, and GPU resources
  • Workload Patterns: Specify pods with their resource requirements
  • RL Parameters: PPO hyperparameters and network architecture
  • Environment Settings: Episode length, reward functions, scheduling rounds
  • Training Options: Batch size, learning rate, update frequency

You can either:

  1. Modify the demo config directly with your parameters
  2. Use the provided create_config() helper function for programmatic generation
  3. Create custom configuration files following the same structure

Key Metrics

  • GPU Allocation Rate: Percentage of GPUs successfully allocated
  • CPU/Memory Utilization: Resource usage efficiency
  • Episode Rewards: Cumulative rewards during training
  • Scheduling Success Rate: Percentage of pods successfully placed

Contact

For questions or issues, please contact mirocody@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages