A distributed training orchestration system using gRPC and PyTorch DDP.
distributed/
├── protos/
│ └── orchestrator.proto # gRPC service definition
├── src/
│ ├── master.py # Orchestrator server
│ ├── worker.py # DDP training worker
│ ├── orchestrator_pb2.py # Generated message classes
│ └── orchestrator_pb2_grpc.py # Generated service stubs
├── scripts/
│ ├── run.sh # Launch master + 4 workers
│ └── generate_proto.sh # Regenerate proto files
├── requirements.txt
└── README.md
┌─────────────────────────────────────────────────────────────┐
│ Master (port 50051) │
│ - Worker registry & rank assignment │
│ - Training coordination │
│ - Live dashboard │
└─────────────────────────────────────────────────────────────┘
▲ ▲ ▲ ▲
│ gRPC │ gRPC │ gRPC │ gRPC
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Worker │ │ Worker │ │ Worker │ │ Worker │
│ Rank 0 │ │ Rank 1 │ │ Rank 2 │ │ Rank 3 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
▲ ▲ ▲ ▲
└───────────────┴───────┬───────┴───────────────┘
│
PyTorch DDP (port 29500)
- Create conda environment:
conda create -n myproject python=3.10
conda activate myproject- Install dependencies:
pip install -r requirements.txt- Generate proto files (if needed):
./scripts/generate_proto.sh./scripts/run.sh# Terminal 1: Master
cd src && python master.py
# Terminals 2-5: Workers
cd src && python worker.py| RPC | Description |
|---|---|
Register |
Worker registers, receives rank |
SendHeartbeat |
Worker sends training progress |
CanStartTraining |
Check if all workers ready |
- Master starts on port 50051
- Workers register (get ranks 0, 1, 2, 3)
- Workers poll
CanStartTraininguntil ready - All workers init DDP process group (gloo backend)
- Training loop with heartbeats
- Master displays live dashboard