Skip to content

Train: add auto-DDP distributed training#1123

Open
GeorgePearse wants to merge 1 commit intomainfrom
feat/auto-ddp-training
Open

Train: add auto-DDP distributed training#1123
GeorgePearse wants to merge 1 commit intomainfrom
feat/auto-ddp-training

Conversation

@GeorgePearse
Copy link
Collaborator

Summary

  • Port the automatic single-node DDP spawning approach from BinItAI/core PR #5414.
  • Fill in the missing visdet.engine.dist implementation (rank/world_size, init, collectives, result collection), which is required by Runner/samplers/metrics.
  • Add visdet.engine.runner.auto_train() and a small entrypoint script scripts/train_auto_ddp.py (no torchrun required).

Usage

  • python scripts/train_auto_ddp.py path/to/runner_config.py

Testing

  • python3 -m pytest -q tests/test_runtime/test_dist_env_fallback.py

@github-actions
Copy link
Contributor

🤖 Multi-Model Consensus Review

Model Status
Claude Sonnet 4
GPT-4o
Gemini 2.0 Flash

scripts/train_auto_ddp.py

Pass Rate: 3/3 models

tests/test_runtime/test_dist_env_fallback.py

Pass Rate: 3/3 models

visdet/engine/dist/__init__.py

Pass Rate: 3/3 models

visdet/engine/dist/dist.py

Pass Rate: 3/3 models

visdet/engine/dist/dist_utils.py

Pass Rate: 3/3 models

visdet/engine/dist/utils.py

Pass Rate: 3/3 models

visdet/engine/runner/__init__.py

Pass Rate: 3/3 models

visdet/engine/runner/auto_train.py

Pass Rate: 3/3 models

@github-actions
Copy link
Contributor

Skylos Scan: No dead code or security issues detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant