diff --git a/.github/ISSUE_TEMPLATE/bug_report.yml b/.github/ISSUE_TEMPLATE/bug_report.yml new file mode 100644 index 00000000..1294bccb --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.yml @@ -0,0 +1,87 @@ +name: Bug Report +description: Submit a bug report to help improve SimAI +title: "[BUG]: " +labels: ["bug"] +body: + - type: markdown + attributes: + value: | + Thanks for taking the time to fill out this bug report! + + It would be very helpful if you could provide as much detail as possible. + + - type: textarea + id: bug-description + attributes: + label: Describe the Bug + description: A clear and concise description of what the bug is. + validations: + required: true + + - type: textarea + id: reproduction + attributes: + label: Reproduction Details + description: | + Please provide detailed steps to reproduce the issue. + Include the branch names or commit IDs of SimAI/AICB you are using. + placeholder: | + 1. **Branches / Commit IDs**: SimAI branch `master` (commit `abc1234`), AICB branch `master` + 2. Go to '...' + 3. Run `...` + 4. See error: ... + validations: + required: true + + - type: textarea + id: expected + attributes: + label: Expected Behavior + description: What did you expect to happen? + validations: + required: true + + - type: textarea + id: actual + attributes: + label: Actual Behavior + description: What actually happened? Please include any error messages or logs. + validations: + required: true + + - type: textarea + id: environment + attributes: + label: Environment + description: Please provide details about your environment. + placeholder: | + - OS: Ubuntu 20.04 + - GCC/G++: 9.4.0 + - Python: 3.8.10 + - Docker image (if applicable): ... + - CUDA version (if applicable): ... + - SimAI branch/commit: master / abc1234 + - AICB branch/commit: master / def5678 + validations: + required: true + + - type: textarea + id: usage-scenario + attributes: + label: Usage Scenario (Optional) + description: | + If possible, please describe your usage scenario for SimAI: + - What task or project you are working on + - The underlying goals or business context + + This information will help us collect relevant use cases and optimize the SimAI simulator to better meet your needs. + validations: + required: false + + - type: textarea + id: screenshots + attributes: + label: Screenshots / Logs + description: If applicable, add screenshots or log snippets to help explain your problem. + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 00000000..3b1de558 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,8 @@ +blank_issues_enabled: false +contact_links: + - name: SimAI Documentation + url: https://github.com/aliyun/SimAI/tree/master/docs + about: Refer to the SimAI documentation to help you get started. + - name: SimAI Community (DingTalk / WeChat) + url: https://github.com/aliyun/SimAI#contact-us + about: Join our DingTalk or WeChat community groups for discussion and support. diff --git a/.github/ISSUE_TEMPLATE/feature_request.yml b/.github/ISSUE_TEMPLATE/feature_request.yml new file mode 100644 index 00000000..c04c204f --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.yml @@ -0,0 +1,37 @@ +name: Feature Request +description: Suggest an improvement for SimAI +title: "[FEATURE]: " +labels: ["enhancement"] +body: + - type: markdown + attributes: + value: | + Thank you for suggesting a feature to improve SimAI! + + - type: textarea + id: feature-description + attributes: + label: Feature Description + description: A clear and concise description of the feature you'd like. + validations: + required: true + + - type: textarea + id: problem + attributes: + label: Problem / Motivation + description: | + What problem does this feature solve? Why is it needed? + Please describe the use case or scenario where this feature would be helpful. + validations: + required: true + + - type: textarea + id: alternatives + attributes: + label: Alternatives Considered + description: | + Have you considered any alternative solutions or workarounds? + Please describe them if applicable. + validations: + required: false diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 00000000..3a65802c --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,34 @@ +## Description + + + +## Related Issue + + + +## Type of Change + +- [ ] Bug fix (non-breaking change which fixes an issue) +- [ ] New feature (non-breaking change which adds functionality) +- [ ] Performance improvement +- [ ] Refactoring (no functional changes) +- [ ] Documentation update +- [ ] Build / CI configuration change + +## Checklist + +- [ ] I have read the [CONTRIBUTING.md](../CONTRIBUTING.md) guide +- [ ] My code follows the existing code style of this project +- [ ] I have tested my changes locally +- [ ] I have added/updated documentation as needed +- [ ] My changes do not introduce new warnings or errors +- [ ] I have verified that simulation accuracy is not degraded (if applicable) + +## Test Results + + + + +## Additional Notes + + diff --git a/.github/workflows/lint.yml b/.github/workflows/lint.yml new file mode 100644 index 00000000..6490eb32 --- /dev/null +++ b/.github/workflows/lint.yml @@ -0,0 +1,52 @@ +name: Lint + +on: + push: + branches: [master, main] + pull_request: + branches: [master, main] + +jobs: + python-lint: + name: Python Lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v5 + with: + python-version: "3.10" + + - name: Install linters + run: pip install flake8 + + - name: Run flake8 + run: | + # Stop the build if there are Python syntax errors or undefined names + flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics + # Treat all other issues as warnings (non-blocking) + flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics + + markdown-lint: + name: Markdown Lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: markdownlint + uses: DavidAnson/markdownlint-cli2-action@v19 + with: + globs: | + README.md + CONTRIBUTING.md + CHANGELOG.md + docs/**/*.md + config: | + { + "default": true, + "MD013": false, + "MD033": false, + "MD041": false + } + continue-on-error: true diff --git a/.gitignore b/.gitignore index e4a9790f..b6d67543 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,4 @@ -.vscode +# .vscode astra-sim-alibabacloud/build/simai_analytical/build/ astra-sim-alibabacloud/build/astra_ns3/build/ astra-sim-alibabacloud/extern/ @@ -8,3 +8,16 @@ test/log/ *.log .cur* .DS_Store + +# fth add +*.csv +*.txt +tmp_simai_inference_workload/ +aicb/ +Spectrum-X* + +fth-test/* + +# Personal dev / fth files +fth.sh +**/fth.sh \ No newline at end of file diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 00000000..9ad03ac6 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,62 @@ +

+ 中文  |  English +

+ +# Changelog + +All notable changes to SimAI will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). + +> **Note**: This changelog covers v1.0 (initial open-source release) and later versions. + +## [Unreleased] + +## [1.6.0] - 2026-03-16 + +### Added + +- GPU memory calculation module: accurate parameter counting and KV cache management for DeepSeek-V3-671B, Qwen3-MoE-235B, and Qwen3-Next-80B +- PD-separation memory planning for independent Prefill/Decode memory budgets +- Improved AICB decode time estimation with linear interpolation and global cache +- 4-scenario end-to-end inference test suite (`run_scenarios.sh`) +- SimAI 1.6 Technical Report (EN/ZH) +- Complete bilingual documentation system (30+ files under `docs/en/`, `docs/zh/`) +- GitHub community health files: issue/PR templates, Code of Conduct, Security Policy, Contributing Guide + +### Changed + +- Replaced print statements with logging across vidur-alibabacloud modules +- Added bilingual docstrings for public APIs +- Standardized TODO comments format + +### Removed + +- Removed ~390 lines of dead code in vidur-alibabacloud +- Cleaned personal debug markers across 8 files + +## [1.5.0] - 2025-12-30 + +### Added + +- **End-to-end multi-request inference simulation**: Full simulation support for multi-request inference workloads. +- **Prefill/Decode separation**: Model complex inference scenarios with Prefill/Decode phase separation. +- **Modern model support**: Added support for DeepSeek, Qwen3-MoE, and Qwen3-Next models. +- **Request scheduling via Vidur**: Integrated request scheduling component adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur) (see [vidur-alibabacloud](./vidur-alibabacloud/)). +- **AICB inference workload generation**: AICB now supports generating prefill/decode inference workloads for DeepSeek, Qwen3-MoE, and Qwen3-Next. +- **DeepSeek training workload support**: AICB now supports generating training workloads for DeepSeek (contributed by [@parthpower](https://github.com/parthpower)). +- **SimCCL initial release**: First public release of the SimCCL collective communication transformation module. + +## [1.0.0] - 2024-10-18 + +### Added + +- Initial open-source release of SimAI: full-stack simulator for AI large-scale training +- Core components: AICB, SimCCL, astra-sim-alibabacloud, ns-3-alibabacloud +- SimAI-Analytical: fast simulation using bus bandwidth abstraction +- SimAI-Simulation: full-stack NS3-based network simulation +- SimAI-Physical (Beta): CPU RDMA cluster physical traffic generation + +### Academic + +- SimAI paper accepted by **NSDI'25 Spring**. See [paper](https://arxiv.org/abs/2410.07346). diff --git a/CHANGELOG_CN.md b/CHANGELOG_CN.md new file mode 100644 index 00000000..a9e50a03 --- /dev/null +++ b/CHANGELOG_CN.md @@ -0,0 +1,62 @@ +

+ 中文  |  English +

+ +# 更新日志 + +SimAI 的所有重要变更均记录在此文件中。 + +格式基于 [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)。 + +> **注意**:本更新日志涵盖 v1.0(首次开源发布)及之后的版本。 + +## [未发布] + +## [1.6.0] - 2026-03-16 + +### 新增 + +- GPU 内存计算模块:支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 的精确参数计数与 KV Cache 管理 +- PD 分离内存规划:Prefill/Decode 阶段独立的内存预算计算 +- 改进 AICB decode 时间估算(首尾线性插值 + 全局缓存) +- 4 场景端到端推理测试套件(`run_scenarios.sh`) +- SimAI 1.6 技术报告(EN/ZH) +- 完整双语文档系统(`docs/en/`、`docs/zh/` 下 30+ 文件) +- GitHub 社区规范文件:Issue/PR 模板、行为准则、安全政策、贡献指南 + +### 变更 + +- vidur-alibabacloud 各模块 print 输出替换为 logging +- 公开 API 添加双语 docstring +- TODO 注释格式统一规范化 + +### 移除 + +- 清理 vidur-alibabacloud 中约 390 行死代码 +- 清理 8 个文件中的个人调试标记 + +## [1.5.0] - 2025-12-30 + +### 新增 + +- **端到端多请求推理仿真**:全面支持多请求推理工作负载的端到端仿真。 +- **Prefill/Decode 分离**:支持 Prefill/Decode 阶段分离等复杂推理场景建模。 +- **主流模型支持**:新增对 DeepSeek、Qwen3-MoE 和 Qwen3-Next 模型的支持。 +- **基于 Vidur 的请求调度**:集成了基于微软 [Vidur](https://github.com/microsoft/vidur) 适配的请求调度组件(详见 [vidur-alibabacloud](./vidur-alibabacloud/))。 +- **AICB 推理工作负载生成**:AICB 现已支持为 DeepSeek、Qwen3-MoE 和 Qwen3-Next 生成 prefill/decode 推理工作负载。 +- **DeepSeek 训练工作负载支持**:AICB 新增 DeepSeek 训练工作负载生成支持(由 [@parthpower](https://github.com/parthpower) 贡献)。 +- **SimCCL 首次发布**:SimCCL 集合通信转换模块首次对外公开发布。 + +## [1.0.0] - 2024-10-18 + +### 新增 + +- SimAI 首次开源发布:业界首个全栈高精度 AI 大规模训练模拟器 +- 核心组件:AICB、SimCCL、astra-sim-alibabacloud、ns-3-alibabacloud +- SimAI-Analytical:基于总线带宽抽象的快速仿真 +- SimAI-Simulation:基于 NS3 的全栈网络仿真 +- SimAI-Physical(Beta):CPU RDMA 集群物理流量生成 + +### 学术 + +- SimAI 论文被 **NSDI'25 Spring** 接收。详见 [论文](https://arxiv.org/abs/2410.07346)。 diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 00000000..1a84e5ce --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,136 @@ +

+ 中文  |  English +

+ +# Contributor Covenant Code of Conduct + +## Our Pledge + +We as members, contributors, and leaders pledge to make participation in our +community a harassment-free experience for everyone, regardless of age, body +size, visible or invisible disability, ethnicity, sex characteristics, gender +identity and expression, level of experience, education, socio-economic status, +nationality, personal appearance, race, religion, or sexual identity +and orientation. + +We pledge to act and interact in ways that contribute to an open, welcoming, +diverse, inclusive, and healthy community. + +## Our Standards + +Examples of behavior that contributes to a positive environment for our +community include: + +* Demonstrating empathy and kindness toward other people +* Being respectful of differing opinions, viewpoints, and experiences +* Giving and gracefully accepting constructive feedback +* Accepting responsibility and apologizing to those affected by our mistakes, + and learning from the experience +* Focusing on what is best not just for us as individuals, but for the + overall community + +Examples of unacceptable behavior include: + +* The use of sexualized language or imagery, and sexual attention or + advances of any kind +* Trolling, insulting or derogatory comments, and personal or political attacks +* Public or private harassment +* Publishing others' private information, such as a physical or email + address, without their explicit permission +* Other conduct which could reasonably be considered inappropriate in a + professional setting + +## Enforcement Responsibilities + +Community leaders are responsible for clarifying and enforcing our standards of +acceptable behavior and will take appropriate and fair corrective action in +response to any behavior that they deem inappropriate, threatening, offensive, +or harmful. + +Community leaders have the right and responsibility to remove, edit, or reject +comments, commits, code, wiki edits, issues, and other contributions that are +not aligned to this Code of Conduct, and will communicate reasons for moderation +decisions when appropriate. + +## Scope + +This Code of Conduct applies within all community spaces, and also applies when +an individual is officially representing the community in public spaces. +Examples of representing our community include using an official e-mail address, +posting via an official social media account, or acting as an appointed +representative at an online or offline event. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be +reported to the community leaders responsible for enforcement at: + +* yunding.lg@alibaba-inc.com +* xuefeiyang.xfy@alibaba-inc.com +* qingxu.lqx@alibaba-inc.com + +All complaints will be reviewed and investigated promptly and fairly. + +All community leaders are obligated to respect the privacy and security of the +reporter of any incident. + +## Enforcement Guidelines + +Community leaders will follow these Community Impact Guidelines in determining +the consequences for any action they deem in violation of this Code of Conduct: + +### 1. Correction + +**Community Impact**: Use of inappropriate language or other behavior deemed +unprofessional or unwelcome in the community. + +**Consequence**: A private, written warning from community leaders, providing +clarity around the nature of the violation and an explanation of why the +behavior was inappropriate. A public apology may be requested. + +### 2. Warning + +**Community Impact**: A violation through a single incident or series +of actions. + +**Consequence**: A warning with consequences for continued behavior. No +interaction with the people involved, including unsolicited interaction with +those enforcing the Code of Conduct, for a specified period of time. This +includes avoiding interactions in community spaces as well as external channels +like social media. Violating these terms may lead to a temporary or +permanent ban. + +### 3. Temporary Ban + +**Community Impact**: A serious violation of community standards, including +sustained inappropriate behavior. + +**Consequence**: A temporary ban from any sort of interaction or public +communication with the community for a specified period of time. No public or +private interaction with the people involved, including unsolicited interaction +with those enforcing the Code of Conduct, is allowed during this period. +Violating these terms may lead to a permanent ban. + +### 4. Permanent Ban + +**Community Impact**: Demonstrating a pattern of violation of community +standards, including sustained inappropriate behavior, harassment of an +individual, or aggression toward or disparagement of classes of individuals. + +**Consequence**: A permanent ban from any sort of public interaction within +the community. + +## Attribution + +This Code of Conduct is adapted from the [Contributor Covenant][homepage], +version 2.0, available at +https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. + +Community Impact Guidelines were inspired by [Mozilla's code of conduct +enforcement ladder](https://github.com/mozilla/diversity). + +[homepage]: https://www.contributor-covenant.org + +For answers to common questions about this code of conduct, see the FAQ at +https://www.contributor-covenant.org/faq. Translations are available at +https://www.contributor-covenant.org/translations. diff --git a/CODE_OF_CONDUCT_CN.md b/CODE_OF_CONDUCT_CN.md new file mode 100644 index 00000000..79faac72 --- /dev/null +++ b/CODE_OF_CONDUCT_CN.md @@ -0,0 +1,91 @@ +

+ 中文  |  English +

+ +# 贡献者公约行为准则 + +## 我们的承诺 + +作为成员、贡献者和领导者,我们承诺让参与我们社区的每个人都能获得无骚扰的体验,无论其年龄、体型、可见或不可见的残疾、种族、性别特征、性别认同与表达、经验水平、受教育程度、社会经济地位、国籍、个人外貌、种族、宗教或性取向如何。 + +我们承诺以有助于建设开放、包容、多元、友好和健康社区的方式行事与互动。 + +## 我们的标准 + +有助于为我们社区创造积极环境的行为示例包括: + +* 对他人表现出同理心和善意 +* 尊重不同的意见、观点和经历 +* 给予建设性的反馈,并能优雅地接受建设性反馈 +* 承担责任,向受到我们错误影响的人道歉,并从中学习 +* 关注对整个社区最有利的事情,而不仅仅是对个人最有利的事情 + +不可接受的行为示例包括: + +* 使用性暗示的语言或图像,以及任何形式的性关注或性挑逗 +* 恶意评论、侮辱性或贬损性言论,以及人身攻击或政治攻击 +* 公开或私下的骚扰行为 +* 未经明确许可,发布他人的私人信息(如实际地址或电子邮件地址) +* 在职业环境中可能被合理认为不适当的其他行为 + +## 执行职责 + +社区领导者负责阐明和执行我们的可接受行为标准,并将对任何被认为不适当、威胁性、冒犯性或有害的行为采取适当且公平的纠正措施。 + +社区领导者有权利和责任删除、编辑或拒绝与本行为准则不一致的评论、提交、代码、Wiki 编辑、Issue 及其他贡献,并在适当时说明审核决定的原因。 + +## 适用范围 + +本行为准则适用于所有社区空间,也适用于个人在公共场合正式代表社区的情形。代表我们社区的示例包括:使用官方电子邮件地址、通过官方社交媒体账户发帖,或在线上或线下活动中担任指定代表。 + +## 执行 + +若发生滥用、骚扰或其他不可接受的行为,可向负责执行的社区领导者举报: + +* yunding.lg@alibaba-inc.com +* xuefeiyang.xfy@alibaba-inc.com +* qingxu.lqx@alibaba-inc.com + +所有投诉都将得到及时、公平的审查和调查。 + +所有社区领导者都有义务尊重任何事件举报者的隐私和安全。 + +## 执行指南 + +社区领导者在确定违反本行为准则的行为的处理后果时,将遵循以下社区影响指南: + +### 1. 纠正 + +**社区影响**:使用不恰当的语言或其他被认为在社区中不专业或不受欢迎的行为。 + +**处理结果**:由社区领导者发出私下书面警告,说明违规行为的性质,并解释为何该行为不恰当。可能要求公开道歉。 + +### 2. 警告 + +**社区影响**:通过单一事件或一系列行为造成的违规。 + +**处理结果**:发出警告,并说明持续此类行为的后果。在规定时间内,禁止与相关人员互动,包括禁止与执行行为准则的人员进行未经请求的互动。这包括避免在社区空间以及社交媒体等外部渠道进行互动。违反这些条款可能导致临时或永久禁止参与。 + +### 3. 临时禁止 + +**社区影响**:严重违反社区标准,包括持续的不适当行为。 + +**处理结果**:在规定时间内,临时禁止与社区进行任何形式的互动或公开交流。在此期间,禁止与相关人员进行任何公开或私下互动,包括禁止与执行行为准则的人员进行未经请求的互动。违反这些条款可能导致永久禁止参与。 + +### 4. 永久禁止 + +**社区影响**:表现出违反社区标准的规律性行为,包括持续的不适当行为、骚扰某个人,或对某类人群的攻击或贬低。 + +**处理结果**:永久禁止在社区内进行任何形式的公开互动。 + +## 署名 + +本行为准则改编自 [贡献者公约][homepage] 2.0 版本, +原文见 https://www.contributor-covenant.org/version/2/0/code_of_conduct.html。 + +社区影响指南参考了 [Mozilla 的行为准则执行阶梯](https://github.com/mozilla/diversity)。 + +[homepage]: https://www.contributor-covenant.org + +有关本行为准则常见问题的解答,请参阅 https://www.contributor-covenant.org/faq。 +译文见 https://www.contributor-covenant.org/translations。 diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..a6a3658b --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,487 @@ +# Contributing to SimAI + +[中文版](CONTRIBUTING.zh-CN.md) + +Thank you for your interest in contributing to SimAI! This guide will help you get started with contributing code, documentation, and ideas. + +--- + +## What We're Building + +**Vision**: The industry's first full-stack, high-precision simulator for AI large-scale inference and training. + +**Goal**: Provide end-to-end modeling and simulation of AI training/inference processes—encompassing framework, collective communication, network layers, and more—so researchers can analyze performance, evaluate optimizations, and explore infrastructure designs without real hardware. + +**Current Progress**: SimAI 1.5 released (Dec 2025), with end-to-end multi-request inference simulation, DeepSeek/Qwen3 model support, and Prefill/Decode separation. + +**Academic Background**: Accepted by NSDI'25 Spring. See our [paper](https://arxiv.org/abs/2410.07346) for technical details. + +--- + +## How to Contribute + +### Ways to Contribute + +1. **New features** — Add model support, parallelism strategies, scheduling policies, etc. +2. **Bug fixes** — Fix simulation inaccuracies, crashes, or incorrect results +3. **Performance optimization** — Improve simulation speed, memory usage, or scalability +4. **Documentation** — Improve tutorials, add examples, fix errors +5. **Benchmarks & validation** — Add validation against real hardware results +6. **Issue reports** — Report bugs, request features, or share feedback + +--- + +## Project Architecture + +SimAI is a modular project composed of 5 core submodules (Git submodules) and several supporting directories: + +``` +SimAI/ +├── aicb/ # AI Computation Benchmark — workload generation (Python) +│ ├── workload_generator/ # Generates training/inference workloads +│ └── aicb.py # Main entry point +├── astra-sim-alibabacloud/ # Simulation engine — core simulator (C++) +│ ├── astra-sim/ # Extended from astra-sim 1.0 +│ └── build.sh # Build script +├── ns-3-alibabacloud/ # NS-3 network simulator backend (C++) +├── vidur-alibabacloud/ # LLM inference simulation (Python) +│ ├── vidur/ # Core simulation framework +│ └── setup.py # Python package config +├── SimCCL/ # Collective communication transformation +├── docs/ # Documentation and tutorials +├── example/ # Example workloads and configurations +├── scripts/ # Build and utility scripts +│ └── build.sh # Main build script +├── results/ # Simulation output directory +├── bin/ # Compiled binary output +├── Dockerfile # Docker container definition +└── README.md # Project documentation +``` + +--- + +## Development Environment Setup + +### Prerequisites + +- **Python** 3.8+ (3.12 recommended with Docker image) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git** with submodule support + +### Option A: Docker (Recommended) + +```bash +# Build the Docker image +docker build -t simai:latest . + +# Run a container with GPU support +docker run --gpus all -it --rm \ + -v $(pwd)/results:/workspace/SimAI/results \ + simai:latest /bin/bash +``` + +### Option B: Build from Source + +```bash +# 1. Clone with submodules +git clone --recurse-submodules https://github.com/aliyun/SimAI.git +cd SimAI + +# 2. Build C++ components (choose one mode) +# Analytical mode (fast, no network detail): +bash scripts/build.sh -c analytical + +# NS-3 simulation mode (full-stack, detailed network): +bash scripts/build.sh -c ns3 + +# Physical mode (beta, RDMA clusters): +bash scripts/build.sh -c phy + +# 3. Install Python dependencies +pip install -r aicb/requirements.txt +pip install -r vidur-alibabacloud/requirements.txt + +# 4. Verify the build +ls bin/ # Should contain SimAI_analytical or SimAI_simulator +``` + +### Verify Installation + +```bash +# Quick check: run a simple analytical simulation +cd bin +./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml +``` + +--- + +## Working with Submodules + +SimAI uses Git submodules for its core components. Understanding this is crucial for contributing. + +### Submodule Overview + +| Submodule | Repository | Language | Description | +|-----------|-----------|----------|-------------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | Python | Workload generation | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | Python | Collective communication | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | C++ | Network simulation | +| `astra-sim-alibabacloud` | In-tree | C++ | Simulation engine | +| `vidur-alibabacloud` | In-tree | Python | Inference simulation | + +### Key Rules + +1. **Submodules have independent Git histories.** Changes inside a submodule directory are tracked by that submodule's own repo, not the parent. +2. **The parent repo only tracks the commit hash** of each submodule. After modifying a submodule, you must commit in both the submodule and the parent repo. +3. **Always initialize submodules** after cloning: + ```bash + git submodule update --init --recursive + ``` + +### Cross-Submodule Changes + +If your contribution spans multiple submodules (e.g., adding a new model in `aicb` and simulation support in `astra-sim-alibabacloud`): + +1. Make and commit changes in each submodule separately +2. Update the parent repo to point to the new submodule commits +3. Create separate PRs for each submodule repository if they have independent remotes +4. Reference the related PRs in your descriptions + +--- + +## Development Workflow + +### Step 1: Fork and Clone + +```bash +# Fork the repository on GitHub, then: +git clone --recurse-submodules https://github.com/YOUR_USERNAME/SimAI.git +cd SimAI + +# Add upstream remote +git remote add upstream https://github.com/aliyun/SimAI.git +``` + +### Step 2: Create a Feature Branch + +```bash +# Sync with upstream first +git fetch upstream +git checkout -b feature/your-feature-name upstream/master + +# Branch naming conventions: +# feature/xxx — New features +# fix/xxx — Bug fixes +# docs/xxx — Documentation +# perf/xxx — Performance improvements +# refactor/xxx — Code refactoring +``` + +### Step 3: Develop and Test + +```bash +# Make your changes... +# Run relevant tests (see Testing section below) + +# For C++ changes, rebuild: +bash scripts/build.sh -c analytical # or ns3 + +# For Python changes, verify imports and basic functionality +python -c "from aicb import ..." +``` + +### Step 4: Commit Your Changes + +```bash +# Stage your changes +git add -A + +# Commit with a descriptive message (see Commit Convention below) +git commit -m "feat(aicb): add Llama-4 model workload generation" +``` + +### Step 5: Push and Create PR + +```bash +# Push to your fork +git push origin feature/your-feature-name + +# Then create a Pull Request on GitHub +``` + +--- + +## Code Style + +### Python + +- **Formatter**: [black](https://github.com/psf/black) (default settings) +- **Import sorting**: [isort](https://pycqa.github.io/isort/) (compatible with black) +- **Linter**: [flake8](https://flake8.pycqa.org/) +- **Max line length**: 120 characters + +```bash +# Format your Python code +black --line-length 120 your_file.py +isort your_file.py +flake8 your_file.py --max-line-length 120 +``` + +### C++ + +- Follow the existing code style in `astra-sim-alibabacloud/` +- Use 4-space indentation +- Keep function and variable names in `snake_case` +- Add comments for non-trivial logic + +### Shell Scripts + +- Use `#!/bin/bash` shebang +- Quote all variables: `"${VAR}"` not `$VAR` +- Use `set -e` for error handling where appropriate + +### General Rules + +- Write comments in **English** +- All new functions/classes should have docstrings or header comments +- Avoid hardcoded paths; use relative paths or configuration variables +- Keep changes focused — one feature/fix per PR + +--- + +## Commit Message Convention + +Use [Conventional Commits](https://www.conventionalcommits.org/) format: + +``` +(): + +[optional body] + +[optional footer] +``` + +### Types + +| Type | Description | +|------|-------------| +| `feat` | New feature | +| `fix` | Bug fix | +| `docs` | Documentation only | +| `refactor` | Code refactoring (no feature/fix) | +| `test` | Adding or updating tests | +| `perf` | Performance improvement | +| `chore` | Build process, tooling, dependencies | + +### Scopes + +`aicb`, `vidur`, `astra-sim`, `ns3`, `simccl`, `docs`, `docker`, `scripts` + +### Examples + +**Good:** +``` +feat(aicb): add DeepSeek-V3 inference workload generation +fix(astra-sim): correct AllReduce latency calculation for ring algorithm +docs: update build instructions for NS-3 mode +perf(vidur): reduce memory allocation in request scheduler +``` + +**Bad:** +``` +update code # Too vague +fix bug # No scope, no description +feat(aicb): Add DeepSeek-V3 inference workload generation support for the new model architecture # Too long +``` + +--- + +## Pull Request Guidelines + +### PR Title + +Use the same format as commit messages: `(): ` + +### PR Description Template + +```markdown +## Summary +Brief description of what this PR does. + +## Changes +- Change 1 +- Change 2 + +## Testing +Describe how you tested these changes. + +## Related Issues +Closes #xxx (if applicable) + +## Checklist +- [ ] Code compiles without errors +- [ ] Existing simulations produce unchanged results (no precision regression) +- [ ] New code has appropriate comments +- [ ] Tests added for new functionality +- [ ] Documentation updated if needed +``` + +--- + +## Testing + +### Training Simulation (Analytical Mode) + +```bash +# Generate a training workload +cd aicb +python aicb.py -m training --model_name GPT-175B + +# Run analytical simulation +cd ../bin +./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml +``` + +### Training Simulation (NS-3 Mode) + +```bash +# Build NS-3 backend first +bash scripts/build.sh -c ns3 + +# Run full-stack simulation +cd bin +./SimAI_simulator [simulation_parameters] +``` + +### Inference Simulation + +```bash +# Run inference simulation via vidur +cd vidur-alibabacloud +python -m vidur.main [config_options] +``` + +### Verify No Regression + +When modifying simulation logic, always compare results against a known-good baseline: + +```bash +# Save baseline results before your changes +cp results/output_baseline.csv /tmp/baseline.csv + +# After changes, compare +diff results/output_baseline.csv /tmp/baseline.csv +``` + +--- + +## Pre-Submission Quality Checklist + +Before submitting your PR, run through this checklist: + +```bash +# 1. C++ compilation check (if you changed C++ code) +bash scripts/build.sh -c analytical +bash scripts/build.sh -c ns3 + +# 2. Python lint check +black --check --line-length 120 your_changed_files.py +flake8 your_changed_files.py --max-line-length 120 + +# 3. Basic simulation test +cd bin && ./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml + +# 4. Submodule state check +git submodule status # Ensure no unexpected submodule changes + +# 5. Verify no unintended files +git diff --stat # Review all changes before committing +``` + +**Checklist Summary:** +- [ ] C++ code compiles without errors or warnings +- [ ] Python code passes lint checks (black, flake8) +- [ ] Basic simulation runs successfully +- [ ] No unintended submodule pointer changes +- [ ] Commit messages follow the convention +- [ ] PR description is complete + +--- + +## Review Process and Acceptance Criteria + +### Acceptance Criteria + +Your contribution will be accepted if it meets these standards: + +| Criterion | Requirement | +|-----------|-------------| +| **Build** | Compiles without errors (C++ and Python) | +| **Precision** | Does not degrade existing simulation accuracy | +| **Tests** | Key code paths are covered by tests or validated | +| **Documentation** | New features have comments and/or doc updates | +| **Style** | Follows the code style guidelines above | +| **Scope** | Changes are focused and well-explained | + +### Reasons for Rejection + +- Build failures +- Simulation precision regression without justification +- Missing tests for new functionality +- Overly large PRs mixing unrelated changes +- Insufficient description of what/why + +### Review Timeline + +1. **Initial review**: Within 3-5 business days +2. **Feedback**: Constructive comments with actionable suggestions +3. **Iteration**: Address feedback and update PR +4. **Merge**: Approved PRs are merged to the main branch + +--- + +## What NOT to Contribute + +- Proprietary or closed-source dependencies +- Changes that break backward compatibility without discussion +- Large-scale reformatting changes (open an issue first) +- Untested code in simulation-critical paths +- Commits with sensitive information (API keys, internal URLs, etc.) + +--- + +## Recognition + +Contributors will be: +- Acknowledged in release notes +- Listed in project documentation +- Credited in commit history + +Significant contributors may be invited to join the maintainer team. + +--- + +## Getting Help + +- **Issues**: [GitHub Issues](https://github.com/aliyun/SimAI/issues) +- **Discussions**: Open an issue with "Question:" prefix +- **Documentation**: See [docs/Tutorial.md](docs/Tutorial.md) for detailed usage guides +- **Community Events**: Check [README.md](README.md) for upcoming events and workshops + +--- + +## Thank You! + +SimAI is built by a growing community of researchers and engineers. Your contributions help advance AI systems research for everyone. + +**Let's build something amazing together!** diff --git a/CONTRIBUTING.zh-CN.md b/CONTRIBUTING.zh-CN.md new file mode 100644 index 00000000..4d27c930 --- /dev/null +++ b/CONTRIBUTING.zh-CN.md @@ -0,0 +1,487 @@ +# SimAI 贡献指南 + +[English Version](CONTRIBUTING.md) + +感谢你对 SimAI 项目的关注!本指南将帮助你了解如何贡献代码、文档和想法。 + +--- + +## 项目愿景与目标 + +**愿景**:打造业界首个全栈、高精度的 AI 大规模推理与训练仿真器。 + +**目标**:提供端到端的 AI 训练/推理过程建模与仿真——涵盖框架层、集合通信层、网络层等——使研究人员无需真实硬件即可分析性能、评估优化方案、探索基础设施设计。 + +**当前进展**:SimAI 1.5 已发布(2025年12月),支持端到端多请求推理仿真、DeepSeek/Qwen3 模型、Prefill/Decode 分离。 + +**学术背景**:已被 NSDI'25 Spring 接收。技术细节请参阅我们的[论文](https://arxiv.org/abs/2410.07346)。 + +--- + +## 贡献方式 + +### 你可以通过以下方式参与贡献 + +1. **新功能开发** — 添加模型支持、并行策略、调度策略等 +2. **Bug 修复** — 修复仿真精度问题、崩溃或错误结果 +3. **性能优化** — 提升仿真速度、内存使用或可扩展性 +4. **文档改进** — 完善教程、添加示例、修正错误 +5. **基准测试与验证** — 添加与真实硬件结果的对比验证 +6. **Issue 报告** — 报告 Bug、提出需求或分享反馈 + +--- + +## 项目架构 + +SimAI 是一个模块化项目,由 5 个核心子模块(Git submodules)和若干辅助目录组成: + +``` +SimAI/ +├── aicb/ # AI 计算基准 — 工作负载生成(Python) +│ ├── workload_generator/ # 训练/推理工作负载生成器 +│ └── aicb.py # 主入口 +├── astra-sim-alibabacloud/ # 仿真引擎 — 核心仿真器(C++) +│ ├── astra-sim/ # 基于 astra-sim 1.0 扩展 +│ └── build.sh # 编译脚本 +├── ns-3-alibabacloud/ # NS-3 网络仿真后端(C++) +├── vidur-alibabacloud/ # LLM 推理仿真框架(Python) +│ ├── vidur/ # 核心仿真框架 +│ └── setup.py # Python 包配置 +├── SimCCL/ # 集合通信变换库 +├── docs/ # 文档与教程 +├── example/ # 示例工作负载和配置 +├── scripts/ # 构建和工具脚本 +│ └── build.sh # 主编译脚本 +├── results/ # 仿真结果输出目录 +├── bin/ # 编译产物目录 +├── Dockerfile # Docker 容器定义 +└── README.md # 项目文档 +``` + +--- + +## 开发环境搭建 + +### 前置依赖 + +- **Python** 3.8+(Docker 镜像中推荐 3.12) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git**(支持 submodule) + +### 方式一:Docker(推荐) + +```bash +# 构建 Docker 镜像 +docker build -t simai:latest . + +# 启动容器(支持 GPU) +docker run --gpus all -it --rm \ + -v $(pwd)/results:/workspace/SimAI/results \ + simai:latest /bin/bash +``` + +### 方式二:源码编译 + +```bash +# 1. 克隆仓库(含子模块) +git clone --recurse-submodules https://github.com/aliyun/SimAI.git +cd SimAI + +# 2. 编译 C++ 组件(选择一种模式) +# 分析模式(快速仿真,不含网络细节): +bash scripts/build.sh -c analytical + +# NS-3 仿真模式(全栈,详细网络建模): +bash scripts/build.sh -c ns3 + +# 物理模式(Beta,RDMA 集群): +bash scripts/build.sh -c phy + +# 3. 安装 Python 依赖 +pip install -r aicb/requirements.txt +pip install -r vidur-alibabacloud/requirements.txt + +# 4. 验证编译结果 +ls bin/ # 应包含 SimAI_analytical 或 SimAI_simulator +``` + +### 验证安装 + +```bash +# 快速测试:运行一个简单的分析仿真 +cd bin +./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml +``` + +--- + +## 子模块开发指南 + +SimAI 使用 Git submodule 管理核心组件。理解子模块的工作方式对于贡献至关重要。 + +### 子模块概览 + +| 子模块 | 仓库 | 语言 | 说明 | +|--------|------|------|------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | Python | 工作负载生成 | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | Python | 集合通信变换 | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | C++ | 网络仿真 | +| `astra-sim-alibabacloud` | 仓库内 | C++ | 仿真引擎 | +| `vidur-alibabacloud` | 仓库内 | Python | 推理仿真 | + +### 关键规则 + +1. **子模块有独立的 Git 历史。** 子模块目录内的更改由该子模块自身的仓库跟踪,而非父仓库。 +2. **父仓库只跟踪子模块的 commit hash。** 修改子模块后,需在子模块和父仓库中分别提交。 +3. **克隆后务必初始化子模块:** + ```bash + git submodule update --init --recursive + ``` + +### 跨子模块修改 + +如果你的贡献涉及多个子模块(例如在 `aicb` 中添加新模型,同时在 `astra-sim-alibabacloud` 中添加仿真支持): + +1. 在每个子模块中分别修改并提交 +2. 更新父仓库,指向子模块的新 commit +3. 如果子模块有独立的远程仓库,需分别创建 PR +4. 在 PR 描述中互相引用关联的 PR + +--- + +## 开发工作流 + +### 第一步:Fork 和 Clone + +```bash +# 先在 GitHub 上 Fork 仓库,然后: +git clone --recurse-submodules https://github.com/YOUR_USERNAME/SimAI.git +cd SimAI + +# 添加上游远程仓库 +git remote add upstream https://github.com/aliyun/SimAI.git +``` + +### 第二步:创建功能分支 + +```bash +# 先同步上游代码 +git fetch upstream +git checkout -b feature/your-feature-name upstream/master + +# 分支命名约定: +# feature/xxx — 新功能 +# fix/xxx — Bug 修复 +# docs/xxx — 文档更新 +# perf/xxx — 性能优化 +# refactor/xxx — 代码重构 +``` + +### 第三步:开发与测试 + +```bash +# 进行修改... +# 运行相关测试(见下方"测试要求"章节) + +# C++ 代码修改后需重新编译: +bash scripts/build.sh -c analytical # 或 ns3 + +# Python 代码修改后,验证导入和基本功能: +python -c "from aicb import ..." +``` + +### 第四步:提交变更 + +```bash +# 暂存更改 +git add -A + +# 使用规范的提交消息(见下方 Commit 规范) +git commit -m "feat(aicb): add Llama-4 model workload generation" +``` + +### 第五步:推送并创建 PR + +```bash +# 推送到你的 Fork +git push origin feature/your-feature-name + +# 然后在 GitHub 上创建 Pull Request +``` + +--- + +## 代码规范 + +### Python + +- **格式化工具**:[black](https://github.com/psf/black)(默认设置) +- **导入排序**:[isort](https://pycqa.github.io/isort/)(兼容 black) +- **静态检查**:[flake8](https://flake8.pycqa.org/) +- **最大行宽**:120 字符 + +```bash +# 格式化 Python 代码 +black --line-length 120 your_file.py +isort your_file.py +flake8 your_file.py --max-line-length 120 +``` + +### C++ + +- 遵循 `astra-sim-alibabacloud/` 中现有的代码风格 +- 使用 4 空格缩进 +- 函数和变量名使用 `snake_case` +- 为非平凡逻辑添加注释 + +### Shell 脚本 + +- 使用 `#!/bin/bash` 声明 +- 变量一律加引号:`"${VAR}"` 而非 `$VAR` +- 适当使用 `set -e` 进行错误处理 + +### 通用规则 + +- 代码注释使用**英文** +- 所有新函数/类应有文档字符串或头注释 +- 避免硬编码路径,使用相对路径或配置变量 +- 保持改动聚焦——每个 PR 只做一件事 + +--- + +## Commit 消息规范 + +使用 [Conventional Commits](https://www.conventionalcommits.org/) 格式: + +``` +(): + +[可选的正文] + +[可选的脚注] +``` + +### 类型(Type) + +| 类型 | 说明 | +|------|------| +| `feat` | 新功能 | +| `fix` | Bug 修复 | +| `docs` | 仅文档变更 | +| `refactor` | 代码重构(非新功能/修复) | +| `test` | 添加或更新测试 | +| `perf` | 性能优化 | +| `chore` | 构建流程、工具、依赖 | + +### 作用域(Scope) + +`aicb`、`vidur`、`astra-sim`、`ns3`、`simccl`、`docs`、`docker`、`scripts` + +### 示例 + +**正确:** +``` +feat(aicb): add DeepSeek-V3 inference workload generation +fix(astra-sim): correct AllReduce latency calculation for ring algorithm +docs: update build instructions for NS-3 mode +perf(vidur): reduce memory allocation in request scheduler +``` + +**错误:** +``` +update code # 太模糊 +fix bug # 无作用域,无描述 +feat(aicb): Add DeepSeek-V3 inference workload generation support for the new model architecture # 太长 +``` + +--- + +## Pull Request 规范 + +### PR 标题 + +与 Commit 消息格式一致:`(): ` + +### PR 描述模板 + +```markdown +## 概述 +简要描述本 PR 的内容。 + +## 变更内容 +- 变更 1 +- 变更 2 + +## 测试方式 +描述你如何测试了这些变更。 + +## 关联 Issue +Closes #xxx(如适用) + +## 自检清单 +- [ ] 代码编译无错误 +- [ ] 现有仿真结果不受影响(无精度回归) +- [ ] 新代码有适当的注释 +- [ ] 为新功能添加了测试 +- [ ] 必要时更新了文档 +``` + +--- + +## 测试要求 + +### 训练仿真测试(分析模式) + +```bash +# 生成训练工作负载 +cd aicb +python aicb.py -m training --model_name GPT-175B + +# 运行分析仿真 +cd ../bin +./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml +``` + +### 训练仿真测试(NS-3 模式) + +```bash +# 先编译 NS-3 后端 +bash scripts/build.sh -c ns3 + +# 运行全栈仿真 +cd bin +./SimAI_simulator [仿真参数] +``` + +### 推理仿真测试 + +```bash +# 通过 vidur 运行推理仿真 +cd vidur-alibabacloud +python -m vidur.main [配置选项] +``` + +### 精度回归验证 + +修改仿真逻辑时,务必与已知基线结果进行对比: + +```bash +# 修改前保存基线结果 +cp results/output_baseline.csv /tmp/baseline.csv + +# 修改后对比 +diff results/output_baseline.csv /tmp/baseline.csv +``` + +--- + +## 提交前质量检查清单 + +提交 PR 之前,请逐项检查: + +```bash +# 1. C++ 编译检查(如果修改了 C++ 代码) +bash scripts/build.sh -c analytical +bash scripts/build.sh -c ns3 + +# 2. Python 代码检查 +black --check --line-length 120 your_changed_files.py +flake8 your_changed_files.py --max-line-length 120 + +# 3. 基本仿真测试 +cd bin && ./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml + +# 4. 子模块状态检查 +git submodule status # 确保没有意外的子模块变更 + +# 5. 确认无遗漏文件 +git diff --stat # 提交前审查所有变更 +``` + +**检查清单总结:** +- [ ] C++ 代码编译无错误或警告 +- [ ] Python 代码通过 lint 检查(black、flake8) +- [ ] 基本仿真运行成功 +- [ ] 无意外的子模块指针变更 +- [ ] Commit 消息符合规范 +- [ ] PR 描述完整 + +--- + +## 审查流程与接受标准 + +### 接受标准 + +你的贡献需要满足以下标准才能被接受: + +| 标准 | 要求 | +|------|------| +| **编译** | C++ 和 Python 代码均无编译/语法错误 | +| **精度** | 不降低现有仿真精度 | +| **测试** | 关键代码路径有测试或验证覆盖 | +| **文档** | 新功能有注释和/或文档更新 | +| **风格** | 遵循上述代码规范 | +| **范围** | 变更聚焦且解释清晰 | + +### 拒绝原因 + +- 编译失败 +- 仿真精度回归且无合理解释 +- 新功能缺少测试 +- PR 过大且混合了不相关的变更 +- 描述不充分 + +### 审查时间线 + +1. **初始审查**:3-5 个工作日内 +2. **反馈**:建设性的意见和可操作的建议 +3. **迭代**:根据反馈更新 PR +4. **合并**:审批通过后合入主分支 + +--- + +## 不接受的贡献 + +- 引入私有或闭源依赖 +- 未经讨论就破坏向后兼容性的变更 +- 大规模格式化变更(请先开 Issue 讨论) +- 仿真关键路径上的未测试代码 +- 包含敏感信息的提交(API 密钥、内部 URL 等) + +--- + +## 致谢与认可 + +贡献者将获得以下认可: +- 在发布说明中致谢 +- 在项目文档中列名 +- 在 commit 历史中署名 + +重要贡献者可能被邀请加入维护者团队。 + +--- + +## 获取帮助 + +- **Issue**:[GitHub Issues](https://github.com/aliyun/SimAI/issues) +- **讨论**:以 "Question:" 前缀开一个 Issue +- **文档**:参阅 [docs/Tutorial.md](docs/Tutorial.md) 获取详细使用指南 +- **社区活动**:查看 [README.md](README.md) 了解近期活动和研讨会 + +--- + +## 感谢 + +SimAI 由一个不断壮大的研究者和工程师社区共同构建。你的每一份贡献都在推动 AI 系统研究的发展。 + +**让我们一起创造更好的未来!** diff --git a/README.ja.md b/README.ja.md index dd96f29f..fd98ac53 100644 --- a/README.ja.md +++ b/README.ja.md @@ -8,12 +8,13 @@ | 日付 | イベント | 場所 | 内容 | 形式 | |:----:|:------|:---------|:--------|:----:| -| 未定 | SimAI 2.0 | 🌐 オンライン | SimAI 2.0のリリース | 💻 バーチャル | +| -- | | | | | ### 🌟 過去のイベント | 日付 | イベント | 場所 | 内容 | 形式 | |:----:|:------|:---------|:--------|:----:| +| 2025年12月30日 | SimAI 1.5 | 🌐 オンライン | SimAI 1.5のリリース | 💻 バーチャル | | 2025年6月4日 | SimAIコミュニティ第1回ワークショップ | 📍 北京大学 | コミュニティ貢献者による3つの講演 | 🎓 現地 | | 2025年5月24日 | 第28回Chinasysワークショップ | 📍 重慶大学 | SimAIに関する招待講演 | 🎓 現地 | | 2024年12月27日 | SimAI技術発表会 | 📍 北京航空航天大学 | SimAI技術共有とディスカッション | 🎓 現地 | diff --git a/README.md b/README.md index 85fbcb63..3edf4667 100755 --- a/README.md +++ b/README.md @@ -1,12 +1,31 @@ +

+ 中文  |  English +

+ +# SimAI + +[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) +[![NSDI'25](https://img.shields.io/badge/NSDI'25-SimAI-blue.svg)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + # Latest News ### Recent Updates +- [2026/03] **SimAI 1.6 Released!** This release adds GPU memory modeling for inference simulation. Key features include: + + - **GPU Memory Module:** Accurate parameter counting and KV cache management for DeepSeek-V3-671B, Qwen3-MoE-235B, and Qwen3-Next-80B. See [SimAI 1.6 Tech Report](./docs/SimAI_1.6_Tech_Report.md). + - **PD-Separation Memory Planning:** Independent parameter memory and KV cache budget calculation for Prefill and Decode phases. See [memory_planner.py](./vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py). + - **Improved Decode Time Estimation:** Linear interpolation replacing nearest-neighbor for AICB decode time prediction, with global cache for cross-run reuse. See [execution_time.py](./vidur-alibabacloud/vidur/entities/execution_time.py). + - **4-Scenario Test Suite:** End-to-end validation covering Qwen3-Next-80B, DeepSeek-671B, and Qwen3-MoE-235B. See [run_scenarios.sh](./vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh). + - **Bilingual Documentation:** Complete EN/ZH documentation system. See [English Docs](./docs/en/index.md) | [中文文档](./docs/zh/index.md). + - **GitHub Community Health Files:** Added [Issue Templates](./.github/ISSUE_TEMPLATE/), [PR Template](./.github/pull_request_template.md), [Code of Conduct](./CODE_OF_CONDUCT.md), [Security Policy](./SECURITY.md), [Contributing Guide](./CONTRIBUTING.md), and [Changelog](./CHANGELOG.md). + - **Code Quality:** Replaced print with logging, added bilingual docstrings, removed ~390 lines of dead code, standardized TODOs, and added type annotations across vidur-alibabacloud modules. + - [2025/12] **SimAI 1.5 Released!** This release brings end-to-end simulation for multi-request **inference** workloads. Key features include: - - - **Advanced Inference Simulation:** Model complex scenarios with Prefill/Decode separation. - - **Modern Model Support:** Now includes DeepSeek, Qwen3Moe and Qwen3Next. See [AICB's README](./aicb/README.md) for more detailed information. - - **Request Scheduling:** Request scheduling is now handled by a component adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur). See [Vidur-Alibabacloud's README](./vidur-alibabacloud/README.md) for more detailed information. + + - **Advanced Inference Simulation:** Model complex scenarios with Prefill/Decode separation. + - **Modern Model Support:** Now includes DeepSeek, Qwen3Moe and Qwen3Next. See [AICB's README](./aicb/README.md) for more detailed information. + - **Request Scheduling:** Request scheduling is now handled by a component adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur). See [Vidur-Alibabacloud's README](./vidur-alibabacloud/README.md) for more detailed information. - [2025/11] [AICB](https://github.com/aliyun/aicb/tree/master) now supports generating **prefill/decode** inference workloads for **DeepSeek**, **Qwen3-MoE** and **Qwen3-Next**. @@ -14,7 +33,8 @@ - [2025/06] The code of SimCCL is first released in the branch [SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL) and will be released in SimCCL repository soon. -**We warmly welcome contributions from the community!** If you are interested in helping shape the future of SimAI, please feel free to open an issue to discuss your ideas or submit a pull request. +**We warmly welcome contributions from the community!** If you are interested in helping shape the future of SimAI, please feel free to open an issue to discuss your ideas or submit a pull request. +
🎯 Events & Community Engagement 🎯 @@ -29,6 +49,7 @@ | Date | Event | Location | Content | Type | |:----------------:|:------------------------------------------------------------------------ |:----------------------- |:-------------------------------------------------------- |:-------------:| +| Mar 16, 2026 | SimAI 1.6 | 🌐 Online | The release of SimAI 1.6 | 💻 Virtual | | Dec 30, 2025 | SimAI 1.5 | 🌐 Online | The release of SimAI 1.5 | 💻 Virtual | | Jun 4, 2025 | The first workshop of the SimAI community | 📍 Peking University | Three talks from community contributors | 🎓 On-site | | May 24, 2025 | The 28th Chinasys workshop | 📍 Chongqing University | An invited talk about SimAI | 🎓 On-site | @@ -44,6 +65,15 @@ --- +## Documentation + +| | | +|---|---| +| [English Documentation](./docs/en/index.md) | Full documentation in English | +| [中文文档](./docs/zh/index.md) | 完整中文文档 | + +--- + # Table of Contents - [SimAI Overview](#simai-overview) @@ -51,18 +81,17 @@ - [Components](#components) - [Scenario](#scenario) - [Citation](#citation) -- [Usage](#usage) +- [Quick Start](#quick-start) - [Setup](#setup) - - [From Source Code](#from-source-code) - [Use SimAI-Analytical](#use-simai-analytical) - [Use SimAI-Simulation](#use-simai-simulation) - - [Use Vidur-AICB](#use-vidur-aicb) + - [Use Multi-requests Inference Simulation](#use-multi-requests-inference-simulation) # SimAI Overview ## Introduction -**SimAI** is the industry's first full-stack, high-precision **Sim**ulator for **AI** large-scale **\*\*inference\*\*** and **training**. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to: +**SimAI** is the industry's first full-stack, high-precision **Sim**ulator for **AI** large-scale **inference** and **training**. It provides detailed modeling and simulation of the entire LLM training process, encompassing framework, collective communication, network layers, and more. This comprehensive approach offers end-to-end performance data, enabling researchers to: - Analyze inference/training process details - Evaluate the time consumption of AI tasks under specific conditions @@ -86,7 +115,7 @@ SimAI --|--- SimCCL |--- vidur-alibabacloud -Building on pure simulation capabilities, SimAI has evolved into a versatile full-stack toolkit comprising four components ([aicb](https://github.com/aliyun/aicb), [SimCCL](https://github.com/aliyun/SimCCL), [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud), [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud)). These components can be combined in various ways to achieve different functionalities. Below, we present the six main usage scenarios for SimAI. We encourage users to explore even more possibilities with this powerful tool. +Building on pure simulation capabilities, SimAI has evolved into a versatile full-stack toolkit comprising four components ([aicb](https://github.com/aliyun/aicb), [SimCCL](https://github.com/aliyun/SimCCL), [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud), [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud)). These components can be combined in various ways to achieve different functionalities. Below, we present the main usage scenarios for SimAI. We encourage users to explore even more possibilities with this powerful tool. Below is the architecture diagram of the SimAI Simulator: ![SimAI_Arc](./docs/images/SimAI_Arc.png) @@ -103,15 +132,15 @@ SimAI supports three major operation modes to meet different simulation requirem **SimAI-Physical** *(Beta)* enables physical traffic generation for CPU RDMA cluster environments. This mode generates NCCL-like traffic patterns, allowing in-depth study of NIC behaviors during LLM training. It is currently in internal testing phase. -| Scenario | Description | Component Combination | -| -------------------------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 1. AICB Test Suite | Run communication patterns on GPU clusters using AICB Test suite | [AICB](https://github.com/aliyun/aicb) | -| 2. AICB/AIOB Workload | Model compute/communication patterns of **\*\*inference\*\*/training** process to generate workload | [AICB](https://github.com/aliyun/aicb) | -| 3. Collective Comm Analyze | Break down collective communication operations into point-to-point communication sets | [SimCCL](https://github.com/aliyun/SimCCL) | -| 4. Collective Comm w/o GPU | Perform RDMA collective communication traffic on non-GPU clusters | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(physical) | +| Scenario | Description | Component Combination | +|----------------------------------------|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| 1. AICB Test Suite | Run communication patterns on GPU clusters using AICB Test suite | [AICB](https://github.com/aliyun/aicb) | +| 2. AICB/AIOB Workload | Model compute/communication patterns of **inference**/training process to generate workload | [AICB](https://github.com/aliyun/aicb) | +| 3. Collective Comm Analyze | Break down collective communication operations into point-to-point communication sets | [SimCCL](https://github.com/aliyun/SimCCL) | +| 4. Collective Comm w/o GPU | Perform RDMA collective communication traffic on non-GPU clusters | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(physical) | | 5. SimAI-Analytical | Conduct rapid AICB workload analysis and simulation on any server (ignoring underlying network details) | [AICB](https://github.com/aliyun/aicb) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical) | -| 6. SimAI-Simulation | Perform full simulation on any server | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(simulation) + [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | -| 7. Multi-requests Inference Simulation | Perform full multi-requests **inference** simulation using one GPU server | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [vidur-alibabacloud](./vidur-alibabacloud) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical/simulation) | +| 6. SimAI-Simulation | Perform full simulation on any server | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(simulation) + [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | +| 7. Multi-requests Inference Simulation | Perform full multi-requests **inference** simulation using one GPU server | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [vidur-alibabacloud](./vidur-alibabacloud) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical/simulation) | ## Citation @@ -125,15 +154,15 @@ We encourage innovative research and extensions based on SimAI. Welcome to join # Quick Start -Here are some simple examples, SimAI full tutorials can be found here: [**SimAI@Tutorial**](./docs/Tutorial.md), [**aicb@Tutorial**](https://github.com/aliyun/aicb/blob/master/training/tutorial.md), [SimCCL@Tutorial], [ns-3-alibabacloud@Tutorial] +Here are some simple examples. SimAI full tutorials can be found here: [**SimAI@Tutorial**](./docs/Tutorial.md), [**aicb@Tutorial**](https://github.com/aliyun/aicb/blob/master/training/tutorial.md), [SimCCL@Tutorial], [ns-3-alibabacloud@Tutorial] ## Setup -You can follow the instrucitons below to quickly set up the environtments and run SimAI +You can follow the instructions below to quickly set up the environments and run SimAI. ### From Source Code -The following code has been successfully tested on GCC/G++ 9.4.0, python 3.8.10 in Ubuntu 20.04 +The following code has been successfully tested on GCC/G++ 9.4.0, python 3.8.10 in Ubuntu 20.04. You can use the official Ubuntu 20.04 image, and do not install ninja. @@ -159,13 +188,13 @@ $ ./scripts/build.sh -c ns3 ## Use SimAI-Analytical ```bash -$ ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml +$ ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml ``` -For calculating bus bandwidth autolly, please try the following command: +For calculating bus bandwidth automatically, please try the following command: ```bash -$ ./bin/SimAI_analytical -w ./example/workload_analytical.txt -g 9216 -nv 360 -nic 48.5 -n_p_s 8 -g_p_s 8 -r example- +$ ./bin/SimAI_analytical -w ./example/workload_analytical.txt -g 9216 -nv 360 -nic 48.5 -n_p_s 8 -g_p_s 8 -r example- ``` ## Use SimAI-Simulation @@ -180,39 +209,57 @@ $ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microA ## Use Multi-requests Inference Simulation -For detailed information, please refer to the [README](./vidur-alibabacloud/README.md) file in the `vidur-alibabacloud` directory. This module leverages AICB to profile the computation time of **inference** workloads. Due to its reliance on specific hardware-accelerated libraries like DeepGEMM and FlashMLA, it is exclusively compatible with NVIDIA GPUs based on the **Hopper (SM90)** and **Blackwell (SM100)** architectures. +For detailed information, please refer to the [README](./vidur-alibabacloud/README.md) file in the `vidur-alibabacloud` directory. This module leverages AICB to profile the computation time of **inference** workloads. Due to its reliance on specific hardware-accelerated libraries like DeepGEMM and FlashMLA, it is exclusively compatible with NVIDIA GPUs based on the **Hopper (SM90)** and **Blackwell (SM100)** architectures. -```shell +```bash # Build from Dockerfile docker build -t image:latest . -docker run --gpus all -it --rm image:latest +docker run --gpus all -it --rm image:latest ``` -**Note**: please add `ENV FLASH_MLA_DISABLE_SM100=1` to Dockerfile if using Hopper GPUs. +**Note:** Please add `ENV FLASH_MLA_DISABLE_SM100=1` to Dockerfile if using Hopper GPUs. -# Acknowledgments +To quickly validate all supported inference scenarios (Qwen3-Next-80B, DeepSeek-671B, Qwen3-MoE-235B), use the bundled 4-scenario test suite: -A huge thanks to the following people and organizations who have contributed to this project: +```bash +# Prerequisites: conda activate vidur +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +# Or run a single scenario: +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` -- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/), +> **Prerequisites:** Requires `conda activate vidur` environment. See [Environment Setup](./vidur-alibabacloud/README.md#-environment-setup) for details. +> +> For detailed scenario configuration table and output file descriptions, see [Vidur-AlibabaCloud README](./vidur-alibabacloud/README.md#4-scenario-configuration). -- Parth Parikh (KEYSIGHT), +# Acknowledgments -- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin), +A huge thanks to the following people and organizations who have contributed to this project: -- Xinyue Li (BUPT), +- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/) +- Parth Parikh (KEYSIGHT) +- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin) +- Xinyue Li (BUPT) +- Tong Chen (Zhejiang University) +- Ming Wang (BUPT) +- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences) -- Tong Chen (Zhejiang University), +...and many other individual contributors from the community (See the [Contributors to aliyun/SimAI](https://github.com/aliyun/SimAI/graphs/contributors)). -- Ming Wang (BUPT), +We also thank Chenning Li (MIT CSAIL) who initiated the cooperation on integrating SimAI into [M4](https://github.com/netiken/m4), a new, innovative simulator. -- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences), +**This project still welcomes more contributions and suggestions.** -and many other individual contributors from the community (See the [Contributors to aliyun/SimAI · GitHub](https://github.com/aliyun/SimAI/graphs/contributors)). +# Contributing -We also thank Chenning Li (MIT CSAIL) who initiated the cooperation on integrating SimAI into [M4](https://github.com/netiken/m4), a new, innovative simulator. +We welcome all contributions! Please read the following guides before getting started: -**This project still welcomes more contributions and suggestions**. +| | | +|---|---| +| [Contributing Guide](./CONTRIBUTING.md) | How to submit issues and pull requests | +| [Security Policy](./SECURITY.md) | How to report security vulnerabilities | +| [Code of Conduct](./CODE_OF_CONDUCT.md) | Our community standards | +| [Changelog](./CHANGELOG.md) | Version history from v1.5 onwards | # Contact us @@ -224,5 +271,3 @@ Welcome to join the SimAI community chat groups, with the DingTalk group on the SimAI DingTalk SimAI WeChat
- -
diff --git a/README_CN.md b/README_CN.md new file mode 100644 index 00000000..1302de50 --- /dev/null +++ b/README_CN.md @@ -0,0 +1,273 @@ +

+ 中文  |  English +

+ +# SimAI + +[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) +[![NSDI'25](https://img.shields.io/badge/NSDI'25-SimAI-blue.svg)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + +# 最新动态 + +### 近期更新 + +- [2026/03] **SimAI 1.6 正式发布!** 本版本新增推理仿真的 GPU 内存建模能力。主要特性包括: + + - **GPU 内存计算模块:** 支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 的精确参数计数与 KV Cache 管理。详见 [SimAI 1.6 技术报告](./docs/SimAI_1.6_Tech_Report_CN.md)。 + - **PD 分离内存规划:** Prefill 与 Decode 阶段独立的参数内存和 KV Cache 预算计算。详见 [memory_planner.py](./vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py)。 + - **Decode 时间估算改进:** 首尾线性插值替代最近邻的 AICB decode 时间预测,全局缓存支持跨运行复用。详见 [execution_time.py](./vidur-alibabacloud/vidur/entities/execution_time.py)。 + - **4 场景端到端测试:** 覆盖 Qwen3-Next-80B、DeepSeek-671B、Qwen3-MoE-235B 的完整验证套件。详见 [run_scenarios.sh](./vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh)。 + - **双语文档体系:** 全套 EN/ZH 文档系统。详见 [English Docs](./docs/en/index.md) | [中文文档](./docs/zh/index.md)。 + - **GitHub 社区规范文件:** 新增 [Issue 模板](./.github/ISSUE_TEMPLATE/)、[PR 模板](./.github/pull_request_template.md)、[行为准则](./CODE_OF_CONDUCT.md) ([中文](./CODE_OF_CONDUCT_CN.md))、[安全政策](./SECURITY.md) ([中文](./SECURITY_CN.md))、[贡献指南](./CONTRIBUTING.md) ([中文](./CONTRIBUTING.zh-CN.md)) 和 [更新日志](./CHANGELOG.md) ([中文](./CHANGELOG_CN.md))。 + - **代码质量提升:** logging 替换 print 输出、双语 docstring、清理 ~390 行死代码、TODO 规范化、类型标注补全。 + +- [2025/12] **SimAI 1.5 正式发布!** 本版本新增对多请求**推理**工作负载的端到端仿真支持,主要特性包括: + + - **高级推理仿真:** 支持 Prefill/Decode 分离等复杂场景建模。 + - **主流模型支持:** 新增 DeepSeek、Qwen3Moe 和 Qwen3Next 模型。详见 [AICB README](./aicb/README.md)。 + - **请求调度:** 请求调度组件基于微软 [Vidur](https://github.com/microsoft/vidur) 适配,详见 [Vidur-Alibabacloud README](./vidur-alibabacloud/README_CN.md)。 + +- [2025/11] [AICB](https://github.com/aliyun/aicb/tree/master) 新增对 **DeepSeek**、**Qwen3-MoE** 和 **Qwen3-Next** 的 **prefill/decode** 推理工作负载生成支持。 + +- [2025/09] [AICB](https://github.com/aliyun/aicb/tree/master) 新增 DeepSeek 训练工作负载生成支持。感谢 [@parthpower](https://github.com/parthpower) 的贡献。 + +- [2025/06] SimCCL 代码首次在 [SimCCL](https://github.com/aliyun/SimAI/tree/SimCCL) 分支发布,后续将在独立仓库正式开源。 + +**欢迎社区贡献!** 如有想法,欢迎提交 Issue 讨论或发起 Pull Request。 + +
+🎯 活动与社区 🎯 + +### 📅 即将举办 + +| 日期 | 活动 | 地点 | 内容 | 形式 | +|:----:|:----- |:-------- |:------- |:----:| +| -- | | | | | + +### 🌟 往期活动 + +| 日期 | 活动 | 地点 | 内容 | 形式 | +|:----------------:|:------------------------------------------------------------------------ |:----------------------- |:-------------------------------------------------------- |:-------------:| +| Mar 16, 2026 | SimAI 1.6 | 🌐 线上 | SimAI 1.6 正式发布 | 💻 线上直播 | +| Dec 30, 2025 | SimAI 1.5 | 🌐 线上 | SimAI 1.5 正式发布 | 💻 线上直播 | +| Jun 4, 2025 | SimAI 社区第一届研讨会 | 📍 北京大学 | 三场社区贡献者演讲 | 🎓 线下 | +| May 24, 2025 | 第 28 届 Chinasys 研讨会 | 📍 重庆大学 | SimAI 受邀演讲 | 🎓 线下 | +| Dec 27, 2024 | SimAI 技术分享 | 📍 北京航空航天大学 | SimAI 技术分享与交流 | 🎓 线下 | +| Dec 6, 2024 | 香港科技大学技术研讨会 | 📍 香港科技大学(广州) | SimAI 技术分享与交流 | 🎓 线下 | +| Dec 5, 2024 | [Bench'24 会议](https://mp.weixin.qq.com/s/STic_E12xMhZRxhzK9wRnw) | 📍 广州 | SimAI 教程与深度技术专场 | 🎓 线下 | +| Nov 26, 2024 | SimAI 社区直播 | 🌐 线上 | 互动技术交流与演示(400+ 参与者) | 💻 线上直播 | +| Nov 15, 2024 | 技术研讨会 | 📍 千岛湖 | SimAI 线下技术交流 | 🎯 线下 | +| Oct 18, 2024 | 嘉宾讲座 | 📍 复旦大学 | SimAI 教程与公开课 | 🎓 线下 | +| Sept 24-26, 2024 | CCF HPC China 2024 | 📍 武汉 | SimAI 介绍与技术报告 | 🎤 会议 | + +
+ +--- + +## 文档 + +| | | +|---|---| +| [English Documentation](./docs/en/index.md) | Full documentation in English | +| [中文文档](./docs/zh/index.md) | 完整中文文档 | + +--- + +# 目录 + +- [SimAI 概述](#simai-概述) + - [简介](#简介) + - [组件](#组件) + - [应用场景](#应用场景) + - [引用](#引用) +- [快速开始](#快速开始) + - [环境搭建](#环境搭建) + - [使用 SimAI-Analytical](#使用-simai-analytical) + - [使用 SimAI-Simulation](#使用-simai-simulation) + - [使用多请求推理仿真](#使用多请求推理仿真) + +# SimAI 概述 + +## 简介 + +**SimAI** 是业界首个全栈高精度 AI 大规模**推理**与**训练**模拟器(**Sim**ulator for **AI**)。它对 LLM 训练全流程进行详细建模和仿真,涵盖框架、集合通信、网络层等,提供端到端的性能数据,帮助研究人员: + +- 分析推理/训练过程细节 +- 评估特定条件下 AI 任务的耗时 +- 评估各类算法优化带来的 E2E 性能收益,包括: + - 框架参数配置 + - 集合通信算法 + - NCCL 环境变量 + - 网络传输协议 + - 拥塞控制算法 + - 自适应路由算法 + - 扩展/集合网络拓扑调整 + - …… + +## 组件 + +
+        |--- AICB
+SimAI --|--- SimCCL
+        |--- astra-sim-alibabacloud
+        |--- ns-3-alibabacloud
+        |--- vidur-alibabacloud
+
+ +在纯仿真能力基础上,SimAI 已演进为一个由四个组件([aicb](https://github.com/aliyun/aicb)、[SimCCL](https://github.com/aliyun/SimCCL)、[astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)、[ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud))构成的全栈工具套件。这些组件可以灵活组合以实现不同功能。我们鼓励用户探索更多可能性。 + +下图为 SimAI 模拟器架构图: +![SimAI_Arc](./docs/images/SimAI_Arc.png) + +astra-sim-alibabacloud 基于 [astra-sim](https://github.com/astra-sim/astra-sim/tree/ASTRA-sim-1.0) 扩展开发。感谢 astra-sim 团队的优秀工作和开源贡献。我们在其基础上集成了 NCCL 算法并添加了若干新特性。 + +## 应用场景 + +SimAI 支持三种主要运行模式: + +**SimAI-Analytical** 通过使用总线带宽(busbw)抽象网络通信细节来估算集合通信时间,实现快速仿真。目前支持用户自定义 busbw,自动计算 busbw 功能即将推出。 + +**SimAI-Simulation** 提供基于细粒度网络通信建模的全栈仿真。利用 NS-3 或其他网络模拟器(当前 NS-3 已开源)实现对所有通信行为的详细仿真,力求高保真还原真实训练环境。 + +**SimAI-Physical** *(Beta)* 支持在 CPU RDMA 集群环境下生成物理流量,通过生成类 NCCL 的流量模式深入研究 LLM 训练中的 NIC 行为。当前处于内测阶段。 + +| 场景 | 描述 | 组件组合 | +|------|------|----------| +| 1. AICB 测试套件 | 在 GPU 集群上使用 AICB 测试套件运行通信模式 | [AICB](https://github.com/aliyun/aicb) | +| 2. AICB/AIOB 工作负载 | 建模**推理**/训练过程的计算/通信模式以生成工作负载 | [AICB](https://github.com/aliyun/aicb) | +| 3. 集合通信分析 | 将集合通信操作分解为点对点通信集合 | [SimCCL](https://github.com/aliyun/SimCCL) | +| 4. 无 GPU 集合通信 | 在非 GPU 集群上执行 RDMA 集合通信流量 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(physical) | +| 5. SimAI-Analytical | 在任意服务器上快速进行 AICB 工作负载分析与仿真(忽略底层网络细节) | [AICB](https://github.com/aliyun/aicb) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical) | +| 6. SimAI-Simulation | 在任意服务器上进行全栈仿真 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(simulation) + [ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | +| 7. 多请求推理仿真 | 在单 GPU 服务器上进行多请求**推理**全栈仿真 | [AICB](https://github.com/aliyun/aicb) + [SimCCL](https://github.com/aliyun/SimCCL) + [vidur-alibabacloud](./vidur-alibabacloud) + [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud)(analytical/simulation) | + +## 引用 + +SimAI 论文已被 NSDI'25 Spring 接收,详情请参阅: + +*SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision.* + +[[pdf](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf)] / [[slides](./docs/SimAI_Intro_Online.pdf)] / [[video](https://n.dingtalk.com/dingding/live-room/index.html?roomId=OF5BkBUXVxmgsK7x&liveUuid=305736cd-aa70-498b-8003-2b471a53decd)] + +欢迎基于 SimAI 开展创新研究和功能扩展。欢迎加入社区群或通过邮件联系我们交流,我们可提供技术支持。 + +# 快速开始 + +以下为简单示例。完整教程请参见:[**SimAI@Tutorial**](./docs/Tutorial.md)、[**aicb@Tutorial**](https://github.com/aliyun/aicb/blob/master/training/tutorial.md)、[SimCCL@Tutorial]、[ns-3-alibabacloud@Tutorial] + +## 环境搭建 + +请按照以下步骤快速搭建环境并运行 SimAI。 + +### 从源码安装 + +以下步骤已在 Ubuntu 20.04 的 GCC/G++ 9.4.0、python 3.8.10 环境下验证。 + +可使用官方 Ubuntu 20.04 镜像,**不要安装 ninja**。 + +(对于工作负载生成场景,推荐直接使用 NGC 容器镜像。) + +```bash +# 克隆仓库 +$ git clone https://github.com/aliyun/SimAI.git +$ cd ./SimAI/ + +# 初始化子模块 +$ git submodule update --init --recursive +# 更新到最新提交 +$ git submodule update --remote + +# 编译 SimAI-Analytical +$ ./scripts/build.sh -c analytical + +# 编译 SimAI-Simulation (ns3) +$ ./scripts/build.sh -c ns3 +``` + +## 使用 SimAI-Analytical + +```bash +$ ./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml +``` + +若需自动计算总线带宽,请尝试: + +```bash +$ ./bin/SimAI_analytical -w ./example/workload_analytical.txt -g 9216 -nv 360 -nic 48.5 -n_p_s 8 -g_p_s 8 -r example- +``` + +## 使用 SimAI-Simulation + +```bash +# 生成网络拓扑 +$ python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# 运行仿真 +$ AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 16 -w ./example/microAllReduce.txt -n ./Spectrum-X_128g_8gps_100Gbps_A100 -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +## 使用多请求推理仿真 + +详情请参见 `vidur-alibabacloud` 目录下的 [README](./vidur-alibabacloud/README_CN.md)。该模块利用 AICB 对**推理**工作负载的计算时间进行 profiling。由于依赖 DeepGEMM 和 FlashMLA 等特定硬件加速库,目前仅兼容基于 **Hopper(SM90)** 和 **Blackwell(SM100)** 架构的 NVIDIA GPU。 + +```bash +# 从 Dockerfile 构建 +docker build -t image:latest . +docker run --gpus all -it --rm image:latest +``` + +**注意:** 若使用 Hopper GPU,请在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1`。 + +如需快速验证所有支持的推理场景(Qwen3-Next-80B、DeepSeek-671B、Qwen3-MoE-235B),可使用内置的四场景测试套件: + +```bash +# 前置条件:conda activate vidur +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +# 或单独运行某个场景: +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +> **前置条件:** 需先激活 `conda activate vidur` 环境。详见 [环境配置](./vidur-alibabacloud/README_CN.md#-环境配置)。 +> +> 完整场景配置表与输出文件说明请参见 [Vidur-AlibabaCloud README](./vidur-alibabacloud/README_CN.md#四场景配置说明)。 + +# 致谢 + +衷心感谢以下人员和机构对本项目的贡献: + + +- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/) +- Parth Parikh (KEYSIGHT) +- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin) +- Xinyue Li (BUPT) +- Tong Chen (Zhejiang University) +- Ming Wang (BUPT) +- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences) + +……以及众多来自社区的个人贡献者(详见 [Contributors to aliyun/SimAI](https://github.com/aliyun/SimAI/graphs/contributors))。 + +同时感谢 Chenning Li(MIT CSAIL)发起了将 SimAI 集成到 [M4](https://github.com/netiken/m4) 的合作——M4 是一个新型创新模拟器。 + +**本项目持续欢迎更多贡献与建议。** + +# 贡献指南 + +欢迎参与贡献!开始前请阅读以下指引: + +| | | +|---|---| +| [贡献指南](./CONTRIBUTING.zh-CN.md) | 如何提交 Issue 和 Pull Request | +| [安全政策](./SECURITY_CN.md) | 如何报告安全漏洞 | +| [行为准则](./CODE_OF_CONDUCT_CN.md) | 社区行为规范 | +| [更新日志](./CHANGELOG_CN.md) | v1.5 起的版本历史 | + +# 联系我们 + +如有任何问题,欢迎发送邮件至:Gang Lu(yunding.lg@alibaba-inc.com)、Feiyang Xue(xuefeiyang.xfy@alibaba-inc.com)或 Qingxu Li(qingxu.lqx@alibaba-inc.com)。 + +欢迎加入 SimAI 社区交流群,左侧为钉钉群,右侧为微信群。 + +
+ SimAI 钉钉群 + SimAI 微信群 +
diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000..aa9e066e --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,47 @@ +

+ 中文  |  English +

+ +# Security Policy + +## Reporting a Vulnerability + +The SimAI team takes security issues seriously. We appreciate your efforts to responsibly disclose any security vulnerabilities you find. + +**Please do NOT report security vulnerabilities through public GitHub issues.** + +Instead, please report them via email to: + +- Gang Lu: yunding.lg@alibaba-inc.com +- Feiyang Xue: xuefeiyang.xfy@alibaba-inc.com +- Qingxu Li: qingxu.lqx@alibaba-inc.com + +Please include the following information in your report: + +- Description of the vulnerability +- Steps to reproduce the issue +- Potential impact +- Suggested fix (if any) + +## Response Timeline + +- We will acknowledge receipt of your vulnerability report within **3 business days**. +- We will provide a more detailed response within **10 business days**, indicating the next steps for handling your report. +- We will keep you informed of the progress towards a fix and full announcement. + +## Supported Versions + +| Version | Supported | +|---------|--------------------| +| 1.5.x | :white_check_mark: | +| < 1.5 | :x: | + +## Disclosure Policy + +When we receive a security bug report, we will: + +1. Confirm the problem and determine the affected versions. +2. Audit code to find any similar problems. +3. Prepare fixes and release them as soon as possible. + +Thank you for helping keep SimAI and its users safe! diff --git a/SECURITY_CN.md b/SECURITY_CN.md new file mode 100644 index 00000000..87c911be --- /dev/null +++ b/SECURITY_CN.md @@ -0,0 +1,47 @@ +

+ 中文  |  English +

+ +# 安全政策 + +## 报告漏洞 + +SimAI 团队高度重视安全问题。我们感谢您以负责任的方式披露所发现的安全漏洞。 + +**请勿通过公开的 GitHub Issue 报告安全漏洞。** + +请通过电子邮件将漏洞报告发送至以下联系人: + +- Gang Lu:yunding.lg@alibaba-inc.com +- Feiyang Xue:xuefeiyang.xfy@alibaba-inc.com +- Qingxu Li:qingxu.lqx@alibaba-inc.com + +报告中请包含以下信息: + +- 漏洞描述 +- 复现步骤 +- 潜在影响 +- 修复建议(如有) + +## 响应时间线 + +- 我们将在 **3 个工作日**内确认收到您的漏洞报告。 +- 我们将在 **10 个工作日**内提供更详细的回复,说明后续处理步骤。 +- 我们将持续向您通报修复进展及公告发布情况。 + +## 支持的版本 + +| 版本 | 是否支持 | +|---------|-----------------------| +| 1.5.x | :white_check_mark: | +| < 1.5 | :x: | + +## 披露政策 + +收到安全漏洞报告后,我们将: + +1. 确认问题并确定受影响的版本。 +2. 对代码进行审查,排查类似问题。 +3. 尽快准备并发布修复补丁。 + +感谢您帮助保障 SimAI 及其用户的安全! diff --git a/docs/SimAI_1.6_Tech_Report.md b/docs/SimAI_1.6_Tech_Report.md new file mode 100644 index 00000000..b00809b7 --- /dev/null +++ b/docs/SimAI_1.6_Tech_Report.md @@ -0,0 +1,429 @@ +

+ 中文  |  English +

+ +# SimAI 1.6 Technical Report + +> This report covers all features from SimAI 1.5 as well as the new enhancements introduced in SimAI 1.6. + +## 1. Overview + +**SimAI** is the industry's first full-stack, high-precision **Sim**ulator for **AI** large-scale inference and training, open-sourced by Alibaba Cloud. SimAI provides detailed modeling and simulation of the entire LLM inference and training process, encompassing the framework layer, collective communication layer, and network transport layer, delivering end-to-end performance data. The SimAI paper was accepted by NSDI'25 Spring [1]. + +SimAI 1.6 builds upon SimAI 1.5 with further enhancements, primarily introducing the **GPU Memory Calculation Module** (supporting accurate parameter counting and KV cache management for DeepSeek-V3-671B, Qwen3-MoE-235B, and Qwen3-Next-80B), a **4-Scenario End-to-End Test Suite**, and comprehensive code quality improvements (bilingual documentation, logging system, dead code cleanup, etc.). + +### Component Overview + +``` + |--- AICB (Workload generation & compute profiling) +SimAI --|--- SimCCL (Collective communication algorithm analysis) + |--- astra-sim-alibabacloud (Simulation engine: Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 network backend) + |--- vidur-alibabacloud (Multi-request inference scheduling & memory management) +``` + +--- + +## 2. Key Milestones + +The following are the key development events from November 2025 to March 2026: + +| Date | Event | Description | +|------|-------|-------------| +| 2025/11 | AICB PR [#58](https://github.com/aliyun/aicb/pull/58) | AICB adds inference workload generation with prefill/decode phase separation, supporting DeepSeek, Qwen3-MoE, and Qwen3-Next | +| 2025/12 | AICB PR [#60](https://github.com/aliyun/aicb/pull/60) | AICB further update, refining inference workload generation | +| 2025/12 | SimAI PR [#203](https://github.com/aliyun/SimAI/pull/203) | SimAI 1.5 core update: end-to-end inference simulation, PD disaggregation, Vidur scheduling integration, modern model support | +| 2025/12 | ns-3 commit [7e3cb5b](https://github.com/aliyun/ns-3-alibabacloud/commit/7e3cb5b88c99abcb582c5abc3919484a4805111b) | ns-3-alibabacloud README documentation enhancement with detailed NS3 backend modifications | +| 2026/01 | Memory module commits | Completed accurate memory calculation for DeepSeek-V3-671B, Qwen3-Next-80B, and Qwen3-MoE-235B | +| 2026/02 | PD disaggregation memory planning | Implemented independent parameter memory and KV cache budget calculation for Prefill/Decode phases | +| 2026/03 | Code quality improvements | Comprehensive bilingual comments/docs/logs, dead code cleanup, TODO standardization, type annotations | + +--- + +## 3. End-to-End Inference Simulation + +SimAI supports complete multi-request LLM inference simulation with the following core features: + +### 3.1 Prefill-Decode (PD) Disaggregation Architecture + +The inference process is divided into two phases: + +- **Prefill phase**: Processes all input prompt tokens and generates the first output token (compute-intensive) +- **Decode phase**: Autoregressively generates subsequent output tokens one at a time (memory-bandwidth-intensive) + +PD disaggregation allows deploying Prefill and Decode phases on different GPU nodes, enabling: +- Elastic resource allocation (Prefill nodes can be configured with more compute, Decode nodes with more memory) +- Performance isolation (avoiding resource contention between Prefill and Decode) +- Flexible P:D node ratio configuration (via `--replica_config_pd_node_ratio`) + +This design was inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim) [6]. + +### 3.2 Multi-Request Inference Scheduling + +The request scheduling component is adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur) [5] (vidur-alibabacloud), supporting the following scheduling strategies: + +| Scheduler Type | Level | Description | +|---------------|-------|-------------| +| `split_wise` | Global | Global scheduling for PD disaggregation, dispatching requests to Prefill and Decode replicas | +| `lor` | Global | Least Outstanding Requests, dispatching to the least-loaded replica | +| `round_robin` | Global | Round-robin dispatch | +| `sarathi` | Per-replica | Intra-replica batch scheduling | +| `split_wise` | Per-replica | Per-replica scheduling for PD disaggregation | + +### 3.3 Flexible Parallelism + +Supports combinations of multiple parallelism strategies: + +- **Data Parallel (DP)** — via `--cluster_config_num_replicas` +- **Tensor Parallel (TP)** — via `--replica_config_tensor_parallel_size` +- **Pipeline Parallel (PP)** — via `--replica_config_num_pipeline_stages` +- **Expert Parallel (EP)** — via `--replica_config_expert_model_parallel_size` + +Works for both dense and MoE (Mixture-of-Experts) models. + +### 3.4 Multiple Execution-Time Prediction Backends + +| Backend | Description | +|---------|-------------| +| **AICB/AIOB** | Partially supports compute kernels and TP/DP/PP/EP communication size modeling for DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B | +| **SimAI Simulation** | SimAI NS-3-based full-stack network simulation (currently supports TP) | +| **SimAI Analytical** | SimAI analytical performance model (currently supports TP) | +| **Native Vidur** | Original Vidur backend, supports TP, DP, PP | + +--- + +## 4. Modern Model Support + +SimAI 1.6 supports the following three state-of-the-art MoE large models, with configuration files located in `vidur-alibabacloud/data/hf_configs/`: + +### 4.1 DeepSeek-V3-671B + +| Attribute | Value | +|-----------|-------| +| Total Layers | 61 | +| Attention Type | MLA (Multi-head Latent Attention) | +| Attention Heads | 128 | +| Hidden Size | 7168 | +| KV LoRA Rank | 512 | +| Q LoRA Rank | 1536 | +| QK RoPE Head Dim | 64 | +| QK NoPE Head Dim | 128 | +| V Head Dim | 128 | +| MoE Routed Experts | 256 | +| Experts Per Token | 8 | +| Shared Experts | 1 | +| Dense Layers (first 3) | Fixed activation of 8 routed experts + 1 shared expert | +| Sparse Layers (layers 3-60) | Dynamically select 8 from 256 routed experts + 1 shared expert | + +Configuration file: `data/hf_configs/deepseek_v3_config.json` + +### 4.2 Qwen3-MoE-235B + +| Attribute | Value | +|-----------|-------| +| Total Layers | 94 | +| Attention Type | MHA/GQA | +| Attention Heads | 64 | +| KV Heads | 4 | +| Hidden Size | 4096 | +| Head Dim | 128 | +| MoE Routed Experts | 128 | +| Experts Per Token | 8 | +| MoE Intermediate Size | 1536 | + +Configuration file: `data/hf_configs/qwen3_moe_config.json` + +### 4.3 Qwen3-Next-80B + +| Attribute | Value | +|-----------|-------| +| Total Layers | 48 | +| Attention Type | Hybrid (full + linear attention, alternating every 4 layers) | +| Full Attention Heads | 16 | +| KV Heads | 2 | +| Hidden Size | 2048 | +| Head Dim | 256 | +| Linear Attention Key Heads | 16 | +| Linear Attention Value Heads | 32 | +| MoE Routed Experts | 512 | +| Experts Per Token | 10 | +| MoE Intermediate Size | 512 | + +Configuration file: `data/hf_configs/qwen3-next-80B-A3B_config.json` + +--- + +## 5. GPU Memory Calculation Module + +This is the core new feature in SimAI 1.6. The module provides accurate GPU memory estimation for inference simulation, covering model parameter memory, KV cache memory, and maximum batch size calculation, with separate memory budget computation for Prefill and Decode phases under PD disaggregation. + +### 5.1 Parameter Counting (ParamCounter) + +**File path**: `vidur-alibabacloud/vidur/utils/param_counter.py` + +ParamCounter supports per-layer and per-device parameter counting, returning a triple `(total_params, prefill_params, decode_params)` under PD disaggregation. + +#### MLA Parameters (DeepSeek-V3-671B) + +Per-layer MLA parameter components: + +- **Q LoRA down-projection**: `wq_down = hidden_size * q_lora_rank` = 7168 * 1536 +- **Q LoRA up-projection**: `wq_up = q_lora_rank * num_attention_heads * qk_head_dim` = 1536 * 128 * 192, where `qk_head_dim = qk_nope_head_dim + qk_rope_head_dim = 128 + 64 = 192` +- **KV LoRA down-projection**: `wkv_down = hidden_size * kv_lora_rank` = 7168 * 512 +- **KV LoRA up-projection**: `wkv_up = kv_lora_rank * num_attention_heads * (qk_nope_head_dim + v_head_dim)` = 512 * 128 * 256 +- **Output projection**: `wo = hidden_size * num_attention_heads * v_head_dim` = 7168 * 128 * 128 + +Under FP8 quantization, each parameter element uses 1 byte; under FP16/BF16, each uses 2 bytes. + +References: [3] [4] + +#### MHA/GQA Parameters (Qwen3-MoE-235B) + +Per-layer MHA parameters: + +``` +wq = hidden_size * num_attention_heads * head_dim +wk = hidden_size * num_key_value_heads * head_dim +wv = hidden_size * num_key_value_heads * head_dim +wo = hidden_size * num_attention_heads * head_dim +total = (wq + wk + wv + wo) * bytes_per_element +``` + +#### Linear Attention Parameters (Qwen3-Next-80B) + +Qwen3-Next-80B uses a hybrid attention architecture, alternating between full attention and linear (GDN) attention every 4 layers. Linear attention layers use independent key/value head configurations (`linear_key_head_dim`, `linear_num_key_heads`, etc.). + +#### MoE Expert Parameters + +Per-expert FFN parameters (3 weight matrices W1, W2, W3): + +``` +expert_params = 3 * hidden_size * moe_intermediate_size * bytes_per_element +``` + +#### PD Disaggregation Parameter Calculation + +Under PD disaggregation, the expert parallelism (EP) may differ between Prefill and Decode clusters: + +- **Prefill cluster**: Uses `prefill_world_size` as EP, experts per device = `num_routed_experts / prefill_world_size` +- **Decode cluster**: Uses `decode_world_size` as EP, experts per device = `num_routed_experts / decode_world_size` + +This results in different parameter memory for Prefill and Decode clusters, which in turn affects their respective available KV cache capacity. + +### 5.2 KV Cache Memory Management + +**File path**: `vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py`, `vidur-alibabacloud/vidur/entities/replica.py` + +#### MHA/GQA KV Cache Calculation + +``` +kv_cache_per_token = 2 * num_kv_heads * head_dim * num_layers * bytes_per_element +``` + +The factor of 2 represents the K (Key) and V (Value) caches. + +#### MLA KV Cache Calculation (DeepSeek-V3-671B) + +The MLA architecture uses compressed KV representations. Unlike MHA which stores separate K and V caches, MLA stores a single compressed latent vector (`kv_lora_rank`) that jointly encodes K and V, plus the RoPE position keys (`qk_rope_head_dim`). Per-token KV cache size: + +``` +kv_cache_per_token = (kv_lora_rank + qk_rope_head_dim) * num_layers * bytes_per_element +``` + +Where `kv_lora_rank = 512` and `qk_rope_head_dim = 64`. Compared to MHA's per-token cache of `2 * num_kv_heads * head_dim` = 2 * 128 * 128 = 32768 elements, MLA reduces this to 576 elements — a **~57x** reduction. + +#### Per-Request KV Cache Tracking + +The `Replica` entity (`vidur/entities/replica.py`) maintains the following state: + +- `_allocated_kv_cache_memory`: Currently allocated KV cache memory (bytes) +- `_max_kv_cache_memory`: Maximum KV cache capacity (computed on first call by MemoryPlanner) +- `_kv_cache_allocation_map`: Per-request KV cache allocation mapping + +Supported operations: +- `allocate_request_kv_cache_memory(request, num_blocks, block_size)` — Allocate KV cache for a request +- `release_request_kv_cache_memory(request)` — Release KV cache for a completed request +- `get_remaining_kv_cache_capacity()` — Query remaining KV cache capacity and serviceable request count + +### 5.3 MemoryPlanner + +**File path**: `vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py` + +MemoryPlanner is the central component for memory management, with the following calculation flow: + +1. **Compute available GPU memory**: `available_memory = total_GPU_memory * (1 - memory_margin_fraction)` +2. **Get model parameter memory**: Computed via ParamCounter; under PD disaggregation returns `(total, prefill, decode)` triple +3. **Compute KV cache available memory**: `kv_cache_available = available_memory - param_memory` +4. **Compute maximum concurrent requests**: `max_requests = kv_cache_available / kv_cache_per_request` + +Under PD disaggregation: +- Prefill replicas use `prefill_param_mem` for KV cache budget calculation +- Decode replicas use `decode_param_mem` for KV cache budget calculation + +Includes OOM detection: when parameter memory exceeds available memory, error messages are output with suggestions to increase TP/EP, use larger GPUs, or enable FP8 quantization. + +--- + +## 6. AICB Inference Workload Generation + +[AICB](https://github.com/aliyun/aicb) introduces inference workload generation capabilities (PR [#58](https://github.com/aliyun/aicb/pull/58), [#60](https://github.com/aliyun/aicb/pull/60)), with key features: + +- **Prefill/Decode phase separation**: Generates separate compute and communication workloads for Prefill and Decode phases +- **Compute kernel profiling**: Relies on the following hardware-accelerated libraries (requires Hopper SM90 or Blackwell SM100 GPUs): + - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) — FP8 matrix multiplication + - [FlashMLA](https://github.com/deepseek-ai/FlashMLA) — MLA attention acceleration + - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) — High-performance inference kernels +- **Communication size modeling**: Supports communication size calculation for TP, DP, PP, EP parallelism strategies +- **Model support**: DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B + +--- + +## 7. Four-Scenario End-to-End Test Suite + +**File path**: `vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh` + +Provides 4 pre-configured end-to-end test scenarios covering different models, parallelism strategies, and PD disaggregation configurations. + +### Shared Hardware Configuration + +- GPU: H20 (h20_dgx) +- NVLink bandwidth: 1600 Gbps +- RDMA bandwidth: 800 Gbps +- PD P2P bandwidth: 800 Gbps +- Data type: fp8 +- Requests: Poisson QPS=100, 4 requests, fixed prefill=100 / decode=8 tokens + +### Scenario Configuration + +| Scenario | Model | PD Separation | World Size | TP | PP | EP | Global Scheduler | +|----------|-------|---------------|-----------|----|----|-----|-----------------| +| 1 | Qwen3-Next-80B (MoE) | No | 32 (dp=32) | 1 | 1 | 1 | lor | +| 2 | Qwen3-Next-80B (MoE) | Yes (P=2, D=6) | 8 | 1 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B (MoE) | Yes (P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | Yes (P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +### Running + +```bash +# Run all 4 scenarios +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# Run a single scenario +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 3 +``` + +For detailed performance data, please run the test suite. Each run produces output files including `request_metrics.csv` (per-request metrics), `chrome_trace.json` (timeline trace), `config.json` (configuration snapshot), and metric files under the `plots/` directory. + +--- + +## 8. Code Quality Improvements + +SimAI 1.6 includes systematic code quality improvements: + +### 8.1 Bilingual Comments and Documentation + +- Added bilingual (Chinese/English) docstrings to all public APIs +- Added bilingual comments to config, scheduler, predictor, and utils modules +- Added bilingual comments to entity modules +- Shell script outputs and Python runtime outputs use bilingual format + +### 8.2 Logging System Improvements + +- Comprehensive replacement of `print` statements with the `logging` module (~12 files) +- Unified log format using parenthetical bilingual style (e.g., `"GPU总内存 (Total GPU mem): 96.00 GB"`) + +### 8.3 Dead Code Cleanup + +- Removed approximately 390 lines of dead code blocks +- Cleaned up personal debug markers + +### 8.4 TODO Standardization + +- Unified to `TODO(author): description` format +- Added missing type annotations + +--- + +## 9. System Architecture + +### Inference Simulation Data Flow + +``` +Request Generator + | Generate synthetic / real-trace requests + v +Global Scheduler + | Dispatch requests to Prefill / Decode replicas + v +Replica Scheduler + | Batch assembly and scheduling + v +Memory Management (MemoryPlanner + Replica) + | KV cache allocation and capacity checking + v +Execution Time Predictor + | AICB / SimAI Simulation / SimAI Analytical / Vidur + v +Metrics Store + | TTFT, TBT, E2E, communication / compute cost + v +Output (request_metrics.csv, chrome_trace.json, plots/) +``` + +--- + +## 10. Quick Start + +### Environment Setup + +#### Option 1: Docker (Recommended) + +```bash +# Build from project root +docker build -t simai:latest . +docker run --gpus all -it --rm simai:latest +``` + +> If using Hopper GPUs, add `ENV FLASH_MLA_DISABLE_SM100=1` to the Dockerfile. + +#### Option 2: Conda + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +### Run 4-Scenario Test Suite + +```bash +# Prerequisites: conda activate vidur +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +### Compile and Run SimAI Training Simulation + +```bash +# Compile SimAI-Analytical +./scripts/build.sh -c analytical + +# Run +./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml +``` + +--- + +## 11. References + +[1] SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision. NSDI'25 Spring. [[pdf](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf)] + +[2] InferSim — Alibaba. Parameter counting and KV cache estimation. [[GitHub](https://github.com/alibaba/InferSim)] + +[3] DeepSeek V3 Parameter Derivation (Chinese). Zhihu. [[link](https://zhuanlan.zhihu.com/p/21455638257)] + +[4] DeepSeek V3 Parameter Size Analysis. Yang Wenbo. [[link](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html)] + +[5] Vidur: A Large-Scale Simulation Framework For LLM Inference. Microsoft Research. [[GitHub](https://github.com/microsoft/vidur)] + +[6] splitwise-sim — Prefill-Decode Disaggregation Simulation. [[GitHub](https://github.com/Mutinifni/splitwise-sim)] diff --git a/docs/SimAI_1.6_Tech_Report_CN.md b/docs/SimAI_1.6_Tech_Report_CN.md new file mode 100644 index 00000000..228ae90a --- /dev/null +++ b/docs/SimAI_1.6_Tech_Report_CN.md @@ -0,0 +1,429 @@ +

+ 中文  |  English +

+ +# SimAI 1.6 技术报告 + +> 本报告涵盖 SimAI 1.5 全部功能及 SimAI 1.6 新增特性。 + +## 1. 概述 + +**SimAI** 是业界首个全栈高精度 AI 大规模推理与训练模拟器(**Sim**ulator for **AI**),由阿里云开源。SimAI 对 LLM 推理与训练全流程进行详细建模和仿真,涵盖框架层、集合通信层、网络传输层等,提供端到端的性能数据。SimAI 论文已被 NSDI'25 Spring 接收 [1]。 + +SimAI 1.6 在 SimAI 1.5 的基础上进一步增强,主要新增了 **GPU 显存计算模块**(支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 三种 MoE 模型的精确参数量计算与 KV Cache 管理)、**四场景端到端测试套件**,以及全面的代码质量改进(双语文档、日志系统、死代码清理等)。 + +### 组件构成 + +``` + |--- AICB (工作负载生成与计算 profiling) +SimAI --|--- SimCCL (集合通信算法分析) + |--- astra-sim-alibabacloud (仿真引擎:Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 网络后端) + |--- vidur-alibabacloud (多请求推理调度与显存管理) +``` + +--- + +## 2. 关键里程碑 + +以下为 2025 年 11 月至 2026 年 3 月的关键开发事件: + +| 时间 | 事件 | 说明 | +|------|------|------| +| 2025/11 | AICB PR [#58](https://github.com/aliyun/aicb/pull/58) | AICB 新增推理工作负载生成能力,区分 prefill/decode 阶段,支持 DeepSeek、Qwen3-MoE、Qwen3-Next | +| 2025/12 | AICB PR [#60](https://github.com/aliyun/aicb/pull/60) | AICB 进一步更新,完善推理工作负载生成 | +| 2025/12 | SimAI PR [#203](https://github.com/aliyun/SimAI/pull/203) | SimAI 1.5 核心更新:端到端推理仿真、PD 分离、Vidur 调度集成、现代模型支持 | +| 2025/12 | ns-3 commit [7e3cb5b](https://github.com/aliyun/ns-3-alibabacloud/commit/7e3cb5b88c99abcb582c5abc3919484a4805111b) | ns-3-alibabacloud README 文档增强,详细说明 NS3 网络后端修改 | +| 2026/01 | 显存模块系列 commit | 完成 DeepSeek-V3-671B、Qwen3-Next-80B、Qwen3-MoE-235B 的精确显存计算 | +| 2026/02 | PD 分离显存规划 | 实现 Prefill/Decode 阶段独立的参数显存与 KV Cache 预算计算 | +| 2026/03 | 代码质量改进 | 双语注释/文档/日志全面改进,死代码清理,TODO 标准化,类型注解补充 | + +--- + +## 3. 端到端推理仿真 + +SimAI 支持完整的多请求 LLM 推理仿真,核心特性如下: + +### 3.1 Prefill–Decode(PD)分离架构 + +推理过程分为两个阶段: + +- **Prefill 阶段**:处理输入 prompt 的全部 token,生成第一个输出 token(计算密集型) +- **Decode 阶段**:逐 token 自回归生成后续输出(访存密集型) + +PD 分离允许将 Prefill 和 Decode 阶段部署在不同的 GPU 节点上,实现: +- 弹性资源分配(Prefill 节点可配置更多计算资源,Decode 节点可配置更多显存) +- 性能隔离(避免 Prefill 和 Decode 之间的资源争用) +- 灵活的 P:D 节点比例配置(通过 `--replica_config_pd_node_ratio` 控制) + +该设计参考了 [splitwise-sim](https://github.com/Mutinifni/splitwise-sim) [6]。 + +### 3.2 多请求推理调度 + +请求调度组件基于微软 [Vidur](https://github.com/microsoft/vidur) [5] 改编(vidur-alibabacloud),支持以下调度策略: + +| 调度器类型 | 级别 | 说明 | +|-----------|------|------| +| `split_wise` | 全局 | PD 分离场景下的全局调度,将请求分配到 Prefill 和 Decode 副本 | +| `lor` | 全局 | Least Outstanding Requests,将请求分配到负载最轻的副本 | +| `round_robin` | 全局 | 轮询分配 | +| `sarathi` | 副本级 | 单副本内的批处理调度 | +| `split_wise` | 副本级 | PD 分离场景下的副本级调度 | + +### 3.3 灵活的并行策略 + +支持多种并行策略的组合: + +- **数据并行(DP)** — 通过 `--cluster_config_num_replicas` 控制 +- **张量并行(TP)** — 通过 `--replica_config_tensor_parallel_size` 控制 +- **流水线并行(PP)** — 通过 `--replica_config_num_pipeline_stages` 控制 +- **专家并行(EP)** — 通过 `--replica_config_expert_model_parallel_size` 控制 + +同时支持 Dense 模型和 MoE(混合专家)模型。 + +### 3.4 多种执行时间预测后端 + +| 后端 | 说明 | +|------|------| +| **AICB/AIOB** | 部分支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 的计算核与 TP/DP/PP/EP 通信量建模 | +| **SimAI Simulation** | 基于 SimAI NS-3 的网络通信全栈仿真(当前支持 TP) | +| **SimAI Analytical** | SimAI 解析性能模型(当前支持 TP) | +| **Native Vidur** | 原版 Vidur 后端,支持 TP、DP、PP | + +--- + +## 4. 现代模型支持 + +SimAI 1.6 支持以下三种前沿 MoE 大模型,模型配置文件位于 `vidur-alibabacloud/data/hf_configs/`: + +### 4.1 DeepSeek-V3-671B + +| 属性 | 值 | +|------|-----| +| 总层数 | 61 | +| 注意力类型 | MLA(Multi-head Latent Attention) | +| 注意力头数 | 128 | +| 隐藏维度 | 7168 | +| KV LoRA 秩 | 512 | +| Q LoRA 秩 | 1536 | +| QK RoPE 头维度 | 64 | +| QK NoPE 头维度 | 128 | +| V 头维度 | 128 | +| MoE 路由专家数 | 256 | +| 每 token 激活专家数 | 8 | +| 共享专家数 | 1 | +| Dense 层(前 3 层) | 固定激活 8 个路由专家 + 1 个共享专家 | +| Sparse 层(第 3-60 层) | 从 256 个路由专家中动态选择 8 个 + 1 个共享专家 | + +配置文件:`data/hf_configs/deepseek_v3_config.json` + +### 4.2 Qwen3-MoE-235B + +| 属性 | 值 | +|------|-----| +| 总层数 | 94 | +| 注意力类型 | MHA/GQA | +| 注意力头数 | 64 | +| KV 头数 | 4 | +| 隐藏维度 | 4096 | +| 头维度 | 128 | +| MoE 路由专家数 | 128 | +| 每 token 激活专家数 | 8 | +| MoE 中间维度 | 1536 | + +配置文件:`data/hf_configs/qwen3_moe_config.json` + +### 4.3 Qwen3-Next-80B + +| 属性 | 值 | +|------|-----| +| 总层数 | 48 | +| 注意力类型 | 混合(全注意力 + 线性注意力,每 4 层交替) | +| 全注意力头数 | 16 | +| KV 头数 | 2 | +| 隐藏维度 | 2048 | +| 头维度 | 256 | +| 线性注意力键头数 | 16 | +| 线性注意力值头数 | 32 | +| MoE 路由专家数 | 512 | +| 每 token 激活专家数 | 10 | +| MoE 中间维度 | 512 | + +配置文件:`data/hf_configs/qwen3-next-80B-A3B_config.json` + +--- + +## 5. GPU 显存计算模块 + +这是 SimAI 1.6 的核心新增特性。该模块为推理仿真提供精确的 GPU 显存估算,覆盖模型参数显存、KV Cache 显存和最大批处理量计算,并在 PD 分离架构下分别计算 Prefill 和 Decode 阶段的显存预算。 + +### 5.1 参数量计算(ParamCounter) + +**文件路径**:`vidur-alibabacloud/vidur/utils/param_counter.py` + +ParamCounter 支持按层、按设备计算模型参数量,并在 PD 分离架构下返回三元组 `(total_params, prefill_params, decode_params)`。 + +#### MLA 参数量(DeepSeek-V3-671B) + +单层 MLA 参数量由以下部分组成: + +- **Q LoRA 下投影**:`wq_down = hidden_size * q_lora_rank` = 7168 * 1536 +- **Q LoRA 上投影**:`wq_up = q_lora_rank * num_attention_heads * qk_head_dim` = 1536 * 128 * 192,其中 `qk_head_dim = qk_nope_head_dim + qk_rope_head_dim = 128 + 64 = 192` +- **KV LoRA 下投影**:`wkv_down = hidden_size * kv_lora_rank` = 7168 * 512 +- **KV LoRA 上投影**:`wkv_up = kv_lora_rank * num_attention_heads * (qk_nope_head_dim + v_head_dim)` = 512 * 128 * 256 +- **输出投影**:`wo = hidden_size * num_attention_heads * v_head_dim` = 7168 * 128 * 128 + +FP8 量化下每个参数元素占 1 字节;FP16/BF16 下每个占 2 字节。 + +参考:[3] [4] + +#### MHA/GQA 参数量(Qwen3-MoE-235B) + +单层 MHA 参数量: + +``` +wq = hidden_size * num_attention_heads * head_dim +wk = hidden_size * num_key_value_heads * head_dim +wv = hidden_size * num_key_value_heads * head_dim +wo = hidden_size * num_attention_heads * head_dim +total = (wq + wk + wv + wo) * bytes_per_element +``` + +#### 线性注意力参数量(Qwen3-Next-80B) + +Qwen3-Next-80B 采用混合注意力架构,每 4 层交替使用全注意力和线性(GDN)注意力。线性注意力层使用独立的键/值头配置(`linear_key_head_dim`、`linear_num_key_heads` 等)。 + +#### MoE 专家参数量 + +每个专家的 FFN 参数量(3 个权重矩阵 W1、W2、W3): + +``` +expert_params = 3 * hidden_size * moe_intermediate_size * bytes_per_element +``` + +#### PD 分离下的参数量计算 + +在 PD 分离架构下,Prefill 和 Decode 集群的专家并行度(EP)可能不同: + +- **Prefill 集群**:使用 `prefill_world_size` 作为 EP,每设备加载的专家数 = `num_routed_experts / prefill_world_size` +- **Decode 集群**:使用 `decode_world_size` 作为 EP,每设备加载的专家数 = `num_routed_experts / decode_world_size` + +这导致 Prefill 和 Decode 集群的参数显存不同,进而影响各自可用的 KV Cache 容量。 + +### 5.2 KV Cache 显存管理 + +**文件路径**:`vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py`、`vidur-alibabacloud/vidur/entities/replica.py` + +#### MHA/GQA KV Cache 计算 + +``` +kv_cache_per_token = 2 * num_kv_heads * head_dim * num_layers * bytes_per_element +``` + +其中因子 2 代表 K(Key)和 V(Value)两个缓存。 + +#### MLA KV Cache 计算(DeepSeek-V3-671B) + +MLA 架构使用压缩的 KV 表示。与 MHA 分别存储 K 和 V 缓存不同,MLA 存储一个联合编码 K 和 V 的压缩潜向量(`kv_lora_rank`),外加 RoPE 位置键(`qk_rope_head_dim`)。每 token 的 KV Cache 大小为: + +``` +kv_cache_per_token = (kv_lora_rank + qk_rope_head_dim) * num_layers * bytes_per_element +``` + +其中 `kv_lora_rank = 512`,`qk_rope_head_dim = 64`。相比 MHA 每 token 缓存量 `2 * num_kv_heads * head_dim` = 2 * 128 * 128 = 32768 个元素,MLA 减少至 576 个元素——约 **57 倍**压缩。 + +#### 逐请求 KV Cache 追踪 + +`Replica` 实体(`vidur/entities/replica.py`)维护以下状态: + +- `_allocated_kv_cache_memory`:已分配的 KV Cache 显存(字节) +- `_max_kv_cache_memory`:最大 KV Cache 容量(首次调用时由 MemoryPlanner 计算) +- `_kv_cache_allocation_map`:每请求 KV Cache 分配映射 + +支持的操作: +- `allocate_request_kv_cache_memory(request, num_blocks, block_size)` — 为请求分配 KV Cache +- `release_request_kv_cache_memory(request)` — 释放已完成请求的 KV Cache +- `get_remaining_kv_cache_capacity()` — 查询剩余 KV Cache 容量和可服务请求数 + +### 5.3 MemoryPlanner 显存规划 + +**文件路径**:`vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py` + +MemoryPlanner 是显存管理的核心组件,计算流程如下: + +1. **计算可用 GPU 显存**:`available_memory = total_GPU_memory * (1 - memory_margin_fraction)` +2. **获取模型参数显存**:通过 ParamCounter 计算,PD 分离下返回 `(total, prefill, decode)` 三元组 +3. **计算 KV Cache 可用显存**:`kv_cache_available = available_memory - param_memory` +4. **计算最大并发请求数**:`max_requests = kv_cache_available / kv_cache_per_request` + +在 PD 分离架构下: +- Prefill 副本使用 `prefill_param_mem` 计算 KV Cache 预算 +- Decode 副本使用 `decode_param_mem` 计算 KV Cache 预算 + +包含 OOM 检测:当参数显存超过可用显存时,输出错误信息并建议增加 TP/EP、使用更大 GPU 或启用 FP8 量化。 + +--- + +## 6. AICB 推理工作负载生成 + +[AICB](https://github.com/aliyun/aicb) 新增了推理工作负载生成能力(PR [#58](https://github.com/aliyun/aicb/pull/58)、[#60](https://github.com/aliyun/aicb/pull/60)),主要特性: + +- **Prefill/Decode 阶段分离**:分别生成 Prefill 和 Decode 阶段的计算与通信工作负载 +- **计算核 Profiling**:依赖以下硬件加速库(需要 Hopper SM90 或 Blackwell SM100 GPU): + - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) — FP8 矩阵乘法 + - [FlashMLA](https://github.com/deepseek-ai/FlashMLA) — MLA 注意力加速 + - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) — 高性能推理内核 +- **通信量建模**:支持 TP、DP、PP、EP 四种并行策略下的通信量计算 +- **模型支持**:DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B + +--- + +## 7. 四场景端到端测试套件 + +**文件路径**:`vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh` + +提供了 4 个预配置的端到端测试场景,覆盖不同模型、并行策略和 PD 分离配置。 + +### 共用硬件配置 + +- GPU:H20 (h20_dgx) +- NVLink 带宽:1600 Gbps +- RDMA 带宽:800 Gbps +- PD P2P 带宽:800 Gbps +- 数据类型:fp8 +- 请求:Poisson QPS=100,4 个请求,固定 prefill=100 / decode=8 tokens + +### 场景配置表 + +| 场景 | 模型 | PD 分离 | World Size | TP | PP | EP | 全局调度 | +|------|------|---------|-----------|----|----|-----|---------| +| 1 | Qwen3-Next-80B (MoE) | 否 | 32 (dp=32) | 1 | 1 | 1 | lor | +| 2 | Qwen3-Next-80B (MoE) | 是 (P=2, D=6) | 8 | 1 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B (MoE) | 是 (P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | 是 (P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +### 运行方式 + +```bash +# 运行全部 4 个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 运行单个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 3 +``` + +详细性能数据请运行测试套件获取。每次运行产生的输出文件包括 `request_metrics.csv`(逐请求指标)、`chrome_trace.json`(时间线追踪)、`config.json`(配置快照)以及 `plots/` 目录下的指标文件。 + +--- + +## 8. 代码质量改进 + +SimAI 1.6 在代码质量方面进行了系统性改进: + +### 8.1 双语注释与文档 + +- 所有公共 API 添加中英双语 docstring +- 配置模块(config)、调度器(scheduler)、预测器(predictor)、工具类(utils)添加双语注释 +- 实体模块(entities)添加双语注释 +- Shell 脚本输出和 Python 运行时输出均为双语格式 + +### 8.2 日志系统改进 + +- 全面将 `print` 语句替换为 `logging` 模块(涉及 ~12 个文件) +- 统一日志格式,使用括号式双语格式(如 `"GPU总内存 (Total GPU mem): 96.00 GB"`) + +### 8.3 死代码清理 + +- 移除约 390 行无效代码块 +- 清理个人调试标记 + +### 8.4 TODO 标准化 + +- 统一为 `TODO(author): description` 格式 +- 补充缺失的类型注解 + +--- + +## 9. 系统架构 + +### 推理仿真数据流 + +``` +请求生成器 (Request Generator) + │ 生成合成/真实 trace 请求 + ▼ +全局调度器 (Global Scheduler) + │ 将请求分配到 Prefill/Decode 副本 + ▼ +副本调度器 (Replica Scheduler) + │ 批处理组装与调度 + ▼ +显存管理 (MemoryPlanner + Replica) + │ KV Cache 分配与容量检查 + ▼ +执行时间预测 (Execution Time Predictor) + │ AICB / SimAI Simulation / SimAI Analytical / Vidur + ▼ +指标收集 (Metrics Store) + │ TTFT, TBT, E2E, 通信/计算开销 + ▼ +输出 (request_metrics.csv, chrome_trace.json, plots/) +``` + +--- + +## 10. 快速开始 + +### 环境搭建 + +#### 方式一:Docker(推荐) + +```bash +# 从项目根目录构建 +docker build -t simai:latest . +docker run --gpus all -it --rm simai:latest +``` + +> 若使用 Hopper GPU,请在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1`。 + +#### 方式二:Conda + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +### 运行四场景测试 + +```bash +# 前置条件:conda activate vidur +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +### 编译与运行 SimAI 训练仿真 + +```bash +# 编译 SimAI-Analytical +./scripts/build.sh -c analytical + +# 运行 +./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml +``` + +--- + +## 11. 参考文献 + +[1] SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision. NSDI'25 Spring. [[pdf](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf)] + +[2] InferSim — Alibaba. Parameter counting and KV cache estimation. [[GitHub](https://github.com/alibaba/InferSim)] + +[3] DeepSeek V3 参数推导详解. 知乎. [[link](https://zhuanlan.zhihu.com/p/21455638257)] + +[4] DeepSeek V3 Parameter Size Analysis. Yang Wenbo. [[link](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html)] + +[5] Vidur: A Large-Scale Simulation Framework For LLM Inference. Microsoft Research. [[GitHub](https://github.com/microsoft/vidur)] + +[6] splitwise-sim — Prefill-Decode Disaggregation Simulation. [[GitHub](https://github.com/Mutinifni/splitwise-sim)] diff --git a/docs/en/benchmarking/index.md b/docs/en/benchmarking/index.md new file mode 100644 index 00000000..2431cf17 --- /dev/null +++ b/docs/en/benchmarking/index.md @@ -0,0 +1,33 @@ +# Benchmarking + +This section covers benchmarking and validation approaches for SimAI. + +--- + +## Contents + +| Document | Description | +|----------|-------------| +| [4-Scenario End-to-End Test Suite](test_suite.md) | Pre-configured test scenarios covering different models, parallelism strategies, and PD configurations | + +--- + +## Benchmarking Approaches + +SimAI supports several benchmarking methodologies: + +### Architecture Comparison + +Compare different network architectures (e.g., Spectrum-X vs DCN+) under identical workloads to evaluate their performance characteristics. + +### Algorithm Comparison + +Compare different collective communication algorithms (e.g., RING vs NVLS) to understand their performance trade-offs at various message sizes. + +### Parameter Optimization + +Use SimAI-Analytical for rapid exploration of parallel parameter combinations (TP, PP, EP, DP) to find optimal configurations. + +### Validation Against Real Hardware + +Use AICB physical execution results as ground truth to validate simulation accuracy. diff --git a/docs/en/benchmarking/test_suite.md b/docs/en/benchmarking/test_suite.md new file mode 100644 index 00000000..54a54404 --- /dev/null +++ b/docs/en/benchmarking/test_suite.md @@ -0,0 +1,139 @@ +# 4-Scenario End-to-End Test Suite + +SimAI provides a pre-configured test suite covering 4 representative inference scenarios, enabling quick validation of all supported configurations. + +--- + +## Overview + +The test suite is located at `vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh` and covers different combinations of models, parallelism strategies, and PD disaggregation configurations. + +--- + +## Running + +```bash +# Run all 4 scenarios +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# Run a single scenario +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 + +# Show help +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --help +``` + +> **Prerequisites**: `conda activate vidur` environment must be active. + +--- + +## Shared Hardware Configuration + +All scenarios share the following hardware settings: + +| Parameter | Value | +|-----------|-------| +| GPU | H20 (h20_dgx) | +| NVLink Bandwidth | 1600 Gbps | +| RDMA Bandwidth | 800 Gbps | +| PD P2P Bandwidth | 800 Gbps | +| PD P2P Data Type | fp8 | +| Request Generator | Poisson, QPS=100 | +| Request Count | 4 | +| Prefill Tokens | 100 (fixed) | +| Decode Tokens | 8 (fixed) | + +--- + +## Scenario Configuration + +| Scenario | Model | PD Separation | World Size | TP | PP | EP | Global Scheduler | +|----------|-------|---------------|-----------|----|----|-----|-----------------| +| **1** | Qwen3-Next-80B (MoE) | No | 32 (dp=32) | 1 | 1 | 1 (default) | lor | +| **2** | Qwen3-Next-80B (MoE) | Yes (P=2, D=6) | 8 | 1 | 1 | 1 (default) | split_wise | +| **3** | DeepSeek-671B (MoE) | Yes (P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| **4** | Qwen3-MoE-235B (MoE) | Yes (P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +### Scenario Details + +- **Scenario 1**: Large-scale DP without PD separation — tests baseline throughput +- **Scenario 2**: Same model with PD separation — tests PD disaggregation overhead +- **Scenario 3**: DeepSeek-671B with large TP/EP — tests MoE with MLA attention +- **Scenario 4**: Qwen3-MoE-235B with moderate TP/EP — tests MHA/GQA attention model + +--- + +## Output + +### Output Directory + +- **Via run_scenarios.sh**: `examples/vidur-ali-scenarios/simulator_output/` +- **Direct python**: `./simulator_output/` + +### Output Files + +``` +// +├── request_metrics.csv # Per-request metrics +├── chrome_trace.json # Chrome DevTools timeline +├── config.json # Configuration snapshot +└── plots/ # Metric CSV/JSON files +``` + +### Logs + +Run logs are saved to `examples/vidur-ali-scenarios/logs/scenario__.log`. + +--- + +## Architecture Comparison Examples + +### RING vs NVLS (SimAI-Simulation) + +```bash +# NVLS topology and run +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py --ro -g 32 -gt H100 -bw 400Gbps -nvbw 1360Gbps +AS_SEND_LAT=12 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 8 -w ./example/microAllReduce.txt \ + -n ./Rail_Opti_SingleToR_32g_8gps_400Gbps_H100 -c ./astra-sim-alibabacloud/inputs/config/SimAI.conf + +# RING topology and run +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py --ro -g 32 -gt H100 -bw 400Gbps -nvbw 1440Gbps +AS_SEND_LAT=2 AS_PXN_ENABLE=1 ./bin/SimAI_simulator -t 8 -w ./example/microAllReduce.txt \ + -n ./Rail_Opti_SingleToR_32g_8gps_400Gbps_H100 -c ./astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +**Results** (busbw in GB/s): + +| Message Size | NVLS | RING | +|-------------|------|------| +| 16M | 148.88 | 141.84 | +| 32M | 178.04 | 153.68 | +| 64M | 197.38 | 160.60 | +| 128M | 208.70 | 163.85 | +| 256M | 214.87 | 165.72 | +| 512M | 218.09 | 166.68 | + +### Spectrum-X vs DCN+ (SimAI-Simulation) + +```bash +# Generate topologies +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo DCN+ -g 256 -psn 64 -bw 400Gbps +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 256 +``` + +**Results** (busbw in GB/s): + +| Message Size | Spectrum-X | DCN+ SingleToR | +|-------------|------------|----------------| +| 16M | 33.10 | 23.33 | +| 64M | 42.05 | 23.68 | +| 256M | 45.10 | 36.21 | +| 512M | 45.65 | 36.24 | + +--- + +## See Also + +- [Inference Simulation](../user_guide/inference_simulation.md) — Full inference simulation guide +- [vidur-alibabacloud](../components/vidur.md) — Component documentation +- [Result Analysis](../user_guide/result_analysis.md) — Output interpretation diff --git a/docs/en/community/index.md b/docs/en/community/index.md new file mode 100644 index 00000000..f5579139 --- /dev/null +++ b/docs/en/community/index.md @@ -0,0 +1,90 @@ +# Community + +Welcome to the SimAI community! SimAI is built by a growing community of researchers and engineers from Alibaba Cloud and academic institutions worldwide. + +--- + +## Getting Help + +- **GitHub Issues**: [github.com/aliyun/SimAI/issues](https://github.com/aliyun/SimAI/issues) — Bug reports, feature requests, questions +- **Discussions**: Open an issue with "Question:" prefix for general questions +- **Email**: Gang Lu (yunding.lg@alibaba-inc.com), Feiyang Xue (xuefeiyang.xfy@alibaba-inc.com), Qingxu Li (qingxu.lqx@alibaba-inc.com) + +--- + +## Community Chat Groups + +Join the SimAI community via DingTalk or WeChat: + +
+ SimAI DingTalk + SimAI WeChat +
+ +--- + +## Events + +### Upcoming Events + +| Date | Event | Location | Type | +|:----:|:------|:---------|:----:| +| — | — | — | — | + +### Past Events + +| Date | Event | Location | Type | +|:----:|:------|:---------|:----:| +| Dec 30, 2025 | SimAI 1.5 Release | Online | Virtual | +| Jun 4, 2025 | First SimAI Community Workshop | Peking University | On-site | +| May 24, 2025 | 28th Chinasys Workshop — SimAI Talk | Chongqing University | On-site | +| Dec 27, 2024 | SimAI Technical Presentation | Beihang University | On-site | +| Dec 6, 2024 | HKUST Technical Workshop | HKUST(GZ) | On-site | +| Dec 5, 2024 | Bench'24 Conference — SimAI Tutorial | Guangzhou | On-site | +| Nov 26, 2024 | SimAI Community Live Stream (400+ attendees) | Online | Virtual | +| Nov 15, 2024 | Technical Workshop | Thousand Island Lake | On-site | +| Oct 18, 2024 | Guest Lecture — SimAI Tutorial | Fudan University | On-site | +| Sept 24-26, 2024 | CCF HPC China 2024 | Wuhan | Conference | + +--- + +## Citation + +SimAI has been accepted by **NSDI'25 Spring**. If you use SimAI in your research, please cite: + +```bibtex +@inproceedings{simai-nsdi25, + title={SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision}, + booktitle={NSDI'25 Spring}, + year={2025} +} +``` + +**Resources:** +- [Paper (PDF)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) +- [Slides](../../docs/SimAI_Intro_Online.pdf) +- [Video](https://n.dingtalk.com/dingding/live-room/index.html?roomId=OF5BkBUXVxmgsK7x&liveUuid=305736cd-aa70-498b-8003-2b471a53decd) + +--- + +## Acknowledgments + +Major contributors: + +- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/) +- Parth Parikh (KEYSIGHT) +- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin) +- Xinyue Li (BUPT) +- Tong Chen (Zhejiang University) +- Ming Wang (BUPT) +- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences) + +And many other individual contributors — see [Contributors to aliyun/SimAI](https://github.com/aliyun/SimAI/graphs/contributors). + +Special thanks to Chenning Li (MIT CSAIL) for initiating the cooperation on integrating SimAI into [M4](https://github.com/netiken/m4). + +--- + +## Contributing + +We warmly welcome contributions! See our [Contributing Guide](../developer_guide/contributing.md) for details. diff --git a/docs/en/components/aicb.md b/docs/en/components/aicb.md new file mode 100644 index 00000000..b6765545 --- /dev/null +++ b/docs/en/components/aicb.md @@ -0,0 +1,212 @@ +# AICB — AI Communication Benchmark + +**Repository**: [aliyun/aicb](https://github.com/aliyun/aicb) | **Language**: Python + +AICB is a specialized communication benchmarking suite for AI scenarios. It generates realistic communication workloads aligned to real-world LLM training and inference processes. + +--- + +## Introduction + +AICB (Artificial Intelligence Communication Benchmark) produces communication workloads with precise patterns aligned to real-world applications. It supports: + +- Benchmarking and tuning GPU cluster communication systems +- Investigating communication patterns of specific model configurations +- Generating workloads for simulators like SimAI + +--- + +## Benchmark Suite + +AICB provides 10 pre-configured benchmark cases covering typical LLM configurations: + +| ID | Model | Seq Length | Framework | TP | PP | SP | MoE | +|----|-------|-----------|-----------|----|----|-----|-----| +| 1 | LLaMA-7B | 2048 | Megatron | 1 | 1 | - | - | +| 2 | GPT-13B | 2048 | Megatron | 2 | 1 | Yes | - | +| 3 | GPT-22B | 2048 | Megatron | 4 | 1 | - | - | +| 4 | LLaMA-65B | 4096 | Megatron | 8 | 2 | Yes | - | +| 5 | GPT-175B | 2048 | Megatron | 8 | 8 | Yes | - | +| 6 | GPT-175B | 2048 | Megatron | 8 | 8 | - | - | +| 7 | Llama3-405B | 8192 | Megatron | 8 | 16 | Yes | - | +| 8 | LLaMA-7B | 4096 | DeepSpeed | 1 | 1 | - | Zero-2 | +| 9 | LLaMA-65B | 4096 | DeepSpeed | 1 | 1 | - | Zero-3 | +| 10 | Mistral-8x7B | 2048 | Megatron | 2 | 1 | Yes | 8 experts | + +--- + +## Environment Setup + +### Docker + +```bash +docker build -t aicb:v0.0.1 . +docker run --gpus all --net host --shm-size 16g -it --rm aicb:v0.0.1 +``` + +### Local Environment + +Requirements: Python >= 3.8, CUDA >= 11.8, PyTorch >= 2.0.0, NVIDIA APEX + +### NGC Container + +```bash +docker pull nvcr.io/nvidia/pytorch:xx.xx-py3 +docker run --gpus all -it --rm -v /path/to/aicb:/workspace/aicb nvcr.io/nvidia/pytorch:xx.xx-py3 +``` + +> **Note**: Inference workload profiling requires NVIDIA Hopper (SM90) or Blackwell (SM100) GPUs. + +--- + +## Physical Execution on GPU Clusters + +### Environment Variables + +| Parameter | Description | +|-----------|-------------| +| `nnodes` | Number of nodes | +| `node_rank` | Rank of the node | +| `nproc_per_node` | Number of GPUs per node | +| `master_addr` | Master node address | +| `master_port` | Master node port | + +### Running Megatron Workloads + +```bash +sh scripts/megatron_gpt.sh \ + --nnodes 1 --node_rank 0 --nproc_per_node 8 \ + --master_addr localhost --master_port 29500 \ + -m 7 --world_size 8 --tensor_model_parallel_size 2 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --seq_length 2048 --swiglu --use_flash_attn --aiob_enable +``` + +### Running MoE Workloads + +```bash +sh scripts/megatron_gpt.sh \ + -m moe --world_size 8 --tensor_model_parallel_size 4 \ + --moe_enable --expert_model_parallel_size 1 \ + --num_experts 4 --moe_router_topk 2 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn +``` + +### Running DeepSeek Workloads + +```bash +sh scripts/megatron_gpt.sh \ + --frame DeepSeek -m deepseek \ + --tensor_model_parallel_size 4 --moe_enable \ + --expert_model_parallel_size 1 --num_experts 4 \ + --global_batch 4 --micro_batch 1 --world_size 4 \ + --num_layers 10 --sp --swiglu --aiob_enable +``` + +--- + +## Training Workload Generation + +Generate workload files for SimAI simulation: + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name GPT-13B --frame=Megatron \ + --world_size=16 --tensor_model_parallel_size=2 --pipeline_model_parallel=1 \ + --global_batch=16 --micro_batch=1 --num_layers=40 --seq_length=2048 \ + --hidden_size=5120 --epoch_num=1 --num_attention_heads=40 \ + --aiob_enable --use_flash_attn --swiglu +``` + +Output saved in `results/mocked_workload/`. + +--- + +## Inference Workload Generation + +AICB generates inference workloads with prefill/decode phase separation for: + +| Model | Attention | MoE Experts | +|-------|-----------|-------------| +| DeepSeek-V3-671B | MLA | 256 routed + 1 shared | +| Qwen3-MoE-235B | MHA/GQA | 128 routed | +| Qwen3-Next-80B | Hybrid (full + linear) | 512 routed | + +Requires hardware-accelerated libraries: [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM), [FlashMLA](https://github.com/deepseek-ai/FlashMLA), [FlashInfer](https://github.com/flashinfer-ai/flashinfer). + +--- + +## AIOB: Computation Profiling + +AIOB profiles actual GPU computation times and embeds them into workloads: + +- `--aiob_enable` — Profile on current GPU +- `--comp_filepath ` — Use pre-existing profiling data + +Output saved in `results/aiob_outputs/`. + +--- + +## Custom Model Development + +AICB supports creating workloads for custom model architectures using `MockedParam` and `MockedModel` base classes. + +The training process is abstracted into: `init → forward → backward → step` + +Each workload item consists of: +1. **Communication info**: `comm_type`, `comm_group`, `comm_group_size`, `msg_size` +2. **Additional info**: source node (broadcast), compute time +3. **Runtime info**: `elapsed_time`, `algo_bw`, `bus_bw` + +Refer to existing `MockedMegatron` and `MockedDeepSpeed` implementations for examples. + +--- + +## Key Parameters + +| Category | Parameter | Description | +|----------|-----------|-------------| +| Framework | `frame` | Megatron / DeepSpeed / DeepSeek | +| Model | `model_size` | Pre-configured size (7/13/22/175/moe/deepseek) | +| Training | `world_size` | Total GPU count | +| | `global_batch` | Total batch size | +| | `micro_batch` | Micro-batch size | +| | `seq_length` | Sequence length | +| Parallelism | `tensor_model_parallel_size` | TP degree | +| | `pipeline_model_parallel` | PP degree | +| | `expert_model_parallel_size` | EP degree | +| MoE | `moe_enable` | Enable MoE | +| | `num_experts` | Number of experts | +| | `moe_router_topk` | Experts per token | +| DeepSeek | `qk_rope_dim` | RoPE dimension for QK | +| | `kv_lora_rank` | KV compression LoRA dimension | +| | `q_lora_rank` | Q compression LoRA dimension | +| | `n_shared_expert` | Number of shared experts | +| Optimization | `use_flash_attn` | FlashAttention | +| | `swiglu` | SwiGLU activation | +| | `aiob_enable` | AIOB compute profiling | +| | `comp_filepath` | Pre-existing computation file | + +--- + +## Result Output + +### Physical Execution + +- Per-communication logs: type, group, message size, execution time, throughput +- Per-iteration timing analysis +- CSV outputs in `results/comm_logs/` + +### Workload Files + +- Training workloads: `results/mocked_workload/` or `results/workload/` +- AIOB profiles: `results/aiob_outputs/` + +--- + +## See Also + +- [Workload Generation Guide](../user_guide/workload_generation.md) — User-facing workload generation guide +- [Supported Models](../user_guide/supported_models.md) — Full model list +- [Tutorial](https://github.com/aliyun/aicb/blob/master/training/tutorial.md) — Detailed AICB tutorial diff --git a/docs/en/components/astra_sim.md b/docs/en/components/astra_sim.md new file mode 100644 index 00000000..d3685618 --- /dev/null +++ b/docs/en/components/astra_sim.md @@ -0,0 +1,147 @@ +# astra-sim-alibabacloud — Simulation Engine + +**Location**: In-tree (`astra-sim-alibabacloud/`) | **Language**: C++ + +The core simulation engine of SimAI, extended from [astra-sim 1.0](https://github.com/astra-sim/astra-sim/tree/ASTRA-sim-1.0). It supports three operation modes and integrates NCCL algorithms with custom enhancements. + +--- + +## Overview + +astra-sim-alibabacloud serves as the central orchestrator for SimAI simulations. It: + +- Receives workloads from AICB +- Uses SimCCL to decompose collective operations into P2P transfers +- Drives network simulation via NS-3 (simulation mode) or direct RDMA (physical mode) +- Computes timing using busbw parameters (analytical mode) + +--- + +## Three Operation Modes + +### SimAI-Analytical + +Fast analytical simulation using bus bandwidth (busbw) to estimate collective communication times. + +**Build**: `./scripts/build.sh -c analytical` +**Binary**: `bin/SimAI_analytical` + +### SimAI-Simulation + +Full-stack simulation with NS-3 network backend for fine-grained network modeling. + +**Build**: `./scripts/build.sh -c ns3` +**Binary**: `bin/SimAI_simulator` + +### SimAI-Physical + +Physical traffic generation using RDMA on real hardware. + +**Build**: `./scripts/build.sh -c phy` +**Binary**: `bin/SimAI_phynet` + +--- + +## Core Components + +| Component | Description | +|-----------|-------------| +| **AstraComputeAPI** | Manages computation timing and scheduling | +| **MemoryAPI** | Handles memory allocation and tracking | +| **NetworkAPI** | Interface to network backends (NS-3, physical) | +| **MockNcclGroup** | Simulates NCCL communication groups | +| **MockNcclChannel** | Manages individual communication channels | +| **SimAiFlowModelRdma** | RDMA flow model for traffic simulation | + +--- + +## Configuration + +### SimAI.conf + +The main configuration file is located at `astra-sim-alibabacloud/inputs/config/SimAI.conf`. It controls simulation parameters including: + +- Communication algorithms +- Buffer sizes +- Timing parameters +- Network backend settings + +### Environment Variables (Simulation Mode) + +| Variable | Description | Default | +|----------|-------------|---------| +| `AS_LOG_LEVEL` | Log level: DEBUG, INFO, WARNING, ERROR | `INFO` | +| `AS_PXN_ENABLE` | Enable PXN (Proxied NVLINK) | `0` (false) | +| `AS_NVLS_ENABLE` | Enable NVLS (NVLink Sharp) | `0` (false) | +| `AS_SEND_LAT` | Packet sending latency (us) | `6` | +| `AS_NVLSTREE_ENABLE` | Enable NVLS Tree algorithm | `false` | + +### Simulation Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `-t` / `--thread` | Number of threads for acceleration | `1` (recommended: 8-16) | +| `-w` / `--workload` | Path to workload file | Required | +| `-n` / `--network-topo` | Network topology file path | Required (simulation mode) | +| `-c` / `--config` | SimAI configuration file | Required | + +--- + +## Topology Generation + +astra-sim provides 5 topology templates via `gen_Topo_Template.py`: + +### Available Templates + +| Template | Architecture | Description | +|----------|-------------|-------------| +| `Spectrum-X` | NVIDIA Spectrum-X | Rail-optimized, single ToR, single plane | +| `AlibabaHPN` (single plane) | Alibaba HPN 7.0 | Dual ToR, rail-optimized, single plane | +| `AlibabaHPN` (dual plane) | Alibaba HPN 7.0 | Dual ToR, rail-optimized, dual plane | +| `DCN+` (single ToR) | DCN+ | Single ToR, non rail-optimized | +| `DCN+` (dual ToR) | DCN+ | Dual ToR, non rail-optimized | + +### Topology Parameters + +| Level | Parameter | Description | +|-------|-----------|-------------| +| **Global** | `-topo` | Template name | +| | `-g` | Number of GPUs | +| | `--dp` | Enable dual plane | +| | `--ro` | Enable rail-optimized | +| | `--dt` | Enable dual ToR | +| **Intra-Host** | `-gps` | GPUs per server | +| | `-gt` | GPU type (A100/H100) | +| | `-nvbw` | NVLink bandwidth | +| | `-nl` | NVLink latency | +| **Intra-Segment** | `-bw` | NIC to ASW bandwidth | +| | `-asw` | ASW switch count | +| | `-nps` | NICs per switch | +| **Intra-Pod** | `-psn` | PSW switch count | +| | `-apbw` | ASW to PSW bandwidth | + +### Examples + +```bash +# Spectrum-X with 128 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# Dual-Plane AlibabaHPN with 64 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo AlibabaHPN --dp -g 64 -asn 16 -psn 16 + +# Dual-ToR DCN+ with 128 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo DCN+ --dt -g 128 -asn 2 -psn 8 +``` + +--- + +## See Also + +- [SimAI-Analytical Guide](../user_guide/simai_analytical.md) — Analytical mode usage +- [SimAI-Simulation Guide](../user_guide/simai_simulation.md) — NS-3 simulation usage +- [SimAI-Physical Guide](../user_guide/simai_physical.md) — Physical mode usage +- [NS-3 Component](ns3.md) — Network backend details +- [SimCCL Component](simccl.md) — Collective communication decomposition diff --git a/docs/en/components/index.md b/docs/en/components/index.md new file mode 100644 index 00000000..80890c35 --- /dev/null +++ b/docs/en/components/index.md @@ -0,0 +1,80 @@ +# Components Overview + +SimAI is a modular project composed of 5 core components. Each component can be used independently or combined to achieve different simulation scenarios. + +--- + +## Architecture + +``` + |--- AICB (Workload generation & compute profiling) +SimAI --|--- SimCCL (Collective communication algorithm analysis) + |--- astra-sim-alibabacloud (Simulation engine: Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 network backend) + |--- vidur-alibabacloud (Multi-request inference scheduling & memory management) +``` + +--- + +## Component Summary + +| Component | Language | Repository | Description | +|-----------|----------|------------|-------------| +| [AICB](aicb.md) | Python | [aliyun/aicb](https://github.com/aliyun/aicb) | AI Communication Benchmark — workload generation for training and inference | +| [SimCCL](simccl.md) | Python | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | Collective communication to point-to-point transformation | +| [astra-sim-alibabacloud](astra_sim.md) | C++ | In-tree | Core simulation engine supporting 3 modes (Analytical/Simulation/Physical) | +| [ns-3-alibabacloud](ns3.md) | C++ | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | NS-3 network simulator backend with RDMA/datacenter extensions | +| [vidur-alibabacloud](vidur.md) | Python | In-tree | LLM inference simulation with PD disaggregation and request scheduling | + +--- + +## Component Combinations by Scenario + +| Scenario | AICB | SimCCL | astra-sim | ns-3 | vidur | +|----------|------|--------|-----------|------|-------| +| AICB Test Suite (physical GPU) | Required | - | - | - | - | +| Workload Generation | Required | - | - | - | - | +| Collective Comm Analysis | - | Required | - | - | - | +| SimAI-Analytical | Required | - | Required (analytical) | - | - | +| SimAI-Simulation | Required | Required | Required (simulation) | Required | - | +| SimAI-Physical | Required | Required | Required (physical) | - | - | +| Inference Simulation | Required | Required | Required (analytical/simulation) | Optional | Required | + +--- + +## Data Flow + +``` +AICB (Workload Generation) + | + |-- Training workload (.txt) --> astra-sim-alibabacloud + |-- Inference workload -------> vidur-alibabacloud + | +SimCCL (Collective → P2P) + | + |--> astra-sim-alibabacloud (Simulation/Physical mode) + | +astra-sim-alibabacloud (Simulation Engine) + | + |-- Analytical mode: busbw-based estimation + |-- Simulation mode: NS-3 backend + |-- Physical mode: RDMA traffic injection + | +ns-3-alibabacloud (Network Backend) + | + |--> Fine-grained network simulation results + | +vidur-alibabacloud (Inference Scheduling) + | + |--> request_metrics.csv, chrome_trace.json, plots/ +``` + +--- + +## Detailed Component Documentation + +- **[AICB](aicb.md)** — Workload generation, benchmark suite, AIOB compute profiling +- **[SimCCL](simccl.md)** — Collective communication decomposition +- **[astra-sim-alibabacloud](astra_sim.md)** — Core simulation engine, configuration, topology generation +- **[ns-3-alibabacloud](ns3.md)** — RDMA network simulation, CC algorithms, analysis tools +- **[vidur-alibabacloud](vidur.md)** — Inference simulation, PD disaggregation, GPU memory management diff --git a/docs/en/components/ns3.md b/docs/en/components/ns3.md new file mode 100644 index 00000000..ed5c0468 --- /dev/null +++ b/docs/en/components/ns3.md @@ -0,0 +1,175 @@ +# ns-3-alibabacloud — Network Simulation Backend + +**Repository**: [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | **Language**: C++ + +An NS-3-based network simulator acting as the network backend for SimAI, extended with datacenter/RDMA-oriented end-to-end modeling. + +--- + +## Overview + +Compared to upstream [NS-3](https://www.nsnam.org/), ns-3-alibabacloud extends the point-to-point module with comprehensive datacenter networking features: + +- **QBB/PFC + multi-priority queues** — 8 priority queues with PAUSE/RESUME handling +- **ECN + CNP feedback** — Switch-side ECN marking and receiver-side congestion notification +- **RDMA host stack (QP-level)** — Full QP modeling with 5 congestion control algorithms +- **Switch and NVSwitch modeling** — ECMP forwarding, buffer management, PFC logic + +### dev/qp Branch + +The [dev/qp](https://github.com/aliyun/ns-3-alibabacloud/tree/dev/qp) branch includes additional enhancements: + +1. QP logic support with creation/destruction based on actual RDMA logic +2. Per-IP or per-QP NIC CC configuration +3. Optimized Max-Min scheduling logic +4. Decoupled CC module for modularity + +--- + +## Core Modules + +### QBB Net Device (`qbb-net-device`) + +A QBB-capable net device with 8 priorities built on top of `PointToPointNetDevice`. Features: + +- PFC PAUSE/RESUME handling +- `RdmaEgressQueue` with high-priority ACK/NACK queue + round-robin across QPs +- `BEgressQueue` for switch port round-robin +- NVSwitch send path support (NVLS mode) + +**Key attributes**: `QbbEnabled`, `QcnEnabled`, `DynamicThreshold`, `PauseTime`, `NVLS_enable` + +### RDMA Host Stack (`rdma-hw`) + +Host RDMA core implementing: + +- QP create/delete lifecycle +- Packet construction (PPP + IPv4 + UDP + SeqTs headers) +- ACK/NACK/CNP processing +- Per-QP congestion control algorithms +- NVSwitch routing tables + +**Congestion Control Algorithms**: + +| Algorithm | Description | +|-----------|-------------| +| **DCQCN** | Data Center Quantized Congestion Notification | +| **HPCC** | High Precision Congestion Control | +| **TIMELY** | RTT-based congestion control | +| **DCTCP** | Data Center TCP | +| **HPCC-PINT** | HPCC with Probabilistic INT | + +**Protocol Numbers (IPv4 Protocol field)**: + +| Protocol | Number | Description | +|----------|--------|-------------| +| UDP Data | `0x11` | Normal data packets | +| CNP | `0xFF` | Congestion Notification Packet | +| PFC | `0xFE` | Priority Flow Control | +| ACK | `0xFC` | Acknowledgment | +| NACK | `0xFD` | Negative Acknowledgment | + +### Switch Node (`switch-node`) + +Switch pipeline implementing: +- ECMP forwarding (5-tuple hash) +- Admission control via MMU +- PFC pause/resume generation +- ECN marking +- INT/PINT injection for HPCC/HPCC-PINT + +### Switch MMU (`switch-mmu`) + +Switch buffer/MMU model: +- Ingress/egress accounting +- Shared buffer and headroom management +- PFC trigger/resume logic +- ECN marking probability curve (`kmin/kmax/pmax`) + +### NVSwitch Node (`nvswitch-node`) + +NVSwitch model for intra-server GPU communication, paired with NVLS routing logic in `RdmaHw`/`QbbNetDevice`. + +### QP State (`rdma-queue-pair`) + +Per-QP and per-RxQP state management including: +- Window and rate control +- ACKed sequence tracking +- Per-CC algorithm state (DCQCN alpha/targetRate, HPCC hop state, TIMELY RTT, DCTCP alpha/ecnCnt, PINT state) + +--- + +## Analysis Tools + +Located in `ns-3-alibabacloud/analysis/`: + +### FCT Analysis + +```bash +python fct_analysis.py -h # See help for usage +``` + +Reads FCT output files and produces statistics for flow completion time analysis. + +### Trace Reader + +```bash +# Build +make trace_reader + +# Usage +./trace_reader <.tr file> [filter_expr] + +# Filter examples +./trace_reader output.tr "time > 2000010000" +./trace_reader output.tr "sip=0x0b000101&dip=0x0b000201" +``` + +### Trace Output Format + +``` +2000055540 n:338 4:3 100608 Enqu ecn:0 0b00d101 0b012301 10000 100 U 161000 0 3 1048(1000) +``` + +Fields: timestamp, node, port:queue, queue_length, event, ecn, src_ip, dst_ip, src_port, dst_port, pkt_type, seq, tx_time, priority, size(payload) + +--- + +## Headers and Utilities + +| File | Description | +|------|-------------| +| `qbb-header` | ACK/NACK header with optional INT header | +| `cn-header` | CNP header (feedback fields) | +| `pause-header` | PFC pause header | +| `pint` | PINT encode/decode utilities | +| `trace-format.h` | Binary trace record structure for offline analysis | + +--- + +## Extension Guide + +### Adding a New CC Algorithm + +1. **Primary**: `rdma-hw.{h,cc}` — Add `HandleAckX`/`UpdateRateX` methods, dispatch by `m_cc_mode` +2. **Often needed**: `rdma-queue-pair.h` — Add new per-QP state variables +3. **If switch feedback required**: `switch-node.cc` — Add INT/PINT or new markings + +### Changing Switch Behavior + +1. **Primary**: `switch-mmu.{h,cc}` — Modify thresholds, curves, formulas +2. **Marking/injection**: `switch-node.cc::SwitchNotifyDequeue()` +3. **Admission/priority**: `switch-node.cc::SendToDev()` + +### Adding New Control Packets + +1. Create new `*Header` in `model/` (follow `CnHeader`/`PauseHeader` pattern) +2. Add parsing in `QbbNetDevice::Receive()` or `RdmaHw::Receive()` + +--- + +## See Also + +- [SimAI-Simulation Guide](../user_guide/simai_simulation.md) — Full-stack simulation usage +- [astra-sim Component](astra_sim.md) — Simulation engine +- [Extending NS-3 Guide](../developer_guide/extending_ns3.md) — Detailed extension guide diff --git a/docs/en/components/simccl.md b/docs/en/components/simccl.md new file mode 100644 index 00000000..f5712b45 --- /dev/null +++ b/docs/en/components/simccl.md @@ -0,0 +1,82 @@ +# SimCCL — Collective Communication Library + +**Repository**: [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | **Language**: Python/C++ + +SimCCL enables the transformation of collective communication operations into point-to-point communications, serving as a critical bridge between the workload layer and the simulation engine. + +--- + +## Overview + +In distributed LLM training, collective communication operations (AllReduce, AllGather, ReduceScatter, AlltoAll, etc.) are fundamental building blocks. SimCCL breaks down these high-level collective operations into sequences of point-to-point communications that can be precisely simulated by the network backend. + +--- + +## Role in SimAI + +SimCCL sits between AICB (workload generation) and astra-sim-alibabacloud (simulation engine): + +``` +AICB generates workload with collective ops + | + v +SimCCL decomposes collective → point-to-point + | + v +astra-sim sends P2P traffic to NS-3 or physical network +``` + +SimCCL is required for: +- **SimAI-Simulation** — Full-stack NS-3 simulation +- **SimAI-Physical** — Physical RDMA traffic generation +- **Inference Simulation** — When using SimAI Simulation backend + +SimCCL is NOT required for: +- **SimAI-Analytical** — Uses busbw-based estimation directly + +--- + +## Versions + +### Basic Version (mocknccl) + +A basic implementation is currently available in the [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud) repository. Files are prefixed with `mocknccl` and provide fundamental collective-to-P2P conversion. + +### Complete Version + +The full SimCCL library with advanced collective communication algorithms is available in the [SimCCL repository](https://github.com/aliyun/SimCCL). + +--- + +## Supported Collective Operations + +| Operation | Description | +|-----------|-------------| +| AllReduce | Reduce data across all ranks, result available on all ranks | +| AllGather | Gather data from all ranks, result available on all ranks | +| ReduceScatter | Reduce and scatter data across all ranks | +| AlltoAll | All-to-all personalized communication | +| Broadcast | Broadcast from one rank to all others | + +--- + +## Integration with astra-sim + +SimCCL integrates with astra-sim-alibabacloud through the `MockNcclGroup` and `MockNcclChannel` interfaces: + +- **MockNcclGroup**: Manages a group of ranks participating in a collective operation +- **MockNcclChannel**: Handles the actual point-to-point data transfer for a specific channel within a collective operation + +The decomposition considers: +- Network topology (ring, tree, etc.) +- Number of participating ranks +- Message size +- Available communication channels + +--- + +## See Also + +- [Components Overview](index.md) — SimAI component architecture +- [astra-sim Component](astra_sim.md) — Simulation engine that consumes SimCCL output +- [NS-3 Component](ns3.md) — Network backend for P2P simulation diff --git a/docs/en/components/vidur.md b/docs/en/components/vidur.md new file mode 100644 index 00000000..67dc4a79 --- /dev/null +++ b/docs/en/components/vidur.md @@ -0,0 +1,196 @@ +# vidur-alibabacloud + +**vidur-alibabacloud** is the LLM inference simulation component of SimAI, adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur). It provides multi-request inference scheduling, GPU memory management, and Prefill-Decode (PD) disaggregation support. + +- **Repository**: In-tree (`vidur-alibabacloud/`) +- **Language**: Python +- **License**: MIT + +--- + +## Key Features + +- **Prefill-Decode (PD) Separation** — Run prefill and decode stages on different nodes for elastic resource allocation and performance isolation. Inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim). +- **Flexible Parallelism** — Data Parallel (DP), Tensor Parallel (TP), Pipeline Parallel (PP), Expert Parallel (EP) +- **Multiple Execution Backends** — AICB/AIOB, SimAI Simulation (NS-3), SimAI Analytical, Native Vidur +- **Workload Generation & Replay** — Synthetic (fixed/Poisson) or real-trace request replay +- **Fine-Grained Metrics** — TTFT, TBT/TPOT, E2E latency, communication cost, compute cost, scheduling delay + +--- + +## GPU Memory Calculation Module + +This module provides accurate GPU memory estimation for MoE models during inference. + +### Components + +| Component | File | Description | +|-----------|------|-------------| +| **ParamCounter** | `vidur/utils/param_counter.py` | Per-layer and per-device parameter counting for MLA, MHA/GQA, linear attention, and MoE experts. Returns `(total_params, prefill_params, decode_params)` under PD disaggregation | +| **MemoryPlanner** | `vidur/scheduler/utils/memory_planner.py` | Plans GPU memory budget: `available = GPU_mem * (1 - margin) - param_mem`, computes KV cache capacity and max concurrent requests. Includes OOM detection | +| **Per-request KV Cache Tracking** | `vidur/entities/replica.py` | Allocates/releases KV cache memory per request, enabling runtime remaining-capacity queries | + +### Supported Attention Architectures + +| Architecture | Model | Description | +|---|---|---| +| **MLA** (Multi-head Latent Attention) | DeepSeek-V3-671B | LoRA-compressed KV cache (`kv_lora_rank` + `qk_rope_head_dim`), ~57x memory savings vs MHA | +| **MHA / GQA** | Qwen3-MoE-235B | Standard KV cache with `num_kv_heads * head_dim` per token per layer | +| **Hybrid Full + Linear Attention** | Qwen3-Next-80B | Alternates between full attention and linear (GDN) attention every 4 layers | + +--- + +## Supported Models + +| Model | Attention | Experts | Status | +|-------|-----------|---------|--------| +| DeepSeek-V3-671B | MLA | 256 routed + 1 shared | PP/EP adaptation in progress | +| Qwen3-MoE-235B | MHA/GQA | 128 routed | PP/EP adaptation in progress | +| Qwen3-Next-80B | Hybrid | 512 routed | PP/EP adaptation in progress | +| Meta-Llama-3-8B / 70B | MHA | Dense | Supported | +| Llama-2-7b / 70b | MHA | Dense | Supported | +| CodeLlama-34b | MHA | Dense | Supported | +| InternLM-20B | MHA | Dense | Supported | +| Qwen-72B | MHA | Dense | Supported | + +--- + +## Environment Setup + +### Docker (Recommended) + +```bash +docker build -t simai:latest . +docker run --gpus all -it --rm simai:latest +``` + +> Add `ENV FLASH_MLA_DISABLE_SM100=1` to Dockerfile when using Hopper GPUs. + +### Conda + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt +``` + +--- + +## Key Input Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `--replica_config_model_name` | HuggingFace model ID or config path | Required | +| `--cluster_config_num_replicas` | Number of replicas (DP) | 1 | +| `--replica_config_tensor_parallel_size` | TP degree | 1 | +| `--replica_config_num_pipeline_stages` | PP stages | 1 | +| `--replica_config_expert_model_parallel_size` | EP degree | 1 | +| `--replica_config_pd_node_ratio` | P:D node ratio (e.g., `"2:6"`) | `""` (no PD) | +| `--cluster_config_global_scheduler_type` | Global scheduler: `lor` / `round_robin` / `split_wise` | `lor` | +| `--cluster_config_replica_scheduler_type` | Per-replica scheduler: `sarathi` / `split_wise` | `sarathi` | +| `--request_generator_config_type` | `synthetic` / `trace_replay` | `synthetic` | +| `--synthetic_request_generator_config_num_requests` | Number of requests to generate | 100 | +| `--poisson_request_generator_config_qps` | Queries per second (Poisson mode) | 1.0 | +| `--replica_config_device` | GPU type (e.g., `h20_dgx`) | Required | +| `--replica_config_network_device` | Network type | Same as device | +| `--execution_time_predictor_config_type` | Backend: `aicb` / `simai_simulation` / `simai_analytical` / `random_forrest` | `random_forrest` | +| `--nvlink_bandwidth_gbps` | NVLink bandwidth | 1600 | +| `--rdma_bandwidth_gbps` | RDMA bandwidth | 800 | +| `--pd_p2p_bandwidth_gbps` | PD inter-node P2P bandwidth | 800 | +| `--replica_config_fp8_enabled` | Enable FP8 quantization | false | +| `--replica_config_memory_margin_fraction` | GPU memory safety margin | 0.1 | + +--- + +## Output Files + +Each run produces the following outputs: + +| File | Description | +|------|-------------| +| `request_metrics.csv` | Per-request metrics with 17 columns | +| `chrome_trace.json` | Timeline trace for Chrome `chrome://tracing` visualization | +| `config.json` | Configuration snapshot | +| `plots/` | Metric visualization plots | + +### request_metrics.csv Columns + +| Column | Description | +|--------|-------------| +| `request_id` | Unique request identifier | +| `arrived_at` | Request arrival time | +| `scheduled_at` | First schedule time | +| `completed_at` | Request completion time | +| `prefill_completed_at` | Prefill completion time (first token) | +| `num_prefill_tokens` | Number of input tokens | +| `num_decode_tokens` | Number of generated tokens | +| `scheduling_delay` | Wait time before scheduling | +| `e2e_time` | End-to-end latency | +| `e2e_time_normalized` | E2E latency / num_decode_tokens | +| `execution_time` | Actual GPU execution time | +| `preemption_time` | Time spent preempted | +| `num_restarts` | Number of restarts | +| `prefill_e2e_time` | TTFT (Time to First Token) | +| `decode_time_normalized` | Average TBT (Time Between Tokens) | +| `total_comm_cost` | Total communication time | +| `total_compute_cost` | Total compute time | + +--- + +## Simulation Metrics (23 Items) + +The simulator logs the following metrics (see `vidur-alibabacloud/docs/metrics.md` for details): + +1. `request_inter_arrival_delay_histogram` — Request inter-arrival delay distribution +2. `request_num_tokens_histogram` — Token count distribution (prefill + decode) +3. `request_num_restarts_histogram` — Restart count distribution +4. `request_e2e_time_cdf` — End-to-end latency CDF +5. `request_e2e_time_normalised_cdf` — Normalized E2E latency CDF +6. `request_execution_plus_preemption_times_cdf` — Execution + preemption time CDF +7. `request_scheduling_delay_cdf` — Scheduling delay CDF +8. `request_execution_time_cdf` — Pure execution time CDF +9. `request_preempted_time_cdf` — Preemption time CDF +10. `decode_token_execution_plus_preemption_times` — Per-token inter-token delay CDF +11. `batch_num_tokens_cdf` — Batch total token count CDF +12. `batch_sizes_cdf` — Batch size CDF +13. `prefill_time_e2e_cdf` — TTFT CDF +14. `prefill_time_execution_plus_preemption_cdf` — Prefill processing time CDF +15. `prefill_time_execution_plus_preemption_normalized_cdf` — Normalized prefill time CDF +16. `decode_time_execution_plus_preemption_normalized_cdf` — Normalized decode time CDF +17. `request_completions_time_series` — Request completion time series +18. `prefill_completions_time_series` — Prefill completion time series +19. `decode_completions_time_series` — Decode completion time series +20. `replica_{id}_memory_usage_weighted_mean` — Per-replica memory utilization +21. `replica_{id}_stage_{id}_busy_time_percent_weighted_mean` — Per-stage busy time percentage +22. `replica_{id}_stage_{id}_mfu_weighted_mean` — Per-stage MFU +23. `request_arrivals_time_series` — Request arrival time series + +--- + +## 4-Scenario Test Suite + +Run all pre-configured scenarios: + +```bash +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +# Or a single scenario: +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 3 +``` + +For detailed scenario configuration, see [Benchmarking — Test Suite](../benchmarking/test_suite.md). + +--- + +## Adding New Models + +To add a new model to vidur-alibabacloud, see the [Adding Models Guide](../developer_guide/adding_models.md) and the upstream documentation at `vidur-alibabacloud/docs/profiling.md`. + +--- + +## Related Documentation + +- [Inference Simulation User Guide](../user_guide/inference_simulation.md) — End-to-end inference simulation workflow +- [Result Analysis](../user_guide/result_analysis.md) — How to interpret output files +- [GPU Memory Module Technical Reference](../technical_reference/memory_module.md) — Detailed memory calculation formulas +- [Benchmarking Test Suite](../benchmarking/test_suite.md) — 4-scenario configuration details diff --git a/docs/en/developer_guide/adding_models.md b/docs/en/developer_guide/adding_models.md new file mode 100644 index 00000000..4d1eeb00 --- /dev/null +++ b/docs/en/developer_guide/adding_models.md @@ -0,0 +1,204 @@ +# Adding New Models + +This guide covers how to add new model support to SimAI, including both the Vidur inference simulation side (GPU memory, profiling) and the AICB workload generation side. + +--- + +## Overview + +Adding a new model typically involves two components: + +| Component | What to add | Required hardware | +|-----------|-------------|-------------------| +| **vidur-alibabacloud** | Model config, profiling data (compute + network) | GPU (for profiling only) | +| **AICB** | Workload generation parameters (`MockedParam` / `MockedModel`) | None | + +--- + +## Part 1: Vidur — Model Configuration and Profiling + +### Step 1: Add Model Configuration + +Create a YAML/JSON model config in `vidur-alibabacloud/data/model_configs/` or `vidur-alibabacloud/data/hf_configs/`: + +- Use the model's HuggingFace model ID as filename (e.g., `meta-llama/Llama-2-70b-hf.yml`) +- Reference the model's HuggingFace `config.json` for parameter values +- Ensure the correct parameters are set so the reference transformer model closely resembles the new model + +**Example config parameters:** + +```yaml +num_layers: 80 +hidden_size: 8192 +num_attention_heads: 64 +num_key_value_heads: 8 # For GQA models +head_dim: 128 +intermediate_size: 28672 +vocab_size: 128256 +max_position_embeddings: 8192 +``` + +For MoE models, also include: + +```yaml +num_routed_experts: 256 +num_experts_per_tok: 8 +num_shared_experts: 1 +moe_intermediate_size: 2048 +``` + +### Step 2: Profiling Data Structure + +Profiling data is stored in `vidur-alibabacloud/data/profiling/`: + +``` +profiling/ +├── compute/ +│ ├── a100/ +│ │ └── model-name/ +│ │ ├── mlp.csv +│ │ └── attention.csv +│ └── h100/ +│ └── model-name/ +│ ├── mlp.csv +│ └── attention.csv +└── network/ + ├── a100_pair_nvlink/ + │ ├── allreduce.csv + │ └── send_recv.csv + └── h100_dgx/ + ├── allreduce.csv + └── send_recv.csv +``` + +**Key distinction:** +- **Compute profiling**: Only GPU SKU matters (e.g., `a100`, `h100`), not network topology +- **Network profiling**: Network configuration matters (e.g., `a100_pair_nvlink` vs `a100_dgx`) + +### Step 3: Compute Profiling (MLP) + +Requires actual GPUs. 1 GPU is sufficient even for TP > 1. + +```bash +# Install sarathi-serve (vidur branch) for profiling +# Then run MLP profiling: +python vidur/profiling/mlp/main.py \ + --models your-model/model-name \ + --num_gpus 4 + +# Copy output to data directory: +cp profiling_outputs/mlp//your-model/model-name/mlp.csv \ + data/profiling/compute//your-model/model-name/mlp.csv +``` + +### Step 4: Compute Profiling (Attention) + +```bash +python vidur/profiling/attention/main.py \ + --models your-model/model-name \ + --num_gpus 4 + +# Copy output: +cp profiling_outputs/attention//your-model/model-name/attention.csv \ + data/profiling/compute//your-model/model-name/attention.csv +``` + +### Step 5: Network Profiling (if needed) + +Network profiling is **model-independent** — same data works for all models on the same hardware configuration. + +```bash +# AllReduce profiling (for TP): +python vidur/profiling/collectives/main.py \ + --num_workers_per_node_combinations 1,2,4,8 \ + --collective all_reduce + +# Send/Recv profiling (for PP, requires multi-node): +python vidur/profiling/collectives/main.py \ + --num_workers_per_node_combinations 1,2,4,8 \ + --collective send_recv +``` + +**Available network device profiles:** +- `a100_pair_nvlink` — Azure Standard_NC96ads_A100_v4 (4x A100 PCIe + NVLink pairs) +- `h100_pair_nvlink` — Azure internal (4x H100 NVL + NVLink pairs) +- `a100_dgx` — A100 DGX (8x A100) +- `h100_dgx` — H100 DGX (8x H100) + +--- + +## Part 2: AICB — Workload Generation + +### Custom Model Parameters (MockedParam) + +To add a new model for workload generation in AICB, create a `MockedParam` subclass: + +```python +# In aicb/workload_generator/mocked_params/ +class YourModelParam(MockedParam): + def __init__(self): + super().__init__() + self.num_layers = 80 + self.hidden_size = 8192 + self.num_attention_heads = 64 + self.num_key_value_heads = 8 + self.ffn_hidden_size = 28672 + self.vocab_size = 128256 + self.seq_length = 8192 + # MoE parameters (if applicable) + self.num_experts = 256 + self.topk = 8 + self.moe_intermediate_size = 2048 +``` + +### Custom Model Workflow (MockedModel) + +For full control over the workload generation process, create a `MockedModel` subclass that defines the compute and communication operations for each layer. + +See [AICB Component Documentation](../components/aicb.md#custom-model-development) for detailed examples. + +### Inference Workload Generation + +For inference workloads with prefill/decode separation: + +```bash +# Generate inference workload +python -m aicb.main \ + --model_name your-model-name \ + --workload_type inference \ + --num_prefill_tokens 1024 \ + --num_decode_tokens 128 +``` + +--- + +## Part 3: GPU Memory Module + +If your model uses a non-standard attention architecture, you may need to extend the `ParamCounter` in `vidur/utils/param_counter.py`: + +1. Add attention parameter calculation for your architecture +2. Add KV cache per-token size calculation +3. Test with the MemoryPlanner to verify OOM detection works correctly + +See [GPU Memory Module Technical Reference](../technical_reference/memory_module.md) for calculation formulas. + +--- + +## Verification Checklist + +- [ ] Model config file added to `data/model_configs/` or `data/hf_configs/` +- [ ] Compute profiling data (MLP + attention) added +- [ ] Network profiling data available for target hardware +- [ ] AICB `MockedParam` created (if workload generation needed) +- [ ] GPU memory calculation works correctly (ParamCounter + MemoryPlanner) +- [ ] End-to-end inference simulation produces reasonable results +- [ ] Documentation updated + +--- + +## Related Documentation + +- [vidur-alibabacloud Component](../components/vidur.md) — Full vidur documentation +- [AICB Component](../components/aicb.md) — AICB workload generation +- [GPU Memory Module](../technical_reference/memory_module.md) — Memory calculation formulas +- [Supported Models](../user_guide/supported_models.md) — Current model support status diff --git a/docs/en/developer_guide/architecture.md b/docs/en/developer_guide/architecture.md new file mode 100644 index 00000000..f3536339 --- /dev/null +++ b/docs/en/developer_guide/architecture.md @@ -0,0 +1,176 @@ +# System Architecture + +This document describes SimAI's modular architecture, component interactions, and data flow for both training and inference simulation. + +--- + +## Project Structure + +``` +SimAI/ +├── aicb/ # AI Computation Benchmark — workload generation (Python) +│ ├── workload_generator/ # Generates training/inference workloads +│ └── aicb.py # Main entry point +├── astra-sim-alibabacloud/ # Simulation engine — core simulator (C++) +│ ├── astra-sim/ # Extended from astra-sim 1.0 +│ └── build.sh # Build script +├── ns-3-alibabacloud/ # NS-3 network simulator backend (C++) +├── vidur-alibabacloud/ # LLM inference simulation (Python) +│ ├── vidur/ # Core simulation framework +│ └── setup.py # Python package config +├── SimCCL/ # Collective communication transformation +├── docs/ # Documentation and tutorials +├── example/ # Example workloads and configurations +├── scripts/ # Build and utility scripts +├── results/ # Simulation output directory +├── bin/ # Compiled binary output +└── Dockerfile # Docker container definition +``` + +--- + +## Component Architecture + +``` + |--- AICB (Workload generation & compute profiling) +SimAI --|--- SimCCL (Collective communication algorithm analysis) + |--- astra-sim-alibabacloud (Simulation engine: Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 network backend) + |--- vidur-alibabacloud (Multi-request inference scheduling & memory management) +``` + +![SimAI Architecture](../../images/SimAI_Arc.png) + +### Component Responsibilities + +| Component | Role | Language | +|-----------|------|----------| +| **AICB** | Generates training/inference workloads, profiles compute kernels, runs physical benchmarks | Python | +| **SimCCL** | Transforms collective communication operations (AllReduce, AllGather, etc.) into point-to-point communication sets | Python | +| **astra-sim-alibabacloud** | Core simulation engine supporting 3 modes; manages compute/memory/network APIs | C++ | +| **ns-3-alibabacloud** | Packet-level network simulation with RDMA, datacenter topology, and CC algorithms | C++ | +| **vidur-alibabacloud** | Multi-request inference scheduling with PD disaggregation and GPU memory management | Python | + +--- + +## Three Operation Modes + +### SimAI-Analytical + +``` +AICB (workload.txt) → astra-sim (analytical) → busbw estimation → CSV results +``` + +- **Use case**: Fast performance analysis, parallel parameter sweeps +- **Components**: AICB + astra-sim-alibabacloud (analytical mode) +- **Network model**: Bus bandwidth (busbw) abstraction + +### SimAI-Simulation + +``` +AICB (workload.txt) → SimCCL (collective→P2P) → astra-sim (simulation) → NS-3 → detailed traces +``` + +- **Use case**: Full-stack network research, CC algorithm evaluation +- **Components**: AICB + SimCCL + astra-sim-alibabacloud (simulation) + ns-3-alibabacloud +- **Network model**: Packet-level NS-3 simulation + +### SimAI-Physical + +``` +AICB (workload.txt) → SimCCL (collective→P2P) → astra-sim (physical) → RDMA traffic on real NICs +``` + +- **Use case**: NIC behavior study, physical traffic analysis +- **Components**: AICB + SimCCL + astra-sim-alibabacloud (physical) +- **Network model**: Real RDMA traffic via MPI + +--- + +## Inference Simulation Data Flow + +``` +Request Generator + | Generate synthetic / real-trace requests + v +Global Scheduler + | Dispatch requests to Prefill / Decode replicas + v +Replica Scheduler + | Batch assembly and scheduling + v +Memory Management (MemoryPlanner + Replica) + | KV cache allocation and capacity checking + v +Execution Time Predictor + | AICB / SimAI Simulation / SimAI Analytical / Vidur + v +Metrics Store + | TTFT, TBT, E2E, communication / compute cost + v +Output (request_metrics.csv, chrome_trace.json, plots/) +``` + +### Key Inference Components + +| Component | File | Description | +|-----------|------|-------------| +| Request Generator | `vidur/request_generator/` | Generates synthetic or trace-based requests | +| Global Scheduler | `vidur/scheduler/global_scheduler/` | Dispatches requests across replicas (`lor`, `round_robin`, `split_wise`) | +| Replica Scheduler | `vidur/scheduler/replica_scheduler/` | Per-replica batch scheduling (`sarathi`, `split_wise`) | +| MemoryPlanner | `vidur/scheduler/utils/memory_planner.py` | GPU memory budget computation | +| ParamCounter | `vidur/utils/param_counter.py` | Model parameter counting (MLA/MHA/GQA/linear/MoE) | +| Execution Predictor | `vidur/execution_time_predictor/` | Execution time estimation via multiple backends | +| Metrics Store | `vidur/metrics/` | Collects and exports 23 simulation metrics | + +--- + +## Submodule Structure + +SimAI uses Git submodules for its core components: + +| Submodule | Repository | Branch | +|-----------|------------|--------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | master | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | master | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | master / dev/qp | +| `astra-sim-alibabacloud` | In-tree | — | +| `vidur-alibabacloud` | In-tree | — | + +**Key rules:** +1. Submodules have independent Git histories +2. The parent repo only tracks the commit hash of each submodule +3. Always initialize after cloning: `git submodule update --init --recursive` + +--- + +## Build System + +### Build Scripts + +```bash +# Analytical mode (fast, busbw-based) +bash scripts/build.sh -c analytical + +# NS-3 simulation mode (full-stack) +bash scripts/build.sh -c ns3 + +# Physical mode (beta, RDMA) +bash scripts/build.sh -c phy +``` + +### Build Outputs + +| Mode | Binary | Location | +|------|--------|----------| +| Analytical | `SimAI_analytical` | `bin/` | +| Simulation | `SimAI_simulator` | `bin/` | +| Physical | `SimAI_physical` | `bin/` | + +--- + +## Related Documentation + +- [Components Overview](../components/index.md) — Detailed documentation for each component +- [Contributing Guide](contributing.md) — How to contribute code +- [Configuration Reference](../technical_reference/configuration.md) — Configuration files and parameters diff --git a/docs/en/developer_guide/contributing.md b/docs/en/developer_guide/contributing.md new file mode 100644 index 00000000..e08b5048 --- /dev/null +++ b/docs/en/developer_guide/contributing.md @@ -0,0 +1,224 @@ +# Contributing to SimAI + +Thank you for your interest in contributing to SimAI! This guide covers the complete development workflow. + +> **Full version**: See [CONTRIBUTING.md](../../../CONTRIBUTING.md) in the project root for the comprehensive guide. + +--- + +## Ways to Contribute + +1. **New features** — Add model support, parallelism strategies, scheduling policies +2. **Bug fixes** — Fix simulation inaccuracies, crashes, or incorrect results +3. **Performance optimization** — Improve simulation speed, memory usage, or scalability +4. **Documentation** — Improve tutorials, add examples, fix errors +5. **Benchmarks & validation** — Add validation against real hardware results +6. **Issue reports** — Report bugs, request features, or share feedback + +--- + +## Development Workflow + +### Step 1: Fork and Clone + +```bash +git clone --recurse-submodules https://github.com/YOUR_USERNAME/SimAI.git +cd SimAI +git remote add upstream https://github.com/aliyun/SimAI.git +``` + +### Step 2: Create a Feature Branch + +```bash +git fetch upstream +git checkout -b feature/your-feature-name upstream/master + +# Branch naming conventions: +# feature/xxx — New features +# fix/xxx — Bug fixes +# docs/xxx — Documentation +# perf/xxx — Performance improvements +# refactor/xxx — Code refactoring +``` + +### Step 3: Develop and Test + +```bash +# For C++ changes, rebuild: +bash scripts/build.sh -c analytical # or ns3 + +# For Python changes: +python -c "from aicb import ..." +``` + +### Step 4: Commit + +```bash +git add -A +git commit -m "feat(aicb): add Llama-4 model workload generation" +``` + +### Step 5: Push and Create PR + +```bash +git push origin feature/your-feature-name +# Then create a Pull Request on GitHub +``` + +--- + +## Commit Message Convention + +Use [Conventional Commits](https://www.conventionalcommits.org/) format: + +``` +(): +``` + +### Types + +| Type | Description | +|------|-------------| +| `feat` | New feature | +| `fix` | Bug fix | +| `docs` | Documentation only | +| `refactor` | Code refactoring | +| `test` | Adding or updating tests | +| `perf` | Performance improvement | +| `chore` | Build process, tooling | + +### Scopes + +`aicb`, `vidur`, `astra-sim`, `ns3`, `simccl`, `docs`, `docker`, `scripts` + +### Examples + +``` +feat(aicb): add DeepSeek-V3 inference workload generation +fix(astra-sim): correct AllReduce latency calculation for ring algorithm +docs: update build instructions for NS-3 mode +perf(vidur): reduce memory allocation in request scheduler +``` + +--- + +## Code Style + +### Python + +- **Formatter**: [black](https://github.com/psf/black) (default settings) +- **Import sorting**: [isort](https://pycqa.github.io/isort/) +- **Linter**: [flake8](https://flake8.pycqa.org/) +- **Max line length**: 120 characters + +```bash +black --line-length 120 your_file.py +isort your_file.py +flake8 your_file.py --max-line-length 120 +``` + +### C++ + +- Follow existing code style in `astra-sim-alibabacloud/` +- 4-space indentation +- `snake_case` for functions and variables +- Comments for non-trivial logic + +### General Rules + +- Write comments in **English** +- All new functions/classes should have docstrings or header comments +- Avoid hardcoded paths; use relative paths or configuration variables +- One feature/fix per PR + +--- + +## Working with Submodules + +SimAI uses Git submodules. Key points: + +| Submodule | Repository | Language | +|-----------|------------|----------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | Python | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | Python | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | C++ | +| `astra-sim-alibabacloud` | In-tree | C++ | +| `vidur-alibabacloud` | In-tree | Python | + +### Cross-Submodule Changes + +If your contribution spans multiple submodules: + +1. Make and commit changes in each submodule separately +2. Update the parent repo to point to new submodule commits +3. Create separate PRs for each submodule with independent remotes +4. Reference related PRs in descriptions + +--- + +## Pull Request Guidelines + +### PR Title + +Use the same format as commit messages: `(): ` + +### PR Checklist + +- [ ] Code compiles without errors +- [ ] Existing simulations produce unchanged results (no precision regression) +- [ ] New code has appropriate comments +- [ ] Tests added for new functionality +- [ ] Documentation updated if needed + +--- + +## Pre-Submission Quality Checklist + +```bash +# 1. C++ compilation check +bash scripts/build.sh -c analytical +bash scripts/build.sh -c ns3 + +# 2. Python lint check +black --check --line-length 120 your_changed_files.py +flake8 your_changed_files.py --max-line-length 120 + +# 3. Basic simulation test +cd bin && ./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml + +# 4. Submodule state check +git submodule status +``` + +--- + +## Acceptance Criteria + +| Criterion | Requirement | +|-----------|-------------| +| **Build** | Compiles without errors | +| **Precision** | No existing simulation accuracy degradation | +| **Tests** | Key code paths are covered | +| **Documentation** | New features have comments/doc updates | +| **Style** | Follows code style guidelines | +| **Scope** | Changes are focused and well-explained | + +--- + +## Review Timeline + +1. **Initial review**: 3-5 business days +2. **Feedback**: Constructive comments with actionable suggestions +3. **Iteration**: Address feedback and update PR +4. **Merge**: Approved PRs merged to main branch + +--- + +## Getting Help + +- **Issues**: [GitHub Issues](https://github.com/aliyun/SimAI/issues) +- **Discussions**: Open an issue with "Question:" prefix +- **Documentation**: See [Tutorial](../../../docs/Tutorial.md) for usage guides diff --git a/docs/en/developer_guide/extending_ns3.md b/docs/en/developer_guide/extending_ns3.md new file mode 100644 index 00000000..484465f3 --- /dev/null +++ b/docs/en/developer_guide/extending_ns3.md @@ -0,0 +1,219 @@ +# Extending the NS-3 Network Backend + +This guide covers how to extend `ns-3-alibabacloud` with new congestion control algorithms, switch behaviors, control packets, and NVSwitch features. + +> **Source reference**: See `astra-sim-alibabacloud/extern/network_backend/ns3-interface/README.md` for the detailed module map. + +--- + +## Module Overview + +All key source files are located in `ns-3-alibabacloud/simulation/src/point-to-point/model/`: + +| File | Class | Purpose | +|------|-------|---------| +| `qbb-net-device.{h,cc}` | `QbbNetDevice`, `RdmaEgressQueue` | QBB-capable NIC with 8 priority queues, PFC handling, NVSwitch send path | +| `rdma-hw.{h,cc}` | `RdmaHw` | Host RDMA core: QP management, packet construction, ACK/NACK, CC algorithms | +| `rdma-queue-pair.{h,cc}` | `RdmaQueuePair`, `RdmaRxQueuePair` | Per-QP state (window, rate, CC-specific state) | +| `switch-node.{h,cc}` | `SwitchNode` | Switch pipeline: ECMP forwarding, ECN marking, PFC, INT/PINT injection | +| `switch-mmu.{h,cc}` | `SwitchMmu` | Switch buffer/MMU: ingress/egress accounting, PFC thresholds, ECN curves | +| `nvswitch-node.{h,cc}` | `NVSwitchNode` | NVSwitch model for intra-server GPU communication | +| `rdma-driver.{h,cc}` | `RdmaDriver` | Wiring layer between Node/NICs and RdmaHw | +| `qbb-header.{h,cc}` | — | ACK/NACK header (PG/seq/CNP-flag + INT header) | +| `cn-header.{h,cc}` | — | CNP header (feedback fields) | +| `pause-header.{h,cc}` | — | PFC pause header | +| `pint.{h,cc}` | — | PINT encode/decode utilities | +| `trace-format.h` | `TraceFormat` | Binary trace record structure for offline analysis | + +--- + +## Adding a New Congestion Control Algorithm + +The NS-3 backend supports 5 built-in CC algorithms: **DCQCN**, **HPCC**, **TIMELY**, **DCTCP**, and **HPCC-PINT**. To add a new CC algorithm: + +### Step 1: Define CC Mode + +Add a new `CcMode` value in `rdma-hw.h`: + +```cpp +// Existing modes: 1=DCQCN, 3=HPCC, 7=TIMELY, 8=DCTCP, 10=HPCC-PINT +static const uint32_t CC_MODE_YOUR_ALG = 11; +``` + +### Step 2: Add Per-QP State (if needed) + +In `rdma-queue-pair.h`, add new state variables to `RdmaQueuePair`: + +```cpp +// Your CC algorithm state +double m_your_alg_rate; +double m_your_alg_alpha; +// ... +``` + +### Step 3: Implement Algorithm Logic + +In `rdma-hw.cc`, add two key functions: + +```cpp +void RdmaHw::HandleAckYourAlg(Ptr qp, ...) { + // Process ACK and update rate/window +} + +void RdmaHw::UpdateRateYourAlg(Ptr qp, ...) { + // Rate update logic +} +``` + +### Step 4: Register Dispatch + +Add dispatch cases in `ReceiveAck()` and/or `ReceiveCnp()` in `rdma-hw.cc`: + +```cpp +switch (m_cc_mode) { + // ... existing cases ... + case CC_MODE_YOUR_ALG: + HandleAckYourAlg(qp, ...); + break; +} +``` + +### Step 5: Add Switch Feedback (if needed) + +If your CC algorithm requires switch-side information (like INT/PINT metadata): + +- Modify `switch-node.cc::SwitchNotifyDequeue()` to inject your metadata +- Add header parsing in `RdmaHw::Receive()` or `QbbNetDevice::Receive()` + +--- + +## Modifying Switch Behavior + +### Buffer Management / PFC Thresholds + +**Primary file**: `switch-mmu.{h,cc}` + +Key methods to modify: + +| Method | Purpose | +|--------|---------| +| `ConfigBufferSize()` | Total buffer pool size | +| `ConfigHdrm()` | Headroom allocation | +| `ConfigEcn()` | ECN marking thresholds (`kmin`, `kmax`, `pmax`) | +| `CheckIngressAdmission()` | Ingress admission control | +| `CheckEgressAdmission()` | Egress admission control | +| `GetPfcThreshold()` | PFC trigger threshold formula | + +### ECN Marking / INT Injection + +**File**: `switch-node.cc` + +Modify `SwitchNotifyDequeue()` for: +- ECN marking based on custom queue occupancy formulas +- INT/PINT metadata injection for advanced CC algorithms +- Custom packet tagging + +### Forwarding / ECMP + +**File**: `switch-node.cc` + +Modify for routing changes: +- `GetOutDev()` — Output port selection +- `EcmpHash()` — ECMP hash function (currently 5-tuple) +- `AddTableEntry()` — Routing table management + +--- + +## Introducing New Control Packets + +### Step 1: Create Header + +Create new header files in `model/` following the pattern of `CnHeader` or `PauseHeader`: + +```cpp +// your-header.h +class YourHeader : public Header { +public: + static TypeId GetTypeId(); + // Serialize/Deserialize methods + uint32_t GetSerializedSize() const override; + void Serialize(Buffer::Iterator start) const override; + uint32_t Deserialize(Buffer::Iterator start) override; + + // Your header fields + uint32_t m_your_field; +}; +``` + +### Step 2: Define Protocol Number + +Add a new protocol number (following existing conventions): + +```cpp +// Existing protocol numbers (IPv4 Protocol field): +// UDP data: 0x11 +// CNP: 0xFF +// PFC: 0xFE +// ACK: 0xFC +// NACK: 0xFD +// Your new: 0xFB (example) +``` + +### Step 3: Add Parsing/Dispatch + +Add packet handling in: +- `QbbNetDevice::Receive()` — Device-level parsing +- `RdmaHw::Receive()` — Host stack processing + +--- + +## NVSwitch / NVLS Extensions + +**Files**: `nvswitch-node.{h,cc}`, `qbb-net-device.{h,cc}` (NVLS send path), `rdma-hw.{h,cc}` (NVLS routing) + +The `NVSwitchNode` models intra-server GPU communication via NVSwitch. To extend: + +- **Forwarding**: Similar to `SwitchNode` but without ECN/INT injection +- **NVLS routing**: Modify `RdmaHw::GetNicIdxOfQp()` and `GetNicIdxOfRxQp()` for NVSwitch routing tables +- **QP redistribution**: `RdmaHw::RedistributeQp()` for load balancing across NVSwitch links + +--- + +## Analysis Tools + +The `ns-3-alibabacloud/analysis/` directory contains trace analysis tools: + +| Tool | Purpose | +|------|---------| +| FCT Analysis | Flow Completion Time analysis from simulation traces | +| Trace Reader | Parse binary `TraceFormat` records | +| Bandwidth Analysis | Per-link bandwidth utilization over time | +| Queue Analysis | Queue occupancy and PFC event analysis | +| QP Analysis | Per-QP performance metrics | + +### Trace Format + +The binary trace record structure (`trace-format.h`) captures per-packet events. Use the offline analysis tools to: + +1. Parse trace files from simulation output +2. Compute FCT, throughput, queue depth statistics +3. Identify congestion hotspots and PFC events + +--- + +## dev/qp Branch Enhancements + +The [dev/qp](https://github.com/aliyun/ns-3-alibabacloud/tree/dev/qp) branch includes: + +1. **QP Logic Support** — QP creation/destruction based on actual RDMA logic +2. **NIC CC Configuration** — PerIP or perQP CC settings +3. **Optimized Scheduling** — Max-Min principle for fair resource allocation +4. **Decoupled CC Module** — Improved modularity + +--- + +## Related Documentation + +- [NS-3 Component](../components/ns3.md) — Full NS-3 backend documentation +- [SimAI-Simulation User Guide](../user_guide/simai_simulation.md) — Using NS-3 simulation mode +- [Configuration Reference](../technical_reference/configuration.md) — Topology and configuration files diff --git a/docs/en/developer_guide/index.md b/docs/en/developer_guide/index.md new file mode 100644 index 00000000..ddb9ccbe --- /dev/null +++ b/docs/en/developer_guide/index.md @@ -0,0 +1,25 @@ +# Developer Guide + +Welcome to the SimAI Developer Guide. This section covers the project architecture, contribution workflow, and guides for extending SimAI with new models and network features. + +--- + +## Contents + +| Document | Description | +|----------|-------------| +| [Architecture](architecture.md) | System architecture, data flow, and module interaction | +| [Contributing](contributing.md) | Development workflow, code style, PR guidelines | +| [Adding Models](adding_models.md) | Guide to adding new model support (Vidur profiling + AICB workload) | +| [Extending NS-3](extending_ns3.md) | Guide to extending the NS-3 network backend | + +--- + +## Prerequisites + +- **Python** 3.8+ (3.12 recommended with Docker) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git** with submodule support + +See [Installation Guide](../getting_started/installation.md) for setup instructions. diff --git a/docs/en/getting_started/index.md b/docs/en/getting_started/index.md new file mode 100644 index 00000000..46d39f68 --- /dev/null +++ b/docs/en/getting_started/index.md @@ -0,0 +1,25 @@ +# Getting Started + +This section helps you set up SimAI and run your first simulation. + +## Contents + +| Page | Description | +|------|-------------| +| [Installation](installation.md) | How to install SimAI via Docker or from source | +| [Quickstart](quickstart.md) | Run your first SimAI-Analytical, SimAI-Simulation, and Inference simulation | + +## Prerequisites + +- **Python** 3.8+ (3.12 recommended with Docker) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git** with submodule support + +For workload generation with AIOB (computation profiling), a **NVIDIA Hopper (SM90)** or **Blackwell (SM100)** GPU is required. + +## Next Steps + +1. Follow the [Installation Guide](installation.md) to set up your environment +2. Run the [Quickstart](quickstart.md) examples to verify your setup +3. Explore the [User Guide](../user_guide/index.md) for detailed usage instructions diff --git a/docs/en/getting_started/installation.md b/docs/en/getting_started/installation.md new file mode 100644 index 00000000..778138b6 --- /dev/null +++ b/docs/en/getting_started/installation.md @@ -0,0 +1,96 @@ +# Installation + +This guide covers how to install SimAI and its dependencies. + +## Option A: Docker (Recommended) + +```bash +# Build the Docker image +docker build -t simai:latest . + +# Run a container with GPU support +docker run --gpus all -it --rm \ + -v $(pwd)/results:/workspace/SimAI/results \ + simai:latest /bin/bash +``` + +> **Note:** If using Hopper GPUs, add `ENV FLASH_MLA_DISABLE_SM100=1` to the Dockerfile. + +## Option B: Build from Source + +The following instructions have been tested on GCC/G++ 9.4.0, Python 3.8.10, Ubuntu 20.04. + +> **Important:** Do not install ninja (it is pre-installed in NGC images and must be removed for SimAI-Simulation compilation). +> ```bash +> apt remove ninja-build && pip uninstall ninja +> ``` + +### Step 1: Clone the Repository + +```bash +git clone https://github.com/aliyun/SimAI.git +cd ./SimAI/ + +# Initialize submodules +git submodule update --init --recursive +# Update to latest commits +git submodule update --remote +``` + +### Step 2: Compile C++ Components + +Choose the mode(s) you need: + +```bash +# SimAI-Analytical (fast, abstracts network details) +./scripts/build.sh -c analytical + +# SimAI-Simulation (full-stack with NS-3 network backend) +./scripts/build.sh -c ns3 + +# SimAI-Physical (beta, requires RDMA environment) +sudo yum install openmpi openmpi-devel +export MPI_INCLUDE_PATH=/usr/include/openmpi-x86_64/ +export MPI_BIN_PATH=/usr/lib64/openmpi/bin/mpic++ +./scripts/build.sh -c phy +``` + +### Step 3: Install Python Dependencies + +```bash +pip install -r aicb/requirements.txt +pip install -r vidur-alibabacloud/requirements.txt +``` + +### Step 4: Verify the Build + +```bash +ls bin/ # Should contain SimAI_analytical and/or SimAI_simulator +``` + +## Option C: Conda Environment (for Inference Simulation) + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +## NGC Container (for Workload Generation) + +For generating workloads with computation profiling (AIOB), we recommend using NGC container images directly: + +```bash +docker pull nvcr.io/nvidia/pytorch:xx.xx-py3 +docker run --gpus all -it --rm \ + -v /path/to/SimAI:/workspace/SimAI \ + nvcr.io/nvidia/pytorch:xx.xx-py3 +``` + +> **Note:** Use PyTorch >= 23.08 NGC images. + +## What's Next + +- [Quickstart Guide](quickstart.md) — Run your first simulation +- [User Guide](../user_guide/index.md) — Detailed usage for each mode diff --git a/docs/en/getting_started/quickstart.md b/docs/en/getting_started/quickstart.md new file mode 100644 index 00000000..985ee2bc --- /dev/null +++ b/docs/en/getting_started/quickstart.md @@ -0,0 +1,85 @@ +# Quickstart + +This guide walks you through running your first simulation with SimAI. + +## 1. SimAI-Analytical + +The fastest way to get started. Abstracts network details using bus bandwidth (busbw). + +```bash +# Run analytical simulation +./bin/SimAI_analytical \ + -w example/workload_analytical.txt \ + -g 9216 \ + -g_p_s 8 \ + -r test- \ + -busbw example/busbw.yaml +``` + +For automatic bus bandwidth calculation: + +```bash +./bin/SimAI_analytical \ + -w ./example/workload_analytical.txt \ + -g 9216 -nv 360 -nic 48.5 \ + -n_p_s 8 -g_p_s 8 -r example- +``` + +For detailed parameter descriptions, see [SimAI-Analytical User Guide](../user_guide/simai_analytical.md). + +## 2. SimAI-Simulation + +Full-stack simulation with NS-3 network backend. + +```bash +# Step 1: Create network topology +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# Step 2: Run simulation +AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ + -t 16 \ + -w ./example/microAllReduce.txt \ + -n ./Spectrum-X_128g_8gps_100Gbps_A100 \ + -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +For detailed parameter descriptions, see [SimAI-Simulation User Guide](../user_guide/simai_simulation.md). + +## 3. Multi-Request Inference Simulation + +End-to-end inference simulation using the Vidur framework. + +### Prerequisites + +```bash +# Activate the vidur conda environment +conda activate vidur +``` + +### Run the 4-Scenario Test Suite + +```bash +# Run all 4 scenarios +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# Or run a single scenario +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +### Scenarios Overview + +| Scenario | Model | PD Separation | World Size | TP | EP | Scheduler | +|----------|-------|---------------|-----------|----|----|-----------| +| 1 | Qwen3-Next-80B | No | 32 | 1 | 1 | lor | +| 2 | Qwen3-Next-80B | Yes (P=2, D=6) | 8 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B | Yes (P=2, D=6) | 8 | 8 | 8 | split_wise | +| 4 | Qwen3-MoE-235B | Yes (P=2, D=6) | 8 | 4 | 4 | split_wise | + +For detailed information, see [Inference Simulation User Guide](../user_guide/inference_simulation.md). + +## What's Next + +- [User Guide](../user_guide/index.md) — Deep dive into each simulation mode +- [Components](../components/index.md) — Learn about each submodule +- [Benchmarking](../benchmarking/index.md) — Run the full test suite diff --git a/docs/en/index.md b/docs/en/index.md new file mode 100644 index 00000000..6953fc17 --- /dev/null +++ b/docs/en/index.md @@ -0,0 +1,73 @@ +# Welcome to SimAI Documentation + +

+ 中文  |  English +

+ +[![License](https://img.shields.io/badge/license-MIT-green.svg)](../../LICENSE) +[![NSDI'25](https://img.shields.io/badge/NSDI'25-SimAI-blue.svg)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + +**SimAI** is the industry's first full-stack, high-precision **Sim**ulator for **AI** large-scale **inference** and **training**, open-sourced by Alibaba Cloud. It provides detailed modeling and simulation of the entire LLM training and inference process, encompassing the framework layer, collective communication layer, and network transport layer, delivering end-to-end performance data. + +SimAI enables researchers to: + +- Analyze inference/training process details +- Evaluate the time consumption of AI tasks under specific conditions +- Evaluate E2E performance gains from various algorithmic optimizations (framework parameters, collective communication algorithms, network protocols, congestion control, routing, topology, etc.) + +--- + +## Documentation Overview + +| Section | Description | +|---------|-------------| +| [Getting Started](getting_started/index.md) | Installation, environment setup, and quickstart guide | +| [User Guide](user_guide/index.md) | Detailed usage for SimAI-Analytical, SimAI-Simulation, SimAI-Physical, and Inference Simulation | +| [Components](components/index.md) | In-depth documentation for each submodule: AICB, SimCCL, astra-sim, ns-3, vidur | +| [Technical Reference](technical_reference/index.md) | GPU memory module, CLI parameters, and configuration reference | +| [Benchmarking](benchmarking/index.md) | 4-scenario end-to-end test suite and benchmark results | +| [Developer Guide](developer_guide/index.md) | Architecture, contributing guide, adding models, and extending NS-3 | +| [Community](community/index.md) | Events, contact information, and citation | + +--- + +## Architecture + +``` + |--- AICB (Workload generation & compute profiling) +SimAI --|--- SimCCL (Collective communication algorithm analysis) + |--- astra-sim-alibabacloud (Simulation engine: Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 network backend) + |--- vidur-alibabacloud (Multi-request inference scheduling & memory management) +``` + +![SimAI Architecture](../images/SimAI_Arc.png) + +--- + +## Three Operation Modes + +| Mode | Description | Use Cases | +|------|-------------|-----------| +| **SimAI-Analytical** | Fast simulation using bus bandwidth (busbw) to estimate collective communication time | Performance analysis, parallel parameter optimization, scale-up exploration | +| **SimAI-Simulation** | Full-stack simulation with NS-3 network backend for fine-grained network modeling | CC algorithm research, network protocol evaluation, novel architecture design | +| **SimAI-Physical** *(Beta)* | Physical traffic generation on CPU RDMA clusters | NIC behavior study during LLM training | + +--- + +## Supported Models + +- **DeepSeek-V3-671B** — MLA attention, 256 routed experts +- **Qwen3-MoE-235B** — MHA/GQA, 128 routed experts +- **Qwen3-Next-80B** — Hybrid full + linear attention, 512 routed experts +- **Meta-Llama-3-8B / 70B**, **Llama-2-7b / 70b**, **CodeLlama-34b**, **InternLM-20B**, **Qwen-72B** + +--- + +## Quick Links + +- [GitHub Repository](https://github.com/aliyun/SimAI) +- [NSDI'25 Paper (PDF)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) +- [Slides](../../docs/SimAI_Intro_Online.pdf) +- [Technical Report (1.6)](../SimAI_1.6_Tech_Report.md) +- [Contributing Guide](../../CONTRIBUTING.md) diff --git a/docs/en/technical_reference/cli_reference.md b/docs/en/technical_reference/cli_reference.md new file mode 100644 index 00000000..689d2058 --- /dev/null +++ b/docs/en/technical_reference/cli_reference.md @@ -0,0 +1,185 @@ +# CLI Reference + +Complete command-line parameter reference for all SimAI tools. + +--- + +## SimAI-Analytical + +**Binary**: `bin/SimAI_analytical` + +### Required Parameters + +| Flag | Long Form | Description | +|------|-----------|-------------| +| `-w` | `--workload` | Path to workload file | +| `-g` | `--gpus` | Simulation GPU scale | +| `-g_p_s` | `--gpus-per-server` | Scale-up size (GPUs per server) | +| `-r` | `--result` | Output file path and prefix (default: `./results/`) | +| `-busbw` | `--bus-bandwidth` | Path to busbw.yaml file | + +### Optional Parameters + +| Flag | Long Form | Description | +|------|-----------|-------------| +| `-v` | `--visual` | Generate visualization files | +| `-dp_o` | `--dp-overlap-ratio` | DP overlap ratio [0.0-1.0] | +| `-ep_o` | `--ep-overlap-ratio` | EP overlap ratio [0.0-1.0] | +| `-tp_o` | `--tp-overlap-ratio` | TP overlap ratio [0.0-1.0] | +| `-pp_o` | `--pp-overlap-ratio` | PP overlap ratio [0.0-1.0] | + +### Auto Busbw Calculation + +| Flag | Description | +|------|-------------| +| `-nv` | NVLink bandwidth (GB/s) | +| `-nic` | NIC bandwidth (GB/s) | +| `-n_p_s` | NICs per server | + +--- + +## SimAI-Simulation + +**Binary**: `bin/SimAI_simulator` + +### Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `AS_LOG_LEVEL` | Log level: DEBUG/INFO/WARNING/ERROR | `INFO` | +| `AS_PXN_ENABLE` | Enable PXN | `0` | +| `AS_NVLS_ENABLE` | Enable NVLS | `0` | +| `AS_SEND_LAT` | Send latency (us) | `6` | +| `AS_NVLSTREE_ENABLE` | Enable NVLS Tree | `false` | + +### Parameters + +| Flag | Long Form | Description | Default | +|------|-----------|-------------|---------| +| `-t` | `--thread` | Number of threads | `1` | +| `-w` | `--workload` | Path to workload | Required | +| `-n` | `--network-topo` | Topology file path | Required | +| `-c` | `--config` | SimAI.conf path | Required | + +--- + +## SimAI-Physical + +**Binary**: `bin/SimAI_phynet` + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `hostlist` | Path to host IP list | Required | +| `-w` / `--workload` | Workload file path | `./microAllReduce.txt` | +| `-i` / `--gid_index` | GID index for RDMA | `0` | +| `-g` / `--gpus` | Number of GPUs | `8` | + +--- + +## Topology Generator + +**Script**: `astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py` + +| Level | Flag | Description | +|-------|------|-------------| +| Global | `-topo` | Template: Spectrum-X / AlibabaHPN / DCN+ | +| | `-g` | Number of GPUs | +| | `--dp` | Enable dual plane | +| | `--ro` | Enable rail-optimized | +| | `--dt` | Enable dual ToR | +| | `-er` | Error rate | +| Intra-Host | `-gps` | GPUs per server | +| | `-gt` | GPU type (A100/H100) | +| | `-nsps` | NV switches per server | +| | `-nvbw` | NVLink bandwidth | +| | `-nl` | NVLink latency | +| | `-l` | NIC latency | +| Intra-Segment | `-bw` | NIC to ASW bandwidth | +| | `-asw` | ASW switch count | +| | `-nps` | NICs per switch | +| Intra-Pod | `-psn` | PSW switch count | +| | `-apbw` | ASW to PSW bandwidth | +| | `-app` | ASW per PSW | + +--- + +## AICB Workload Generator + +**Script**: `scripts/megatron_workload_with_aiob.sh` or `python -m workload_generator.SimAI_training_workload_generator` + +### Core Parameters + +| Parameter | Description | +|-----------|-------------| +| `--frame` | Framework: Megatron / DeepSpeed / DeepSeek | +| `-m` / `--model_size` | Model size: 7/13/22/175/moe/deepseek | +| `--world_size` | Total GPU count | +| `--global_batch` | Total batch size | +| `--micro_batch` | Micro-batch size | +| `--seq_length` | Sequence length | +| `--epoch_num` | Number of iterations | + +### Parallelism Parameters + +| Parameter | Description | +|-----------|-------------| +| `--tensor_model_parallel_size` | TP degree | +| `--pipeline_model_parallel` | PP degree | +| `--expert_model_parallel_size` | EP degree | +| `--enable_sequence_parallel` | Enable SP | + +### Model Parameters + +| Parameter | Description | +|-----------|-------------| +| `--num_layers` | Transformer layers | +| `--hidden_size` | Hidden size | +| `--num_attention_heads` | Attention heads | +| `--ffn_hidden_size` | FFN hidden size | +| `--vocab_size` | Vocabulary size | + +### MoE Parameters + +| Parameter | Description | +|-----------|-------------| +| `--moe_enable` | Enable MoE | +| `--num_experts` | Number of experts | +| `--moe_router_topk` | Experts per token | +| `--moe_grouped_gemm` | Enable grouped GEMM | + +### DeepSeek Parameters + +| Parameter | Description | +|-----------|-------------| +| `--qk_rope_dim` | RoPE dimension for QK | +| `--qk_nope_dim` | Non-RoPE dimension for QK | +| `--q_lora_rank` | Q LoRA rank | +| `--kv_lora_rank` | KV LoRA rank | +| `--v_head_dim` | V head dimension | +| `--n_shared_expert` | Shared experts per MoE layer | +| `--n_dense_layer` | Dense layers count | + +### Optimization Parameters + +| Parameter | Description | +|-----------|-------------| +| `--use_flash_attn` | FlashAttention | +| `--swiglu` | SwiGLU activation | +| `--aiob_enable` | AIOB computation profiling | +| `--comp_filepath` | Pre-computed times file | + +--- + +## Vidur Inference Simulation + +**Command**: `python -m vidur.main` + +Run `python -m vidur.main -h` for the full parameter list. Key parameters are documented in the [vidur component page](../components/vidur.md). + +--- + +## See Also + +- [Configuration Reference](configuration.md) — Config file formats +- [SimAI-Analytical Guide](../user_guide/simai_analytical.md) — Usage examples +- [AICB Component](../components/aicb.md) — Full parameter details diff --git a/docs/en/technical_reference/configuration.md b/docs/en/technical_reference/configuration.md new file mode 100644 index 00000000..7ad67fb6 --- /dev/null +++ b/docs/en/technical_reference/configuration.md @@ -0,0 +1,163 @@ +# Configuration Reference + +This document covers all configuration files used by SimAI. + +--- + +## SimAI.conf + +**Path**: `astra-sim-alibabacloud/inputs/config/SimAI.conf` + +The main simulation configuration file used by both SimAI-Analytical and SimAI-Simulation modes. It controls communication algorithms, buffer sizes, and timing parameters. + +--- + +## busbw.yaml + +**Path**: `example/busbw.yaml` + +Used by SimAI-Analytical to specify bus bandwidth for different communication groups and collective operations. + +### Format + +```yaml +test +TP: + allreduce,: 300 # AllReduce busbw 300GB/s in TP group + allgather,: 280 + reducescatter,: 280 + alltoall,: 230 +DP: + allreduce,: null # null = not used in this group + allgather,: 380 + reducescatter,: 380 + alltoall,: null +EP: + allreduce,: null + allgather,: 45 + reducescatter,: 45 + alltoall,: 80 +``` + +### Communication Groups + +| Group | Description | +|-------|-------------| +| `TP` | Tensor Parallelism — intra-server NVLink communication | +| `DP` | Data Parallelism — inter-server RDMA communication | +| `EP` | Expert Parallelism — MoE expert communication | + +### Collective Operations + +| Operation | Description | +|-----------|-------------| +| `allreduce` | Reduce + broadcast across all ranks | +| `allgather` | Gather data from all ranks | +| `reducescatter` | Reduce and scatter | +| `alltoall` | All-to-all personalized exchange | + +Set value to `null` for operations not used in a particular group. + +--- + +## Topology Files + +Generated by `gen_Topo_Template.py`, topology files define the network structure for SimAI-Simulation. + +### Generation + +```bash +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps +``` + +The output file is named based on parameters, e.g., `Spectrum-X_128g_8gps_100Gbps_A100`. + +### Template Defaults + +| Template | GPUs | Topo | Bandwidth | GPU Type | +|----------|------|------|-----------|----------| +| Spectrum-X | 4096 | Rail-optimized, single ToR | 400Gbps | H100 | +| AlibabaHPN (single) | 15360 | Rail-optimized, dual ToR | 200Gbps | H100 | +| AlibabaHPN (dual) | 15360 | Rail-optimized, dual ToR, dual plane | 200Gbps | H100 | +| DCN+ (single ToR) | 512 | Non rail-optimized | 400Gbps | A100 | +| DCN+ (dual ToR) | 512 | Non rail-optimized, dual ToR | 200Gbps | H100 | + +--- + +## Model Configuration Files + +### Inference Model Configs + +Located in `vidur-alibabacloud/data/hf_configs/`: + +| Model | Config File | +|-------|------------| +| DeepSeek-V3-671B | `deepseek_v3_config.json` | +| Qwen3-MoE-235B | `qwen3_moe_config.json` | +| Qwen3-Next-80B | `qwen3-next-80B-A3B_config.json` | + +These files follow the HuggingFace `config.json` format and define model architecture parameters. + +### Profiling Data + +Located in `vidur-alibabacloud/data/profiling/`: + +``` +profiling/ +├── compute/ +│ ├── a100/ +│ │ └── / +│ │ ├── mlp.csv +│ │ └── attention.csv +│ └── h100/ +│ └── / +│ ├── mlp.csv +│ └── attention.csv +└── network/ + ├── a100_pair_nvlink/ + │ ├── allreduce.csv + │ └── send_recv.csv + └── h100_dgx/ + ├── allreduce.csv + └── send_recv.csv +``` + +- **Compute profiling**: GPU-type dependent (a100, h100) +- **Network profiling**: Network-configuration dependent (pair_nvlink, dgx) + +--- + +## Workload Files + +### Training Workload Format + +``` +HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 vpp: 8 ga: 1 all_gpus: 32 checkpoints: 0 checkpoint_initiates: 0 +6 +embedding_layer -1 556000 ALLREDUCE 16777216 1 NONE 0 1 NONE 0 1 +... +``` + +Header fields: +- `model_parallel_NPU_group`: TP size +- `ep`: EP size +- `pp`: PP size +- `vpp`: Virtual Pipeline Parallelism +- `ga`: Gradient accumulation +- `all_gpus`: Total GPU count + +### Request Trace Files + +For inference simulation, located in `vidur-alibabacloud/data/processed_traces/`: + +- `splitwise_conv.csv` — Conversational trace +- `sharegpt_8k_filtered_stats_llama2_tokenizer.csv` — ShareGPT trace + +--- + +## See Also + +- [CLI Reference](cli_reference.md) — Command-line parameters +- [SimAI-Analytical Guide](../user_guide/simai_analytical.md) — busbw configuration usage +- [SimAI-Simulation Guide](../user_guide/simai_simulation.md) — Topology configuration usage diff --git a/docs/en/technical_reference/index.md b/docs/en/technical_reference/index.md new file mode 100644 index 00000000..05cbc062 --- /dev/null +++ b/docs/en/technical_reference/index.md @@ -0,0 +1,13 @@ +# Technical Reference + +This section provides detailed technical specifications and reference documentation for SimAI's internal modules, CLI parameters, and configuration files. + +--- + +## Contents + +| Document | Description | +|----------|-------------| +| [GPU Memory Module](memory_module.md) | ParamCounter, KV Cache calculation, and MemoryPlanner | +| [CLI Reference](cli_reference.md) | Complete command-line parameter reference for all modes | +| [Configuration Files](configuration.md) | SimAI.conf, busbw.yaml, topology files, and model configs | diff --git a/docs/en/technical_reference/memory_module.md b/docs/en/technical_reference/memory_module.md new file mode 100644 index 00000000..9fb41c53 --- /dev/null +++ b/docs/en/technical_reference/memory_module.md @@ -0,0 +1,155 @@ +# GPU Memory Calculation Module + +The GPU memory calculation module (introduced in SimAI 1.6) provides accurate GPU memory estimation for inference simulation, covering model parameter memory, KV cache memory, and maximum batch size calculation. + +--- + +## Architecture + +``` +ParamCounter (param_counter.py) + |-- Computes per-layer, per-device parameter counts + |-- Returns (total_params, prefill_params, decode_params) under PD + | +MemoryPlanner (memory_planner.py) + |-- Plans GPU memory budget + |-- Computes KV cache capacity + |-- Detects OOM conditions + | +Replica KV Cache Tracker (replica.py) + |-- Per-request allocation/release + |-- Runtime capacity queries +``` + +--- + +## ParamCounter + +**File**: `vidur-alibabacloud/vidur/utils/param_counter.py` + +### MLA Parameters (DeepSeek-V3-671B) + +Per-layer MLA parameter components: + +| Component | Formula | DeepSeek-V3 Value | +|-----------|---------|-------------------| +| Q LoRA down-projection | `hidden_size * q_lora_rank` | 7168 * 1536 | +| Q LoRA up-projection | `q_lora_rank * num_heads * qk_head_dim` | 1536 * 128 * 192 | +| KV LoRA down-projection | `hidden_size * kv_lora_rank` | 7168 * 512 | +| KV LoRA up-projection | `kv_lora_rank * num_heads * (qk_nope_dim + v_head_dim)` | 512 * 128 * 256 | +| Output projection | `hidden_size * num_heads * v_head_dim` | 7168 * 128 * 128 | + +Where `qk_head_dim = qk_nope_head_dim + qk_rope_head_dim = 128 + 64 = 192` + +### MHA/GQA Parameters (Qwen3-MoE-235B) + +``` +wq = hidden_size * num_attention_heads * head_dim +wk = hidden_size * num_key_value_heads * head_dim +wv = hidden_size * num_key_value_heads * head_dim +wo = hidden_size * num_attention_heads * head_dim +total = (wq + wk + wv + wo) * bytes_per_element +``` + +### Linear Attention Parameters (Qwen3-Next-80B) + +Qwen3-Next-80B uses hybrid attention: full attention and linear (GDN) attention alternating every 4 layers. Linear attention layers use independent `linear_key_head_dim` / `linear_num_key_heads` configurations. + +### MoE Expert Parameters + +Per-expert FFN (3 weight matrices W1, W2, W3): + +``` +expert_params = 3 * hidden_size * moe_intermediate_size * bytes_per_element +``` + +### PD Disaggregation + +Under PD disaggregation, expert parallelism differs between clusters: + +- **Prefill cluster**: `experts_per_device = num_routed_experts / prefill_world_size` +- **Decode cluster**: `experts_per_device = num_routed_experts / decode_world_size` + +Returns triple: `(total_params, prefill_params, decode_params)` + +--- + +## KV Cache Calculation + +### MHA/GQA KV Cache + +``` +kv_cache_per_token = 2 * num_kv_heads * head_dim * num_layers * bytes_per_element +``` + +Factor of 2 = K (Key) + V (Value) caches. + +### MLA KV Cache (DeepSeek-V3-671B) + +MLA uses compressed KV representations — a single latent vector encoding both K and V: + +``` +kv_cache_per_token = (kv_lora_rank + qk_rope_head_dim) * num_layers * bytes_per_element +``` + +Where `kv_lora_rank = 512`, `qk_rope_head_dim = 64`. + +**Comparison**: MHA would need `2 * 128 * 128 = 32768` elements per token. MLA needs only `576` elements — a **~57x reduction**. + +### Per-Request KV Cache Tracking + +The `Replica` entity maintains: + +| State | Description | +|-------|-------------| +| `_allocated_kv_cache_memory` | Currently allocated KV cache (bytes) | +| `_max_kv_cache_memory` | Maximum KV cache capacity | +| `_kv_cache_allocation_map` | Per-request allocation mapping | + +Operations: +- `allocate_request_kv_cache_memory(request, num_blocks, block_size)` +- `release_request_kv_cache_memory(request)` +- `get_remaining_kv_cache_capacity()` + +--- + +## MemoryPlanner + +**File**: `vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py` + +### Calculation Flow + +1. **Available GPU memory**: `available = total_GPU_memory * (1 - memory_margin_fraction)` +2. **Parameter memory**: Via ParamCounter; PD returns `(total, prefill, decode)` +3. **KV cache budget**: `kv_available = available - param_memory` +4. **Max concurrent requests**: `max_requests = kv_available / kv_cache_per_request` + +### PD Disaggregation + +- Prefill replicas: use `prefill_param_mem` for budget +- Decode replicas: use `decode_param_mem` for budget + +### OOM Detection + +When `param_memory > available_memory`, outputs error with suggestions: +- Increase TP/EP size +- Use larger GPU (more VRAM) +- Enable FP8 quantization + +--- + +## Quantization Support + +| Precision | Bytes per Element | Use Case | +|-----------|-------------------|----------| +| FP32 | 4 | Reference | +| FP16/BF16 | 2 | Default inference | +| FP8 | 1 | Reduced memory, supported by ParamCounter | + +--- + +## See Also + +- [vidur-alibabacloud Component](../components/vidur.md) — Full component documentation +- [Supported Models](../user_guide/supported_models.md) — Model specifications +- [SimAI 1.6 Tech Report](../../SimAI_1.6_Tech_Report.md) — Detailed technical report diff --git a/docs/en/user_guide/index.md b/docs/en/user_guide/index.md new file mode 100644 index 00000000..3e95fc10 --- /dev/null +++ b/docs/en/user_guide/index.md @@ -0,0 +1,25 @@ +# User Guide + +This section provides detailed usage instructions for all SimAI operation modes. + +## Contents + +| Page | Description | +|------|-------------| +| [SimAI-Analytical](simai_analytical.md) | Fast analytical simulation using bus bandwidth | +| [SimAI-Simulation](simai_simulation.md) | Full-stack NS-3 network simulation with topology configuration | +| [SimAI-Physical](simai_physical.md) | Physical RDMA traffic generation on real clusters | +| [Inference Simulation](inference_simulation.md) | Multi-request LLM inference simulation with PD disaggregation | +| [Workload Generation](workload_generation.md) | Generate training and inference workloads using AICB | +| [Supported Models](supported_models.md) | Complete list of supported models and configurations | +| [Result Analysis](result_analysis.md) | Analyze and visualize simulation results | + +## Workflow Overview + +A typical SimAI workflow involves three steps: + +1. **Generate workloads** using [AICB](workload_generation.md) — defines the computation and communication patterns +2. **Run simulation** using one of the three modes (Analytical, Simulation, or Physical) +3. **Analyze results** using built-in tools or custom scripts + +For inference simulation, the workflow uses Vidur for request scheduling and memory management, with AICB or SimAI as the execution-time prediction backend. diff --git a/docs/en/user_guide/inference_simulation.md b/docs/en/user_guide/inference_simulation.md new file mode 100644 index 00000000..b99b75fa --- /dev/null +++ b/docs/en/user_guide/inference_simulation.md @@ -0,0 +1,176 @@ +# Multi-Request Inference Simulation + +SimAI supports complete multi-request LLM inference simulation, enabling end-to-end performance evaluation of inference serving systems with support for Prefill-Decode (PD) disaggregation. + +--- + +## Overview + +The inference simulation pipeline combines several SimAI components: + +- **[AICB](../components/aicb.md)** — Generates inference workloads and profiles computation time +- **[vidur-alibabacloud](../components/vidur.md)** — Request scheduling, memory management, and metrics collection +- **[astra-sim-alibabacloud](../components/astra_sim.md)** — Simulation engine (Analytical or Simulation mode) +- **[SimCCL](../components/simccl.md)** — Collective communication transformation + +--- + +## Prefill-Decode (PD) Disaggregation + +The inference process is divided into two distinct phases: + +| Phase | Characteristic | Description | +|-------|---------------|-------------| +| **Prefill** | Compute-intensive | Processes all input prompt tokens and generates the first output token | +| **Decode** | Memory-bandwidth-intensive | Autoregressively generates subsequent output tokens one at a time | + +PD disaggregation allows deploying these phases on different GPU nodes, enabling: + +- **Elastic resource allocation** — Prefill nodes can be configured with more compute, Decode nodes with more memory +- **Performance isolation** — Avoiding resource contention between phases +- **Flexible P:D ratio** — Configurable via `--replica_config_pd_node_ratio` + +--- + +## Request Scheduling + +The scheduling component is adapted from Microsoft's [Vidur](https://github.com/microsoft/vidur), supporting multiple strategies: + +| Scheduler | Level | Description | +|-----------|-------|-------------| +| `split_wise` | Global | PD disaggregation-aware dispatch to Prefill and Decode replicas | +| `lor` | Global | Least Outstanding Requests — dispatch to the least-loaded replica | +| `round_robin` | Global | Round-robin dispatch | +| `sarathi` | Per-replica | Intra-replica batch scheduling | +| `split_wise` | Per-replica | Per-replica scheduling for PD disaggregation | + +--- + +## Parallelism Strategies + +Supports combinations of multiple parallelism strategies: + +| Strategy | Flag | Description | +|----------|------|-------------| +| **Data Parallel (DP)** | `--cluster_config_num_replicas` | Number of replicas | +| **Tensor Parallel (TP)** | `--replica_config_tensor_parallel_size` | Intra-node parallelism | +| **Pipeline Parallel (PP)** | `--replica_config_num_pipeline_stages` | Inter-stage parallelism | +| **Expert Parallel (EP)** | `--replica_config_expert_model_parallel_size` | MoE expert parallelism | + +Works for both dense and MoE (Mixture-of-Experts) models. + +--- + +## Execution-Time Prediction Backends + +| Backend | Flag Value | Description | +|---------|-----------|-------------| +| **AICB/AIOB** | `aicb` | Supports compute kernels and TP/DP/PP/EP communication size for DeepSeek-V3, Qwen3-MoE, Qwen3-Next | +| **SimAI Simulation** | `simai_simulation` | NS-3-based full-stack network simulation (currently supports TP) | +| **SimAI Analytical** | `simai_analytical` | Analytical performance model (currently supports TP) | +| **Native Vidur** | `vidur` | Original Vidur backend, supports TP, DP, PP | + +Set via `--random_forrest_execution_time_predictor_config_backend`. + +--- + +## Quick Start + +### Prerequisites + +- **AICB backend**: SimAI Docker environment with Hopper (SM90) or Blackwell (SM100) GPUs +- **SimAI backends**: Compile SimAI-Analytical or SimAI-Simulation first +- **Vidur backend**: Conda environment with profiling data + +### Run with AICB Backend + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_expert_model_parallel_size 8 \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 5 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 1024 \ + --fixed_request_length_generator_config_decode_tokens 10 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +### Run with SimAI Simulation Backend + +```bash +cd SimAI + +# Compile and generate topology +./scripts/build.sh -c ns3 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +cd vidur-alibabacloud + +python -m vidur.main \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --random_forrest_execution_time_predictor_config_backend simai_simulation \ + --random_forrest_execution_time_predictor_config_simai_dir ../ \ + --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \ + --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +### Run the 4-Scenario Test Suite + +```bash +# Run all 4 pre-configured scenarios +bash examples/vidur-ali-scenarios/run_scenarios.sh --all + +# Run a single scenario +bash examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +--- + +## 4-Scenario Configuration + +**Shared Hardware**: H20 GPU (h20_dgx), NVLink 1600 Gbps, RDMA 800 Gbps, PD P2P 800 Gbps (fp8) + +| Scenario | Model | PD Separation | World Size | TP | EP | Scheduler | +|----------|-------|---------------|------------|----|----|-----------| +| 1 | Qwen3-Next-80B | No | 32 (dp=32) | 1 | 1 | lor | +| 2 | Qwen3-Next-80B | Yes (P=2, D=6) | 8 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B | Yes (P=2, D=6) | 8 | 8 | 8 | split_wise | +| 4 | Qwen3-MoE-235B | Yes (P=2, D=6) | 8 | 4 | 4 | split_wise | + +--- + +## Output + +Each simulation run produces: + +``` +// +├── request_metrics.csv # Per-request metrics +├── chrome_trace.json # Chrome DevTools timeline trace +├── config.json # Configuration snapshot +└── plots/ # Per-metric CSV/JSON files +``` + +See [Result Analysis](result_analysis.md) for output interpretation. + +--- + +## See Also + +- [vidur-alibabacloud Component](../components/vidur.md) — Full inference simulation documentation +- [Supported Models](supported_models.md) — Model compatibility matrix +- [Result Analysis](result_analysis.md) — Output interpretation guide diff --git a/docs/en/user_guide/result_analysis.md b/docs/en/user_guide/result_analysis.md new file mode 100644 index 00000000..eb4d71e3 --- /dev/null +++ b/docs/en/user_guide/result_analysis.md @@ -0,0 +1,183 @@ +# Result Analysis & Visualization + +This guide covers how to interpret and analyze simulation outputs from all SimAI modes. + +--- + +## SimAI-Analytical Output + +### CSV Output + +Running SimAI-Analytical generates a CSV file in the `results/` directory. The output contains: + +- **Summary row**: Exposure time, computation time (absolute and percentage) for each communication group, and end-to-end iteration time +- **Per-layer rows**: Detailed operation timing for each layer + +Key columns include per-communication-group breakdown (TP, DP, EP, PP) showing time allocation and overlap effects. + +### Visualization + +When running with the `-v` flag, SimAI-Analytical generates additional visualization files showing the timing breakdown across communication groups. + +```bash +# Run with visualization enabled +./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml -v +``` + +--- + +## SimAI-Simulation Output + +SimAI-Simulation (NS-3 mode) generates detailed trace data capturing fine-grained network behavior. The NS-3 backend outputs `.tr` trace files that can be analyzed using the provided analysis tools. + +### Analysis Tools + +Located in `ns-3-alibabacloud/analysis/`: + +| Tool | Description | +|------|-------------| +| `fct_analysis.py` | Flow Completion Time analysis — reads FCT output files and produces statistics | +| `trace_reader` | Parses `.tr` trace files with filtering support | + +### Using trace_reader + +```bash +# Build +cd ns-3-alibabacloud/analysis +make trace_reader + +# Parse trace file +./trace_reader <.tr file> [filter_expr] + +# Examples: +./trace_reader output.tr "time > 2000010000" +./trace_reader output.tr "sip=0x0b000101&dip=0x0b000201" +``` + +### Trace Output Format + +Each line in the trace output follows this format: + +``` +2000055540 n:338 4:3 100608 Enqu ecn:0 0b00d101 0b012301 10000 100 U 161000 0 3 1048(1000) +``` + +Fields: timestamp (ns), node ID, port:queue, queue length (bytes), event type, ECN flag, source IP, destination IP, source port, destination port, packet type, sequence number, TX timestamp, priority group, packet size (payload). + +--- + +## Inference Simulation Output + +### Output Directory Structure + +Each inference simulation run produces: + +``` +// +├── request_metrics.csv # Per-request metrics +├── chrome_trace.json # Chrome DevTools timeline trace +├── config.json # Configuration snapshot +└── plots/ # Per-metric CSV/JSON files + ├── request_e2e_time.csv + ├── prefill_e2e_time.csv + ├── pd_p2p_comm_time.csv + ├── replica_N_memory_usage.json + └── ... +``` + +### request_metrics.csv Columns + +| Column | Meaning | +|--------|---------| +| `arrived_at` | Timestamp when the request entered the system (seconds) | +| `scheduled_at` | Timestamp when the request was first scheduled (seconds) | +| `prefill_completed_at` | Timestamp when Prefill completed and first token generated | +| `decode_arrived_at` | Timestamp when Decode phase started | +| `decode_time` | Duration of Decode phase (seconds) | +| `prefill_replica_id` | Replica ID that executed Prefill (PD mode) | +| `decode_replica_id` | Replica ID that executed Decode (PD mode) | +| `request_num_prefill_tokens` | Number of input tokens (prompt length) | +| `request_num_decode_tokens` | Number of output tokens (generation length) | +| `pd_p2p_comm_size` | P2P communication size from Prefill to Decode node (bytes) | +| `pd_p2p_comm_time` | P2P communication time (seconds) | +| `completed_at` | Request completion timestamp | +| `request_execution_time` | Total execution time excluding delays (seconds) | +| `request_preemption_time` | Wait time due to preemption/bubbles (seconds) | +| `request_scheduling_delay` | Scheduling delay: `scheduled_at - arrived_at` (seconds) | +| `request_e2e_time` | End-to-end latency: `completed_at - arrived_at` (seconds) | +| `prefill_e2e_time` | Time To First Token (TTFT): `prefill_completed_at - arrived_at` (seconds) | +| `tbt` | Time Between Tokens: `decode_time / request_num_decode_tokens` (seconds/token) | + +### Chrome Trace Visualization + +Open `chrome_trace.json` in Chrome DevTools for visual timeline analysis: + +1. Open Chrome browser +2. Navigate to `chrome://tracing` +3. Load the `chrome_trace.json` file + +### Simulation Metrics (23 metrics) + +The simulator records 23 fine-grained metrics: + +| Category | Metrics | +|----------|---------| +| **Request Latency** | E2E time CDF, normalized E2E CDF, execution+preemption CDF | +| **Scheduling** | Scheduling delay CDF | +| **Execution** | Execution time CDF, preemption time CDF | +| **Token-level** | Decode token execution+preemption times, inter-token delay | +| **Batch** | Batch num tokens CDF, batch sizes CDF | +| **Prefill** | Prefill E2E CDF, prefill execution+preemption CDF (normalized) | +| **Decode** | Decode execution+preemption normalized CDF | +| **Time Series** | Request/prefill/decode completions, request arrivals | +| **Per-replica** | Memory usage (weighted mean), busy time %, MFU | + +For detailed metric definitions, see the [vidur metrics documentation](../components/vidur.md). + +--- + +## AICB Physical Execution Output + +### Log Output + +After each communication, AICB prints: +- Communication type and group +- Message size +- Execution time +- Throughput (algbw and busbw) + +### Iteration Summary + +After all communications complete, a summary shows: +- Overall runtime and per-iteration timing +- Per-communication-type statistics (message sizes, frequencies, latency min/max/avg) + +### CSV Output + +Results are saved in `results/comm_logs/`: +- `__log.csv` — Execution log with timing, phase, algbw, busbw per comm_group and comm_type +- `__workload.csv` — Generated workload description + +### Programmatic Analysis + +```python +# Read workload log +from log_analyzer.log import Workload +workload, args = Workload.load("results/comm_logs/megatron_gpt_13B_8n_workload.csv") + +# Read execution log +from log_analyzer.log import Log +log = Log.load("results/comm_logs/megatron_gpt_13B_8n_log.csv") +# log.comm_logs: List[LogItem] +# log.epoch_times: List[int] +# log.comm_log_each_epoch: List[List[LogItem]] +``` + +--- + +## See Also + +- [SimAI-Analytical](simai_analytical.md) — Analytical mode usage +- [SimAI-Simulation](simai_simulation.md) — NS-3 simulation mode usage +- [Inference Simulation](inference_simulation.md) — Inference simulation guide +- [NS-3 Component](../components/ns3.md) — NS-3 analysis tools diff --git a/docs/en/user_guide/simai_analytical.md b/docs/en/user_guide/simai_analytical.md new file mode 100644 index 00000000..94dec88d --- /dev/null +++ b/docs/en/user_guide/simai_analytical.md @@ -0,0 +1,104 @@ +# SimAI-Analytical + +SimAI-Analytical offers fast simulation by abstracting network communication details using bus bandwidth (busbw) to estimate collective communication time. It is ideal for rapid scenario validation and performance analysis. + +## Use Cases + +- **Performance Analysis**: Compare completion times across different models (e.g., impact of expert numbers on MoE training) +- **Parallel Parameter Optimization**: Balance and optimize TP/EP/PP parameters +- **Scale-up Exploration**: Investigate parallel parameter performance across different scale-up domains +- **Scale-out Bandwidth Selection**: Research cost-effective bandwidth configurations + +## Workload Generation + +Generate workloads using [AICB](workload_generation.md): + +```bash +sh ./aicb/scripts/megatron_workload_with_aiob.sh \ + -m 7 --world_size 4096 \ + --tensor_model_parallel_size 2 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 8192 \ + --micro_batch 1 --seq_length 4096 \ + --swiglu --use_flash_attn --aiob_enable +``` + +This produces a `.txt` workload file containing: +- `model_parallel_NPU_group`: Tensor Parallelism size +- `ep`: Expert model parallelism size +- `pp`: Pipeline model parallelism size +- `vpp`: Virtual Pipeline Parallelism + +> For more details, see [AICB Workload Generation](workload_generation.md) and the [AICB Component Documentation](../components/aicb.md). + +## Busbw Configuration + +SimAI-Analytical uses a `busbw.yaml` file to specify bus bandwidth for different communication groups: + +```yaml +test +TP: + allreduce,: 300 # AllReduce busbw 300GB/s in TP + allgather,: 280 + reducescatter,: 280 + alltoall,: 230 +DP: + allreduce,: null + allgather,: 380 # AllGather busbw 380GB/s in DP + reducescatter,: 380 + alltoall,: null +EP: + allreduce,: null + allgather,: 45 + reducescatter,: 45 + alltoall,: 80 # AlltoAll busbw 80GB/s in EP +``` + +## Running Analytical Simulation + +```bash +./bin/SimAI_analytical \ + -w example/workload_analytical.txt \ + -g 9216 \ + -g_p_s 8 \ + -r test- \ + -busbw example/busbw.yaml +``` + +### Required Parameters + +| Parameter | Long Form | Description | +|:---------:|:----------|:------------| +| `-w` | `--workload` | Path to the workload file | +| `-g` | `--gpus` | Simulation GPU scale | +| `-g_p_s` | `--gpus-per-server` | Scale-up size (GPUs per server) | +| `-r` | `--result` | Output file path and prefix (default: `./results/`) | +| `-busbw` | `--bus-bandwidth` | Path to the busbw file | + +### Optional Parameters + +| Parameter | Long Form | Description | +|:---------:|:----------|:------------| +| `-v` | `--visual` | Generate visualization files | + +### Overlap Ratios + +| Parameter | Long Form | Description | Range | +|:---------:|:----------|:------------|:------| +| `-dp_o` | `--dp-overlap-ratio` | DP overlap ratio | [0.0-1.0] | +| `-ep_o` | `--ep-overlap-ratio` | EP overlap ratio | [0.0-1.0] | +| `-tp_o` | `--tp-overlap-ratio` | TP overlap ratio | [0.0-1.0] | +| `-pp_o` | `--pp-overlap-ratio` | PP overlap ratio | [0.0-1.0] | + +## Result Analysis + +Running SimAI-Analytical generates a CSV output containing: +- Summary row with exposure time, computation time percentages for each communication group, and end-to-end iteration time +- Per-layer operation details + +![Raw Output](../../images/simai_raw.png) + +If you specify `-v`, a visualization file is also generated: + +![Visualization](../../images/simai_visual.png) + +For more on result analysis, see [Result Analysis](result_analysis.md). diff --git a/docs/en/user_guide/simai_physical.md b/docs/en/user_guide/simai_physical.md new file mode 100644 index 00000000..e3481408 --- /dev/null +++ b/docs/en/user_guide/simai_physical.md @@ -0,0 +1,123 @@ +# SimAI-Physical Mode + +> **Status**: Beta — Currently in internal testing phase. + +SimAI-Physical enables physical traffic generation for CPU RDMA cluster environments. This mode generates NCCL-like traffic patterns, allowing in-depth study of NIC behaviors during LLM training. + +--- + +## Overview + +Unlike SimAI-Analytical and SimAI-Simulation which run entirely in software simulation, SimAI-Physical injects real RDMA traffic into a physical network. This enables: + +- Studying actual NIC behavior under realistic LLM training traffic patterns +- Validating network configurations on real hardware +- Benchmarking RDMA performance with representative collective communication workloads + +**Component Combination**: [AICB](../../components/aicb.md) + [SimCCL](../../components/simccl.md) + [astra-sim-alibabacloud](../../components/astra_sim.md) (physical mode) + +--- + +## Prerequisites + +SimAI-Physical uses the RoCEv2 protocol for traffic generation. Before compilation, ensure your environment meets: + +- **RDMA Support**: Working `libibverbs` / RDMA device drivers +- **MPI**: OpenMPI installed and functional +- **Verification**: Successfully run `ib_write_bw` or similar RDMA perftest tools + +--- + +## Compilation + +```bash +# Clone and initialize +git clone https://github.com/aliyun/SimAI.git +cd SimAI/ +git submodule update --init --recursive +git submodule update --remote + +# Install MPI (CentOS/RHEL) +sudo yum install openmpi openmpi-devel + +# Set MPI paths +export MPI_INCLUDE_PATH=/usr/include/openmpi-x86_64/ +export MPI_BIN_PATH=/usr/lib64/openmpi/bin/mpic++ + +# Build SimAI-Physical +./scripts/build.sh -c phy +``` + +--- + +## Workload Generation + +SimAI-Physical uses the same workload format as SimAI-Simulation, generated through [AICB](../../components/aicb.md). See [Workload Generation](workload_generation.md) for details. + +### Example Workload + +``` +HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 8 ga: 1 all_gpus: 2 checkpoints: 0 checkpoint_initiates: 0 +10 +mlp_norm -1 1055000 ALLGATHER 1073741824 1055000 NONE 0 1055000 NONE 0 100 +mlp_norm -1 1055000 ALLGATHER 1073741824 1055000 NONE 0 1055000 NONE 0 100 +... +``` + +--- + +## Prepare the Host List + +Prepare an IP list file for the MPI program. The number of IPs should match the number of NICs involved in physical traffic generation (not the number of nodes). + +``` +33.255.199.130 +33.255.199.129 +``` + +--- + +## Running + +### MPI Execution + +```bash +/usr/lib64/openmpi/bin/mpirun -np 2 \ + -host 33.255.199.130,33.255.199.129 \ + --allow-run-as-root \ + -x AS_LOG_LEVEL=0 \ + ./bin/SimAI_phynet ./hostlist -g 2 -w ./example/microAllReduce.txt +``` + +### MPI Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `-np` | Number of processes | Required | +| `-host` | Comma-separated IP list | Required | +| `--allow-run-as-root` | Allow running as root | `FALSE` | + +### SimAI-Physical Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `hostlist` | Path to host IP list file | Required | +| `-w` / `--workload` | Path to workload file | `./microAllReduce.txt` | +| `-i` / `--gid_index` | GID index for RDMA device | `0` | +| `-g` / `--gpus` | Number of GPUs (must match IP count in hostlist) | `8` | + +--- + +## Notes + +- The number of GPUs (`-g`) must be consistent with the number of IPs in the host IP list +- Ensure all nodes have network connectivity and RDMA is properly configured +- SimAI-Physical is currently in beta; some features may change in future releases + +--- + +## See Also + +- [AICB Component](../components/aicb.md) — Workload generation +- [SimCCL Component](../components/simccl.md) — Collective communication transformation +- [astra-sim Component](../components/astra_sim.md) — Simulation engine details diff --git a/docs/en/user_guide/simai_simulation.md b/docs/en/user_guide/simai_simulation.md new file mode 100644 index 00000000..9053d2eb --- /dev/null +++ b/docs/en/user_guide/simai_simulation.md @@ -0,0 +1,113 @@ +# SimAI-Simulation + +SimAI-Simulation provides high-fidelity, full-stack simulation with fine-grained network communication modeling using NS-3 as the network backend. It is designed for detailed research into collective communication algorithms, network protocols, and novel network architectures. + +## Use Cases + +- **Collective Communication Algorithm Research**: Design and optimize traffic patterns for non-switch architectures +- **Network Protocol Research**: Evaluate congestion control, routing mechanisms, and low-level protocols +- **Novel Network Architecture Design**: Explore innovative network topologies and configurations + +## Workload Generation + +Use the same workload as SimAI-Analytical, generated by [AICB](workload_generation.md). + +## Network Topology Configuration + +Before running SimAI-Simulation, you need to generate a topology file recognized by ns-3-alibabacloud. + +### Topology Templates + +SimAI provides 5 templates for common architectures: + +| Template | Description | Default GPUs | +|----------|-------------|-------------| +| `Spectrum-X` | Rail-optimized, single ToR, single plane | 4096 | +| `AlibabaHPN` (Single Plane) | Rail-optimized, dual ToR, single plane | 15360 | +| `AlibabaHPN` (Dual Plane) | Rail-optimized, dual ToR, dual plane | 15360 | +| `DCN+` (Single ToR) | Non rail-optimized, single ToR | 512 | +| `DCN+` (Dual ToR) | Non rail-optimized, dual ToR | 512 | + +![Spectrum-X](../../images/Spectrum-X.jpg) + +### Generating Topology + +```bash +# Spectrum-X with 8 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 8 -psn 1 + +# Dual-Plane AlibabaHPN with 64 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo AlibabaHPN --dp -g 64 -asn 16 -psn 16 + +# Dual-ToR DCN+ with 128 GPUs +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo DCN+ --dt -g 128 -asn 2 -psn 8 + +# Custom rail-optimized topology +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -g 32 -bw 200Gbps -gt A100 -psn 8 --ro +``` + +### Topology Parameters + +| Level | Parameter | Description | +|-------|-----------|-------------| +| Whole Structure | `-topo` | Template name | +| | `-g` | Number of GPUs | +| | `--dp` | Enable dual plane | +| | `--ro` | Enable rail-optimized | +| | `--dt` | Enable dual NICs and dual ToRs | +| | `-er` | Error rate | +| Intra-Host | `-gps` | GPUs per server | +| | `-gt` | GPU type | +| | `-nvbw` | NVLink bandwidth | +| | `-nl` | NVLink latency | +| | `-l` | NIC latency | +| Intra-Segment | `-bw` | NIC to ASW bandwidth | +| | `-asw` | ASW switch count | +| | `-nps` | NICs per switch | +| Intra-Pod | `-psn` | PSW switch count | +| | `-apbw` | ASW to PSW bandwidth | +| | `-app` | ASW per PSW | + +> For detailed topology parameters and default values per template, see the [astra-sim Component Documentation](../components/astra_sim.md). + +## Running NS-3 Simulation + +```bash +AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ + -t 16 \ + -w ./example/microAllReduce.txt \ + -n ./Spectrum-X_8g_8gps_400Gbps_H100 \ + -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +### Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `AS_LOG_LEVEL` | Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR` | `INFO` | +| `AS_PXN_ENABLE` | Enable PXN (`0`/`1`) | `0` | +| `AS_NVLS_ENABLE` | Enable NVLS (`0`/`1`) | `0` | +| `AS_SEND_LAT` | Packet sending latency (us) | `6` | +| `AS_NVLSTREE_ENABLE` | Enable NVLSTREE | `false` | + +### Simulation Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `-t` / `--thread` | Number of threads (recommended 8-16) | `1` | +| `-w` / `--workload` | Path to workload file | `./microAllReduce.txt` | +| `-n` / `--network-topo` | Network topology path | None | +| `-c` / `--config` | SimAI configuration file | None | + +## Example: RING vs NVLS Comparison + +See the [Tutorial](../../docs/Tutorial.md#ring-vs-nvls) for a complete comparison of RING and NVLS algorithms across different message sizes. + +## What's Next + +- [NS-3 Component Documentation](../components/ns3.md) — Detailed NS-3 module reference +- [Result Analysis](result_analysis.md) — Analyze simulation output diff --git a/docs/en/user_guide/supported_models.md b/docs/en/user_guide/supported_models.md new file mode 100644 index 00000000..06ebcf59 --- /dev/null +++ b/docs/en/user_guide/supported_models.md @@ -0,0 +1,141 @@ +# Supported Models + +SimAI supports a range of LLM models for both training and inference simulation. + +--- + +## Inference Models (SimAI 1.5+) + +These models are supported for multi-request inference simulation via vidur-alibabacloud, with GPU memory calculation, PD disaggregation, and workload generation support. + +### DeepSeek-V3-671B + +| Attribute | Value | +|-----------|-------| +| **Total Layers** | 61 | +| **Attention Type** | MLA (Multi-head Latent Attention) | +| **Attention Heads** | 128 | +| **Hidden Size** | 7168 | +| **KV LoRA Rank** | 512 | +| **Q LoRA Rank** | 1536 | +| **QK RoPE Head Dim** | 64 | +| **QK NoPE Head Dim** | 128 | +| **V Head Dim** | 128 | +| **MoE Routed Experts** | 256 | +| **Experts Per Token** | 8 | +| **Shared Experts** | 1 | +| **Dense Layers** | First 3 layers (fixed activation of 8 routed + 1 shared expert) | +| **Sparse Layers** | Layers 3-60 (dynamically select 8 from 256 routed + 1 shared expert) | +| **Config File** | `vidur-alibabacloud/data/hf_configs/deepseek_v3_config.json` | + +### Qwen3-MoE-235B + +| Attribute | Value | +|-----------|-------| +| **Total Layers** | 94 | +| **Attention Type** | MHA / GQA | +| **Attention Heads** | 64 | +| **KV Heads** | 4 | +| **Hidden Size** | 4096 | +| **Head Dim** | 128 | +| **MoE Routed Experts** | 128 | +| **Experts Per Token** | 8 | +| **MoE Intermediate Size** | 1536 | +| **Config File** | `vidur-alibabacloud/data/hf_configs/qwen3_moe_config.json` | + +### Qwen3-Next-80B + +| Attribute | Value | +|-----------|-------| +| **Total Layers** | 48 | +| **Attention Type** | Hybrid (full + linear attention, alternating every 4 layers) | +| **Full Attention Heads** | 16 | +| **KV Heads** | 2 | +| **Hidden Size** | 2048 | +| **Head Dim** | 256 | +| **Linear Attention Key Heads** | 16 | +| **Linear Attention Value Heads** | 32 | +| **MoE Routed Experts** | 512 | +| **Experts Per Token** | 10 | +| **MoE Intermediate Size** | 512 | +| **Config File** | `vidur-alibabacloud/data/hf_configs/qwen3-next-80B-A3B_config.json` | + +### Legacy Inference Models (via Vidur Backend) + +These models are supported using the original Vidur profiling-based backend: + +| Model | TP Support | PP Support | +|-------|-----------|-----------| +| meta-llama/Meta-Llama-3-8B | Yes | Yes | +| meta-llama/Meta-Llama-3-70B | Yes | Yes | +| meta-llama/Llama-2-7b-hf | Yes | Yes | +| meta-llama/Llama-2-70b-hf | Yes | Yes | +| codellama/CodeLlama-34b-Instruct-hf | Yes | Yes | +| internlm/internlm-20b | Yes | Yes | +| Qwen/Qwen-72B | Yes | Yes | + +--- + +## Training Models (AICB) + +The following models are supported for training workload generation: + +### AICB Benchmark Suite + +| ID | Model | Seq Length | Framework | TP | PP | SP | MoE | +|----|-------|-----------|-----------|----|----|-----|-----| +| 1 | LLaMA-7B | 2048 | Megatron | 1 | 1 | - | - | +| 2 | GPT-13B | 2048 | Megatron | 2 | 1 | Yes | - | +| 3 | GPT-22B | 2048 | Megatron | 4 | 1 | - | - | +| 4 | LLaMA-65B | 4096 | Megatron | 8 | 2 | Yes | - | +| 5 | GPT-175B | 2048 | Megatron | 8 | 8 | Yes | - | +| 6 | GPT-175B | 2048 | Megatron | 8 | 8 | - | - | +| 7 | Llama3-405B | 8192 | Megatron | 8 | 16 | Yes | - | +| 8 | LLaMA-7B | 4096 | DeepSpeed | 1 | 1 | - | Zero-2 | +| 9 | LLaMA-65B | 4096 | DeepSpeed | 1 | 1 | - | Zero-3 | +| 10 | Mistral-8x7B | 2048 | Megatron | 2 | 1 | Yes | 8 experts | + +### Training Framework Support + +| Framework | Models | AIOB Support | +|-----------|--------|-------------| +| **Megatron** | GPT, LLaMA, MoE | Yes | +| **DeepSpeed** | LLaMA (Zero Stage 1/2/3) | No (fixed times) | +| **DeepSeek** | DeepSeek (16B/236B/671B) | Yes | + +--- + +## Attention Architecture Comparison + +| Architecture | Model | KV Cache Strategy | Memory Efficiency | +|-------------|-------|-------------------|-------------------| +| **MLA** | DeepSeek-V3-671B | Compressed latent vector (`kv_lora_rank` + `qk_rope_head_dim`) | ~57x reduction vs MHA | +| **MHA / GQA** | Qwen3-MoE-235B | Standard KV cache (`num_kv_heads * head_dim`) | Standard | +| **Hybrid Full + Linear** | Qwen3-Next-80B | Full attention layers + linear (GDN) attention alternating every 4 layers | Reduced (linear layers have no KV cache) | + +--- + +## Hardware Requirements + +### Inference Profiling (AICB Backend) + +| Requirement | Details | +|-------------|---------| +| **GPU Architecture** | NVIDIA Hopper (SM90) or Blackwell (SM100) | +| **Reason** | Dependency on DeepGEMM, FlashMLA, FlashInfer | +| **Hopper Note** | Add `ENV FLASH_MLA_DISABLE_SM100=1` to Dockerfile | + +### Training Simulation + +- **SimAI-Analytical**: Any CPU (no GPU required) +- **SimAI-Simulation**: Any CPU (no GPU required) +- **AICB Physical Execution**: Requires GPU cluster with NCCL support + +--- + +## See Also + +- [Inference Simulation](inference_simulation.md) — Multi-request inference guide +- [Workload Generation](workload_generation.md) — AICB workload generation +- [GPU Memory Module](../technical_reference/memory_module.md) — Memory calculation details +- [vidur-alibabacloud](../components/vidur.md) — Inference scheduling component diff --git a/docs/en/user_guide/workload_generation.md b/docs/en/user_guide/workload_generation.md new file mode 100644 index 00000000..e5859b5e --- /dev/null +++ b/docs/en/user_guide/workload_generation.md @@ -0,0 +1,160 @@ +# Workload Generation + +AICB (AI Communication Benchmark) provides workload generation capabilities for both training and inference simulation in SimAI. + +--- + +## Overview + +AICB generates workload description files (`.txt`) that describe the communication and computation patterns of LLM training/inference processes. These workloads are consumed by SimAI's simulation engine. + +Two types of workload generation are supported: + +| Type | Description | Models Supported | +|------|-------------|------------------| +| **Training** | Generates training communication/computation patterns | GPT (7B/13B/22B/175B), LLaMA (7B/65B/405B), DeepSeek (16B/236B/671B), MoE | +| **Inference** | Generates prefill/decode phase workloads | DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B | + +--- + +## Training Workload Generation + +### Quick Start with Pre-configured Models + +```bash +# Generate workload for Megatron GPT-7B +sh ./scripts/megatron_workload_with_aiob.sh -m 7 \ + --world_size 4096 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 8192 \ + --micro_batch 1 --seq_length 4096 --swiglu \ + --use_flash_attn --aiob_enable \ + --comp_filepath workload/aiob_inputs/Example.txt +``` + +Available pre-configured model sizes: `7`, `13`, `22`, `175` (GPT/LLaMA), `moe`, `deepseek` (16/236/671). + +### Generating for Different Frameworks + +#### Megatron + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name GPT-13B --frame=Megatron \ + --world_size=16 --tensor_model_parallel_size=2 --pipeline_model_parallel=1 \ + --global_batch=16 --micro_batch=1 --num_layers=40 --seq_length=2048 \ + --hidden_size=5120 --epoch_num=1 --num_attention_heads=40 \ + --aiob_enable --use_flash_attn --swiglu +``` + +#### MoE + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name MoE --frame=Megatron \ + --world_size=32 --tensor_model_parallel_size=4 --pipeline_model_parallel=1 \ + --expert_model_parallel_size=2 --moe_enable --num_experts=8 --moe_router_topk=2 \ + --global_batch=32 --micro_batch=1 --seq_length=2048 \ + --aiob_enable --swiglu --use_flash_attn +``` + +#### DeepSeek + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --frame=DeepSeek \ + --world_size=32 --tensor_model_parallel_size=4 \ + --expert_model_parallel_size=2 --moe_enable --num_experts=4 --moe_router_topk=2 \ + --global_batch=16 --micro_batch=1 --seq_length=4096 \ + --aiob_enable --swiglu -m deepseek +``` + +#### DeepSpeed + +```bash +python -m workload_generator.generate_deepspeed_stage3_workload \ + --world_size=64 --global_batch=64 \ + --num_layers=40 --hidden_size=5120 --seq_length=4096 \ + --zero_stage=3 --reduce_bucket_size=1000000000 +``` + +### Output + +Generated workload files are saved in: +- Training: `results/mocked_workload/` or `results/workload/` + +--- + +## Inference Workload Generation + +SimAI uses AICB to generate inference workloads with prefill/decode phase separation. + +> **Note**: Inference compute profiling requires NVIDIA Hopper (SM90) or Blackwell (SM100) GPUs due to dependencies on [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) and [FlashMLA](https://github.com/deepseek-ai/FlashMLA). + +### Supported Inference Models + +| Model | Attention | MoE Experts | Experts/Token | +|-------|-----------|-------------|---------------| +| DeepSeek-V3-671B | MLA | 256 routed + 1 shared | 8 | +| Qwen3-MoE-235B | MHA/GQA | 128 routed | 8 | +| Qwen3-Next-80B | Hybrid (full + linear) | 512 routed | 10 | + +Inference workloads are automatically generated and consumed by the vidur-alibabacloud scheduling framework. See [Inference Simulation](inference_simulation.md) for end-to-end usage. + +--- + +## AIOB: Computation Time Embedding + +AIOB (AI Operation Benchmark) is a sub-module within AICB that profiles actual GPU computation times and embeds them into workloads. + +### Usage Options + +| Option | Description | +|--------|-------------| +| `--aiob_enable` | Enable AIOB to profile computation times on the current GPU | +| `--comp_filepath ` | Use a pre-existing computation time description file | +| Neither | Use fixed default computation times | + +### Example: Profile and Embed + +```bash +sh scripts/megatron_gpt.sh \ + -m 7 --world_size 8 --tensor_model_parallel_size 2 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --seq_length 2048 --swiglu --use_flash_attn --aiob_enable +``` + +Computation description files are saved in `results/aiob_outputs/`. + +--- + +## Key Parameters + +| Category | Parameter | Description | +|----------|-----------|-------------| +| **Framework** | `--frame` | Megatron / DeepSpeed / DeepSeek | +| **Model** | `--model_size` or `-m` | Pre-configured model size | +| **Training** | `--world_size` | Total number of GPUs | +| | `--global_batch` | Total batch size | +| | `--micro_batch` | Micro-batch size | +| | `--seq_length` | Sequence length | +| | `--epoch_num` | Number of iterations | +| **Parallelism** | `--tensor_model_parallel_size` | TP degree | +| | `--pipeline_model_parallel` | PP degree | +| | `--expert_model_parallel_size` | EP degree | +| **MoE** | `--moe_enable` | Enable MoE | +| | `--num_experts` | Number of experts | +| | `--moe_router_topk` | Experts per token | +| **Optimization** | `--use_flash_attn` | Use FlashAttention | +| | `--swiglu` | Use SwiGLU activation | +| | `--aiob_enable` | Enable AIOB computation profiling | +| | `--comp_filepath` | Path to computation time file | + +For the full parameter list, see the [AICB component documentation](../components/aicb.md) or the [CLI Reference](../technical_reference/cli_reference.md). + +--- + +## See Also + +- [AICB Component](../components/aicb.md) — Complete AICB documentation +- [Inference Simulation](inference_simulation.md) — End-to-end inference simulation guide +- [Supported Models](supported_models.md) — Full model compatibility list diff --git a/docs/images/simai_dingtalk.jpg b/docs/images/simai_dingtalk.jpg index ee2d0416..556d4f2b 100644 Binary files a/docs/images/simai_dingtalk.jpg and b/docs/images/simai_dingtalk.jpg differ diff --git a/docs/images/simai_wechat.jpeg b/docs/images/simai_wechat.jpeg index e70c9f02..cd75e4a0 100644 Binary files a/docs/images/simai_wechat.jpeg and b/docs/images/simai_wechat.jpeg differ diff --git a/docs/zh/benchmarking/index.md b/docs/zh/benchmarking/index.md new file mode 100644 index 00000000..5cd15da1 --- /dev/null +++ b/docs/zh/benchmarking/index.md @@ -0,0 +1,33 @@ +# 基准测试 + +本节涵盖 SimAI 的基准测试与验证方法。 + +--- + +## 目录 + +| 文档 | 说明 | +|----------|-------------| +| [4 场景端到端测试套件](test_suite.md) | 预配置测试场景,覆盖不同模型、并行策略和 PD 配置 | + +--- + +## 基准测试方法 + +SimAI 支持多种基准测试方法: + +### 架构对比 + +在相同工作负载下比较不同网络架构(如 Spectrum-X vs DCN+),评估其性能特征。 + +### 算法对比 + +比较不同集合通信算法(如 RING vs NVLS),了解不同消息大小下的性能权衡。 + +### 参数优化 + +使用 SimAI-Analytical 快速探索并行参数组合(TP、PP、EP、DP),寻找最优配置。 + +### 对比真实硬件验证 + +使用 AICB 物理执行结果作为基准,验证仿真精度。 diff --git a/docs/zh/benchmarking/test_suite.md b/docs/zh/benchmarking/test_suite.md new file mode 100644 index 00000000..a843bf7c --- /dev/null +++ b/docs/zh/benchmarking/test_suite.md @@ -0,0 +1,139 @@ +# 4 场景端到端测试套件 + +SimAI 提供预配置的测试套件,覆盖 4 个典型推理场景,可快速验证所有支持的配置。 + +--- + +## 概述 + +测试套件位于 `vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh`,覆盖不同模型、并行策略和 PD 分离配置的组合。 + +--- + +## 运行 + +```bash +# 运行所有 4 个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 运行单个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 + +# 显示帮助 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --help +``` + +> **前置条件**:需激活 `conda activate vidur` 环境。 + +--- + +## 共享硬件配置 + +所有场景共享以下硬件设置: + +| 参数 | 值 | +|-----------|-------| +| GPU | H20 (h20_dgx) | +| NVLink 带宽 | 1600 Gbps | +| RDMA 带宽 | 800 Gbps | +| PD P2P 带宽 | 800 Gbps | +| PD P2P 数据类型 | fp8 | +| 请求生成器 | 泊松分布,QPS=100 | +| 请求数量 | 4 | +| Prefill Tokens | 100(固定) | +| Decode Tokens | 8(固定) | + +--- + +## 场景配置 + +| 场景 | 模型 | PD 分离 | World Size | TP | PP | EP | 全局调度器 | +|----------|-------|---------------|-----------|----|----|-----|-----------------| +| **1** | Qwen3-Next-80B (MoE) | 否 | 32 (dp=32) | 1 | 1 | 1(默认) | lor | +| **2** | Qwen3-Next-80B (MoE) | 是 (P=2, D=6) | 8 | 1 | 1 | 1(默认) | split_wise | +| **3** | DeepSeek-671B (MoE) | 是 (P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| **4** | Qwen3-MoE-235B (MoE) | 是 (P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +### 场景详情 + +- **场景 1**:大规模 DP 无 PD 分离 — 测试基线吞吐量 +- **场景 2**:同模型加 PD 分离 — 测试 PD 分离开销 +- **场景 3**:DeepSeek-671B 大 TP/EP — 测试 MoE + MLA 注意力 +- **场景 4**:Qwen3-MoE-235B 中等 TP/EP — 测试 MHA/GQA 注意力模型 + +--- + +## 输出 + +### 输出目录 + +- **通过 run_scenarios.sh**:`examples/vidur-ali-scenarios/simulator_output/` +- **直接 Python 运行**:`./simulator_output/` + +### 输出文件 + +``` +// +├── request_metrics.csv # 每请求指标 +├── chrome_trace.json # Chrome DevTools 时间线 +├── config.json # 配置快照 +└── plots/ # 指标 CSV/JSON 文件 +``` + +### 日志 + +运行日志保存在 `examples/vidur-ali-scenarios/logs/scenario__.log`。 + +--- + +## 架构对比示例 + +### RING vs NVLS(SimAI-Simulation) + +```bash +# NVLS 拓扑和运行 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py --ro -g 32 -gt H100 -bw 400Gbps -nvbw 1360Gbps +AS_SEND_LAT=12 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 8 -w ./example/microAllReduce.txt \ + -n ./Rail_Opti_SingleToR_32g_8gps_400Gbps_H100 -c ./astra-sim-alibabacloud/inputs/config/SimAI.conf + +# RING 拓扑和运行 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py --ro -g 32 -gt H100 -bw 400Gbps -nvbw 1440Gbps +AS_SEND_LAT=2 AS_PXN_ENABLE=1 ./bin/SimAI_simulator -t 8 -w ./example/microAllReduce.txt \ + -n ./Rail_Opti_SingleToR_32g_8gps_400Gbps_H100 -c ./astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +**结果**(busbw 单位 GB/s): + +| 消息大小 | NVLS | RING | +|-------------|------|------| +| 16M | 148.88 | 141.84 | +| 32M | 178.04 | 153.68 | +| 64M | 197.38 | 160.60 | +| 128M | 208.70 | 163.85 | +| 256M | 214.87 | 165.72 | +| 512M | 218.09 | 166.68 | + +### Spectrum-X vs DCN+(SimAI-Simulation) + +```bash +# 生成拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo DCN+ -g 256 -psn 64 -bw 400Gbps +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 256 +``` + +**结果**(busbw 单位 GB/s): + +| 消息大小 | Spectrum-X | DCN+ SingleToR | +|-------------|------------|----------------| +| 16M | 33.10 | 23.33 | +| 64M | 42.05 | 23.68 | +| 256M | 45.10 | 36.21 | +| 512M | 45.65 | 36.24 | + +--- + +## 相关文档 + +- [多请求推理仿真](../user_guide/inference_simulation.md) — 完整推理仿真指南 +- [vidur-alibabacloud](../components/vidur.md) — 组件文档 +- [结果分析](../user_guide/result_analysis.md) — 输出解读 diff --git a/docs/zh/community/index.md b/docs/zh/community/index.md new file mode 100644 index 00000000..565892f7 --- /dev/null +++ b/docs/zh/community/index.md @@ -0,0 +1,91 @@ +# 社区 + +欢迎加入 SimAI 社区!SimAI 由阿里云和全球学术机构的研究人员和工程师共同构建。 + +--- + +## 获取帮助 + +- **GitHub Issues**: [github.com/aliyun/SimAI/issues](https://github.com/aliyun/SimAI/issues) — Bug 报告、功能请求、问题咨询 +- **讨论**: 创建 Issue 并以 "Question:" 为前缀 +- **联系邮箱**: Gang Lu (yunding.lg@alibaba-inc.com), Feiyang Xue (xuefeiyang.xfy@alibaba-inc.com), Qingxu Li (qingxu.lqx@alibaba-inc.com) + +--- + +## 社区群 + +通过钉钉或微信加入 SimAI 社区: + +
+ SimAI 钉钉群 + SimAI 微信群 +
+ +--- + +## 活动 + +### 即将举办 + +| 日期 | 活动 | 地点 | 类型 | +|:----:|:------|:---------|:----:| +| — | — | — | — | + +### 历史活动 + +| 日期 | 活动 | 地点 | 类型 | +|:----:|:------|:---------|:----:| +| 2025/12/30 | SimAI 1.5 发布 | 线上 | 线上 | +| 2025/06/04 | 首届 SimAI 社区研讨会 | 北京大学 | 线下 | +| 2025/05/24 | 第 28 届 Chinasys Workshop — SimAI 报告 | 重庆大学 | 线下 | +| 2024/12/27 | SimAI 技术报告 | 北京航空航天大学 | 线下 | +| 2024/12/06 | 香港科技大学(广州)技术研讨会 | 香港科技大学(广州) | 线下 | +| 2024/12/05 | Bench'24 会议 — SimAI 教程 | 广州 | 线下 | +| 2024/11/26 | SimAI 社区直播(400+ 参与者) | 线上 | 线上 | +| 2024/11/15 | 技术研讨会 | 千岛湖 | 线下 | +| 2024/10/18 | 客座讲座 — SimAI 教程 | 复旦大学 | 线下 | +| 2024/09/24-26 | CCF HPC China 2024 | 武汉 | 会议 | + +--- + +## 引用 + +SimAI 已被 **NSDI'25 Spring** 接收。如果您在研究中使用了 SimAI,请引用: + +```bibtex +@inproceedings{simai-nsdi25, + title={SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision}, + booktitle={NSDI'25 Spring}, + year={2025} +} +``` + +**相关资源:** +- [论文 (PDF)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) +- [幻灯片](../../docs/SimAI_Intro_Online.pdf) +- [视频回放](https://n.dingtalk.com/dingding/live-room/index.html?roomId=OF5BkBUXVxmgsK7x&liveUuid=305736cd-aa70-498b-8003-2b471a53decd) + +--- + +## 致谢 + +主要贡献者: + + +- TianHao Fu (Peking University) and [TELOS-syslab](https://github.com/TELOS-syslab/) +- Parth Parikh (KEYSIGHT) +- Sarah-Michelle Hammer & Ziyi Wang (TU-Berlin) +- Xinyue Li (BUPT) +- Tong Chen (Zhejiang University) +- Ming Wang (BUPT) +- Tao Jiang (Institute of Computing Technology, Chinese Academy of Sciences) + +以及其他个人贡献者——详见 [SimAI 贡献者列表](https://github.com/aliyun/SimAI/graphs/contributors)。 + +特别感谢 Chenning Li(MIT CSAIL)发起将 SimAI 集成到 [M4](https://github.com/netiken/m4) 的合作。 + +--- + +## 参与贡献 + +我们热烈欢迎社区贡献!详见 [贡献指南](../developer_guide/contributing.md)。 diff --git a/docs/zh/components/aicb.md b/docs/zh/components/aicb.md new file mode 100644 index 00000000..b0842a18 --- /dev/null +++ b/docs/zh/components/aicb.md @@ -0,0 +1,212 @@ +# AICB — AI 通信基准测试 + +**仓库**: [aliyun/aicb](https://github.com/aliyun/aicb) | **语言**: Python + +AICB 是面向 AI 场景的专用通信基准测试套件。它能生成与真实 LLM 训练和推理流程对齐的通信工作负载。 + +--- + +## 简介 + +AICB(Artificial Intelligence Communication Benchmark)生成与真实应用精准对齐的通信工作负载模式,支持: + +- 基准测试和调优 GPU 集群通信系统 +- 研究特定模型配置的通信模式 +- 为 SimAI 等仿真器生成工作负载 + +--- + +## 基准测试套件 + +AICB 提供 10 个预配置的基准测试案例,覆盖典型 LLM 配置: + +| 编号 | 模型 | 序列长度 | 框架 | TP | PP | SP | MoE | +|----|-------|-----------|-----------|----|----|-----|-----| +| 1 | LLaMA-7B | 2048 | Megatron | 1 | 1 | - | - | +| 2 | GPT-13B | 2048 | Megatron | 2 | 1 | 是 | - | +| 3 | GPT-22B | 2048 | Megatron | 4 | 1 | - | - | +| 4 | LLaMA-65B | 4096 | Megatron | 8 | 2 | 是 | - | +| 5 | GPT-175B | 2048 | Megatron | 8 | 8 | 是 | - | +| 6 | GPT-175B | 2048 | Megatron | 8 | 8 | - | - | +| 7 | Llama3-405B | 8192 | Megatron | 8 | 16 | 是 | - | +| 8 | LLaMA-7B | 4096 | DeepSpeed | 1 | 1 | - | Zero-2 | +| 9 | LLaMA-65B | 4096 | DeepSpeed | 1 | 1 | - | Zero-3 | +| 10 | Mistral-8x7B | 2048 | Megatron | 2 | 1 | 是 | 8 experts | + +--- + +## 环境搭建 + +### Docker + +```bash +docker build -t aicb:v0.0.1 . +docker run --gpus all --net host --shm-size 16g -it --rm aicb:v0.0.1 +``` + +### 本地环境 + +要求:Python >= 3.8、CUDA >= 11.8、PyTorch >= 2.0.0、NVIDIA APEX + +### NGC 容器 + +```bash +docker pull nvcr.io/nvidia/pytorch:xx.xx-py3 +docker run --gpus all -it --rm -v /path/to/aicb:/workspace/aicb nvcr.io/nvidia/pytorch:xx.xx-py3 +``` + +> **注意**:推理工作负载 Profiling 需要 NVIDIA Hopper (SM90) 或 Blackwell (SM100) GPU。 + +--- + +## 物理集群执行 + +### 环境变量 + +| 参数 | 说明 | +|-----------|-------------| +| `nnodes` | 节点数量 | +| `node_rank` | 当前节点编号 | +| `nproc_per_node` | 每节点 GPU 数 | +| `master_addr` | 主节点地址 | +| `master_port` | 主节点端口 | + +### 运行 Megatron 工作负载 + +```bash +sh scripts/megatron_gpt.sh \ + --nnodes 1 --node_rank 0 --nproc_per_node 8 \ + --master_addr localhost --master_port 29500 \ + -m 7 --world_size 8 --tensor_model_parallel_size 2 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --seq_length 2048 --swiglu --use_flash_attn --aiob_enable +``` + +### 运行 MoE 工作负载 + +```bash +sh scripts/megatron_gpt.sh \ + -m moe --world_size 8 --tensor_model_parallel_size 4 \ + --moe_enable --expert_model_parallel_size 1 \ + --num_experts 4 --moe_router_topk 2 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --sp --grouped_gemm --aiob_enable --swiglu --use_flash_attn +``` + +### 运行 DeepSeek 工作负载 + +```bash +sh scripts/megatron_gpt.sh \ + --frame DeepSeek -m deepseek \ + --tensor_model_parallel_size 4 --moe_enable \ + --expert_model_parallel_size 1 --num_experts 4 \ + --global_batch 4 --micro_batch 1 --world_size 4 \ + --num_layers 10 --sp --swiglu --aiob_enable +``` + +--- + +## 训练工作负载生成 + +为 SimAI 仿真生成工作负载文件: + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name GPT-13B --frame=Megatron \ + --world_size=16 --tensor_model_parallel_size=2 --pipeline_model_parallel=1 \ + --global_batch=16 --micro_batch=1 --num_layers=40 --seq_length=2048 \ + --hidden_size=5120 --epoch_num=1 --num_attention_heads=40 \ + --aiob_enable --use_flash_attn --swiglu +``` + +输出保存在 `results/mocked_workload/`。 + +--- + +## 推理工作负载生成 + +AICB 为以下模型生成带 Prefill/Decode 阶段分离的推理工作负载: + +| 模型 | 注意力架构 | MoE 专家数 | +|-------|-----------|-------------| +| DeepSeek-V3-671B | MLA | 256 路由 + 1 共享 | +| Qwen3-MoE-235B | MHA/GQA | 128 路由 | +| Qwen3-Next-80B | 混合(全注意力 + 线性注意力) | 512 路由 | + +需要硬件加速库:[DeepGEMM](https://github.com/deepseek-ai/DeepGEMM)、[FlashMLA](https://github.com/deepseek-ai/FlashMLA)、[FlashInfer](https://github.com/flashinfer-ai/flashinfer)。 + +--- + +## AIOB:计算性能分析 + +AIOB 可采集实际 GPU 计算耗时并嵌入工作负载: + +- `--aiob_enable` — 在当前 GPU 上进行 Profiling +- `--comp_filepath ` — 使用已有 Profiling 数据 + +输出保存在 `results/aiob_outputs/`。 + +--- + +## 自定义模型开发 + +AICB 支持使用 `MockedParam` 和 `MockedModel` 基类为自定义模型架构创建工作负载。 + +训练过程被抽象为:`init → forward → backward → step` + +每条工作负载项包含: +1. **通信信息**:`comm_type`、`comm_group`、`comm_group_size`、`msg_size` +2. **附加信息**:源节点(broadcast 场景)、计算耗时 +3. **运行时信息**:`elapsed_time`、`algo_bw`、`bus_bw` + +可参考现有 `MockedMegatron` 和 `MockedDeepSpeed` 实现。 + +--- + +## 关键参数 + +| 类别 | 参数 | 说明 | +|----------|-----------|-------------| +| 框架 | `frame` | Megatron / DeepSpeed / DeepSeek | +| 模型 | `model_size` | 预配置大小(7/13/22/175/moe/deepseek) | +| 训练 | `world_size` | 总 GPU 数量 | +| | `global_batch` | 总批量大小 | +| | `micro_batch` | 微批量大小 | +| | `seq_length` | 序列长度 | +| 并行策略 | `tensor_model_parallel_size` | TP 度 | +| | `pipeline_model_parallel` | PP 度 | +| | `expert_model_parallel_size` | EP 度 | +| MoE | `moe_enable` | 启用 MoE | +| | `num_experts` | 专家数量 | +| | `moe_router_topk` | 每 Token 专家数 | +| DeepSeek | `qk_rope_dim` | QK 的 RoPE 维度 | +| | `kv_lora_rank` | KV 压缩 LoRA 维度 | +| | `q_lora_rank` | Q 压缩 LoRA 维度 | +| | `n_shared_expert` | 共享专家数 | +| 优化 | `use_flash_attn` | FlashAttention | +| | `swiglu` | SwiGLU 激活函数 | +| | `aiob_enable` | AIOB 计算 Profiling | +| | `comp_filepath` | 预有计算文件 | + +--- + +## 结果输出 + +### 物理执行 + +- 每次通信日志:类型、分组、消息大小、执行时间、吞吐量 +- 每次迭代耗时分析 +- CSV 输出在 `results/comm_logs/` + +### 工作负载文件 + +- 训练工作负载:`results/mocked_workload/` 或 `results/workload/` +- AIOB Profiling:`results/aiob_outputs/` + +--- + +## 相关文档 + +- [工作负载生成指南](../user_guide/workload_generation.md) — 用户指南中的工作负载生成 +- [支持的模型](../user_guide/supported_models.md) — 完整模型列表 +- [Tutorial](https://github.com/aliyun/aicb/blob/master/training/tutorial.md) — AICB 详细教程 diff --git a/docs/zh/components/astra_sim.md b/docs/zh/components/astra_sim.md new file mode 100644 index 00000000..52541430 --- /dev/null +++ b/docs/zh/components/astra_sim.md @@ -0,0 +1,147 @@ +# astra-sim-alibabacloud — 仿真引擎 + +**位置**: 项目内(`astra-sim-alibabacloud/`) | **语言**: C++ + +SimAI 的核心仿真引擎,扩展自 [astra-sim 1.0](https://github.com/astra-sim/astra-sim/tree/ASTRA-sim-1.0)。支持三种运行模式,并集成了 NCCL 算法与自定义增强。 + +--- + +## 概述 + +astra-sim-alibabacloud 是 SimAI 仿真的中央调度器: + +- 接收 AICB 生成的工作负载 +- 使用 SimCCL 将集合操作分解为 P2P 传输 +- 通过 NS-3(仿真模式)或直接 RDMA(物理模式)驱动网络仿真 +- 使用 busbw 参数进行时间计算(分析模式) + +--- + +## 三种运行模式 + +### SimAI-Analytical + +使用总线带宽(busbw)进行快速分析仿真,估算集合通信耗时。 + +**编译**: `./scripts/build.sh -c analytical` +**二进制**: `bin/SimAI_analytical` + +### SimAI-Simulation + +使用 NS-3 网络后端的全栈仿真,实现细粒度网络建模。 + +**编译**: `./scripts/build.sh -c ns3` +**二进制**: `bin/SimAI_simulator` + +### SimAI-Physical + +在真实硬件上使用 RDMA 生成物理流量。 + +**编译**: `./scripts/build.sh -c phy` +**二进制**: `bin/SimAI_phynet` + +--- + +## 核心组件 + +| 组件 | 说明 | +|-----------|-------------| +| **AstraComputeAPI** | 管理计算时序和调度 | +| **MemoryAPI** | 处理内存分配和追踪 | +| **NetworkAPI** | 网络后端接口(NS-3、物理网络) | +| **MockNcclGroup** | 模拟 NCCL 通信组 | +| **MockNcclChannel** | 管理单个通信通道 | +| **SimAiFlowModelRdma** | RDMA 流量模型 | + +--- + +## 配置 + +### SimAI.conf + +主配置文件位于 `astra-sim-alibabacloud/inputs/config/SimAI.conf`,控制以下仿真参数: + +- 通信算法 +- 缓冲区大小 +- 时序参数 +- 网络后端设置 + +### 环境变量(仿真模式) + +| 变量 | 说明 | 默认值 | +|----------|-------------|---------| +| `AS_LOG_LEVEL` | 日志级别:DEBUG、INFO、WARNING、ERROR | `INFO` | +| `AS_PXN_ENABLE` | 启用 PXN(Proxied NVLINK) | `0`(禁用) | +| `AS_NVLS_ENABLE` | 启用 NVLS(NVLink Sharp) | `0`(禁用) | +| `AS_SEND_LAT` | 包发送延迟(us) | `6` | +| `AS_NVLSTREE_ENABLE` | 启用 NVLS Tree 算法 | `false` | + +### 仿真参数 + +| 参数 | 说明 | 默认值 | +|-----------|-------------|---------| +| `-t` / `--thread` | 加速线程数 | `1`(建议 8-16) | +| `-w` / `--workload` | 工作负载文件路径 | 必需 | +| `-n` / `--network-topo` | 网络拓扑文件路径 | 必需(仿真模式) | +| `-c` / `--config` | SimAI 配置文件 | 必需 | + +--- + +## 拓扑生成 + +astra-sim 通过 `gen_Topo_Template.py` 提供 5 种拓扑模板: + +### 可用模板 + +| 模板 | 架构 | 说明 | +|----------|-------------|-------------| +| `Spectrum-X` | NVIDIA Spectrum-X | Rail-optimized,单 ToR,单 Plane | +| `AlibabaHPN`(单 Plane) | Alibaba HPN 7.0 | 双 ToR,Rail-optimized,单 Plane | +| `AlibabaHPN`(双 Plane) | Alibaba HPN 7.0 | 双 ToR,Rail-optimized,双 Plane | +| `DCN+`(单 ToR) | DCN+ | 单 ToR,非 Rail-optimized | +| `DCN+`(双 ToR) | DCN+ | 双 ToR,非 Rail-optimized | + +### 拓扑参数 + +| 层级 | 参数 | 说明 | +|-------|-----------|-------------| +| **全局** | `-topo` | 模板名称 | +| | `-g` | GPU 数量 | +| | `--dp` | 启用双 Plane | +| | `--ro` | 启用 Rail-optimized | +| | `--dt` | 启用双 ToR | +| **服务器内** | `-gps` | 每服务器 GPU 数 | +| | `-gt` | GPU 型号(A100/H100) | +| | `-nvbw` | NVLink 带宽 | +| | `-nl` | NVLink 延迟 | +| **Segment 内** | `-bw` | NIC 到 ASW 带宽 | +| | `-asw` | ASW 交换机数量 | +| | `-nps` | 每交换机 NIC 数 | +| **Pod 内** | `-psn` | PSW 交换机数量 | +| | `-apbw` | ASW 到 PSW 带宽 | + +### 示例 + +```bash +# Spectrum-X 128 GPU +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# 双 Plane AlibabaHPN 64 GPU +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo AlibabaHPN --dp -g 64 -asn 16 -psn 16 + +# 双 ToR DCN+ 128 GPU +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo DCN+ --dt -g 128 -asn 2 -psn 8 +``` + +--- + +## 相关文档 + +- [SimAI-Analytical 使用指南](../user_guide/simai_analytical.md) — 分析模式使用 +- [SimAI-Simulation 使用指南](../user_guide/simai_simulation.md) — NS-3 仿真使用 +- [SimAI-Physical 使用指南](../user_guide/simai_physical.md) — 物理模式使用 +- [NS-3 组件](ns3.md) — 网络后端详情 +- [SimCCL 组件](simccl.md) — 集合通信分解 diff --git a/docs/zh/components/index.md b/docs/zh/components/index.md new file mode 100644 index 00000000..7635e442 --- /dev/null +++ b/docs/zh/components/index.md @@ -0,0 +1,80 @@ +# 组件概述 + +SimAI 是一个模块化项目,由 5 个核心组件组成。各组件可独立使用,也可组合使用以实现不同的仿真场景。 + +--- + +## 架构 + +``` + |--- AICB (工作负载生成 & 计算性能分析) +SimAI --|--- SimCCL (集合通信算法分析) + |--- astra-sim-alibabacloud (仿真引擎:Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 网络后端) + |--- vidur-alibabacloud (多请求推理调度 & 显存管理) +``` + +--- + +## 组件摘要 + +| 组件 | 语言 | 仓库 | 说明 | +|------|------|------|------| +| [AICB](aicb.md) | Python | [aliyun/aicb](https://github.com/aliyun/aicb) | AI 通信基准测试——训练和推理工作负载生成 | +| [SimCCL](simccl.md) | Python | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | 集合通信到点对点通信转换 | +| [astra-sim-alibabacloud](astra_sim.md) | C++ | In-tree | 支持 3 种模式的核心仿真引擎 | +| [ns-3-alibabacloud](ns3.md) | C++ | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | 带 RDMA/数据中心扩展的 NS-3 网络仿真后端 | +| [vidur-alibabacloud](vidur.md) | Python | In-tree | 支持 PD 分离和请求调度的 LLM 推理仿真 | + +--- + +## 场景与组件组合 + +| 场景 | AICB | SimCCL | astra-sim | ns-3 | vidur | +|------|------|--------|-----------|------|-------| +| AICB 测试套件(物理 GPU) | 必需 | - | - | - | - | +| 工作负载生成 | 必需 | - | - | - | - | +| 集合通信分析 | - | 必需 | - | - | - | +| SimAI-Analytical | 必需 | - | 必需(analytical) | - | - | +| SimAI-Simulation | 必需 | 必需 | 必需(simulation) | 必需 | - | +| SimAI-Physical | 必需 | 必需 | 必需(physical) | - | - | +| 推理仿真 | 必需 | 必需 | 必需 | 可选 | 必需 | + +--- + +## 数据流 + +``` +AICB(工作负载生成) + | + |-- 训练工作负载 (.txt) --> astra-sim-alibabacloud + |-- 推理工作负载 -------> vidur-alibabacloud + | +SimCCL(集合 → P2P) + | + |--> astra-sim-alibabacloud(Simulation/Physical 模式) + | +astra-sim-alibabacloud(仿真引擎) + | + |-- Analytical 模式:busbw 估算 + |-- Simulation 模式:NS-3 后端 + |-- Physical 模式:RDMA 流量注入 + | +ns-3-alibabacloud(网络后端) + | + |--> 细粒度网络仿真结果 + | +vidur-alibabacloud(推理调度) + | + |--> request_metrics.csv, chrome_trace.json, plots/ +``` + +--- + +## 组件详细文档 + +- **[AICB](aicb.md)** — 工作负载生成、基准测试套件、AIOB 计算分析 +- **[SimCCL](simccl.md)** — 集合通信分解 +- **[astra-sim-alibabacloud](astra_sim.md)** — 核心仿真引擎、配置、拓扑生成 +- **[ns-3-alibabacloud](ns3.md)** — RDMA 网络仿真、CC 算法、分析工具 +- **[vidur-alibabacloud](vidur.md)** — 推理仿真、PD 分离、GPU 显存管理 diff --git a/docs/zh/components/ns3.md b/docs/zh/components/ns3.md new file mode 100644 index 00000000..d682f568 --- /dev/null +++ b/docs/zh/components/ns3.md @@ -0,0 +1,175 @@ +# ns-3-alibabacloud — 网络仿真后端 + +**仓库**: [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | **语言**: C++ + +基于 NS-3 的网络仿真器,作为 SimAI 的网络后端,扩展了面向数据中心/RDMA 的端到端建模能力。 + +--- + +## 概述 + +相比上游 [NS-3](https://www.nsnam.org/),ns-3-alibabacloud 在点对点模块上扩展了全面的数据中心网络特性: + +- **QBB/PFC + 多优先级队列** — 8 个优先级队列,支持 PAUSE/RESUME 处理 +- **ECN + CNP 反馈** — 交换机侧 ECN 标记和接收端拥塞通知 +- **RDMA 主机协议栈(QP 级别)** — 完整 QP 建模,支持 5 种拥塞控制算法 +- **交换机和 NVSwitch 建模** — ECMP 转发、缓冲区管理、PFC 逻辑 + +### dev/qp 分支 + +[dev/qp](https://github.com/aliyun/ns-3-alibabacloud/tree/dev/qp) 分支包含额外增强: + +1. 基于实际 RDMA 逻辑的 QP 创建/销毁支持 +2. 按 IP 或按 QP 的 NIC CC 配置 +3. 优化的 Max-Min 调度逻辑 +4. 解耦的 CC 模块,提升模块化程度 + +--- + +## 核心模块 + +### QBB 网络设备(`qbb-net-device`) + +基于 `PointToPointNetDevice` 构建的支持 QBB 的网络设备,具有 8 个优先级。特性: + +- PFC PAUSE/RESUME 处理 +- `RdmaEgressQueue`:高优先级 ACK/NACK 队列 + QP 间轮询 +- `BEgressQueue`:交换机端口轮询 +- NVSwitch 发送路径支持(NVLS 模式) + +**关键属性**: `QbbEnabled`、`QcnEnabled`、`DynamicThreshold`、`PauseTime`、`NVLS_enable` + +### RDMA 主机协议栈(`rdma-hw`) + +主机 RDMA 核心实现: + +- QP 创建/删除生命周期 +- 报文构造(PPP + IPv4 + UDP + SeqTs 头) +- ACK/NACK/CNP 处理 +- 按 QP 的拥塞控制算法 +- NVSwitch 路由表 + +**拥塞控制算法**: + +| 算法 | 说明 | +|-----------|-------------| +| **DCQCN** | 数据中心量化拥塞通知 | +| **HPCC** | 高精度拥塞控制 | +| **TIMELY** | 基于 RTT 的拥塞控制 | +| **DCTCP** | 数据中心 TCP | +| **HPCC-PINT** | HPCC + 概率 INT | + +**协议号(IPv4 Protocol 字段)**: + +| 协议 | 编号 | 说明 | +|----------|--------|-------------| +| UDP 数据 | `0x11` | 普通数据报文 | +| CNP | `0xFF` | 拥塞通知报文 | +| PFC | `0xFE` | 优先级流控 | +| ACK | `0xFC` | 确认报文 | +| NACK | `0xFD` | 否定确认报文 | + +### 交换机节点(`switch-node`) + +交换机流水线实现: +- ECMP 转发(5 元组哈希) +- 通过 MMU 进行准入控制 +- PFC Pause/Resume 生成 +- ECN 标记 +- INT/PINT 注入(用于 HPCC/HPCC-PINT) + +### 交换机 MMU(`switch-mmu`) + +交换机缓冲区/MMU 模型: +- 入口/出口记账 +- 共享缓冲区和 Headroom 管理 +- PFC 触发/恢复逻辑 +- ECN 标记概率曲线(`kmin/kmax/pmax`) + +### NVSwitch 节点(`nvswitch-node`) + +用于服务器内 GPU 通信的 NVSwitch 模型,配合 `RdmaHw`/`QbbNetDevice` 中的 NVLS 路由逻辑。 + +### QP 状态(`rdma-queue-pair`) + +按 QP 和按 RxQP 的状态管理,包括: +- 窗口和速率控制 +- 已确认序列号追踪 +- 按 CC 算法的状态(DCQCN alpha/targetRate、HPCC hop state、TIMELY RTT、DCTCP alpha/ecnCnt、PINT state) + +--- + +## 分析工具 + +位于 `ns-3-alibabacloud/analysis/`: + +### FCT 分析 + +```bash +python fct_analysis.py -h # 查看使用帮助 +``` + +读取 FCT 输出文件,生成流完成时间(FCT)分析统计。 + +### Trace 阅读器 + +```bash +# 编译 +make trace_reader + +# 使用 +./trace_reader <.tr 文件> [过滤表达式] + +# 过滤示例 +./trace_reader output.tr "time > 2000010000" +./trace_reader output.tr "sip=0x0b000101&dip=0x0b000201" +``` + +### Trace 输出格式 + +``` +2000055540 n:338 4:3 100608 Enqu ecn:0 0b00d101 0b012301 10000 100 U 161000 0 3 1048(1000) +``` + +字段:时间戳、节点、端口:队列、队列长度、事件、ECN、源 IP、目的 IP、源端口、目的端口、报文类型、序列号、发送时间、优先级、大小(载荷) + +--- + +## 头部和工具 + +| 文件 | 说明 | +|------|-------------| +| `qbb-header` | ACK/NACK 头(含可选 INT 头) | +| `cn-header` | CNP 头(反馈字段) | +| `pause-header` | PFC Pause 头 | +| `pint` | PINT 编解码工具 | +| `trace-format.h` | 用于离线分析的二进制 Trace 记录结构 | + +--- + +## 扩展指南 + +### 添加新拥塞控制算法 + +1. **主要修改**: `rdma-hw.{h,cc}` — 添加 `HandleAckX`/`UpdateRateX` 方法,按 `m_cc_mode` 分发 +2. **通常需要**: `rdma-queue-pair.h` — 添加新的按 QP 状态变量 +3. **如需交换机反馈**: `switch-node.cc` — 添加 INT/PINT 或新标记 + +### 修改交换机行为 + +1. **主要修改**: `switch-mmu.{h,cc}` — 修改阈值、曲线、公式 +2. **标记/注入**: `switch-node.cc::SwitchNotifyDequeue()` +3. **准入/优先级**: `switch-node.cc::SendToDev()` + +### 添加新控制报文 + +1. 在 `model/` 中创建新 `*Header`(参照 `CnHeader`/`PauseHeader` 模式) +2. 在 `QbbNetDevice::Receive()` 或 `RdmaHw::Receive()` 中添加解析 + +--- + +## 相关文档 + +- [SimAI-Simulation 使用指南](../user_guide/simai_simulation.md) — 全栈仿真使用 +- [astra-sim 组件](astra_sim.md) — 仿真引擎 +- [NS-3 扩展指南](../developer_guide/extending_ns3.md) — 详细扩展指南 diff --git a/docs/zh/components/simccl.md b/docs/zh/components/simccl.md new file mode 100644 index 00000000..cceac180 --- /dev/null +++ b/docs/zh/components/simccl.md @@ -0,0 +1,82 @@ +# SimCCL — 集合通信库 + +**仓库**: [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | **语言**: Python/C++ + +SimCCL 将集合通信操作转换为点对点通信,是工作负载层与仿真引擎之间的关键桥梁。 + +--- + +## 概述 + +在分布式 LLM 训练中,集合通信操作(AllReduce、AllGather、ReduceScatter、AlltoAll 等)是基础构建块。SimCCL 将这些高层集合操作分解为点对点通信序列,以便网络后端精确仿真。 + +--- + +## 在 SimAI 中的角色 + +SimCCL 位于 AICB(工作负载生成)和 astra-sim-alibabacloud(仿真引擎)之间: + +``` +AICB 生成包含集合操作的工作负载 + | + v +SimCCL 分解集合操作 → 点对点通信 + | + v +astra-sim 将 P2P 流量发送到 NS-3 或物理网络 +``` + +SimCCL 在以下场景中**必需**: +- **SimAI-Simulation** — 全栈 NS-3 仿真 +- **SimAI-Physical** — 物理 RDMA 流量生成 +- **推理仿真** — 使用 SimAI Simulation 后端时 + +SimCCL 在以下场景中**不需要**: +- **SimAI-Analytical** — 直接使用 busbw 估算 + +--- + +## 版本 + +### 基础版(mocknccl) + +基础实现目前位于 [astra-sim-alibabacloud](https://github.com/aliyun/SimAI/tree/master/astra-sim-alibabacloud) 仓库中。文件以 `mocknccl` 为前缀,提供基本的集合→P2P 转换功能。 + +### 完整版 + +具备高级集合通信算法的完整 SimCCL 库可在 [SimCCL 仓库](https://github.com/aliyun/SimCCL) 获取。 + +--- + +## 支持的集合操作 + +| 操作 | 说明 | +|-----------|-------------| +| AllReduce | 跨所有 Rank 进行归约,结果在所有 Rank 上可用 | +| AllGather | 从所有 Rank 收集数据,结果在所有 Rank 上可用 | +| ReduceScatter | 跨所有 Rank 进行归约并分发 | +| AlltoAll | 全对全个性化通信 | +| Broadcast | 从一个 Rank 广播到所有其他 Rank | + +--- + +## 与 astra-sim 的集成 + +SimCCL 通过 `MockNcclGroup` 和 `MockNcclChannel` 接口与 astra-sim-alibabacloud 集成: + +- **MockNcclGroup**:管理参与集合操作的一组 Rank +- **MockNcclChannel**:处理集合操作中特定 Channel 的实际点对点数据传输 + +分解过程考虑: +- 网络拓扑(Ring、Tree 等) +- 参与的 Rank 数量 +- 消息大小 +- 可用通信通道 + +--- + +## 相关文档 + +- [组件概述](index.md) — SimAI 组件架构 +- [astra-sim 组件](astra_sim.md) — 消费 SimCCL 输出的仿真引擎 +- [NS-3 组件](ns3.md) — P2P 仿真的网络后端 diff --git a/docs/zh/components/vidur.md b/docs/zh/components/vidur.md new file mode 100644 index 00000000..bc6d2463 --- /dev/null +++ b/docs/zh/components/vidur.md @@ -0,0 +1,194 @@ +# vidur-alibabacloud — LLM 推理仿真 + +**位置**: 项目内(`vidur-alibabacloud/`) | **语言**: Python | **许可**: MIT + +vidur-alibabacloud 是 SimAI 的 LLM 推理仿真组件,改编自微软 [Vidur](https://github.com/microsoft/vidur)。提供多请求推理调度、GPU 显存管理和 Prefill-Decode(PD)分离支持。 + +--- + +## 核心特性 + +- **Prefill-Decode(PD)分离** — 在不同节点上运行 Prefill 和 Decode 阶段,实现弹性资源分配和性能隔离。灵感来自 [splitwise-sim](https://github.com/Mutinifni/splitwise-sim) +- **灵活的并行策略** — 数据并行(DP)、张量并行(TP)、流水线并行(PP)、专家并行(EP) +- **多种执行后端** — AICB/AIOB、SimAI Simulation (NS-3)、SimAI Analytical、原生 Vidur +- **工作负载生成与回放** — 合成请求(固定/泊松分布)或真实 Trace 回放 +- **细粒度指标** — TTFT、TBT/TPOT、端到端延迟、通信开销、计算开销、调度延迟 + +--- + +## GPU 显存计算模块 + +该模块为推理场景下的 MoE 模型提供精确的 GPU 显存估算。 + +### 组件 + +| 组件 | 文件 | 说明 | +|-----------|------|-------------| +| **ParamCounter** | `vidur/utils/param_counter.py` | 按层和按设备的参数计数,支持 MLA、MHA/GQA、线性注意力和 MoE 专家。PD 分离下返回 `(total_params, prefill_params, decode_params)` | +| **MemoryPlanner** | `vidur/scheduler/utils/memory_planner.py` | 规划 GPU 显存预算:`可用 = GPU显存 * (1 - margin) - 参数显存`,计算 KV Cache 容量和最大并发请求数。含 OOM 检测 | +| **按请求 KV Cache 追踪** | `vidur/entities/replica.py` | 按请求分配/释放 KV Cache 显存,支持运行时剩余容量查询 | + +### 支持的注意力架构 + +| 架构 | 模型 | 说明 | +|---|---|---| +| **MLA**(多头潜注意力) | DeepSeek-V3-671B | LoRA 压缩 KV Cache(`kv_lora_rank` + `qk_rope_head_dim`),相比 MHA 节省约 57 倍显存 | +| **MHA / GQA** | Qwen3-MoE-235B | 标准 KV Cache,每 Token 每层 `num_kv_heads * head_dim` | +| **混合全注意力 + 线性注意力** | Qwen3-Next-80B | 全注意力与线性(GDN)注意力每 4 层交替 | + +--- + +## 支持的模型 + +| 模型 | 注意力 | 专家 | 状态 | +|-------|-----------|---------|--------| +| DeepSeek-V3-671B | MLA | 256 路由 + 1 共享 | PP/EP 适配中 | +| Qwen3-MoE-235B | MHA/GQA | 128 路由 | PP/EP 适配中 | +| Qwen3-Next-80B | 混合 | 512 路由 | PP/EP 适配中 | +| Meta-Llama-3-8B / 70B | MHA | 稠密 | 已支持 | +| Llama-2-7b / 70b | MHA | 稠密 | 已支持 | +| CodeLlama-34b | MHA | 稠密 | 已支持 | +| InternLM-20B | MHA | 稠密 | 已支持 | +| Qwen-72B | MHA | 稠密 | 已支持 | + +--- + +## 环境搭建 + +### Docker(推荐) + +```bash +docker build -t simai:latest . +docker run --gpus all -it --rm simai:latest +``` + +> 在 Hopper GPU 上使用时,在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1`。 + +### Conda + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt +``` + +--- + +## 关键输入参数 + +| 参数 | 说明 | 默认值 | +|-----------|-------------|---------| +| `--replica_config_model_name` | HuggingFace 模型 ID 或配置路径 | 必需 | +| `--cluster_config_num_replicas` | 副本数量(DP) | 1 | +| `--replica_config_tensor_parallel_size` | TP 度 | 1 | +| `--replica_config_num_pipeline_stages` | PP 阶段数 | 1 | +| `--replica_config_expert_model_parallel_size` | EP 度 | 1 | +| `--replica_config_pd_node_ratio` | P:D 节点比例(如 `"2:6"`) | `""`(无 PD) | +| `--cluster_config_global_scheduler_type` | 全局调度器:`lor` / `round_robin` / `split_wise` | `lor` | +| `--cluster_config_replica_scheduler_type` | 副本调度器:`sarathi` / `split_wise` | `sarathi` | +| `--request_generator_config_type` | `synthetic` / `trace_replay` | `synthetic` | +| `--synthetic_request_generator_config_num_requests` | 生成请求数 | 100 | +| `--poisson_request_generator_config_qps` | 每秒请求数(泊松模式) | 1.0 | +| `--replica_config_device` | GPU 型号(如 `h20_dgx`) | 必需 | +| `--replica_config_network_device` | 网络类型 | 与 device 相同 | +| `--execution_time_predictor_config_type` | 后端:`aicb` / `simai_simulation` / `simai_analytical` / `random_forrest` | `random_forrest` | +| `--nvlink_bandwidth_gbps` | NVLink 带宽 | 1600 | +| `--rdma_bandwidth_gbps` | RDMA 带宽 | 800 | +| `--pd_p2p_bandwidth_gbps` | PD 节点间 P2P 带宽 | 800 | +| `--replica_config_fp8_enabled` | 启用 FP8 量化 | false | +| `--replica_config_memory_margin_fraction` | GPU 显存安全余量 | 0.1 | + +--- + +## 输出文件 + +每次运行产生以下输出: + +| 文件 | 说明 | +|------|-------------| +| `request_metrics.csv` | 每请求指标(17 列) | +| `chrome_trace.json` | 时间线 Trace,可在 Chrome `chrome://tracing` 中可视化 | +| `config.json` | 配置快照 | +| `plots/` | 指标可视化图表 | + +### request_metrics.csv 列说明 + +| 列名 | 说明 | +|--------|-------------| +| `request_id` | 请求唯一标识 | +| `arrived_at` | 请求到达时间 | +| `scheduled_at` | 首次调度时间 | +| `completed_at` | 请求完成时间 | +| `prefill_completed_at` | Prefill 完成时间(首 Token) | +| `num_prefill_tokens` | 输入 Token 数 | +| `num_decode_tokens` | 生成 Token 数 | +| `scheduling_delay` | 调度前等待时间 | +| `e2e_time` | 端到端延迟 | +| `e2e_time_normalized` | E2E 延迟 / num_decode_tokens | +| `execution_time` | 实际 GPU 执行时间 | +| `preemption_time` | 被抢占时间 | +| `num_restarts` | 重启次数 | +| `prefill_e2e_time` | TTFT(首 Token 时间) | +| `decode_time_normalized` | 平均 TBT(Token 间隔时间) | +| `total_comm_cost` | 总通信耗时 | +| `total_compute_cost` | 总计算耗时 | + +--- + +## 仿真指标(23 项) + +仿真器记录以下指标(详见 `vidur-alibabacloud/docs/metrics.md`): + +1. `request_inter_arrival_delay_histogram` — 请求到达间隔分布 +2. `request_num_tokens_histogram` — Token 数量分布(Prefill + Decode) +3. `request_num_restarts_histogram` — 重启次数分布 +4. `request_e2e_time_cdf` — 端到端延迟 CDF +5. `request_e2e_time_normalised_cdf` — 归一化 E2E 延迟 CDF +6. `request_execution_plus_preemption_times_cdf` — 执行 + 抢占时间 CDF +7. `request_scheduling_delay_cdf` — 调度延迟 CDF +8. `request_execution_time_cdf` — 纯执行时间 CDF +9. `request_preempted_time_cdf` — 抢占时间 CDF +10. `decode_token_execution_plus_preemption_times` — 按 Token 的 inter-token 延迟 CDF +11. `batch_num_tokens_cdf` — 批次总 Token 数 CDF +12. `batch_sizes_cdf` — 批次大小 CDF +13. `prefill_time_e2e_cdf` — TTFT CDF +14. `prefill_time_execution_plus_preemption_cdf` — Prefill 处理时间 CDF +15. `prefill_time_execution_plus_preemption_normalized_cdf` — 归一化 Prefill 时间 CDF +16. `decode_time_execution_plus_preemption_normalized_cdf` — 归一化 Decode 时间 CDF +17. `request_completions_time_series` — 请求完成时间序列 +18. `prefill_completions_time_series` — Prefill 完成时间序列 +19. `decode_completions_time_series` — Decode 完成时间序列 +20. `replica_{id}_memory_usage_weighted_mean` — 按副本显存利用率 +21. `replica_{id}_stage_{id}_busy_time_percent_weighted_mean` — 按阶段忙碌时间百分比 +22. `replica_{id}_stage_{id}_mfu_weighted_mean` — 按阶段 MFU +23. `request_arrivals_time_series` — 请求到达时间序列 + +--- + +## 4 场景测试套件 + +运行所有预配置场景: + +```bash +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all +# 或运行单个场景: +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 3 +``` + +场景配置详情请参阅 [基准测试 — 测试套件](../benchmarking/test_suite.md)。 + +--- + +## 添加新模型 + +如需为 vidur-alibabacloud 添加新模型支持,请参阅 [添加新模型指南](../developer_guide/adding_models.md) 和上游文档 `vidur-alibabacloud/docs/profiling.md`。 + +--- + +## 相关文档 + +- [多请求推理仿真](../user_guide/inference_simulation.md) — 端到端推理仿真工作流 +- [结果分析](../user_guide/result_analysis.md) — 输出文件解读 +- [GPU 显存模块技术参考](../technical_reference/memory_module.md) — 详细显存计算公式 +- [基准测试套件](../benchmarking/test_suite.md) — 4 场景配置详情 diff --git a/docs/zh/developer_guide/adding_models.md b/docs/zh/developer_guide/adding_models.md new file mode 100644 index 00000000..92081963 --- /dev/null +++ b/docs/zh/developer_guide/adding_models.md @@ -0,0 +1,204 @@ +# 添加新模型 + +本指南介绍如何为 SimAI 添加新模型支持,包括 Vidur 推理仿真侧(GPU 显存、Profiling)和 AICB 工作负载生成侧。 + +--- + +## 概述 + +添加新模型通常涉及两个组件: + +| 组件 | 需要添加的内容 | 硬件要求 | +|-----------|-------------|-------------------| +| **vidur-alibabacloud** | 模型配置、Profiling 数据(计算 + 网络) | GPU(仅 Profiling 阶段需要) | +| **AICB** | 工作负载生成参数(`MockedParam` / `MockedModel`) | 无 | + +--- + +## 第一部分:Vidur — 模型配置与 Profiling + +### 步骤 1:添加模型配置 + +在 `vidur-alibabacloud/data/model_configs/` 或 `vidur-alibabacloud/data/hf_configs/` 中创建 YAML/JSON 模型配置: + +- 使用模型的 HuggingFace 模型 ID 作为文件名(如 `meta-llama/Llama-2-70b-hf.yml`) +- 参考模型的 HuggingFace `config.json` 获取参数值 +- 确保正确设置参数,使参考 Transformer 模型尽可能接近新模型 + +**配置参数示例:** + +```yaml +num_layers: 80 +hidden_size: 8192 +num_attention_heads: 64 +num_key_value_heads: 8 # GQA 模型 +head_dim: 128 +intermediate_size: 28672 +vocab_size: 128256 +max_position_embeddings: 8192 +``` + +MoE 模型还需包含: + +```yaml +num_routed_experts: 256 +num_experts_per_tok: 8 +num_shared_experts: 1 +moe_intermediate_size: 2048 +``` + +### 步骤 2:Profiling 数据结构 + +Profiling 数据存储在 `vidur-alibabacloud/data/profiling/`: + +``` +profiling/ +├── compute/ +│ ├── a100/ +│ │ └── model-name/ +│ │ ├── mlp.csv +│ │ └── attention.csv +│ └── h100/ +│ └── model-name/ +│ ├── mlp.csv +│ └── attention.csv +└── network/ + ├── a100_pair_nvlink/ + │ ├── allreduce.csv + │ └── send_recv.csv + └── h100_dgx/ + ├── allreduce.csv + └── send_recv.csv +``` + +**关键区别:** +- **计算 Profiling**:仅依赖 GPU 型号(如 `a100`、`h100`),不依赖网络拓扑 +- **网络 Profiling**:依赖网络配置(如 `a100_pair_nvlink` vs `a100_dgx`) + +### 步骤 3:计算 Profiling(MLP) + +需要实际 GPU。TP > 1 时仅需 1 块 GPU 即可。 + +```bash +# 安装 sarathi-serve(vidur 分支)用于 Profiling +# 然后运行 MLP Profiling: +python vidur/profiling/mlp/main.py \ + --models your-model/model-name \ + --num_gpus 4 + +# 将输出复制到数据目录: +cp profiling_outputs/mlp//your-model/model-name/mlp.csv \ + data/profiling/compute//your-model/model-name/mlp.csv +``` + +### 步骤 4:计算 Profiling(Attention) + +```bash +python vidur/profiling/attention/main.py \ + --models your-model/model-name \ + --num_gpus 4 + +# 复制输出: +cp profiling_outputs/attention//your-model/model-name/attention.csv \ + data/profiling/compute//your-model/model-name/attention.csv +``` + +### 步骤 5:网络 Profiling(如需) + +网络 Profiling 是**与模型无关**的——相同硬件配置的数据可用于所有模型。 + +```bash +# AllReduce Profiling(用于 TP): +python vidur/profiling/collectives/main.py \ + --num_workers_per_node_combinations 1,2,4,8 \ + --collective all_reduce + +# Send/Recv Profiling(用于 PP,需要多节点): +python vidur/profiling/collectives/main.py \ + --num_workers_per_node_combinations 1,2,4,8 \ + --collective send_recv +``` + +**可用网络设备 Profile:** +- `a100_pair_nvlink` — Azure Standard_NC96ads_A100_v4(4x A100 PCIe + NVLink pairs) +- `h100_pair_nvlink` — Azure 内部(4x H100 NVL + NVLink pairs) +- `a100_dgx` — A100 DGX(8x A100) +- `h100_dgx` — H100 DGX(8x H100) + +--- + +## 第二部分:AICB — 工作负载生成 + +### 自定义模型参数(MockedParam) + +在 AICB 中添加新模型的工作负载生成,需创建 `MockedParam` 子类: + +```python +# 在 aicb/workload_generator/mocked_params/ 中 +class YourModelParam(MockedParam): + def __init__(self): + super().__init__() + self.num_layers = 80 + self.hidden_size = 8192 + self.num_attention_heads = 64 + self.num_key_value_heads = 8 + self.ffn_hidden_size = 28672 + self.vocab_size = 128256 + self.seq_length = 8192 + # MoE 参数(如适用) + self.num_experts = 256 + self.topk = 8 + self.moe_intermediate_size = 2048 +``` + +### 自定义模型工作流(MockedModel) + +如需完全控制工作负载生成过程,可创建 `MockedModel` 子类,定义每层的计算和通信操作。 + +详见 [AICB 组件文档](../components/aicb.md#自定义模型开发)。 + +### 推理工作负载生成 + +生成带 Prefill/Decode 分离的推理工作负载: + +```bash +# 生成推理工作负载 +python -m aicb.main \ + --model_name your-model-name \ + --workload_type inference \ + --num_prefill_tokens 1024 \ + --num_decode_tokens 128 +``` + +--- + +## 第三部分:GPU 显存模块 + +如果您的模型使用非标准注意力架构,可能需要扩展 `vidur/utils/param_counter.py` 中的 `ParamCounter`: + +1. 添加您的架构的注意力参数计算 +2. 添加 KV Cache 每 Token 大小计算 +3. 使用 MemoryPlanner 测试验证 OOM 检测正确工作 + +详见 [GPU 显存模块技术参考](../technical_reference/memory_module.md)。 + +--- + +## 验证清单 + +- [ ] 模型配置文件已添加到 `data/model_configs/` 或 `data/hf_configs/` +- [ ] 计算 Profiling 数据(MLP + Attention)已添加 +- [ ] 目标硬件的网络 Profiling 数据可用 +- [ ] AICB `MockedParam` 已创建(如需工作负载生成) +- [ ] GPU 显存计算正确(ParamCounter + MemoryPlanner) +- [ ] 端到端推理仿真产生合理结果 +- [ ] 文档已更新 + +--- + +## 相关文档 + +- [vidur-alibabacloud 组件](../components/vidur.md) — 完整 vidur 文档 +- [AICB 组件](../components/aicb.md) — AICB 工作负载生成 +- [GPU 显存模块](../technical_reference/memory_module.md) — 显存计算公式 +- [支持的模型](../user_guide/supported_models.md) — 当前模型支持状态 diff --git a/docs/zh/developer_guide/architecture.md b/docs/zh/developer_guide/architecture.md new file mode 100644 index 00000000..702bf80d --- /dev/null +++ b/docs/zh/developer_guide/architecture.md @@ -0,0 +1,176 @@ +# 系统架构 + +本文档描述 SimAI 的模块化架构、组件交互以及训练和推理仿真的数据流。 + +--- + +## 项目结构 + +``` +SimAI/ +├── aicb/ # AI 计算基准——工作负载生成(Python) +│ ├── workload_generator/ # 训练/推理工作负载生成器 +│ └── aicb.py # 主入口 +├── astra-sim-alibabacloud/ # 仿真引擎——核心仿真器(C++) +│ ├── astra-sim/ # 扩展自 astra-sim 1.0 +│ └── build.sh # 编译脚本 +├── ns-3-alibabacloud/ # NS-3 网络仿真后端(C++) +├── vidur-alibabacloud/ # LLM 推理仿真(Python) +│ ├── vidur/ # 核心仿真框架 +│ └── setup.py # Python 包配置 +├── SimCCL/ # 集合通信转换 +├── docs/ # 文档和教程 +├── example/ # 示例工作负载和配置 +├── scripts/ # 编译和工具脚本 +├── results/ # 仿真输出目录 +├── bin/ # 编译二进制输出 +└── Dockerfile # Docker 容器定义 +``` + +--- + +## 组件架构 + +``` + |--- AICB (工作负载生成 & 计算性能分析) +SimAI --|--- SimCCL (集合通信算法分析) + |--- astra-sim-alibabacloud (仿真引擎:Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 网络后端) + |--- vidur-alibabacloud (多请求推理调度 & 显存管理) +``` + +![SimAI 架构](../../images/SimAI_Arc.png) + +### 组件职责 + +| 组件 | 角色 | 语言 | +|-----------|------|----------| +| **AICB** | 生成训练/推理工作负载、采集计算内核性能、运行物理基准测试 | Python | +| **SimCCL** | 将集合通信操作(AllReduce、AllGather 等)转换为点对点通信集合 | Python | +| **astra-sim-alibabacloud** | 支持 3 种模式的核心仿真引擎;管理计算/内存/网络 API | C++ | +| **ns-3-alibabacloud** | 带 RDMA、数据中心拓扑和 CC 算法的包级网络仿真 | C++ | +| **vidur-alibabacloud** | 支持 PD 分离和 GPU 显存管理的多请求推理调度 | Python | + +--- + +## 三种运行模式 + +### SimAI-Analytical + +``` +AICB (workload.txt) → astra-sim (analytical) → busbw 估算 → CSV 结果 +``` + +- **适用场景**:快速性能分析、并行参数扫描 +- **组件**:AICB + astra-sim-alibabacloud(分析模式) +- **网络模型**:总线带宽(busbw)抽象 + +### SimAI-Simulation + +``` +AICB (workload.txt) → SimCCL (集合→P2P) → astra-sim (simulation) → NS-3 → 详细 Trace +``` + +- **适用场景**:全栈网络研究、CC 算法评估 +- **组件**:AICB + SimCCL + astra-sim-alibabacloud (simulation) + ns-3-alibabacloud +- **网络模型**:包级 NS-3 仿真 + +### SimAI-Physical + +``` +AICB (workload.txt) → SimCCL (集合→P2P) → astra-sim (physical) → 真实 NIC 上的 RDMA 流量 +``` + +- **适用场景**:NIC 行为研究、物理流量分析 +- **组件**:AICB + SimCCL + astra-sim-alibabacloud(物理模式) +- **网络模型**:通过 MPI 的真实 RDMA 流量 + +--- + +## 推理仿真数据流 + +``` +请求生成器 + | 生成合成/真实 Trace 请求 + v +全局调度器 + | 将请求分发到 Prefill / Decode 副本 + v +副本调度器 + | 批次组装和调度 + v +显存管理(MemoryPlanner + Replica) + | KV Cache 分配和容量检查 + v +执行时间预测器 + | AICB / SimAI Simulation / SimAI Analytical / Vidur + v +指标存储 + | TTFT、TBT、E2E、通信/计算开销 + v +输出(request_metrics.csv, chrome_trace.json, plots/) +``` + +### 推理关键组件 + +| 组件 | 文件 | 说明 | +|-----------|------|-------------| +| 请求生成器 | `vidur/request_generator/` | 生成合成或基于 Trace 的请求 | +| 全局调度器 | `vidur/scheduler/global_scheduler/` | 跨副本分发请求(`lor`、`round_robin`、`split_wise`) | +| 副本调度器 | `vidur/scheduler/replica_scheduler/` | 副本内批次调度(`sarathi`、`split_wise`) | +| MemoryPlanner | `vidur/scheduler/utils/memory_planner.py` | GPU 显存预算计算 | +| ParamCounter | `vidur/utils/param_counter.py` | 模型参数计数(MLA/MHA/GQA/线性/MoE) | +| 执行预测器 | `vidur/execution_time_predictor/` | 通过多种后端估算执行时间 | +| 指标存储 | `vidur/metrics/` | 采集并导出 23 项仿真指标 | + +--- + +## 子模块结构 + +SimAI 使用 Git submodule 管理核心组件: + +| 子模块 | 仓库 | 分支 | +|-----------|------------|--------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | master | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | master | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | master / dev/qp | +| `astra-sim-alibabacloud` | 项目内 | — | +| `vidur-alibabacloud` | 项目内 | — | + +**关键规则:** +1. 子模块拥有独立的 Git 历史 +2. 父仓库仅追踪每个子模块的 commit hash +3. 克隆后务必初始化:`git submodule update --init --recursive` + +--- + +## 构建系统 + +### 编译脚本 + +```bash +# 分析模式(快速,基于 busbw) +bash scripts/build.sh -c analytical + +# NS-3 仿真模式(全栈) +bash scripts/build.sh -c ns3 + +# 物理模式(Beta,RDMA) +bash scripts/build.sh -c phy +``` + +### 编译产物 + +| 模式 | 二进制 | 位置 | +|------|--------|----------| +| Analytical | `SimAI_analytical` | `bin/` | +| Simulation | `SimAI_simulator` | `bin/` | +| Physical | `SimAI_physical` | `bin/` | + +--- + +## 相关文档 + +- [组件概述](../components/index.md) — 各组件详细文档 +- [贡献指南](contributing.md) — 如何贡献代码 +- [配置文件参考](../technical_reference/configuration.md) — 配置文件和参数 diff --git a/docs/zh/developer_guide/contributing.md b/docs/zh/developer_guide/contributing.md new file mode 100644 index 00000000..59780065 --- /dev/null +++ b/docs/zh/developer_guide/contributing.md @@ -0,0 +1,224 @@ +# 贡献指南 + +感谢您对 SimAI 项目的关注!本指南介绍完整的开发工作流。 + +> **完整版**:详见项目根目录下的 [CONTRIBUTING.md](../../../CONTRIBUTING.md)。 + +--- + +## 贡献方式 + +1. **新功能** — 添加模型支持、并行策略、调度策略 +2. **Bug 修复** — 修复仿真不准确、崩溃或结果错误 +3. **性能优化** — 提升仿真速度、内存使用或可扩展性 +4. **文档** — 改进教程、添加示例、修正错误 +5. **基准测试与验证** — 添加对比真实硬件的验证结果 +6. **问题报告** — 报告 Bug、请求功能或分享反馈 + +--- + +## 开发工作流 + +### 步骤 1:Fork 和克隆 + +```bash +git clone --recurse-submodules https://github.com/YOUR_USERNAME/SimAI.git +cd SimAI +git remote add upstream https://github.com/aliyun/SimAI.git +``` + +### 步骤 2:创建功能分支 + +```bash +git fetch upstream +git checkout -b feature/your-feature-name upstream/master + +# 分支命名规范: +# feature/xxx — 新功能 +# fix/xxx — Bug 修复 +# docs/xxx — 文档 +# perf/xxx — 性能优化 +# refactor/xxx — 代码重构 +``` + +### 步骤 3:开发和测试 + +```bash +# C++ 修改需重新编译: +bash scripts/build.sh -c analytical # 或 ns3 + +# Python 修改: +python -c "from aicb import ..." +``` + +### 步骤 4:提交 + +```bash +git add -A +git commit -m "feat(aicb): add Llama-4 model workload generation" +``` + +### 步骤 5:推送并创建 PR + +```bash +git push origin feature/your-feature-name +# 然后在 GitHub 上创建 Pull Request +``` + +--- + +## 提交信息规范 + +使用 [Conventional Commits](https://www.conventionalcommits.org/) 格式: + +``` +(): +``` + +### 类型 + +| 类型 | 说明 | +|------|-------------| +| `feat` | 新功能 | +| `fix` | Bug 修复 | +| `docs` | 仅文档修改 | +| `refactor` | 代码重构 | +| `test` | 添加或更新测试 | +| `perf` | 性能优化 | +| `chore` | 构建流程、工具 | + +### 范围 + +`aicb`、`vidur`、`astra-sim`、`ns3`、`simccl`、`docs`、`docker`、`scripts` + +### 示例 + +``` +feat(aicb): add DeepSeek-V3 inference workload generation +fix(astra-sim): correct AllReduce latency calculation for ring algorithm +docs: update build instructions for NS-3 mode +perf(vidur): reduce memory allocation in request scheduler +``` + +--- + +## 代码风格 + +### Python + +- **格式化**: [black](https://github.com/psf/black)(默认设置) +- **Import 排序**: [isort](https://pycqa.github.io/isort/) +- **Linter**: [flake8](https://flake8.pycqa.org/) +- **最大行宽**: 120 字符 + +```bash +black --line-length 120 your_file.py +isort your_file.py +flake8 your_file.py --max-line-length 120 +``` + +### C++ + +- 遵循 `astra-sim-alibabacloud/` 中现有代码风格 +- 4 空格缩进 +- 函数和变量使用 `snake_case` +- 非显而易见的逻辑需添加注释 + +### 通用规则 + +- 注释使用**英文**编写 +- 所有新函数/类应有文档字符串或头部注释 +- 避免硬编码路径;使用相对路径或配置变量 +- 每个 PR 只包含一个功能/修复 + +--- + +## 子模块操作 + +SimAI 使用 Git submodule,关键要点: + +| 子模块 | 仓库 | 语言 | +|-----------|------------|----------| +| `aicb` | [aliyun/aicb](https://github.com/aliyun/aicb) | Python | +| `SimCCL` | [aliyun/SimCCL](https://github.com/aliyun/SimCCL) | Python | +| `ns-3-alibabacloud` | [aliyun/ns-3-alibabacloud](https://github.com/aliyun/ns-3-alibabacloud) | C++ | +| `astra-sim-alibabacloud` | 项目内 | C++ | +| `vidur-alibabacloud` | 项目内 | Python | + +### 跨子模块修改 + +如果您的贡献涉及多个子模块: + +1. 在每个子模块中分别进行修改并提交 +2. 更新父仓库指向新的子模块 commit +3. 为有独立远程仓库的子模块创建单独的 PR +4. 在 PR 描述中引用相关 PR + +--- + +## Pull Request 指南 + +### PR 标题 + +使用与 commit message 相同的格式:`(): ` + +### PR 检查清单 + +- [ ] 代码编译无错误 +- [ ] 现有仿真结果不变(无精度退化) +- [ ] 新代码有适当注释 +- [ ] 新功能添加了测试 +- [ ] 必要时更新了文档 + +--- + +## 提交前质量检查 + +```bash +# 1. C++ 编译检查 +bash scripts/build.sh -c analytical +bash scripts/build.sh -c ns3 + +# 2. Python lint 检查 +black --check --line-length 120 your_changed_files.py +flake8 your_changed_files.py --max-line-length 120 + +# 3. 基本仿真测试 +cd bin && ./SimAI_analytical \ + --workload_path=../example/workload_analytical.txt \ + --comm_group_type=TP_GROUP \ + --busbw_path=../example/busbw.yaml + +# 4. 子模块状态检查 +git submodule status +``` + +--- + +## 验收标准 + +| 标准 | 要求 | +|-----------|-------------| +| **编译** | 编译无错误 | +| **精度** | 不降低现有仿真精度 | +| **测试** | 关键代码路径有覆盖 | +| **文档** | 新功能有注释/文档更新 | +| **风格** | 遵循代码风格规范 | +| **范围** | 修改集中且解释清晰 | + +--- + +## 审核时间线 + +1. **初审**:3-5 个工作日 +2. **反馈**:建设性评论和可操作建议 +3. **迭代**:处理反馈并更新 PR +4. **合并**:批准的 PR 合入主分支 + +--- + +## 获取帮助 + +- **Issues**: [GitHub Issues](https://github.com/aliyun/SimAI/issues) +- **讨论**: 创建 Issue 并以 "Question:" 为前缀 +- **文档**: 参见 [Tutorial](../../../docs/Tutorial.md) diff --git a/docs/zh/developer_guide/extending_ns3.md b/docs/zh/developer_guide/extending_ns3.md new file mode 100644 index 00000000..980aee84 --- /dev/null +++ b/docs/zh/developer_guide/extending_ns3.md @@ -0,0 +1,219 @@ +# NS-3 网络后端扩展指南 + +本指南介绍如何扩展 `ns-3-alibabacloud`,包括新增拥塞控制算法、交换机行为、控制报文和 NVSwitch 特性。 + +> **源码参考**:详见 `astra-sim-alibabacloud/extern/network_backend/ns3-interface/README.md` 获取完整模块映射。 + +--- + +## 模块概览 + +所有关键源文件位于 `ns-3-alibabacloud/simulation/src/point-to-point/model/`: + +| 文件 | 类 | 用途 | +|------|-------|---------| +| `qbb-net-device.{h,cc}` | `QbbNetDevice`, `RdmaEgressQueue` | 支持 QBB 的 NIC,8 优先级队列,PFC 处理,NVSwitch 发送路径 | +| `rdma-hw.{h,cc}` | `RdmaHw` | 主机 RDMA 核心:QP 管理、报文构造、ACK/NACK、CC 算法 | +| `rdma-queue-pair.{h,cc}` | `RdmaQueuePair`, `RdmaRxQueuePair` | 按 QP 状态(窗口、速率、CC 特定状态) | +| `switch-node.{h,cc}` | `SwitchNode` | 交换机流水线:ECMP 转发、ECN 标记、PFC、INT/PINT 注入 | +| `switch-mmu.{h,cc}` | `SwitchMmu` | 交换机缓冲区/MMU:入口/出口记账、PFC 阈值、ECN 曲线 | +| `nvswitch-node.{h,cc}` | `NVSwitchNode` | 服务器内 GPU 通信的 NVSwitch 模型 | +| `rdma-driver.{h,cc}` | `RdmaDriver` | Node/NIC 与 RdmaHw 之间的连接层 | +| `qbb-header.{h,cc}` | — | ACK/NACK 头(PG/seq/CNP-flag + INT 头) | +| `cn-header.{h,cc}` | — | CNP 头(反馈字段) | +| `pause-header.{h,cc}` | — | PFC Pause 头 | +| `pint.{h,cc}` | — | PINT 编解码工具 | +| `trace-format.h` | `TraceFormat` | 用于离线分析的二进制 Trace 记录结构 | + +--- + +## 添加新拥塞控制算法 + +NS-3 后端内置 5 种 CC 算法:**DCQCN**、**HPCC**、**TIMELY**、**DCTCP** 和 **HPCC-PINT**。添加新算法步骤: + +### 步骤 1:定义 CC 模式 + +在 `rdma-hw.h` 中添加新的 `CcMode` 值: + +```cpp +// 现有模式:1=DCQCN, 3=HPCC, 7=TIMELY, 8=DCTCP, 10=HPCC-PINT +static const uint32_t CC_MODE_YOUR_ALG = 11; +``` + +### 步骤 2:添加按 QP 状态(如需) + +在 `rdma-queue-pair.h` 中为 `RdmaQueuePair` 添加新状态变量: + +```cpp +// 您的 CC 算法状态 +double m_your_alg_rate; +double m_your_alg_alpha; +// ... +``` + +### 步骤 3:实现算法逻辑 + +在 `rdma-hw.cc` 中添加两个关键函数: + +```cpp +void RdmaHw::HandleAckYourAlg(Ptr qp, ...) { + // 处理 ACK 并更新速率/窗口 +} + +void RdmaHw::UpdateRateYourAlg(Ptr qp, ...) { + // 速率更新逻辑 +} +``` + +### 步骤 4:注册分发 + +在 `rdma-hw.cc` 的 `ReceiveAck()` 和/或 `ReceiveCnp()` 中添加分发: + +```cpp +switch (m_cc_mode) { + // ... 现有 case ... + case CC_MODE_YOUR_ALG: + HandleAckYourAlg(qp, ...); + break; +} +``` + +### 步骤 5:添加交换机反馈(如需) + +如果您的 CC 算法需要交换机侧信息(如 INT/PINT 元数据): + +- 修改 `switch-node.cc::SwitchNotifyDequeue()` 注入元数据 +- 在 `RdmaHw::Receive()` 或 `QbbNetDevice::Receive()` 中添加头部解析 + +--- + +## 修改交换机行为 + +### 缓冲区管理 / PFC 阈值 + +**主要文件**: `switch-mmu.{h,cc}` + +关键修改方法: + +| 方法 | 用途 | +|--------|---------| +| `ConfigBufferSize()` | 总缓冲池大小 | +| `ConfigHdrm()` | Headroom 分配 | +| `ConfigEcn()` | ECN 标记阈值(`kmin`、`kmax`、`pmax`) | +| `CheckIngressAdmission()` | 入口准入控制 | +| `CheckEgressAdmission()` | 出口准入控制 | +| `GetPfcThreshold()` | PFC 触发阈值公式 | + +### ECN 标记 / INT 注入 + +**文件**: `switch-node.cc` + +修改 `SwitchNotifyDequeue()` 实现: +- 基于自定义队列占用公式的 ECN 标记 +- 用于高级 CC 算法的 INT/PINT 元数据注入 +- 自定义报文标记 + +### 转发 / ECMP + +**文件**: `switch-node.cc` + +路由修改: +- `GetOutDev()` — 输出端口选择 +- `EcmpHash()` — ECMP 哈希函数(当前为 5 元组) +- `AddTableEntry()` — 路由表管理 + +--- + +## 引入新控制报文 + +### 步骤 1:创建头部 + +在 `model/` 中创建新头部文件,参照 `CnHeader` 或 `PauseHeader` 模式: + +```cpp +// your-header.h +class YourHeader : public Header { +public: + static TypeId GetTypeId(); + // 序列化/反序列化方法 + uint32_t GetSerializedSize() const override; + void Serialize(Buffer::Iterator start) const override; + uint32_t Deserialize(Buffer::Iterator start) override; + + // 头部字段 + uint32_t m_your_field; +}; +``` + +### 步骤 2:定义协议号 + +添加新协议号(遵循现有约定): + +```cpp +// 现有协议号(IPv4 Protocol 字段): +// UDP 数据: 0x11 +// CNP: 0xFF +// PFC: 0xFE +// ACK: 0xFC +// NACK: 0xFD +// 新协议: 0xFB(示例) +``` + +### 步骤 3:添加解析/分发 + +在以下位置添加报文处理: +- `QbbNetDevice::Receive()` — 设备级解析 +- `RdmaHw::Receive()` — 主机协议栈处理 + +--- + +## NVSwitch / NVLS 扩展 + +**文件**: `nvswitch-node.{h,cc}`、`qbb-net-device.{h,cc}`(NVLS 发送路径)、`rdma-hw.{h,cc}`(NVLS 路由) + +`NVSwitchNode` 模拟通过 NVSwitch 的服务器内 GPU 通信。扩展方式: + +- **转发**:类似 `SwitchNode` 但不包含 ECN/INT 注入 +- **NVLS 路由**:修改 `RdmaHw::GetNicIdxOfQp()` 和 `GetNicIdxOfRxQp()` 以适配 NVSwitch 路由表 +- **QP 重分配**:`RdmaHw::RedistributeQp()` 用于 NVSwitch 链路间负载均衡 + +--- + +## 分析工具 + +`ns-3-alibabacloud/analysis/` 目录包含 Trace 分析工具: + +| 工具 | 用途 | +|------|---------| +| FCT 分析 | 从仿真 Trace 分析流完成时间 | +| Trace 阅读器 | 解析二进制 `TraceFormat` 记录 | +| 带宽分析 | 按链路的带宽利用率随时间变化 | +| 队列分析 | 队列占用和 PFC 事件分析 | +| QP 分析 | 按 QP 的性能指标 | + +### Trace 格式 + +二进制 Trace 记录结构(`trace-format.h`)捕获按报文的事件。使用离线分析工具: + +1. 解析仿真输出的 Trace 文件 +2. 计算 FCT、吞吐量、队列深度统计 +3. 识别拥塞热点和 PFC 事件 + +--- + +## dev/qp 分支增强 + +[dev/qp](https://github.com/aliyun/ns-3-alibabacloud/tree/dev/qp) 分支包含: + +1. **QP 逻辑支持** — 基于实际 RDMA 逻辑的 QP 创建/销毁 +2. **NIC CC 配置** — 按 IP 或按 QP 的 CC 设置 +3. **优化调度** — Max-Min 原则的公平资源分配 +4. **解耦 CC 模块** — 提升模块化程度 + +--- + +## 相关文档 + +- [NS-3 组件](../components/ns3.md) — 完整 NS-3 后端文档 +- [SimAI-Simulation 使用指南](../user_guide/simai_simulation.md) — NS-3 仿真模式使用 +- [配置文件参考](../technical_reference/configuration.md) — 拓扑和配置文件 diff --git a/docs/zh/developer_guide/index.md b/docs/zh/developer_guide/index.md new file mode 100644 index 00000000..f937de50 --- /dev/null +++ b/docs/zh/developer_guide/index.md @@ -0,0 +1,25 @@ +# 开发者指南 + +欢迎来到 SimAI 开发者指南。本节涵盖项目架构、贡献工作流以及扩展 SimAI 的新模型和网络特性指南。 + +--- + +## 目录 + +| 文档 | 说明 | +|----------|-------------| +| [系统架构](architecture.md) | 系统架构、数据流和模块交互 | +| [贡献指南](contributing.md) | 开发工作流、代码风格、PR 指南 | +| [添加新模型](adding_models.md) | 添加新模型支持指南(Vidur profiling + AICB 工作负载) | +| [NS-3 扩展](extending_ns3.md) | NS-3 网络后端扩展指南 | + +--- + +## 前置条件 + +- **Python** 3.8+(Docker 推荐 3.12) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git** 支持 submodule + +详见 [安装指南](../getting_started/installation.md)。 diff --git a/docs/zh/getting_started/index.md b/docs/zh/getting_started/index.md new file mode 100644 index 00000000..dc865b17 --- /dev/null +++ b/docs/zh/getting_started/index.md @@ -0,0 +1,25 @@ +# 快速入门 + +本章节帮助你搭建 SimAI 环境并运行第一次仿真。 + +## 目录 + +| 页面 | 说明 | +|------|------| +| [安装指南](installation.md) | 通过 Docker 或源码安装 SimAI | +| [快速开始](quickstart.md) | 运行第一个 SimAI-Analytical、SimAI-Simulation 和推理仿真 | + +## 前置条件 + +- **Python** 3.8+(Docker 镜像中推荐 3.12) +- **CMake** 3.16+ +- **GCC/G++** 9.4+ +- **Git**(支持 submodule) + +若需使用 AIOB(计算性能分析)生成工作负载,则需要 **NVIDIA Hopper (SM90)** 或 **Blackwell (SM100)** GPU。 + +## 下一步 + +1. 按照[安装指南](installation.md)搭建环境 +2. 运行[快速开始](quickstart.md)示例验证安装 +3. 查看[用户指南](../user_guide/index.md)了解详细用法 diff --git a/docs/zh/getting_started/installation.md b/docs/zh/getting_started/installation.md new file mode 100644 index 00000000..bebb2844 --- /dev/null +++ b/docs/zh/getting_started/installation.md @@ -0,0 +1,96 @@ +# 安装指南 + +本指南介绍如何安装 SimAI 及其依赖。 + +## 方式一:Docker(推荐) + +```bash +# 构建 Docker 镜像 +docker build -t simai:latest . + +# 运行容器(带 GPU 支持) +docker run --gpus all -it --rm \ + -v $(pwd)/results:/workspace/SimAI/results \ + simai:latest /bin/bash +``` + +> **注意:** 如使用 Hopper GPU,请在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1`。 + +## 方式二:从源码编译 + +以下步骤已在 GCC/G++ 9.4.0、Python 3.8.10、Ubuntu 20.04 环境下测试通过。 + +> **重要:** 请勿安装 ninja(NGC 镜像中已预装,需移除以兼容 SimAI-Simulation 编译)。 +> ```bash +> apt remove ninja-build && pip uninstall ninja +> ``` + +### 第一步:克隆仓库 + +```bash +git clone https://github.com/aliyun/SimAI.git +cd ./SimAI/ + +# 初始化子模块 +git submodule update --init --recursive +# 更新到最新提交 +git submodule update --remote +``` + +### 第二步:编译 C++ 组件 + +根据需要选择编译模式: + +```bash +# SimAI-Analytical(快速,抽象网络细节) +./scripts/build.sh -c analytical + +# SimAI-Simulation(使用 NS-3 网络后端的全栈仿真) +./scripts/build.sh -c ns3 + +# SimAI-Physical(Beta,需要 RDMA 环境) +sudo yum install openmpi openmpi-devel +export MPI_INCLUDE_PATH=/usr/include/openmpi-x86_64/ +export MPI_BIN_PATH=/usr/lib64/openmpi/bin/mpic++ +./scripts/build.sh -c phy +``` + +### 第三步:安装 Python 依赖 + +```bash +pip install -r aicb/requirements.txt +pip install -r vidur-alibabacloud/requirements.txt +``` + +### 第四步:验证编译结果 + +```bash +ls bin/ # 应包含 SimAI_analytical 和/或 SimAI_simulator +``` + +## 方式三:Conda 环境(推理仿真专用) + +```bash +cd vidur-alibabacloud +conda env create -p ./env -f ./environment.yml +conda activate vidur +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +## NGC 容器(工作负载生成) + +使用 AIOB 进行计算性能分析生成工作负载时,建议直接使用 NGC 容器镜像: + +```bash +docker pull nvcr.io/nvidia/pytorch:xx.xx-py3 +docker run --gpus all -it --rm \ + -v /path/to/SimAI:/workspace/SimAI \ + nvcr.io/nvidia/pytorch:xx.xx-py3 +``` + +> **注意:** 请使用 PyTorch >= 23.08 版本的 NGC 镜像。 + +## 下一步 + +- [快速开始](quickstart.md) — 运行第一次仿真 +- [用户指南](../user_guide/index.md) — 各模式详细使用方法 diff --git a/docs/zh/getting_started/quickstart.md b/docs/zh/getting_started/quickstart.md new file mode 100644 index 00000000..10e4a374 --- /dev/null +++ b/docs/zh/getting_started/quickstart.md @@ -0,0 +1,85 @@ +# 快速开始 + +本指南帮助你运行第一次 SimAI 仿真。 + +## 1. SimAI-Analytical + +最快的入门方式。使用总线带宽(busbw)抽象网络细节。 + +```bash +# 运行 Analytical 仿真 +./bin/SimAI_analytical \ + -w example/workload_analytical.txt \ + -g 9216 \ + -g_p_s 8 \ + -r test- \ + -busbw example/busbw.yaml +``` + +自动计算总线带宽: + +```bash +./bin/SimAI_analytical \ + -w ./example/workload_analytical.txt \ + -g 9216 -nv 360 -nic 48.5 \ + -n_p_s 8 -g_p_s 8 -r example- +``` + +详细参数说明请参考 [SimAI-Analytical 用户指南](../user_guide/simai_analytical.md)。 + +## 2. SimAI-Simulation + +使用 NS-3 网络后端的全栈仿真。 + +```bash +# 第一步:创建网络拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +# 第二步:运行仿真 +AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ + -t 16 \ + -w ./example/microAllReduce.txt \ + -n ./Spectrum-X_128g_8gps_100Gbps_A100 \ + -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +详细参数说明请参考 [SimAI-Simulation 用户指南](../user_guide/simai_simulation.md)。 + +## 3. 多请求推理仿真 + +使用 Vidur 框架的端到端推理仿真。 + +### 前置条件 + +```bash +# 激活 vidur conda 环境 +conda activate vidur +``` + +### 运行四场景测试套件 + +```bash +# 运行全部 4 个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 或运行单个场景 +bash vidur-alibabacloud/examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +### 场景概览 + +| 场景 | 模型 | PD 分离 | World Size | TP | EP | 调度器 | +|------|------|---------|-----------|----|----|--------| +| 1 | Qwen3-Next-80B | 否 | 32 | 1 | 1 | lor | +| 2 | Qwen3-Next-80B | 是(P=2, D=6) | 8 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B | 是(P=2, D=6) | 8 | 8 | 8 | split_wise | +| 4 | Qwen3-MoE-235B | 是(P=2, D=6) | 8 | 4 | 4 | split_wise | + +详细信息请参考[推理仿真用户指南](../user_guide/inference_simulation.md)。 + +## 下一步 + +- [用户指南](../user_guide/index.md) — 深入了解各仿真模式 +- [组件详情](../components/index.md) — 了解各子模块 +- [基准测试](../benchmarking/index.md) — 运行完整测试套件 diff --git a/docs/zh/index.md b/docs/zh/index.md new file mode 100644 index 00000000..9dcc8707 --- /dev/null +++ b/docs/zh/index.md @@ -0,0 +1,73 @@ +# 欢迎使用 SimAI 文档 + +

+ English  |  English +

+ +[![License](https://img.shields.io/badge/license-MIT-green.svg)](../../LICENSE) +[![NSDI'25](https://img.shields.io/badge/NSDI'25-SimAI-blue.svg)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) + +**SimAI** 是阿里云开源的业界首个全栈高精度 AI 大规模**推理**与**训练**仿真器。它提供了对 LLM 训练和推理全流程的详细建模与仿真,涵盖框架层、集合通信层和网络传输层,提供端到端的性能数据。 + +SimAI 使研究人员能够: + +- 分析推理/训练过程细节 +- 评估特定条件下 AI 任务的时间消耗 +- 评估各种算法优化带来的端到端性能提升(框架参数、集合通信算法、网络协议、拥塞控制、路由、拓扑等) + +--- + +## 文档概览 + +| 章节 | 说明 | +|------|------| +| [快速入门](getting_started/index.md) | 安装、环境搭建、快速开始 | +| [用户指南](user_guide/index.md) | SimAI-Analytical、SimAI-Simulation、SimAI-Physical、推理仿真的详细使用方法 | +| [组件详情](components/index.md) | 各子模块详细文档:AICB、SimCCL、astra-sim、ns-3、vidur | +| [技术参考](technical_reference/index.md) | GPU 显存模块、CLI 参数、配置文件参考 | +| [基准测试](benchmarking/index.md) | 四场景端到端测试套件及基准测试结果 | +| [开发者指南](developer_guide/index.md) | 架构、贡献指南、添加模型、扩展 NS-3 | +| [社区](community/index.md) | 活动、联系方式、引用 | + +--- + +## 系统架构 + +``` + |--- AICB (工作负载生成 & 计算性能分析) +SimAI --|--- SimCCL (集合通信算法分析) + |--- astra-sim-alibabacloud (仿真引擎:Analytical / Simulation / Physical) + |--- ns-3-alibabacloud (NS-3 网络后端) + |--- vidur-alibabacloud (多请求推理调度 & 显存管理) +``` + +![SimAI 架构图](../images/SimAI_Arc.png) + +--- + +## 三种运行模式 + +| 模式 | 说明 | 适用场景 | +|------|------|----------| +| **SimAI-Analytical** | 使用总线带宽(busbw)估算集合通信时间的快速仿真 | 性能分析、并行参数优化、scale-up 探索 | +| **SimAI-Simulation** | 使用 NS-3 网络后端的全栈仿真,提供细粒度网络建模 | CC 算法研究、网络协议评估、新架构设计 | +| **SimAI-Physical** *(Beta)* | 在 CPU RDMA 集群上生成物理流量 | NIC 行为研究 | + +--- + +## 支持的模型 + +- **DeepSeek-V3-671B** — MLA 注意力,256 个路由专家 +- **Qwen3-MoE-235B** — MHA/GQA,128 个路由专家 +- **Qwen3-Next-80B** — 混合全注意力 + 线性注意力,512 个路由专家 +- **Meta-Llama-3-8B / 70B**、**Llama-2-7b / 70b**、**CodeLlama-34b**、**InternLM-20B**、**Qwen-72B** + +--- + +## 快速链接 + +- [GitHub 仓库](https://github.com/aliyun/SimAI) +- [NSDI'25 论文 (PDF)](https://ennanzhai.github.io/pub/nsdi25spring-simai.pdf) +- [演示文稿](../../docs/SimAI_Intro_Online.pdf) +- [技术报告 (1.6)](../SimAI_1.6_Tech_Report.md) +- [贡献指南](../../CONTRIBUTING.md) diff --git a/docs/zh/technical_reference/cli_reference.md b/docs/zh/technical_reference/cli_reference.md new file mode 100644 index 00000000..3c1e1ad2 --- /dev/null +++ b/docs/zh/technical_reference/cli_reference.md @@ -0,0 +1,185 @@ +# CLI 参考 + +SimAI 所有工具的完整命令行参数参考。 + +--- + +## SimAI-Analytical + +**二进制**: `bin/SimAI_analytical` + +### 必需参数 + +| 标志 | 长格式 | 说明 | +|------|-----------|-------------| +| `-w` | `--workload` | 工作负载文件路径 | +| `-g` | `--gpus` | 仿真 GPU 规模 | +| `-g_p_s` | `--gpus-per-server` | Scale-up 大小(每服务器 GPU 数) | +| `-r` | `--result` | 输出文件路径和前缀(默认:`./results/`) | +| `-busbw` | `--bus-bandwidth` | busbw.yaml 文件路径 | + +### 可选参数 + +| 标志 | 长格式 | 说明 | +|------|-----------|-------------| +| `-v` | `--visual` | 生成可视化文件 | +| `-dp_o` | `--dp-overlap-ratio` | DP 重叠比例 [0.0-1.0] | +| `-ep_o` | `--ep-overlap-ratio` | EP 重叠比例 [0.0-1.0] | +| `-tp_o` | `--tp-overlap-ratio` | TP 重叠比例 [0.0-1.0] | +| `-pp_o` | `--pp-overlap-ratio` | PP 重叠比例 [0.0-1.0] | + +### 自动 Busbw 计算 + +| 标志 | 说明 | +|------|-------------| +| `-nv` | NVLink 带宽(GB/s) | +| `-nic` | NIC 带宽(GB/s) | +| `-n_p_s` | 每服务器 NIC 数 | + +--- + +## SimAI-Simulation + +**二进制**: `bin/SimAI_simulator` + +### 环境变量 + +| 变量 | 说明 | 默认值 | +|----------|-------------|---------| +| `AS_LOG_LEVEL` | 日志级别:DEBUG/INFO/WARNING/ERROR | `INFO` | +| `AS_PXN_ENABLE` | 启用 PXN | `0` | +| `AS_NVLS_ENABLE` | 启用 NVLS | `0` | +| `AS_SEND_LAT` | 发送延迟(us) | `6` | +| `AS_NVLSTREE_ENABLE` | 启用 NVLS Tree | `false` | + +### 参数 + +| 标志 | 长格式 | 说明 | 默认值 | +|------|-----------|-------------|---------| +| `-t` | `--thread` | 线程数 | `1` | +| `-w` | `--workload` | 工作负载路径 | 必需 | +| `-n` | `--network-topo` | 拓扑文件路径 | 必需 | +| `-c` | `--config` | SimAI.conf 路径 | 必需 | + +--- + +## SimAI-Physical + +**二进制**: `bin/SimAI_phynet` + +| 参数 | 说明 | 默认值 | +|-----------|-------------|---------| +| `hostlist` | 主机 IP 列表路径 | 必需 | +| `-w` / `--workload` | 工作负载文件路径 | `./microAllReduce.txt` | +| `-i` / `--gid_index` | RDMA 的 GID 索引 | `0` | +| `-g` / `--gpus` | GPU 数量 | `8` | + +--- + +## 拓扑生成器 + +**脚本**: `astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py` + +| 层级 | 标志 | 说明 | +|-------|------|-------------| +| 全局 | `-topo` | 模板:Spectrum-X / AlibabaHPN / DCN+ | +| | `-g` | GPU 数量 | +| | `--dp` | 启用双 Plane | +| | `--ro` | 启用 Rail-optimized | +| | `--dt` | 启用双 ToR | +| | `-er` | 错误率 | +| 服务器内 | `-gps` | 每服务器 GPU 数 | +| | `-gt` | GPU 型号(A100/H100) | +| | `-nsps` | 每服务器 NV Switch 数 | +| | `-nvbw` | NVLink 带宽 | +| | `-nl` | NVLink 延迟 | +| | `-l` | NIC 延迟 | +| Segment 内 | `-bw` | NIC 到 ASW 带宽 | +| | `-asw` | ASW 交换机数量 | +| | `-nps` | 每交换机 NIC 数 | +| Pod 内 | `-psn` | PSW 交换机数量 | +| | `-apbw` | ASW 到 PSW 带宽 | +| | `-app` | 每 PSW 的 ASW 数 | + +--- + +## AICB 工作负载生成器 + +**脚本**: `scripts/megatron_workload_with_aiob.sh` 或 `python -m workload_generator.SimAI_training_workload_generator` + +### 核心参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--frame` | 框架:Megatron / DeepSpeed / DeepSeek | +| `-m` / `--model_size` | 模型大小:7/13/22/175/moe/deepseek | +| `--world_size` | 总 GPU 数量 | +| `--global_batch` | 总批量大小 | +| `--micro_batch` | 微批量大小 | +| `--seq_length` | 序列长度 | +| `--epoch_num` | 迭代次数 | + +### 并行参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--tensor_model_parallel_size` | TP 度 | +| `--pipeline_model_parallel` | PP 度 | +| `--expert_model_parallel_size` | EP 度 | +| `--enable_sequence_parallel` | 启用 SP | + +### 模型参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--num_layers` | Transformer 层数 | +| `--hidden_size` | 隐藏层大小 | +| `--num_attention_heads` | 注意力头数 | +| `--ffn_hidden_size` | FFN 隐藏层大小 | +| `--vocab_size` | 词表大小 | + +### MoE 参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--moe_enable` | 启用 MoE | +| `--num_experts` | 专家数量 | +| `--moe_router_topk` | 每 Token 专家数 | +| `--moe_grouped_gemm` | 启用分组 GEMM | + +### DeepSeek 参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--qk_rope_dim` | QK 的 RoPE 维度 | +| `--qk_nope_dim` | QK 的非 RoPE 维度 | +| `--q_lora_rank` | Q LoRA 秩 | +| `--kv_lora_rank` | KV LoRA 秩 | +| `--v_head_dim` | V Head 维度 | +| `--n_shared_expert` | 每 MoE 层共享专家数 | +| `--n_dense_layer` | 稠密层数 | + +### 优化参数 + +| 参数 | 说明 | +|-----------|-------------| +| `--use_flash_attn` | FlashAttention | +| `--swiglu` | SwiGLU 激活函数 | +| `--aiob_enable` | AIOB 计算 Profiling | +| `--comp_filepath` | 预计算时间文件 | + +--- + +## Vidur 推理仿真 + +**命令**: `python -m vidur.main` + +运行 `python -m vidur.main -h` 查看完整参数列表。关键参数见 [vidur 组件页面](../components/vidur.md)。 + +--- + +## 相关文档 + +- [配置文件参考](configuration.md) — 配置文件格式 +- [SimAI-Analytical 指南](../user_guide/simai_analytical.md) — 使用示例 +- [AICB 组件](../components/aicb.md) — 完整参数详情 diff --git a/docs/zh/technical_reference/configuration.md b/docs/zh/technical_reference/configuration.md new file mode 100644 index 00000000..6486f738 --- /dev/null +++ b/docs/zh/technical_reference/configuration.md @@ -0,0 +1,163 @@ +# 配置文件参考 + +本文档涵盖 SimAI 使用的所有配置文件。 + +--- + +## SimAI.conf + +**路径**: `astra-sim-alibabacloud/inputs/config/SimAI.conf` + +SimAI-Analytical 和 SimAI-Simulation 模式共用的主仿真配置文件,控制通信算法、缓冲区大小和时序参数。 + +--- + +## busbw.yaml + +**路径**: `example/busbw.yaml` + +SimAI-Analytical 使用,用于指定不同通信组和集合操作的总线带宽。 + +### 格式 + +```yaml +test +TP: + allreduce,: 300 # TP 组 AllReduce busbw 300GB/s + allgather,: 280 + reducescatter,: 280 + alltoall,: 230 +DP: + allreduce,: null # null = 该组不使用此操作 + allgather,: 380 + reducescatter,: 380 + alltoall,: null +EP: + allreduce,: null + allgather,: 45 + reducescatter,: 45 + alltoall,: 80 +``` + +### 通信组 + +| 组 | 说明 | +|-------|-------------| +| `TP` | 张量并行 — 服务器内 NVLink 通信 | +| `DP` | 数据并行 — 服务器间 RDMA 通信 | +| `EP` | 专家并行 — MoE 专家通信 | + +### 集合操作 + +| 操作 | 说明 | +|-----------|-------------| +| `allreduce` | 归约 + 广播到所有 Rank | +| `allgather` | 从所有 Rank 收集数据 | +| `reducescatter` | 归约并分发 | +| `alltoall` | 全对全个性化交换 | + +对特定组中不使用的操作,将值设为 `null`。 + +--- + +## 拓扑文件 + +由 `gen_Topo_Template.py` 生成,拓扑文件为 SimAI-Simulation 定义网络结构。 + +### 生成 + +```bash +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps +``` + +输出文件以参数命名,如 `Spectrum-X_128g_8gps_100Gbps_A100`。 + +### 模板默认值 + +| 模板 | GPU 数 | 拓扑 | 带宽 | GPU 型号 | +|----------|------|------|-----------|----------| +| Spectrum-X | 4096 | Rail-optimized,单 ToR | 400Gbps | H100 | +| AlibabaHPN(单 Plane) | 15360 | Rail-optimized,双 ToR | 200Gbps | H100 | +| AlibabaHPN(双 Plane) | 15360 | Rail-optimized,双 ToR,双 Plane | 200Gbps | H100 | +| DCN+(单 ToR) | 512 | 非 Rail-optimized | 400Gbps | A100 | +| DCN+(双 ToR) | 512 | 非 Rail-optimized,双 ToR | 200Gbps | H100 | + +--- + +## 模型配置文件 + +### 推理模型配置 + +位于 `vidur-alibabacloud/data/hf_configs/`: + +| 模型 | 配置文件 | +|-------|------------| +| DeepSeek-V3-671B | `deepseek_v3_config.json` | +| Qwen3-MoE-235B | `qwen3_moe_config.json` | +| Qwen3-Next-80B | `qwen3-next-80B-A3B_config.json` | + +这些文件遵循 HuggingFace `config.json` 格式,定义模型架构参数。 + +### Profiling 数据 + +位于 `vidur-alibabacloud/data/profiling/`: + +``` +profiling/ +├── compute/ +│ ├── a100/ +│ │ └── / +│ │ ├── mlp.csv +│ │ └── attention.csv +│ └── h100/ +│ └── / +│ ├── mlp.csv +│ └── attention.csv +└── network/ + ├── a100_pair_nvlink/ + │ ├── allreduce.csv + │ └── send_recv.csv + └── h100_dgx/ + ├── allreduce.csv + └── send_recv.csv +``` + +- **计算 Profiling**:仅依赖 GPU 型号(如 `a100`、`h100`),不依赖网络拓扑 +- **网络 Profiling**:依赖网络配置(如 `a100_pair_nvlink` vs `a100_dgx`) + +--- + +## 工作负载文件 + +### 训练工作负载格式 + +``` +HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 vpp: 8 ga: 1 all_gpus: 32 checkpoints: 0 checkpoint_initiates: 0 +6 +embedding_layer -1 556000 ALLREDUCE 16777216 1 NONE 0 1 NONE 0 1 +... +``` + +头部字段: +- `model_parallel_NPU_group`:TP 大小 +- `ep`:EP 大小 +- `pp`:PP 大小 +- `vpp`:虚拟流水线并行 +- `ga`:梯度累积 +- `all_gpus`:总 GPU 数量 + +### 请求 Trace 文件 + +用于推理仿真,位于 `vidur-alibabacloud/data/processed_traces/`: + +- `splitwise_conv.csv` — 对话式 Trace +- `sharegpt_8k_filtered_stats_llama2_tokenizer.csv` — ShareGPT Trace + +--- + +## 相关文档 + +- [CLI 参考](cli_reference.md) — 命令行参数 +- [SimAI-Analytical 指南](../user_guide/simai_analytical.md) — busbw 配置使用 +- [SimAI-Simulation 指南](../user_guide/simai_simulation.md) — 拓扑配置使用 diff --git a/docs/zh/technical_reference/index.md b/docs/zh/technical_reference/index.md new file mode 100644 index 00000000..78eac060 --- /dev/null +++ b/docs/zh/technical_reference/index.md @@ -0,0 +1,13 @@ +# 技术参考 + +本节提供 SimAI 内部模块、CLI 参数和配置文件的详细技术规格与参考文档。 + +--- + +## 目录 + +| 文档 | 说明 | +|----------|-------------| +| [GPU 显存模块](memory_module.md) | ParamCounter、KV Cache 计算、MemoryPlanner | +| [CLI 参考](cli_reference.md) | 所有模式的完整命令行参数参考 | +| [配置文件参考](configuration.md) | SimAI.conf、busbw.yaml、拓扑文件和模型配置 | diff --git a/docs/zh/technical_reference/memory_module.md b/docs/zh/technical_reference/memory_module.md new file mode 100644 index 00000000..1949621c --- /dev/null +++ b/docs/zh/technical_reference/memory_module.md @@ -0,0 +1,155 @@ +# GPU 显存计算模块 + +GPU 显存计算模块(SimAI 1.6 引入)为推理仿真提供精确的 GPU 显存估算,涵盖模型参数显存、KV Cache 显存和最大批量大小计算。 + +--- + +## 架构 + +``` +ParamCounter (param_counter.py) + |-- 按层、按设备计算参数量 + |-- PD 分离下返回 (total_params, prefill_params, decode_params) + | +MemoryPlanner (memory_planner.py) + |-- 规划 GPU 显存预算 + |-- 计算 KV Cache 容量 + |-- 检测 OOM 条件 + | +Replica KV Cache Tracker (replica.py) + |-- 按请求分配/释放 + |-- 运行时容量查询 +``` + +--- + +## ParamCounter + +**文件**: `vidur-alibabacloud/vidur/utils/param_counter.py` + +### MLA 参数(DeepSeek-V3-671B) + +每层 MLA 参数组成: + +| 组件 | 公式 | DeepSeek-V3 值 | +|-----------|---------|-------------------| +| Q LoRA 下投影 | `hidden_size * q_lora_rank` | 7168 * 1536 | +| Q LoRA 上投影 | `q_lora_rank * num_heads * qk_head_dim` | 1536 * 128 * 192 | +| KV LoRA 下投影 | `hidden_size * kv_lora_rank` | 7168 * 512 | +| KV LoRA 上投影 | `kv_lora_rank * num_heads * (qk_nope_dim + v_head_dim)` | 512 * 128 * 256 | +| 输出投影 | `hidden_size * num_heads * v_head_dim` | 7168 * 128 * 128 | + +其中 `qk_head_dim = qk_nope_head_dim + qk_rope_head_dim = 128 + 64 = 192` + +### MHA/GQA 参数(Qwen3-MoE-235B) + +``` +wq = hidden_size * num_attention_heads * head_dim +wk = hidden_size * num_key_value_heads * head_dim +wv = hidden_size * num_key_value_heads * head_dim +wo = hidden_size * num_attention_heads * head_dim +total = (wq + wk + wv + wo) * bytes_per_element +``` + +### 线性注意力参数(Qwen3-Next-80B) + +Qwen3-Next-80B 使用混合注意力:全注意力和线性(GDN)注意力每 4 层交替。线性注意力层使用独立的 `linear_key_head_dim` / `linear_num_key_heads` 配置。 + +### MoE 专家参数 + +每专家 FFN(3 个权重矩阵 W1、W2、W3): + +``` +expert_params = 3 * hidden_size * moe_intermediate_size * bytes_per_element +``` + +### PD 分离 + +PD 分离下,专家并行在不同集群间有差异: + +- **Prefill 集群**: `experts_per_device = num_routed_experts / prefill_world_size` +- **Decode 集群**: `experts_per_device = num_routed_experts / decode_world_size` + +返回三元组:`(total_params, prefill_params, decode_params)` + +--- + +## KV Cache 计算 + +### MHA/GQA KV Cache + +``` +kv_cache_per_token = 2 * num_kv_heads * head_dim * num_layers * bytes_per_element +``` + +因子 2 = K(Key)+ V(Value)缓存。 + +### MLA KV Cache(DeepSeek-V3-671B) + +MLA 使用压缩的 KV 表示——单个潜向量同时编码 K 和 V: + +``` +kv_cache_per_token = (kv_lora_rank + qk_rope_head_dim) * num_layers * bytes_per_element +``` + +其中 `kv_lora_rank = 512`、`qk_rope_head_dim = 64`。 + +**对比**:MHA 每 Token 需要 `2 * 128 * 128 = 32768` 个元素。MLA 仅需 `576` 个元素——**约 57 倍压缩**。 + +### 按请求 KV Cache 追踪 + +`Replica` 实体维护: + +| 状态 | 说明 | +|-------|-------------| +| `_allocated_kv_cache_memory` | 当前已分配的 KV Cache(字节) | +| `_max_kv_cache_memory` | 最大 KV Cache 容量 | +| `_kv_cache_allocation_map` | 按请求的分配映射 | + +操作: +- `allocate_request_kv_cache_memory(request, num_blocks, block_size)` +- `release_request_kv_cache_memory(request)` +- `get_remaining_kv_cache_capacity()` + +--- + +## MemoryPlanner + +**文件**: `vidur-alibabacloud/vidur/scheduler/utils/memory_planner.py` + +### 计算流程 + +1. **可用 GPU 显存**: `available = 总GPU显存 * (1 - memory_margin_fraction)` +2. **参数显存**: 通过 ParamCounter 计算;PD 返回 `(total, prefill, decode)` +3. **KV Cache 预算**: `kv_available = available - param_memory` +4. **最大并发请求**: `max_requests = kv_available / kv_cache_per_request` + +### PD 分离 + +- Prefill 副本:使用 `prefill_param_mem` 计算预算 +- Decode 副本:使用 `decode_param_mem` 计算预算 + +### OOM 检测 + +当 `param_memory > available_memory` 时,输出错误并给出建议: +- 增加 TP/EP 度 +- 使用更大 GPU(更多显存) +- 启用 FP8 量化 + +--- + +## 量化支持 + +| 精度 | 每元素字节数 | 使用场景 | +|-----------|-------------------|----------| +| FP32 | 4 | 参考基准 | +| FP16/BF16 | 2 | 默认推理 | +| FP8 | 1 | 降低显存,ParamCounter 支持 | + +--- + +## 相关文档 + +- [vidur-alibabacloud 组件](../components/vidur.md) — 完整组件文档 +- [支持的模型](../user_guide/supported_models.md) — 模型规格 +- [SimAI 1.6 技术报告](../../SimAI_1.6_Tech_Report.md) — 详细技术报告 diff --git a/docs/zh/user_guide/index.md b/docs/zh/user_guide/index.md new file mode 100644 index 00000000..75013786 --- /dev/null +++ b/docs/zh/user_guide/index.md @@ -0,0 +1,25 @@ +# 用户指南 + +本章节提供 SimAI 各运行模式的详细使用说明。 + +## 目录 + +| 页面 | 说明 | +|------|------| +| [SimAI-Analytical](simai_analytical.md) | 使用总线带宽的快速分析仿真 | +| [SimAI-Simulation](simai_simulation.md) | 使用 NS-3 网络后端的全栈仿真与拓扑配置 | +| [SimAI-Physical](simai_physical.md) | 在真实集群上生成物理 RDMA 流量 | +| [推理仿真](inference_simulation.md) | 支持 PD 分离的多请求 LLM 推理仿真 | +| [工作负载生成](workload_generation.md) | 使用 AICB 生成训练和推理工作负载 | +| [支持的模型](supported_models.md) | 支持的模型完整列表及配置 | +| [结果分析](result_analysis.md) | 仿真结果分析与可视化 | + +## 工作流程概览 + +典型的 SimAI 工作流程包含三个步骤: + +1. 使用 [AICB](workload_generation.md) **生成工作负载** — 定义计算和通信模式 +2. 使用三种模式之一(Analytical、Simulation 或 Physical)**运行仿真** +3. 使用内置工具或自定义脚本**分析结果** + +推理仿真的工作流程使用 Vidur 进行请求调度和内存管理,AICB 或 SimAI 作为执行时间预测后端。 diff --git a/docs/zh/user_guide/inference_simulation.md b/docs/zh/user_guide/inference_simulation.md new file mode 100644 index 00000000..c6723ab4 --- /dev/null +++ b/docs/zh/user_guide/inference_simulation.md @@ -0,0 +1,151 @@ +# 多请求推理仿真 + +SimAI 支持完整的多请求 LLM 推理仿真,提供端到端的推理服务系统性能评估,支持 Prefill-Decode (PD) 分离架构。 + +--- + +## 概述 + +推理仿真流水线组合了多个 SimAI 组件: + +- **[AICB](../components/aicb.md)** — 生成推理工作负载和计算时间分析 +- **[vidur-alibabacloud](../components/vidur.md)** — 请求调度、显存管理和指标收集 +- **[astra-sim-alibabacloud](../components/astra_sim.md)** — 仿真引擎(Analytical 或 Simulation 模式) +- **[SimCCL](../components/simccl.md)** — 集合通信转换 + +--- + +## Prefill-Decode (PD) 分离 + +推理过程分为两个阶段: + +| 阶段 | 特征 | 说明 | +|------|------|------| +| **Prefill** | 计算密集 | 处理所有输入 prompt token,生成第一个输出 token | +| **Decode** | 内存带宽密集 | 逐个自回归生成后续输出 token | + +PD 分离允许将这两个阶段部署在不同的 GPU 节点上: + +- **弹性资源分配** — Prefill 节点可配置更多算力,Decode 节点可配置更多显存 +- **性能隔离** — 避免阶段间的资源竞争 +- **灵活 P:D 比例** — 通过 `--replica_config_pd_node_ratio` 配置 + +--- + +## 请求调度 + +调度组件改编自微软 [Vidur](https://github.com/microsoft/vidur),支持多种策略: + +| 调度器 | 级别 | 说明 | +|--------|------|------| +| `split_wise` | 全局 | PD 分离感知调度,将请求分发到 Prefill 和 Decode 副本 | +| `lor` | 全局 | 最少未完成请求——分发到负载最轻的副本 | +| `round_robin` | 全局 | 轮询分发 | +| `sarathi` | 副本级 | 副本内批量调度 | +| `split_wise` | 副本级 | PD 分离的副本级调度 | + +--- + +## 并行策略 + +支持多种并行策略组合: + +| 策略 | 参数 | 说明 | +|------|------|------| +| **数据并行 (DP)** | `--cluster_config_num_replicas` | 副本数量 | +| **张量并行 (TP)** | `--replica_config_tensor_parallel_size` | 节点内并行 | +| **流水线并行 (PP)** | `--replica_config_num_pipeline_stages` | 阶段间并行 | +| **专家并行 (EP)** | `--replica_config_expert_model_parallel_size` | MoE 专家并行 | + +适用于稠密模型和 MoE(混合专家)模型。 + +--- + +## 执行时间预测后端 + +| 后端 | 参数值 | 说明 | +|------|--------|------| +| **AICB/AIOB** | `aicb` | 支持 DeepSeek-V3、Qwen3-MoE、Qwen3-Next 的计算核和 TP/DP/PP/EP 通信量建模 | +| **SimAI Simulation** | `simai_simulation` | 基于 NS-3 的全栈网络仿真(当前支持 TP) | +| **SimAI Analytical** | `simai_analytical` | 分析性能模型(当前支持 TP) | +| **原生 Vidur** | `vidur` | 原版 Vidur 后端,支持 TP、DP、PP | + +通过 `--random_forrest_execution_time_predictor_config_backend` 设置。 + +--- + +## 快速开始 + +### 前置条件 + +- **AICB 后端**:SimAI Docker 环境 + Hopper (SM90) 或 Blackwell (SM100) GPU +- **SimAI 后端**:先编译 SimAI-Analytical 或 SimAI-Simulation +- **Vidur 后端**:Conda 环境 + profiling 数据 + +### 使用 AICB 后端运行 + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_expert_model_parallel_size 8 \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 5 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 1024 \ + --fixed_request_length_generator_config_decode_tokens 10 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +### 运行四场景测试套件 + +```bash +# 运行全部 4 个预配置场景 +bash examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 运行单个场景 +bash examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 +``` + +--- + +## 四场景配置 + +**共享硬件**:H20 GPU (h20_dgx),NVLink 1600 Gbps,RDMA 800 Gbps,PD P2P 800 Gbps(fp8) + +| 场景 | 模型 | PD 分离 | World Size | TP | EP | 调度器 | +|------|------|---------|-----------|----|----|--------| +| 1 | Qwen3-Next-80B | 否 | 32 (dp=32) | 1 | 1 | lor | +| 2 | Qwen3-Next-80B | 是(P=2, D=6) | 8 | 1 | 1 | split_wise | +| 3 | DeepSeek-671B | 是(P=2, D=6) | 8 | 8 | 8 | split_wise | +| 4 | Qwen3-MoE-235B | 是(P=2, D=6) | 8 | 4 | 4 | split_wise | + +--- + +## 输出 + +每次仿真运行生成: + +``` +// +├── request_metrics.csv # 逐请求指标 +├── chrome_trace.json # Chrome DevTools 时间线追踪 +├── config.json # 配置快照 +└── plots/ # 各指标 CSV/JSON 文件 +``` + +输出解读请参见[结果分析](result_analysis.md)。 + +--- + +## 相关文档 + +- [vidur-alibabacloud 组件](../components/vidur.md) — 完整推理仿真文档 +- [支持的模型](supported_models.md) — 模型兼容性矩阵 +- [结果分析](result_analysis.md) — 输出解读指南 diff --git a/docs/zh/user_guide/result_analysis.md b/docs/zh/user_guide/result_analysis.md new file mode 100644 index 00000000..cec5ab78 --- /dev/null +++ b/docs/zh/user_guide/result_analysis.md @@ -0,0 +1,183 @@ +# 结果分析与可视化 + +本指南介绍如何解读和分析 SimAI 各模式的仿真输出。 + +--- + +## SimAI-Analytical 输出 + +### CSV 输出 + +运行 SimAI-Analytical 后会在 `results/` 目录生成 CSV 文件,包含: + +- **汇总行**:暴露时间、各通信组的计算时间(绝对值和百分比)、端到端迭代时间 +- **逐层行**:每层的详细操作时间 + +关键列包含各通信组(TP、DP、EP、PP)的分解,展示时间分配和重叠效果。 + +### 可视化 + +使用 `-v` 参数运行时,SimAI-Analytical 会生成额外的可视化文件,展示各通信组的时间分解。 + +```bash +# 启用可视化运行 +./bin/SimAI_analytical -w example/workload_analytical.txt -g 9216 -g_p_s 8 -r test- -busbw example/busbw.yaml -v +``` + +--- + +## SimAI-Simulation 输出 + +SimAI-Simulation(NS-3 模式)生成详细的追踪数据,捕获细粒度的网络行为。NS-3 后端输出 `.tr` 追踪文件,可使用提供的分析工具进行分析。 + +### 分析工具 + +位于 `ns-3-alibabacloud/analysis/`: + +| 工具 | 说明 | +|------|------| +| `fct_analysis.py` | 流完成时间(FCT)分析——读取 FCT 输出文件并生成统计数据 | +| `trace_reader` | 解析 `.tr` 追踪文件,支持过滤 | + +### 使用 trace_reader + +```bash +# 编译 +cd ns-3-alibabacloud/analysis +make trace_reader + +# 解析追踪文件 +./trace_reader <.tr 文件> [过滤表达式] + +# 示例: +./trace_reader output.tr "time > 2000010000" +./trace_reader output.tr "sip=0x0b000101&dip=0x0b000201" +``` + +### 追踪输出格式 + +追踪输出每行格式如下: + +``` +2000055540 n:338 4:3 100608 Enqu ecn:0 0b00d101 0b012301 10000 100 U 161000 0 3 1048(1000) +``` + +字段:时间戳(ns)、节点 ID、端口:队列、队列长度(字节)、事件类型、ECN 标志、源 IP、目的 IP、源端口、目的端口、包类型、序列号、发送时间戳、优先级组、包大小(有效负载)。 + +--- + +## 推理仿真输出 + +### 输出目录结构 + +每次推理仿真运行生成: + +``` +// +├── request_metrics.csv # 逐请求指标 +├── chrome_trace.json # Chrome DevTools 时间线追踪 +├── config.json # 配置快照 +└── plots/ # 各指标 CSV/JSON 文件 + ├── request_e2e_time.csv + ├── prefill_e2e_time.csv + ├── pd_p2p_comm_time.csv + ├── replica_N_memory_usage.json + └── ... +``` + +### request_metrics.csv 列说明 + +| 列名 | 含义 | +|------|------| +| `arrived_at` | 请求进入系统的时间戳(秒) | +| `scheduled_at` | 请求首次被调度的时间戳(秒) | +| `prefill_completed_at` | Prefill 完成并生成第一个 token 的时间戳 | +| `decode_arrived_at` | Decode 阶段开始的时间戳 | +| `decode_time` | Decode 阶段持续时间(秒) | +| `prefill_replica_id` | 执行 Prefill 的副本 ID(PD 模式) | +| `decode_replica_id` | 执行 Decode 的副本 ID(PD 模式) | +| `request_num_prefill_tokens` | 输入 token 数(prompt 长度) | +| `request_num_decode_tokens` | 输出 token 数(生成长度) | +| `pd_p2p_comm_size` | Prefill 到 Decode 节点的 P2P 通信大小(字节) | +| `pd_p2p_comm_time` | P2P 通信时间(秒) | +| `completed_at` | 请求完成时间戳 | +| `request_execution_time` | 总执行时间(不含延迟,秒) | +| `request_preemption_time` | 因抢占/气泡导致的等待时间(秒) | +| `request_scheduling_delay` | 调度延迟:`scheduled_at - arrived_at`(秒) | +| `request_e2e_time` | 端到端延迟:`completed_at - arrived_at`(秒) | +| `prefill_e2e_time` | 首 token 时间(TTFT):`prefill_completed_at - arrived_at`(秒) | +| `tbt` | token 间时间:`decode_time / request_num_decode_tokens`(秒/token) | + +### Chrome Trace 可视化 + +在 Chrome DevTools 中打开 `chrome_trace.json` 进行可视化时间线分析: + +1. 打开 Chrome 浏览器 +2. 访问 `chrome://tracing` +3. 加载 `chrome_trace.json` 文件 + +### 仿真指标(23 项) + +仿真器记录 23 项细粒度指标: + +| 类别 | 指标 | +|------|------| +| **请求延迟** | E2E 时间 CDF、归一化 E2E CDF、执行+抢占 CDF | +| **调度** | 调度延迟 CDF | +| **执行** | 执行时间 CDF、抢占时间 CDF | +| **Token 级** | Decode token 执行+抢占时间、token 间延迟 | +| **批次** | 批次 token 数 CDF、批次大小 CDF | +| **Prefill** | Prefill E2E CDF、Prefill 执行+抢占 CDF(归一化) | +| **Decode** | Decode 执行+抢占归一化 CDF | +| **时间序列** | 请求/Prefill/Decode 完成、请求到达 | +| **逐副本** | 显存使用(加权均值)、繁忙时间百分比、MFU | + +详细指标定义请参见 [vidur 指标文档](../components/vidur.md)。 + +--- + +## AICB 物理执行输出 + +### 日志输出 + +每次通信后,AICB 输出: +- 通信类型和组 +- 消息大小 +- 执行时间 +- 吞吐量(algbw 和 busbw) + +### 迭代汇总 + +所有通信完成后,汇总显示: +- 总运行时间和每次迭代的时间 +- 按通信类型的统计(消息大小、频率、延迟最小/最大/平均值) + +### CSV 输出 + +结果保存在 `results/comm_logs/`: +- `<模型>_<配置>_log.csv` — 执行日志(包含时间、阶段、algbw、busbw 等) +- `<模型>_<配置>_workload.csv` — 生成的工作负载描述 + +### 编程分析 + +```python +# 读取工作负载日志 +from log_analyzer.log import Workload +workload, args = Workload.load("results/comm_logs/megatron_gpt_13B_8n_workload.csv") + +# 读取执行日志 +from log_analyzer.log import Log +log = Log.load("results/comm_logs/megatron_gpt_13B_8n_log.csv") +# log.comm_logs: List[LogItem] +# log.epoch_times: List[int] +# log.comm_log_each_epoch: List[List[LogItem]] +``` + +--- + +## 相关文档 + +- [SimAI-Analytical](simai_analytical.md) — Analytical 模式用法 +- [SimAI-Simulation](simai_simulation.md) — NS-3 仿真模式用法 +- [推理仿真](inference_simulation.md) — 推理仿真指南 +- [NS-3 组件](../components/ns3.md) — NS-3 分析工具 diff --git a/docs/zh/user_guide/simai_analytical.md b/docs/zh/user_guide/simai_analytical.md new file mode 100644 index 00000000..4f47a85f --- /dev/null +++ b/docs/zh/user_guide/simai_analytical.md @@ -0,0 +1,104 @@ +# SimAI-Analytical + +SimAI-Analytical 通过总线带宽(busbw)抽象网络通信细节来估算集合通信时间,提供快速仿真。适用于快速场景验证和性能分析。 + +## 适用场景 + +- **性能分析**:比较不同模型的完成时间(如专家数对 MoE 训练的影响) +- **并行参数优化**:平衡和优化 TP/EP/PP 参数 +- **Scale-up 探索**:研究不同 scale-up 域下的并行参数性能 +- **Scale-out 带宽选择**:研究高性价比的带宽配置 + +## 工作负载生成 + +使用 [AICB](workload_generation.md) 生成工作负载: + +```bash +sh ./aicb/scripts/megatron_workload_with_aiob.sh \ + -m 7 --world_size 4096 \ + --tensor_model_parallel_size 2 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 8192 \ + --micro_batch 1 --seq_length 4096 \ + --swiglu --use_flash_attn --aiob_enable +``` + +生成的 `.txt` 工作负载文件包含: +- `model_parallel_NPU_group`:张量并行度 +- `ep`:专家模型并行度 +- `pp`:流水线并行度 +- `vpp`:虚拟流水线并行度 + +> 更多信息参见 [AICB 工作负载生成](workload_generation.md) 和 [AICB 组件文档](../components/aicb.md)。 + +## Busbw 配置 + +SimAI-Analytical 使用 `busbw.yaml` 文件为不同通信组指定总线带宽: + +```yaml +test +TP: + allreduce,: 300 # TP 组内 AllReduce busbw 300GB/s + allgather,: 280 + reducescatter,: 280 + alltoall,: 230 +DP: + allreduce,: null + allgather,: 380 # DP 组内 AllGather busbw 380GB/s + reducescatter,: 380 + alltoall,: null +EP: + allreduce,: null + allgather,: 45 + reducescatter,: 45 + alltoall,: 80 # EP 组内 AlltoAll busbw 80GB/s +``` + +## 运行 Analytical 仿真 + +```bash +./bin/SimAI_analytical \ + -w example/workload_analytical.txt \ + -g 9216 \ + -g_p_s 8 \ + -r test- \ + -busbw example/busbw.yaml +``` + +### 必选参数 + +| 参数 | 长格式 | 说明 | +|:----:|:-------|:-----| +| `-w` | `--workload` | 工作负载文件路径 | +| `-g` | `--gpus` | 仿真 GPU 规模 | +| `-g_p_s` | `--gpus-per-server` | Scale-up 大小(每服务器 GPU 数) | +| `-r` | `--result` | 输出文件路径和前缀(默认:`./results/`) | +| `-busbw` | `--bus-bandwidth` | busbw 文件路径 | + +### 可选参数 + +| 参数 | 长格式 | 说明 | +|:----:|:-------|:-----| +| `-v` | `--visual` | 生成可视化文件 | + +### 重叠比例 + +| 参数 | 长格式 | 说明 | 范围 | +|:----:|:-------|:-----|:-----| +| `-dp_o` | `--dp-overlap-ratio` | DP 重叠比例 | [0.0-1.0] | +| `-ep_o` | `--ep-overlap-ratio` | EP 重叠比例 | [0.0-1.0] | +| `-tp_o` | `--tp-overlap-ratio` | TP 重叠比例 | [0.0-1.0] | +| `-pp_o` | `--pp-overlap-ratio` | PP 重叠比例 | [0.0-1.0] | + +## 结果分析 + +运行 SimAI-Analytical 后生成的 CSV 文件包含: +- 汇总行:暴露时间、各通信组的计算时间百分比、端到端迭代时间 +- 逐层操作详情 + +![原始输出](../../images/simai_raw.png) + +指定 `-v` 参数后还会生成可视化文件: + +![可视化](../../images/simai_visual.png) + +更多结果分析方法请参考[结果分析](result_analysis.md)。 diff --git a/docs/zh/user_guide/simai_physical.md b/docs/zh/user_guide/simai_physical.md new file mode 100644 index 00000000..7a19f651 --- /dev/null +++ b/docs/zh/user_guide/simai_physical.md @@ -0,0 +1,113 @@ +# SimAI-Physical 模式 + +> **状态**:Beta — 目前处于内部测试阶段。 + +SimAI-Physical 在 CPU RDMA 集群环境中生成物理流量。该模式生成类 NCCL 的流量模式,用于深入研究 LLM 训练过程中的 NIC 行为。 + +--- + +## 概述 + +与 SimAI-Analytical 和 SimAI-Simulation 完全在软件中仿真不同,SimAI-Physical 将真实的 RDMA 流量注入物理网络。可用于: + +- 研究真实 LLM 训练流量模式下的 NIC 行为 +- 在真实硬件上验证网络配置 +- 使用典型集合通信工作负载进行 RDMA 性能基准测试 + +**组件组合**:[AICB](../components/aicb.md) + [SimCCL](../components/simccl.md) + [astra-sim-alibabacloud](../components/astra_sim.md)(物理模式) + +--- + +## 前置条件 + +SimAI-Physical 使用 RoCEv2 协议生成流量。编译前请确保: + +- **RDMA 支持**:可用的 `libibverbs` / RDMA 设备驱动 +- **MPI**:已安装并可运行 OpenMPI +- **验证**:能成功运行 `ib_write_bw` 等 RDMA 性能测试工具 + +--- + +## 编译 + +```bash +# 克隆和初始化 +git clone https://github.com/aliyun/SimAI.git +cd SimAI/ +git submodule update --init --recursive +git submodule update --remote + +# 安装 MPI(CentOS/RHEL) +sudo yum install openmpi openmpi-devel + +# 设置 MPI 路径 +export MPI_INCLUDE_PATH=/usr/include/openmpi-x86_64/ +export MPI_BIN_PATH=/usr/lib64/openmpi/bin/mpic++ + +# 编译 SimAI-Physical +./scripts/build.sh -c phy +``` + +--- + +## 工作负载生成 + +SimAI-Physical 使用与 SimAI-Simulation 相同的工作负载格式,通过 [AICB](../components/aicb.md) 生成。详见[工作负载生成](workload_generation.md)。 + +--- + +## 准备主机列表 + +为 MPI 程序准备 IP 列表文件。IP 数量需与参与物理流量生成的 NIC 数量一致(非节点数)。 + +``` +33.255.199.130 +33.255.199.129 +``` + +--- + +## 运行 + +### MPI 执行 + +```bash +/usr/lib64/openmpi/bin/mpirun -np 2 \ + -host 33.255.199.130,33.255.199.129 \ + --allow-run-as-root \ + -x AS_LOG_LEVEL=0 \ + ./bin/SimAI_phynet ./hostlist -g 2 -w ./example/microAllReduce.txt +``` + +### MPI 参数 + +| 参数 | 说明 | 默认值 | +|------|------|--------| +| `-np` | 进程数 | 必填 | +| `-host` | 逗号分隔的 IP 列表 | 必填 | +| `--allow-run-as-root` | 允许以 root 运行 | `FALSE` | + +### SimAI-Physical 参数 + +| 参数 | 说明 | 默认值 | +|------|------|--------| +| `hostlist` | 主机 IP 列表文件路径 | 必填 | +| `-w` / `--workload` | 工作负载文件路径 | `./microAllReduce.txt` | +| `-i` / `--gid_index` | RDMA 设备 GID 索引 | `0` | +| `-g` / `--gpus` | GPU 数量(须与 hostlist 中 IP 数一致) | `8` | + +--- + +## 注意事项 + +- GPU 数量(`-g`)必须与主机 IP 列表中的 IP 数一致 +- 确保所有节点具有网络连通性且 RDMA 已正确配置 +- SimAI-Physical 目前为 Beta 版本;部分功能可能在后续版本中变更 + +--- + +## 相关文档 + +- [AICB 组件](../components/aicb.md) — 工作负载生成 +- [SimCCL 组件](../components/simccl.md) — 集合通信转换 +- [astra-sim 组件](../components/astra_sim.md) — 仿真引擎详情 diff --git a/docs/zh/user_guide/simai_simulation.md b/docs/zh/user_guide/simai_simulation.md new file mode 100644 index 00000000..d6377c60 --- /dev/null +++ b/docs/zh/user_guide/simai_simulation.md @@ -0,0 +1,113 @@ +# SimAI-Simulation + +SimAI-Simulation 使用 NS-3 作为网络后端,提供高保真的全栈仿真和细粒度网络通信建模。适用于集合通信算法、网络协议和新型网络架构的深入研究。 + +## 适用场景 + +- **集合通信算法研究**:设计和优化非交换机架构下的流量模式 +- **网络协议研究**:评估拥塞控制、路由机制和底层协议 +- **新型网络架构设计**:探索创新的网络拓扑和配置 + +## 工作负载生成 + +使用与 SimAI-Analytical 相同的工作负载,由 [AICB](workload_generation.md) 生成。 + +## 网络拓扑配置 + +运行 SimAI-Simulation 之前,需要生成 ns-3-alibabacloud 识别的拓扑文件。 + +### 拓扑模板 + +SimAI 提供 5 种常见架构模板: + +| 模板 | 说明 | 默认 GPU 数 | +|------|------|-------------| +| `Spectrum-X` | Rail-optimized,单 ToR,单 Plane | 4096 | +| `AlibabaHPN`(单 Plane) | Rail-optimized,双 ToR,单 Plane | 15360 | +| `AlibabaHPN`(双 Plane) | Rail-optimized,双 ToR,双 Plane | 15360 | +| `DCN+`(单 ToR) | 非 Rail-optimized,单 ToR | 512 | +| `DCN+`(双 ToR) | 非 Rail-optimized,双 ToR | 512 | + +![Spectrum-X](../../images/Spectrum-X.jpg) + +### 生成拓扑 + +```bash +# 8 GPU 的 Spectrum-X 拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 8 -psn 1 + +# 64 GPU 的双 Plane AlibabaHPN 拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo AlibabaHPN --dp -g 64 -asn 16 -psn 16 + +# 128 GPU 的双 ToR DCN+ 拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo DCN+ --dt -g 128 -asn 2 -psn 8 + +# 自定义 Rail-optimized 拓扑 +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -g 32 -bw 200Gbps -gt A100 -psn 8 --ro +``` + +### 拓扑参数 + +| 层级 | 参数 | 说明 | +|------|------|------| +| 整体结构 | `-topo` | 模板名称 | +| | `-g` | GPU 数量 | +| | `--dp` | 启用双 Plane | +| | `--ro` | 启用 Rail-optimized | +| | `--dt` | 启用双 NIC 和双 ToR | +| | `-er` | 错误率 | +| 服务器内 | `-gps` | 每服务器 GPU 数 | +| | `-gt` | GPU 类型 | +| | `-nvbw` | NVLink 带宽 | +| | `-nl` | NVLink 延迟 | +| | `-l` | NIC 延迟 | +| Segment 内 | `-bw` | NIC 到 ASW 带宽 | +| | `-asw` | ASW 交换机数量 | +| | `-nps` | 每交换机 NIC 数 | +| Pod 内 | `-psn` | PSW 交换机数量 | +| | `-apbw` | ASW 到 PSW 带宽 | +| | `-app` | 每 PSW 的 ASW 数 | + +> 详细拓扑参数和各模板默认值请参见 [astra-sim 组件文档](../components/astra_sim.md)。 + +## 运行 NS-3 仿真 + +```bash +AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ + -t 16 \ + -w ./example/microAllReduce.txt \ + -n ./Spectrum-X_8g_8gps_400Gbps_H100 \ + -c astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +### 环境变量 + +| 变量 | 说明 | 默认值 | +|------|------|--------| +| `AS_LOG_LEVEL` | 日志级别:`DEBUG`、`INFO`、`WARNING`、`ERROR` | `INFO` | +| `AS_PXN_ENABLE` | 启用 PXN(`0`/`1`) | `0` | +| `AS_NVLS_ENABLE` | 启用 NVLS(`0`/`1`) | `0` | +| `AS_SEND_LAT` | 数据包发送延迟(us) | `6` | +| `AS_NVLSTREE_ENABLE` | 启用 NVLSTREE | `false` | + +### 仿真参数 + +| 参数 | 说明 | 默认值 | +|------|------|--------| +| `-t` / `--thread` | 线程数(推荐 8-16) | `1` | +| `-w` / `--workload` | 工作负载文件路径 | `./microAllReduce.txt` | +| `-n` / `--network-topo` | 网络拓扑路径 | 无 | +| `-c` / `--config` | SimAI 配置文件 | 无 | + +## 示例:RING vs NVLS 对比 + +请参阅 [Tutorial](../../docs/Tutorial.md#ring-vs-nvls) 了解 RING 和 NVLS 算法在不同消息大小下的完整对比。 + +## 延伸阅读 + +- [NS-3 组件文档](../components/ns3.md) — NS-3 模块详细参考 +- [结果分析](result_analysis.md) — 仿真输出分析 diff --git a/docs/zh/user_guide/supported_models.md b/docs/zh/user_guide/supported_models.md new file mode 100644 index 00000000..1a6913e2 --- /dev/null +++ b/docs/zh/user_guide/supported_models.md @@ -0,0 +1,133 @@ +# 支持的模型 + +SimAI 支持多种 LLM 模型的训练和推理仿真。 + +--- + +## 推理模型(SimAI 1.5+) + +以下模型通过 vidur-alibabacloud 支持多请求推理仿真,包含 GPU 显存计算、PD 分离和工作负载生成支持。 + +### DeepSeek-V3-671B + +| 属性 | 值 | +|------|------| +| **总层数** | 61 | +| **注意力类型** | MLA(多头潜在注意力) | +| **注意力头数** | 128 | +| **隐藏维度** | 7168 | +| **KV LoRA 秩** | 512 | +| **Q LoRA 秩** | 1536 | +| **QK RoPE Head Dim** | 64 | +| **QK NoPE Head Dim** | 128 | +| **V Head Dim** | 128 | +| **MoE 路由专家数** | 256 | +| **每 token 激活专家数** | 8 | +| **共享专家数** | 1 | +| **稠密层** | 前 3 层(固定激活 8 个路由 + 1 个共享专家) | +| **稀疏层** | 第 3-60 层(从 256 个路由专家中动态选择 8 个 + 1 个共享专家) | +| **配置文件** | `vidur-alibabacloud/data/hf_configs/deepseek_v3_config.json` | + +### Qwen3-MoE-235B + +| 属性 | 值 | +|------|------| +| **总层数** | 94 | +| **注意力类型** | MHA / GQA | +| **注意力头数** | 64 | +| **KV 头数** | 4 | +| **隐藏维度** | 4096 | +| **Head Dim** | 128 | +| **MoE 路由专家数** | 128 | +| **每 token 激活专家数** | 8 | +| **MoE 中间维度** | 1536 | +| **配置文件** | `vidur-alibabacloud/data/hf_configs/qwen3_moe_config.json` | + +### Qwen3-Next-80B + +| 属性 | 值 | +|------|------| +| **总层数** | 48 | +| **注意力类型** | 混合(全注意力 + 线性注意力,每 4 层交替) | +| **全注意力头数** | 16 | +| **KV 头数** | 2 | +| **隐藏维度** | 2048 | +| **Head Dim** | 256 | +| **线性注意力 Key 头数** | 16 | +| **线性注意力 Value 头数** | 32 | +| **MoE 路由专家数** | 512 | +| **每 token 激活专家数** | 10 | +| **MoE 中间维度** | 512 | +| **配置文件** | `vidur-alibabacloud/data/hf_configs/qwen3-next-80B-A3B_config.json` | + +### 传统推理模型(通过 Vidur 后端) + +以下模型使用原版 Vidur 基于 profiling 的后端: + +| 模型 | TP 支持 | PP 支持 | +|------|---------|---------| +| meta-llama/Meta-Llama-3-8B | 是 | 是 | +| meta-llama/Meta-Llama-3-70B | 是 | 是 | +| meta-llama/Llama-2-7b-hf | 是 | 是 | +| meta-llama/Llama-2-70b-hf | 是 | 是 | +| codellama/CodeLlama-34b-Instruct-hf | 是 | 是 | +| internlm/internlm-20b | 是 | 是 | +| Qwen/Qwen-72B | 是 | 是 | + +--- + +## 训练模型(AICB) + +以下模型支持训练工作负载生成: + +### AICB 基准测试套件 + +| ID | 模型 | 序列长度 | 框架 | TP | PP | SP | MoE | +|----|------|----------|------|----|----|-----|-----| +| 1 | LLaMA-7B | 2048 | Megatron | 1 | 1 | - | - | +| 2 | GPT-13B | 2048 | Megatron | 2 | 1 | 是 | - | +| 3 | GPT-22B | 2048 | Megatron | 4 | 1 | - | - | +| 4 | LLaMA-65B | 4096 | Megatron | 8 | 2 | 是 | - | +| 5 | GPT-175B | 2048 | Megatron | 8 | 8 | 是 | - | +| 6 | GPT-175B | 2048 | Megatron | 8 | 8 | - | - | +| 7 | Llama3-405B | 8192 | Megatron | 8 | 16 | 是 | - | +| 8 | LLaMA-7B | 4096 | DeepSpeed | 1 | 1 | - | Zero-2 | +| 9 | LLaMA-65B | 4096 | DeepSpeed | 1 | 1 | - | Zero-3 | +| 10 | Mistral-8x7B | 2048 | Megatron | 2 | 1 | 是 | 8 专家 | + +--- + +## 注意力架构对比 + +| 架构 | 模型 | KV Cache 策略 | 内存效率 | +|------|------|-------------|----------| +| **MLA** | DeepSeek-V3-671B | 压缩潜向量(`kv_lora_rank` + `qk_rope_head_dim`) | 相比 MHA 约 57 倍缩减 | +| **MHA / GQA** | Qwen3-MoE-235B | 标准 KV 缓存(`num_kv_heads * head_dim`) | 标准 | +| **混合全注意力 + 线性注意力** | Qwen3-Next-80B | 全注意力层 + 线性 (GDN) 注意力每 4 层交替 | 减少(线性注意力层无 KV 缓存) | + +--- + +## 硬件要求 + +### 推理性能分析(AICB 后端) + +| 要求 | 详情 | +|------|------| +| **GPU 架构** | NVIDIA Hopper (SM90) 或 Blackwell (SM100) | +| **原因** | 依赖 DeepGEMM、FlashMLA、FlashInfer | +| **Hopper 注意** | 在 Dockerfile 中添加 `ENV FLASH_MLA_DISABLE_SM100=1` | + +### 训练仿真 + +- **SimAI-Analytical**:任意 CPU(无需 GPU) +- **SimAI-Simulation**:任意 CPU(无需 GPU) +- **AICB 物理执行**:需要支持 NCCL 的 GPU 集群 + +--- + +## 相关文档 + +- [推理仿真](inference_simulation.md) — 多请求推理指南 +- [工作负载生成](workload_generation.md) — AICB 工作负载生成 +- [GPU 显存模块](../technical_reference/memory_module.md) — 显存计算详情 +- [vidur-alibabacloud](../components/vidur.md) — 推理调度组件 diff --git a/docs/zh/user_guide/workload_generation.md b/docs/zh/user_guide/workload_generation.md new file mode 100644 index 00000000..95b1eb37 --- /dev/null +++ b/docs/zh/user_guide/workload_generation.md @@ -0,0 +1,160 @@ +# 工作负载生成 + +AICB(AI Communication Benchmark)为 SimAI 的训练和推理仿真提供工作负载生成功能。 + +--- + +## 概述 + +AICB 生成工作负载描述文件(`.txt`),描述 LLM 训练/推理过程的通信和计算模式。这些工作负载由 SimAI 仿真引擎消费。 + +支持两类工作负载生成: + +| 类型 | 说明 | 支持的模型 | +|------|------|-----------| +| **训练** | 生成训练通信/计算模式 | GPT (7B/13B/22B/175B)、LLaMA (7B/65B/405B)、DeepSeek (16B/236B/671B)、MoE | +| **推理** | 生成 prefill/decode 阶段工作负载 | DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B | + +--- + +## 训练工作负载生成 + +### 使用预配置模型快速开始 + +```bash +# 生成 Megatron GPT-7B 工作负载 +sh ./scripts/megatron_workload_with_aiob.sh -m 7 \ + --world_size 4096 --tensor_model_parallel_size 4 --pipeline_model_parallel 1 \ + --frame Megatron --global_batch 8192 \ + --micro_batch 1 --seq_length 4096 --swiglu \ + --use_flash_attn --aiob_enable \ + --comp_filepath workload/aiob_inputs/Example.txt +``` + +可用预配置模型大小:`7`、`13`、`22`、`175`(GPT/LLaMA)、`moe`、`deepseek`(16/236/671)。 + +### 不同框架的生成方法 + +#### Megatron + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name GPT-13B --frame=Megatron \ + --world_size=16 --tensor_model_parallel_size=2 --pipeline_model_parallel=1 \ + --global_batch=16 --micro_batch=1 --num_layers=40 --seq_length=2048 \ + --hidden_size=5120 --epoch_num=1 --num_attention_heads=40 \ + --aiob_enable --use_flash_attn --swiglu +``` + +#### MoE + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --model_name MoE --frame=Megatron \ + --world_size=32 --tensor_model_parallel_size=4 --pipeline_model_parallel=1 \ + --expert_model_parallel_size=2 --moe_enable --num_experts=8 --moe_router_topk=2 \ + --global_batch=32 --micro_batch=1 --seq_length=2048 \ + --aiob_enable --swiglu --use_flash_attn +``` + +#### DeepSeek + +```bash +python -m workload_generator.SimAI_training_workload_generator \ + --frame=DeepSeek \ + --world_size=32 --tensor_model_parallel_size=4 \ + --expert_model_parallel_size=2 --moe_enable --num_experts=4 --moe_router_topk=2 \ + --global_batch=16 --micro_batch=1 --seq_length=4096 \ + --aiob_enable --swiglu -m deepseek +``` + +#### DeepSpeed + +```bash +python -m workload_generator.generate_deepspeed_stage3_workload \ + --world_size=64 --global_batch=64 \ + --num_layers=40 --hidden_size=5120 --seq_length=4096 \ + --zero_stage=3 --reduce_bucket_size=1000000000 +``` + +### 输出 + +生成的工作负载文件保存在: +- 训练:`results/mocked_workload/` 或 `results/workload/` + +--- + +## 推理工作负载生成 + +SimAI 使用 AICB 生成带有 prefill/decode 阶段分离的推理工作负载。 + +> **注意**:推理计算性能分析需要 NVIDIA Hopper (SM90) 或 Blackwell (SM100) GPU,因为依赖 [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) 和 [FlashMLA](https://github.com/deepseek-ai/FlashMLA)。 + +### 支持的推理模型 + +| 模型 | 注意力 | MoE 专家数 | 每 token 激活专家数 | +|------|--------|-----------|-------------------| +| DeepSeek-V3-671B | MLA | 256 路由 + 1 共享 | 8 | +| Qwen3-MoE-235B | MHA/GQA | 128 路由 | 8 | +| Qwen3-Next-80B | 混合(全注意力 + 线性注意力) | 512 路由 | 10 | + +推理工作负载由 vidur-alibabacloud 调度框架自动生成和消费。端到端用法请参见[推理仿真](inference_simulation.md)。 + +--- + +## AIOB:计算时间嵌入 + +AIOB(AI Operation Benchmark)是 AICB 的子模块,用于分析实际 GPU 计算时间并将其嵌入工作负载。 + +### 使用选项 + +| 选项 | 说明 | +|------|------| +| `--aiob_enable` | 启用 AIOB,在当前 GPU 上分析计算时间 | +| `--comp_filepath ` | 使用已有的计算时间描述文件 | +| 均不指定 | 使用固定默认计算时间 | + +### 示例:分析并嵌入 + +```bash +sh scripts/megatron_gpt.sh \ + -m 7 --world_size 8 --tensor_model_parallel_size 2 \ + --frame Megatron --global_batch 16 --micro_batch 1 \ + --seq_length 2048 --swiglu --use_flash_attn --aiob_enable +``` + +计算描述文件保存在 `results/aiob_outputs/`。 + +--- + +## 关键参数 + +| 类别 | 参数 | 说明 | +|------|------|------| +| **框架** | `--frame` | Megatron / DeepSpeed / DeepSeek | +| **模型** | `--model_size` 或 `-m` | 预配置模型大小 | +| **训练** | `--world_size` | GPU 总数 | +| | `--global_batch` | 全局批大小 | +| | `--micro_batch` | 微批大小 | +| | `--seq_length` | 序列长度 | +| | `--epoch_num` | 迭代次数 | +| **并行** | `--tensor_model_parallel_size` | TP 并行度 | +| | `--pipeline_model_parallel` | PP 并行度 | +| | `--expert_model_parallel_size` | EP 并行度 | +| **MoE** | `--moe_enable` | 启用 MoE | +| | `--num_experts` | 专家数量 | +| | `--moe_router_topk` | 每 token 激活专家数 | +| **优化** | `--use_flash_attn` | 使用 FlashAttention | +| | `--swiglu` | 使用 SwiGLU 激活函数 | +| | `--aiob_enable` | 启用 AIOB 计算分析 | +| | `--comp_filepath` | 计算时间文件路径 | + +完整参数列表请参见 [AICB 组件文档](../components/aicb.md) 或 [CLI 参考](../technical_reference/cli_reference.md)。 + +--- + +## 相关文档 + +- [AICB 组件](../components/aicb.md) — 完整 AICB 文档 +- [推理仿真](inference_simulation.md) — 端到端推理仿真指南 +- [支持的模型](supported_models.md) — 完整模型兼容列表 diff --git a/vidur-alibabacloud/README-vidur.md b/vidur-alibabacloud/README-vidur.md index 6f1cd2d0..00c08c04 100644 --- a/vidur-alibabacloud/README-vidur.md +++ b/vidur-alibabacloud/README-vidur.md @@ -164,4 +164,164 @@ This project may contain trademarks or logos for projects, products, or services trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. -Any use of third-party trademarks or logos are subject to those third-party's policies. \ No newline at end of file +Any use of third-party trademarks or logos are subject to those third-party's policies. + +## SimAI / AICB 场景示例(一键运行) + +> 以下命令均在 `vidur-alibabacloud/` 目录下执行,需提前激活 `vidur` conda 环境。 +> 使用 AICB 后端 (`--random_forrest_execution_time_predictor_config_backend aicb`), +> 设备为 H20 DGX (`h20_dgx`),请求生成为 Poisson QPS=100,固定长度 prefill=100/decode=8。 +> 所有输入输出文件统一汇聚至 `examples/vidur-ali-scenarios/` 目录: +> - 脚本: `examples/vidur-ali-scenarios/run_scenarios.sh` +> - 运行日志: `examples/vidur-ali-scenarios/logs/scenario__.log` +> - 模拟输出: `examples/vidur-ali-scenarios/simulator_output//` + +### 场景汇总 + +| 场景 | 模型 | ws | TP | PP | EP | PD分离 | 调度器 | +|------|------|----|----|----|-----|--------|--------| +| 1 | Qwen3-Next-80B | 32 | 1 | 1 | 32 | 否 | lor | +| 2 | Qwen3-Next-80B | 8 (P=2,D=6) | 1 | 1 | auto | 是 | split_wise | +| 3 | DeepSeek-671B | 8 (P=2,D=6) | 8 | 1 | 8 | 是 | split_wise | +| 4 | Qwen3-MoE-235B | 8 (P=2,D=6) | 4 | 1 | 4 | 是 | split_wise | + +### 使用方法 + +```sh +# 激活环境 +conda activate vidur + +# 运行单个场景(1~4) +bash examples/vidur-ali-scenarios/run_scenarios.sh --scenario 1 + +# 顺序运行所有场景 +bash examples/vidur-ali-scenarios/run_scenarios.sh --all + +# 查看帮助 +bash examples/vidur-ali-scenarios/run_scenarios.sh --help +``` + +### 分场景命令(手动方式) + +**场景 1: Qwen3-Next-80B 无PD分离 (ws=32, lor)** + +```sh +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 32 \ + --replica_config_pd_node_ratio 1 \ + --global_scheduler_config_type lor \ + --replica_scheduler_config_type sarathi \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 +``` + +**场景 2: Qwen3-Next-80B PD分离 (P=2, D=6, split_wise)** + +```sh +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --replica_config_num_prefill_replicas 2 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-next-80B \ + --replica_config_tensor_parallel_size 1 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_prefill_tensor_parallel_size 1 \ + --replica_config_prefill_num_pipeline_stages 1 \ + --replica_config_decode_tensor_parallel_size 1 \ + --replica_config_decode_num_pipeline_stages 1 +``` + +**场景 3: DeepSeek-671B PD分离 (tp=8, ep=8, split_wise)** + +```sh +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 8 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 8 +``` + +**场景 4: Qwen3-MoE-235B PD分离 (tp=4, ep=4, split_wise)** + +```sh +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype fp8 \ + --replica_config_network_device h20_dgx \ + --replica_config_device h20 \ + --request_generator_config_type synthetic \ + --interval_generator_config_type poisson \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 4 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 100 \ + --fixed_request_length_generator_config_decode_tokens 8 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --random_forrest_execution_time_predictor_config_backend aicb \ + --metrics_config_output_dir examples/vidur-ali-scenarios/simulator_output \ + --cluster_config_num_replicas 8 \ + --replica_config_pd_node_ratio 0.25 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name qwen3-moe-235B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 4 +``` \ No newline at end of file diff --git a/vidur-alibabacloud/README.md b/vidur-alibabacloud/README.md index 244a1e47..f844bb5a 100644 --- a/vidur-alibabacloud/README.md +++ b/vidur-alibabacloud/README.md @@ -1,64 +1,125 @@ -# README +

+ 中文  |  English +

+# Vidur-AlibabaCloud + +[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/) +[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) + +Vidur ([original](https://github.com/microsoft/vidur)) is a simulation framework for large language model (LLM) inference systems. +**Vidur-AlibabaCloud** (this repository) is a customized version optimized for Alibaba Cloud **SimAI** scenarios. It supports advanced features such as **Prefill–Decode (PD) disaggregation** and includes dedicated adaptations for SOTA LLM models including **DeepSeek-V3-671B**, **Qwen3-MoE-235B**, **Qwen3-Next-80B**, and others. -Vidur ([original](https://github.com/microsoft/vidur)) is a simulation framework for large language model (LLM) inference systems. -**Vidur-AlibabaCloud** (this repository) is a customized version optimized for Alibaba Cloud **SimAI** scenarios. It supports advanced features such as **Prefill–Decode (PD) disaggregation** and includes dedicated adaptations for state-of-the-art (SOTA) LLM models including **DeepSeek-V3-671B**, **Qwen3-MoE-235B**, **Qwen3-Next-80B**, and other models. + +--- + +## Table of Contents + +- [Key Features](#key-features) +- [GPU Memory Calculation](#gpu-memory-calculation) +- [Supported Models](#supported-models) +- [Environment Setup](#-environment-setup) +- [Running Examples](#%EF%B8%8F-running-examples) + - [4-Scenario Configuration](#4-scenario-configuration) + - [Output Files](#output-files) +- [Key Input Parameters](#-key-input-parameter-reference) +- [Key Output Interpretation](#-key-output-interpretation) +- [Known Issues](#%EF%B8%8F-known-issues) +- [Help](#-help) --- ## Key Features -+ **Prefill–Decode (PD) Separation** – Enables running the prefill and decode stages on different nodes, allowing elastic resource allocation and performance isolation. -(Inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)). -+ **Flexible Parallelism** – Supports: - - **Data Parallel (DP)** - - **Tensor Parallel (TP)** - - **Pipeline Parallel (PP)** - - **Expert Parallel (EP)** (support in progress) -Works for both **dense** and **Mixture-of-Experts (MoE)** models (MoE support in progress). -+ **Multiple Execution-Time Prediction Backends** – Choose from: - - **AICB/AIOB** - Partially supports computation kernels and TP, DP, PP, EP communication size for DeepSeek-V3-671B, Qwen3-Moe-235B, Qwen3-Next-80B - - **SimAi_simulation** – SimAI NS-3-based network simulation (supports TP) - - **SimAi_analytical** – SimAI analytical performance model (supports TP) - - **Native Vidur [original]** – Supports TP, DP, PP -+ **Workload Generation & Replay** – Replay real-world traces or generate synthetic requests using fixed or Poisson distributions. -+ **Fine-Grained Metrics** – Records: - - TTFT – Time to First Token - - TBT / TPOT – Time Between Tokens / Time Per Output Token - - End-to-end latency - - Communication cost - - Computation cost - - Scheduling delay + +- **Prefill–Decode (PD) Separation** — Enables running the prefill and decode stages on different nodes, allowing elastic resource allocation and performance isolation. + (Inspired by [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)) +- **Flexible Parallelism** — Supports: + - **Data Parallel (DP)** + - **Tensor Parallel (TP)** + - **Pipeline Parallel (PP)** + - **Expert Parallel (EP)** (support in progress) + + Works for both **dense** and **Mixture-of-Experts (MoE)** models (MoE support in progress). +- **Multiple Execution-Time Prediction Backends** — Choose from: + - **AICB/AIOB** — Partially supports computation kernels and TP, DP, PP, EP communication size for DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B + - **SimAI Simulation** — SimAI NS-3-based network simulation (supports TP) + - **SimAI Analytical** — SimAI analytical performance model (supports TP) + - **Native Vidur [original]** — Supports TP, DP, PP +- **Workload Generation & Replay** — Replay real-world traces or generate synthetic requests using fixed or Poisson distributions. +- **Fine-Grained Metrics** — Records: + - TTFT — Time to First Token + - TBT / TPOT — Time Between Tokens / Time Per Output Token + - End-to-end latency + - Communication cost + - Computation cost + - Scheduling delay + +--- + +## GPU Memory Calculation + +This module provides accurate GPU memory estimation for modern MoE (Mixture-of-Experts) models during inference simulation, covering **model parameter memory**, **KV cache memory**, and **maximum batch size** calculation under Prefill–Decode (PD) disaggregation. + +### Supported Attention Architectures + +| Architecture | Model | Description | +|---|---|---| +| **MLA** (Multi-head Latent Attention) | DeepSeek-V3-671B | Uses LoRA-compressed KV cache (`kv_lora_rank` + `qk_rope_head_dim`) for reduced memory footprint | +| **MHA / GQA** (Multi-Head / Grouped-Query Attention) | Qwen3-MoE-235B | Standard KV cache with `num_kv_heads * head_dim` per token per layer | +| **Hybrid Full + Linear Attention** | Qwen3-Next-80B | Alternates between full attention and linear (GDN) attention every 4 layers | + +### Key Components + +- **`ParamCounter`** (`vidur/utils/param_counter.py`) — Computes per-layer and per-device parameter counts for MLA, MHA/GQA, linear attention, and MoE expert weights, with FP8 quantization support. Under PD disaggregation, it returns separate `(total_params, prefill_params, decode_params)` based on `prefill_world_size` / `decode_world_size`. +- **`MemoryPlanner`** (`vidur/scheduler/utils/memory_planner.py`) — Plans GPU memory budget: `available = GPU_mem * (1 - margin) - param_mem`, then computes KV cache capacity and maximum concurrent requests. Includes OOM detection with actionable suggestions. +- **Per-request KV cache tracking** (`vidur/entities/replica.py`) — Allocates and releases KV cache memory on a per-request basis, enabling accurate remaining-capacity queries at runtime. + +### References & Acknowledgments + +The GPU memory calculation module was developed with reference to the following works: + +- [InferSim](https://github.com/alibaba/InferSim) — Parameter counting and KV cache estimation methodology +- [DeepSeek V3 Parameter Size Analysis](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html) — DeepSeek V3 MLA parameter derivation +- [DeepSeek V3 Parameter Derivation (Chinese)](https://zhuanlan.zhihu.com/p/21455638257) — Detailed MLA weight decomposition + +We gratefully acknowledge these resources for providing the foundational analysis that guided our implementation. --- ## Supported Models -+ **DeepSeek-V3-671B** (SimAI PP/EP communication、GPU memory allocation module adaptations in progress) -+ **Qwen3-Moe-235B**, **Qwen3-Next-80B** (SimAI PP/EP communication、GPU memory allocation module adaptations in progress) -+ **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** -+ **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** -+ **codellama/CodeLlama-34b-Instruct-hf** -+ **internlm/internlm-20b** -+ **Qwen/Qwen-72B** + +- **DeepSeek-V3-671B** (SimAI PP/EP communication and GPU memory allocation module adaptations in progress) +- **Qwen3-MoE-235B**, **Qwen3-Next-80B** (SimAI PP/EP communication and GPU memory allocation module adaptations in progress) +- **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** +- **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** +- **codellama/CodeLlama-34b-Instruct-hf** +- **internlm/internlm-20b** +- **Qwen/Qwen-72B** --- ## 📦 Environment Setup + ### 1. Create Conda Environment + ```bash conda env create -p ./env -f ./environment.yml ``` ### 2. (Optional) Update Dev Dependencies + ```bash conda env update -f environment-dev.yml ``` ### 3. Activate Environment + ```bash conda activate vidur ``` ### 4. Install Python Dependencies (Using Alibaba Cloud PyPI Mirror) + ```bash pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ @@ -66,13 +127,16 @@ pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ --- -## ▶️ Running Example -### Run DeepSeek-671B **with** AICB -**Requirements: **SimAI and AICB Docker environment (see [README](../README.md) for setup instructions). +## ▶️ Running Examples + +### Run DeepSeek-671B with AICB -After setting up the environment, run the following commands: +**Requirements:** SimAI and AICB Docker environment (see [README](../README.md) for setup instructions). + +After setting up the environment, run the following commands: + +#### DeepSeek-671B with AICB (Fixed Length Generator) -#### Run DeepSeek-671B **with** AICB (Fixed Length Generator) ```bash cd SimAI/vidur-alibabacloud @@ -94,10 +158,11 @@ python -m vidur.main --replica_config_pd_p2p_comm_bandwidth 800 \ --replica_config_tensor_parallel_size 2 \ --replica_config_num_pipeline_stages 1 \ --replica_config_expert_model_parallel_size 8 \ - --random_forrest_execution_time_predictor_config_backend aicb + --random_forrest_execution_time_predictor_config_backend aicb ``` -#### Run DeepSeek-671B **with** AICB (Trace Length Generator) +#### DeepSeek-671B with AICB (Trace Length Generator) + ```bash cd SimAI/vidur-alibabacloud @@ -124,11 +189,9 @@ python -m vidur.main \ ``` > ✅ Full parameter descriptions are available via `python -m vidur.main -h`. -> +### Run Llama-3-8B with SimAI Simulation - -### Run Llama-3-8B **with** simai_simulation ```bash cd SimAI @@ -136,8 +199,8 @@ cd SimAI ./scripts/build.sh -c ns3 # Create network topo (Spectrum-X_128g_8gps_100Gbps_A100) -python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps - +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps cd SimAI/vidur-alibabacloud @@ -163,18 +226,16 @@ python -m vidur.main \ --random_forrest_execution_time_predictor_config_backend simai_simulation \ --random_forrest_execution_time_predictor_config_simai_dir ../ \ --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \ - --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf + --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf ``` -> -> +### Run Llama-3-8B with SimAI Analytical -### Run Llama-3-8B **with** simai_analytical ```bash cd SimAI # Compile SimAI-Analytical -$ ./scripts/build.sh -c analytical +./scripts/build.sh -c analytical cd SimAI/vidur-alibabacloud @@ -200,10 +261,8 @@ python -m vidur.main \ --random_forrest_execution_time_predictor_config_backend simai_analytical ``` -> -> +### Run Llama-3-8B with Native Vidur [original] -### Run Llama-3-8B **with** native Vidur [original] ```bash cd SimAI/vidur-alibabacloud @@ -229,125 +288,172 @@ python -m vidur.main \ --random_forrest_execution_time_predictor_config_backend vidur ``` -> -> +### Run 4-Scenario Suite +For a quick validation of all supported configurations, use the bundled test script: +```bash +bash examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +See `bash examples/vidur-ali-scenarios/run_scenarios.sh --help` for details. + +#### 4-Scenario Configuration + +The following scenarios are pre-configured in `run_scenarios.sh`. All scenarios share the hardware configuration below. + +**Shared Hardware Configuration:** +- GPU: H20 (h20_dgx), NVLink: 1600 Gbps, RDMA: 800 Gbps +- PD P2P bandwidth: 800 Gbps, dtype: fp8 +- Request: Poisson QPS=100, 4 requests, fixed prefill=100 / decode=8 tokens + +| Scenario | Model | PD Separation | World Size | TP | PP | EP | Global Scheduler | +|----------|-------|---------------|------------|----|----|------------|------------------| +| 1 | Qwen3-Next-80B (MoE) | No | 32 (dp=32) | 1 | 1 | 1 (default) | lor | +| 2 | Qwen3-Next-80B (MoE) | Yes (P=2, D=6) | 8 | 1 | 1 | 1 (default) | split_wise | +| 3 | DeepSeek-671B (MoE) | Yes (P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | Yes (P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +> **Note:** All four models use Mixture-of-Experts (MoE) architecture. The EP column reflects the explicit `--replica_config_expert_model_parallel_size` value set in the script; scenarios without an explicit EP setting use the default value of 1. + +#### Output Files + +**Output path depends on how you run the simulation:** + +- **`run_scenarios.sh`** --- outputs to `examples/vidur-ali-scenarios/simulator_output/` +- **Direct `python -m vidur.main`** --- outputs to `./simulator_output/` (or the path specified by `--metrics_config_output_dir`) + +Each run produces the following directory: + +``` +// +├── request_metrics.csv # per-request metrics (see Key Output Interpretation) +├── chrome_trace.json # Chrome DevTools timeline trace (open at chrome://tracing) +├── config.json # snapshot of all simulation parameters +└── plots/ # per-metric CSV / JSON files (including but not limited to) + ├── request_e2e_time.csv + ├── prefill_e2e_time.csv + ├── pd_p2p_comm_time.csv + ├── replica_N_memory_usage.json + └── ... +``` + +> **Note:** The exact file list in `plots/` may vary across versions. +> Run-time logs (when using `run_scenarios.sh`) are saved separately to `examples/vidur-ali-scenarios/logs/scenario__.log`. --- ## 🔧 Key Input Parameter Reference + | Parameter | Default | Description | -| --- | --- | --- | -| `--replica_config_pd_p2p_comm_bandwidth` | 800 | Bandwidth (Gbps) for point-to-point communication between Prefill and Decode nodes in PD disaggregation | +|-----------|---------|-------------| +| `--replica_config_pd_p2p_comm_bandwidth` | 800 | Bandwidth (Gbps) for P2P communication between Prefill and Decode nodes in PD disaggregation | | `--replica_config_nvlink_bandwidth` | 1600 | NVLink bandwidth (Gbps) for TP/EP communications | | `--replica_config_rdma_bandwidth` | 800 | RDMA bandwidth (Gbps) for inter-node communication | | `--replica_config_pd_p2p_comm_dtype` | float16 | Data type for PD communication (`float16`, `float32`, etc.) | | `--poisson_request_interval_generator_config_qps` | 0.5 | Queries per second (QPS) for Poisson request generator | | `--synthetic_request_generator_config_num_requests` | 128 | Number of synthetic requests to generate | | `--length_generator_config_type` | fixed | Request length generator type (`fixed`, `trace`, etc.) | -| `--fixed_request_length_generator_config_prefill_tokens` | `2048` | Number of prefill tokens per request (only effective when `--length_generator_config_type=fixed`) | -| `--fixed_request_length_generator_config_decode_tokens` | `512` | Number of decode tokens per request (only effective when `--length_generator_config_type=fixed`) | +| `--fixed_request_length_generator_config_prefill_tokens` | 2048 | Number of prefill tokens per request (only effective when `--length_generator_config_type=fixed`) | +| `--fixed_request_length_generator_config_decode_tokens` | 512 | Number of decode tokens per request (only effective when `--length_generator_config_type=fixed`) | | `--trace_request_length_generator_config_max_tokens` | 4096 | Max tokens when using trace-based length generator | -| `--trace_request_length_generator_config_trace_file` | data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv | Path to trace file for request lengths | +| `--trace_request_length_generator_config_trace_file` | `data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv` | Path to trace file for request lengths | | `--interval_generator_config_type` | poisson | Inter-arrival time generator type | | `--cluster_config_num_replicas` | 1 | Total number of replicas (i.e., data parallelism degree) | -| `--replica_config_pd_node_ratio` | 0.5 | Ratio of P-nodes to (P-nodes + D-nodes) Fraction of replicas allocated as prefill (P) nodes. The remaining replicas are used as decode (D) nodes. For example, 0.5 means half of the replicas are prefill nodes and half are decode nodes (P:D = 1:1). | +| `--replica_config_pd_node_ratio` | 0.5 | Fraction of replicas allocated as prefill (P) nodes. The remaining replicas are used as decode (D) nodes. For example, 0.5 means P:D = 1:1. | | `--global_scheduler_config_type` | round_robin | Global scheduler type (`split_wise`, `round_robin`, etc.) | | `--replica_scheduler_config_type` | sarathi | Per-replica scheduler type | -| `--replica_config_model_name` | meta-llama/Llama-2-7b-hf | Model name (DeepSeek-671B, Qwen3-Moe-235B, Qwen3-Next-80B , etc.)
⚠️ **Note**: Vidur GPU Memory management module is still under adaptation for DeepSeek-671B, Qwen3-Moe-235B, Qwen3-Next-80B | +| `--replica_config_model_name` | meta-llama/Llama-2-7b-hf | Model name (DeepSeek-671B, Qwen3-MoE-235B, Qwen3-Next-80B, etc.) | | `--replica_config_tensor_parallel_size` | 1 | Tensor parallelism size (TP) | | `--replica_config_num_pipeline_stages` | 1 | Number of pipeline stages (PP) | | `--replica_config_expert_model_parallel_size` | 1 | Expert model parallelism size (EP) | -| `--random_forrest_execution_time_predictor_config_backend` | vidur | Backend for execution time prediction
('vidur', 'simai_simulation', 'simai_analytical','aicb', etc.)
⚠️ **Note**: `simai_simulation` and `simai_analytical` currently only model TP communication and do not support pipeline or expert parallelism | -| `--random_forrest_execution_time_predictor_config_simai_dir` | `'../'` | Root directory of the SimAI simulator(default: `../`)
(only effective when `--random_forrest_execution_time_predictor_config_backend simai_simulation`) | -| `--random_forrest_execution_time_predictor_config_simai_simulation_topo` | `'../example/topo'` | Path to SimAI topology file (e.g., `'../Spectrum-X_128g_8gps_100Gbps_A100'`)(only effective when `--random_forrest_execution_time_predictor_config_backend simai_simulation`) | -| `--random_forrest_execution_time_predictor_config_simai_simulation_config` | `'../astra-sim-alibabacloud/inputs/config/SimAI.conf'` | Path to SimAI configuration file (e.g., `'../astra-sim-alibabacloud/inputs/config/SimAI.conf'`)
(only effective when `--random_forrest_execution_time_predictor_config_backend simai_simulation`) | - +| `--random_forrest_execution_time_predictor_config_backend` | vidur | Backend for execution time prediction (`vidur`, `simai_simulation`, `simai_analytical`, `aicb`, etc.). **Note:** `simai_simulation` and `simai_analytical` currently only model TP communication and do not support pipeline or expert parallelism. | +| `--random_forrest_execution_time_predictor_config_simai_dir` | `../` | Root directory of the SimAI simulator (only effective when backend = `simai_simulation`) | +| `--random_forrest_execution_time_predictor_config_simai_simulation_topo` | `../example/topo` | Path to SimAI topology file (only effective when backend = `simai_simulation`) | +| `--random_forrest_execution_time_predictor_config_simai_simulation_config` | `../astra-sim-alibabacloud/inputs/config/SimAI.conf` | Path to SimAI configuration file (only effective when backend = `simai_simulation`) | --- ## 📊 Key Output Interpretation + Simulation results are saved to: -```plain +``` ./simulator_output/YYYY-MM-DD_HH-MM-SS-XXXXXX/request_metrics.csv ``` ### Key Columns in `request_metrics.csv` + | Column | Meaning | -| --- | --- | +|--------|---------| | `arrived_at` / `prefill_arrived_at` | Timestamp when the request entered the system (in seconds). | -| `scheduled_at` | Timestamp when the request was first scheduled by the scheduler and began execution (in seconds). | +| `scheduled_at` | Timestamp when the request was first scheduled and began execution (in seconds). | | `prefill_completed_at` | Timestamp when the Prefill phase completed and the first output token was generated. | -| `decode_arrived_at` | Timestamp when the Decode phase started. In non-PD-Disaggregated setup, this typically equals `prefill_completed_at`. In PD-Disaggregated setup, it is `prefill_completed_at + pd_p2p_comm_time`. | -| `decode_time` | Duration of the Decode phase (in seconds), computed as `completed_at - decode_arrived_at` (equivalently: `request_e2e_time - prefill_e2e_time`). | -| `prefill_replica_id` | Replica ID that executed the Prefill phase (in PD-Disaggregated setup). | -| `decode_replica_id` | Replica ID that executed the Decode phase (in PD-Disaggregated setup). | +| `decode_arrived_at` | Timestamp when the Decode phase started. In non-PD-disaggregated setup, this typically equals `prefill_completed_at`. In PD-disaggregated setup, it is `prefill_completed_at + pd_p2p_comm_time`. | +| `decode_time` | Duration of the Decode phase (in seconds), computed as `completed_at - decode_arrived_at`. | +| `prefill_replica_id` | Replica ID that executed the Prefill phase (in PD-disaggregated setup). | +| `decode_replica_id` | Replica ID that executed the Decode phase (in PD-disaggregated setup). | | `request_num_prefill_tokens` | Number of input tokens (i.e., prompt length). | | `request_num_decode_tokens` | Number of output tokens (i.e., generation length). | -| `pd_p2p_comm_size` | Point-to-point communication size (in bytes) of data transferred from the Prefill node to the Decode node (KV cache, etc.) in PD-Disaggregated setup. | -| `pd_p2p_comm_time` | Point-to-point communication time (in seconds) between Prefill and Decode nodes in PD-Disaggregated setup. | +| `pd_p2p_comm_size` | P2P communication size (in bytes) of data transferred from the Prefill node to the Decode node (KV cache, etc.) in PD-disaggregated setup. | +| `pd_p2p_comm_time` | P2P communication time (in seconds) between Prefill and Decode nodes in PD-disaggregated setup. | | `completed_at` | Timestamp when the request finished processing. | | `request_execution_time` | Total actual execution time (in seconds), excluding delays due to preemption or pipeline bubbles. | | `request_preemption_time` | Time (in seconds) spent waiting due to scheduler preemption, pipeline bubbles, or other non-execution gaps. | | `request_scheduling_delay` | Scheduling delay before execution: `scheduled_at - arrived_at` (in seconds). | | `request_e2e_time` | End-to-end latency: `completed_at - arrived_at` (in seconds). | | `prefill_e2e_time` | Time To First Token (TTFT): `prefill_completed_at - arrived_at` (in seconds). | -| `tbt` | Time Between Tokens (TBT), also known as Time Per Output Token (TPOT). Computed as: `decode_time / request_num_decode_tokens` or equivalently: `(request_e2e_time - prefill_e2e_time) / request_num_decode_tokens` (in seconds/token). | +| `tbt` | Time Between Tokens (TBT), also known as TPOT. Computed as: `decode_time / request_num_decode_tokens` (in seconds/token). | +**Notes:** - **Notes**: +- All time-related fields are in **seconds (s)**, based on monotonic clock or Unix timestamps. +- In non-PD-separated deployments, `prefill_replica_id` and `decode_replica_id` are typically identical. +- If `request_num_decode_tokens = 0`, `tbt` is undefined (may be recorded as `NaN` or `0`). +- TBT is not yet logged in `request_metrics.csv`; it can be computed manually for now. -+ All time-related fields are in **seconds (s)**, based on monotonic clock or Unix timestamps. -+ In non-PD-separated deployments, `prefill_replica_id` and `decode_replica_id` are typically identical. -+ If `request_num_decode_tokens = 0`, `tbt` is undefined (may be recorded as `NaN` or `0`). -+ **TBT is not yet logged in request_metrics.csv; it can be computed manually for now.** +### Sample Row (`request_metrics.csv`) -### Sample Row (request_metrics.csv) -```plain +``` Request Id,request_e2e_time,...,arrived_at,prefill_arrived_at,scheduled_at,prefill_completed_at,decode_arrived_at,completed_at,...,prefill_replica_id,decode_replica_id,pd_p2p_comm_size,pd_p2p_comm_time,... 0,0.03607,...,0.0102006,0.0102006,0.0102006,0.0102265,0.0433997,0.0462744,...,0,2,3561947136,0.0331732,... ``` --- -## ⚠️ Known Issue: Plotting Warning +## ⚠️ Known Issues + +### Plotting Warning + You may see this error at exit: -```plain +``` RuntimeError: Kaleido requires Google Chrome to be installed. ``` -This occurs because the simulator tries to generate PNG plots but lacks Chrome. -✅ **Important**: This **does NOT affect** the generation of `request_metrics.csv`. +This occurs because the simulator tries to generate PNG plots but lacks Chrome. +**Important:** This does **NOT** affect the generation of `request_metrics.csv`. -### Solutions: -1. **Ignore it** – CSV output is unaffected. -2. **Install Chrome**: - -```bash -plotly_get_chrome -``` +**Solutions:** +1. **Ignore it** — CSV output is unaffected. +2. **Install Chrome:** + ```bash + plotly_get_chrome + ``` 3. **Disable plotting** (not recommended): Comment out these lines in `vidur/simulator.py`: - -```python -# self._metric_store.plot() -# logger.info("Metrics written") -``` - -> ⚠️ Disabling plotting will skip all visual outputs and request_metrics.csv. -> + ```python + # self._metric_store.plot() + # logger.info("Metrics written") + ``` + > Disabling plotting will skip all visual outputs and `request_metrics.csv`. --- ## 📚 Help + View all CLI options: ```bash python -m vidur.main -h ``` - ---- - diff --git a/vidur-alibabacloud/README_CN.md b/vidur-alibabacloud/README_CN.md new file mode 100644 index 00000000..372f8303 --- /dev/null +++ b/vidur-alibabacloud/README_CN.md @@ -0,0 +1,458 @@ +

+ 中文  |  English +

+ +# Vidur-AlibabaCloud + +[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/) +[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) + +Vidur([原版](https://github.com/microsoft/vidur))是一个大语言模型(LLM)推理系统的模拟框架。 +**Vidur-AlibabaCloud**(本仓库)是针对阿里云 **SimAI** 场景优化的定制版本。支持 **Prefill–Decode(PD)分离**等高级特性,并针对 **DeepSeek-V3-671B**、**Qwen3-MoE-235B**、**Qwen3-Next-80B** 等 SOTA 大模型进行了专门适配。 + +--- + +## 目录 + +- [主要特性](#主要特性) +- [GPU 显存计算模块](#gpu-显存计算模块) +- [支持的模型](#支持的模型) +- [📦 环境配置](#-环境配置) +- [▶️ 运行示例](#️-运行示例) + - [四场景配置说明](#四场景配置说明) + - [输出文件说明](#输出文件说明) +- [🔧 关键输入参数参考](#-关键输入参数参考) +- [📊 输出结果解读](#-输出结果解读) +- [⚠️ 已知问题](#️-已知问题) +- [📚 帮助](#-帮助) + +--- + +## 主要特性 + +- **Prefill–Decode(PD)分离** — 支持 prefill 和 decode 阶段在不同节点运行,实现弹性资源分配和性能隔离。 + (参考 [splitwise-sim](https://github.com/Mutinifni/splitwise-sim)) +- **灵活的并行策略** — 支持: + - **数据并行(DP)** + - **张量并行(TP)** + - **流水线并行(PP)** + - **专家并行(EP)**(适配中) + + 同时支持 **Dense** 模型和 **混合专家(MoE)** 模型(MoE 适配中)。 +- **多种执行时间预测后端** — 可选: + - **AICB/AIOB** — 部分支持 DeepSeek-V3-671B、Qwen3-MoE-235B、Qwen3-Next-80B 的计算核与 TP、DP、PP、EP 通信量建模 + - **SimAI 仿真(Simulation)** — 基于 SimAI NS-3 的网络通信全栈仿真(支持 TP) + - **SimAI 解析(Analytical)** — SimAI 解析性能模型(支持 TP) + - **原版 Vidur [original]** — 支持 TP、DP、PP +- **负载生成与回放** — 支持真实 trace 回放,或使用固定/泊松分布生成合成请求。 +- **细粒度指标** — 记录: + - TTFT — 首 token 时延 + - TBT / TPOT — 相邻 token 时延 / 每输出 token 耗时 + - 端到端延迟 + - 通信开销 + - 计算开销 + - 调度延迟 + +--- + +## GPU 显存计算模块 + +本模块为现代 MoE(混合专家)模型的推理仿真提供精确的 GPU 显存估算,涵盖**模型参数显存**、**KV Cache 显存**以及 Prefill–Decode(PD)分离架构下的**最大批处理量**计算。 + +### 支持的注意力架构 + +| 架构 | 模型 | 说明 | +|---|---|---| +| **MLA**(多头潜在注意力) | DeepSeek-V3-671B | 使用 LoRA 压缩的 KV Cache(`kv_lora_rank` + `qk_rope_head_dim`),显著降低显存占用 | +| **MHA / GQA**(多头 / 分组查询注意力) | Qwen3-MoE-235B | 标准 KV Cache,每 token 每层使用 `num_kv_heads * head_dim` | +| **混合全注意力 + 线性注意力** | Qwen3-Next-80B | 每 4 层交替使用全注意力和线性(GDN)注意力 | + +### 核心组件 + +- **`ParamCounter`**(`vidur/utils/param_counter.py`)— 计算每层和每设备的参数量,支持 MLA、MHA/GQA、线性注意力和 MoE 专家权重,支持 FP8 量化。在 PD 分离架构下,根据 `prefill_world_size` / `decode_world_size` 分别返回 `(total_params, prefill_params, decode_params)` 三元组。 +- **`MemoryPlanner`**(`vidur/scheduler/utils/memory_planner.py`)— 规划 GPU 显存预算:`available = GPU_mem * (1 - margin) - param_mem`,计算 KV Cache 容量和最大并发请求数,包含 OOM 检测与建议输出。 +- **逐请求 KV Cache 追踪**(`vidur/entities/replica.py`)— 按请求粒度分配和释放 KV Cache 显存,支持运行时精确查询剩余容量。 + +### 参考与致谢 + +本 GPU 显存计算模块的开发参考了以下工作: + +- [InferSim](https://github.com/alibaba/InferSim) — 参数量计算与 KV Cache 估算方法论 +- [DeepSeek V3 Parameter Size Analysis](https://yangwenbo.com/articles/deepseek-v3-parameter-size.html) — DeepSeek V3 MLA 参数推导 +- [DeepSeek V3 参数推导详解](https://zhuanlan.zhihu.com/p/21455638257) — MLA 权重分解详细分析 + +衷心感谢以上资源为我们的实现提供了基础性的分析与指导。 + +--- + +## 支持的模型 + +- **DeepSeek-V3-671B**(SimAI PP/EP 通信及 GPU 显存管理模块适配中) +- **Qwen3-MoE-235B**、**Qwen3-Next-80B**(SimAI PP/EP 通信及 GPU 显存管理模块适配中) +- **meta-llama/Meta-Llama-3-8B** / **Meta-Llama-3-70B** +- **meta-llama/Llama-2-7b-hf** / **Llama-2-70b-hf** +- **codellama/CodeLlama-34b-Instruct-hf** +- **internlm/internlm-20b** +- **Qwen/Qwen-72B** + +--- + +## 📦 环境配置 + +### 1. 创建 Conda 环境 + +```bash +conda env create -p ./env -f ./environment.yml +``` + +### 2.(可选)更新开发依赖 + +```bash +conda env update -f environment-dev.yml +``` + +### 3. 激活环境 + +```bash +conda activate vidur +``` + +### 4. 安装 Python 依赖(使用阿里云 PyPI 镜像) + +```bash +pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ +pip install -r requirements-dev.txt -i https://mirrors.aliyun.com/pypi/simple/ +``` + +--- + +## ▶️ 运行示例 + +### 使用 AICB 运行 DeepSeek-671B + +**前置条件:** 需要 SimAI 和 AICB Docker 环境(参见 [README](../README.md) 了解搭建方法)。 + +完成环境配置后,运行以下命令: + +#### DeepSeek-671B + AICB(固定长度生成器) + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 5 \ + --length_generator_config_type fixed \ + --fixed_request_length_generator_config_prefill_tokens 1024 \ + --fixed_request_length_generator_config_decode_tokens 10 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 8 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +#### DeepSeek-671B + AICB(Trace 长度生成器) + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 1024 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name deepseek-671B \ + --replica_config_tensor_parallel_size 2 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 8 \ + --random_forrest_execution_time_predictor_config_backend aicb +``` + +> ✅ 完整参数说明可通过 `python -m vidur.main -h` 查看。 + +### 使用 SimAI 仿真运行 Llama-3-8B + +```bash +cd SimAI + +# 编译 SimAI-Simulation(ns3) +./scripts/build.sh -c ns3 + +# 生成网络拓扑(Spectrum-X_128g_8gps_100Gbps_A100) +python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \ + -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps + +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 1 \ + --random_forrest_execution_time_predictor_config_backend simai_simulation \ + --random_forrest_execution_time_predictor_config_simai_dir ../ \ + --random_forrest_execution_time_predictor_config_simai_simulation_topo ../Spectrum-X_128g_8gps_100Gbps_A100 \ + --random_forrest_execution_time_predictor_config_simai_simulation_config ../astra-sim-alibabacloud/inputs/config/SimAI.conf +``` + +### 使用 SimAI 解析模型运行 Llama-3-8B + +```bash +cd SimAI + +# 编译 SimAI-Analytical +./scripts/build.sh -c analytical + +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 1 \ + --random_forrest_execution_time_predictor_config_backend simai_analytical +``` + +### 使用原版 Vidur 运行 Llama-3-8B + +```bash +cd SimAI/vidur-alibabacloud + +python -m vidur.main \ + --replica_config_pd_p2p_comm_bandwidth 800 \ + --replica_config_nvlink_bandwidth 1600 \ + --replica_config_rdma_bandwidth 800 \ + --replica_config_pd_p2p_comm_dtype float32 \ + --poisson_request_interval_generator_config_qps 100 \ + --synthetic_request_generator_config_num_requests 10 \ + --length_generator_config_type trace \ + --trace_request_length_generator_config_max_tokens 2048 \ + --trace_request_length_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \ + --interval_generator_config_type poisson \ + --cluster_config_num_replicas 4 \ + --replica_config_pd_node_ratio 0.5 \ + --global_scheduler_config_type split_wise \ + --replica_scheduler_config_type split_wise \ + --replica_config_model_name meta-llama/Meta-Llama-3-8B \ + --replica_config_tensor_parallel_size 4 \ + --replica_config_num_pipeline_stages 1 \ + --replica_config_expert_model_parallel_size 1 \ + --random_forrest_execution_time_predictor_config_backend vidur +``` + +### 运行四场景套件 + +使用内置脚本快速验证所有支持的配置: + +```bash +bash examples/vidur-ali-scenarios/run_scenarios.sh --all +``` + +详细信息请运行 `bash examples/vidur-ali-scenarios/run_scenarios.sh --help`。 + +#### 四场景配置说明 + +以下场景已在 `run_scenarios.sh` 中预配置,所有场景共享下方硬件配置。 + +**共用硬件配置:** +- GPU:H20(h20_dgx),NVLink:1600 Gbps,RDMA:800 Gbps +- PD P2P 带宽:800 Gbps,数据类型:fp8 +- 请求生成:Poisson QPS=100,4 requests,固定 prefill=100 / decode=8 tokens + +| 场景 | 模型 | PD 分离 | World Size | TP | PP | EP | 全局调度器 | +|------|------|---------|------------|----|----|------------|------------| +| 1 | Qwen3-Next-80B (MoE) | 无 | 32 (dp=32) | 1 | 1 | 1(默认) | lor | +| 2 | Qwen3-Next-80B (MoE) | 是(P=2, D=6) | 8 | 1 | 1 | 1(默认) | split_wise | +| 3 | DeepSeek-671B (MoE) | 是(P=2, D=6) | 8 | 8 | 1 | 8 | split_wise | +| 4 | Qwen3-MoE-235B (MoE) | 是(P=2, D=6) | 8 | 4 | 1 | 4 | split_wise | + +> **说明:** 四个模型均使用混合专家(MoE)架构。EP 列反映脚本中 `--replica_config_expert_model_parallel_size` 的显式设定值;未显式指定时使用默认值 1。 + +#### 输出文件说明 + +**输出路径取决于运行方式:** + +- **`run_scenarios.sh`** --- 输出到 `examples/vidur-ali-scenarios/simulator_output/` +- **直接 `python -m vidur.main`** --- 输出到 `./simulator_output/`(或通过 `--metrics_config_output_dir` 指定的路径) + +每次运行产生如下目录: + +``` +// +├── request_metrics.csv # 逐请求指标(参见"输出结果解读") +├── chrome_trace.json # Chrome DevTools 时间轴 trace(可在 chrome://tracing 打开) +├── config.json # 本次仿真的全部参数快照 +└── plots/ # 逐指标 CSV / JSON 文件(包括但不限于) + ├── request_e2e_time.csv + ├── prefill_e2e_time.csv + ├── pd_p2p_comm_time.csv + ├── replica_N_memory_usage.json + └── ... +``` + +> **说明:** `plots/` 中的具体文件列表可能因版本不同而变化。 +> 使用 `run_scenarios.sh` 时,运行日志另存于 `examples/vidur-ali-scenarios/logs/scenario__.log`。 + +--- + +## 🔧 关键输入参数参考 + +| 参数 | 默认值 | 说明 | +|------|--------|------| +| `--replica_config_pd_p2p_comm_bandwidth` | 800 | PD 分离中 Prefill 节点与 Decode 节点间 P2P 通信带宽(Gbps) | +| `--replica_config_nvlink_bandwidth` | 1600 | TP/EP 通信使用的 NVLink 带宽(Gbps) | +| `--replica_config_rdma_bandwidth` | 800 | 节点间通信使用的 RDMA 带宽(Gbps) | +| `--replica_config_pd_p2p_comm_dtype` | float16 | PD 通信数据类型(`float16`、`float32` 等) | +| `--poisson_request_interval_generator_config_qps` | 0.5 | 泊松请求生成器的 QPS(每秒请求数) | +| `--synthetic_request_generator_config_num_requests` | 128 | 合成请求总数 | +| `--length_generator_config_type` | fixed | 请求长度生成器类型(`fixed`、`trace` 等) | +| `--fixed_request_length_generator_config_prefill_tokens` | 2048 | 每请求的 prefill token 数(仅在 `--length_generator_config_type=fixed` 时生效) | +| `--fixed_request_length_generator_config_decode_tokens` | 512 | 每请求的 decode token 数(仅在 `--length_generator_config_type=fixed` 时生效) | +| `--trace_request_length_generator_config_max_tokens` | 4096 | 使用 trace 长度生成器时的最大 token 数 | +| `--trace_request_length_generator_config_trace_file` | `data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv` | trace 文件路径 | +| `--interval_generator_config_type` | poisson | 请求到达间隔生成器类型 | +| `--cluster_config_num_replicas` | 1 | replica 总数(即数据并行度) | +| `--replica_config_pd_node_ratio` | 0.5 | 分配为 Prefill(P)节点的 replica 比例,其余为 Decode(D)节点。例如 0.5 表示 P:D = 1:1。 | +| `--global_scheduler_config_type` | round_robin | 全局调度器类型(`split_wise`、`round_robin` 等) | +| `--replica_scheduler_config_type` | sarathi | 单 replica 调度器类型 | +| `--replica_config_model_name` | meta-llama/Llama-2-7b-hf | 模型名称(DeepSeek-671B、Qwen3-MoE-235B、Qwen3-Next-80B 等) | +| `--replica_config_tensor_parallel_size` | 1 | 张量并行大小(TP) | +| `--replica_config_num_pipeline_stages` | 1 | 流水线阶段数(PP) | +| `--replica_config_expert_model_parallel_size` | 1 | 专家并行大小(EP) | +| `--random_forrest_execution_time_predictor_config_backend` | vidur | 执行时间预测后端(`vidur`、`simai_simulation`、`simai_analytical`、`aicb` 等)。**注意:** `simai_simulation` 和 `simai_analytical` 当前仅建模 TP 通信,不支持流水线或专家并行。 | +| `--random_forrest_execution_time_predictor_config_simai_dir` | `../` | SimAI 模拟器根目录(仅在 backend = `simai_simulation` 时生效) | +| `--random_forrest_execution_time_predictor_config_simai_simulation_topo` | `../example/topo` | SimAI 拓扑文件路径(仅在 backend = `simai_simulation` 时生效) | +| `--random_forrest_execution_time_predictor_config_simai_simulation_config` | `../astra-sim-alibabacloud/inputs/config/SimAI.conf` | SimAI 配置文件路径(仅在 backend = `simai_simulation` 时生效) | + +--- + +## 📊 输出结果解读 + +仿真结果保存于: + +``` +./simulator_output/YYYY-MM-DD_HH-MM-SS-XXXXXX/request_metrics.csv +``` + +### `request_metrics.csv` 关键列说明 + +| 列名 | 含义 | +|------|------| +| `arrived_at` / `prefill_arrived_at` | 请求进入系统的时间戳(秒)。 | +| `scheduled_at` | 请求首次被调度并开始执行的时间戳(秒)。 | +| `prefill_completed_at` | Prefill 阶段完成、生成第一个输出 token 的时间戳。 | +| `decode_arrived_at` | Decode 阶段开始的时间戳。非 PD 分离场景下通常等于 `prefill_completed_at`;PD 分离场景下为 `prefill_completed_at + pd_p2p_comm_time`。 | +| `decode_time` | Decode 阶段持续时间(秒),计算公式:`completed_at - decode_arrived_at`。 | +| `prefill_replica_id` | 执行 Prefill 阶段的 replica ID(PD 分离场景下)。 | +| `decode_replica_id` | 执行 Decode 阶段的 replica ID(PD 分离场景下)。 | +| `request_num_prefill_tokens` | 输入 token 数(即 prompt 长度)。 | +| `request_num_decode_tokens` | 输出 token 数(即生成长度)。 | +| `pd_p2p_comm_size` | PD 分离场景下,从 Prefill 节点传输至 Decode 节点的数据量(字节,含 KV Cache 等)。 | +| `pd_p2p_comm_time` | PD 分离场景下,Prefill 节点与 Decode 节点间 P2P 通信耗时(秒)。 | +| `completed_at` | 请求处理完成的时间戳。 | +| `request_execution_time` | 实际执行总时间(秒),不含抢占或流水线气泡导致的等待。 | +| `request_preemption_time` | 因调度器抢占、流水线气泡或其他非执行间隔导致的等待时间(秒)。 | +| `request_scheduling_delay` | 执行前的调度延迟:`scheduled_at - arrived_at`(秒)。 | +| `request_e2e_time` | 端到端延迟:`completed_at - arrived_at`(秒)。 | +| `prefill_e2e_time` | 首 token 时延(TTFT):`prefill_completed_at - arrived_at`(秒)。 | +| `tbt` | 相邻 token 时延(TBT / TPOT)。计算公式:`decode_time / request_num_decode_tokens`(秒/token)。 | + +**说明:** + +- 所有时间字段单位均为**秒(s)**,基于单调时钟或 Unix 时间戳。 +- 非 PD 分离部署中,`prefill_replica_id` 与 `decode_replica_id` 通常相同。 +- 若 `request_num_decode_tokens = 0`,则 `tbt` 未定义(可能记录为 `NaN` 或 `0`)。 +- `tbt` 暂未写入 `request_metrics.csv`,目前需手动计算。 + +### 示例行(`request_metrics.csv`) + +``` +Request Id,request_e2e_time,...,arrived_at,prefill_arrived_at,scheduled_at,prefill_completed_at,decode_arrived_at,completed_at,...,prefill_replica_id,decode_replica_id,pd_p2p_comm_size,pd_p2p_comm_time,... +0,0.03607,...,0.0102006,0.0102006,0.0102006,0.0102265,0.0433997,0.0462744,...,0,2,3561947136,0.0331732,... +``` + +--- + +## ⚠️ 已知问题 + +### 绘图警告 + +退出时可能出现以下错误: + +``` +RuntimeError: Kaleido requires Google Chrome to be installed. +``` + +这是因为模拟器尝试生成 PNG 图表但缺少 Chrome。 +**重要:** 此问题**不影响** `request_metrics.csv` 的生成。 + +**解决方案:** + +1. **忽略** — CSV 输出不受影响。 +2. **安装 Chrome:** + ```bash + plotly_get_chrome + ``` +3. **禁用绘图**(不推荐):注释掉 `vidur/simulator.py` 中的以下行: + ```python + # self._metric_store.plot() + # logger.info("Metrics written") + ``` + > 禁用绘图将跳过所有可视化输出及 `request_metrics.csv`。 + +--- + +## 📚 帮助 + +查看所有 CLI 选项: + +```bash +python -m vidur.main -h +```