Merge Release/v26.4 back to main#759
Closed
GeneDer wants to merge 10 commits into
Closed
Conversation
1:Adds a complete native SFT (Supervised Fine-Tuning) training stack to Primus,on the Megatron backend, parallel to the existing pretrain path. 2:Implements custom dataset, packing, forward_step, LoRA/PEFT, multi-turn conversation, and offline JSONL/JSON loaders without depending on Megatron-Bridge at runtime ,while keeping a megatron_bridge_adapter.py for users who still want the Bridge path. 3:In terms of performance results: with memory alignment, llama3_8b and llama2_70b outperform the third-party library megatron-bridge by 4%, deepseek_v2_lite and qwen30-30b-a3b outperform the third-party library megatron-bridge by 6%, and are comparable to mlperf_llama2_70b_lora. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Xiaoming-AMD <198007710+Xiaoming-AMD@users.noreply.github.com> Co-authored-by: Xiaoming <xiaoming@primus.dev> Co-authored-by: WangLingxun <linxwang@amd.com> Co-authored-by: botahu_qle <botahu@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Botao Hu <botahu@smc300x-ccs-aus-a16-19.prov.aus.ccs.cpe.ice.amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR merges Release/v26.4 back into main, bringing in Megatron-native SFT (datasets/formatters/runtime wiring), stage-based trainer registration in BackendRegistry, and supporting scripts/configs for SFT runs and diagnostics.
Changes:
- Add Megatron-native SFT stack (schema/formatters/tokenization/datasets/forward_step/runtime) plus unit tests and example configs for SFT + packed sequences.
- Introduce stage-aware trainer registration/lookup in
BackendRegistryand update adapters/backends (megatron, megatron_bridge, torchtitan) + related tests. - Add operational hooks/tools: HF→Megatron checkpoint conversion hook, diagnostics scripts, and various training launch examples/config updates (including FP4/Turbo knobs).
Reviewed changes
Copilot reviewed 83 out of 83 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_tests/backends/megatron/test_sft_dataset_offline.py | Unit tests for offline JSON/JSONL SFT dataset loading. |
| tests/unit_tests/backends/megatron/test_sft_abstractions.py | Tests for SFT normalization/tokenization and forward_step behavior. |
| tests/unit_tests/backends/megatron/test_messages_format.py | Tests for OpenAI messages format and formatter selection. |
| tests/unit_tests/backends/megatron/test_megatron_sft_trainer.py | Tests for MegatronSFTTrainer wiring into runtime factories/entrypoints. |
| tests/unit_tests/backends/megatron/test_megatron_registration.py | Verifies megatron adapter + stage trainers are registered in registry. |
| tests/unit_tests/backends/megatron/test_megatron_adapter.py | Adapter tests updated for stage-based trainer lookup and errors. |
| runner/helpers/hooks/train/posttrain/megatron/01_convert_checkpoints.py | Hook to convert HF checkpoints to Megatron format via Megatron-Bridge. |
| runner/helpers/hooks/train/posttrain/megatron/00_install_requirements.sh | Installs dependencies needed for checkpoint conversion hook. |
| run.sh | Convenience launcher script with quieter Torch/NCCL logging defaults. |
| primus/tools/diag/verify_mlp_merge_fix.py | GPU allocator diagnostic for MLP shard-merge fragmentation fix. |
| primus/tools/diag/verify_mlp_merge_fix_realistic.py | “Realistic” scale allocator diagnostic with large context prealloc. |
| primus/tools/diag/inspect_sft_data.py | Diagnostic comparing Bridge packed parquet vs Native packed cache. |
| primus/tools/diag/init.py | Declares diag utilities package. |
| primus/core/launcher/config.py | Adds pre_trainer/post_trainer aliasing for SFT configs. |
| primus/core/config/primus_config.py | Adds module-name aliasing in get_module_config. |
| primus/core/backend/backend_registry.py | Implements stage-based trainer registry and debug dump. |
| primus/core/backend/backend_adapter.py | Minor doc/log string normalization (->). |
| primus/configs/modules/megatron/sft_trainer.yaml | New Megatron SFT trainer module config (packing, bridge parity flags, LoRA). |
| primus/configs/modules/megatron/primus_turbo.yaml | Adds use_turbo_fp4_autocast flag. |
| primus/configs/models/megatron/qwen3_235B_A22B_4layer.yaml | Adds 4-layer smoke-test model variant. |
| primus/configs/models/megatron/llama3_8B.yaml | Switches tokenizer_type to HuggingFaceTokenizer. |
| primus/configs/models/megatron_bridge/qwen3_30b_a3b.yaml | Adds Bridge model config for Qwen3-30B-A3B. |
| primus/configs/models/megatron_bridge/llama3_8b.yaml | Adds Bridge model config for Llama3-8B. |
| primus/backends/torchtitan/torchtitan_adapter.py | Uses BackendRegistry.get_trainer_class for trainer loading. |
| primus/backends/torchtitan/init.py | Registers torchtitan pretrain trainer in stage registry. |
| primus/backends/megatron/training/evaluator.py | Handles [loss, num_tokens] tensor metric shape for averaging. |
| primus/backends/megatron/sft/schema.py | Defines normalized SFT sample/message schema and formatted spans. |
| primus/backends/megatron/sft/runtime.py | Provides dataset provider + pretrain entrypoint wrapper with signature probing. |
| primus/backends/megatron/sft/preprocessing.py | Adds local record loading + tokenization + loss mask/label shifting. |
| primus/backends/megatron/sft/formatters.py | Adds Alpaca/ChatML/OpenAI-messages/SQuAD formatters + selector. |
| primus/backends/megatron/sft/dataset.py | Implements SFTDataset and dataset-builder with packed/mlperf dispatch. |
| primus/backends/megatron/sft/init.py | Exposes SFT public API surface. |
| primus/backends/megatron/peft/recompute.py | Adds adapter-only recompute grad fix hook for PP=1 cases. |
| primus/backends/megatron/peft/module_matcher.py | PEFT module matcher utility (ported/adjusted). |
| primus/backends/megatron/peft/lora.py | LoRA implementation/transformations (ported/adjusted). |
| primus/backends/megatron/peft/import_utils.py | Safe import helpers for optional dependencies. |
| primus/backends/megatron/peft/base.py | Base PEFT API + freeze/walk + adapter save filtering. |
| primus/backends/megatron/peft/adapter_wrapper.py | Adapter wrapper state_dict/sharded_state_dict handling. |
| primus/backends/megatron/peft/init.py | PEFT package exports. |
| primus/backends/megatron/patches/turbo/fp4_patches.py | Changes FP4 patch gating condition to fp4 enabled. |
| primus/backends/megatron/patches/sft_grad_sanitize_patches.py | Adds optional NaN/Inf grad sanitization patch for benchmark configs. |
| primus/backends/megatron/patches/checkpoint_patches.py | Adds tolerant factory merge + torch_dist load_checkpoint fixes. |
| primus/backends/megatron/megatron_adapter.py | Uses stage-based trainer registry (raises RuntimeError on missing). |
| primus/backends/megatron/core/transformer/moe/router.py | Removes inconsistent force-LB routing_map override; adds rationale. |
| primus/backends/megatron/core/fp4_utils.py | Lazier Turbo imports; TE fallback autocast; improved recipe handling. |
| primus/backends/megatron/core/datasets/sft_dataset.py | Compatibility shim re-exporting new SFT dataset APIs. |
| primus/backends/megatron/init.py | Registers megatron pretrain + sft trainers in stage registry. |
| primus/backends/megatron_bridge/init.py | Registers bridge pretrain + sft trainers in stage registry. |
| examples/moe_package/start_training_qwen_30B_a3B.sh | Example pretrain launch script for Qwen3-30B-A3B. |
| examples/moe_package/start_training_dsv2_lite.sh | Example pretrain launch script for DeepSeek-V2-Lite. |
| examples/megatron/prepare.py | Skips pretrain dataset tokenization for stage=sft; handles empty submodule. |
| examples/megatron/convert_to_jsonl.py | Utility to export HF/CSV datasets into JSONL for offline SFT. |
| examples/megatron/configs/MI355X/qwen3_235B_A22B-BF16-sft.yaml | Example native SFT config for Qwen3-235B-A22B. |
| examples/megatron/configs/MI355X/qwen3_235B_A22B_4layer-BF16-sft.yaml | Smoke-test SFT config for 4-layer Qwen3-235B-A22B. |
| examples/megatron/configs/MI355X/llama3.1_8B-MXFP8-pretrain.yaml | Adds Llama3.1 MXFP8 pretrain config. |
| examples/megatron/configs/MI355X/llama3.1_8B-MXFP4-pretrain.yaml | Adds Llama3.1 MXFP4 pretrain config. |
| examples/megatron/configs/MI355X/llama3_8B-BF16-sft.yaml | Example native SFT config for Llama3-8B. |
| examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed.yaml | Example packed-sequence SFT config for Llama3-8B. |
| examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed-squad.yaml | Packed SFT SQuAD config for Bridge-vs-Native benchmarking. |
| examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed-bridge_aligned.yaml | Bridge-aligned packed SFT benchmark config (native path). |
| examples/megatron/configs/MI355X/llama3_8B-BF16-lora-sft.yaml | LoRA-focused SFT config variant for Llama3-8B. |
| examples/megatron/configs/MI355X/llama2_70B-FP8-sft-packed-perf.yaml | FP8 performance-oriented packed SFT config for Llama2-70B. |
| examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-sft.yaml | Example SFT config for DeepSeek-V2-Lite. |
| examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-sft-packed.yaml | Packed SFT config for DeepSeek-V2-Lite with extensive perf notes. |
| examples/megatron_bridge/configs/MI355X/qwen3_30b_a3b_lora_posttrain_packed.yaml | Bridge packed LoRA SFT benchmark config for Qwen3-30B-A3B. |
| examples/megatron_bridge/configs/MI355X/llama3_8b_lora_posttrain_packed.yaml | Bridge packed LoRA SFT config for Llama3-8B. |
Comment on lines
+385
to
+387
|
|
||
| _pre_forward_canary(model) | ||
|
|
Comment on lines
+157
to
+162
| while not done_file.exists() and elapsed < timeout: | ||
| if not lock_file.exists() and not done_file.exists(): | ||
| time.sleep(2) | ||
| else: | ||
| time.sleep(5) | ||
| elapsed += 5 |
Comment on lines
+110
to
+114
| except Exception as e: | ||
| # If torch or datasets not available, skip | ||
| if "No module named" in str(e): | ||
| self.skipTest(f"Required module not available: {e}") | ||
| raise |
Comment on lines
+141
to
+145
| except Exception as e: | ||
| # If torch or datasets not available, skip | ||
| if "No module named" in str(e): | ||
| self.skipTest(f"Required module not available: {e}") | ||
| raise |
Comment on lines
+219
to
+222
| except Exception as e: | ||
| if "No module named" in str(e): | ||
| self.skipTest(f"Required module not available: {e}") | ||
| raise |
Comment on lines
+257
to
+260
| except Exception as e: | ||
| if "No module named" in str(e): | ||
| self.skipTest(f"Required module not available: {e}") | ||
| raise |
Comment on lines
+3
to
+5
| # Default config if no argument is provided | ||
| CONFIG_FILE=${1:-"./examples/megatron/configs/MI355X/llama3_8B-BF16-sft.yaml"} | ||
|
|
| echo "Starting training with config: $CONFIG_FILE" | ||
| echo "Experiment Name: $PRIMUS_EXP_NAME" | ||
|
|
||
| PRIMUS_TRAIN_RUNTIME=core ./primus-cli --debug direct -- train posttrain --config "$CONFIG_FILE" |
Comment on lines
+1
to
+8
| # Primus Native SFT LoRA — Quick Start | ||
|
|
||
| > **Branch**: `feat/megatron/support-sft-native` (PR701) | ||
| > **Backend**: Megatron-LM **native** (no Megatron-Bridge runtime dependency) | ||
| > **Hardware**: AMD MI355X / MI300X | ||
| > **Models verified**: Llama2-70B, Llama3-8B, Llama3-70B, Qwen3-30B-A3B, Qwen3-235B-A22B, DeepSeek-V2-Lite | ||
|
|
||
| This README walks through how to launch training on Primus's **native SFT LoRA** path, and explains exactly which fields to change when switching from BF16 / FP8 to FP4 (NVFP4 / MXFP4). |
Comment on lines
+85
to
+88
| -e EXP=examples/megatron/configs/MI355X/llama2_70B-BF16-sft-packed-mlperf_aligned.yaml \ | ||
| sft_primus_0507_native \ | ||
| bash -c 'cd /workspace/Primus && bash examples/run_pretrain.sh' \ | ||
| 2>&1 | tee /home/botahu/llama2_70b_500iter_runs/${EXP_NAME}.log |
Comment on lines
+142
to
+146
| modules: | ||
| pre_trainer: | ||
| framework: megatron | ||
| config: sft_trainer.yaml | ||
| model: llama2_70B.yaml |
Comment on lines
+204
to
+208
| # Recommended invocation: | ||
| # export PRIMUS_EXP_NAME=native_llama2_70b_fp4_perf_$(date +%Y%m%d_%H%M%S) | ||
| # EXP=examples/megatron/configs/MI355X/llama2_70B-FP4-sft-packed-perf.yaml \ | ||
| # bash examples/run_pretrain.sh | ||
| # ============================================================================= |
Comment on lines
+213
to
+217
| modules: | ||
| pre_trainer: | ||
| framework: megatron | ||
| config: sft_trainer.yaml | ||
| model: llama2_70B.yaml |
| overrides: | ||
| data_path: null | ||
| sft_dataset_name: rajpurkar/squad | ||
| sft_dataset_formatter: squad |
Comment on lines
+364
to
+367
| -e EXP=examples/megatron/configs/MI355X/llama2_70B-FP4-sft-packed-perf.yaml \ | ||
| sft_primus_0507_native \ | ||
| bash -c 'cd /workspace/Primus && bash examples/run_pretrain.sh' \ | ||
| 2>&1 | tee /home/botahu/llama2_70b_500iter_runs/${EXP_NAME}.log |
Comment on lines
+61
to
+65
| modules: | ||
| sft_trainer: | ||
| framework: megatron | ||
| config: sft_trainer.yaml | ||
| model: llama3_8B.yaml |
Member
Author
|
These files are not needed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.