Skip to content

Merge Release/v26.4 back to main#759

Closed
GeneDer wants to merge 10 commits into
mainfrom
release/v26.4
Closed

Merge Release/v26.4 back to main#759
GeneDer wants to merge 10 commits into
mainfrom
release/v26.4

Conversation

@GeneDer

@GeneDer GeneDer commented Jun 9, 2026

Copy link
Copy Markdown
Member

No description provided.

Vidushi Goyal and others added 7 commits June 3, 2026 20:14
1:Adds a complete native SFT (Supervised Fine-Tuning) training stack to
Primus,on the Megatron backend, parallel to the existing pretrain path.
2:Implements custom dataset, packing, forward_step, LoRA/PEFT,
multi-turn conversation, and offline JSONL/JSON loaders without
depending on Megatron-Bridge at runtime ,while keeping a
megatron_bridge_adapter.py for users who still want the
Bridge path.
3:In terms of performance results: with memory alignment, llama3_8b and
llama2_70b outperform the third-party library megatron-bridge by 4%,
deepseek_v2_lite and qwen30-30b-a3b outperform the third-party library
megatron-bridge by 6%, and are comparable to mlperf_llama2_70b_lora.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Xiaoming-AMD <198007710+Xiaoming-AMD@users.noreply.github.com>
Co-authored-by: Xiaoming <xiaoming@primus.dev>
Co-authored-by: WangLingxun <linxwang@amd.com>
Co-authored-by: botahu_qle <botahu@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Botao Hu <botahu@smc300x-ccs-aus-a16-19.prov.aus.ccs.cpe.ice.amd.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 9, 2026 16:23
Comment thread primus/backends/megatron/sft/forward_step.py Fixed
Comment thread primus/backends/megatron/sft/forward_step.py Fixed
Comment thread primus/backends/megatron_bridge/megatron_bridge_adapter.py Fixed
Comment thread primus/backends/megatron/peft/walk_utils.py Fixed
Comment thread primus/backends/megatron/sft/mlperf_packed_dataset.py Fixed
Comment thread primus/backends/megatron/peft/base.py Fixed
Comment thread primus/backends/megatron/sft/forward_step.py Fixed
Comment thread primus/backends/megatron/peft/lora_layers.py Fixed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR merges Release/v26.4 back into main, bringing in Megatron-native SFT (datasets/formatters/runtime wiring), stage-based trainer registration in BackendRegistry, and supporting scripts/configs for SFT runs and diagnostics.

Changes:

  • Add Megatron-native SFT stack (schema/formatters/tokenization/datasets/forward_step/runtime) plus unit tests and example configs for SFT + packed sequences.
  • Introduce stage-aware trainer registration/lookup in BackendRegistry and update adapters/backends (megatron, megatron_bridge, torchtitan) + related tests.
  • Add operational hooks/tools: HF→Megatron checkpoint conversion hook, diagnostics scripts, and various training launch examples/config updates (including FP4/Turbo knobs).

Reviewed changes

Copilot reviewed 83 out of 83 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit_tests/backends/megatron/test_sft_dataset_offline.py Unit tests for offline JSON/JSONL SFT dataset loading.
tests/unit_tests/backends/megatron/test_sft_abstractions.py Tests for SFT normalization/tokenization and forward_step behavior.
tests/unit_tests/backends/megatron/test_messages_format.py Tests for OpenAI messages format and formatter selection.
tests/unit_tests/backends/megatron/test_megatron_sft_trainer.py Tests for MegatronSFTTrainer wiring into runtime factories/entrypoints.
tests/unit_tests/backends/megatron/test_megatron_registration.py Verifies megatron adapter + stage trainers are registered in registry.
tests/unit_tests/backends/megatron/test_megatron_adapter.py Adapter tests updated for stage-based trainer lookup and errors.
runner/helpers/hooks/train/posttrain/megatron/01_convert_checkpoints.py Hook to convert HF checkpoints to Megatron format via Megatron-Bridge.
runner/helpers/hooks/train/posttrain/megatron/00_install_requirements.sh Installs dependencies needed for checkpoint conversion hook.
run.sh Convenience launcher script with quieter Torch/NCCL logging defaults.
primus/tools/diag/verify_mlp_merge_fix.py GPU allocator diagnostic for MLP shard-merge fragmentation fix.
primus/tools/diag/verify_mlp_merge_fix_realistic.py “Realistic” scale allocator diagnostic with large context prealloc.
primus/tools/diag/inspect_sft_data.py Diagnostic comparing Bridge packed parquet vs Native packed cache.
primus/tools/diag/init.py Declares diag utilities package.
primus/core/launcher/config.py Adds pre_trainer/post_trainer aliasing for SFT configs.
primus/core/config/primus_config.py Adds module-name aliasing in get_module_config.
primus/core/backend/backend_registry.py Implements stage-based trainer registry and debug dump.
primus/core/backend/backend_adapter.py Minor doc/log string normalization (->).
primus/configs/modules/megatron/sft_trainer.yaml New Megatron SFT trainer module config (packing, bridge parity flags, LoRA).
primus/configs/modules/megatron/primus_turbo.yaml Adds use_turbo_fp4_autocast flag.
primus/configs/models/megatron/qwen3_235B_A22B_4layer.yaml Adds 4-layer smoke-test model variant.
primus/configs/models/megatron/llama3_8B.yaml Switches tokenizer_type to HuggingFaceTokenizer.
primus/configs/models/megatron_bridge/qwen3_30b_a3b.yaml Adds Bridge model config for Qwen3-30B-A3B.
primus/configs/models/megatron_bridge/llama3_8b.yaml Adds Bridge model config for Llama3-8B.
primus/backends/torchtitan/torchtitan_adapter.py Uses BackendRegistry.get_trainer_class for trainer loading.
primus/backends/torchtitan/init.py Registers torchtitan pretrain trainer in stage registry.
primus/backends/megatron/training/evaluator.py Handles [loss, num_tokens] tensor metric shape for averaging.
primus/backends/megatron/sft/schema.py Defines normalized SFT sample/message schema and formatted spans.
primus/backends/megatron/sft/runtime.py Provides dataset provider + pretrain entrypoint wrapper with signature probing.
primus/backends/megatron/sft/preprocessing.py Adds local record loading + tokenization + loss mask/label shifting.
primus/backends/megatron/sft/formatters.py Adds Alpaca/ChatML/OpenAI-messages/SQuAD formatters + selector.
primus/backends/megatron/sft/dataset.py Implements SFTDataset and dataset-builder with packed/mlperf dispatch.
primus/backends/megatron/sft/init.py Exposes SFT public API surface.
primus/backends/megatron/peft/recompute.py Adds adapter-only recompute grad fix hook for PP=1 cases.
primus/backends/megatron/peft/module_matcher.py PEFT module matcher utility (ported/adjusted).
primus/backends/megatron/peft/lora.py LoRA implementation/transformations (ported/adjusted).
primus/backends/megatron/peft/import_utils.py Safe import helpers for optional dependencies.
primus/backends/megatron/peft/base.py Base PEFT API + freeze/walk + adapter save filtering.
primus/backends/megatron/peft/adapter_wrapper.py Adapter wrapper state_dict/sharded_state_dict handling.
primus/backends/megatron/peft/init.py PEFT package exports.
primus/backends/megatron/patches/turbo/fp4_patches.py Changes FP4 patch gating condition to fp4 enabled.
primus/backends/megatron/patches/sft_grad_sanitize_patches.py Adds optional NaN/Inf grad sanitization patch for benchmark configs.
primus/backends/megatron/patches/checkpoint_patches.py Adds tolerant factory merge + torch_dist load_checkpoint fixes.
primus/backends/megatron/megatron_adapter.py Uses stage-based trainer registry (raises RuntimeError on missing).
primus/backends/megatron/core/transformer/moe/router.py Removes inconsistent force-LB routing_map override; adds rationale.
primus/backends/megatron/core/fp4_utils.py Lazier Turbo imports; TE fallback autocast; improved recipe handling.
primus/backends/megatron/core/datasets/sft_dataset.py Compatibility shim re-exporting new SFT dataset APIs.
primus/backends/megatron/init.py Registers megatron pretrain + sft trainers in stage registry.
primus/backends/megatron_bridge/init.py Registers bridge pretrain + sft trainers in stage registry.
examples/moe_package/start_training_qwen_30B_a3B.sh Example pretrain launch script for Qwen3-30B-A3B.
examples/moe_package/start_training_dsv2_lite.sh Example pretrain launch script for DeepSeek-V2-Lite.
examples/megatron/prepare.py Skips pretrain dataset tokenization for stage=sft; handles empty submodule.
examples/megatron/convert_to_jsonl.py Utility to export HF/CSV datasets into JSONL for offline SFT.
examples/megatron/configs/MI355X/qwen3_235B_A22B-BF16-sft.yaml Example native SFT config for Qwen3-235B-A22B.
examples/megatron/configs/MI355X/qwen3_235B_A22B_4layer-BF16-sft.yaml Smoke-test SFT config for 4-layer Qwen3-235B-A22B.
examples/megatron/configs/MI355X/llama3.1_8B-MXFP8-pretrain.yaml Adds Llama3.1 MXFP8 pretrain config.
examples/megatron/configs/MI355X/llama3.1_8B-MXFP4-pretrain.yaml Adds Llama3.1 MXFP4 pretrain config.
examples/megatron/configs/MI355X/llama3_8B-BF16-sft.yaml Example native SFT config for Llama3-8B.
examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed.yaml Example packed-sequence SFT config for Llama3-8B.
examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed-squad.yaml Packed SFT SQuAD config for Bridge-vs-Native benchmarking.
examples/megatron/configs/MI355X/llama3_8B-BF16-sft-packed-bridge_aligned.yaml Bridge-aligned packed SFT benchmark config (native path).
examples/megatron/configs/MI355X/llama3_8B-BF16-lora-sft.yaml LoRA-focused SFT config variant for Llama3-8B.
examples/megatron/configs/MI355X/llama2_70B-FP8-sft-packed-perf.yaml FP8 performance-oriented packed SFT config for Llama2-70B.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-sft.yaml Example SFT config for DeepSeek-V2-Lite.
examples/megatron/configs/MI355X/deepseek_v2_lite-BF16-sft-packed.yaml Packed SFT config for DeepSeek-V2-Lite with extensive perf notes.
examples/megatron_bridge/configs/MI355X/qwen3_30b_a3b_lora_posttrain_packed.yaml Bridge packed LoRA SFT benchmark config for Qwen3-30B-A3B.
examples/megatron_bridge/configs/MI355X/llama3_8b_lora_posttrain_packed.yaml Bridge packed LoRA SFT config for Llama3-8B.

Comment on lines +385 to +387

_pre_forward_canary(model)

Comment on lines +157 to +162
while not done_file.exists() and elapsed < timeout:
if not lock_file.exists() and not done_file.exists():
time.sleep(2)
else:
time.sleep(5)
elapsed += 5
Comment on lines +110 to +114
except Exception as e:
# If torch or datasets not available, skip
if "No module named" in str(e):
self.skipTest(f"Required module not available: {e}")
raise
Comment on lines +141 to +145
except Exception as e:
# If torch or datasets not available, skip
if "No module named" in str(e):
self.skipTest(f"Required module not available: {e}")
raise
Comment on lines +219 to +222
except Exception as e:
if "No module named" in str(e):
self.skipTest(f"Required module not available: {e}")
raise
Comment on lines +257 to +260
except Exception as e:
if "No module named" in str(e):
self.skipTest(f"Required module not available: {e}")
raise
Copilot AI review requested due to automatic review settings June 10, 2026 15:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.

Comment thread run.sh
Comment on lines +3 to +5
# Default config if no argument is provided
CONFIG_FILE=${1:-"./examples/megatron/configs/MI355X/llama3_8B-BF16-sft.yaml"}

Comment thread run.sh
echo "Starting training with config: $CONFIG_FILE"
echo "Experiment Name: $PRIMUS_EXP_NAME"

PRIMUS_TRAIN_RUNTIME=core ./primus-cli --debug direct -- train posttrain --config "$CONFIG_FILE"
Comment on lines +1 to +8
# Primus Native SFT LoRA — Quick Start

> **Branch**: `feat/megatron/support-sft-native` (PR701)
> **Backend**: Megatron-LM **native** (no Megatron-Bridge runtime dependency)
> **Hardware**: AMD MI355X / MI300X
> **Models verified**: Llama2-70B, Llama3-8B, Llama3-70B, Qwen3-30B-A3B, Qwen3-235B-A22B, DeepSeek-V2-Lite

This README walks through how to launch training on Primus's **native SFT LoRA** path, and explains exactly which fields to change when switching from BF16 / FP8 to FP4 (NVFP4 / MXFP4).
Comment on lines +85 to +88
-e EXP=examples/megatron/configs/MI355X/llama2_70B-BF16-sft-packed-mlperf_aligned.yaml \
sft_primus_0507_native \
bash -c 'cd /workspace/Primus && bash examples/run_pretrain.sh' \
2>&1 | tee /home/botahu/llama2_70b_500iter_runs/${EXP_NAME}.log
Comment on lines +142 to +146
modules:
pre_trainer:
framework: megatron
config: sft_trainer.yaml
model: llama2_70B.yaml
Comment on lines +204 to +208
# Recommended invocation:
# export PRIMUS_EXP_NAME=native_llama2_70b_fp4_perf_$(date +%Y%m%d_%H%M%S)
# EXP=examples/megatron/configs/MI355X/llama2_70B-FP4-sft-packed-perf.yaml \
# bash examples/run_pretrain.sh
# =============================================================================
Comment on lines +213 to +217
modules:
pre_trainer:
framework: megatron
config: sft_trainer.yaml
model: llama2_70B.yaml
overrides:
data_path: null
sft_dataset_name: rajpurkar/squad
sft_dataset_formatter: squad
Comment on lines +364 to +367
-e EXP=examples/megatron/configs/MI355X/llama2_70B-FP4-sft-packed-perf.yaml \
sft_primus_0507_native \
bash -c 'cd /workspace/Primus && bash examples/run_pretrain.sh' \
2>&1 | tee /home/botahu/llama2_70b_500iter_runs/${EXP_NAME}.log
Comment on lines +61 to +65
modules:
sft_trainer:
framework: megatron
config: sft_trainer.yaml
model: llama3_8B.yaml
@GeneDer

GeneDer commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

These files are not needed

@GeneDer GeneDer closed this Jun 10, 2026
@GeneDer GeneDer deleted the release/v26.4 branch June 10, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants