sync: gitlab/main -> github/main#43
Merged
Merged
Conversation
# 🔒 Security ## Narrow gitleaks scanning exceptions - Run gitleaks over the full working tree from pre-commit - Replace broad docs/tests allowlists with generated-artifact-only exclusions - Keep exact placeholder IP allowlists for 10.0.0.1 and 192.168.1.100 - Tighten QS_ and LTA token rules to reduce identifier false positives --- # 📝 Documentation ## Replace non-allowlisted private endpoints - Update rollout and checkpoint docs to use documentation IP ranges where values are not explicitly allowlisted --- # ✅ Tests ## Align Ray address fixtures - Update rollout manager test fixtures for the remaining non-allowlisted example endpoint
# 🐛 Bug Fix
## Extract `MODEL_DIR` / `DATA_DIR` / `EXP_DIR` in training scripts
- Fix `EXP_DIR="${MODEL_DIR:=...}"` variable-name typo in
`run-qwen3-30B-A3B-int4-8xgpu.sh` and `run-qwen3-4B-8xgpu-hybrid-async.sh`
- Split into the canonical three-way `EXP_DIR` / `MODEL_DIR` / `DATA_DIR`
form already used by `run-qwen3-30B-A3B-fp8-8xgpu.sh`
- Route `--hf-checkpoint` and `--ref-load` via `${MODEL_DIR}`, prompt /
eval data via `${DATA_DIR}`, keep `--save` / `--load` on `${EXP_DIR}`
## Allow `vision_dp_when_cp` to pass through model provider
- Extend the Megatron-Bridge override allowlist in `get_model_provider_func`
so the CLI flag is no longer silently dropped
---
# 🔧 CI/CD
## Restrict gitleaks pre-commit hook to tracked content
- Switch the entry to the upstream-recommended
`gitleaks git --pre-commit --staged` form so the hook scans only staged
changes; the previous `gitleaks dir .` form scanned the full working
tree and false-positived on untracked `log/` files
…d add MFU metrics Replace the old bottom-up per-component FLOPS calculator (flops_utils.py) with verl's 6N formula-based FlopsCounter, supporting per-model-type estimators for dense, MoE, MLA+MoE, and vision-language architectures. Add MFU (Model FLOPS Utilization) metrics via GPU peak FLOPS auto-detection. Supported model families: Qwen2/3/3.5/3.6, LLaMA, Mistral, DeepSeek-V3, GLM4/GLM4V/GLM46V (dense & MoE & MLA), MiniCPM-V/O, SEED-OSS, MIMO, Qwen3-VL, Qwen3-VL-MoE, Qwen3-Omni-MoE. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e converter path # ⭐ Feature ## Add Qwen3.5-397B-A17B model config and training scripts - Add model config `scripts/models/qwen35-397B-A17B.sh` (60 layers, 512 experts, MoE) - Add 128xGPU text training script with DeepEP flex dispatcher - Add 128xGPU multimodal training script for open-r1mm dataset - Add `--warm-hf-checkpoint-page-cache` CLI flag to optionally pre-read HF checkpoints --- # ♻️ Refactor ## Unify bridge converter expert/non-expert weight sync paths - Remove dual BF16/INT4 code paths in `HfWeightIteratorBridge`; always use bucket-based broadcast - Remove upstream `megatron-bridge` fallback (`_iter_hf_params_via_upstream_bridge`) - Rename `_broadcast_quantized_*` to `_broadcast_converted_*` (no longer quantization-specific) - Add `broadcast_and_apply_configs()` for PP-rank config exchange in `BridgeConverter` - Add prefix-aware fallback module lookup for tasks missing `megatron_module` - Fix EP `src_rank` dedup: keep lowest rank when multiple EP ranks own the same expert param - Add error logging with param shape/mapping details on `megatron_to_hf` failure --- # 🔩 Chore ## Update scripts and tooling - Enable `--warm-hf-checkpoint-page-cache` across all existing bridge-mode training scripts - Scale up Qwen3.5-35B-A3B multimodal script (CP=4, EP=16, 128xGPU resources) - Fix `xargs` in `ray-job.sh` with `--no-run-if-empty` to avoid error on empty input - Extend `.gitleaks.toml` to exclude logs, caches, and build artifacts - Simplify `ssh-ray-cluster` SKILL.md to a concise 3-step debug loop - Rename test file to `test_broadcast_converted.py` matching function renames
# 🐛 Bug Fix ## Fix HF→Megatron mapping under PP/EP > 1 - Switch `_build_hf_to_megatron_mapping` to use `task.global_param_name` instead of `task.param_name` - `hf_param_name` yielded by `export_hf_weights` is already global, so the megatron-side counterpart must also be global to keep both sides of the mapping in the same namespace - Under PP > 1 or EP > 1, `param_name` carries local layer/expert indices and would otherwise collide with PP-placeholder tasks that store global names, producing same-string-different-meaning entries that break any downstream layer-indexed dispatch --- # 🔩 Chore ## Patch megatron-bridge progress tracking to tqdm - Monkey-patch `MegatronModelBridge._with_progress_tracking` at module import to use tqdm instead of `rich.Progress` - Upstream rich live-rendering keeps overwriting log lines in distributed training; tqdm is single-line and plays nicely with our loggers - Guarded by try/except ImportError so environments without megatron-bridge silently skip the patch
# 🐛 Bug Fix ## Detect colocated engines via Ray node_id instead of GPU offset - Replace GPU-offset arithmetic with a node_id comparison between actor and rollout engines to decide whether to route weights via CUDA IPC - Gather all actor node_ids with `all_gather_object` over the gloo group and treat an engine as colocated only when its node_id is in that set - Fix hybrid-mode crash where rollout engines on a remote node were mis-classified as colocated (rollout pg has its own offsets starting at 0), causing CUDA IPC handles to fail across nodes with `cudaErrorMapBufferObjectFailed` in `_rebuild_cuda_tensor` - Remove the special-case `self.args.hybrid` branch; node_id check handles both colocate and hybrid uniformly
# ⚡ Performance ## Move rollout_temperature division into per-chunk yield in get_responses - Remove full-tensor `logits.div(rollout_temperature)` that allocated a duplicate `[T, V]` fp32 buffer (~16 GiB on Qwen3 with long packed sequences), doubling loss-step peak memory and triggering OOM under allocator fragmentation - Apply the scalar division to each `logits_chunk` right before yielding, so allocations are bounded by per-sample response size and happen incrementally instead of as a single giant contiguous block - Numerically equivalent across all four chunking paths (cp_size==1 RL, SFT, allgather_cp, zigzag CP) since scalar division commutes with slicing and concatenation
# 🐛 Bug Fix ## Fall back when device properties are unavailable - Treat missing current-device properties as CPU for peak FLOPS detection - Keep explicit device-name lookup unchanged for known accelerator MFU reporting --- # ✅ Tests ## Cover CPU-only CI path - Add a regression test for unavailable device properties in peak FLOPS detection
NINGBENZHE
approved these changes
Jun 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Routine internal -> external sync.