sync: gitlab/main -> github/main by Yangruipis · Pull Request #43 · redai-infra/Relax

Yangruipis · 2026-06-03T12:03:21Z

Routine internal -> external sync.

# 🔒 Security ## Narrow gitleaks scanning exceptions - Run gitleaks over the full working tree from pre-commit - Replace broad docs/tests allowlists with generated-artifact-only exclusions - Keep exact placeholder IP allowlists for 10.0.0.1 and 192.168.1.100 - Tighten QS_ and LTA token rules to reduce identifier false positives --- # 📝 Documentation ## Replace non-allowlisted private endpoints - Update rollout and checkpoint docs to use documentation IP ranges where values are not explicitly allowlisted --- # ✅ Tests ## Align Ray address fixtures - Update rollout manager test fixtures for the remaining non-allowlisted example endpoint

# 🐛 Bug Fix ## Extract `MODEL_DIR` / `DATA_DIR` / `EXP_DIR` in training scripts - Fix `EXP_DIR="${MODEL_DIR:=...}"` variable-name typo in `run-qwen3-30B-A3B-int4-8xgpu.sh` and `run-qwen3-4B-8xgpu-hybrid-async.sh` - Split into the canonical three-way `EXP_DIR` / `MODEL_DIR` / `DATA_DIR` form already used by `run-qwen3-30B-A3B-fp8-8xgpu.sh` - Route `--hf-checkpoint` and `--ref-load` via `${MODEL_DIR}`, prompt / eval data via `${DATA_DIR}`, keep `--save` / `--load` on `${EXP_DIR}` ## Allow `vision_dp_when_cp` to pass through model provider - Extend the Megatron-Bridge override allowlist in `get_model_provider_func` so the CLI flag is no longer silently dropped --- # 🔧 CI/CD ## Restrict gitleaks pre-commit hook to tracked content - Switch the entry to the upstream-recommended `gitleaks git --pre-commit --staged` form so the hook scans only staged changes; the previous `gitleaks dir .` form scanned the full working tree and false-positived on untracked `log/` files

…d add MFU metrics Replace the old bottom-up per-component FLOPS calculator (flops_utils.py) with verl's 6N formula-based FlopsCounter, supporting per-model-type estimators for dense, MoE, MLA+MoE, and vision-language architectures. Add MFU (Model FLOPS Utilization) metrics via GPU peak FLOPS auto-detection. Supported model families: Qwen2/3/3.5/3.6, LLaMA, Mistral, DeepSeek-V3, GLM4/GLM4V/GLM46V (dense & MoE & MLA), MiniCPM-V/O, SEED-OSS, MIMO, Qwen3-VL, Qwen3-VL-MoE, Qwen3-Omni-MoE. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e converter path # ⭐ Feature ## Add Qwen3.5-397B-A17B model config and training scripts - Add model config `scripts/models/qwen35-397B-A17B.sh` (60 layers, 512 experts, MoE) - Add 128xGPU text training script with DeepEP flex dispatcher - Add 128xGPU multimodal training script for open-r1mm dataset - Add `--warm-hf-checkpoint-page-cache` CLI flag to optionally pre-read HF checkpoints --- # ♻️ Refactor ## Unify bridge converter expert/non-expert weight sync paths - Remove dual BF16/INT4 code paths in `HfWeightIteratorBridge`; always use bucket-based broadcast - Remove upstream `megatron-bridge` fallback (`_iter_hf_params_via_upstream_bridge`) - Rename `_broadcast_quantized_*` to `_broadcast_converted_*` (no longer quantization-specific) - Add `broadcast_and_apply_configs()` for PP-rank config exchange in `BridgeConverter` - Add prefix-aware fallback module lookup for tasks missing `megatron_module` - Fix EP `src_rank` dedup: keep lowest rank when multiple EP ranks own the same expert param - Add error logging with param shape/mapping details on `megatron_to_hf` failure --- # 🔩 Chore ## Update scripts and tooling - Enable `--warm-hf-checkpoint-page-cache` across all existing bridge-mode training scripts - Scale up Qwen3.5-35B-A3B multimodal script (CP=4, EP=16, 128xGPU resources) - Fix `xargs` in `ray-job.sh` with `--no-run-if-empty` to avoid error on empty input - Extend `.gitleaks.toml` to exclude logs, caches, and build artifacts - Simplify `ssh-ray-cluster` SKILL.md to a concise 3-step debug loop - Rename test file to `test_broadcast_converted.py` matching function renames

# 🐛 Bug Fix ## Fix HF→Megatron mapping under PP/EP > 1 - Switch `_build_hf_to_megatron_mapping` to use `task.global_param_name` instead of `task.param_name` - `hf_param_name` yielded by `export_hf_weights` is already global, so the megatron-side counterpart must also be global to keep both sides of the mapping in the same namespace - Under PP > 1 or EP > 1, `param_name` carries local layer/expert indices and would otherwise collide with PP-placeholder tasks that store global names, producing same-string-different-meaning entries that break any downstream layer-indexed dispatch --- # 🔩 Chore ## Patch megatron-bridge progress tracking to tqdm - Monkey-patch `MegatronModelBridge._with_progress_tracking` at module import to use tqdm instead of `rich.Progress` - Upstream rich live-rendering keeps overwriting log lines in distributed training; tqdm is single-line and plays nicely with our loggers - Guarded by try/except ImportError so environments without megatron-bridge silently skip the patch

# 🐛 Bug Fix ## Detect colocated engines via Ray node_id instead of GPU offset - Replace GPU-offset arithmetic with a node_id comparison between actor and rollout engines to decide whether to route weights via CUDA IPC - Gather all actor node_ids with `all_gather_object` over the gloo group and treat an engine as colocated only when its node_id is in that set - Fix hybrid-mode crash where rollout engines on a remote node were mis-classified as colocated (rollout pg has its own offsets starting at 0), causing CUDA IPC handles to fail across nodes with `cudaErrorMapBufferObjectFailed` in `_rebuild_cuda_tensor` - Remove the special-case `self.args.hybrid` branch; node_id check handles both colocate and hybrid uniformly

# ⚡ Performance ## Move rollout_temperature division into per-chunk yield in get_responses - Remove full-tensor `logits.div(rollout_temperature)` that allocated a duplicate `[T, V]` fp32 buffer (~16 GiB on Qwen3 with long packed sequences), doubling loss-step peak memory and triggering OOM under allocator fragmentation - Apply the scalar division to each `logits_chunk` right before yielding, so allocations are bounded by per-sample response size and happen incrementally instead of as a single giant contiguous block - Numerically equivalent across all four chunking paths (cp_size==1 RL, SFT, allgather_cp, zigzag CP) since scalar division commutes with slicing and concatenation

# 🐛 Bug Fix ## Fall back when device properties are unavailable - Treat missing current-device properties as CPU for peak FLOPS detection - Keep explicit device-name lookup unchanged for known accelerator MFU reporting --- # ✅ Tests ## Cover CPU-only CI path - Add a regression test for unavailable device properties in peak FLOPS detection

Yangruipis and others added 9 commits June 3, 2026 19:42

fix(R3): shape mismatch when cp > 1

4be1014

fix(misc): tui bug

d0a6125

Yangruipis requested a review from NINGBENZHE as a code owner June 3, 2026 12:03

NINGBENZHE approved these changes Jun 3, 2026

View reviewed changes

NINGBENZHE merged commit 13d31af into main Jun 3, 2026
5 checks passed

Yangruipis deleted the sync/from-gitlab branch June 3, 2026 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: gitlab/main -> github/main#43

sync: gitlab/main -> github/main#43
NINGBENZHE merged 10 commits into
mainfrom
sync/from-gitlab

Yangruipis commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Yangruipis commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants