Skip to content

sync: gitlab/main -> github/main#40

Merged
Yangruipis merged 15 commits into
mainfrom
sync/from-gitlab
May 29, 2026
Merged

sync: gitlab/main -> github/main#40
Yangruipis merged 15 commits into
mainfrom
sync/from-gitlab

Conversation

@Yangruipis

Copy link
Copy Markdown
Collaborator

Routine internal -> external sync.

Yangruipis and others added 15 commits May 29, 2026 17:18
# 🐛 Bug Fix

## Fix MODEL_DIR/EXP_DIR initialization in qwen35-9B hybrid-async script

- Replace buggy `EXP_DIR="${MODEL_DIR:=...}"` side-effect assignment with separate `EXP_DIR`/`MODEL_DIR`/`DATA_DIR` defaults, matching `run-qwen35-9B-8xgpu-openr1mm-async.sh`
- Point `--hf-checkpoint` / `--ref-load` at `${MODEL_DIR}` and `PROMPT_SET` at `${DATA_DIR}` so model and dataset roots can be overridden independently of the experiment output dir

(cherry picked from commit 466c779)
# 🔩 Chore

## Align DeepEyes dataset inputs

- Align fp16, GenRM, and partial-rollout scripts with examples/deepeyes/run_deepeyes.sh
- Use the same Deepeyes v1 training shards as the main script
- Use the same thinklite reasoning accuracy eval slice as the main script

(cherry picked from commit df613b4)
# 📝 Documentation

## Add bilingual Hybrid training mode guide

- Add `docs/en/guide/hybrid-training.md` and `docs/zh/guide/hybrid-training.md` describing the hybrid execution mode (streaming TransferQueue + in-process TensorBackuper weight sharing)
- Cover mode comparison vs Colocate / Fully Async, role layout (`ROLES_COLOCATE` with disjoint actor/rollout placement groups), `--hybrid` flag resolution, and the three-phase `train_hybrid` loop
- Document required and optional flags (`--hybrid`, `--num-iters-per-train-update`, `--max-staleness`, `--balance-data`) and the default overrides applied in `relax/utils/arguments.py`
- Include the 8-GPU multimodal reference launch from `scripts/training/multimodal/run-qwen35-9B-8xgpu-openr1mm-hybrid-async.sh` and troubleshooting tips (stalled sub-batches, balance-data rejection)
- List planned next steps: integrate DCS for weight sync, split `train_actor` by `num_iters_per_train_update`

## Register pages in VitePress sidebar

- Add Hybrid Training Mode under the Advanced group in both `en` and `zh` sidebars in `docs/.vitepress/config.mts`

(cherry picked from commit 7f78ff7)
# ⭐ Feature

## Add JSON provider config dump

- Keep the existing transformer_config.pkl dump for compatibility
- Also write transformer_config.json next to it for easier inspection
- Convert non-JSON-safe values recursively and fall back to str() when needed

(cherry picked from commit 598c15c)
# 📝 Documentation

## Add hybrid mode to bilingual README

- Add Hybrid bullet to Highlights section in both README.md and README_zh.md
- Add 05/26/2026 News entry pointing to the Hybrid Training guide
- Expand Architecture section from two to three execution modes with Hybrid description (separate PG + in-process ref/actor_fwd via TensorBackuper + _switch_model)
- Add Hybrid Training doc link to the "Learn more" line

(cherry picked from commit 6a262cb)
# ⚡ Performance

## Pre-fault HF safetensors into page cache once per node

- Add `_warm_hf_checkpoint_page_cache(source_path)` in `relax/backends/megatron/checkpoint.py`, invoked from `_load_checkpoint_hf` before `AutoBridge.from_hf_pretrained`
- Eliminates the dominant NFS-mmap small-read bottleneck during `bridge.load_hf_weights` (`aten::cat` was running at ~20 MB/s, accounting for ~65% of init CPU time on 30B-A3B-class MoE models)
- Explicit per-node coordination: `LOCAL_RANK == 0` runs `cat <ckpt>/*.{safetensors,bin} > /dev/null`, other local ranks poll a marker under `/dev/shm`
- Advisory `flock` wraps the rank-0 path so two Relax jobs sharing a host and ckpt do not duplicate the warmup
- Marker lives in `/dev/shm` (tmpfs) so it naturally clears on reboot, avoiding stale-marker / cold-cache mismatches
- Warmup is best-effort: missing path, non-zero `cat` exit, or wait timeout only log a warning, never a correctness gate
- Configurable wait via `RELAX_HF_WARMUP_TIMEOUT_S` (default 1800s)

---

# ✅ Tests

## Reshape repro profiler around the bridge progress loop

- Replace `_maybe_profile` contextmanager in `scripts/tools/repro_megatron_bridge_load.py` with `_install_bridge_progress_profiler` that monkey-patches `MegatronModelBridge._with_progress_tracking`
- Profiles a fixed `RELAX_REPRO_PROFILE_STEPS` window of conversion tasks (default 50) after `RELAX_REPRO_PROFILE_WARMUP` warmup tasks (default 5), then dumps trace/operator-table/stacks/metadata immediately
- Add `RELAX_REPRO_PROFILE_EXIT_AFTER_DUMP` early-exit knob so a long load can be cut short once the profile window is captured
- `scripts/tools/repro_qwen35_moe_bridge_load_tp4pp2.sh`: default `RELAX_REPRO_PROFILE=0`, set `PYTHONPATH=$REPO_ROOT`, default profile dir to `/tmp/relax/profile`

(cherry picked from commit 335001c)
# ✨ Feature

- Propagate virtual pipeline size into Megatron-Bridge providers.
- Derive vp_stage from Megatron virtual pipeline state for provider wrappers.
- Round dynamic microbatch counts up to the VPP group multiple.
- Add Qwen3.6-35B 8xGPU VPP trial settings.

---

# ✅ Tests

- Add focused VPP provider and microbatch rounding regressions.
- Verified with focused pytest and pre-commit.

(cherry picked from commit 460aa2d)
# 🐛 Bug Fix

## Resume aborted DeepEyes samples by status

- Detect aborted samples from sample status and response length
- Preserve multimodal rollout state needed for continued generation
- Keep off-policy masking controlled by the existing partial-rollout mask flag

## Align resumed generation budgets

- Track current-turn generated tokens separately from context budget
- Apply the smaller active budget to resumed inference calls
- Clear turn-local resume metadata when the turn completes

## Repair rollout prefetch and abort handoff

- Wait for aborted samples to return to the buffer before the next fetch
- Submit the next synchronous prefetch after transfer tasks complete

(cherry picked from commit 867ed47)
# ⭐ Feature

## Add INT4 QAT weight sync pipeline

- Add BridgeConverter to unify HF→Megatron weight conversion for bridge and DCS backends
- Add fake INT4 quantization CUDA kernel for QAT forward pass
- Add compressed-tensors INT4 quantizer processor for weight repacking
- Add quantization_config ignore-list augmentation for non-quantized namespaces
- Add `--sglang-hf-checkpoint` arg to let INT4 QAT point SGLang at original INT4 weights
- Add `--rollout-engine-init-timeout` arg with progress-bar wait for engine startup
- Add Kimi K2.6 model config and INT4 training launch scripts (text + multimodal)
- Add MoE INT4→BF16 offline cast tool (`relax/tools/quant_cast/convert_moe_int4_to_bf16.py`)

## Add Kimi K2.5-style multimodal processor adapters

- Add processor kwargs adaptation for K2.5-style VLM chat processors
- Add placeholder expansion and response token sanitization for K2.5 vision tokens
- Add multimodal train_inputs remapping for K2.5 pixel_values/grid_thws

---

# ♻️ Refactor

## Refactor weight update broadcast into bucketed pipeline

- Extract param-info bucketing, GPU loading, PP/EP broadcast into composable functions
- Add quantized-weight broadcast phase with metadata encoding for INT4 triplets
- Consolidate DCS device_direct backend to reuse BridgeConverter

---

# ✅ Tests

- Add test_broadcast_quantized for INT4 weight broadcast round-trip
- Add test_processing_utils for K2.5 processor adapter functions
- Update test_dcs_weight_conversion and test_state_machine for new APIs

(cherry picked from commit ec24de0)
# 🐛 Bug Fix

## Make overlap grad/param sync setup idempotent in train()

- Relax invokes `train()` once per rollout (upstream Megatron calls it once per run); re-assigning `config.no_sync_func` / `config.param_sync_func` after rollout 0 trips the "no_sync_func must be None" assertion.
- Guard the sync-func wiring so it only runs when the slot is still `None` — works for both `--overlap-grad-reduce` and `--overlap-param-gather --align-param-gather`.
- Leave forward pre-hooks enabled on exit; disabling them here would empty `DDP.remove_forward_pre_hook_handles` and the next `train()` would `KeyError` on the second `disable_forward_pre_hook` call.
- Drop the now-dead `pre_hook_enabled` flag.

---

# 📝 Documentation

## Document distributed-optimizer and overlap flags

- Add `--use-distributed-optimizer`, `--overlap-grad-reduce`, `--overlap-param-gather` to optimizer tables (EN + ZH).
- Add compatibility matrix covering text dense, dense VL (CP=1 vs CP>1), and MoE.

(cherry picked from commit 8425002)
(cherry picked from commit f1e8764)
# 🐛 Bug Fix

## Avoid CUDA IPC across nodes in hybrid weight sync

- In hybrid mode actor and rollout sit on separate placement groups, so
  rollout `engine_gpu_offsets` are local to the rollout pg and start at 0.
- The previous numeric `gpu_offset < total_actor_gpus` check mis-classified
  cross-node engines as colocated and routed weights through CUDA IPC
  handles, which are not valid across nodes
  (`cudaErrorMapBufferObjectFailed` in `_rebuild_cuda_tensor`).
- Short-circuit `colocate_engine_nums = 0` when `args.hybrid` so all engines
  go through the distributed (NCCL broadcast) path.

(cherry picked from commit 387d76d)
# 📝 Documentation

## Add Kimi K2.6 to supported models tables

- Add Kimi K2.6 row (256B-A16B MoE, Vision+Language, INT4 QAT) to README.md
- Add Kimi K2.6 row to README_zh.md
- Add Kimi K2.6 to Vision modality column in docs/en/guide/introduction.md
- Add Kimi K2.6 to Vision modality column in docs/zh/guide/introduction.md
- Fix untranslated "**vision**" header in zh introduction table → "**视觉**"

---

# ⭐ Feature

## Add AI Coding Skills section to README

- Add "🛠️ AI Coding Skills" section before Citation in README.md and README_zh.md
- Lists all 11 skills (code-review, debug-hang, dev, doc-writer, git-commit,
  model-integration, perf-doctor, redaccel-to-relax, ssh-ray-cluster,
  verl-to-relax, creating-skills) with one-line descriptions

(cherry picked from commit bc96e1c)
# 🐛 Bug Fix

## Backport SGLang PR #24244 to docker patch

Cherry-pick of upstream sgl-project/sglang#24244 ("size mamba mappings
from req pool, not mamba pool"), manually ported onto the
`update-transformers-v5` branch used by our docker image. Upstream
patch context did not apply cleanly because that branch dropped the
`mamba_layer_ids` kwarg from `_init_mamba_pool`; the semantic fix is
identical.

- Rename `_init_mamba_pool(size=...)` to `_init_mamba_pool(mamba_size=...)`
  to remove the parameter-name ambiguity that caused the bug
- Size `req_index_to_mamba_index_mapping` and the ping-pong track
  buffer from `self.req_to_token.shape[0]` (req pool size) instead of
  from the mamba pool size — indices into these tensors are
  `req_pool_idx`, not mamba slot ids
- Update both call sites in `HybridReqToTokenPool.__init__` and
  `HybridMambaDecodeReqToTokenPool.__init__`

Fixes `torch.AcceleratorError: CUDA error: an illegal memory access
was encountered` at `HybridReqToTokenPool.alloc -> req_index_to_mamba_index_mapping[select_index] = ...`
when running hybrid attention + linear-state models under SGLang where
`max_mamba_cache_size < max_running_requests` (easy to hit with
`--sglang-mem-fraction-static 0.7` on tight GPU memory).

Note: regenerating the patch via `git diff` also moved
`base_processor.py` from the end of the file to its alphabetical
position under `multimodal/processors/`, and added `@@` function
context labels. No semantic change; `git apply` is order-agnostic.

(cherry picked from commit 4354003)
@Yangruipis Yangruipis merged commit d529a5b into main May 29, 2026
2 of 5 checks passed
@Yangruipis Yangruipis deleted the sync/from-gitlab branch May 29, 2026 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants