Bump sglang from 0.4.6.post5 to 0.4.10.post2#3
Closed
dependabot[bot] wants to merge 499 commits into
Closed
Conversation
- Add allgather method to dataproto - Add tests - Replace existing raw allgather with this function
This PR solves these 2 following problems. 1. Last step skipped `self.global_steps += 1` before if `self.global_steps >= self.total_training_steps` makes the last step skipped. We start from step 1, and we expect `self.total_training_steps` in total. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L999-L1001 When `self.global_steps == self.total_training_steps-1`: * we have only executed `self.total_training_steps-1` steps * `self.global_steps` is updated to `self.total_training_steps` * `self.global_steps >= self.total_training_steps` is satisfied, and the training ends. Therefore, we should put `self.global_steps += 1` at last 2. redundant validation and logging If `self.total_training_steps % self.config.trainer.test_freq == 0` : * `self._validate()` will be executed twice 1. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L984 2. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1005 * logging will also be executed twice 1. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L985 and https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L997 2. https://github.com/volcengine/verl/blob/82b38e25c72e1b6de7d7d2092af6e1ed5dd2a400/verl/trainer/ppo/ray_trainer.py#L1007
…llelism with multiple bug fixed (#495) This PR combines multiple modifications. # QWen2.5 checkpoint saver bug fix Thanks for the efforts @uygnef contributed to #368 , we use the new saver for model loader and saver for 3D parallelism support. # Megatron backend 3D-parallelism test benches We modify the scripts in `examples/ppo_trainer` and `tests/e2e`, as well as the CI workflows, all tested. # Bug Fix for 3D-parallelism Including configuration bugs as well as the module packing. Original TP VocabParallelEntropy can lead to CUDA OOM, we refactor the implementation with `torch.bmm`. # Fully migration to Megatron Core Now we only use Megatron core in verl, fully get rid of calling other components. If they are in need, please integrate them into `utils/megatron`. --------- Co-authored-by: uygnef <admin@fengyu.org>
# Background In RLHFDataset, we filter out prompts that are too long. This requires apply_chat_template to the whole dataset, which is not scalable when the dataset is large. https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L132 Instead of performing filtering online, we probably want to move this process offline and add an assertion to avoid truncation or simply perform truncation Reference: #502 # Key Changes - Add an option `data.filter_overlong_prompts=True \` to enable the above data filtering. The default value is set to False, but we enable it for all the example scripts. - Add an option `data.truncation` to truncate the input_ids or prompt length if they exceed max_prompt_length. The default is 'error', which does not allow the max_prompt_length to be exceeded. The users should increase the max_prompt_length if throwing the error. You can also set `left` and `right`. ### Suggestion for large-scale dataset. For large-scale datasets, filtering overlong prompts could be time-consuming. You should set `data.filtering_overlong_prompts=False` and set `truncation='left'`. Also, please note that you should increase `data.max_prompt_length` to avoid over-truncation of the prompts.
…469) Since searching for an appropriate `simplify` algorithm may cause `sympy.simplify` to timeout, and `ProcessPool` may get stuck due to excessive concurrency, the timeout mechanism in `verl/verl/workers/reward_manager/prime.py` cannot capture the timeout. To address this issue, a timeout detection mechanism is added to `verl/verl/utils/reward_score/prime_math/__init__.py` for `sympy.simplify` to solve it easily.
- [x] Add concurrency to workflows to cancel previous workflows when new commit is pushed to the same branch. - [ ] Cancel all workflows/jobs from the same commit if any fails? (Not sure whether we really need it) Note: we leave out `secrets_scan.yml` and `scorecard.yml` to avoid any possible leakage or security risk, which also cost little.
…g` is None (#322)
Current bugs when enable hsdp: - **Incorrect Division in Batch Sizes** - `ppo_micro_batch`, `ppo_minibatch`, etc... should be divided by `self.device_mesh.size()` instead of `self.device_mesh.shape[0]`. - **Improper Weight Initialization** in `get_init_weight_context_manager` - The `get_init_weight_context_manager` function must initialize empty weights only on local_rank == 0 within every fsdp mesh. - When `sync_module_states=True`, PyTorch's FSDP first broadcasts parameters within the fsdp process group and then within the ddp process group. If weights are not initialized correctly on `local_rank == 0` of each fsdp mesh, the synchronization process may fail or produce incorrect results. https://github.com/pytorch/pytorch/blob/3f069e7679588d5ee4b1d5b2492ca0e20f9320b5/torch/distributed/fsdp/_init_utils.py#L614-L621 - Ensure initialization occurs only when `self.device_mesh.get_coordinate()[-1] == 0`, which corresponds to `local_rank == 0 `within each fsdp mesh.
[bugfix] Fix position embedding processing for Qwen2.5-VL In the `RLHFDataset.__getitem__` method, a bug was identified in how multimodal position IDs (3D in Qwen2.5-VL) are determined. Previously, the code checked for `self.image_key in row_dict` to decide whether to use multimodal position IDs. However, since `self.image_key` is popped from `row_dict` during image token expansion, this check incorrectly fails for subsequent operations. This causes the VL model to use incorrect position IDs, resulting in significant performance degradation. <img width="349" alt="image" src="https://github.com/user-attachments/assets/79790bbf-239e-4667-a2c5-d63d91d63165" /> The fix introduces an explicit `is_multi_modal` flag to properly track multimodal content throughout the processing pipeline. Co-authored-by: songyifan <songyifan3@xiaomi.com>
Refactor and merge PRIME algorithm into verl/main https://github.com/PRIME-RL/PRIME Breaking changes: `trainer.fsdp_config.min_num_params` is now moved to `trainer.fsdp_config.wrap_policy.min_num_params`.
1. add [PRIME](https://arxiv.org/abs/2502.01456) to README.md 2. slightly change the example script to align with the paper
This PR removes several unnecessary `empty_cache` to improve efficiency. Credit to @PeterSH6
Urgently update megatron core_r0.11.0 documentation.
# Description verl-project/verl#287, verl-project/verl#295. This PR introduces support for [Math-Verify](https://github.com/huggingface/Math-Verify) as a new rule-based reward scorer, significantly improving evaluation accuracy. # Key changes - Added `math-verify` to the installation dependencies. - Introduced `reward_score/math_verify.py` and updated `reward_score/__init__.py`. # Test Comparison between the existing scorer in math.py and the newly added `math_verify.py`, using Qwen2.5-Math-7B-Instruct: ``` # Use scorer in math.py (original) {'val/test_score/DigitalLearningGmbH/MATH-lighteval': 0.803} # Use scorer in math_verify.py (newly added) {'val/test_score/DigitalLearningGmbH/MATH-lighteval': 0.8338} ``` Test scripts: ```bash set -x # Data Process python examples/data_preprocess/math_dataset.py --local_dir /workspace/datasets/math # Evaluation export CUDA_VISIBLE_DEVICES=4,5,6,7 export VLLM_ATTENTION_BACKEND=XFORMERS math_train_path=/workspace/datasets/math/train.parquet math_test_path=/workspace/datasets/math/test.parquet python3 -m verl.trainer.main_ppo \ data.train_files="$math_train_path" \ data.val_files="$math_test_path" \ data.max_prompt_length=2048 \ data.max_response_length=2048 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-Math-7B-Instruct \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.n=1 \ actor_rollout_ref.rollout.temperature=0 \ trainer.logger=['console'] \ trainer.project_name='test-math-verify' \ trainer.experiment_name='test-math-verify' \ +trainer.val_before_train=True \ trainer.n_gpus_per_node=4 \ trainer.nnodes=1 \ trainer.total_epochs=0 \ data.train_batch_size=1024 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ algorithm.adv_estimator=grpo $@ ```
… directly (#543) As we're moving to vllm>=0.7.3, we should remove `verl/third_party` complelely in the future.
Currently, eager mode is applied in the validation stage. However, in some reasoning tasks, we may need to generate n times and average the scores. In this PR, we support using non-eager sampling parameters during validation by specifying the `val_kwargs` in `actor_rollout_ref.rollout` config field. **Future work** - [ ] Merge `vllm_rollout_spmd.py` and `vllm_rollout.py` into one file.
# Description - Corrected dummy size to avoid faulty communication. - Fixed batch number calculation. - Adjusted worker group role to alleviate memory overhead. - Add ray.init() to prevent failing to register worker.
… xformers (#570) ### Description - fix filter_overlong_prompts setting in PRIME - fix padding side incorrect for Qwen in PRIME - When I utilize PRIME recipe to train Qwen series models, I got “*ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to call tokenizer.padding_side = 'left' before tokenizing the input.*” So I set `use_cache = False` when calling model to calculate output logits. - fix CUDA error with vllm v0.6.3 - When I run PRIME, I may get an error — *CUDA error: an illegal memory access was encountered*. According to vllm-project/vllm#10389, I set `VLLM_ATTENTION_BACKEND=XFORMERS` .
#556 take effort to remove remove unnecessary empty_cache, but will
cause CUDA oom at vllm wake_up.
```text
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/fsdp_workers.py", line 481, in generate_sequences
with self.rollout_sharding_manager:
File "/opt/tiger/ray/session_2025-03-13_12-11-30_408315_2895/runtime_resources/working_dir_files/_ray_pkg_a64b690733067c5c/verl/workers/sharding_manager/fsdp_vllm.py", line 82, in __enter__
self.inference_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/entrypoints/llm.py", line 1244, in wake_up
self.llm_engine.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 1859, in wake_up
self.model_executor.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/executor_base.py", line 216, in wake_up
self.collective_rpc("wake_up")
File "/usr/local/lib/python3.11/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 140, in wake_up
allocator.wake_up()
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 207, in wake_up
create_and_map(handle)
File "/usr/local/lib/python3.11/dist-packages/vllm/device_allocator/cumem.py", line 75, in create_and_map
python_create_and_map(*allocation_handle)
RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
```
This PR remove all redundant `torch.cuda.empty_cache()` in FSDP worker
and only empty cache before vllm wake_up and after vllm sleep, since
vllm has its own caching memory allocator
[CuMemAllocator](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/device_allocator/cumem.py#L103).
Out of vllm scope, we should avoid empty cache to let pytorch using
caching memory to speed up memory allocations.
- [x] Cleanup FSDP worker torch.cuda.empty_cache()
- [ ] Cleanup Megatron worker torch.cuda.empty_cache()
Remove redundant broadcast in fsdp vllm postprocess since vllm output in each tp rank should be identical.
* Update README.md
* update README.md * fix some tiny bugs
…ch env worker (#148) * add 'resources_per_worker' config for easily managing cpus/gpus of each env worker
…arch-r1 experiments & similarity-based GiGPO (#159) * add search-r1 experiment (tool-calling); add similarity-based GiGPO * update script * use env_kwargs * add the results of Search-R1 experiments * add wandb of search * add copyright * fix some bug & add tool use count
* add search-r1 experiment (tool-calling); add similarity-based GiGPO * update script * use env_kwargs * add the results of Search-R1 experiments * add wandb of search * add copyright * fix some bug & add tool use count * GiGPO @ NeurIPS 2025 * GiGPO @ NeurIPS 2025 * adjust space * adjust space
* fix bugs in adjust_batch of VLM
* update readme * clarify data prepare
Co-authored-by: Markus Junttila <markus.1.junttila@nokia.com>
fix(ray): ensure Ray version < 2.50.0 to avoid incompatibility with latest release
* My initial modifications * Add gspo to verl-agent * Add an example file for gspo --------- Co-authored-by: Markus Junttila <markus.1.junttila@nokia.com>
* add qwen3 * fix some issues in qwen3-vl pr * address the issues of tensor model parallel * update scripts * update readme --------- Co-authored-by: langfeng <langfeng.cs@gmail.com> Co-authored-by: langfeng <1371441151@qq.com>
* Add recipe/hgpo (HGPO trainer, configs and run scripts) * docs: update README files for HGPO recipe * docs: update recipe/hgpo README * docs: update README
Bumps [sglang](https://github.com/sgl-project/sglang) from 0.4.6.post5 to 0.4.10.post2. - [Release notes](https://github.com/sgl-project/sglang/releases) - [Commits](sgl-project/sglang@v0.4.6.post5...v0.4.10.post2) --- updated-dependencies: - dependency-name: sglang dependency-version: 0.4.10.post2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Contributor
Author
|
OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting If you change your mind, just re-open this PR and I'll resolve any conflicts on it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps sglang from 0.4.6.post5 to 0.4.10.post2.
Release notes
Sourced from sglang's releases.
... (truncated)
Commits
8cd3445chore: bump v0.4.10.post2 (#8727)0e0eef0[DP] fix the compatibility issue between DP attention and `--attention-backen...cb099d2[CUDA Graph] save cuda graph memory by using next_token_logits_buffer (#8579)7a91330Save cuda graph memory for fa3 (#8567)5ce5093chore: bump sgl-kernel 0.3.0 with torch 2.8.0 (#8718)6f9baf1[Improvements] Merge health check route (#8444)a31b7a7feat: Add new moe triton for NVIDIA RTX 6000 Ada (#8547)7ed8e51[fix] Fix divide by zero error for llama4. (#8683)32f2815Do layernorm before allgather for DP attention (#8631)f7b2853[feat] support minimum token load balance in dp attention (#7379)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)