Skip to content

fix(grpo,rloo): apply generation_config override in use_transformers_paged path#5888

Open
Sumu004 wants to merge 1 commit into
huggingface:mainfrom
Sumu004:fix/paged-generation-config-override
Open

fix(grpo,rloo): apply generation_config override in use_transformers_paged path#5888
Sumu004 wants to merge 1 commit into
huggingface:mainfrom
Sumu004:fix/paged-generation-config-override

Conversation

@Sumu004
Copy link
Copy Markdown

@Sumu004 Sumu004 commented May 30, 2026

What does this PR do?

The use_transformers_paged generation path in both GRPOTrainer and RLOOTrainer was missing the generation_kwargs override that the regular path already applies via _override_model_generation_config.

On transformers < 5.0, model-specific generation_config values (e.g. Qwen2.5 ships with temperature=0.7) are silently merged on top of training kwargs during model.generate() — see transformers#42762. The regular generation path was already fixed (after the v0.24.0 tag) by passing generation_kwargs=self.generation_kwargs to unwrap_model_for_generation. The use_transformers_paged branch called unwrap_model_for_generation without generation_kwargs, leaving the same silent bug active: rollouts collapse to near-duplicates, advantage variance → 0, and GRPO/RLOO training silently fails with no error or warning.

Fix: pass generation_kwargs=self.generation_kwargs in the use_transformers_paged branch of both trainers, matching the comment and pattern already on the regular path. Also adds unit tests for _override_model_generation_config (none existed before).

Fixes #5783

Before submitting

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@lewtun @kashif — this touches the GRPO and RLOO generation paths. Happy to make any adjustments.


Note

Low Risk
Small, targeted parity fix in rollout generation plus tests; aligns paged path with an already-used pattern on the regular path.

Overview
The use_transformers_paged rollout path in GRPOTrainer and RLOOTrainer now passes generation_kwargs=self.generation_kwargs into unwrap_model_for_generation, matching the non-paged generation path. That applies the existing _override_model_generation_config workaround for transformers#42762 (on transformers < 5.0, bundled model generation_config can override training sampling settings during generate/generate_batch).

tests/test_model_utils.py adds TestOverrideModelGenerationConfig covering override during the context, restore afterward, no-op when kwargs are None, and no-op on transformers ≥ 5.0.

Reviewed by Cursor Bugbot for commit f63a893. Bugbot is set up for automated code reviews on this repo. Configure here.

…paged path

The regular generation path already passes generation_kwargs to
unwrap_model_for_generation (which calls _override_model_generation_config)
to work around transformers#42762, where model-specific generation_config
values (e.g. Qwen2.5's temperature=0.7) silently override training kwargs
such as temperature=1.0 during model.generate().

The use_transformers_paged path was missing this same fix: it called
unwrap_model_for_generation without generation_kwargs, leaving the model's
generation_config unoverridden before generate_batch(). This caused
near-greedy sampling in the paged path under transformers < 5.0, silently
collapsing GRPO/RLOO rollout diversity and making the advantage signal
degenerate (std(R) ≈ 0) without any error.

Fix: pass generation_kwargs=self.generation_kwargs in the
use_transformers_paged branch of both GRPOTrainer and RLOOTrainer,
consistent with the regular generation path.

Also adds unit tests for _override_model_generation_config covering:
- config override during context (transformers < 5)
- original config restored after context
- no-op when generation_kwargs=None
- no-op on transformers >= 5 (upstream fix)

Fixes: huggingface#5783
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GRPOTrainer silently uses near-greedy decoding when temperature=1.0 (transformers >= 4.50 + Qwen2.5)

1 participant