fix(grpo,rloo): apply generation_config override in use_transformers_paged path by Sumu004 · Pull Request #5888 · huggingface/trl

Sumu004 · 2026-05-30T01:12:58Z

What does this PR do?

The use_transformers_paged generation path in both GRPOTrainer and RLOOTrainer was missing the generation_kwargs override that the regular path already applies via _override_model_generation_config.

On transformers < 5.0, model-specific generation_config values (e.g. Qwen2.5 ships with temperature=0.7) are silently merged on top of training kwargs during model.generate() — see transformers#42762. The regular generation path was already fixed (after the v0.24.0 tag) by passing generation_kwargs=self.generation_kwargs to unwrap_model_for_generation. The use_transformers_paged branch called unwrap_model_for_generation without generation_kwargs, leaving the same silent bug active: rollouts collapse to near-duplicates, advantage variance → 0, and GRPO/RLOO training silently fails with no error or warning.

Fix: pass generation_kwargs=self.generation_kwargs in the use_transformers_paged branch of both trainers, matching the comment and pattern already on the regular path. Also adds unit tests for _override_model_generation_config (none existed before).

Fixes #5783

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. — GRPOTrainer silently uses near-greedy decoding when temperature=1.0 (transformers >= 4.50 + Qwen2.5) #5783
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@lewtun @kashif — this touches the GRPO and RLOO generation paths. Happy to make any adjustments.

Note

Low Risk
Small, targeted parity fix in rollout generation plus tests; aligns paged path with an already-used pattern on the regular path.

Overview
The use_transformers_paged rollout path in GRPOTrainer and RLOOTrainer now passes generation_kwargs=self.generation_kwargs into unwrap_model_for_generation, matching the non-paged generation path. That applies the existing _override_model_generation_config workaround for transformers#42762 (on transformers < 5.0, bundled model generation_config can override training sampling settings during generate/generate_batch).

tests/test_model_utils.py adds TestOverrideModelGenerationConfig covering override during the context, restore afterward, no-op when kwargs are None, and no-op on transformers ≥ 5.0.

^{Reviewed by Cursor Bugbot for commit f63a893. Bugbot is set up for automated code reviews on this repo. Configure here.}

…paged path The regular generation path already passes generation_kwargs to unwrap_model_for_generation (which calls _override_model_generation_config) to work around transformers#42762, where model-specific generation_config values (e.g. Qwen2.5's temperature=0.7) silently override training kwargs such as temperature=1.0 during model.generate(). The use_transformers_paged path was missing this same fix: it called unwrap_model_for_generation without generation_kwargs, leaving the model's generation_config unoverridden before generate_batch(). This caused near-greedy sampling in the paged path under transformers < 5.0, silently collapsing GRPO/RLOO rollout diversity and making the advantage signal degenerate (std(R) ≈ 0) without any error. Fix: pass generation_kwargs=self.generation_kwargs in the use_transformers_paged branch of both GRPOTrainer and RLOOTrainer, consistent with the regular generation path. Also adds unit tests for _override_model_generation_config covering: - config override during context (transformers < 5) - original config restored after context - no-op when generation_kwargs=None - no-op on transformers >= 5 (upstream fix) Fixes: huggingface#5783

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(grpo,rloo): apply generation_config override in use_transformers_paged path#5888

fix(grpo,rloo): apply generation_config override in use_transformers_paged path#5888
Sumu004 wants to merge 1 commit into
huggingface:mainfrom
Sumu004:fix/paged-generation-config-override

Sumu004 commented May 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sumu004 commented May 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sumu004 commented May 30, 2026 •

edited by cursor Bot

Loading