Removed `generate_rollout_completions` by sergiopaniego · Pull Request #5870 · huggingface/trl

sergiopaniego · 2026-05-27T16:10:45Z

What does this PR do?

Idea comes from #5841
Still WIP

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@qgallouedec

Note

Medium Risk
Removes a public experimental API and changes GRPO/OpenEnv training paths; users on rollout_func must migrate, but core trainers still expose rollout_func elsewhere and changes are mostly docs/examples plus additive FunctionGemma templating.

Overview
Removes the experimental generate_rollout_completions helper (trl/experimental/openenv/) and steers OpenEnv GRPO examples toward environment_factory only.

Docs and examples: OpenEnv overview text no longer promotes rollout_func; openenv.md adds BrowserGym install instructions, a short environment_factory vs rollout_func section (deprecation warning), and drops the long migration guide. The FunctionGemma BrowserGym notebook is rewritten to use a tool-based BrowserGymFunctionGemmaEnv with GRPOTrainer(environment_factory=...) instead of custom rollout_func / vLLM rollout code.

TRL support for FunctionGemma: Adds functiongemma.jinja and wires functiongemma_schema in add_response_schema so tool calls can be parsed during training/inference.

BrowserGym scripts/notebook: Clients use .sync(), optional 100MB WebSocket message-size patching for large observations, safer bid formatting for actions, episode-end handling without raising on done steps, default gradient_accumulation_steps=1, and an extra reward_efficiency term on the VLM script.

^{Reviewed by Cursor Bugbot for commit 789cc10. Bugbot is set up for automated code reviews on this repo. Configure here.}

HuggingFaceDocBuilderDev · 2026-05-29T14:38:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sergiopaniego · 2026-05-29T15:36:18Z

i've ported the changes in #5568 here so we can just close it once this one lands

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 789cc10. Configure here.}

cursor · 2026-05-29T16:13:26Z


+def reward_efficiency(completions, environments, **kwargs) -> list[float]:
+    """Penalize extra tool calls beyond the first: -0.1 per extra call."""
+    return [-0.1 * max(0, env._step_count - 1) for env in environments]


Efficiency reward penalizes exploration during early training

Medium Severity

reward_efficiency penalizes extra steps unconditionally, including for failed episodes. In early training when the model hasn't learned the task and all episodes fail, GRPO's group-relative normalization will reward shorter failures over longer attempts. This creates a strong incentive to always noop immediately (reward 0.0) rather than explore multi-step solutions (reward -0.4 to -0.9 for failures), making task learning extremely difficult. The penalty likely needs to be conditioned on success (e.g., only penalize steps when env.reward > 0).

Additional Locations (1)

examples/scripts/openenv/browsergym.py#L299-L300

^{Reviewed by Cursor Bugbot for commit 789cc10. Configure here.}

Removed generate_rollout_completions

5507e82

sergiopaniego force-pushed the remove-rollout-func-openenv branch from 09da11c to 5507e82 Compare May 29, 2026 13:50

sergiopaniego added 4 commits May 29, 2026 16:22

update

4d66446

update notebook

a6d5cdf

updated notebook

704cd01

udpated notebook

fb20b4d

sergiopaniego marked this pull request as ready for review May 29, 2026 14:35

Merge branch 'main' into remove-rollout-func-openenv

f644d06

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated

sergiopaniego added 3 commits May 29, 2026 16:41

nit

030cf2e

cursor feedbackg

f5e219e

added changes in PR#5568

3ace556

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread examples/scripts/openenv/browsergym.py Outdated

update based on cursor

187407b

cursor Bot reviewed May 29, 2026

View reviewed changes

Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated

Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated

cursor review

789cc10

cursor Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed `generate_rollout_completions`#5870

Removed `generate_rollout_completions`#5870
sergiopaniego wants to merge 11 commits into
mainfrom
remove-rollout-func-openenv

sergiopaniego commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

sergiopaniego commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sergiopaniego commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 29, 2026

Uh oh!

sergiopaniego commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Efficiency reward penalizes exploration during early training

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sergiopaniego commented May 27, 2026 •

edited by cursor Bot

Loading