Skip to content

Removed generate_rollout_completions#5870

Open
sergiopaniego wants to merge 11 commits into
mainfrom
remove-rollout-func-openenv
Open

Removed generate_rollout_completions#5870
sergiopaniego wants to merge 11 commits into
mainfrom
remove-rollout-func-openenv

Conversation

@sergiopaniego
Copy link
Copy Markdown
Member

@sergiopaniego sergiopaniego commented May 27, 2026

What does this PR do?

Idea comes from #5841
Still WIP

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

@qgallouedec


Note

Medium Risk
Removes a public experimental API and changes GRPO/OpenEnv training paths; users on rollout_func must migrate, but core trainers still expose rollout_func elsewhere and changes are mostly docs/examples plus additive FunctionGemma templating.

Overview
Removes the experimental generate_rollout_completions helper (trl/experimental/openenv/) and steers OpenEnv GRPO examples toward environment_factory only.

Docs and examples: OpenEnv overview text no longer promotes rollout_func; openenv.md adds BrowserGym install instructions, a short environment_factory vs rollout_func section (deprecation warning), and drops the long migration guide. The FunctionGemma BrowserGym notebook is rewritten to use a tool-based BrowserGymFunctionGemmaEnv with GRPOTrainer(environment_factory=...) instead of custom rollout_func / vLLM rollout code.

TRL support for FunctionGemma: Adds functiongemma.jinja and wires functiongemma_schema in add_response_schema so tool calls can be parsed during training/inference.

BrowserGym scripts/notebook: Clients use .sync(), optional 100MB WebSocket message-size patching for large observations, safer bid formatting for actions, episode-end handling without raising on done steps, default gradient_accumulation_steps=1, and an extra reward_efficiency term on the VLM script.

Reviewed by Cursor Bugbot for commit 789cc10. Bugbot is set up for automated code reviews on this repo. Configure here.

@sergiopaniego sergiopaniego force-pushed the remove-rollout-func-openenv branch from 09da11c to 5507e82 Compare May 29, 2026 13:50
@sergiopaniego sergiopaniego marked this pull request as ready for review May 29, 2026 14:35
Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sergiopaniego
Copy link
Copy Markdown
Member Author

i've ported the changes in #5568 here so we can just close it once this one lands

Comment thread examples/scripts/openenv/browsergym.py Outdated
Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated
Comment thread examples/notebooks/grpo_functiongemma_browsergym_openenv.ipynb Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 789cc10. Configure here.


def reward_efficiency(completions, environments, **kwargs) -> list[float]:
"""Penalize extra tool calls beyond the first: -0.1 per extra call."""
return [-0.1 * max(0, env._step_count - 1) for env in environments]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efficiency reward penalizes exploration during early training

Medium Severity

reward_efficiency penalizes extra steps unconditionally, including for failed episodes. In early training when the model hasn't learned the task and all episodes fail, GRPO's group-relative normalization will reward shorter failures over longer attempts. This creates a strong incentive to always noop immediately (reward 0.0) rather than explore multi-step solutions (reward -0.4 to -0.9 for failures), making task learning extremely difficult. The penalty likely needs to be conditioned on success (e.g., only penalize steps when env.reward > 0).

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 789cc10. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants