Removed generate_rollout_completions#5870
Conversation
09da11c to
5507e82
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
i've ported the changes in #5568 here so we can just close it once this one lands |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 789cc10. Configure here.
|
|
||
| def reward_efficiency(completions, environments, **kwargs) -> list[float]: | ||
| """Penalize extra tool calls beyond the first: -0.1 per extra call.""" | ||
| return [-0.1 * max(0, env._step_count - 1) for env in environments] |
There was a problem hiding this comment.
Efficiency reward penalizes exploration during early training
Medium Severity
reward_efficiency penalizes extra steps unconditionally, including for failed episodes. In early training when the model hasn't learned the task and all episodes fail, GRPO's group-relative normalization will reward shorter failures over longer attempts. This creates a strong incentive to always noop immediately (reward 0.0) rather than explore multi-step solutions (reward -0.4 to -0.9 for failures), making task learning extremely difficult. The penalty likely needs to be conditioned on success (e.g., only penalize steps when env.reward > 0).
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 789cc10. Configure here.


What does this PR do?
Idea comes from #5841
Still WIP
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
@qgallouedec
Note
Medium Risk
Removes a public experimental API and changes GRPO/OpenEnv training paths; users on
rollout_funcmust migrate, but core trainers still exposerollout_funcelsewhere and changes are mostly docs/examples plus additive FunctionGemma templating.Overview
Removes the experimental
generate_rollout_completionshelper (trl/experimental/openenv/) and steers OpenEnv GRPO examples towardenvironment_factoryonly.Docs and examples: OpenEnv overview text no longer promotes
rollout_func;openenv.mdadds BrowserGym install instructions, a shortenvironment_factoryvsrollout_funcsection (deprecation warning), and drops the long migration guide. The FunctionGemma BrowserGym notebook is rewritten to use a tool-basedBrowserGymFunctionGemmaEnvwithGRPOTrainer(environment_factory=...)instead of customrollout_func/ vLLM rollout code.TRL support for FunctionGemma: Adds
functiongemma.jinjaand wiresfunctiongemma_schemainadd_response_schemaso tool calls can be parsed during training/inference.BrowserGym scripts/notebook: Clients use
.sync(), optional 100MB WebSocket message-size patching for large observations, safer bid formatting for actions, episode-end handling without raising on done steps, defaultgradient_accumulation_steps=1, and an extrareward_efficiencyterm on the VLM script.Reviewed by Cursor Bugbot for commit 789cc10. Bugbot is set up for automated code reviews on this repo. Configure here.