Skip to content

monitor: track TRL #5120 for future migration from verl-agent #85

@abrichr

Description

@abrichr

Context

We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches rollout_func.

Decision doc: docs/verl_agent_decision.md (PR #84)

Upstream Dependency

When to Revisit

Check quarterly (June, September, December 2026). If any of:

  1. TRL #5120 is resolved or has a merged fix
  2. TRL's GRPOTrainer passes multi-turn VLM E2E tests
  3. TRL release notes announce multi-turn VLM GRPO support

Then:

  • Test TRL against our WAA RL environment (RLEnvironment / WAADesktopEnv)
  • Benchmark: verl-agent vs TRL on same task (wall time, VRAM, convergence)
  • If TRL matches verl-agent AND adds per-step credit assignment (GiGPO equivalent), consider switching

Why We'd Want to Switch

verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. But only if TRL also adds per-step credit assignment — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions