monitor: track TRL #5120 for future migration from verl-agent

## Context

We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches `rollout_func`.

Decision doc: `docs/verl_agent_decision.md` (PR #84)

## Upstream Dependency

- **TRL #5120**: [Preserve structured multimodal messages through rollout and generation pipeline](https://github.com/huggingface/trl/issues/5120) (opened Feb 18, 2026, OPEN)
- **TRL #5119**: [Decouple inference backend from rollout & agent logic](https://github.com/huggingface/trl/issues/5119) (OPEN)

## When to Revisit

Check quarterly (June, September, December 2026). If any of:

1. TRL #5120 is resolved or has a merged fix
2. TRL's GRPOTrainer passes multi-turn VLM E2E tests
3. TRL release notes announce multi-turn VLM GRPO support

Then:
- Test TRL against our WAA RL environment (`RLEnvironment` / `WAADesktopEnv`)
- Benchmark: verl-agent vs TRL on same task (wall time, VRAM, convergence)
- If TRL matches verl-agent AND adds per-step credit assignment (GiGPO equivalent), consider switching

## Why We'd Want to Switch

verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. **But only if TRL also adds per-step credit assignment** — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.

## Related

- PR #84: verl-agent spike (`WAADesktopEnv` adapter)
- `openadapt_ml/training/grpo/trainer.py`: inline comment tracking this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitor: track TRL #5120 for future migration from verl-agent #85

Context

Upstream Dependency

When to Revisit

Why We'd Want to Switch

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

monitor: track TRL #5120 for future migration from verl-agent #85

Description

Context

Upstream Dependency

When to Revisit

Why We'd Want to Switch

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions