Skip to content

Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency)#326

Merged
lingduoduo merged 2 commits into
mainfrom
feat/agent-framework-grpo-phase-a0
Jun 23, 2026
Merged

Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency)#326
lingduoduo merged 2 commits into
mainfrom
feat/agent-framework-grpo-phase-a0

Conversation

@lingduoduo

Copy link
Copy Markdown
Owner

What & why

Phase A0 of the agent-framework GRPO optimization — training durability. Independent of #324/#325 (branches off main). The GRPO stack was single-step, foreground, no-resume: a crash lost everything and a hung rollout stalled the run. This makes long runs survivable.

Spec/plan/tasks: SPEC.md · plan · tasks

Changes

  • train_loop (src/training/ppo/train_loop.py) — durable multi-step driver, trainer-agnostic (any step_async):
    • step-level asyncio.wait_for timeout + skip — a hung or failing step is logged and skipped; prior progress is intact
    • periodic checkpoint every ckpt_every and resume-from-checkpoint (continues at the saved step)
    • per-step metrics JSONL for observability
  • LLMGRPOTrainer.save_checkpoint / load_checkpoint — persist/restore policy + tokenizer + optimizer state (reference policy is frozen/reconstructable; no scheduler exists in the trainer).
  • Bounded rollout concurrency_resolve_max_concurrent(None) -> 8, so a B×G batch of live search rollouts can't saturate/rate-limit the retrieval server (was unbounded by default).

Scope note

Retrieval-level retries (T-A0.2) are deferred: they'd touch src/agents/search.py _retrieve_many, which Phase B (#325) rewrote — a guaranteed merge conflict. The step-level timeout/skip already covers the catastrophic case; retries land as a small follow-up after #325 merges.

Testing

  • TDD throughout. 6 new tests: 5 train_loop (runs N steps, periodic ckpt + jsonl, resume continues from saved step, hung step skipped, failing step skipped) + 1 concurrency-default. 53-test training slice green; ruff clean.

🤖 Generated with Claude Code

lingduoduo and others added 2 commits June 23, 2026 19:44
The GRPO stack was single-step/foreground/no-resume; this adds the durability a
long run needs:
- train_loop: durable multi-step driver — step-level asyncio.wait_for timeout +
  skip (a hung/failing step is logged and skipped, not fatal), periodic
  checkpoint, resume-from-checkpoint, per-step metrics JSONL. Trainer-agnostic.
- LLMGRPOTrainer.save_checkpoint/load_checkpoint: policy + tokenizer + optimizer.
- Bounded rollout concurrency: _resolve_max_concurrent(None) -> 8 so a B×G batch
  of live search rollouts can't saturate the retrieval server.

6 new tests (5 train_loop + 1 concurrency). Includes spec/plan/tasks docs.
Retrieval-level retries deferred to avoid a search.py conflict with PR #325.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lingduoduo lingduoduo merged commit ae99f4f into main Jun 23, 2026
5 of 6 checks passed
@lingduoduo lingduoduo deleted the feat/agent-framework-grpo-phase-a0 branch June 23, 2026 23:51
lingduoduo added a commit that referenced this pull request Jun 24, 2026
…unner

Wires the full action-policy stack:
  from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware()
  reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) ->
  action_eval baseline-vs-trained comparison table.

- PolicyServerManager: tiny in-process server_manager over the trainer's LIVE
  policy so rollouts are on-policy (no second model load); loop_factory is set
  after trainer construction so it closes over trainer.policy.
- Imports train_loop (PR #326) — documented dependency; runs once the stack
  merges. Lint clean, compiles, all other symbols resolve.

GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward;
this runner is the manual GPU/MPS entry point for a real convergent run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lingduoduo added a commit that referenced this pull request Jun 24, 2026
* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks

Spec-driven groundwork for optimizing the agent framework (modular
components + GRPO action policy):
- SPEC.md, plan, and task breakdown under docs/superpowers/
- SearchAgentState: six-field search-loop state (question, previous_queries,
  retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum
  and Citation, added alongside the existing orchestration AgentState
- 9 unit tests (dedup, round counting, rerank, evidence clamp)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3)

Behavior-preserving, additive component modules under src/agents/components/,
each unit-tested in isolation with injected deps:
- EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1]
  (blends query sufficiency with squashed top scores; monotonic in quality)
- AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext
- SearchTool: single-retriever wrapper that records the round into SearchAgentState
- RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted)

Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type
(lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Record T-A.4 wiring decision in task plan (recommend deferring into Phase B)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2)

- SearchTool gains a second backend: run(state, query, retriever) selects web vs
  vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured.
  Backward compatible with the Phase A single-arg constructor.
- Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed
  SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer;
  malformed input degrades to a safe vector-DB search. 7 new tests (23 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop

The policy can now target a second backend per search round via
<search retriever="web">. Surgical, behavior-preserving (vdb is the default,
identical to before):
- SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade)
- action regex tolerates tag attributes; _parse_round_retriever reads the choice
- retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many;
  cache keyed by (retriever, query); both clients closed on exit
- system instruction teaches the retriever attribute
- 2 loop tests (web routing + degradation); on_turn fake updated for new signature

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update task plan: Phase B routing done; defer rerank action to Phase C

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3)

Make GRPO able to learn the action policy (search web vs vector-DB, search
budget, evidence informativeness):
- SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the
  preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All
  default 0 -> existing presets byte-stable; new breakdown components added.
- Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total
  into output.metrics; per-round continuous score reuses EvidenceJudge.score_round
  (wires the component into the loop). rerank_calls reserved (0 until rerank action).
- SearchRewardConfig.retriever_aware() preset wires the new weights on top of
  second_pass; existing presets untouched.

12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual
integration step (needs a real model); rerank-as-action + eval remain (Phase D).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Agent framework Phase B: web-vs-vector-DB retriever action (Planner + live loop) (#325)

* Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2)

- SearchTool gains a second backend: run(state, query, retriever) selects web vs
  vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured.
  Backward compatible with the Phase A single-arg constructor.
- Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed
  SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer;
  malformed input degrades to a safe vector-DB search. 7 new tests (23 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop

The policy can now target a second backend per search round via
<search retriever="web">. Surgical, behavior-preserving (vdb is the default,
identical to before):
- SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade)
- action regex tolerates tag attributes; _parse_round_retriever reads the choice
- retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many;
  cache keyed by (retriever, query); both clients closed on exit
- system instruction teaches the retriever attribute
- 2 loop tests (web routing + degradation); on_turn fake updated for new signature

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update task plan: Phase B routing done; defer rerank action to Phase C

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase C/D: reranker as a priced policy action (per-search rerank flag)

Completes the action space the reward already prices:
- <search rerank="true"> reranks that round's results BEFORE they are labeled, so
  the [RxQyDz] citations the model sees match the reranked order (sidesteps the
  citation-shift problem of a standalone mid-loop <rerank/>).
- Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged
  no-op (degrade). Counts rerank_calls -> consumed by rerank_cost.
- System instruction teaches the rerank attribute. 2 loop tests.

Also: T-A0.2 retrieval retries already covered by SearchClient's exponential
backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained
checkpoint (manual); action-mix metrics are already on output.metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison

src/training/eval/action_eval.py aggregates the action-mix metrics the loop
already emits (web/vdb searches, search_rounds, rerank rate, evidence) and
encodes the spec's headline success criterion — fewer search rounds AND
correctness preserved — with a readable comparison table. 6 unit tests.

Ready to consume real eval rollouts; the actual baseline-vs-trained numbers
still need a converged GRPO checkpoint (manual training run).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add GRPO smoke step test exercising retriever_aware reward + action metrics

Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage ->
loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over
rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN)
and that retriever_cost/rerank_cost/evidence_gain actually price the rollout —
proving the Phase B/C action space flows end-to-end through GRPO training.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add examples/run_retriever_aware_grpo.py — one-command train + eval runner

Wires the full action-policy stack:
  from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware()
  reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) ->
  action_eval baseline-vs-trained comparison table.

- PolicyServerManager: tiny in-process server_manager over the trainer's LIVE
  policy so rollouts are on-policy (no second model load); loop_factory is set
  after trainer construction so it closes over trainer.policy.
- Imports train_loop (PR #326) — documented dependency; runs once the stack
  merges. Lint clean, compiles, all other symbols resolve.

GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward;
this runner is the manual GPU/MPS entry point for a real convergent run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lingduoduo added a commit that referenced this pull request Jun 24, 2026
* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks

Spec-driven groundwork for optimizing the agent framework (modular
components + GRPO action policy):
- SPEC.md, plan, and task breakdown under docs/superpowers/
- SearchAgentState: six-field search-loop state (question, previous_queries,
  retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum
  and Citation, added alongside the existing orchestration AgentState
- 9 unit tests (dedup, round counting, rerank, evidence clamp)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3)

Behavior-preserving, additive component modules under src/agents/components/,
each unit-tested in isolation with injected deps:
- EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1]
  (blends query sufficiency with squashed top scores; monotonic in quality)
- AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext
- SearchTool: single-retriever wrapper that records the round into SearchAgentState
- RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted)

Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type
(lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Record T-A.4 wiring decision in task plan (recommend deferring into Phase B)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2)

- SearchTool gains a second backend: run(state, query, retriever) selects web vs
  vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured.
  Backward compatible with the Phase A single-arg constructor.
- Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed
  SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer;
  malformed input degrades to a safe vector-DB search. 7 new tests (23 total).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop

The policy can now target a second backend per search round via
<search retriever="web">. Surgical, behavior-preserving (vdb is the default,
identical to before):
- SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade)
- action regex tolerates tag attributes; _parse_round_retriever reads the choice
- retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many;
  cache keyed by (retriever, query); both clients closed on exit
- system instruction teaches the retriever attribute
- 2 loop tests (web routing + degradation); on_turn fake updated for new signature

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Update task plan: Phase B routing done; defer rerank action to Phase C

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3)

Make GRPO able to learn the action policy (search web vs vector-DB, search
budget, evidence informativeness):
- SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the
  preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All
  default 0 -> existing presets byte-stable; new breakdown components added.
- Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total
  into output.metrics; per-round continuous score reuses EvidenceJudge.score_round
  (wires the component into the loop). rerank_calls reserved (0 until rerank action).
- SearchRewardConfig.retriever_aware() preset wires the new weights on top of
  second_pass; existing presets untouched.

12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual
integration step (needs a real model); rerank-as-action + eval remain (Phase D).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase C/D: reranker as a priced policy action (per-search rerank flag)

Completes the action space the reward already prices:
- <search rerank="true"> reranks that round's results BEFORE they are labeled, so
  the [RxQyDz] citations the model sees match the reranked order (sidesteps the
  citation-shift problem of a standalone mid-loop <rerank/>).
- Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged
  no-op (degrade). Counts rerank_calls -> consumed by rerank_cost.
- System instruction teaches the rerank attribute. 2 loop tests.

Also: T-A0.2 retrieval retries already covered by SearchClient's exponential
backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained
checkpoint (manual); action-mix metrics are already on output.metrics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison

src/training/eval/action_eval.py aggregates the action-mix metrics the loop
already emits (web/vdb searches, search_rounds, rerank rate, evidence) and
encodes the spec's headline success criterion — fewer search rounds AND
correctness preserved — with a readable comparison table. 6 unit tests.

Ready to consume real eval rollouts; the actual baseline-vs-trained numbers
still need a converged GRPO checkpoint (manual training run).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add GRPO smoke step test exercising retriever_aware reward + action metrics

Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage ->
loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over
rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN)
and that retriever_cost/rerank_cost/evidence_gain actually price the rollout —
proving the Phase B/C action space flows end-to-end through GRPO training.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add examples/run_retriever_aware_grpo.py — one-command train + eval runner

Wires the full action-policy stack:
  from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware()
  reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) ->
  action_eval baseline-vs-trained comparison table.

- PolicyServerManager: tiny in-process server_manager over the trainer's LIVE
  policy so rollouts are on-policy (no second model load); loop_factory is set
  after trainer construction so it closes over trainer.policy.
- Imports train_loop (PR #326) — documented dependency; runs once the stack
  merges. Lint clean, compiles, all other symbols resolve.

GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward;
this runner is the manual GPU/MPS entry point for a real convergent run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant