Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency) by lingduoduo · Pull Request #326 · lingduoduo/Agentic-Search-GRPO

lingduoduo · 2026-06-23T23:44:38Z

What & why

Phase A0 of the agent-framework GRPO optimization — training durability. Independent of #324/#325 (branches off main). The GRPO stack was single-step, foreground, no-resume: a crash lost everything and a hung rollout stalled the run. This makes long runs survivable.

Spec/plan/tasks: SPEC.md · plan · tasks

Changes

train_loop (src/training/ppo/train_loop.py) — durable multi-step driver, trainer-agnostic (any step_async):
- step-level asyncio.wait_for timeout + skip — a hung or failing step is logged and skipped; prior progress is intact
- periodic checkpoint every ckpt_every and resume-from-checkpoint (continues at the saved step)
- per-step metrics JSONL for observability
LLMGRPOTrainer.save_checkpoint / load_checkpoint — persist/restore policy + tokenizer + optimizer state (reference policy is frozen/reconstructable; no scheduler exists in the trainer).
Bounded rollout concurrency — _resolve_max_concurrent(None) -> 8, so a B×G batch of live search rollouts can't saturate/rate-limit the retrieval server (was unbounded by default).

Scope note

Retrieval-level retries (T-A0.2) are deferred: they'd touch src/agents/search.py _retrieve_many, which Phase B (#325) rewrote — a guaranteed merge conflict. The step-level timeout/skip already covers the catastrophic case; retries land as a small follow-up after #325 merges.

Testing

TDD throughout. 6 new tests: 5 train_loop (runs N steps, periodic ckpt + jsonl, resume continues from saved step, hung step skipped, failing step skipped) + 1 concurrency-default. 53-test training slice green; ruff clean.

🤖 Generated with Claude Code

The GRPO stack was single-step/foreground/no-resume; this adds the durability a long run needs: - train_loop: durable multi-step driver — step-level asyncio.wait_for timeout + skip (a hung/failing step is logged and skipped, not fatal), periodic checkpoint, resume-from-checkpoint, per-step metrics JSONL. Trainer-agnostic. - LLMGRPOTrainer.save_checkpoint/load_checkpoint: policy + tokenizer + optimizer. - Bounded rollout concurrency: _resolve_max_concurrent(None) -> 8 so a B×G batch of live search rollouts can't saturate the retrieval server. 6 new tests (5 train_loop + 1 concurrency). Includes spec/plan/tasks docs. Retrieval-level retries deferred to avoid a search.py conflict with PR #325. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…unner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks Spec-driven groundwork for optimizing the agent framework (modular components + GRPO action policy): - SPEC.md, plan, and task breakdown under docs/superpowers/ - SearchAgentState: six-field search-loop state (question, previous_queries, retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum and Citation, added alongside the existing orchestration AgentState - 9 unit tests (dedup, round counting, rerank, evidence clamp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3) Behavior-preserving, additive component modules under src/agents/components/, each unit-tested in isolation with injected deps: - EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1] (blends query sufficiency with squashed top scores; monotonic in quality) - AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext - SearchTool: single-retriever wrapper that records the round into SearchAgentState - RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted) Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type (lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Record T-A.4 wiring decision in task plan (recommend deferring into Phase B) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3) Make GRPO able to learn the action policy (search web vs vector-DB, search budget, evidence informativeness): - SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All default 0 -> existing presets byte-stable; new breakdown components added. - Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total into output.metrics; per-round continuous score reuses EvidenceJudge.score_round (wires the component into the loop). rerank_calls reserved (0 until rerank action). - SearchRewardConfig.retriever_aware() preset wires the new weights on top of second_pass; existing presets untouched. 12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual integration step (needs a real model); rerank-as-action + eval remain (Phase D). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Agent framework Phase B: web-vs-vector-DB retriever action (Planner + live loop) (#325) * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C/D: reranker as a priced policy action (per-search rerank flag) Completes the action space the reward already prices: - <search rerank="true"> reranks that round's results BEFORE they are labeled, so the [RxQyDz] citations the model sees match the reranked order (sidesteps the citation-shift problem of a standalone mid-loop <rerank/>). - Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged no-op (degrade). Counts rerank_calls -> consumed by rerank_cost. - System instruction teaches the rerank attribute. 2 loop tests. Also: T-A0.2 retrieval retries already covered by SearchClient's exponential backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained checkpoint (manual); action-mix metrics are already on output.metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison src/training/eval/action_eval.py aggregates the action-mix metrics the loop already emits (web/vdb searches, search_rounds, rerank rate, evidence) and encodes the spec's headline success criterion — fewer search rounds AND correctness preserved — with a readable comparison table. 6 unit tests. Ready to consume real eval rollouts; the actual baseline-vs-trained numbers still need a converged GRPO checkpoint (manual training run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add GRPO smoke step test exercising retriever_aware reward + action metrics Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage -> loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN) and that retriever_cost/rerank_cost/evidence_gain actually price the rollout — proving the Phase B/C action space flows end-to-end through GRPO training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add examples/run_retriever_aware_grpo.py — one-command train + eval runner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks Spec-driven groundwork for optimizing the agent framework (modular components + GRPO action policy): - SPEC.md, plan, and task breakdown under docs/superpowers/ - SearchAgentState: six-field search-loop state (question, previous_queries, retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum and Citation, added alongside the existing orchestration AgentState - 9 unit tests (dedup, round counting, rerank, evidence clamp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3) Behavior-preserving, additive component modules under src/agents/components/, each unit-tested in isolation with injected deps: - EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1] (blends query sufficiency with squashed top scores; monotonic in quality) - AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext - SearchTool: single-retriever wrapper that records the round into SearchAgentState - RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted) Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type (lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Record T-A.4 wiring decision in task plan (recommend deferring into Phase B) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3) Make GRPO able to learn the action policy (search web vs vector-DB, search budget, evidence informativeness): - SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All default 0 -> existing presets byte-stable; new breakdown components added. - Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total into output.metrics; per-round continuous score reuses EvidenceJudge.score_round (wires the component into the loop). rerank_calls reserved (0 until rerank action). - SearchRewardConfig.retriever_aware() preset wires the new weights on top of second_pass; existing presets untouched. 12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual integration step (needs a real model); rerank-as-action + eval remain (Phase D). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C/D: reranker as a priced policy action (per-search rerank flag) Completes the action space the reward already prices: - <search rerank="true"> reranks that round's results BEFORE they are labeled, so the [RxQyDz] citations the model sees match the reranked order (sidesteps the citation-shift problem of a standalone mid-loop <rerank/>). - Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged no-op (degrade). Counts rerank_calls -> consumed by rerank_cost. - System instruction teaches the rerank attribute. 2 loop tests. Also: T-A0.2 retrieval retries already covered by SearchClient's exponential backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained checkpoint (manual); action-mix metrics are already on output.metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison src/training/eval/action_eval.py aggregates the action-mix metrics the loop already emits (web/vdb searches, search_rounds, rerank rate, evidence) and encodes the spec's headline success criterion — fewer search rounds AND correctness preserved — with a readable comparison table. 6 unit tests. Ready to consume real eval rollouts; the actual baseline-vs-trained numbers still need a converged GRPO checkpoint (manual training run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add GRPO smoke step test exercising retriever_aware reward + action metrics Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage -> loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN) and that retriever_cost/rerank_cost/evidence_gain actually price the rollout — proving the Phase B/C action space flows end-to-end through GRPO training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add examples/run_retriever_aware_grpo.py — one-command train + eval runner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lingduoduo and others added 2 commits June 23, 2026 19:44

Merge main into Phase A0 branch

c303487

lingduoduo merged commit ae99f4f into main Jun 23, 2026
5 of 6 checks passed

lingduoduo deleted the feat/agent-framework-grpo-phase-a0 branch June 23, 2026 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency)#326

Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency)#326
lingduoduo merged 2 commits into
mainfrom
feat/agent-framework-grpo-phase-a0

lingduoduo commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lingduoduo commented Jun 23, 2026

What & why

Changes

Scope note

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant