Agent framework Phase A0: durable GRPO training loop (checkpoint/resume + bounded concurrency)#326
Merged
Merged
Conversation
The GRPO stack was single-step/foreground/no-resume; this adds the durability a long run needs: - train_loop: durable multi-step driver — step-level asyncio.wait_for timeout + skip (a hung/failing step is logged and skipped, not fatal), periodic checkpoint, resume-from-checkpoint, per-step metrics JSONL. Trainer-agnostic. - LLMGRPOTrainer.save_checkpoint/load_checkpoint: policy + tokenizer + optimizer. - Bounded rollout concurrency: _resolve_max_concurrent(None) -> 8 so a B×G batch of live search rollouts can't saturate the retrieval server. 6 new tests (5 train_loop + 1 concurrency). Includes spec/plan/tasks docs. Retrieval-level retries deferred to avoid a search.py conflict with PR #325. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lingduoduo
added a commit
that referenced
this pull request
Jun 24, 2026
…unner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lingduoduo
added a commit
that referenced
this pull request
Jun 24, 2026
* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks Spec-driven groundwork for optimizing the agent framework (modular components + GRPO action policy): - SPEC.md, plan, and task breakdown under docs/superpowers/ - SearchAgentState: six-field search-loop state (question, previous_queries, retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum and Citation, added alongside the existing orchestration AgentState - 9 unit tests (dedup, round counting, rerank, evidence clamp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3) Behavior-preserving, additive component modules under src/agents/components/, each unit-tested in isolation with injected deps: - EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1] (blends query sufficiency with squashed top scores; monotonic in quality) - AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext - SearchTool: single-retriever wrapper that records the round into SearchAgentState - RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted) Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type (lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Record T-A.4 wiring decision in task plan (recommend deferring into Phase B) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3) Make GRPO able to learn the action policy (search web vs vector-DB, search budget, evidence informativeness): - SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All default 0 -> existing presets byte-stable; new breakdown components added. - Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total into output.metrics; per-round continuous score reuses EvidenceJudge.score_round (wires the component into the loop). rerank_calls reserved (0 until rerank action). - SearchRewardConfig.retriever_aware() preset wires the new weights on top of second_pass; existing presets untouched. 12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual integration step (needs a real model); rerank-as-action + eval remain (Phase D). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Agent framework Phase B: web-vs-vector-DB retriever action (Planner + live loop) (#325) * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C/D: reranker as a priced policy action (per-search rerank flag) Completes the action space the reward already prices: - <search rerank="true"> reranks that round's results BEFORE they are labeled, so the [RxQyDz] citations the model sees match the reranked order (sidesteps the citation-shift problem of a standalone mid-loop <rerank/>). - Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged no-op (degrade). Counts rerank_calls -> consumed by rerank_cost. - System instruction teaches the rerank attribute. 2 loop tests. Also: T-A0.2 retrieval retries already covered by SearchClient's exponential backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained checkpoint (manual); action-mix metrics are already on output.metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison src/training/eval/action_eval.py aggregates the action-mix metrics the loop already emits (web/vdb searches, search_rounds, rerank rate, evidence) and encodes the spec's headline success criterion — fewer search rounds AND correctness preserved — with a readable comparison table. 6 unit tests. Ready to consume real eval rollouts; the actual baseline-vs-trained numbers still need a converged GRPO checkpoint (manual training run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add GRPO smoke step test exercising retriever_aware reward + action metrics Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage -> loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN) and that retriever_cost/rerank_cost/evidence_gain actually price the rollout — proving the Phase B/C action space flows end-to-end through GRPO training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add examples/run_retriever_aware_grpo.py — one-command train + eval runner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lingduoduo
added a commit
that referenced
this pull request
Jun 24, 2026
* Add SearchAgentState foundation + agent-framework GRPO spec/plan/tasks Spec-driven groundwork for optimizing the agent framework (modular components + GRPO action policy): - SPEC.md, plan, and task breakdown under docs/superpowers/ - SearchAgentState: six-field search-loop state (question, previous_queries, retrieved_docs, evidence_score, search_rounds, citations) + Retriever enum and Citation, added alongside the existing orchestration AgentState - 9 unit tests (dedup, round counting, rerank, evidence clamp) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Extract EvidenceJudge, AnswerGenerator, SearchTool, RerankerTool (T-A.2, T-A.3) Behavior-preserving, additive component modules under src/agents/components/, each unit-tested in isolation with injected deps: - EvidenceJudge: wraps SearchResultEvaluator -> continuous evidence_score in [0,1] (blends query sufficiency with squashed top scores; monotonic in quality) - AnswerGenerator: resolves [RxQyDz] markers to structured Citations via AgentContext - SearchTool: single-retriever wrapper that records the round into SearchAgentState - RerankerTool: reorders retrieved_docs via an injected rerank fn (no round counted) Also align SearchAgentState.retrieved_docs to the loop's native SearchResult type (lazy TYPE_CHECKING annotation; no runtime import cycle). 13 component tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Record T-A.4 wiring decision in task plan (recommend deferring into Phase B) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Mark Checkpoint 1 reached; revise PR slicing after T-A.4 deferral Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add spec copy under docs/superpowers/specs/ (per-PR spec+plan convention) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: Planner tag parser + SearchTool web/vdb routing (T-B.1, T-B.2) - SearchTool gains a second backend: run(state, query, retriever) selects web vs vector-DB; a WEB request degrades to vector-DB (logged) when web is unconfigured. Backward compatible with the Phase A single-arg constructor. - Planner: parses <search retriever="web|vdb">, <rerank/>, <answer> into typed SearchAction/RerankAction/AnswerAction. Precedence search>rerank>answer; malformed input degrades to a safe vector-DB search. 7 new tests (23 total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase B: wire per-round web/vdb retriever routing into SearchAgentLoop The policy can now target a second backend per search round via <search retriever="web">. Surgical, behavior-preserving (vdb is the default, identical to before): - SearchAgentLoopConfig.web_search_url + a second SearchClient (None -> degrade) - action regex tolerates tag attributes; _parse_round_retriever reads the choice - retriever threaded through _execute_search_round/_retrieve_with_cache/_retrieve_many; cache keyed by (retriever, query); both clients closed on exit - system instruction teaches the retriever attribute - 2 loop tests (web routing + degradation); on_turn fake updated for new signature Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Update task plan: Phase B routing done; defer rerank action to Phase C Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C: retriever-aware reward terms + action metrics + preset (T-C.1/C.2/C.3) Make GRPO able to learn the action policy (search web vs vector-DB, search budget, evidence informativeness): - SearchRewardConfig: retriever_cost_vdb/retriever_cost_web (web 5x vdb in the preset), rerank_cost (flat), evidence_gain_weight (Δevidence_score/round). All default 0 -> existing presets byte-stable; new breakdown components added. - Loop surfaces web_searches/vdb_searches/evidence_score_final/evidence_gain_total into output.metrics; per-round continuous score reuses EvidenceJudge.score_round (wires the component into the loop). rerank_calls reserved (0 until rerank action). - SearchRewardConfig.retriever_aware() preset wires the new weights on top of second_pass; existing presets untouched. 12 new tests; 235-test slice green; ruff clean. GRPO --smoke run is a manual integration step (needs a real model); rerank-as-action + eval remain (Phase D). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase C/D: reranker as a priced policy action (per-search rerank flag) Completes the action space the reward already prices: - <search rerank="true"> reranks that round's results BEFORE they are labeled, so the [RxQyDz] citations the model sees match the reranked order (sidesteps the citation-shift problem of a standalone mid-loop <rerank/>). - Reranker injected via loop._reranker (callable (query,docs)->docs); None = logged no-op (degrade). Counts rerank_calls -> consumed by rerank_cost. - System instruction teaches the rerank attribute. 2 loop tests. Also: T-A0.2 retrieval retries already covered by SearchClient's exponential backoff (max_retries=3) — no redundant code. Eval (T-D.1) needs a real trained checkpoint (manual); action-mix metrics are already on output.metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Phase D scaffold: action-eval aggregation + baseline-vs-trained comparison src/training/eval/action_eval.py aggregates the action-mix metrics the loop already emits (web/vdb searches, search_rounds, rerank rate, evidence) and encodes the spec's headline success criterion — fewer search rounds AND correctness preserved — with a readable comparison table. 6 unit tests. Ready to consume real eval rollouts; the actual baseline-vs-trained numbers still need a converged GRPO checkpoint (manual training run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add GRPO smoke step test exercising retriever_aware reward + action metrics Runs a real SearchAgentGRPOTrainer.step() (rollout -> reward -> advantage -> loss -> optimizer) on the toy LM with SearchRewardConfig.retriever_aware() over rollouts that emit web/vdb/rerank metrics. Asserts the step is finite (no NaN) and that retriever_cost/rerank_cost/evidence_gain actually price the rollout — proving the Phase B/C action space flows end-to-end through GRPO training. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add examples/run_retriever_aware_grpo.py — one-command train + eval runner Wires the full action-policy stack: from_pretrained policy -> on-policy SearchAgentLoop rollouts -> retriever_aware() reward (web 5x vdb) -> train_loop (checkpoint/resume, step timeout+skip) -> action_eval baseline-vs-trained comparison table. - PolicyServerManager: tiny in-process server_manager over the trainer's LIVE policy so rollouts are on-policy (no second model load); loop_factory is set after trainer construction so it closes over trainer.policy. - Imports train_loop (PR #326) — documented dependency; runs once the stack merges. Lint clean, compiles, all other symbols resolve. GRPO smoke is already proven by test_grpo_smoke_step_with_retriever_aware_reward; this runner is the manual GPU/MPS entry point for a real convergent run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Phase A0 of the agent-framework GRPO optimization — training durability. Independent of #324/#325 (branches off
main). The GRPO stack was single-step, foreground, no-resume: a crash lost everything and a hung rollout stalled the run. This makes long runs survivable.Spec/plan/tasks: SPEC.md · plan · tasks
Changes
train_loop(src/training/ppo/train_loop.py) — durable multi-step driver, trainer-agnostic (anystep_async):asyncio.wait_fortimeout + skip — a hung or failing step is logged and skipped; prior progress is intactckpt_everyand resume-from-checkpoint (continues at the saved step)LLMGRPOTrainer.save_checkpoint/load_checkpoint— persist/restore policy + tokenizer + optimizer state (reference policy is frozen/reconstructable; no scheduler exists in the trainer)._resolve_max_concurrent(None) -> 8, so a B×G batch of live search rollouts can't saturate/rate-limit the retrieval server (was unbounded by default).Scope note
Retrieval-level retries (T-A0.2) are deferred: they'd touch
src/agents/search.py_retrieve_many, which Phase B (#325) rewrote — a guaranteed merge conflict. The step-level timeout/skip already covers the catastrophic case; retries land as a small follow-up after #325 merges.Testing
train_loop(runs N steps, periodic ckpt + jsonl, resume continues from saved step, hung step skipped, failing step skipped) + 1 concurrency-default. 53-test training slice green; ruff clean.🤖 Generated with Claude Code