Proposal: surgical regeneration in retry chain
Problem
When a contract validation fails on part of a structured output (e.g. 3/15 array items violate a rule), the current retry_policy { escalate(...) } regenerates the entire output. Two consequences:
- Cost/latency — full prompt+completion to fix a few bad items
- Structural bias persists — same prompt + same model converges to the same bias on retry. Escalating to a stronger model partially helps but is expensive and slow.
Real-world data point (pdf_to_quiz, ADR-016, 2026-04-28):
- ~11 Sentry events / 48h:
Quiz delivered after contract retry exhaustion
- Two consecutive
nano@low attempts converge to ~5.5/15 mean imbalance — bias is structural, not sampling
mini@low fallback reduces to ~2.7/15 — most quizzes still ship with ≥1 imbalanced question
- Validated experiment: regenerating only failing items with
nano@low and a different prompt → 3/3 fixed, latency 14.8s vs ~60s for mini@low, cost ~$0.0003
Shape of the idea
A new attempt mode inside the existing retry_policy { escalate(...) }, operating on a slice of the previous output instead of regenerating the whole thing:
retry_policy do
escalate(
{ model: "gpt-5-nano", reasoning_effort: "low" },
{ model: "gpt-5-mini", reasoning_effort: "low" },
{
model: "gpt-5-nano", reasoning_effort: "low",
mode: :surgical,
when: ->(result) { result.failures.size <= 4 },
target: ->(result) { result.failures.map(&:path) },
preserve: %i[correct_answer_index correct_answer_text],
prompt: ->(slice, invariants) { ... }
}
)
end
This is illustrative, not a final API. Concrete keyword names, whether mode: :surgical or a separate top-level method, etc. — all open. The DSL is the easy part.
Hard requirements (must be solved before DSL has meaning)
1. Validators must report failure paths
Today validators return boolean/message. Surgical needs result.failures with structured path (e.g. [:questions, 3, :options]) so a target lambda has something to point at. This is a change in the validator API, not a new keyword. Backward-compatible default could be "whole output" path, but it makes surgical attempts effectively no-ops for legacy validators.
2. Splice & merge
Surgical output is a slice. Gem must merge it back into the previous attempt's output by path. Simplest case: arrays-of-objects keyed by index (the quiz case). General case (arbitrary JSON path, nested arrays, dict keys) is an order of magnitude more work and probably out of scope for v1.
3. Preservation invariants as post-merge guard
preserve: is effectively an auto-validator that compares pre/post values at named paths after the merge. On violation → rollback to previous attempt's output (soft-deliver, no regression vs current behavior). This needs to happen after merge but before counting the surgical attempt as success.
Other open questions
- Trace shape. Surgical attempt is conceptually one entry in
result.trace[:attempts] but its input is a slice and its output is a slice. Does the trace record the slice, the post-merge full output, or both?
around_call semantics. Hook fires once per run() with post-retry Result (existing invariant). Surgical attempts should not change this — they're still part of the retry chain, not separate calls.
- Eval/optimizer interaction.
compare_models and optimize currently treat each escalate entry as a candidate. A mode: :surgical entry has different cost characteristics (smaller prompt, smaller output) and shouldn't be benchmarked the same way as a full attempt.
Why file this now
Not asking for implementation. Logging the pattern because:
- It came from real production data (not speculation)
- The cost/latency win was experimentally validated on prod logs
- "Structural bias resistant to same-model retry" is a class of problem the retry-chain abstraction currently doesn't have a clean answer for
- The hard requirements (especially failure paths in validators) touch core API and are worth thinking about before they accumulate technical debt
Reference: ADR-016 in pdf_to_quiz (doc/adr/016-surgical-regen-imbalanced-options.md).
Proposal: surgical regeneration in retry chain
Problem
When a contract validation fails on part of a structured output (e.g. 3/15 array items violate a rule), the current
retry_policy { escalate(...) }regenerates the entire output. Two consequences:Real-world data point (pdf_to_quiz, ADR-016, 2026-04-28):
Quiz delivered after contract retry exhaustionnano@lowattempts converge to ~5.5/15 mean imbalance — bias is structural, not samplingmini@lowfallback reduces to ~2.7/15 — most quizzes still ship with ≥1 imbalanced questionnano@lowand a different prompt → 3/3 fixed, latency 14.8s vs ~60s formini@low, cost ~$0.0003Shape of the idea
A new attempt mode inside the existing
retry_policy { escalate(...) }, operating on a slice of the previous output instead of regenerating the whole thing:This is illustrative, not a final API. Concrete keyword names, whether
mode: :surgicalor a separate top-level method, etc. — all open. The DSL is the easy part.Hard requirements (must be solved before DSL has meaning)
1. Validators must report failure paths
Today validators return boolean/message. Surgical needs
result.failureswith structuredpath(e.g.[:questions, 3, :options]) so atargetlambda has something to point at. This is a change in the validator API, not a new keyword. Backward-compatible default could be "whole output" path, but it makes surgical attempts effectively no-ops for legacy validators.2. Splice & merge
Surgical output is a slice. Gem must merge it back into the previous attempt's output by path. Simplest case: arrays-of-objects keyed by index (the quiz case). General case (arbitrary JSON path, nested arrays, dict keys) is an order of magnitude more work and probably out of scope for v1.
3. Preservation invariants as post-merge guard
preserve:is effectively an auto-validator that compares pre/post values at named paths after the merge. On violation → rollback to previous attempt's output (soft-deliver, no regression vs current behavior). This needs to happen after merge but before counting the surgical attempt as success.Other open questions
result.trace[:attempts]but its input is a slice and its output is a slice. Does the trace record the slice, the post-merge full output, or both?around_callsemantics. Hook fires once perrun()with post-retryResult(existing invariant). Surgical attempts should not change this — they're still part of the retry chain, not separate calls.compare_modelsandoptimizecurrently treat eachescalateentry as a candidate. Amode: :surgicalentry has different cost characteristics (smaller prompt, smaller output) and shouldn't be benchmarked the same way as a full attempt.Why file this now
Not asking for implementation. Logging the pattern because:
Reference: ADR-016 in pdf_to_quiz (
doc/adr/016-surgical-regen-imbalanced-options.md).