Skip to content

Proposal: surgical regeneration as a third retry strategy #31

@justi

Description

@justi

Proposal: surgical regeneration in retry chain

Problem

When a contract validation fails on part of a structured output (e.g. 3/15 array items violate a rule), the current retry_policy { escalate(...) } regenerates the entire output. Two consequences:

  1. Cost/latency — full prompt+completion to fix a few bad items
  2. Structural bias persists — same prompt + same model converges to the same bias on retry. Escalating to a stronger model partially helps but is expensive and slow.

Real-world data point (pdf_to_quiz, ADR-016, 2026-04-28):

  • ~11 Sentry events / 48h: Quiz delivered after contract retry exhaustion
  • Two consecutive nano@low attempts converge to ~5.5/15 mean imbalance — bias is structural, not sampling
  • mini@low fallback reduces to ~2.7/15 — most quizzes still ship with ≥1 imbalanced question
  • Validated experiment: regenerating only failing items with nano@low and a different prompt → 3/3 fixed, latency 14.8s vs ~60s for mini@low, cost ~$0.0003

Shape of the idea

A new attempt mode inside the existing retry_policy { escalate(...) }, operating on a slice of the previous output instead of regenerating the whole thing:

retry_policy do
  escalate(
    { model: "gpt-5-nano", reasoning_effort: "low" },
    { model: "gpt-5-mini", reasoning_effort: "low" },
    {
      model: "gpt-5-nano", reasoning_effort: "low",
      mode: :surgical,
      when: ->(result) { result.failures.size <= 4 },
      target: ->(result) { result.failures.map(&:path) },
      preserve: %i[correct_answer_index correct_answer_text],
      prompt: ->(slice, invariants) { ... }
    }
  )
end

This is illustrative, not a final API. Concrete keyword names, whether mode: :surgical or a separate top-level method, etc. — all open. The DSL is the easy part.

Hard requirements (must be solved before DSL has meaning)

1. Validators must report failure paths

Today validators return boolean/message. Surgical needs result.failures with structured path (e.g. [:questions, 3, :options]) so a target lambda has something to point at. This is a change in the validator API, not a new keyword. Backward-compatible default could be "whole output" path, but it makes surgical attempts effectively no-ops for legacy validators.

2. Splice & merge

Surgical output is a slice. Gem must merge it back into the previous attempt's output by path. Simplest case: arrays-of-objects keyed by index (the quiz case). General case (arbitrary JSON path, nested arrays, dict keys) is an order of magnitude more work and probably out of scope for v1.

3. Preservation invariants as post-merge guard

preserve: is effectively an auto-validator that compares pre/post values at named paths after the merge. On violation → rollback to previous attempt's output (soft-deliver, no regression vs current behavior). This needs to happen after merge but before counting the surgical attempt as success.

Other open questions

  • Trace shape. Surgical attempt is conceptually one entry in result.trace[:attempts] but its input is a slice and its output is a slice. Does the trace record the slice, the post-merge full output, or both?
  • around_call semantics. Hook fires once per run() with post-retry Result (existing invariant). Surgical attempts should not change this — they're still part of the retry chain, not separate calls.
  • Eval/optimizer interaction. compare_models and optimize currently treat each escalate entry as a candidate. A mode: :surgical entry has different cost characteristics (smaller prompt, smaller output) and shouldn't be benchmarked the same way as a full attempt.

Why file this now

Not asking for implementation. Logging the pattern because:

  • It came from real production data (not speculation)
  • The cost/latency win was experimentally validated on prod logs
  • "Structural bias resistant to same-model retry" is a class of problem the retry-chain abstraction currently doesn't have a clean answer for
  • The hard requirements (especially failure paths in validators) touch core API and are worth thinking about before they accumulate technical debt

Reference: ADR-016 in pdf_to_quiz (doc/adr/016-surgical-regen-imbalanced-options.md).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions