Context
A 2026-04-19 smoke run produced this budget_summary.per_agent in the session JSON:
"per_agent": {
"apprentice_pipeline": {
"tokens_used": 3987,
"cost_usd": 0.059805,
"calls": 1,
"duration_seconds": 47.27
}
}
All six pipeline stages (discovery, implementation, instrumentation, visualization, assessment, review) are aggregated into a single row labeled apprentice_pipeline.
Problem
When a generated artifact fails quality review, there is no way to diagnose which stage is responsible without re-running with manual instrumentation. Per-stage breakdown is also needed to:
- Tune individual prompt costs
- Detect when one stage dominates the budget
- Compare model choices stage-by-stage (e.g., Haiku for assessment, gpt-5.4 for implementation)
- Validate that gates actually ran
Proposed fix
budget_summary.per_agent should report one row per stage defined in src/apprentice/stages/:
"per_agent": {
"discovery": {"tokens_used": N, "cost_usd": N, "calls": N, "duration_seconds": N},
"implementation": {...},
"instrumentation": {...},
"visualization": {...},
"assessment": {...},
"review": {...}
}
Additionally, record per-stage gate verdicts (gate_name → pass/fail/skipped) in the session JSON so a reader can confirm which gates executed and in what order — the current session has no evidence of gate ordering.
Related
Observability is the "no magic" principle applied to apprentice itself. Currently apprentice is opaque about its own pipeline — ironic given the project it serves.
Context
A 2026-04-19 smoke run produced this
budget_summary.per_agentin the session JSON:All six pipeline stages (discovery, implementation, instrumentation, visualization, assessment, review) are aggregated into a single row labeled
apprentice_pipeline.Problem
When a generated artifact fails quality review, there is no way to diagnose which stage is responsible without re-running with manual instrumentation. Per-stage breakdown is also needed to:
Proposed fix
budget_summary.per_agentshould report one row per stage defined insrc/apprentice/stages/:Additionally, record per-stage gate verdicts (
gate_name → pass/fail/skipped) in the session JSON so a reader can confirm which gates executed and in what order — the current session has no evidence of gate ordering.Related
Observability is the "no magic" principle applied to apprentice itself. Currently apprentice is opaque about its own pipeline — ironic given the project it serves.