Problem
Subcalls currently use a stronger but slower model (gpt-5.1), which increases per-sample latency.
Proposal
Use a faster model for subcalls (e.g., gpt-4o) while keeping the baseline/final reasoning path on gpt-5.1.
Suggested split:
- Baseline/final pass:
gpt-5.1
- Subcalls/tool-like intermediate calls:
gpt-4o
Expected Impact
- Estimated latency improvement: ~40-50% for subcall-heavy flows.
- Approximate per-sample runtime: 108.5s -> 60-70s (workload dependent).
Risk
- Yes, quality risk: possible ~5-10% drop in correctness/grounding.
Implementation Considerations
- Add model selection by call type in config.
- Support easy rollback to single-model mode.
- Add per-call tracing to measure where latency is actually saved.
Acceptance Criteria
- Config supports separate models for baseline/final vs subcalls.
- Benchmark report compares before/after on:
- latency per sample
- token usage/cost
- correctness
- grounding/evidence metrics
- Quality delta is explicitly documented and approved before default rollout.
- One-command rollback path is available.
Problem
Subcalls currently use a stronger but slower model (
gpt-5.1), which increases per-sample latency.Proposal
Use a faster model for subcalls (e.g.,
gpt-4o) while keeping the baseline/final reasoning path ongpt-5.1.Suggested split:
gpt-5.1gpt-4oExpected Impact
Risk
Implementation Considerations
Acceptance Criteria