Skip to content

Use a faster model for subcalls (gpt-4o) to reduce latency #16

@apenab

Description

@apenab

Problem

Subcalls currently use a stronger but slower model (gpt-5.1), which increases per-sample latency.

Proposal

Use a faster model for subcalls (e.g., gpt-4o) while keeping the baseline/final reasoning path on gpt-5.1.

Suggested split:

  • Baseline/final pass: gpt-5.1
  • Subcalls/tool-like intermediate calls: gpt-4o

Expected Impact

  • Estimated latency improvement: ~40-50% for subcall-heavy flows.
  • Approximate per-sample runtime: 108.5s -> 60-70s (workload dependent).

Risk

  • Yes, quality risk: possible ~5-10% drop in correctness/grounding.

Implementation Considerations

  • Add model selection by call type in config.
  • Support easy rollback to single-model mode.
  • Add per-call tracing to measure where latency is actually saved.

Acceptance Criteria

  • Config supports separate models for baseline/final vs subcalls.
  • Benchmark report compares before/after on:
    • latency per sample
    • token usage/cost
    • correctness
    • grounding/evidence metrics
  • Quality delta is explicitly documented and approved before default rollout.
  • One-command rollback path is available.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions