Use a faster model for subcalls (gpt-4o) to reduce latency

## Problem
Subcalls currently use a stronger but slower model (`gpt-5.1`), which increases per-sample latency.

## Proposal
Use a faster model for subcalls (e.g., `gpt-4o`) while keeping the baseline/final reasoning path on `gpt-5.1`.

Suggested split:
- Baseline/final pass: `gpt-5.1`
- Subcalls/tool-like intermediate calls: `gpt-4o`

## Expected Impact
- Estimated latency improvement: **~40-50%** for subcall-heavy flows.
- Approximate per-sample runtime: **108.5s -> 60-70s** (workload dependent).

## Risk
- **Yes, quality risk**: possible **~5-10% drop** in correctness/grounding.

## Implementation Considerations
- Add model selection by call type in config.
- Support easy rollback to single-model mode.
- Add per-call tracing to measure where latency is actually saved.

## Acceptance Criteria
- Config supports separate models for baseline/final vs subcalls.
- Benchmark report compares before/after on:
  - latency per sample
  - token usage/cost
  - correctness
  - grounding/evidence metrics
- Quality delta is explicitly documented and approved before default rollout.
- One-command rollback path is available.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a faster model for subcalls (gpt-4o) to reduce latency #16

Problem

Proposal

Expected Impact

Risk

Implementation Considerations

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Use a faster model for subcalls (gpt-4o) to reduce latency #16

Description

Problem

Proposal

Expected Impact

Risk

Implementation Considerations

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions