Skip to content

Reasoning models aren't benchmarked correctly #5

@jpreagan

Description

@jpreagan

Problem

When benchmarking reasoning/thinking models (e.g., QwQ, DeepSeek-R1), llmnop reports:

  • 0 output tokens
  • 0 throughput
  • Misleading TTFT (appears instant or missing)

Example output with a reasoning model:

number_output_tokens
    mean = 0
request_output_throughput_token_per_s
    mean = 0
ttft_s
    mean = 0.001  # Suspiciously fast

Root Cause

In src/benchmark.rs, we only check delta.content:

if let Some(content) = choice.delta.content {
    if !content.is_empty() {
        chunk_arrivals.push((Instant::now(), content.clone()));
        generated_text.push_str(&content);
    }
}

Reasoning models stream their thinking process via different fields depending on the inference server:

  • choices[].delta.reasoning_content (vLLM)
  • choices[].delta.reasoning (Ollama, LM Studio)

Since we ignore these fields, all reasoning tokens are missed. If the model does extended thinking before producing content, we miss that latency window entirely.

Caveat

The async-openai crate does not expose reasoning_content or reasoning fields on ChatCompletionStreamResponseDelta. The maintainers have declined PRs to add these fields, as the crate targets the official OpenAI API.

See: PR #418 (labeled "out of scope")

Recommended Approach: BYOT

The async-openai maintainer recommended:

"You can achieve this with combination of byot feature and Serde struct flattening to re-use/expand existing types."

The byot (Bring Your Own Types) feature enables *_byot() methods that accept custom request/response types. See: BYOT documentation

Implementation

  1. Enable the byot feature in Cargo.toml:
async-openai = { version = "0.30", features = ["byot"] }
  1. Define custom streaming types with reasoning fields:
#[derive(Debug, Deserialize)]
struct StreamDelta {
    content: Option<String>,
    reasoning_content: Option<String>,  // vLLM
    reasoning: Option<String>,          // Ollama, LM Studio
}

#[derive(Debug, Deserialize)]
struct StreamChoice {
    delta: StreamDelta,
}

#[derive(Debug, Deserialize)]
struct StreamChunk {
    choices: Vec<StreamChoice>,
}
  1. Use create_stream_byot() instead of create_stream():
let stream: Pin<Box<dyn Stream<Item = Result<StreamChunk, OpenAIError>> + Send>> =
    client.chat().create_stream_byot(request).await?;

Solution

1. Track arrivals separately

let mut content_arrivals: Vec<(Instant, String)> = Vec::new();
let mut reasoning_arrivals: Vec<(Instant, String)> = Vec::new();
let mut generated_text = String::new();
let mut reasoning_text = String::new();

2. Parse both content and reasoning

let content = delta.content.as_deref().unwrap_or("");
let reasoning = delta.reasoning_content
    .as_deref()
    .or(delta.reasoning.as_deref())
    .unwrap_or("");

let now = Instant::now();

if !reasoning.is_empty() {
    reasoning_arrivals.push((now, reasoning.to_string()));
    reasoning_text.push_str(reasoning);
}

if !content.is_empty() {
    content_arrivals.push((now, content.to_string()));
    generated_text.push_str(content);
}

3. Add new metrics

TTFT (Time to First Token): Time to first token of ANY kind (including reasoning)

  • Uses: min(content_arrivals[0], reasoning_arrivals[0]) if both exist

TTFO (Time to First Output Token): Time to first NON-reasoning token

  • Uses: content_arrivals[0] only
  • For non-reasoning models: TTFO = TTFT

Reasoning token count: Number of reasoning tokens generated

  • Tokenize reasoning_text separately from generated_text

4. Update BenchmarkResult

pub struct BenchmarkResult {
    pub ttft: Duration,              // First token (any kind)
    pub ttfo: Option<Duration>,      // First content token (None if no content)
    pub total_latency: Duration,
    pub throughput: f64,
    pub input_tokens: u32,
    pub output_tokens: u32,          // Content tokens only
    pub reasoning_tokens: u32,       // Reasoning tokens (new)
    pub inter_token_latency_s: f64,
    pub total_tokens: u32,           // input + output + reasoning
}

5. Update output

Add new metrics to console and JSON output:

  • time_to_first_output_token (TTFO) with percentiles
  • reasoning_tokens with percentiles
  • Update total_tokens to include reasoning

For non-reasoning models:

  • TTFO = TTFT (or omit TTFO entirely)
  • reasoning_tokens = 0 or null

Files to Modify

  • Cargo.toml - Add byot feature to async-openai
  • src/client.rs - Use create_stream_byot() with custom types
  • src/benchmark.rs - Track content vs reasoning arrivals separately, compute TTFT/TTFO
  • src/output.rs - Add TTFO and reasoning token output fields

Testing

  1. Test with a non-reasoning model (e.g., Llama 3.1, Gemma 3) - should work as before
  2. Test with a reasoning model (e.g., QwQ, DeepSeek-R1) - should now show:
    • Non-zero reasoning tokens
    • TTFT reflecting first reasoning token
    • TTFO reflecting first content token
    • Correct throughput based on total tokens generated

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions