-
Notifications
You must be signed in to change notification settings - Fork 0
Reasoning models aren't benchmarked correctly #5
Description
Problem
When benchmarking reasoning/thinking models (e.g., QwQ, DeepSeek-R1), llmnop reports:
- 0 output tokens
- 0 throughput
- Misleading TTFT (appears instant or missing)
Example output with a reasoning model:
number_output_tokens
mean = 0
request_output_throughput_token_per_s
mean = 0
ttft_s
mean = 0.001 # Suspiciously fast
Root Cause
In src/benchmark.rs, we only check delta.content:
if let Some(content) = choice.delta.content {
if !content.is_empty() {
chunk_arrivals.push((Instant::now(), content.clone()));
generated_text.push_str(&content);
}
}Reasoning models stream their thinking process via different fields depending on the inference server:
choices[].delta.reasoning_content(vLLM)choices[].delta.reasoning(Ollama, LM Studio)
Since we ignore these fields, all reasoning tokens are missed. If the model does extended thinking before producing content, we miss that latency window entirely.
Caveat
The async-openai crate does not expose reasoning_content or reasoning fields on ChatCompletionStreamResponseDelta. The maintainers have declined PRs to add these fields, as the crate targets the official OpenAI API.
See: PR #418 (labeled "out of scope")
Recommended Approach: BYOT
The async-openai maintainer recommended:
"You can achieve this with combination of
byotfeature and Serde struct flattening to re-use/expand existing types."
The byot (Bring Your Own Types) feature enables *_byot() methods that accept custom request/response types. See: BYOT documentation
Implementation
- Enable the
byotfeature inCargo.toml:
async-openai = { version = "0.30", features = ["byot"] }- Define custom streaming types with reasoning fields:
#[derive(Debug, Deserialize)]
struct StreamDelta {
content: Option<String>,
reasoning_content: Option<String>, // vLLM
reasoning: Option<String>, // Ollama, LM Studio
}
#[derive(Debug, Deserialize)]
struct StreamChoice {
delta: StreamDelta,
}
#[derive(Debug, Deserialize)]
struct StreamChunk {
choices: Vec<StreamChoice>,
}- Use
create_stream_byot()instead ofcreate_stream():
let stream: Pin<Box<dyn Stream<Item = Result<StreamChunk, OpenAIError>> + Send>> =
client.chat().create_stream_byot(request).await?;Solution
1. Track arrivals separately
let mut content_arrivals: Vec<(Instant, String)> = Vec::new();
let mut reasoning_arrivals: Vec<(Instant, String)> = Vec::new();
let mut generated_text = String::new();
let mut reasoning_text = String::new();2. Parse both content and reasoning
let content = delta.content.as_deref().unwrap_or("");
let reasoning = delta.reasoning_content
.as_deref()
.or(delta.reasoning.as_deref())
.unwrap_or("");
let now = Instant::now();
if !reasoning.is_empty() {
reasoning_arrivals.push((now, reasoning.to_string()));
reasoning_text.push_str(reasoning);
}
if !content.is_empty() {
content_arrivals.push((now, content.to_string()));
generated_text.push_str(content);
}3. Add new metrics
TTFT (Time to First Token): Time to first token of ANY kind (including reasoning)
- Uses:
min(content_arrivals[0], reasoning_arrivals[0])if both exist
TTFO (Time to First Output Token): Time to first NON-reasoning token
- Uses:
content_arrivals[0]only - For non-reasoning models: TTFO = TTFT
Reasoning token count: Number of reasoning tokens generated
- Tokenize
reasoning_textseparately fromgenerated_text
4. Update BenchmarkResult
pub struct BenchmarkResult {
pub ttft: Duration, // First token (any kind)
pub ttfo: Option<Duration>, // First content token (None if no content)
pub total_latency: Duration,
pub throughput: f64,
pub input_tokens: u32,
pub output_tokens: u32, // Content tokens only
pub reasoning_tokens: u32, // Reasoning tokens (new)
pub inter_token_latency_s: f64,
pub total_tokens: u32, // input + output + reasoning
}5. Update output
Add new metrics to console and JSON output:
time_to_first_output_token(TTFO) with percentilesreasoning_tokenswith percentiles- Update
total_tokensto include reasoning
For non-reasoning models:
- TTFO = TTFT (or omit TTFO entirely)
reasoning_tokens= 0 or null
Files to Modify
Cargo.toml- Addbyotfeature to async-openaisrc/client.rs- Usecreate_stream_byot()with custom typessrc/benchmark.rs- Track content vs reasoning arrivals separately, compute TTFT/TTFOsrc/output.rs- Add TTFO and reasoning token output fields
Testing
- Test with a non-reasoning model (e.g., Llama 3.1, Gemma 3) - should work as before
- Test with a reasoning model (e.g., QwQ, DeepSeek-R1) - should now show:
- Non-zero reasoning tokens
- TTFT reflecting first reasoning token
- TTFO reflecting first content token
- Correct throughput based on total tokens generated