Reasoning models aren't benchmarked correctly

## Problem

When benchmarking reasoning/thinking models (e.g., QwQ, DeepSeek-R1), llmnop reports:
- 0 output tokens
- 0 throughput
- Misleading TTFT (appears instant or missing)

Example output with a reasoning model:
```
number_output_tokens
    mean = 0
request_output_throughput_token_per_s
    mean = 0
ttft_s
    mean = 0.001  # Suspiciously fast
```

## Root Cause

In `src/benchmark.rs`, we only check `delta.content`:

```rust
if let Some(content) = choice.delta.content {
    if !content.is_empty() {
        chunk_arrivals.push((Instant::now(), content.clone()));
        generated_text.push_str(&content);
    }
}
```

Reasoning models stream their thinking process via different fields depending on the inference server:
- `choices[].delta.reasoning_content` (vLLM)
- `choices[].delta.reasoning` (Ollama, LM Studio)

Since we ignore these fields, all reasoning tokens are missed. If the model does extended thinking before producing content, we miss that latency window entirely.

## Caveat

The `async-openai` crate does not expose `reasoning_content` or `reasoning` fields on `ChatCompletionStreamResponseDelta`. The maintainers have declined PRs to add these fields, as the crate targets the official OpenAI API.

See: [PR #418](https://github.com/64bit/async-openai/pull/418) (labeled "out of scope")

### Recommended Approach: BYOT

The async-openai maintainer recommended:

> "You can achieve this with combination of `byot` feature and Serde struct flattening to re-use/expand existing types."

The `byot` (Bring Your Own Types) feature enables `*_byot()` methods that accept custom request/response types. See: [BYOT documentation](https://github.com/64bit/async-openai?tab=readme-ov-file#bring-your-own-types)

### Implementation

1. Enable the `byot` feature in `Cargo.toml`:
```toml
async-openai = { version = "0.30", features = ["byot"] }
```

2. Define custom streaming types with reasoning fields:
```rust
#[derive(Debug, Deserialize)]
struct StreamDelta {
    content: Option<String>,
    reasoning_content: Option<String>,  // vLLM
    reasoning: Option<String>,          // Ollama, LM Studio
}

#[derive(Debug, Deserialize)]
struct StreamChoice {
    delta: StreamDelta,
}

#[derive(Debug, Deserialize)]
struct StreamChunk {
    choices: Vec<StreamChoice>,
}
```

3. Use `create_stream_byot()` instead of `create_stream()`:
```rust
let stream: Pin<Box<dyn Stream<Item = Result<StreamChunk, OpenAIError>> + Send>> =
    client.chat().create_stream_byot(request).await?;
```

## Solution

### 1. Track arrivals separately

```rust
let mut content_arrivals: Vec<(Instant, String)> = Vec::new();
let mut reasoning_arrivals: Vec<(Instant, String)> = Vec::new();
let mut generated_text = String::new();
let mut reasoning_text = String::new();
```

### 2. Parse both content and reasoning

```rust
let content = delta.content.as_deref().unwrap_or("");
let reasoning = delta.reasoning_content
    .as_deref()
    .or(delta.reasoning.as_deref())
    .unwrap_or("");

let now = Instant::now();

if !reasoning.is_empty() {
    reasoning_arrivals.push((now, reasoning.to_string()));
    reasoning_text.push_str(reasoning);
}

if !content.is_empty() {
    content_arrivals.push((now, content.to_string()));
    generated_text.push_str(content);
}
```

### 3. Add new metrics

**TTFT (Time to First Token)**: Time to first token of ANY kind (including reasoning)
- Uses: `min(content_arrivals[0], reasoning_arrivals[0])` if both exist

**TTFO (Time to First Output Token)**: Time to first NON-reasoning token
- Uses: `content_arrivals[0]` only
- For non-reasoning models: TTFO = TTFT

**Reasoning token count**: Number of reasoning tokens generated
- Tokenize `reasoning_text` separately from `generated_text`

### 4. Update BenchmarkResult

```rust
pub struct BenchmarkResult {
    pub ttft: Duration,              // First token (any kind)
    pub ttfo: Option<Duration>,      // First content token (None if no content)
    pub total_latency: Duration,
    pub throughput: f64,
    pub input_tokens: u32,
    pub output_tokens: u32,          // Content tokens only
    pub reasoning_tokens: u32,       // Reasoning tokens (new)
    pub inter_token_latency_s: f64,
    pub total_tokens: u32,           // input + output + reasoning
}
```

### 5. Update output

Add new metrics to console and JSON output:
- `time_to_first_output_token` (TTFO) with percentiles
- `reasoning_tokens` with percentiles
- Update `total_tokens` to include reasoning

For non-reasoning models:
- TTFO = TTFT (or omit TTFO entirely)
- `reasoning_tokens` = 0 or null

## Files to Modify

- `Cargo.toml` - Add `byot` feature to async-openai
- `src/client.rs` - Use `create_stream_byot()` with custom types
- `src/benchmark.rs` - Track content vs reasoning arrivals separately, compute TTFT/TTFO
- `src/output.rs` - Add TTFO and reasoning token output fields

## Testing

1. Test with a non-reasoning model (e.g., Llama 3.1, Gemma 3) - should work as before
2. Test with a reasoning model (e.g., QwQ, DeepSeek-R1) - should now show:
   - Non-zero reasoning tokens
   - TTFT reflecting first reasoning token
   - TTFO reflecting first content token
   - Correct throughput based on total tokens generated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reasoning models aren't benchmarked correctly #5

Problem

Root Cause

Caveat

Recommended Approach: BYOT

Implementation

Solution

1. Track arrivals separately

2. Parse both content and reasoning

3. Add new metrics

4. Update BenchmarkResult

5. Update output

Files to Modify

Testing

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reasoning models aren't benchmarked correctly #5

Description

Problem

Root Cause

Caveat

Recommended Approach: BYOT

Implementation

Solution

1. Track arrivals separately

2. Parse both content and reasoning

3. Add new metrics

4. Update BenchmarkResult

5. Update output

Files to Modify

Testing

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions