-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Hi, the following is AI-generated, but essentially I only see the last turn being loaded during step 2 of the KV cache loading process. I’m not sure whether this is a design decision or not. Please let me know. At first glance, it doesn't look right to me.
Summary
The multi-turn conversation feature currently reads only the previous turn (turn N-1) before writing a new turn (turn N), rather than reading all previous turns (turns 1 through N-1) as would be required for realistic LLM inference simulation.
This significantly underestimates I/O volume for long conversations and doesn't match real-world LLM behavior.
Current Behavior
Code Location: kv_cache_benchmark/kv_cache/benchmark.py lines 517-525
# 2. For multi-turn conversations, access cache from previous turn.
if self.conversation_manager and request.turn_number > 1:
prev_turn_key = f"{request.conversation_id}_turn_{request.turn_number - 1}"
location, read_latency = self.cache.access_cache(prev_turn_key, InferencePhase.DECODE, 'multi_turn')What happens:
- Turn 2: Reads turn 1 only (1 read)
- Turn 3: Reads turn 2 only (1 read) - Missing turn 1!
- Turn 10: Reads turn 9 only (1 read) - Missing turns 1-8!
Total reads for 10-turn conversation: 9 reads (one per turn)
Expected Behavior
Real LLM inference requires reading all previous context for each turn.
Option 1: Read All Previous Turns (Separate Entries)
# Proposed fix:
if self.conversation_manager and request.turn_number > 1:
prev_keys = self.conversation_manager.get_all_previous_turn_keys(
request.conversation_id, request.turn_number
)
for prev_key in prev_keys:
location, read_latency = self.cache.access_cache(prev_key, InferencePhase.DECODE, 'multi_turn')
storage_latency += read_latencyWhat happens:
- Turn 2: Reads turn 1 (1 read)
- Turn 3: Reads turns 1, 2 (2 reads)
- Turn 10: Reads turns 1-9 (9 reads)
Total reads for 10-turn conversation: 1 + 2 + 3 + ... + 9 = 45 reads
Option 2: Cumulative Cache Entries (Concatenation)
Alternatively, implement cumulative KV cache growth (more realistic but more complex):
# Turn 2:
prev_data = cache.read("turn_1") # 100 tokens
new_data = generate(50 tokens)
combined = np.concatenate([prev_data, new_data]) # 150 tokens
cache.write("turn_2", combined) # Write 150 tokens (cumulative)
# Turn 3:
prev_data = cache.read("turn_2") # 150 tokens (cumulative)
new_data = generate(75 tokens)
combined = np.concatenate([prev_data, new_data]) # 225 tokens
cache.write("turn_3", combined) # Write 225 tokens (cumulative)This matches real LLM behavior where KV cache grows with each turn.
Why This Matters
1. I/O Volume Underestimation
Current implementation (10-turn conversation):
- Reads: 9 (one per turn)
- I/O volume: 9 × avg_cache_size
Realistic implementation (Option 1):
- Reads: 45 (cumulative)
- I/O volume: 45 × avg_cache_size
Underestimation factor: 5× for 10-turn conversations, worse for longer conversations.
2. Cache Locality Patterns
Current implementation:
- Only recent turn (turn N-1) is accessed
- Older turns (1 to N-2) may be evicted without consequence
Realistic implementation:
- All previous turns must be accessible
- Evicting old turns causes cache misses
- Tests realistic cache capacity requirements
3. Storage Throughput Requirements
For a 50-turn conversation (max_turns_per_conv = 50):
Current: 49 reads total
Realistic: 1 + 2 + 3 + ... + 49 = 1,225 reads total
This 25× difference dramatically changes storage throughput requirements.
Evidence of Incomplete Implementation
Function Exists But Is Unused
File: kv_cache_benchmark/kv_cache/conversation.py lines 105-111
def get_all_previous_turn_keys(self, conversation_id: str, current_turn: int) -> List[str]:
"""Retrieves all cache keys from previous turns in a conversation."""
with self.lock:
if conversation_id not in self.conversations:
return []
state = self.conversations[conversation_id]
return [key for key in state.cache_keys if key != f"{conversation_id}_turn_{current_turn}"]This function is NEVER called in the benchmark code!
$ grep -rn "get_all_previous_turn_keys" kv_cache_benchmark/ --include="*.py"
kv_cache_benchmark/tests/test_kv_cache.py:953: prev_keys = conversation_manager.get_all_previous_turn_keys(conv_id, 2)
kv_cache_benchmark/kv_cache/conversation.py:105: def get_all_previous_turn_keys(self, conversation_id: str, current_turn: int) -> List[str]:Only used in tests, never in actual benchmark.
Documentation Inconsistency
File: kv_cache_benchmark/docs/MLperf v3 KV cache proposal.md line 1831
"Context tokens = cumulative conversation history"
But the code doesn't implement cumulative context! Each turn only reads the previous turn, not the full cumulative history.
Proposed Solution
Minimal Fix (Option 1): Read All Previous Turns
File: kv_cache_benchmark/kv_cache/benchmark.py
Replace lines 517-525 with:
# 2. For multi-turn conversations, access cache from ALL previous turns.
if self.conversation_manager and request.turn_number > 1:
prev_keys = self.conversation_manager.get_all_previous_turn_keys(
request.conversation_id, request.turn_number
)
for prev_key in prev_keys:
location, read_latency = self.cache.access_cache(prev_key, InferencePhase.DECODE, 'multi_turn')
if location is not None:
storage_latency += read_latency
with self.results_lock: self.results['multi_turn_cache_hits'] += 1
else:
with self.results_lock: self.results['multi_turn_cache_misses'] += 1Pros:
- Simple change (uses existing function)
- Realistic I/O pattern
- Tests cache capacity requirements
- No change to cache entry structure
Cons:
- Increases I/O volume (but this is realistic!)
- Longer benchmark runtime for long conversations
Complete Fix (Option 2): Cumulative Cache Entries
Implement true cumulative KV cache growth with concatenation. This is more complex but matches real LLM behavior exactly.
Pros:
- Matches real LLM inference exactly
- Single read per turn (of cumulative cache)
- Tests realistic cache size growth
Cons:
- Requires significant code changes
- More complex implementation
- Higher memory usage
Impact Assessment
Performance Impact
Current (10-turn conversation):
- 9 reads
- ~100ms total read latency (assuming 10ms/read from NVMe)
Proposed (10-turn conversation):
- 45 reads
- ~450ms total read latency
This is realistic! Real LLM inference DOES require reading all previous context.
Benchmark Runtime
For --max-turns-per-conv 50 with 100 conversations:
Current: 4,900 reads (49 per conversation)
Proposed: 122,500 reads (1,225 per conversation)
Runtime increase: ~25× for multi-turn reads
Mitigation: Users can reduce --max-turns-per-conv if needed, but the I/O pattern will be realistic.
Recommendation
Implement Option 1 (Read All Previous Turns) as the minimal fix.
This:
- Uses existing
get_all_previous_turn_keys()function - Requires minimal code changes
- Provides realistic I/O patterns
- Tests actual cache capacity requirements
- Matches documentation claims about "cumulative conversation history"
Option 2 (cumulative cache entries) could be considered for a future version if more accurate LLM simulation is desired.
Related Issues
- Documentation at line 1831 claims "cumulative conversation history" but code doesn't implement this
- Function
get_all_previous_turn_keys()exists but is unused (dead code) - Current implementation may give misleading results for storage capacity planning