Description
Sending the identical prompt twice to the same server with temperature=0.0 produces different outputs. The second response often contains corrupted text (Cyrillic characters, garbled tokens), suggesting uninitialized memory or state corruption between requests.
Steps to Reproduce
# Start server
./build-metal/quant-server SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080
# First request
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}],"temperature":0.0,"max_tokens":30}'
# Response 1: "2+2 is equal to 4." (coherent)
# Second request (identical)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}],"temperature":0.0,"max_tokens":30}'
# Response 2: "2+2 = 4\nОтвет: 4" (Cyrillic corruption)
Expected Behavior
With temperature=0.0 (greedy decoding), identical inputs must produce identical outputs every time.
Impact
- Severity: P0 — Breaks reproducibility, a fundamental requirement for testing and production
- Suggests memory corruption or uninitialized state in the KV cache between requests
- May be related to the KV cache reuse feature (chat-mode optimization)
Root Cause Hypothesis
The KV cache from the previous request may not be fully cleared/reset before the next request. If the cache reuse logic incorrectly detects a "match" or leaves stale data, the attention computation reads corrupted values, producing non-deterministic output.
Suggested Investigation
- Check if the KV cache is properly reset between unrelated requests
- Verify that
memset/zero-initialization happens on all state buffers
- Test with KV cache reuse disabled to isolate the issue
- Run under AddressSanitizer (
-fsanitize=address) to detect memory issues
Environment
- quant.cpp: latest main (49c6605)
- Model: SmolLM2-1.7B-Instruct-Q8_0.gguf
- Build: cmake -DTQ_BUILD_METAL=ON
- OS: macOS 15 (Apple M3)
Reported by ClawTeam Claw-5 (Researcher persona)
Description
Sending the identical prompt twice to the same server with
temperature=0.0produces different outputs. The second response often contains corrupted text (Cyrillic characters, garbled tokens), suggesting uninitialized memory or state corruption between requests.Steps to Reproduce
Expected Behavior
With
temperature=0.0(greedy decoding), identical inputs must produce identical outputs every time.Impact
Root Cause Hypothesis
The KV cache from the previous request may not be fully cleared/reset before the next request. If the cache reuse logic incorrectly detects a "match" or leaves stale data, the attention computation reads corrupted values, producing non-deterministic output.
Suggested Investigation
memset/zero-initialization happens on all state buffers-fsanitize=address) to detect memory issuesEnvironment
Reported by ClawTeam Claw-5 (Researcher persona)