Environment
Symptom
This is the sequel to #1798. The crash is gone since #1799, but the same trigger now corrupts silently.
As soon as a plain chat completion and a constrained request decode concurrently in the same batch:
response_format: json_schema (strict) responses come back as plain prose - the grammar is never applied. 5/5 in my repro.
thinking_budget is ignored — the model reasons to its natural length. 10/10 in my repro (603 thinking tokens with thinking_budget: 300, every single time).
No error, no log, the requests complete "successfully". Solo, both features work perfectly (299/300, valid JSON, deterministic at temperature 0).
I originally chased this as a flaky "the first budgeted request after a model load doesn't enforce" - it was actually the overlap with the previous request still draining in the batch. Two threads 0.3s apart make it 100% deterministic.
Root cause
I put a composition probe on the step chokepoint and on GenerationBatch.extend. The smoking gun:
PROBE GenerationBatch.extend self=[0, 0] + in=[2] -> [0, 0, 2] uids=[1, 2]
PROBE step composition: [(1, 0), (2, 0)]
Three processor slots for two uids. A stale slot left behind by a finished request shifts everything after it: the structured request (uid 2) reads slot 1 (empty), while its grammar + budget processors sit in slot 2, which nothing ever reads. The #1799 normalisation makes the step crash-safe on None slots, but by design it can't restore alignment - the row just runs without its processors.
samplers go through the exact same positional logic in extend()/filter(), so a row can also silently run another request's sampler.
Repro
import json, threading, time, urllib.request
BASE = "http://localhost:1234/v1/chat/completions"
MODEL = "Qwen3.6-35B-A3B-nvfp4"
SCHEMA = {"type": "object", "additionalProperties": False,
"required": ["minutes"], "properties": {"minutes": {"type": "integer"}}}
def post(body):
req = urllib.request.Request(BASE, data=json.dumps(body).encode(),
headers={"Content-Type": "application/json"}, method="POST")
with urllib.request.urlopen(req, timeout=600) as r:
return json.loads(r.read())
def plain(i):
post({"model": MODEL, "messages": [{"role": "user", "content": f"Briefly explain tides, variant {i}."}],
"temperature": 0, "max_tokens": 700})
def constrained(i):
d = post({"model": MODEL,
"messages": [{"role": "user", "content": "A train leaves at 14:07 and arrives at 17:43, stopping 12 minutes total. How many minutes was it moving?"}],
"temperature": 0, "max_tokens": 4000, "thinking_budget": 300,
"response_format": {"type": "json_schema",
"json_schema": {"name": "x", "strict": True, "schema": SCHEMA}}})
msg = d["choices"][0]["message"]
print(i, "content:", (msg.get("content") or "")[:60],
"| reasoning chars:", len(msg.get("reasoning_content") or ""))
for i in range(10):
t1 = threading.Thread(target=plain, args=(i,))
t2 = threading.Thread(target=constrained, args=(i,))
t1.start(); time.sleep(0.3); t2.start()
t1.join(); t2.join()
On main: every constrained iteration prints prose content and ~2000+ reasoning chars. Expected: {"minutes": 204} and a bounded reasoning.
Fix
I have a fix up: record at insert time what each uid is supposed to run, and realign the positional lists from that registry at the step chokepoint (same spot as #1799). With it the repro goes 15/15 violations → 0/15, solo/sequential behavior unchanged. PR incoming right after this issue.
Environment
6ae5142(includes the fix: normalise logits_processors row slots dropped to None by batch merge #1799 fix), mlx-lm at the pinned39c4019mlx-community/Qwen3.6-35B-A3B-nvfp4, thinking enabledSymptom
This is the sequel to #1798. The crash is gone since #1799, but the same trigger now corrupts silently.
As soon as a plain chat completion and a constrained request decode concurrently in the same batch:
response_format: json_schema(strict) responses come back as plain prose - the grammar is never applied. 5/5 in my repro.thinking_budgetis ignored — the model reasons to its natural length. 10/10 in my repro (603 thinking tokens withthinking_budget: 300, every single time).No error, no log, the requests complete "successfully". Solo, both features work perfectly (299/300, valid JSON, deterministic at temperature 0).
I originally chased this as a flaky "the first budgeted request after a model load doesn't enforce" - it was actually the overlap with the previous request still draining in the batch. Two threads 0.3s apart make it 100% deterministic.
Root cause
I put a composition probe on the step chokepoint and on
GenerationBatch.extend. The smoking gun:Three processor slots for two uids. A stale slot left behind by a finished request shifts everything after it: the structured request (uid 2) reads slot 1 (empty), while its grammar + budget processors sit in slot 2, which nothing ever reads. The #1799 normalisation makes the step crash-safe on
Noneslots, but by design it can't restore alignment - the row just runs without its processors.samplersgo through the exact same positional logic inextend()/filter(), so a row can also silently run another request's sampler.Repro
On main: every
constrainediteration prints prose content and ~2000+ reasoning chars. Expected:{"minutes": 204}and a bounded reasoning.Fix
I have a fix up: record at insert time what each uid is supposed to run, and realign the positional lists from that registry at the step chokepoint (same spot as #1799). With it the repro goes 15/15 violations → 0/15, solo/sequential behavior unchanged. PR incoming right after this issue.