Grammar and thinking_budget silently stop applying under concurrent mixed load (row misalignment after batch merge — sequel to #1798)

## Environment

- omlx main @ `6ae5142` (includes the #1799 fix), mlx-lm at the pinned `39c4019`
- macOS 26.5, Apple Silicon (M5 Pro, 64 GB)
- Model: `mlx-community/Qwen3.6-35B-A3B-nvfp4`, thinking enabled
- Default scheduler settings

## Symptom

This is the sequel to #1798. The crash is gone since #1799, but the same trigger now corrupts silently.

As soon as a plain chat completion and a constrained request decode concurrently in the same batch:

- `response_format: json_schema` (strict) responses come back as **plain prose** - the grammar is never applied. 5/5 in my repro.
- `thinking_budget` is ignored — the model reasons to its natural length. 10/10 in my repro (603 thinking tokens with `thinking_budget: 300`, every single time).

No error, no log, the requests complete "successfully". Solo, both features work perfectly (299/300, valid JSON, deterministic at temperature 0).

I originally chased this as a flaky "the first budgeted request after a model load doesn't enforce" - it was actually the overlap with the previous request still draining in the batch. Two threads 0.3s apart make it 100% deterministic.

## Root cause

I put a composition probe on the step chokepoint and on `GenerationBatch.extend`. The smoking gun:

```
PROBE GenerationBatch.extend self=[0, 0] + in=[2] -> [0, 0, 2]  uids=[1, 2]
PROBE step composition: [(1, 0), (2, 0)]
```

Three processor slots for two uids. A stale slot left behind by a finished request shifts everything after it: the structured request (uid 2) reads slot 1 (empty), while its grammar + budget processors sit in slot 2, which nothing ever reads. The #1799 normalisation makes the step crash-safe on `None` slots, but by design it can't restore alignment - the row just runs without its processors.

`samplers` go through the exact same positional logic in `extend()`/`filter()`, so a row can also silently run another request's sampler.

## Repro

```python
import json, threading, time, urllib.request

BASE = "http://localhost:1234/v1/chat/completions"
MODEL = "Qwen3.6-35B-A3B-nvfp4"
SCHEMA = {"type": "object", "additionalProperties": False,
          "required": ["minutes"], "properties": {"minutes": {"type": "integer"}}}

def post(body):
    req = urllib.request.Request(BASE, data=json.dumps(body).encode(),
                                 headers={"Content-Type": "application/json"}, method="POST")
    with urllib.request.urlopen(req, timeout=600) as r:
        return json.loads(r.read())

def plain(i):
    post({"model": MODEL, "messages": [{"role": "user", "content": f"Briefly explain tides, variant {i}."}],
          "temperature": 0, "max_tokens": 700})

def constrained(i):
    d = post({"model": MODEL,
              "messages": [{"role": "user", "content": "A train leaves at 14:07 and arrives at 17:43, stopping 12 minutes total. How many minutes was it moving?"}],
              "temperature": 0, "max_tokens": 4000, "thinking_budget": 300,
              "response_format": {"type": "json_schema",
                                  "json_schema": {"name": "x", "strict": True, "schema": SCHEMA}}})
    msg = d["choices"][0]["message"]
    print(i, "content:", (msg.get("content") or "")[:60],
          "| reasoning chars:", len(msg.get("reasoning_content") or ""))

for i in range(10):
    t1 = threading.Thread(target=plain, args=(i,))
    t2 = threading.Thread(target=constrained, args=(i,))
    t1.start(); time.sleep(0.3); t2.start()
    t1.join(); t2.join()
```

On main: every `constrained` iteration prints prose content and ~2000+ reasoning chars. Expected: `{"minutes": 204}` and a bounded reasoning.

## Fix

I have a fix up: record at insert time what each uid is supposed to run, and realign the positional lists from that registry at the step chokepoint (same spot as #1799). With it the repro goes 15/15 violations → 0/15, solo/sequential behavior unchanged. PR incoming right after this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grammar and thinking_budget silently stop applying under concurrent mixed load (row misalignment after batch merge — sequel to #1798) #1823

Environment

Symptom

Root cause

Repro

Fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Grammar and thinking_budget silently stop applying under concurrent mixed load (row misalignment after batch merge — sequel to #1798) #1823

Description

Environment

Symptom

Root cause

Repro

Fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions