Skip to content

Grammar and thinking_budget silently stop applying under concurrent mixed load (row misalignment after batch merge — sequel to #1798) #1823

@efortin

Description

@efortin

Environment

Symptom

This is the sequel to #1798. The crash is gone since #1799, but the same trigger now corrupts silently.

As soon as a plain chat completion and a constrained request decode concurrently in the same batch:

  • response_format: json_schema (strict) responses come back as plain prose - the grammar is never applied. 5/5 in my repro.
  • thinking_budget is ignored — the model reasons to its natural length. 10/10 in my repro (603 thinking tokens with thinking_budget: 300, every single time).

No error, no log, the requests complete "successfully". Solo, both features work perfectly (299/300, valid JSON, deterministic at temperature 0).

I originally chased this as a flaky "the first budgeted request after a model load doesn't enforce" - it was actually the overlap with the previous request still draining in the batch. Two threads 0.3s apart make it 100% deterministic.

Root cause

I put a composition probe on the step chokepoint and on GenerationBatch.extend. The smoking gun:

PROBE GenerationBatch.extend self=[0, 0] + in=[2] -> [0, 0, 2]  uids=[1, 2]
PROBE step composition: [(1, 0), (2, 0)]

Three processor slots for two uids. A stale slot left behind by a finished request shifts everything after it: the structured request (uid 2) reads slot 1 (empty), while its grammar + budget processors sit in slot 2, which nothing ever reads. The #1799 normalisation makes the step crash-safe on None slots, but by design it can't restore alignment - the row just runs without its processors.

samplers go through the exact same positional logic in extend()/filter(), so a row can also silently run another request's sampler.

Repro

import json, threading, time, urllib.request

BASE = "http://localhost:1234/v1/chat/completions"
MODEL = "Qwen3.6-35B-A3B-nvfp4"
SCHEMA = {"type": "object", "additionalProperties": False,
          "required": ["minutes"], "properties": {"minutes": {"type": "integer"}}}

def post(body):
    req = urllib.request.Request(BASE, data=json.dumps(body).encode(),
                                 headers={"Content-Type": "application/json"}, method="POST")
    with urllib.request.urlopen(req, timeout=600) as r:
        return json.loads(r.read())

def plain(i):
    post({"model": MODEL, "messages": [{"role": "user", "content": f"Briefly explain tides, variant {i}."}],
          "temperature": 0, "max_tokens": 700})

def constrained(i):
    d = post({"model": MODEL,
              "messages": [{"role": "user", "content": "A train leaves at 14:07 and arrives at 17:43, stopping 12 minutes total. How many minutes was it moving?"}],
              "temperature": 0, "max_tokens": 4000, "thinking_budget": 300,
              "response_format": {"type": "json_schema",
                                  "json_schema": {"name": "x", "strict": True, "schema": SCHEMA}}})
    msg = d["choices"][0]["message"]
    print(i, "content:", (msg.get("content") or "")[:60],
          "| reasoning chars:", len(msg.get("reasoning_content") or ""))

for i in range(10):
    t1 = threading.Thread(target=plain, args=(i,))
    t2 = threading.Thread(target=constrained, args=(i,))
    t1.start(); time.sleep(0.3); t2.start()
    t1.join(); t2.join()

On main: every constrained iteration prints prose content and ~2000+ reasoning chars. Expected: {"minutes": 204} and a bounded reasoning.

Fix

I have a fix up: record at insert time what each uid is supposed to run, and realign the positional lists from that registry at the step chokepoint (same spot as #1799). With it the repro goes 15/15 violations → 0/15, solo/sequential behavior unchanged. PR incoming right after this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions