fix(scheduler): realign per-row samplers and logits processors by uid#1824
Conversation
79d4534 to
7d2048a
Compare
jundot#1799 made the generation step crash-safe by normalising None row slots, but the positional drift behind it was still there: a stale or offset slot left in samplers/logits_processors by batch extend/filter/split shifts every row after it, so a request can decode with another request's - or no - sampler and processors. Under concurrent mixed load this silently disables grammar constraints (json_schema responses come back as prose) and thinking budgets (unbounded reasoning), with no error anywhere. Record at insert time what each uid must run, and realign the positional lists from that registry at the step chokepoint - the same place as the jundot#1799 normalisation. A rate-limited warning fires whenever a drift is actually corrected, so the silent corruption becomes observable. The registry is a bounded OrderedDict so a missed cleanup can never grow it unbounded. Repro (concurrent plain + constrained request, 0.3s apart): 10/10 thinking_budget violations and 5/5 grammar violations on main; 0/15 with this fix. Solo and sequential behavior unchanged, byte-identical at temperature 0.
…pletion mlx-lm's BatchGenerator numbers uids per instance starting at zero, so two engines serving concurrently (or an engine reload) produce colliding uid sequences; a registry keyed by uid alone could install one model's sampler and processors on another model's row. Key entries by (id(model), uid) instead - the model object is the one identity both the insert sites and the step chokepoint can see. Also guard the registry with a lock (engines run on separate executor threads and share the module-level dict) and release a request's row at the existing uid-map cleanup site, so heavy grammar processors are not pinned until LRU eviction; the bounded size stays as the backstop. New tests: same uid on two models must not cross-contaminate, and completion unregister drops the row (double-unregister is a no-op).
Each (model, uid) key is inserted exactly once per request, so the OrderedDict already appends it last; plain insertion order is all the oldest-first backstop eviction needs. One dict operation per row instead of two, and the lps copy reads as list(lps or ()).
…shutdown The row registry was only released in the deferred async-remove path, so a request finishing through the synchronous fallback, an aborted request, or a generator reset (failure recovery, reset(), shutdown()) kept its sampler and logits processors pinned until the FIFO backstop — and entries surviving an engine unload were exactly the residue a recycled id(model) could later match. Release the row in every path that retires a uid, and release by model wherever the uid maps are cleared wholesale. Rate-limit the drift warning to one WARNING per 60s window (DEBUG otherwise) so a pathological merge pattern cannot flood the logs. Tests: structural AST checks that every retirement path releases its rows, a model-scoped clear unit test, the pre-fix offset behavior pinned through the empty-registry fallback, and the warning rate limit.
Pull the chokepoint realignment into a pure _realigned_rows function, name the drift comparison rules (_row_drifted: identity fast path stays inline at the caller, two fresh empty lists are equivalent, content comparison last), move the rate-limited log into _log_drift_correction, and type the registry entries as a NamedTuple. No behavior change; the rebuilt lists are still installed unconditionally and drift only drives logging. Two direct unit tests on the pure rebuild.
b77ea27 to
5f59d6a
Compare
|
Thanks for chasing this down and for rebasing it on top of #1845. The uid registry approach is the right fix for the remaining BatchGenerator row drift: it covers samplers as well as logits processors, and it corrects the positional lists at the decode chokepoint instead of only patching one reshape path. One follow-up I'll fold in after merge: native MTP can bypass GenerationBatch._step from its patched next(), so it should call the same row realignment before MTP eligibility and row extraction. That does not block this standard decode fix, and I want to get this row-drift fix in first. This looks good to me, and I'm going to merge it. |
|
This bug was critical for me (lot of batch with structured output on my side). |
Summary
Fixes #1823.
This PR fixes a row-alignment bug in continuous batching.
samplersandlogits_processorsare positional lists that must stay aligned withuids; under mixed concurrent load, stale or shifted slots could make a request run another request's constraints, or none at all.In practice, this could silently disable grammar constraints or thinking budgets.
Changes
(model, uid)when a row is inserted.Tests
Covers the #1823 stale-slot reproduction, cross-model uid collisions, cleanup paths, registry bounds, warning rate limiting, and the existing #1799 normalization case.
Validation: reproduction is red on the base commit and green on this PR; targeted scheduler tests pass.