Skip to content

fix(scheduler): realign per-row samplers and logits processors by uid#1824

Merged
jundot merged 8 commits into
jundot:mainfrom
efortin:fix/batch-row-realignment
Jun 14, 2026
Merged

fix(scheduler): realign per-row samplers and logits processors by uid#1824
jundot merged 8 commits into
jundot:mainfrom
efortin:fix/batch-row-realignment

Conversation

@efortin

@efortin efortin commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1823.

This PR fixes a row-alignment bug in continuous batching. samplers and logits_processors are positional lists that must stay aligned with uids; under mixed concurrent load, stale or shifted slots could make a request run another request's constraints, or none at all.

In practice, this could silently disable grammar constraints or thinking budgets.

Changes

  • Record the expected sampler and logits processors for each (model, uid) when a row is inserted.
  • Realign batch rows at the generation-step chokepoint before decoding.
  • Key by model as well as uid to avoid cross-engine uid collisions.
  • Release registry rows on completion, abort, recovery, reset, and shutdown.
  • Keep the existing fix: normalise logits_processors row slots dropped to None by batch merge #1799 None-slot normalization as fallback behavior.
  • Log when a drift is corrected, with rate limiting.

Tests

Covers the #1823 stale-slot reproduction, cross-model uid collisions, cleanup paths, registry bounds, warning rate limiting, and the existing #1799 normalization case.

Validation: reproduction is red on the base commit and green on this PR; targeted scheduler tests pass.

efortin added 7 commits June 13, 2026 12:20
jundot#1799 made the generation step crash-safe by normalising None row slots,
but the positional drift behind it was still there: a stale or offset
slot left in samplers/logits_processors by batch extend/filter/split
shifts every row after it, so a request can decode with another
request's - or no - sampler and processors. Under concurrent mixed load
this silently disables grammar constraints (json_schema responses come
back as prose) and thinking budgets (unbounded reasoning), with no error
anywhere.

Record at insert time what each uid must run, and realign the positional
lists from that registry at the step chokepoint - the same place as the
jundot#1799 normalisation. A rate-limited warning fires whenever a drift is
actually corrected, so the silent corruption becomes observable. The
registry is a bounded OrderedDict so a missed cleanup can never grow it
unbounded.

Repro (concurrent plain + constrained request, 0.3s apart): 10/10
thinking_budget violations and 5/5 grammar violations on main; 0/15 with
this fix. Solo and sequential behavior unchanged, byte-identical at
temperature 0.
…pletion

mlx-lm's BatchGenerator numbers uids per instance starting at zero, so
two engines serving concurrently (or an engine reload) produce colliding
uid sequences; a registry keyed by uid alone could install one model's
sampler and processors on another model's row. Key entries by
(id(model), uid) instead - the model object is the one identity both the
insert sites and the step chokepoint can see.

Also guard the registry with a lock (engines run on separate executor
threads and share the module-level dict) and release a request's row at
the existing uid-map cleanup site, so heavy grammar processors are not
pinned until LRU eviction; the bounded size stays as the backstop.

New tests: same uid on two models must not cross-contaminate, and
completion unregister drops the row (double-unregister is a no-op).
Each (model, uid) key is inserted exactly once per request, so the
OrderedDict already appends it last; plain insertion order is all the
oldest-first backstop eviction needs. One dict operation per row instead
of two, and the lps copy reads as list(lps or ()).
…shutdown

The row registry was only released in the deferred async-remove path, so a
request finishing through the synchronous fallback, an aborted request, or
a generator reset (failure recovery, reset(), shutdown()) kept its sampler
and logits processors pinned until the FIFO backstop — and entries
surviving an engine unload were exactly the residue a recycled id(model)
could later match.

Release the row in every path that retires a uid, and release by model
wherever the uid maps are cleared wholesale. Rate-limit the drift warning
to one WARNING per 60s window (DEBUG otherwise) so a pathological merge
pattern cannot flood the logs.

Tests: structural AST checks that every retirement path releases its rows,
a model-scoped clear unit test, the pre-fix offset behavior pinned through
the empty-registry fallback, and the warning rate limit.
Pull the chokepoint realignment into a pure _realigned_rows function,
name the drift comparison rules (_row_drifted: identity fast path stays
inline at the caller, two fresh empty lists are equivalent, content
comparison last), move the rate-limited log into _log_drift_correction,
and type the registry entries as a NamedTuple. No behavior change; the
rebuilt lists are still installed unconditionally and drift only drives
logging. Two direct unit tests on the pure rebuild.
@jundot

jundot commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Thanks for chasing this down and for rebasing it on top of #1845. The uid registry approach is the right fix for the remaining BatchGenerator row drift: it covers samplers as well as logits processors, and it corrects the positional lists at the decode chokepoint instead of only patching one reshape path.

One follow-up I'll fold in after merge: native MTP can bypass GenerationBatch._step from its patched next(), so it should call the same row realignment before MTP eligibility and row extraction. That does not block this standard decode fix, and I want to get this row-drift fix in first.

This looks good to me, and I'm going to merge it.

@jundot jundot merged commit 205a23b into jundot:main Jun 14, 2026
4 checks passed
@efortin efortin deleted the fix/batch-row-realignment branch June 14, 2026 09:18
@efortin

efortin commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

This bug was critical for me (lot of batch with structured output on my side).
Thanks a lot, I can move away from my fork :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grammar and thinking_budget silently stop applying under concurrent mixed load (row misalignment after batch merge — sequel to #1798)

2 participants