Add Gemma 12b Unified Support by neilmehta24 · Pull Request #334 · lmstudio-ai/mlx-engine

neilmehta24 · 2026-06-10T16:05:30Z

Changes:

Delete the legacy "VisionModelKit" and "Vision Add On", now that everything is being sent through the batched vision model kit
Add support for gemma 4 12 unified. I added some patches to the model to enable APC, as mlx-vlm disables chunked prefill if there is an image in the prompt. Some code is added to ensure that we don't restore the cache in the middle of an image.
Add support for vision feature caching. In v1.8.5, while the KV cache would not need to be re-computed, the image embeddings did need to be recalculated. This creates a substantial speedup for follow-up image requests for large models and large image data. Hook into the mlx-vlm implementation for this.

Fixes:

Fix MLX thread shutdown bug with a workaround: install_mlx_compile_cache_cleanup_for_thread. This bug is already fixed upstream, so we can remove this workaround after the next MLX release
Add model compatibility fix for LFM2 VL
Add patches to Qwen 3.5 to restore generation speed to v1.8.5, as some of the upstream MTP additions caused a slowdown

will-lms · 2026-06-11T18:19:03Z

+        or target_verify
+        or (isinstance(mask, str) and mask == "left_padded_decode")
+    ):
+        return OriginalVlmQwen3_5AttentionCall(


Was thinking about this path wrt caching. My understanding is that we'll route here if a prompt has images and to Qwen3NextAttention if prompt does not.

Have we tested the case where a vision prompt comes in, restores a cache that was previously generated using the text-only route, and then generates? It is not obvious to me that it would work as expected, but hopefully should be close enough either way. Just wondering if we tested such a branch (or if I'm misunderstanding the caching).

This works as intended, there's no explicit test case for this though

will-lms · 2026-06-11T18:50:03Z

Reviewed with agentic help. Overall am aligned with decisions.

neilmehta24 added 30 commits June 3, 2026 17:30

Keep model-prefixed Mistral3 vision weights

f501192

Sync Qwen3.5 quantization config keys

1d1c756

Accept deserialized Qwen nested configs

d53839e

Remap Qwen3.5 vision weight keys before filtering

27c7606

Route Qwen3.5 target verify attention to VLM

e3a419c

Materialize batch state before cache mutation

e456ba0

Handle Qwen3.5 attention position embeddings

9dd4811

Slice Gemma4 token type prompt kwargs

ce981a0

Remove legacy vision kit

14eadf7

Remove legacy vision model kit

fe35245

Remove legacy vision add-on tests

2713eb4

Port vision feature memoizer to batched vision

d8a92a1

Align vision feature cache with upstream

9e332a4

Load VLM image processor before processor

05740e9

Handle Gemma4 unified visual APC

00d313e

Simplify Gemma4 unified visual prefill

6a52fa6

Reuse Gemma4 prefix before new images

4729546

Force safe mlx-vlm model loading

3818e07

Restore Qwen decode fast path

0cdae5e

Remove legacy Qwen image parity test

95740e9

Use loaded processors in VLM parity tests

5db7ddc

Work around MLX threaded compile cache cleanup

1cefe5f

Fix VLM cache restore CI failures

303b788

Fix Gemma4 cache follow-up test prompt

0b51c03

Hardcode Gemma4 cache test prompt

955c03f

Restore prompt cache save priority

5185b9b

Update generated requirements

126351b

Handle Qwen left-padded decode mask

ae55e21

Handle Qwen left-padded text decode

970a7c7

Limit Qwen left-padded positions to decode

bfdd7b9

Add Gemma4 12B VLM parity coverage

24b71df

github-actions Bot added the CLA signed Indicates that all contributors have signed label Jun 10, 2026

neilmehta24 marked this pull request as ready for review June 10, 2026 19:06

Preserve VLM backend errors for active requests

af14231

will-lms reviewed Jun 11, 2026

View reviewed changes

will-lms approved these changes Jun 11, 2026

View reviewed changes

neilmehta24 merged commit 9445b31 into main Jun 11, 2026
2 checks passed

neilmehta24 deleted the neil/mlx-upstream-sync branch June 11, 2026 19:02

github-actions Bot locked and limited conversation to collaborators Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 12b Unified Support#334

Add Gemma 12b Unified Support#334
neilmehta24 merged 32 commits into
mainfrom
neil/mlx-upstream-sync

neilmehta24 commented Jun 10, 2026

Uh oh!

will-lms Jun 11, 2026

Uh oh!

neilmehta24 Jun 11, 2026

Uh oh!

will-lms commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilmehta24 commented Jun 10, 2026

Uh oh!

will-lms Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

neilmehta24 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

will-lms commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants