Skip to content

Add Gemma 12b Unified Support#334

Merged
neilmehta24 merged 32 commits into
mainfrom
neil/mlx-upstream-sync
Jun 11, 2026
Merged

Add Gemma 12b Unified Support#334
neilmehta24 merged 32 commits into
mainfrom
neil/mlx-upstream-sync

Conversation

@neilmehta24

Copy link
Copy Markdown
Member

Changes:

  • Delete the legacy "VisionModelKit" and "Vision Add On", now that everything is being sent through the batched vision model kit
  • Add support for gemma 4 12 unified. I added some patches to the model to enable APC, as mlx-vlm disables chunked prefill if there is an image in the prompt. Some code is added to ensure that we don't restore the cache in the middle of an image.
  • Add support for vision feature caching. In v1.8.5, while the KV cache would not need to be re-computed, the image embeddings did need to be recalculated. This creates a substantial speedup for follow-up image requests for large models and large image data. Hook into the mlx-vlm implementation for this.

Fixes:

  • Fix MLX thread shutdown bug with a workaround: install_mlx_compile_cache_cleanup_for_thread. This bug is already fixed upstream, so we can remove this workaround after the next MLX release
  • Add model compatibility fix for LFM2 VL
  • Add patches to Qwen 3.5 to restore generation speed to v1.8.5, as some of the upstream MTP additions caused a slowdown

@github-actions github-actions Bot added the CLA signed Indicates that all contributors have signed label Jun 10, 2026
@neilmehta24 neilmehta24 marked this pull request as ready for review June 10, 2026 19:06
or target_verify
or (isinstance(mask, str) and mask == "left_padded_decode")
):
return OriginalVlmQwen3_5AttentionCall(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about this path wrt caching. My understanding is that we'll route here if a prompt has images and to Qwen3NextAttention if prompt does not.

Have we tested the case where a vision prompt comes in, restores a cache that was previously generated using the text-only route, and then generates? It is not obvious to me that it would work as expected, but hopefully should be close enough either way. Just wondering if we tested such a branch (or if I'm misunderstanding the caching).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as intended, there's no explicit test case for this though

@will-lms

Copy link
Copy Markdown
Contributor

Reviewed with agentic help. Overall am aligned with decisions.

@neilmehta24 neilmehta24 merged commit 9445b31 into main Jun 11, 2026
2 checks passed
@neilmehta24 neilmehta24 deleted the neil/mlx-upstream-sync branch June 11, 2026 19:02
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants