Add CoPE (Clipped RoPE) soft-clipping for zero-shot context extension#1344
Open
machiabeli wants to merge 1 commit into
Open
Add CoPE (Clipped RoPE) soft-clipping for zero-shot context extension#1344machiabeli wants to merge 1 commit into
machiabeli wants to merge 1 commit into
Conversation
Implements the soft-clipping strategy from "CoPE: Clipped RoPE as A
Scalable Free Lunch for Long Context LLMs" (arXiv:2602.05258) as an
opt-in rope_parameters option. Low-frequency components whose rotation
periods exceed the pre-training context window are attenuated with a
raised-cosine taper applied to inv_freq inside MRoPERotaryEmbedding,
so it propagates through every mrope style (interleaved, chunked,
sectioned) and the fused Metal apply path with no kernel changes.
Enabled via:
"rope_parameters": {"rope_type": "cope",
"original_max_position_embeddings": N}
with optional explicit "clip_n". No behavior change for any existing
config. Wired through Qwen3_5RotaryEmbedding for the Qwen3.5/3.6
family (e.g. 262K-native models extended toward 1M).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Author
|
Evidence grid for the shared clip math (perplexity deltas + depth × length NIAH, zero-shot on Qwen3.6-35B-A3B): ml-explore/mlx-lm#1387 (comment) — the |
Collaborator
|
@machiabeli Are there are models that use this in their config today? Or is this just speculative? |
Author
I used it successfully with Qwen 3.6 35b A3b. I extended the context window to approx 400k with near 0 loss. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add CoPE (Clipped RoPE) as an opt-in rope option
This adds the soft-clipping strategy from CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs (reference implementation) to
MRoPERotaryEmbedding, enabling zero-shot context extension past a model's native window via a config edit — no weight changes, no behavior change for any existing config.Text-side counterpart: ml-explore/mlx-lm#1387.
Method
RoPE extrapolation past the trained context fails primarily because of the lowest-frequency components — their rotation periods exceed the pre-training window, so they never complete a full rotation during training and go out-of-distribution on longer sequences. CoPE attenuates exactly those components with a raised-cosine (Hann) taper on
inv_freq:inv_freqis computed inMRoPERotaryEmbedding.__init__, so it propagates through every mrope style (interleaved, chunked, sectioned) and the fused Metal apply path with no kernel changes. A fully clipped component is simplyinv_freq=0(identity rotation).2π/inv_freq[i]greater thanoriginal_max_position_embeddingsis clipped (overridable viaclip_n). For the Qwen3.5/3.6 family (rope_theta=10M, 64 rotary dims, native 262144) this clips 10 of 32 components.Qwen3_5RotaryEmbeddingnow forwardsrope_parametersso the Qwen3.5/3.6 family (262K-native) picks this up directly from config.Usage
Testing
mlx_vlm/tests/test_cope_rope.py: auto clip sizing, head-frequency preservation, taper monotonicity, frozen tail, explicitclip_n/ no-op / bounds,ValueErrorwithout sizing info,MRoPERotaryEmbeddingintegration (on and off), and theQwen3_5RotaryEmbeddingpassthrough. All passing.rope_parametersexactly as a user would enable it: taper verified in the constructed rotary embeddings on every full-attention layer (inv_freq[-1] == 0), with coherent text and image generation through the interleaved mrope + fused Metal apply path (a zero-frequency component reduces to the identity rotation, so no kernel changes are needed). Note: the config pipeline normalizesrope_type->type; the hook accepts both keys.blackclean.