Skip to content

Add CoPE (Clipped RoPE) soft-clipping for zero-shot context extension#1344

Open
machiabeli wants to merge 1 commit into
Blaizzy:mainfrom
machiabeli:feat/cope-rope
Open

Add CoPE (Clipped RoPE) soft-clipping for zero-shot context extension#1344
machiabeli wants to merge 1 commit into
Blaizzy:mainfrom
machiabeli:feat/cope-rope

Conversation

@machiabeli

@machiabeli machiabeli commented Jun 10, 2026

Copy link
Copy Markdown

Add CoPE (Clipped RoPE) as an opt-in rope option

This adds the soft-clipping strategy from CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs (reference implementation) to MRoPERotaryEmbedding, enabling zero-shot context extension past a model's native window via a config edit — no weight changes, no behavior change for any existing config.

Text-side counterpart: ml-explore/mlx-lm#1387.

Method

RoPE extrapolation past the trained context fails primarily because of the lowest-frequency components — their rotation periods exceed the pre-training window, so they never complete a full rotation during training and go out-of-distribution on longer sequences. CoPE attenuates exactly those components with a raised-cosine (Hann) taper on inv_freq:

  • Single hook point — the taper is applied where inv_freq is computed in MRoPERotaryEmbedding.__init__, so it propagates through every mrope style (interleaved, chunked, sectioned) and the fused Metal apply path with no kernel changes. A fully clipped component is simply inv_freq=0 (identity rotation).
  • Auto-sized clip — every component with period 2π/inv_freq[i] greater than original_max_position_embeddings is clipped (overridable via clip_n). For the Qwen3.5/3.6 family (rope_theta=10M, 64 rotary dims, native 262144) this clips 10 of 32 components.
  • Smooth rolloff — the mask goes 1 → 0 across the clipped range: the boundary component is untouched, the lowest-frequency component is frozen, avoiding the spectral leakage / attention ringing of a hard cutoff. Unlike YaRN, in-distribution frequencies are untouched, so short-context behavior is preserved.

Qwen3_5RotaryEmbedding now forwards rope_parameters so the Qwen3.5/3.6 family (262K-native) picks this up directly from config.

Usage

"rope_parameters": {
  "rope_type": "cope",
  "original_max_position_embeddings": 262144
}

Testing

  • mlx_vlm/tests/test_cope_rope.py: auto clip sizing, head-frequency preservation, taper monotonicity, frozen tail, explicit clip_n / no-op / bounds, ValueError without sizing info, MRoPERotaryEmbedding integration (on and off), and the Qwen3_5RotaryEmbedding passthrough. All passing.
  • End-to-end on Qwen3.6-35B-A3B (4-bit), config-driven via rope_parameters exactly as a user would enable it: taper verified in the constructed rotary embeddings on every full-attention layer (inv_freq[-1] == 0), with coherent text and image generation through the interleaved mrope + fused Metal apply path (a zero-frequency component reduces to the identity rotation, so no kernel changes are needed). Note: the config pipeline normalizes rope_type -> type; the hook accepts both keys.
  • Empirical validation of the same clip on the text side (Qwen3.6-35B-A3B, native 262K): needle-in-a-haystack at 393K tokens passes where default RoPE fails outright; short-context next-token distribution preserved (KL 0.0067 at 8K); no speed cost — details in mlx-lm#1387.

black clean.

Implements the soft-clipping strategy from "CoPE: Clipped RoPE as A
Scalable Free Lunch for Long Context LLMs" (arXiv:2602.05258) as an
opt-in rope_parameters option. Low-frequency components whose rotation
periods exceed the pre-training context window are attenuated with a
raised-cosine taper applied to inv_freq inside MRoPERotaryEmbedding,
so it propagates through every mrope style (interleaved, chunked,
sectioned) and the fused Metal apply path with no kernel changes.

Enabled via:
  "rope_parameters": {"rope_type": "cope",
                      "original_max_position_embeddings": N}
with optional explicit "clip_n". No behavior change for any existing
config. Wired through Qwen3_5RotaryEmbedding for the Qwen3.5/3.6
family (e.g. 262K-native models extended toward 1M).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@machiabeli

Copy link
Copy Markdown
Author

Evidence grid for the shared clip math (perplexity deltas + depth × length NIAH, zero-shot on Qwen3.6-35B-A3B): ml-explore/mlx-lm#1387 (comment) — the inv_freq produced there is bit-identical to this implementation.

@lucasnewman

Copy link
Copy Markdown
Collaborator

@machiabeli Are there are models that use this in their config today? Or is this just speculative?

@machiabeli

Copy link
Copy Markdown
Author

@machiabeli Are there are models that use this in their config today? Or is this just speculative?

I used it successfully with Qwen 3.6 35b A3b. I extended the context window to approx 400k with near 0 loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants