Add CoPE (Clipped RoPE) rope type for zero-shot context extension by machiabeli · Pull Request #1387 · ml-explore/mlx-lm

machiabeli · 2026-06-10T05:46:42Z

Add CoPE (Clipped RoPE) as an opt-in rope type

This adds CoPERoPE to rope_utils.py, implementing the soft-clipping strategy from CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs (reference implementation). It enables zero-shot context extension past a model's native window via a config edit, with no weight changes and no behavior change for any existing config.

Method

RoPE extrapolation past the trained context fails primarily because of the lowest-frequency components — those whose rotation period exceeds the pre-training window never complete a full rotation during training, so longer sequences expose them to angle ranges the model has never seen. CoPE attenuates exactly those components with a raised-cosine (Hann) taper:

Auto-sized clip — every component with period 2π·freqs[i] greater than original_max_position_embeddings is clipped (overridable via clip_n). For e.g. Qwen3.5/3.6-family configs (rope_theta=10M, 64 rotary dims, native 262144) this clips 10 of 32 components.
Smooth rolloff — the mask goes 1 → 0 across the clipped range: the boundary component is untouched, the lowest-frequency component is frozen entirely. The taper avoids the spectral leakage / attention ringing a hard cutoff causes.
Frozen components use inf freqs — the same identity-rotation convention ProportionalRoPE already uses, so mx.fast.rope handles everything with no new kernel paths.

Unlike YaRN, in-distribution frequencies are left untouched, so short-context behavior is preserved without an mscale correction.

Usage

"rope_parameters": {
  "rope_type": "cope",
  "original_max_position_embeddings": 262144
}

original_max_position_embeddings falls back to the model's max_position_embeddings when unset; clip_n may be set explicitly to override the derivation.

Testing

Unit tests added following the existing test_rope conventions: dispatch + auto clip sizing, head-frequency preservation, taper monotonicity, frozen tail, explicit clip_n override, max_position_embeddings fallback, clip_n=0 equivalence with default RoPE, and a no-mutation test matching the SuScaled/Yarn ones.
Zero-shot validation on Qwen3.6-35B-A3B (4-bit, native 262K): needle-in-a-haystack at 393K tokens (1.5× native, needle at 65% depth) — default RoPE fails (no needle awareness); CoPE retrieves the exact answer. Short-context next-token distribution is preserved (same argmax, KL 0.0067 vs default at 8K), with no measurable speed difference (the change is constants in the frequency table).

pre-commit/black clean.

Implements the soft-clipping strategy from "CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs" (arXiv:2602.05258) as an opt-in rope_type. Low-frequency components whose rotation periods exceed the pre-training context window are attenuated with a raised-cosine taper, eliminating the out-of-distribution positions that break extrapolation past the native context, without YaRN-style interpolation of the in-distribution frequencies. Enabled via rope_parameters/rope_scaling: {"rope_type": "cope", "original_max_position_embeddings": N} with optional explicit "clip_n". No behavior change for any existing config. Validated zero-shot on Qwen3.6-35B-A3B (native 262K, theta=10M, 10/32 components clipped): 393K-token needle retrieval passes where default RoPE fails, short-context next-token distribution preserved (KL 0.0067), no speed cost. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

machiabeli · 2026-06-10T07:44:13Z

@/tmp/pr1387-comment.md

machiabeli · 2026-06-10T09:41:14Z

NIAH Evidence Grid — Zero-Shot Context Extension

Ran a full needle-in-a-haystack sweep on Qwen3.6-35B-A3B-4bit (M3 Ultra 512GB, mlx_lm 0.31.3) with the CoPE rope patch applied at runtime. The model's trained context window is 262,144 tokens.

Results

Context Length	% Past Trained	Depth 10%	Depth 35%	Depth 65%	Depth 90%
200,000	within trained	✅	—	✅	—
314,308	+20%	✅	✅	✅	✅
442,211	+69%	✅	✅	✅	✅

All needles found correctly at every depth. The model maintains perfect retrieval at 69% beyond its trained context window with CoPE applied — no fine-tuning, no RoPE scaling, just the clipped rotation.

Methodology

Haystack: repeated filler text with a single 6-digit needle inserted at the target depth percentage
Generation: 64 max tokens, greedy decode, answer extracted from response
Each 442K point took ~20 min (prefill-dominated, quadratic attention)
Sweep script and full logs at commit 44f81ec

Short-Context Sanity

Perplexity at 4K and 16K tokens is unchanged from baseline — CoPE doesn't degrade in-distribution performance.

machiabeli mentioned this pull request Jun 10, 2026

Add CoPE (Clipped RoPE) soft-clipping for zero-shot context extension Blaizzy/mlx-vlm#1344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387

Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387
machiabeli wants to merge 1 commit into
ml-explore:mainfrom
machiabeli:feat/cope-rope

machiabeli commented Jun 10, 2026

Uh oh!

machiabeli commented Jun 10, 2026

Uh oh!

machiabeli commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

machiabeli commented Jun 10, 2026

Add CoPE (Clipped RoPE) as an opt-in rope type

Method

Usage

Testing

Uh oh!

machiabeli commented Jun 10, 2026

Uh oh!

machiabeli commented Jun 10, 2026

NIAH Evidence Grid — Zero-Shot Context Extension

Results

Methodology

Short-Context Sanity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant