Skip to content

Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387

Open
machiabeli wants to merge 1 commit into
ml-explore:mainfrom
machiabeli:feat/cope-rope
Open

Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387
machiabeli wants to merge 1 commit into
ml-explore:mainfrom
machiabeli:feat/cope-rope

Conversation

@machiabeli

Copy link
Copy Markdown

Add CoPE (Clipped RoPE) as an opt-in rope type

This adds CoPERoPE to rope_utils.py, implementing the soft-clipping strategy from CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs (reference implementation). It enables zero-shot context extension past a model's native window via a config edit, with no weight changes and no behavior change for any existing config.

Method

RoPE extrapolation past the trained context fails primarily because of the lowest-frequency components — those whose rotation period exceeds the pre-training window never complete a full rotation during training, so longer sequences expose them to angle ranges the model has never seen. CoPE attenuates exactly those components with a raised-cosine (Hann) taper:

  • Auto-sized clip — every component with period 2π·freqs[i] greater than original_max_position_embeddings is clipped (overridable via clip_n). For e.g. Qwen3.5/3.6-family configs (rope_theta=10M, 64 rotary dims, native 262144) this clips 10 of 32 components.
  • Smooth rolloff — the mask goes 1 → 0 across the clipped range: the boundary component is untouched, the lowest-frequency component is frozen entirely. The taper avoids the spectral leakage / attention ringing a hard cutoff causes.
  • Frozen components use inf freqs — the same identity-rotation convention ProportionalRoPE already uses, so mx.fast.rope handles everything with no new kernel paths.

Unlike YaRN, in-distribution frequencies are left untouched, so short-context behavior is preserved without an mscale correction.

Usage

"rope_parameters": {
  "rope_type": "cope",
  "original_max_position_embeddings": 262144
}

original_max_position_embeddings falls back to the model's max_position_embeddings when unset; clip_n may be set explicitly to override the derivation.

Testing

  • Unit tests added following the existing test_rope conventions: dispatch + auto clip sizing, head-frequency preservation, taper monotonicity, frozen tail, explicit clip_n override, max_position_embeddings fallback, clip_n=0 equivalence with default RoPE, and a no-mutation test matching the SuScaled/Yarn ones.
  • Zero-shot validation on Qwen3.6-35B-A3B (4-bit, native 262K): needle-in-a-haystack at 393K tokens (1.5× native, needle at 65% depth) — default RoPE fails (no needle awareness); CoPE retrieves the exact answer. Short-context next-token distribution is preserved (same argmax, KL 0.0067 vs default at 8K), with no measurable speed difference (the change is constants in the frequency table).

pre-commit/black clean.

Implements the soft-clipping strategy from "CoPE: Clipped RoPE as A
Scalable Free Lunch for Long Context LLMs" (arXiv:2602.05258) as an
opt-in rope_type. Low-frequency components whose rotation periods
exceed the pre-training context window are attenuated with a
raised-cosine taper, eliminating the out-of-distribution positions
that break extrapolation past the native context, without YaRN-style
interpolation of the in-distribution frequencies.

Enabled via rope_parameters/rope_scaling:
  {"rope_type": "cope", "original_max_position_embeddings": N}
with optional explicit "clip_n". No behavior change for any existing
config.

Validated zero-shot on Qwen3.6-35B-A3B (native 262K, theta=10M,
10/32 components clipped): 393K-token needle retrieval passes where
default RoPE fails, short-context next-token distribution preserved
(KL 0.0067), no speed cost.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@machiabeli

Copy link
Copy Markdown
Author

@/tmp/pr1387-comment.md

@machiabeli

Copy link
Copy Markdown
Author

NIAH Evidence Grid — Zero-Shot Context Extension

Ran a full needle-in-a-haystack sweep on Qwen3.6-35B-A3B-4bit (M3 Ultra 512GB, mlx_lm 0.31.3) with the CoPE rope patch applied at runtime. The model's trained context window is 262,144 tokens.

Results

Context Length % Past Trained Depth 10% Depth 35% Depth 65% Depth 90%
200,000 within trained
314,308 +20%
442,211 +69%

All needles found correctly at every depth. The model maintains perfect retrieval at 69% beyond its trained context window with CoPE applied — no fine-tuning, no RoPE scaling, just the clipped rotation.

Methodology

  • Haystack: repeated filler text with a single 6-digit needle inserted at the target depth percentage
  • Generation: 64 max tokens, greedy decode, answer extracted from response
  • Each 442K point took ~20 min (prefill-dominated, quadratic attention)
  • Sweep script and full logs at commit 44f81ec

Short-Context Sanity

Perplexity at 4K and 16K tokens is unchanged from baseline — CoPE doesn't degrade in-distribution performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant