Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387
Open
machiabeli wants to merge 1 commit into
Open
Add CoPE (Clipped RoPE) rope type for zero-shot context extension#1387machiabeli wants to merge 1 commit into
machiabeli wants to merge 1 commit into
Conversation
Implements the soft-clipping strategy from "CoPE: Clipped RoPE as A
Scalable Free Lunch for Long Context LLMs" (arXiv:2602.05258) as an
opt-in rope_type. Low-frequency components whose rotation periods
exceed the pre-training context window are attenuated with a
raised-cosine taper, eliminating the out-of-distribution positions
that break extrapolation past the native context, without YaRN-style
interpolation of the in-distribution frequencies.
Enabled via rope_parameters/rope_scaling:
{"rope_type": "cope", "original_max_position_embeddings": N}
with optional explicit "clip_n". No behavior change for any existing
config.
Validated zero-shot on Qwen3.6-35B-A3B (native 262K, theta=10M,
10/32 components clipped): 393K-token needle retrieval passes where
default RoPE fails, short-context next-token distribution preserved
(KL 0.0067), no speed cost.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Author
|
@/tmp/pr1387-comment.md |
Author
NIAH Evidence Grid — Zero-Shot Context ExtensionRan a full needle-in-a-haystack sweep on Qwen3.6-35B-A3B-4bit (M3 Ultra 512GB, mlx_lm 0.31.3) with the CoPE rope patch applied at runtime. The model's trained context window is 262,144 tokens. Results
All needles found correctly at every depth. The model maintains perfect retrieval at 69% beyond its trained context window with CoPE applied — no fine-tuning, no RoPE scaling, just the clipped rotation. Methodology
Short-Context SanityPerplexity at 4K and 16K tokens is unchanged from baseline — CoPE doesn't degrade in-distribution performance. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add CoPE (Clipped RoPE) as an opt-in rope type
This adds
CoPERoPEtorope_utils.py, implementing the soft-clipping strategy from CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs (reference implementation). It enables zero-shot context extension past a model's native window via a config edit, with no weight changes and no behavior change for any existing config.Method
RoPE extrapolation past the trained context fails primarily because of the lowest-frequency components — those whose rotation period exceeds the pre-training window never complete a full rotation during training, so longer sequences expose them to angle ranges the model has never seen. CoPE attenuates exactly those components with a raised-cosine (Hann) taper:
2π·freqs[i]greater thanoriginal_max_position_embeddingsis clipped (overridable viaclip_n). For e.g. Qwen3.5/3.6-family configs (rope_theta=10M, 64 rotary dims, native 262144) this clips 10 of 32 components.inffreqs — the same identity-rotation conventionProportionalRoPEalready uses, somx.fast.ropehandles everything with no new kernel paths.Unlike YaRN, in-distribution frequencies are left untouched, so short-context behavior is preserved without an mscale correction.
Usage
original_max_position_embeddingsfalls back to the model'smax_position_embeddingswhen unset;clip_nmay be set explicitly to override the derivation.Testing
test_ropeconventions: dispatch + auto clip sizing, head-frequency preservation, taper monotonicity, frozen tail, explicitclip_noverride,max_position_embeddingsfallback,clip_n=0equivalence with default RoPE, and a no-mutation test matching the SuScaled/Yarn ones.pre-commit/blackclean.