[DRAFT] Support for Zaya1 8B model (depends on PR #22833)#23112
[DRAFT] Support for Zaya1 8B model (depends on PR #22833)#23112Juste-Leo2 wants to merge 33 commits into
Conversation
- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h - Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D - Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D - Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D - Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D - Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D This improves consistency by reusing the existing SSM_CONV1D constant that's already used by other SSM-based architectures (mamba, jamba, etc.)
- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_NORM - Remove ZAYA_ROUTER_NORM from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM - Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM - Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM - Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM Router normalization is a standard FFN norm (RMSNorm), making this a semantically correct replacement that reduces custom constants.
- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP - Remove ZAYA_ROUTER_DOWN from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP - Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP - Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP - Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP Router down projection is a linear projection similar to MoE gate input, making this a semantically reasonable replacement.
- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE - Remove ZAYA_ROUTER_MLP0 from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE - Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE - Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE - Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE Router MLP hidden layer is a linear projection similar to FFN gate, making this a reasonable replacement for reducing custom constants.
- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL - Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES) - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix This reduces 8 custom ZAYA constants to 4 by reusing the same constant for both weight and bias tensors, differentiated by suffix.
- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B - Use FFN_GATE_INP for both router down weight and bias - Use FFN_GATE for both router mlp0 weight and bias - Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix - Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection This reduces 3 more custom constants by reusing the same constant for both weight and bias tensors, differentiated by suffix.
Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.
val_proj1 and val_proj2 output dimension should be latent_k_dim / 2 (n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently both are equal for ZAYA1-8B (n_head_kv=2), but this would break for any other n_head_kv configuration.
Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.
|
Wanted to give this a try but I guess the Vulkan backend needs some more work for this, or is this unexpected? |
ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw and cca_conv_grp to F16 before convolution to support quantized models (Q4, Q8). CUDA/SYCL backends are unaffected since their im2col implementation only reads kernel dimensions, not data.
OK, after some trouble, I think I've found the cause. It was a type error; im2col requires f16 for the conversion when using CPU backends and the like. With CUDA, it doesn't check the kernel's internal data during the conversion only the dimensions which is why it worked correctly on CUDA. |
|
I wanted to give a try to the Zaya model with my 9060 XT and ROCM 7.1, I compiled the MTP version of llama and backported your branch. Got some issues such as: ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed This is my setup. I used AI to fix it, it made a few changes in zaya.cpp, here's the unified diff: It works so far but the token generation is low (~15 t/s) compared to a Qwen 35B-A3B (60t/s), I was expecting much more from this model considering the small amount of active params. |
ROCm and Vulkan backends require contiguous tensors for im2col and mul_mat operations. Add ggml_cont after ggml_cast for conv kernels and after ggml_concat for hs_d to ensure compatibility across all backends. CUDA was unaffected since it handles non-contiguous tensors more permissively.
Thanks for the catch! I was also unsure about the speed during implementation, and there are a few reasons for it:
I just pushed a commit with only the necessary ggml_cont calls, I think the AI added a few extras out of caution that weren't actually needed and would have added unnecessary memory copies. Could you try again and let me know how it goes @kdrapelinexto ? Thanks ! |
Ports PR ggml-org#22833 and PR ggml-org#23112 from ggml-org/llama.cpp onto our fork. - ggml: add ggml_conv_1d_grouped op (depthwise + headwise conv via ggml_view_3d slicing, falls back to existing conv1d/dw for groups=1 and groups=IC) - gguf: register ZAYA arch, CCA_VAL_PROJ1/2, CCA_CONV_GRP, CCA_K_SCALE, RES_SCALE_HS/RES/FINAL, ZAYA_ROUTER_MLP2/4/BIASES/EDA_SCALE tensors - src: add llama_model_zaya with alternating CCA (even) and MoE (odd) layers; residual scaling at every layer and final norm - conversion/zaya.py: HF→GGUF converter for ZayaModel/ZayaForCausalLM - Includes ggml_cont fixes for ROCm non-contiguous tensor compatibility and F16 cast fixes for CPU backend (from Zyphra fork review) Markovian RSA (test-time compute method) is intentionally excluded and will be a separate implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This works now, thanks! I've tested both f16 and Q8_0. |
|
BTW thanks for doing this, I was tempted to attempt this but I figured out I have enough to do and this architecture is a bit crazy :) |
|
Thank you, I put Q6_K on my Pi5 16gb, resulted 6.5~6.9 TPS generation. A little disappointed but it's a good sub of Gemma4-e2b-it-IQ4_NL since ZAYA1 is a lot smarter! |
The model's config.json reports vocab_size=262272 but the actual tokenizer only has 262147 tokens. The 125 extra entries are padding in PyTorch's embed_tokens.weight matrix that don't correspond to any real tokens. Use the pre-computed _tokenizer_vocab_size to write the correct vocab size in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.
I had some trouble running the test initially due to a vocabulary size mismatch. The PyTorch config included some extra unused padding tokens (PyTorch vocab: 262272 vs llama.cpp vocab: 262147). Since this doesn't impact inference for current GGUFs, I made a commit to explicitly align with the Here are the detailed results of the verification: 🔍 Token & Logits Verification✅ Match: All 6 tokens match between PyTorch and llama.cpp. Here are the raw logits for the top-10 predictions:
📈 NMSE Metrics
❌ Final Verdict & ContextThe script returned an error ( Given that the NMSE is just slightly above the threshold but the top-10 tokens are aligned, @pwilkin what do you think? Is this sufficient to validate the test, or is further investigation required regarding this gap? |
|
Yeah, does look like a bug. You should probably dump intermediate tensors at this stage to see where the divergence starts. |
Thanks for the feedback! I'll investigate this during the week to find where the divergence starts. |
|
I had some time to run tests with opencode. I thought it would be nice to map it out with a Python script, so here are the results for each layer in BF16.
I noticed we have spikes on the odd layers. A few examples:
To try and fix this, I re-analyzed things against Zyphra's official VLLM fork (which I based this on), specifically looking at the MoE logic (since the odd layers are MoE layers). We have the exact same implementation for the final Residual scaling, as well as for the EDA (Exponential Decay Averaging) and MoE gate/up/down logic. I also tested disabling this EDA: the NMSE exploded to 5.72e-01 (46x worse), which confirms that it is indeed reducing the drift rather than causing it. I thought it might be a routing precision issue, so I switched the softmax from bf16 to f32 (line 382 in So after all these tests, I finally tried running everything in full F32 (which I should have done first 😅), and it passed the logit test! Here are the results: 📈 METRICS
==============================
MSE (Mean Squared Error): 2.335141e-02
Reference Variance: 6.342938e+00
NMSE: 3.681481e-03
Max Absolute Error: 0.798369
Mean Absolute Error: 0.117253
NMSE (dB): -24.34 dB
🎯 INTERPRETATION
==============================
👍 Good match
📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
Small differences are likely due to precision/quantization.
📚 NMSE BENCHMARKS
==============================
< 1e-6: Essentially identical
< 1e-4: Excellent (typical for good conversions)
< 1e-3: Very good
< 1e-2: Good (acceptable for most use cases)
< 0.1: Acceptable (may need verification)
> 1.0: Poor (worse than random)
✅ RESULT: PASS (NMSE = 3.68e-03)
Here is the graph with all the layers in F32:
On this graph, I still observe a noticeable MSE increase at the end. In F32, layers 0-72 are correct, but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers, and probably not a structural bug. So, given that F32 passes the logit test (NMSE = 3.68e-03, "Very good") and BF16 just barely fails due to this accumulating precision loss, the model's architecture seems consistent. @pwilkin pinging you again with these new observations. Is it worth investigating further, or is this explanation sufficient to validate the architecture? |
That probably means that more tensors need less strong quantization for this model?
Seems still useful to try to understand where this exactly comes from if the pytorch implementation on the very same model doesn't show this behaviour. I have no idea but this feels like something is going wrong somewhere |
Thanks for the feedback! Since I'm still getting used to interpreting the exact impact of quantization with these tools, I'm still a bit hesitant to draw definitive conclusions. |
|
Sorry, I'll take a look when I'm able - from my intuition, if it doesn't validate at BF16, then that suggests something is still wrong, but I'd have to look at the intermediate tensors themselves to make an informed opinion. |
Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.
zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.
05ec4f4 to
2b0c8c8
Compare
… _FP32EmbeddingMethod
Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).
|
I added comments to compare the vllm implementation with zyphra's fork (zaya.py and CCA.py) Note: For the logit test, I’m using zyphra’s transformer fork. So my next step is to compare it with this fork to see if there are any differences there as well. I also wanted to clarify that the test showing the graph with the layers was a cumulative test, not a test with separate, independent layers. |
- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra - cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328) - Applied only during prefill (n_seq_tokens > 1), matching Python logic - Mask filled with 1.0f for all positions (no padding info in ubatch)
Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701
|
After hours of debugging, I think I've finally found the root cause of the issue! For the I put together a diagram (thanks opencode) to illustrate exactly what happens: BEFORE (NMSE 1.23e-02) AFTER (NMSE 3.94e-03)
QK_dw [7,1280,1] F32 QK_dw [7,1280,1] F32
│ │
▼ ▼
ggml_conv_1d_grouped(10) ggml_conv_1d_grouped(10)
│ │
┌────┴────┐ ┌────┴────┐
│ group 0 │ ... │ group 0 │ ...
│ slice │ │ slice │
│ [7,128] │ F32 │ [7,128] │ F32
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ im2col │ │ im2col │
│ F16 ⚠️ │← precision loss │ F32 ✅ │← exact
│ -1.191 │ │ -1.191 │
│ →-1.194 │ │ →-1.191 │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ mul_mat │ F16×F32 │ mul_mat │ F32×F32
│ │ │ │ │ │
│ 0.818 │← err amplified │ 0.818 │✓
│→0.750 │ δ=0.068 │→0.818 │ δ≈0
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ ┌────┴────┐
│ concat │ │ concat │
│ groups │ │ groups │
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ ┌────┴────┐
│ attn + │ err amplified │ attn + │ minimal
│ rest │ by later layers │ rest │ error
│ │ │ │
│logits │ NMSE 1.23e-02 │logits │ NMSE 3.94e-03
└─────────┘ └─────────┘
Given these results, I think the best path forward is to revert the recent adjustments and go back to the state just after this commit. The subsequent commits didn't bring any improvements (they mainly documented the code and added elements that are likely handled implicitly). We can discuss if we want to keep a few specific commits, but the metrics clearly point to this Note that this means the test works perfectly in f16: 📈 METRICS
==============================
MSE (Mean Squared Error): 2.430202e-02
Reference Variance: 6.339689e+00
NMSE: 3.833313e-03
Max Absolute Error: 0.842108
Mean Absolute Error: 0.121947
NMSE (dB): -24.16 dB
🎯 INTERPRETATION
==============================
👍 Good match
📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
Small differences are likely due to precision/quantization.
📚 NMSE BENCHMARKS
==============================
✅ RESULT: PASS (NMSE = 3.83e-03)but not in BF16 : 📈 METRICS
==============================
MSE (Mean Squared Error): 7.868546e-02
Reference Variance: 6.339689e+00
NMSE: 1.241156e-02
Max Absolute Error: 1.791535
Mean Absolute Error: 0.202144
NMSE (dB): -19.06 dB
🎯 INTERPRETATION
==============================
⚠️ Acceptable match
📋 GUIDANCE
==============================
⚠️ ACCEPTABLE: Conversion is working but with some differences.
Check if you're using quantization (Q4, Q8, etc.)
Test generation quality to see if it's acceptable.
📚 NMSE BENCHMARKS
==============================
❌ RESULT: NEEDS REVIEW (NMSE = 1.24e-02) |
This reverts commit f1bd772.
- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0). - zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.
|
I found the fix! We simply need to allocate im2col dynamically: F32 for unsupported types (like BF16 or quantized) and F16 for native F16 models. This keeps F16 hardware-optimized while allowing BF16/quantized models to use precise F32 math without backend crashes. Here are the final passing results for both: BF16 : 📈 METRICS
==============================
MSE (Mean Squared Error): 2.556950e-02
Reference Variance: 6.339689e+00
NMSE: 4.033242e-03
Max Absolute Error: 0.862177
Mean Absolute Error: 0.122009
NMSE (dB): -23.94 dB
✅ RESULT: PASS (NMSE = 4.03e-03) |
|
@Juste-Leo2 took a stab at some perf follow ups for CCA and qkmean fusion, CUDA ~80 -> 108 t/s bsz1 decode Q4_K_M on a 4090 validated with per-layer dumps vs zaya transformers fp32 + op tests. happy to open a follow up after this lands / share the dev branch if useful |
This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.
The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.
|
I believe the implementation is complete. I'm waiting for the im2col and grouped conv PRs to be merged first since this draft depends on them. Once those are in, I plan to convert this draft into a proper PR. |


Overview
This PR adds support for the Zaya1 8B model (without Markovian RSA). (see issue #22776)
Note: This draft depends on PR #22833, hence the choice of opening a draft PR to be able to update the tree based on the requested changes.
Zaya is a hybrid recurrent/attention model. It consists of a succession of classic MoE layers and convolution-based CCA layers.
The goal of CCA is to substitute classic attention. From what I understand, the process is:
The researchers of the Zaya model also used an attention projection on the previous token.
Additional information
Heavily based on the vLLM implementation.
For context, I initially started this port on my own fork. I later got stuck on an inference issue, and the work was moved to a separate branch on the Zyphra repository where @nanduruganesh greatly helped unblock the situation (see Zyphra PR #1 and Zyphra PR #2). I then took care of the refactoring and various fixes (using OpenCode) to propose this clean version upstream. A huge thanks to him for his crucial help!
Regarding inference with an RTX 4070 Ti and 64GB of RAM:
In BF16
Q4_K_M
Requirements
I am more than willing to do my best to answer maintainers' questions about the architecture or anything else.