[DRAFT] Support for Zaya1 8B model (depends on PR #22833) by Juste-Leo2 · Pull Request #23112 · ggml-org/llama.cpp

Juste-Leo2 · 2026-05-15T16:59:17Z

Overview

This PR adds support for the Zaya1 8B model (without Markovian RSA). (see issue #22776)

Note: This draft depends on PR #22833, hence the choice of opening a draft PR to be able to update the tree based on the requested changes.

Zaya is a hybrid recurrent/attention model. It consists of a succession of classic MoE layers and convolution-based CCA layers.

The goal of CCA is to substitute classic attention. From what I understand, the process is:

Q, K projections from the hidden state
Computation of pre-means (before convolution) — average per GQA group of Q and K
Q+K concatenation + depthwise convolution + grouped convolution
Injection of the pre-means AFTER the convolution — the convolution result is added to the means computed in step 2
L2 norm + learned temperature on K (similar to Kimi Linear or Qwen3 Next models)
And finally, 50% RoPE

The researchers of the Zaya model also used an attention projection on the previous token.

Additional information

Heavily based on the vLLM implementation.

For context, I initially started this port on my own fork. I later got stuck on an inference issue, and the work was moved to a separate branch on the Zyphra repository where @nanduruganesh greatly helped unblock the situation (see Zyphra PR #1 and Zyphra PR #2). I then took care of the refactoring and various fixes (using OpenCode) to propose this clean version upstream. A huge thanks to him for his crucial help!

Regarding inference with an RTX 4070 Ti and 64GB of RAM:

In BF16

~/zaya/llama.cpp$ ./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf   -p "Quelle est la capitale de la France ?"   -n 64 -c 512 -ngl all -sm none --single-turn --simple-io
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes, VRAM: 12281 MiB

Loading model...

...

> Quelle est la capitale de la France ?

[Start thinking]
We need to answer the question: "Quelle est la capitale de la France ?" which is French asking: What is the capital of France? The assistant should answer in French presumably, or maybe in English? The user wrote in French. Usually we answer in the same language. So answer: Paris. Possibly add a brief

[ Prompt: 81.5 t/s | Generation: 11.0 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4070 Ti) | 12281 =    0 + (17445 = 15879 +      21 +    1545) +       -5164 |
common_memory_breakdown_print: |   - Host                |                 1033 =  1024 +       0 +       9

Q4_K_M

~/zaya/llama.cpp$ ./build/bin/llama-cli -m models/ZAYA1-8B-Q4_K_M.gguf   -p "Quelle est la capitale
de la France ?"   -n 64 -c 512 -ngl all -sm none --single-turn --simple-io
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes, VRAM: 12281 MiB

Loading model...

...

> Quelle est la capitale de la France ?

[Start thinking]
We have a conversation: user asks "Quelle est la capitale de la France?" which is French: "What is the capital of France?" The user is asking a factual question. According to knowledge, the capital of France is Paris.

We need to answer appropriately. The user is using French language. Probably we should

[ Prompt: 167.8 t/s | Generation: 45.9 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4070 Ti) | 12281 = 5092 + (5836 =  4874 +      21 +     941) +         1352 |
common_memory_breakdown_print: |   - Host                |                  429 =   420 +       0 +       9

Requirements

I have read and agree with the contributing guidelines

AI usage disclosure: YES
- I used OpenCode to better understand the implementation and architectural concepts.
- I discussed with the AI to refactor the code and use existing constants to improve maintainability.
- The naive implementation of the architecture was ported from vLLM.
- Translation of this PR (originally handwritten by me in English) and improving its readability.

I am more than willing to do my best to answer maintainers' questions about the architecture or anything else.

- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h - Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D - Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D - Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D - Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D - Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D This improves consistency by reusing the existing SSM_CONV1D constant that's already used by other SSM-based architectures (mamba, jamba, etc.)

- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_NORM - Remove ZAYA_ROUTER_NORM from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM - Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM - Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM - Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM Router normalization is a standard FFN norm (RMSNorm), making this a semantically correct replacement that reduces custom constants.

- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP - Remove ZAYA_ROUTER_DOWN from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP - Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP - Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP - Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP Router down projection is a linear projection similar to MoE gate input, making this a semantically reasonable replacement.

- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE - Remove ZAYA_ROUTER_MLP0 from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE - Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE - Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE - Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE Router MLP hidden layer is a linear projection similar to FFN gate, making this a reasonable replacement for reducing custom constants.

- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL - Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES) - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix This reduces 8 custom ZAYA constants to 4 by reusing the same constant for both weight and bias tensors, differentiated by suffix.

- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B - Use FFN_GATE_INP for both router down weight and bias - Use FFN_GATE for both router mlp0 weight and bias - Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix - Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection This reduces 3 more custom constants by reusing the same constant for both weight and bias tensors, differentiated by suffix.

Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.

val_proj1 and val_proj2 output dimension should be latent_k_dim / 2 (n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently both are equal for ZAYA1-8B (n_head_kv=2), but this would break for any other n_head_kv configuration.

Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.

sdroege · 2026-05-16T15:40:26Z

Wanted to give this a try but I guess the Vulkan backend needs some more work for this, or is this unexpected?

build/bin/llama-cli -hf JusteLeo/ZAYA1-8B-GGUF:Q8_0 --jinja -ngl 99

Loading model... \.../llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7653: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw and cca_conv_grp to F16 before convolution to support quantized models (Q4, Q8). CUDA/SYCL backends are unaffected since their im2col implementation only reads kernel dimensions, not data.

Juste-Leo2 · 2026-05-16T19:11:11Z

Wanted to give this a try but I guess the Vulkan backend needs some more work for this, or is this unexpected?
build/bin/llama-cli -hf JusteLeo/ZAYA1-8B-GGUF:Q8_0 --jinja -ngl 99

Loading model... \.../llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7653: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

OK, after some trouble, I think I've found the cause. It was a type error; im2col requires f16 for the conversion when using CPU backends and the like. With CUDA, it doesn't check the kernel's internal data during the conversion only the dimensions which is why it worked correctly on CUDA.
Could you try again and let me know how it goes, @sdroege?

kdrapelinexto · 2026-05-16T19:53:06Z

I wanted to give a try to the Zaya model with my 9060 XT and ROCM 7.1, I compiled the MTP version of llama and backported your branch. Got some issues such as: ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed

This is my setup.

"llama-server.exe" ^
  -m "D:\04_AI\Models\8B\ZAYA1-8B-Q6_K.gguf" ^
  -ngl all ^
  --mlock ^
  -sm none ^
  -fa 1 ^
  --temp 0.7 ^
  --top-k 20 ^
  --top-p 0.8 ^
  --min-p 0.0 ^
  -t 8 -tb 8 --prio 2 --prio-batch 2 --poll 100 --poll-batch 1 ^
  --presence-penalty 1.5 ^
   --ctx-size 24000 ^
   -n -1 ^
   --host 0.0.0.0 ^
  --port 9090 ^
  --alias "ZAYA1-8B-Q6_K" ^
  -np 1

I used AI to fix it, it made a few changes in zaya.cpp, here's the unified diff:

 src/models/zaya.cpp | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/models/zaya.cpp b/src/models/zaya.cpp
index cda5abeea..01a7cdd0d 100644
--- a/src/models/zaya.cpp
+++ b/src/models/zaya.cpp
@@ -221,7 +221,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             ggml_tensor * cur_state_src = ggml_cont(ctx0, cur);
             ggml_tensor * cur_seq = ggml_reshape_3d(ctx0, cur_state_src, n_embd, n_seq_tokens, n_seqs);
 
-            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, prev_hs, n_embd, 1, n_seqs);
+            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, ggml_cont(ctx0, prev_hs), n_embd, 1, n_seqs);
             if (n_seq_tokens > 1) {
                 ggml_tensor * cur_shift = ggml_view_3d(ctx0, cur_seq, n_embd, n_seq_tokens - 1, n_seqs,
                         cur_seq->nb[1],
@@ -229,7 +229,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
                         0);
                 hs_d = ggml_concat(ctx0, hs_d, cur_shift, 1);
             }
-            hs_d = ggml_reshape_2d(ctx0, hs_d, n_embd, n_tokens);
+            hs_d = ggml_reshape_2d(ctx0, ggml_cont(ctx0, hs_d), n_embd, n_tokens);
             cb(hs_d, "cca_hs_d", il);
 
             // V = concat(val_proj1(x), val_proj2(x delayed)) -> [n_embd_k, n_tokens]
@@ -249,7 +249,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * Kpre_grouped = ggml_reshape_4d(ctx0, Kpre, n_embd_head, 1, n_head_kv, n_tokens);
             Kpre_grouped = ggml_repeat_4d(ctx0, Kpre_grouped, n_embd_head, n_gqa, n_head_kv, n_tokens);
-            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, Kpre_grouped, n_embd_head, n_head, n_tokens);
+            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Kpre_grouped), n_embd_head, n_head, n_tokens);
             ggml_tensor * qk_mean_q = ggml_scale(ctx0, ggml_add(ctx0, Qpre, Kpre_rep), 0.5f);
             cb(qk_mean_q, "qk_mean_q", il);
 
@@ -257,7 +257,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             Qgroup = ggml_permute(ctx0, Qgroup, 1, 0, 2, 3);
             Qgroup = ggml_cont(ctx0, Qgroup);
             ggml_tensor * Qmean = ggml_mean(ctx0, Qgroup);
-            Qmean = ggml_reshape_3d(ctx0, Qmean, n_embd_head, n_head_kv, n_tokens);
+            Qmean = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Qmean), n_embd_head, n_head_kv, n_tokens);
             ggml_tensor * qk_mean_k = ggml_scale(ctx0, ggml_add(ctx0, Qmean, Kpre), 0.5f);
             cb(qk_mean_k, "qk_mean_k", il);
 
@@ -289,7 +289,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_dw = layer.cca_conv_dw;
             if (conv_dw->type != GGML_TYPE_F16) {
-                conv_dw = ggml_cast(ctx0, conv_dw, GGML_TYPE_F16);
+                conv_dw = ggml_cont(ctx0, ggml_cast(ctx0, conv_dw, GGML_TYPE_F16));
             }
             conv_dw = ggml_reshape_3d(ctx0, conv_dw, conv_dw->ne[0], 1, n_qk);
             ggml_tensor * QK = ggml_conv_1d_dw(ctx0, conv_dw, conv_input, 1, 0, 1);
@@ -300,7 +300,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_grp = layer.cca_conv_grp;
             if (conv_grp->type != GGML_TYPE_F16) {
-                conv_grp = ggml_cast(ctx0, conv_grp, GGML_TYPE_F16);
+                conv_grp = ggml_cont(ctx0, ggml_cast(ctx0, conv_grp, GGML_TYPE_F16));
             }
             QK = ggml_conv_1d_grouped(ctx0, conv_grp, QK, 1, 0, 1, n_groups);
             QK = ggml_add(ctx0, QK, ggml_reshape_3d(ctx0, layer.cca_conv_grp_b, 1, n_qk, 1));

It works so far but the token generation is low (~15 t/s) compared to a Qwen 35B-A3B (60t/s), I was expecting much more from this model considering the small amount of active params.

ROCm and Vulkan backends require contiguous tensors for im2col and mul_mat operations. Add ggml_cont after ggml_cast for conv kernels and after ggml_concat for hs_d to ensure compatibility across all backends. CUDA was unaffected since it handles non-contiguous tensors more permissively.

Juste-Leo2 · 2026-05-16T20:41:19Z

I wanted to give a try to the Zaya model with my 9060 XT and ROCM 7.1, I compiled the MTP version of llama and backported your branch. Got some issues such as: ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed

This is my setup.

"llama-server.exe" ^
  -m "D:\04_AI\Models\8B\ZAYA1-8B-Q6_K.gguf" ^
  -ngl all ^
  --mlock ^
  -sm none ^
  -fa 1 ^
  --temp 0.7 ^
  --top-k 20 ^
  --top-p 0.8 ^
  --min-p 0.0 ^
  -t 8 -tb 8 --prio 2 --prio-batch 2 --poll 100 --poll-batch 1 ^
  --presence-penalty 1.5 ^
   --ctx-size 24000 ^
   -n -1 ^
   --host 0.0.0.0 ^
  --port 9090 ^
  --alias "ZAYA1-8B-Q6_K" ^
  -np 1

I used AI to fix it, it made a few changes in zaya.cpp, here's the unified diff:

 src/models/zaya.cpp | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/models/zaya.cpp b/src/models/zaya.cpp
index cda5abeea..01a7cdd0d 100644
--- a/src/models/zaya.cpp
+++ b/src/models/zaya.cpp
@@ -221,7 +221,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             ggml_tensor * cur_state_src = ggml_cont(ctx0, cur);
             ggml_tensor * cur_seq = ggml_reshape_3d(ctx0, cur_state_src, n_embd, n_seq_tokens, n_seqs);
 
-            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, prev_hs, n_embd, 1, n_seqs);
+            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, ggml_cont(ctx0, prev_hs), n_embd, 1, n_seqs);
             if (n_seq_tokens > 1) {
                 ggml_tensor * cur_shift = ggml_view_3d(ctx0, cur_seq, n_embd, n_seq_tokens - 1, n_seqs,
                         cur_seq->nb[1],
@@ -229,7 +229,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
                         0);
                 hs_d = ggml_concat(ctx0, hs_d, cur_shift, 1);
             }
-            hs_d = ggml_reshape_2d(ctx0, hs_d, n_embd, n_tokens);
+            hs_d = ggml_reshape_2d(ctx0, ggml_cont(ctx0, hs_d), n_embd, n_tokens);
             cb(hs_d, "cca_hs_d", il);
 
             // V = concat(val_proj1(x), val_proj2(x delayed)) -> [n_embd_k, n_tokens]
@@ -249,7 +249,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * Kpre_grouped = ggml_reshape_4d(ctx0, Kpre, n_embd_head, 1, n_head_kv, n_tokens);
             Kpre_grouped = ggml_repeat_4d(ctx0, Kpre_grouped, n_embd_head, n_gqa, n_head_kv, n_tokens);
-            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, Kpre_grouped, n_embd_head, n_head, n_tokens);
+            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Kpre_grouped), n_embd_head, n_head, n_tokens);
             ggml_tensor * qk_mean_q = ggml_scale(ctx0, ggml_add(ctx0, Qpre, Kpre_rep), 0.5f);
             cb(qk_mean_q, "qk_mean_q", il);
 
@@ -257,7 +257,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             Qgroup = ggml_permute(ctx0, Qgroup, 1, 0, 2, 3);
             Qgroup = ggml_cont(ctx0, Qgroup);
             ggml_tensor * Qmean = ggml_mean(ctx0, Qgroup);
-            Qmean = ggml_reshape_3d(ctx0, Qmean, n_embd_head, n_head_kv, n_tokens);
+            Qmean = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Qmean), n_embd_head, n_head_kv, n_tokens);
             ggml_tensor * qk_mean_k = ggml_scale(ctx0, ggml_add(ctx0, Qmean, Kpre), 0.5f);
             cb(qk_mean_k, "qk_mean_k", il);
 
@@ -289,7 +289,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_dw = layer.cca_conv_dw;
             if (conv_dw->type != GGML_TYPE_F16) {
-                conv_dw = ggml_cast(ctx0, conv_dw, GGML_TYPE_F16);
+                conv_dw = ggml_cont(ctx0, ggml_cast(ctx0, conv_dw, GGML_TYPE_F16));
             }
             conv_dw = ggml_reshape_3d(ctx0, conv_dw, conv_dw->ne[0], 1, n_qk);
             ggml_tensor * QK = ggml_conv_1d_dw(ctx0, conv_dw, conv_input, 1, 0, 1);
@@ -300,7 +300,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_grp = layer.cca_conv_grp;
             if (conv_grp->type != GGML_TYPE_F16) {
-                conv_grp = ggml_cast(ctx0, conv_grp, GGML_TYPE_F16);
+                conv_grp = ggml_cont(ctx0, ggml_cast(ctx0, conv_grp, GGML_TYPE_F16));
             }
             QK = ggml_conv_1d_grouped(ctx0, conv_grp, QK, 1, 0, 1, n_groups);
             QK = ggml_add(ctx0, QK, ggml_reshape_3d(ctx0, layer.cca_conv_grp_b, 1, n_qk, 1));

It works so far but the token generation is low (~15 t/s) compared to a Qwen 35B-A3B (60t/s), I was expecting much more from this model considering the small amount of active params.

Thanks for the catch!

I was also unsure about the speed during implementation, and there are a few reasons for it:

ggml_conv_1d_grouped is a naive implementation that uses basic operators instead of a dedicated kernel per backend, which explains the efficiency.
CCA + MoE does more computation per token than a standard transformer, so it's naturally more resource-intensive.

I just pushed a commit with only the necessary ggml_cont calls, I think the AI added a few extras out of caution that weren't actually needed and would have added unnecessary memory copies.

Could you try again and let me know how it goes @kdrapelinexto ? Thanks !

Ports PR ggml-org#22833 and PR ggml-org#23112 from ggml-org/llama.cpp onto our fork. - ggml: add ggml_conv_1d_grouped op (depthwise + headwise conv via ggml_view_3d slicing, falls back to existing conv1d/dw for groups=1 and groups=IC) - gguf: register ZAYA arch, CCA_VAL_PROJ1/2, CCA_CONV_GRP, CCA_K_SCALE, RES_SCALE_HS/RES/FINAL, ZAYA_ROUTER_MLP2/4/BIASES/EDA_SCALE tensors - src: add llama_model_zaya with alternating CCA (even) and MoE (odd) layers; residual scaling at every layer and final norm - conversion/zaya.py: HF→GGUF converter for ZayaModel/ZayaForCausalLM - Includes ggml_cont fixes for ROCm non-contiguous tensor compatibility and F16 cast fixes for CPU backend (from Zyphra fork review) Markovian RSA (test-time compute method) is intentionally excluded and will be a separate implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sdroege · 2026-05-17T06:13:56Z

Could you try again and let me know how it goes, @sdroege?

This works now, thanks! I've tested both f16 and Q8_0.

pwilkin · 2026-05-17T20:17:23Z

BTW thanks for doing this, I was tempted to attempt this but I figured out I have enough to do and this architecture is a bit crazy :)

lfung109-web · 2026-05-18T14:43:20Z

Thank you, I put Q6_K on my Pi5 16gb, resulted 6.5~6.9 TPS generation. A little disappointed but it's a good sub of Gemma4-e2b-it-IQ4_NL since ZAYA1 is a lot smarter!

The model's config.json reports vocab_size=262272 but the actual tokenizer only has 262147 tokens. The 125 extra entries are padding in PyTorch's embed_tokens.weight matrix that don't correspond to any real tokens. Use the pre-computed _tokenizer_vocab_size to write the correct vocab size in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.

Juste-Leo2 · 2026-05-18T18:46:40Z

@Juste-Leo2 as for correctness, you should probably just run make causal-verify-logits from examples/model_conversion first :)

I had some trouble running the test initially due to a vocabulary size mismatch. The PyTorch config included some extra unused padding tokens (PyTorch vocab: 262272 vs llama.cpp vocab: 262147). Since this doesn't impact inference for current GGUFs, I made a commit to explicitly align with the llama.cpp vocab size.

Here are the detailed results of the verification:

🔍 Token & Logits Verification

✅ Match: All 6 tokens match between PyTorch and llama.cpp.
Observation: The top-10 predictions share the exact same tokens. The llama.cpp scores are systematically ~0.2 to 0.8 higher.

Here are the raw logits for the top-10 predictions:

Rank	Token	PyTorch Logits	llama.cpp Logits	Difference
1	`Alex`	15.744840	15.971514	+0.226674
2	`X`	15.345150	15.850997	+0.505847
3	`\n`	15.037668	15.798749	+0.761081
4	`Dr`	15.006845	15.574100	+0.567255
5	`<`	15.006233	15.410849	+0.404616
6	`—`	14.760404	15.110687	+0.350283
7	`—`	14.739218	14.959461	+0.220243
8	`—`	14.561875	14.884461	+0.322586
9	`—`	14.437909	14.692179	+0.254270
10	`—`	14.418798	14.672704	+0.253906

📈 NMSE Metrics

Metric	Value
MSE	7.805752e-02
Reference Variance	6.342938e+00
NMSE	`1.230621e-02`
Max Absolute Error	1.780159
Mean Absolute Error	0.201440

❌ Final Verdict & Context

The script returned an error (make: *** [Makefile:68: causal-verify-logits] Error 1) because the NMSE of 1.23e-02 is slightly above the automatic validation threshold (< 1e-02).

Given that the NMSE is just slightly above the threshold but the top-10 tokens are aligned, @pwilkin what do you think? Is this sufficient to validate the test, or is further investigation required regarding this gap?

pwilkin · 2026-05-18T20:00:18Z

Yeah, does look like a bug. You should probably dump intermediate tensors at this stage to see where the divergence starts.

Juste-Leo2 · 2026-05-18T20:16:25Z

Yeah, does look like a bug. You should probably dump intermediate tensors at this stage to see where the divergence starts.

Thanks for the feedback! I'll investigate this during the week to find where the divergence starts.

Juste-Leo2 · 2026-05-19T18:37:39Z

I had some time to run tests with opencode. I thought it would be nice to map it out with a Python script, so here are the results for each layer in BF16.

I noticed we have spikes on the odd layers. A few examples:

Layer 1: NMSE = 1e-5
Layer 55: NMSE = 0.15
Layer 69: NMSE = 0.22
Layer 79: NMSE = 0.78

To try and fix this, I re-analyzed things against Zyphra's official VLLM fork (which I based this on), specifically looking at the MoE logic (since the odd layers are MoE layers). We have the exact same implementation for the final Residual scaling, as well as for the EDA (Exponential Decay Averaging) and MoE gate/up/down logic.

I also tested disabling this EDA: the NMSE exploded to 5.72e-01 (46x worse), which confirms that it is indeed reducing the drift rather than causing it.

I thought it might be a routing precision issue, so I switched the softmax from bf16 to f32 (line 382 in zaya.cpp). The test result remained unchanged.

So after all these tests, I finally tried running everything in full F32 (which I should have done first 😅), and it passed the logit test! Here are the results:

📈 METRICS
==============================
MSE (Mean Squared Error):     2.335141e-02
Reference Variance:           6.342938e+00
NMSE:                         3.681481e-03
Max Absolute Error:           0.798369
Mean Absolute Error:          0.117253
NMSE (dB):                    -24.34 dB
🎯 INTERPRETATION
==============================
👍 Good match
📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
   Small differences are likely due to precision/quantization.
📚 NMSE BENCHMARKS
==============================
< 1e-6:  Essentially identical
< 1e-4:  Excellent (typical for good conversions)
< 1e-3:  Very good
< 1e-2:  Good (acceptable for most use cases)
< 0.1:   Acceptable (may need verification)
> 1.0:   Poor (worse than random)
✅ RESULT: PASS (NMSE = 3.68e-03)

Here is the graph with all the layers in F32:

On this graph, I still observe a noticeable MSE increase at the end. In F32, layers 0-72 are correct, but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers, and probably not a structural bug.

So, given that F32 passes the logit test (NMSE = 3.68e-03, "Very good") and BF16 just barely fails due to this accumulating precision loss, the model's architecture seems consistent.

@pwilkin pinging you again with these new observations. Is it worth investigating further, or is this explanation sufficient to validate the architecture?

sdroege · 2026-05-21T09:45:07Z

I finally tried running everything in full F32

That probably means that more tensors need less strong quantization for this model?

but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers

Seems still useful to try to understand where this exactly comes from if the pytorch implementation on the very same model doesn't show this behaviour. I have no idea but this feels like something is going wrong somewhere

Juste-Leo2 · 2026-05-21T10:59:20Z

I finally tried running everything in full F32

That probably means that more tensors need less strong quantization for this model?

but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers

Seems still useful to try to understand where this exactly comes from if the pytorch implementation on the very same model doesn't show this behaviour. I have no idea but this feels like something is going wrong somewhere

Thanks for the feedback! Since I'm still getting used to interpreting the exact impact of quantization with these tools, I'm still a bit hesitant to draw definitive conclusions.
To move forward and make the logic review easier, I'll push a temporary commit. I'm going to literally put the Python code blocks from the vLLM fork as comments right next to my C++ implementation. This side-by-side comparison will help us strictly verify the architecture. Based on that, we'll see if it's worth digging deeper or if we can validate this step and move on.

pwilkin · 2026-05-21T11:45:10Z

Sorry, I'll take a look when I'm able - from my intuition, if it doesn't validate at BF16, then that suggests something is still wrong, but I'd have to look at the intermediate tensors themselves to make an informed opinion.

Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.

zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.

… _FP32EmbeddingMethod

Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).

Juste-Leo2 · 2026-05-22T09:46:06Z

I added comments to compare the vllm implementation with zyphra's fork (zaya.py and CCA.py)
I then corrected the minor differences.
Unfortunately, the bf16 test remained unchanged.

Note: For the logit test, I’m using zyphra’s transformer fork. So my next step is to compare it with this fork to see if there are any differences there as well.

I also wanted to clarify that the test showing the graph with the layers was a cumulative test, not a test with separate, independent layers.

- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra - cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328) - Applied only during prefill (n_seq_tokens > 1), matching Python logic - Mask filled with 1.0f for all positions (no padding info in ubatch)

Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701

Juste-Leo2 · 2026-05-22T18:02:46Z

After hours of debugging, I think I've finally found the root cause of the issue!

For the ggml_conv_1d_grouped operation, the Zaya model absolutely needs to use f32 (if bf16). Currently, it relies on the standard ggml_conv_1d operation, which forces a pass through im2col in f16. This downcasting introduces a precision loss.

I put together a diagram (thanks opencode) to illustrate exactly what happens:

BEFORE (NMSE 1.23e-02)       AFTER  (NMSE 3.94e-03)
QK_dw [7,1280,1] F32         QK_dw [7,1280,1] F32
        │                                │
        ▼                                ▼
ggml_conv_1d_grouped(10)         ggml_conv_1d_grouped(10)
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ group 0 │ ...                │ group 0 │ ...
   │ slice   │                    │ slice   │
   │ [7,128] │ F32                │ [7,128] │ F32
   │    │    │                    │    │    │
   │    ▼    │                    │    ▼    │
   │ im2col  │                    │ im2col  │
   │ F16 ⚠️  │← precision loss    │ F32 ✅  │← exact
   │ -1.191  │                    │ -1.191  │
   │ →-1.194 │                    │ →-1.191 │
   │    │    │                    │    │    │
   │    ▼    │                    │    ▼    │
   │ mul_mat │ F16×F32            │ mul_mat │ F32×F32
   │    │    │                    │    │    │
   │ 0.818   │← err amplified     │ 0.818   │✓
   │→0.750   │  δ=0.068           │→0.818   │  δ≈0
   └────┬────┘                    └────┬────┘
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ concat  │                    │ concat  │
   │ groups  │                    │ groups  │
   └────┬────┘                    └────┬────┘
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ attn +  │ err amplified      │ attn +  │ minimal
   │  rest   │ by later layers    │  rest   │ error
   │         │                    │         │
   │logits   │ NMSE 1.23e-02      │logits   │ NMSE 3.94e-03
   └─────────┘                    └─────────┘

Given these results, I think the best path forward is to revert the recent adjustments and go back to the state just after this commit. The subsequent commits didn't bring any improvements (they mainly documented the code and added elements that are likely handled implicitly). We can discuss if we want to keep a few specific commits, but the metrics clearly point to this f16 bottleneck as the main culprit.

Note that this means the test works perfectly in f16:

📈 METRICS
==============================
MSE (Mean Squared Error):     2.430202e-02
Reference Variance:           6.339689e+00
NMSE:                         3.833313e-03
Max Absolute Error:           0.842108
Mean Absolute Error:          0.121947
NMSE (dB):                    -24.16 dB

🎯 INTERPRETATION
==============================
👍 Good match

📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
   Small differences are likely due to precision/quantization.

📚 NMSE BENCHMARKS
==============================

✅ RESULT: PASS (NMSE = 3.83e-03)

but not in BF16 :

📈 METRICS
==============================
MSE (Mean Squared Error):     7.868546e-02
Reference Variance:           6.339689e+00
NMSE:                         1.241156e-02
Max Absolute Error:           1.791535
Mean Absolute Error:          0.202144
NMSE (dB):                    -19.06 dB

🎯 INTERPRETATION
==============================
⚠️ Acceptable match

📋 GUIDANCE
==============================
⚠️  ACCEPTABLE: Conversion is working but with some differences.
   Check if you're using quantization (Q4, Q8, etc.)
   Test generation quality to see if it's acceptable.

📚 NMSE BENCHMARKS
==============================

❌ RESULT: NEEDS REVIEW (NMSE = 1.24e-02)

This reverts commit f1bd772.

- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0). - zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.

Juste-Leo2 · 2026-05-22T21:33:49Z

I found the fix! We simply need to allocate im2col dynamically: F32 for unsupported types (like BF16 or quantized) and F16 for native F16 models.

This keeps F16 hardware-optimized while allowing BF16/quantized models to use precise F32 math without backend crashes.

Here are the final passing results for both:
F16 Model MSE: 2.430202e-02 | NMSE: 3.833313e-03 (PASS)

BF16 :

📈 METRICS
==============================
MSE (Mean Squared Error):     2.556950e-02
Reference Variance:           6.339689e+00
NMSE:                         4.033242e-03
Max Absolute Error:           0.862177
Mean Absolute Error:          0.122009
NMSE (dB):                    -23.94 dB

✅ RESULT: PASS (NMSE = 4.03e-03)

rawsh · 2026-05-24T21:00:31Z

@Juste-Leo2 took a stab at some perf follow ups for CCA and qkmean fusion, CUDA ~80 -> 108 t/s bsz1 decode Q4_K_M on a 4090 validated with per-layer dumps vs zaya transformers fp32 + op tests. happy to open a follow up after this lands / share the dev branch if useful

This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.

The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.

Juste-Leo2 · 2026-05-25T10:45:15Z

I believe the implementation is complete. I'm waiting for the im2col and grouped conv PRs to be merged first since this draft depends on them. Once those are in, I plan to convert this draft into a proper PR.

Juste-Leo2 and others added 17 commits May 8, 2026 11:08

ops: add Conv1dGrouped operation

99e5d03

initial implementation

e0ac753

implementation checkpoint

7cc554a

update

02a9843

add corrections

8362c10

zaya generation running

109856e

zaya: remove unused CCA_QK_NORM tensor constant

fede4c6

zaya: remove dead ZAYA_ROUTER_MLP2 mapping from non-block config

2069583

zaya: revert unrelated debug.cpp changes

356e962

zaya: replace hardcoded n_ff_exp with GGUF metadata

81d727f

Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.

github-actions Bot added model Model specific testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 15, 2026

quant: exclude Zaya cca_conv_grp tensors from quantization

800fbe8

Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.

Juste-Leo2 added 2 commits May 21, 2026 22:50

docs(zaya): add Python reference comments to C++ implementation

f1bd772

Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.

fix(zaya): gate EDA with layer check matching Python use_eda logic

2234dab

zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.

Juste-Leo2 force-pushed the Zaya1 branch 2 times, most recently from 05ec4f4 to 2b0c8c8 Compare May 21, 2026 22:35

feat(zaya): add zaya_high_prec for FP32 output logits matching Python…

1fc4581

… _FP32EmbeddingMethod

Juste-Leo2 force-pushed the Zaya1 branch from 2b0c8c8 to 1fc4581 Compare May 21, 2026 23:21

zaya.cpp: fix comment reference to MOD skip expert handling

0f37ace

Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).

leo added 2 commits May 22, 2026 16:30

zaya: cast residual to F32 before addition (residual_in_fp32)

9aaef94

Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701

Juste-Leo2 added 3 commits May 22, 2026 22:31

cleanup: revert debugs commits

abe9e40

Revert "docs(zaya): add Python reference comments to C++ implementation"

6fad5d8

This reverts commit f1bd772.

Juste-Leo2 added 2 commits May 25, 2026 11:30

zaya: add il != 1 check for EDA to match python reference

894ffd4

This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.

zaya: compute residual in fp32 to match config

1a7582b

The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.

Juste-Leo2 mentioned this pull request May 25, 2026

ggml: uniformize im2col dst_type for all conv ops #23660

Open

Conversation

Juste-Leo2 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

sdroege commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 16, 2026

Uh oh!

kdrapelinexto commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdroege commented May 17, 2026

Uh oh!

pwilkin commented May 17, 2026

Uh oh!

lfung109-web commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Token & Logits Verification

📈 NMSE Metrics

❌ Final Verdict & Context

Uh oh!

pwilkin commented May 18, 2026

Uh oh!

Juste-Leo2 commented May 18, 2026

Uh oh!

Juste-Leo2 commented May 19, 2026

Uh oh!

sdroege commented May 21, 2026

Uh oh!

Juste-Leo2 commented May 21, 2026

Uh oh!

pwilkin commented May 21, 2026

Uh oh!

Juste-Leo2 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rawsh commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Juste-Leo2 commented May 15, 2026 •

edited

Loading

sdroege commented May 16, 2026 •

edited

Loading

kdrapelinexto commented May 16, 2026 •

edited

Loading

Juste-Leo2 commented May 16, 2026 •

edited

Loading

lfung109-web commented May 18, 2026 •

edited

Loading

Juste-Leo2 commented May 18, 2026 •

edited

Loading

Juste-Leo2 commented May 22, 2026 •

edited

Loading

Juste-Leo2 commented May 22, 2026 •

edited

Loading

Juste-Leo2 commented May 22, 2026 •

edited

Loading

rawsh commented May 24, 2026 •

edited

Loading