Skip to content

[DRAFT] Support for Zaya1 8B model (depends on PR #22833)#23112

Draft
Juste-Leo2 wants to merge 33 commits into
ggml-org:masterfrom
Juste-Leo2:Zaya1
Draft

[DRAFT] Support for Zaya1 8B model (depends on PR #22833)#23112
Juste-Leo2 wants to merge 33 commits into
ggml-org:masterfrom
Juste-Leo2:Zaya1

Conversation

@Juste-Leo2
Copy link
Copy Markdown
Contributor

@Juste-Leo2 Juste-Leo2 commented May 15, 2026

Overview

This PR adds support for the Zaya1 8B model (without Markovian RSA). (see issue #22776)

Note: This draft depends on PR #22833, hence the choice of opening a draft PR to be able to update the tree based on the requested changes.

Zaya is a hybrid recurrent/attention model. It consists of a succession of classic MoE layers and convolution-based CCA layers.

The goal of CCA is to substitute classic attention. From what I understand, the process is:

  • Q, K projections from the hidden state
  • Computation of pre-means (before convolution) — average per GQA group of Q and K
  • Q+K concatenation + depthwise convolution + grouped convolution
  • Injection of the pre-means AFTER the convolution — the convolution result is added to the means computed in step 2
  • L2 norm + learned temperature on K (similar to Kimi Linear or Qwen3 Next models)
  • And finally, 50% RoPE

The researchers of the Zaya model also used an attention projection on the previous token.

Additional information

Heavily based on the vLLM implementation.

For context, I initially started this port on my own fork. I later got stuck on an inference issue, and the work was moved to a separate branch on the Zyphra repository where @nanduruganesh greatly helped unblock the situation (see Zyphra PR #1 and Zyphra PR #2). I then took care of the refactoring and various fixes (using OpenCode) to propose this clean version upstream. A huge thanks to him for his crucial help!

Regarding inference with an RTX 4070 Ti and 64GB of RAM:

In BF16

~/zaya/llama.cpp$ ./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf   -p "Quelle est la capitale de la France ?"   -n 64 -c 512 -ngl all -sm none --single-turn --simple-io
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes, VRAM: 12281 MiB

Loading model...

...

> Quelle est la capitale de la France ?

[Start thinking]
We need to answer the question: "Quelle est la capitale de la France ?" which is French asking: What is the capital of France? The assistant should answer in French presumably, or maybe in English? The user wrote in French. Usually we answer in the same language. So answer: Paris. Possibly add a brief

[ Prompt: 81.5 t/s | Generation: 11.0 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4070 Ti) | 12281 =    0 + (17445 = 15879 +      21 +    1545) +       -5164 |
common_memory_breakdown_print: |   - Host                |                 1033 =  1024 +       0 +       9            

Q4_K_M

~/zaya/llama.cpp$ ./build/bin/llama-cli -m models/ZAYA1-8B-Q4_K_M.gguf   -p "Quelle est la capitale
de la France ?"   -n 64 -c 512 -ngl all -sm none --single-turn --simple-io
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 12281 MiB):
  Device 0: NVIDIA GeForce RTX 4070 Ti, compute capability 8.9, VMM: yes, VRAM: 12281 MiB

Loading model...

...

> Quelle est la capitale de la France ?

[Start thinking]
We have a conversation: user asks "Quelle est la capitale de la France?" which is French: "What is the capital of France?" The user is asking a factual question. According to knowledge, the capital of France is Paris.

We need to answer appropriately. The user is using French language. Probably we should

[ Prompt: 167.8 t/s | Generation: 45.9 t/s ]

Exiting...
common_memory_breakdown_print: | memory breakdown [MiB]  | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4070 Ti) | 12281 = 5092 + (5836 =  4874 +      21 +     941) +         1352 |
common_memory_breakdown_print: |   - Host                |                  429 =   420 +       0 +       9            

Requirements

  • AI usage disclosure: YES
    • I used OpenCode to better understand the implementation and architectural concepts.
    • I discussed with the AI to refactor the code and use existing constants to improve maintainability.
    • The naive implementation of the architecture was ported from vLLM.
    • Translation of this PR (originally handwritten by me in English) and improving its readability.

I am more than willing to do my best to answer maintainers' questions about the architecture or anything else.

Juste-Leo2 and others added 17 commits May 8, 2026 11:08
- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h
- Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D
- Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D
- Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D
- Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D
- Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D

This improves consistency by reusing the existing SSM_CONV1D constant
that's already used by other SSM-based architectures (mamba, jamba, etc.)
- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_NORM
- Remove ZAYA_ROUTER_NORM from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM
- Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM
- Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM
- Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM

Router normalization is a standard FFN norm (RMSNorm), making this
a semantically correct replacement that reduces custom constants.
- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP
- Remove ZAYA_ROUTER_DOWN from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP
- Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP
- Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP
- Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP

Router down projection is a linear projection similar to MoE gate input,
making this a semantically reasonable replacement.
- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_GATE
- Remove ZAYA_ROUTER_MLP0 from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE
- Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE
- Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE
- Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE

Router MLP hidden layer is a linear projection similar to FFN gate,
making this a reasonable replacement for reducing custom constants.
- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL
- Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES)
- Update tensor mappings in llama-arch.cpp
- Remove bias constants from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list
- Update zaya.cpp to create bias tensors using same constant with 'bias' suffix
- Update convert_hf_to_gguf.py to map bias tensors with .bias suffix

This reduces 8 custom ZAYA constants to 4 by reusing the same constant
for both weight and bias tensors, differentiated by suffix.
- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B
- Use FFN_GATE_INP for both router down weight and bias
- Use FFN_GATE for both router mlp0 weight and bias
- Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias
- Update tensor mappings in llama-arch.cpp
- Remove bias constants from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list
- Update zaya.cpp to create bias tensors using same constant with 'bias' suffix
- Update convert_hf_to_gguf.py to map bias tensors with .bias suffix
- Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection

This reduces 3 more custom constants by reusing the same constant
for both weight and bias tensors, differentiated by suffix.
Remove hardcoded 256 value for router MLP hidden size and read it
from the GGUF expert_feed_forward_length metadata key instead.
The converter now writes zaya_mlp_expansion from config.json.
val_proj1 and val_proj2 output dimension should be latent_k_dim / 2
(n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently
both are equal for ZAYA1-8B (n_head_kv=2), but this would break
for any other n_head_kv configuration.
@github-actions github-actions Bot added model Model specific testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 15, 2026
Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv,
and RWKV time_mix tensors. These small conv weights (d_conv=2)
are not divisible by quant block sizes (32), causing Q8_0 failures.
@sdroege
Copy link
Copy Markdown

sdroege commented May 16, 2026

Wanted to give this a try but I guess the Vulkan backend needs some more work for this, or is this unexpected?

build/bin/llama-cli -hf JusteLeo/ZAYA1-8B-GGUF:Q8_0 --jinja -ngl 99

Loading model... \.../llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7653: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw
and cca_conv_grp to F16 before convolution to support quantized
models (Q4, Q8). CUDA/SYCL backends are unaffected since their
im2col implementation only reads kernel dimensions, not data.
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Wanted to give this a try but I guess the Vulkan backend needs some more work for this, or is this unexpected?

build/bin/llama-cli -hf JusteLeo/ZAYA1-8B-GGUF:Q8_0 --jinja -ngl 99

Loading model... \.../llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:7653: GGML_ASSERT(y_non_contig || !qy_needs_dequant) failed

OK, after some trouble, I think I've found the cause. It was a type error; im2col requires f16 for the conversion when using CPU backends and the like. With CUDA, it doesn't check the kernel's internal data during the conversion only the dimensions which is why it worked correctly on CUDA.
Could you try again and let me know how it goes, @sdroege?

@kdrapelinexto
Copy link
Copy Markdown

kdrapelinexto commented May 16, 2026

I wanted to give a try to the Zaya model with my 9060 XT and ROCM 7.1, I compiled the MTP version of llama and backported your branch. Got some issues such as: ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed

This is my setup.

"llama-server.exe" ^
  -m "D:\04_AI\Models\8B\ZAYA1-8B-Q6_K.gguf" ^
  -ngl all ^
  --mlock ^
  -sm none ^
  -fa 1 ^
  --temp 0.7 ^
  --top-k 20 ^
  --top-p 0.8 ^
  --min-p 0.0 ^
  -t 8 -tb 8 --prio 2 --prio-batch 2 --poll 100 --poll-batch 1 ^
  --presence-penalty 1.5 ^
   --ctx-size 24000 ^
   -n -1 ^
   --host 0.0.0.0 ^
  --port 9090 ^
  --alias "ZAYA1-8B-Q6_K" ^
  -np 1

I used AI to fix it, it made a few changes in zaya.cpp, here's the unified diff:

 src/models/zaya.cpp | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/models/zaya.cpp b/src/models/zaya.cpp
index cda5abeea..01a7cdd0d 100644
--- a/src/models/zaya.cpp
+++ b/src/models/zaya.cpp
@@ -221,7 +221,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             ggml_tensor * cur_state_src = ggml_cont(ctx0, cur);
             ggml_tensor * cur_seq = ggml_reshape_3d(ctx0, cur_state_src, n_embd, n_seq_tokens, n_seqs);
 
-            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, prev_hs, n_embd, 1, n_seqs);
+            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, ggml_cont(ctx0, prev_hs), n_embd, 1, n_seqs);
             if (n_seq_tokens > 1) {
                 ggml_tensor * cur_shift = ggml_view_3d(ctx0, cur_seq, n_embd, n_seq_tokens - 1, n_seqs,
                         cur_seq->nb[1],
@@ -229,7 +229,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
                         0);
                 hs_d = ggml_concat(ctx0, hs_d, cur_shift, 1);
             }
-            hs_d = ggml_reshape_2d(ctx0, hs_d, n_embd, n_tokens);
+            hs_d = ggml_reshape_2d(ctx0, ggml_cont(ctx0, hs_d), n_embd, n_tokens);
             cb(hs_d, "cca_hs_d", il);
 
             // V = concat(val_proj1(x), val_proj2(x delayed)) -> [n_embd_k, n_tokens]
@@ -249,7 +249,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * Kpre_grouped = ggml_reshape_4d(ctx0, Kpre, n_embd_head, 1, n_head_kv, n_tokens);
             Kpre_grouped = ggml_repeat_4d(ctx0, Kpre_grouped, n_embd_head, n_gqa, n_head_kv, n_tokens);
-            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, Kpre_grouped, n_embd_head, n_head, n_tokens);
+            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Kpre_grouped), n_embd_head, n_head, n_tokens);
             ggml_tensor * qk_mean_q = ggml_scale(ctx0, ggml_add(ctx0, Qpre, Kpre_rep), 0.5f);
             cb(qk_mean_q, "qk_mean_q", il);
 
@@ -257,7 +257,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             Qgroup = ggml_permute(ctx0, Qgroup, 1, 0, 2, 3);
             Qgroup = ggml_cont(ctx0, Qgroup);
             ggml_tensor * Qmean = ggml_mean(ctx0, Qgroup);
-            Qmean = ggml_reshape_3d(ctx0, Qmean, n_embd_head, n_head_kv, n_tokens);
+            Qmean = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Qmean), n_embd_head, n_head_kv, n_tokens);
             ggml_tensor * qk_mean_k = ggml_scale(ctx0, ggml_add(ctx0, Qmean, Kpre), 0.5f);
             cb(qk_mean_k, "qk_mean_k", il);
 
@@ -289,7 +289,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_dw = layer.cca_conv_dw;
             if (conv_dw->type != GGML_TYPE_F16) {
-                conv_dw = ggml_cast(ctx0, conv_dw, GGML_TYPE_F16);
+                conv_dw = ggml_cont(ctx0, ggml_cast(ctx0, conv_dw, GGML_TYPE_F16));
             }
             conv_dw = ggml_reshape_3d(ctx0, conv_dw, conv_dw->ne[0], 1, n_qk);
             ggml_tensor * QK = ggml_conv_1d_dw(ctx0, conv_dw, conv_input, 1, 0, 1);
@@ -300,7 +300,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_grp = layer.cca_conv_grp;
             if (conv_grp->type != GGML_TYPE_F16) {
-                conv_grp = ggml_cast(ctx0, conv_grp, GGML_TYPE_F16);
+                conv_grp = ggml_cont(ctx0, ggml_cast(ctx0, conv_grp, GGML_TYPE_F16));
             }
             QK = ggml_conv_1d_grouped(ctx0, conv_grp, QK, 1, 0, 1, n_groups);
             QK = ggml_add(ctx0, QK, ggml_reshape_3d(ctx0, layer.cca_conv_grp_b, 1, n_qk, 1));

It works so far but the token generation is low (~15 t/s) compared to a Qwen 35B-A3B (60t/s), I was expecting much more from this model considering the small amount of active params.

ROCm and Vulkan backends require contiguous tensors for im2col and
mul_mat operations. Add ggml_cont after ggml_cast for conv kernels
and after ggml_concat for hs_d to ensure compatibility across all
backends. CUDA was unaffected since it handles non-contiguous
tensors more permissively.
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Juste-Leo2 commented May 16, 2026

I wanted to give a try to the Zaya model with my 9060 XT and ROCM 7.1, I compiled the MTP version of llama and backported your branch. Got some issues such as: ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed

This is my setup.

"llama-server.exe" ^
  -m "D:\04_AI\Models\8B\ZAYA1-8B-Q6_K.gguf" ^
  -ngl all ^
  --mlock ^
  -sm none ^
  -fa 1 ^
  --temp 0.7 ^
  --top-k 20 ^
  --top-p 0.8 ^
  --min-p 0.0 ^
  -t 8 -tb 8 --prio 2 --prio-batch 2 --poll 100 --poll-batch 1 ^
  --presence-penalty 1.5 ^
   --ctx-size 24000 ^
   -n -1 ^
   --host 0.0.0.0 ^
  --port 9090 ^
  --alias "ZAYA1-8B-Q6_K" ^
  -np 1

I used AI to fix it, it made a few changes in zaya.cpp, here's the unified diff:

 src/models/zaya.cpp | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/models/zaya.cpp b/src/models/zaya.cpp
index cda5abeea..01a7cdd0d 100644
--- a/src/models/zaya.cpp
+++ b/src/models/zaya.cpp
@@ -221,7 +221,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             ggml_tensor * cur_state_src = ggml_cont(ctx0, cur);
             ggml_tensor * cur_seq = ggml_reshape_3d(ctx0, cur_state_src, n_embd, n_seq_tokens, n_seqs);
 
-            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, prev_hs, n_embd, 1, n_seqs);
+            ggml_tensor * hs_d = ggml_reshape_3d(ctx0, ggml_cont(ctx0, prev_hs), n_embd, 1, n_seqs);
             if (n_seq_tokens > 1) {
                 ggml_tensor * cur_shift = ggml_view_3d(ctx0, cur_seq, n_embd, n_seq_tokens - 1, n_seqs,
                         cur_seq->nb[1],
@@ -229,7 +229,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
                         0);
                 hs_d = ggml_concat(ctx0, hs_d, cur_shift, 1);
             }
-            hs_d = ggml_reshape_2d(ctx0, hs_d, n_embd, n_tokens);
+            hs_d = ggml_reshape_2d(ctx0, ggml_cont(ctx0, hs_d), n_embd, n_tokens);
             cb(hs_d, "cca_hs_d", il);
 
             // V = concat(val_proj1(x), val_proj2(x delayed)) -> [n_embd_k, n_tokens]
@@ -249,7 +249,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * Kpre_grouped = ggml_reshape_4d(ctx0, Kpre, n_embd_head, 1, n_head_kv, n_tokens);
             Kpre_grouped = ggml_repeat_4d(ctx0, Kpre_grouped, n_embd_head, n_gqa, n_head_kv, n_tokens);
-            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, Kpre_grouped, n_embd_head, n_head, n_tokens);
+            ggml_tensor * Kpre_rep = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Kpre_grouped), n_embd_head, n_head, n_tokens);
             ggml_tensor * qk_mean_q = ggml_scale(ctx0, ggml_add(ctx0, Qpre, Kpre_rep), 0.5f);
             cb(qk_mean_q, "qk_mean_q", il);
 
@@ -257,7 +257,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
             Qgroup = ggml_permute(ctx0, Qgroup, 1, 0, 2, 3);
             Qgroup = ggml_cont(ctx0, Qgroup);
             ggml_tensor * Qmean = ggml_mean(ctx0, Qgroup);
-            Qmean = ggml_reshape_3d(ctx0, Qmean, n_embd_head, n_head_kv, n_tokens);
+            Qmean = ggml_reshape_3d(ctx0, ggml_cont(ctx0, Qmean), n_embd_head, n_head_kv, n_tokens);
             ggml_tensor * qk_mean_k = ggml_scale(ctx0, ggml_add(ctx0, Qmean, Kpre), 0.5f);
             cb(qk_mean_k, "qk_mean_k", il);
 
@@ -289,7 +289,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_dw = layer.cca_conv_dw;
             if (conv_dw->type != GGML_TYPE_F16) {
-                conv_dw = ggml_cast(ctx0, conv_dw, GGML_TYPE_F16);
+                conv_dw = ggml_cont(ctx0, ggml_cast(ctx0, conv_dw, GGML_TYPE_F16));
             }
             conv_dw = ggml_reshape_3d(ctx0, conv_dw, conv_dw->ne[0], 1, n_qk);
             ggml_tensor * QK = ggml_conv_1d_dw(ctx0, conv_dw, conv_input, 1, 0, 1);
@@ -300,7 +300,7 @@ llama_model_zaya::graph::graph(const llama_model & model, const llm_graph_params
 
             ggml_tensor * conv_grp = layer.cca_conv_grp;
             if (conv_grp->type != GGML_TYPE_F16) {
-                conv_grp = ggml_cast(ctx0, conv_grp, GGML_TYPE_F16);
+                conv_grp = ggml_cont(ctx0, ggml_cast(ctx0, conv_grp, GGML_TYPE_F16));
             }
             QK = ggml_conv_1d_grouped(ctx0, conv_grp, QK, 1, 0, 1, n_groups);
             QK = ggml_add(ctx0, QK, ggml_reshape_3d(ctx0, layer.cca_conv_grp_b, 1, n_qk, 1));

It works so far but the token generation is low (~15 t/s) compared to a Qwen 35B-A3B (60t/s), I was expecting much more from this model considering the small amount of active params.

Thanks for the catch!

I was also unsure about the speed during implementation, and there are a few reasons for it:

  • ggml_conv_1d_grouped is a naive implementation that uses basic operators instead of a dedicated kernel per backend, which explains the efficiency.
  • CCA + MoE does more computation per token than a standard transformer, so it's naturally more resource-intensive.

I just pushed a commit with only the necessary ggml_cont calls, I think the AI added a few extras out of caution that weren't actually needed and would have added unnecessary memory copies.

Could you try again and let me know how it goes @kdrapelinexto ? Thanks !

kmbandy added a commit to kmbandy/llama.cpp that referenced this pull request May 17, 2026
Ports PR ggml-org#22833 and PR ggml-org#23112 from ggml-org/llama.cpp onto our fork.

- ggml: add ggml_conv_1d_grouped op (depthwise + headwise conv via
  ggml_view_3d slicing, falls back to existing conv1d/dw for groups=1
  and groups=IC)
- gguf: register ZAYA arch, CCA_VAL_PROJ1/2, CCA_CONV_GRP, CCA_K_SCALE,
  RES_SCALE_HS/RES/FINAL, ZAYA_ROUTER_MLP2/4/BIASES/EDA_SCALE tensors
- src: add llama_model_zaya with alternating CCA (even) and MoE (odd)
  layers; residual scaling at every layer and final norm
- conversion/zaya.py: HF→GGUF converter for ZayaModel/ZayaForCausalLM
- Includes ggml_cont fixes for ROCm non-contiguous tensor compatibility
  and F16 cast fixes for CPU backend (from Zyphra fork review)

Markovian RSA (test-time compute method) is intentionally excluded and
will be a separate implementation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sdroege
Copy link
Copy Markdown

sdroege commented May 17, 2026

Could you try again and let me know how it goes, @sdroege?

This works now, thanks! I've tested both f16 and Q8_0.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 17, 2026

BTW thanks for doing this, I was tempted to attempt this but I figured out I have enough to do and this architecture is a bit crazy :)

@lfung109-web
Copy link
Copy Markdown

lfung109-web commented May 18, 2026

Thank you, I put Q6_K on my Pi5 16gb, resulted 6.5~6.9 TPS generation. A little disappointed but it's a good sub of Gemma4-e2b-it-IQ4_NL since ZAYA1 is a lot smarter!

The model's config.json reports vocab_size=262272 but the actual tokenizer
only has 262147 tokens. The 125 extra entries are padding in PyTorch's
embed_tokens.weight matrix that don't correspond to any real tokens.

Use the pre-computed _tokenizer_vocab_size to write the correct vocab size
in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Juste-Leo2 commented May 18, 2026

@Juste-Leo2 as for correctness, you should probably just run make causal-verify-logits from examples/model_conversion first :)

I had some trouble running the test initially due to a vocabulary size mismatch. The PyTorch config included some extra unused padding tokens (PyTorch vocab: 262272 vs llama.cpp vocab: 262147). Since this doesn't impact inference for current GGUFs, I made a commit to explicitly align with the llama.cpp vocab size.

Here are the detailed results of the verification:

🔍 Token & Logits Verification

Match: All 6 tokens match between PyTorch and llama.cpp.
Observation: The top-10 predictions share the exact same tokens. The llama.cpp scores are systematically ~0.2 to 0.8 higher.

Here are the raw logits for the top-10 predictions:

Rank Token PyTorch Logits llama.cpp Logits Difference
1 Alex 15.744840 15.971514 +0.226674
2 X 15.345150 15.850997 +0.505847
3 \n 15.037668 15.798749 +0.761081
4 Dr 15.006845 15.574100 +0.567255
5 < 15.006233 15.410849 +0.404616
6 14.760404 15.110687 +0.350283
7 14.739218 14.959461 +0.220243
8 14.561875 14.884461 +0.322586
9 14.437909 14.692179 +0.254270
10 14.418798 14.672704 +0.253906

📈 NMSE Metrics

Metric Value
MSE 7.805752e-02
Reference Variance 6.342938e+00
NMSE 1.230621e-02
Max Absolute Error 1.780159
Mean Absolute Error 0.201440

❌ Final Verdict & Context

The script returned an error (make: *** [Makefile:68: causal-verify-logits] Error 1) because the NMSE of 1.23e-02 is slightly above the automatic validation threshold (< 1e-02).

Given that the NMSE is just slightly above the threshold but the top-10 tokens are aligned, @pwilkin what do you think? Is this sufficient to validate the test, or is further investigation required regarding this gap?

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 18, 2026

Yeah, does look like a bug. You should probably dump intermediate tensors at this stage to see where the divergence starts.

@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Yeah, does look like a bug. You should probably dump intermediate tensors at this stage to see where the divergence starts.

Thanks for the feedback! I'll investigate this during the week to find where the divergence starts.

@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

I had some time to run tests with opencode. I thought it would be nice to map it out with a Python script, so here are the results for each layer in BF16.

image

I noticed we have spikes on the odd layers. A few examples:

  • Layer 1: NMSE = 1e-5
  • Layer 55: NMSE = 0.15
  • Layer 69: NMSE = 0.22
  • Layer 79: NMSE = 0.78

To try and fix this, I re-analyzed things against Zyphra's official VLLM fork (which I based this on), specifically looking at the MoE logic (since the odd layers are MoE layers). We have the exact same implementation for the final Residual scaling, as well as for the EDA (Exponential Decay Averaging) and MoE gate/up/down logic.

I also tested disabling this EDA: the NMSE exploded to 5.72e-01 (46x worse), which confirms that it is indeed reducing the drift rather than causing it.

I thought it might be a routing precision issue, so I switched the softmax from bf16 to f32 (line 382 in zaya.cpp). The test result remained unchanged.

So after all these tests, I finally tried running everything in full F32 (which I should have done first 😅), and it passed the logit test! Here are the results:

📈 METRICS
==============================
MSE (Mean Squared Error):     2.335141e-02
Reference Variance:           6.342938e+00
NMSE:                         3.681481e-03
Max Absolute Error:           0.798369
Mean Absolute Error:          0.117253
NMSE (dB):                    -24.34 dB
🎯 INTERPRETATION
==============================
👍 Good match
📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
   Small differences are likely due to precision/quantization.
📚 NMSE BENCHMARKS
==============================
< 1e-6:  Essentially identical
< 1e-4:  Excellent (typical for good conversions)
< 1e-3:  Very good
< 1e-2:  Good (acceptable for most use cases)
< 0.1:   Acceptable (may need verification)
> 1.0:   Poor (worse than random)
✅ RESULT: PASS (NMSE = 3.68e-03)

Here is the graph with all the layers in F32:

image

On this graph, I still observe a noticeable MSE increase at the end. In F32, layers 0-72 are correct, but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers, and probably not a structural bug.

So, given that F32 passes the logit test (NMSE = 3.68e-03, "Very good") and BF16 just barely fails due to this accumulating precision loss, the model's architecture seems consistent.

@pwilkin pinging you again with these new observations. Is it worth investigating further, or is this explanation sufficient to validate the architecture?

@sdroege
Copy link
Copy Markdown

sdroege commented May 21, 2026

I finally tried running everything in full F32

That probably means that more tensors need less strong quantization for this model?

but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers

Seems still useful to try to understand where this exactly comes from if the pytorch implementation on the very same model doesn't show this behaviour. I have no idea but this feels like something is going wrong somewhere

@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

I finally tried running everything in full F32

That probably means that more tensors need less strong quantization for this model?

but I saw a sharp increase after that. I think this is probably due to the accumulation from the previous layers

Seems still useful to try to understand where this exactly comes from if the pytorch implementation on the very same model doesn't show this behaviour. I have no idea but this feels like something is going wrong somewhere

Thanks for the feedback! Since I'm still getting used to interpreting the exact impact of quantization with these tools, I'm still a bit hesitant to draw definitive conclusions.
To move forward and make the logic review easier, I'll push a temporary commit. I'm going to literally put the Python code blocks from the vLLM fork as comments right next to my C++ implementation. This side-by-side comparison will help us strictly verify the architecture. Based on that, we'll see if it's worth digging deeper or if we can validate this step and move on.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 21, 2026

Sorry, I'll take a look when I'm able - from my intuition, if it doesn't validate at BF16, then that suggests something is still wrong, but I'd have to look at the intermediate tensors themselves to make an informed opinion.

Add detailed inline comments mapping each C++ code section to the
corresponding zaya.py and cca.py Python lines, including code snippets
for direct comparison.
zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via
(self.layer_number != zaya_first_layer). Add il != 1 guard to match.
@Juste-Leo2 Juste-Leo2 force-pushed the Zaya1 branch 2 times, most recently from 05ec4f4 to 2b0c8c8 Compare May 21, 2026 22:35
Correct line reference from zaya.py L387-389 to L459-469, and add
note explaining why excluding the skip expert from gate_probs is
correct (bias=-1.0 makes it effectively never selected at inference
with topk=1).
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Juste-Leo2 commented May 22, 2026

I added comments to compare the vllm implementation with zyphra's fork (zaya.py and CCA.py)
I then corrected the minor differences.
Unfortunately, the bf16 test remained unchanged.

Note: For the logit test, I’m using zyphra’s transformer fork. So my next step is to compare it with this fork to see if there are any differences there as well.

I also wanted to clarify that the test showing the graph with the layers was a cumulative test, not a test with separate, independent layers.

leo added 2 commits May 22, 2026 16:30
- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra
- cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states
  before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328)
- Applied only during prefill (n_seq_tokens > 1), matching Python logic
- Mask filled with 1.0f for all positions (no padding info in ubatch)
Match Python reference which casts hidden_states and residual to
float32 before ggml_add in both per-layer and final residual paths.

zaya.py ref: L900, L1387, L1701
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Juste-Leo2 commented May 22, 2026

After hours of debugging, I think I've finally found the root cause of the issue!

For the ggml_conv_1d_grouped operation, the Zaya model absolutely needs to use f32 (if bf16). Currently, it relies on the standard ggml_conv_1d operation, which forces a pass through im2col in f16. This downcasting introduces a precision loss.

I put together a diagram (thanks opencode) to illustrate exactly what happens:

BEFORE (NMSE 1.23e-02)       AFTER  (NMSE 3.94e-03)
QK_dw [7,1280,1] F32         QK_dw [7,1280,1] F32
        │                                │
        ▼                                ▼
ggml_conv_1d_grouped(10)         ggml_conv_1d_grouped(10)
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ group 0 │ ...                │ group 0 │ ...
   │ slice   │                    │ slice   │
   │ [7,128] │ F32                │ [7,128] │ F32
   │    │    │                    │    │    │
   │    ▼    │                    │    ▼    │
   │ im2col  │                    │ im2col  │
   │ F16 ⚠️  │← precision loss    │ F32 ✅  │← exact
   │ -1.191  │                    │ -1.191  │
   │ →-1.194 │                    │ →-1.191 │
   │    │    │                    │    │    │
   │    ▼    │                    │    ▼    │
   │ mul_mat │ F16×F32            │ mul_mat │ F32×F32
   │    │    │                    │    │    │
   │ 0.818   │← err amplified     │ 0.818   │✓
   │→0.750   │  δ=0.068           │→0.818   │  δ≈0
   └────┬────┘                    └────┬────┘
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ concat  │                    │ concat  │
   │ groups  │                    │ groups  │
   └────┬────┘                    └────┬────┘
        │                                │
   ┌────┴────┐                    ┌────┴────┐
   │ attn +  │ err amplified      │ attn +  │ minimal
   │  rest   │ by later layers    │  rest   │ error
   │         │                    │         │
   │logits   │ NMSE 1.23e-02      │logits   │ NMSE 3.94e-03
   └─────────┘                    └─────────┘

Given these results, I think the best path forward is to revert the recent adjustments and go back to the state just after this commit. The subsequent commits didn't bring any improvements (they mainly documented the code and added elements that are likely handled implicitly). We can discuss if we want to keep a few specific commits, but the metrics clearly point to this f16 bottleneck as the main culprit.

Note that this means the test works perfectly in f16:

📈 METRICS
==============================
MSE (Mean Squared Error):     2.430202e-02
Reference Variance:           6.339689e+00
NMSE:                         3.833313e-03
Max Absolute Error:           0.842108
Mean Absolute Error:          0.121947
NMSE (dB):                    -24.16 dB

🎯 INTERPRETATION
==============================
👍 Good match

📋 GUIDANCE
==============================
👍 GOOD: Your GGML conversion is working well.
   Small differences are likely due to precision/quantization.

📚 NMSE BENCHMARKS
==============================

✅ RESULT: PASS (NMSE = 3.83e-03)

but not in BF16 :

📈 METRICS
==============================
MSE (Mean Squared Error):     7.868546e-02
Reference Variance:           6.339689e+00
NMSE:                         1.241156e-02
Max Absolute Error:           1.791535
Mean Absolute Error:          0.202144
NMSE (dB):                    -19.06 dB

🎯 INTERPRETATION
==============================
⚠️ Acceptable match

📋 GUIDANCE
==============================
⚠️  ACCEPTABLE: Conversion is working but with some differences.
   Check if you're using quantization (Q4, Q8, etc.)
   Test generation quality to see if it's acceptable.

📚 NMSE BENCHMARKS
==============================

❌ RESULT: NEEDS REVIEW (NMSE = 1.24e-02)

- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0).
- zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

Juste-Leo2 commented May 22, 2026

I found the fix! We simply need to allocate im2col dynamically: F32 for unsupported types (like BF16 or quantized) and F16 for native F16 models.

This keeps F16 hardware-optimized while allowing BF16/quantized models to use precise F32 math without backend crashes.

Here are the final passing results for both:
F16 Model MSE: 2.430202e-02 | NMSE: 3.833313e-03 (PASS)

BF16 :

📈 METRICS
==============================
MSE (Mean Squared Error):     2.556950e-02
Reference Variance:           6.339689e+00
NMSE:                         4.033242e-03
Max Absolute Error:           0.862177
Mean Absolute Error:          0.122009
NMSE (dB):                    -23.94 dB

✅ RESULT: PASS (NMSE = 4.03e-03)

@rawsh
Copy link
Copy Markdown

rawsh commented May 24, 2026

@Juste-Leo2 took a stab at some perf follow ups for CCA and qkmean fusion, CUDA ~80 -> 108 t/s bsz1 decode Q4_K_M on a 4090 validated with per-layer dumps vs zaya transformers fp32 + op tests. happy to open a follow up after this lands / share the dev branch if useful

This is a safety guard matching self.layer_number != zaya_first_layer
in the original implementation. No behavioral change for correctly
converted models since the tensor is already nullptr for layer 1.
The model config has residual_in_fp32=true. Cast both residual
branches to float32 to align with the python reference.
@Juste-Leo2
Copy link
Copy Markdown
Contributor Author

I believe the implementation is complete. I'm waiting for the im2col and grouped conv PRs to be merged first since this draft depends on them. Once those are in, I plan to convert this draft into a proper PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants