Draft PR for ZAYA1 llama.cpp implementation by nanduruganesh · Pull Request #1 · Zyphra/llama.cpp

nanduruganesh · 2026-05-12T04:57:31Z

Draft PR with an initial working implementation of ZAYA1-8B, forked from https://github.com/Juste-Leo2/llama.cpp/tree/CCA.
Open to public collaboration for this merge.

Quickstart

Build llama.cpp
Huggingface -> GGUF conversion

hf download Zyphra/ZAYA1-8B --local-dir ./models/ZAYA1-8B
python convert_hf_to_gguf.py models/ZAYA1-8B --outfile models/ZAYA1-8B-BF16.gguf --outtype bf16

Inference

./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf -p \
"Derive bernoulli's equation from the work-energy theorem" \
-n 256 -c 4096 -ngl all -sm none --single-turn --simple-io \
--no-warmup --ctx-checkpoints 0 --cache-ram 0 --no-context-shift

Output:

Todo:

Fix slight output logit mismatch between GGUF and transformers forward
Optimize for faster ITL (probably under single-batch conditions?)
Clean up this vibecoded implementation for proper PR, mainly just going through Contributing

Juste-Leo2 · 2026-05-12T21:54:41Z

I tested the current PR with a Q3_K_M quantized model, and it works. Thanks a lot for the adjustments @nanduruganesh , the generation is now usable!

I'll take a closer look at the code. During quantization I got a few warnings, but that's probably fixable without too much trouble:

warning: blk.20.cca_conv_dw.weight            - ncols      2 not divisible by 256 (required for type    q3_K) (WARNING: must use F16 due to unusual shape) -> falling back to     f16

I've had the chance to contribute a bit to llama.cpp and I follow their work closely. The best approach would really be to break this large PR into several smaller ones, to make it easier for the maintainers to review and to adjust things incrementally. Even if visually we already have a natural split with the ggml_conv1d_grouped operation, it's worth checking whether ZAYA1 support could be split into multiple distinct PRs.

I'm with you on speed. Given how sparse this MoE is (only 760M active parameters), there's probably a lot of optimization potential. But for a llama.cpp integration, I'd recommend going with a naive implementation first, then optimizing later. That's often how it's done.

We should probably also consider running some perplexity tests. This would help spot any potential degradation if something's off.

As an electronics student by background, I need to get more comfortable with the codebase to really understand llama.cpp's structure and potentially suggest changes to the implementation if needed. I'll do that when I have some free time. Don't hesitate to move ahead if you see things that need changing.

Juste-Leo2

I think the cleanup will mainly involve remapping to make use of what's already there. Feel free to give me feedback, but I think this is a good starting point. I've mainly commented on a single file, but the logic applies to the others as well, of course

Juste-Leo2 · 2026-05-14T19:35:19Z

+        struct ggml_tensor * a_g = ggml_view_3d(ctx, a,
+            a->ne[0], IC_G, OC_G,
+            a->nb[1], a->nb[2],
+            g * OC_G * a->nb[2]);
+
+        // slice input for group g: [L, IC_G, N]
+        struct ggml_tensor * b_g = ggml_view_3d(ctx, b,
+            b->ne[0], IC_G, b->ne[2],
+            b->nb[1], b->nb[2],
+            g * IC_G * b->nb[1]);
+
+        struct ggml_tensor * out_g = ggml_conv_1d(ctx, a_g, b_g, s0, p0, d0);
+
+        if (result == NULL) {
+            result = out_g;
+        } else {
+            result = ggml_concat(ctx, result, out_g, 1);
+        }
+    }
+
+    return result;
+}


We're reusing the ggml operations; we'll need to see later whether it's necessary to create a specific operation for each backend.

Juste-Leo2 · 2026-05-14T19:35:39Z

+    GGML_API struct ggml_tensor * ggml_conv_1d_grouped(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * a,      // convolution kernel
+            struct ggml_tensor  * b,      // data
+            int                   s0,     // stride
+            int                   p0,     // padding
+            int                   d0,     // dilation
+            int                   groups); // number of groups


So, a separate PR

Juste-Leo2 · 2026-05-14T19:36:02Z

@@ -162,6 +155,13 @@ bool common_debug_cb_eval(struct ggml_tensor * t, bool ask, void * user_data) {
        }
    }

+    if (ask) {
+        return matches_filter;
+    }
+
+    const struct ggml_tensor * src0 = t->src[0];
+    const struct ggml_tensor * src1 = t->src[1];
+


These changes will need to be removed

Juste-Leo2 · 2026-05-14T19:38:42Z

+    CCA_CONV_DW          = auto() # Zaya
+    CCA_CONV_GRP         = auto() # Zaya
+    CCA_CONV_DW_B        = auto() # Zaya: conv_qk.0.bias
+    CCA_QK_NORM          = auto() # Zaya (weightless - unit RMSNorm)
+    CCA_K_SCALE          = auto() # Zaya
+    CCA_VAL_PROJ1        = auto() # Zaya: CCA value projection stream 1
+    CCA_VAL_PROJ2        = auto() # Zaya: CCA value projection stream 2
+    RES_SCALE_HS         = auto() # Zaya: hidden_states_scale
+    RES_SCALE_HS_B       = auto() # Zaya: hidden_states_bias
+    RES_SCALE_RES        = auto() # Zaya: residual_scale
+    RES_SCALE_RES_B      = auto() # Zaya: residual_bias
+    RES_SCALE_HS_FINAL   = auto() # Zaya: final hidden_states_scale
+    RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias
+    RES_SCALE_RES_FINAL  = auto() # Zaya: final residual_scale
+    RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias
+    ZAYA_ROUTER_DOWN     = auto() # Zaya
+    ZAYA_ROUTER_DOWN_B   = auto() # Zaya
+    ZAYA_ROUTER_NORM     = auto() # Zaya
+    ZAYA_ROUTER_MLP0     = auto() # Zaya
+    ZAYA_ROUTER_MLP0_B   = auto() # Zaya
+    ZAYA_ROUTER_MLP2     = auto() # Zaya
+    ZAYA_ROUTER_MLP2_B   = auto() # Zaya
+    ZAYA_ROUTER_MLP4     = auto() # Zaya
+    ZAYA_ROUTER_BIASES   = auto() # Zaya
+    ZAYA_ROUTER_EDA_SCALE = auto() # Zaya


I think we need to simplify and reuse what we already have

Juste-Leo2 · 2026-05-14T19:39:15Z

    SSM_BETA             = auto() # Kimi Linear qwen3.5
    SSM_G_A              = auto() # Kimi Linear
    SSM_G_B              = auto() # Kimi Linear
+    CCA_CONV_DW          = auto() # Zaya


CCA_CONV_DW --> SSM_CONV1D ?

Juste-Leo2 · 2026-05-14T19:43:32Z

+    RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias
+    ZAYA_ROUTER_DOWN     = auto() # Zaya
+    ZAYA_ROUTER_DOWN_B   = auto() # Zaya
+    ZAYA_ROUTER_NORM     = auto() # Zaya


Juste-Leo2 · 2026-05-14T19:44:19Z

+    ZAYA_ROUTER_MLP0     = auto() # Zaya
+    ZAYA_ROUTER_MLP0_B   = auto() # Zaya
+    ZAYA_ROUTER_MLP2     = auto() # Zaya
+    ZAYA_ROUTER_MLP2_B   = auto() # Zaya


Juste-Leo2 · 2026-05-14T19:44:52Z

+    ZAYA_ROUTER_MLP0_B   = auto() # Zaya
+    ZAYA_ROUTER_MLP2     = auto() # Zaya
+    ZAYA_ROUTER_MLP2_B   = auto() # Zaya
+    ZAYA_ROUTER_MLP4     = auto() # Zaya


Note: I think I mentioned the wrong line (see 629)

Juste-Leo2 · 2026-05-14T19:48:14Z

+    RES_SCALE_HS         = auto() # Zaya: hidden_states_scale
+    RES_SCALE_HS_B       = auto() # Zaya: hidden_states_bias
+    RES_SCALE_RES        = auto() # Zaya: residual_scale
+    RES_SCALE_RES_B      = auto() # Zaya: residual_bias
+    RES_SCALE_HS_FINAL   = auto() # Zaya: final hidden_states_scale
+    RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias
+    RES_SCALE_RES_FINAL  = auto() # Zaya: final residual_scale
+    RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias


the “B”s can be removed ?

Juste-Leo2 · 2026-05-14T19:50:58Z

@@ -3992,6 +4044,42 @@ class MODEL_TENSOR(IntEnum):
        MODEL_TENSOR.FFN_DOWN_SHEXP,
        MODEL_TENSOR.FFN_UP_SHEXP,
    ],
+    MODEL_ARCH.ZAYA: [
+        MODEL_TENSOR.TOKEN_EMBD,
+        MODEL_TENSOR.OUTPUT_NORM,
+        MODEL_TENSOR.OUTPUT,
+        MODEL_TENSOR.ATTN_NORM,
+        MODEL_TENSOR.ATTN_Q,
+        MODEL_TENSOR.ATTN_K,
+        MODEL_TENSOR.ATTN_OUT,
+        MODEL_TENSOR.CCA_CONV_DW,
+        MODEL_TENSOR.CCA_CONV_DW_B,
+        MODEL_TENSOR.CCA_CONV_GRP,
+        MODEL_TENSOR.CCA_QK_NORM,
+        MODEL_TENSOR.CCA_K_SCALE,
+        MODEL_TENSOR.CCA_VAL_PROJ1,
+        MODEL_TENSOR.CCA_VAL_PROJ2,
+        MODEL_TENSOR.RES_SCALE_HS,
+        MODEL_TENSOR.RES_SCALE_HS_B,
+        MODEL_TENSOR.RES_SCALE_RES,
+        MODEL_TENSOR.RES_SCALE_RES_B,
+        MODEL_TENSOR.RES_SCALE_HS_FINAL,
+        MODEL_TENSOR.RES_SCALE_HS_B_FINAL,
+        MODEL_TENSOR.RES_SCALE_RES_FINAL,
+        MODEL_TENSOR.RES_SCALE_RES_B_FINAL,
+        MODEL_TENSOR.ZAYA_ROUTER_DOWN,
+        MODEL_TENSOR.ZAYA_ROUTER_DOWN_B,
+        MODEL_TENSOR.ZAYA_ROUTER_NORM,
+        MODEL_TENSOR.ZAYA_ROUTER_MLP0,
+        MODEL_TENSOR.ZAYA_ROUTER_MLP0_B,
+        MODEL_TENSOR.ZAYA_ROUTER_MLP2,
+        MODEL_TENSOR.ZAYA_ROUTER_MLP2_B,
+        MODEL_TENSOR.ZAYA_ROUTER_MLP4,
+        MODEL_TENSOR.ZAYA_ROUTER_BIASES,
+        MODEL_TENSOR.ZAYA_ROUTER_EDA_SCALE,
+        MODEL_TENSOR.FFN_GATE_UP_EXP,
+        MODEL_TENSOR.FFN_DOWN_EXP,
+    ],


We can try removing the _B or _B_FINAL.

Juste-Leo2 · 2026-05-14T20:05:03Z

I think the best approach is to stick with two PRs:
One for the ggml_conv_1d_grouped operation (in progress)
Another to integrate the naive version of the model with mappings that reflect the existing setup
I don't know enough about Markovian RSA, but it could be implemented later
I think I can handle the first refactoring for the mapping; I'll probably get it done this week.

Juste-Leo2 and others added 6 commits May 8, 2026 11:08

ops: add Conv1dGrouped operation

99e5d03

initial implementation

e0ac753

implementation checkpoint

7cc554a

update

02a9843

add corrections

8362c10

zaya generation running

109856e

nanduruganesh mentioned this pull request May 12, 2026

Feature Request: Support ZAYA1-8B (Sparse MoE) and Markovian RSA Sampling ggml-org/llama.cpp#22776

Open

4 tasks

Juste-Leo2 reviewed May 14, 2026

View reviewed changes

This was referenced May 15, 2026

[DRAFT] Zaya 1 Draft Support #2

Open

[DRAFT] Support for Zaya1 8B model (depends on PR #22833) ggml-org/llama.cpp#23112

Draft

Conversation

nanduruganesh commented May 12, 2026

Quickstart

Output:

Todo:

Uh oh!

Juste-Leo2 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Juste-Leo2 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Juste-Leo2 commented May 12, 2026 •

edited

Loading

Juste-Leo2 left a comment •

edited

Loading

Juste-Leo2 commented May 14, 2026 •

edited

Loading