Skip to content

Draft PR for ZAYA1 llama.cpp implementation#1

Open
nanduruganesh wants to merge 6 commits into
masterfrom
CCA
Open

Draft PR for ZAYA1 llama.cpp implementation#1
nanduruganesh wants to merge 6 commits into
masterfrom
CCA

Conversation

@nanduruganesh
Copy link
Copy Markdown
Collaborator

Draft PR with an initial working implementation of ZAYA1-8B, forked from https://github.com/Juste-Leo2/llama.cpp/tree/CCA.
Open to public collaboration for this merge.

Quickstart

  1. Build llama.cpp
  2. Huggingface -> GGUF conversion
hf download Zyphra/ZAYA1-8B --local-dir ./models/ZAYA1-8B
python convert_hf_to_gguf.py models/ZAYA1-8B --outfile models/ZAYA1-8B-BF16.gguf --outtype bf16
  1. Inference
./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf -p \
"Derive bernoulli's equation from the work-energy theorem" \
-n 256 -c 4096 -ngl all -sm none --single-turn --simple-io \
--no-warmup --ctx-checkpoints 0 --cache-ram 0 --no-context-shift

Output:

image

Todo:

  1. Fix slight output logit mismatch between GGUF and transformers forward
  2. Optimize for faster ITL (probably under single-batch conditions?)
  3. Clean up this vibecoded implementation for proper PR, mainly just going through Contributing

@Juste-Leo2
Copy link
Copy Markdown

Juste-Leo2 commented May 12, 2026

I tested the current PR with a Q3_K_M quantized model, and it works. Thanks a lot for the adjustments @nanduruganesh , the generation is now usable!

I'll take a closer look at the code. During quantization I got a few warnings, but that's probably fixable without too much trouble:

warning: blk.20.cca_conv_dw.weight            - ncols      2 not divisible by 256 (required for type    q3_K) (WARNING: must use F16 due to unusual shape) -> falling back to     f16

I've had the chance to contribute a bit to llama.cpp and I follow their work closely. The best approach would really be to break this large PR into several smaller ones, to make it easier for the maintainers to review and to adjust things incrementally. Even if visually we already have a natural split with the ggml_conv1d_grouped operation, it's worth checking whether ZAYA1 support could be split into multiple distinct PRs.

I'm with you on speed. Given how sparse this MoE is (only 760M active parameters), there's probably a lot of optimization potential. But for a llama.cpp integration, I'd recommend going with a naive implementation first, then optimizing later. That's often how it's done.

We should probably also consider running some perplexity tests. This would help spot any potential degradation if something's off.

As an electronics student by background, I need to get more comfortable with the codebase to really understand llama.cpp's structure and potentially suggest changes to the implementation if needed. I'll do that when I have some free time. Don't hesitate to move ahead if you see things that need changing.

Copy link
Copy Markdown

@Juste-Leo2 Juste-Leo2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cleanup will mainly involve remapping to make use of what's already there. Feel free to give me feedback, but I think this is a good starting point. I've mainly commented on a single file, but the logic applies to the others as well, of course

Comment thread ggml/src/ggml.c
Comment on lines +4578 to +4599
struct ggml_tensor * a_g = ggml_view_3d(ctx, a,
a->ne[0], IC_G, OC_G,
a->nb[1], a->nb[2],
g * OC_G * a->nb[2]);

// slice input for group g: [L, IC_G, N]
struct ggml_tensor * b_g = ggml_view_3d(ctx, b,
b->ne[0], IC_G, b->ne[2],
b->nb[1], b->nb[2],
g * IC_G * b->nb[1]);

struct ggml_tensor * out_g = ggml_conv_1d(ctx, a_g, b_g, s0, p0, d0);

if (result == NULL) {
result = out_g;
} else {
result = ggml_concat(ctx, result, out_g, 1);
}
}

return result;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reusing the ggml operations; we'll need to see later whether it's necessary to create a specific operation for each backend.

Comment thread ggml/include/ggml.h
Comment on lines +2050 to +2057
GGML_API struct ggml_tensor * ggml_conv_1d_grouped(
struct ggml_context * ctx,
struct ggml_tensor * a, // convolution kernel
struct ggml_tensor * b, // data
int s0, // stride
int p0, // padding
int d0, // dilation
int groups); // number of groups
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a separate PR

Comment thread common/debug.cpp
Comment on lines 146 to +164
@@ -162,6 +155,13 @@ bool common_debug_cb_eval(struct ggml_tensor * t, bool ask, void * user_data) {
}
}

if (ask) {
return matches_filter;
}

const struct ggml_tensor * src0 = t->src[0];
const struct ggml_tensor * src1 = t->src[1];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes will need to be removed

Comment thread gguf-py/gguf/constants.py
Comment on lines +614 to +638
CCA_CONV_DW = auto() # Zaya
CCA_CONV_GRP = auto() # Zaya
CCA_CONV_DW_B = auto() # Zaya: conv_qk.0.bias
CCA_QK_NORM = auto() # Zaya (weightless - unit RMSNorm)
CCA_K_SCALE = auto() # Zaya
CCA_VAL_PROJ1 = auto() # Zaya: CCA value projection stream 1
CCA_VAL_PROJ2 = auto() # Zaya: CCA value projection stream 2
RES_SCALE_HS = auto() # Zaya: hidden_states_scale
RES_SCALE_HS_B = auto() # Zaya: hidden_states_bias
RES_SCALE_RES = auto() # Zaya: residual_scale
RES_SCALE_RES_B = auto() # Zaya: residual_bias
RES_SCALE_HS_FINAL = auto() # Zaya: final hidden_states_scale
RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias
RES_SCALE_RES_FINAL = auto() # Zaya: final residual_scale
RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias
ZAYA_ROUTER_DOWN = auto() # Zaya
ZAYA_ROUTER_DOWN_B = auto() # Zaya
ZAYA_ROUTER_NORM = auto() # Zaya
ZAYA_ROUTER_MLP0 = auto() # Zaya
ZAYA_ROUTER_MLP0_B = auto() # Zaya
ZAYA_ROUTER_MLP2 = auto() # Zaya
ZAYA_ROUTER_MLP2_B = auto() # Zaya
ZAYA_ROUTER_MLP4 = auto() # Zaya
ZAYA_ROUTER_BIASES = auto() # Zaya
ZAYA_ROUTER_EDA_SCALE = auto() # Zaya
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to simplify and reuse what we already have

Comment thread gguf-py/gguf/constants.py
SSM_BETA = auto() # Kimi Linear qwen3.5
SSM_G_A = auto() # Kimi Linear
SSM_G_B = auto() # Kimi Linear
CCA_CONV_DW = auto() # Zaya
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCA_CONV_DW --> SSM_CONV1D ?

Comment thread gguf-py/gguf/constants.py
RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias
ZAYA_ROUTER_DOWN = auto() # Zaya
ZAYA_ROUTER_DOWN_B = auto() # Zaya
ZAYA_ROUTER_NORM = auto() # Zaya
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFN_NORM ?

Comment thread gguf-py/gguf/constants.py
Comment on lines +632 to +635
ZAYA_ROUTER_MLP0 = auto() # Zaya
ZAYA_ROUTER_MLP0_B = auto() # Zaya
ZAYA_ROUTER_MLP2 = auto() # Zaya
ZAYA_ROUTER_MLP2_B = auto() # Zaya
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFN_GATE ?

Comment thread gguf-py/gguf/constants.py
ZAYA_ROUTER_MLP0_B = auto() # Zaya
ZAYA_ROUTER_MLP2 = auto() # Zaya
ZAYA_ROUTER_MLP2_B = auto() # Zaya
ZAYA_ROUTER_MLP4 = auto() # Zaya
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFN_DOWN ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think I mentioned the wrong line (see 629)

Comment thread gguf-py/gguf/constants.py
Comment on lines +621 to +628
RES_SCALE_HS = auto() # Zaya: hidden_states_scale
RES_SCALE_HS_B = auto() # Zaya: hidden_states_bias
RES_SCALE_RES = auto() # Zaya: residual_scale
RES_SCALE_RES_B = auto() # Zaya: residual_bias
RES_SCALE_HS_FINAL = auto() # Zaya: final hidden_states_scale
RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias
RES_SCALE_RES_FINAL = auto() # Zaya: final residual_scale
RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the “B”s can be removed ?

Comment thread gguf-py/gguf/constants.py
Comment on lines 1153 to +4082
@@ -3992,6 +4044,42 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.FFN_DOWN_SHEXP,
MODEL_TENSOR.FFN_UP_SHEXP,
],
MODEL_ARCH.ZAYA: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.ATTN_Q,
MODEL_TENSOR.ATTN_K,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.CCA_CONV_DW,
MODEL_TENSOR.CCA_CONV_DW_B,
MODEL_TENSOR.CCA_CONV_GRP,
MODEL_TENSOR.CCA_QK_NORM,
MODEL_TENSOR.CCA_K_SCALE,
MODEL_TENSOR.CCA_VAL_PROJ1,
MODEL_TENSOR.CCA_VAL_PROJ2,
MODEL_TENSOR.RES_SCALE_HS,
MODEL_TENSOR.RES_SCALE_HS_B,
MODEL_TENSOR.RES_SCALE_RES,
MODEL_TENSOR.RES_SCALE_RES_B,
MODEL_TENSOR.RES_SCALE_HS_FINAL,
MODEL_TENSOR.RES_SCALE_HS_B_FINAL,
MODEL_TENSOR.RES_SCALE_RES_FINAL,
MODEL_TENSOR.RES_SCALE_RES_B_FINAL,
MODEL_TENSOR.ZAYA_ROUTER_DOWN,
MODEL_TENSOR.ZAYA_ROUTER_DOWN_B,
MODEL_TENSOR.ZAYA_ROUTER_NORM,
MODEL_TENSOR.ZAYA_ROUTER_MLP0,
MODEL_TENSOR.ZAYA_ROUTER_MLP0_B,
MODEL_TENSOR.ZAYA_ROUTER_MLP2,
MODEL_TENSOR.ZAYA_ROUTER_MLP2_B,
MODEL_TENSOR.ZAYA_ROUTER_MLP4,
MODEL_TENSOR.ZAYA_ROUTER_BIASES,
MODEL_TENSOR.ZAYA_ROUTER_EDA_SCALE,
MODEL_TENSOR.FFN_GATE_UP_EXP,
MODEL_TENSOR.FFN_DOWN_EXP,
],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try removing the _B or _B_FINAL.

@Juste-Leo2
Copy link
Copy Markdown

Juste-Leo2 commented May 14, 2026

I think the best approach is to stick with two PRs:
One for the ggml_conv_1d_grouped operation (in progress)
Another to integrate the naive version of the model with mappings that reflect the existing setup
I don't know enough about Markovian RSA, but it could be implemented later
I think I can handle the first refactoring for the mapping; I'll probably get it done this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants