[DRAFT] Zaya 1 Draft Support by Juste-Leo2 · Pull Request #2 · Zyphra/llama.cpp

Juste-Leo2 · 2026-05-15T10:01:16Z

@nanduruganesh I've created a new pull request here, so you can make changes and I can make changes too.
So here, I've refactored the code to use the existing constants while keeping it functional. I used Open Code to make the changes. It will likely replace the PR #1

I'm reposting the message you originally posted below for anyone who wants to try the PR

Quickstart

Build llama.cpp
Huggingface -> GGUF conversion

hf download Zyphra/ZAYA1-8B --local-dir ./models/ZAYA1-8B
python convert_hf_to_gguf.py models/ZAYA1-8B --outfile models/ZAYA1-8B-BF16.gguf --outtype bf16

Inference

./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf -p \
"Derive bernoulli's equation from the work-energy theorem" \
-n 256 -c 4096 -ngl all -sm none --single-turn --simple-io \
--no-warmup --ctx-checkpoints 0 --cache-ram 0 --no-context-shift

Output:

Todo:

Fix slight output logit mismatch between GGUF and transformers forward
Optimize for faster ITL (probably under single-batch conditions?)
Clean up this vibecoded implementation for proper PR, mainly just going through Contributing

- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h - Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D - Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D - Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D - Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D - Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D This improves consistency by reusing the existing SSM_CONV1D constant that's already used by other SSM-based architectures (mamba, jamba, etc.)

- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_NORM - Remove ZAYA_ROUTER_NORM from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM - Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM - Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM - Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM Router normalization is a standard FFN norm (RMSNorm), making this a semantically correct replacement that reduces custom constants.

- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP - Remove ZAYA_ROUTER_DOWN from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP - Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP - Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP - Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP Router down projection is a linear projection similar to MoE gate input, making this a semantically reasonable replacement.

- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h - Update tensor mappings in llama-arch.cpp to use FFN_GATE - Remove ZAYA_ROUTER_MLP0 from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE - Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE - Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE - Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE Router MLP hidden layer is a linear projection similar to FFN gate, making this a reasonable replacement for reducing custom constants.

- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL - Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES) - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix This reduces 8 custom ZAYA constants to 4 by reusing the same constant for both weight and bias tensors, differentiated by suffix.

- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B - Use FFN_GATE_INP for both router down weight and bias - Use FFN_GATE for both router mlp0 weight and bias - Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias - Update tensor mappings in llama-arch.cpp - Remove bias constants from gguf constants.py - Update MODEL_ARCH.ZAYA1 tensor list - Update zaya.cpp to create bias tensors using same constant with 'bias' suffix - Update convert_hf_to_gguf.py to map bias tensors with .bias suffix - Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection This reduces 3 more custom constants by reusing the same constant for both weight and bias tensors, differentiated by suffix.

Juste-Leo2 · 2026-05-15T11:23:38Z

Okay, regarding the warning that appears during quantification: This warning is normal, and suppressing it causes changes that affect all models. So it’s best to keep the warning; it doesn’t impact inference.
@nanduruganesh Can you give me a quick review if you think everything is good as is ? I will create a draft on Llama.cpp with a comprehensive description of how the model works; I’ll see about including inference tests (speed). I’m will put it in a draft because I’d prefer to merge the ggml_conv_1d_grouped operation first, if it meets the maintainers' requirements

Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.

val_proj1 and val_proj2 output dimension should be latent_k_dim / 2 (n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently both are equal for ZAYA1-8B (n_head_kv=2), but this would break for any other n_head_kv configuration.

Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.

ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw and cca_conv_grp to F16 before convolution to support quantized models (Q4, Q8). CUDA/SYCL backends are unaffected since their im2col implementation only reads kernel dimensions, not data.

ROCm and Vulkan backends require contiguous tensors for im2col and mul_mat operations. Add ggml_cont after ggml_cast for conv kernels and after ggml_concat for hs_d to ensure compatibility across all backends. CUDA was unaffected since it handles non-contiguous tensors more permissively.

Ramachandrajoshi · 2026-05-17T14:09:51Z

Tested this PR, llama-cli is working. But failing with llama-server

llama-server -m ../models/ZAYA1-8B-Q4_K_S.ggufmain: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9081-3aaab7f7b
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '../models/ZAYA1-8B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
/data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed
0: 0x7ab358313c
1: 0x7ab35830fc ggml_print_backtrace
2: 0x7ab3596ac0 ggml_abort
3: 0x7ab3589f50 ggml_reshape_3d
4: 0x7aab66e34c _ZN16llama_model_zaya5graphC2ERK11llama_modelRK16llm_graph_params
5: 0x7aab66dc20 _ZNK16llama_model_zaya16build_arch_graphERK16llm_graph_params
6: 0x7aab5b03fc _ZNK11llama_model11build_graphERK16llm_graph_params
7: 0x7aab548bb4 _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm
8: 0x7aab547924 _ZN13llama_context13sched_reserveEv
9: 0x7aab5469e0 _ZN13llama_contextC2ERK11llama_model20llama_context_params
10: 0x7aab54f64c llama_init_from_model
11: 0x7ab188ed8c
12: 0x7ab1889c8c
13: 0x7ab1889a68 _Z17common_fit_paramsPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level
14: 0x7ab186ec4c _ZN18common_init_resultC2ER13common_params
15: 0x7ab186ffdc _Z23common_init_from_paramsR13common_params
16: 0x7ab5746654
17: 0x7ab56c820c
18: 0x7aa852dd64 __libc_init
Aborted                    llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf

Juste-Leo2 · 2026-05-17T14:55:43Z

Tested this PR, llama-cli is working. But failing with llama-server

llama-server -m ../models/ZAYA1-8B-Q4_K_S.ggufmain: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9081-3aaab7f7b
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '../models/ZAYA1-8B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
/data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed
0: 0x7ab358313c
1: 0x7ab35830fc ggml_print_backtrace
2: 0x7ab3596ac0 ggml_abort
3: 0x7ab3589f50 ggml_reshape_3d
4: 0x7aab66e34c _ZN16llama_model_zaya5graphC2ERK11llama_modelRK16llm_graph_params
5: 0x7aab66dc20 _ZNK16llama_model_zaya16build_arch_graphERK16llm_graph_params
6: 0x7aab5b03fc _ZNK11llama_model11build_graphERK16llm_graph_params
7: 0x7aab548bb4 _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm
8: 0x7aab547924 _ZN13llama_context13sched_reserveEv
9: 0x7aab5469e0 _ZN13llama_contextC2ERK11llama_model20llama_context_params
10: 0x7aab54f64c llama_init_from_model
11: 0x7ab188ed8c
12: 0x7ab1889c8c
13: 0x7ab1889a68 _Z17common_fit_paramsPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level
14: 0x7ab186ec4c _ZN18common_init_resultC2ER13common_params
15: 0x7ab186ffdc _Z23common_init_from_paramsR13common_params
16: 0x7ab5746654
17: 0x7ab56c820c
18: 0x7aa852dd64 __libc_init
Aborted                    llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf

Thanks for the catch @Ramachandrajoshi . I'll fix that later. In the meantime, you can use this command instead:

llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf --parallel 1

That should work

- Add ggml_cont(prev_hs) for non-contiguous tensor view (n_seqs > 1) - Replace ggml_conv_1d_dw with ggml_ssm_conv for proper batch support - Cast conv kernel to F32 and permute output shape ggml_conv_1d_dw does not support n_seqs > 1 (assert b->ne[3] == 1). Use ggml_ssm_conv which is designed for SSM models with batching.

Juste-Leo2 · 2026-05-17T17:06:40Z

@nanduruganesh Can you try again with the correction? It should be fixed by now :)

Ramachandrajoshi · 2026-05-18T01:22:46Z

Thanks, with
--parallel 1 flag llama server working fine.

The model's config.json reports vocab_size=262272 but the actual tokenizer only has 262147 tokens. The 125 extra entries are padding in PyTorch's embed_tokens.weight matrix that don't correspond to any real tokens. Use the pre-computed _tokenizer_vocab_size to write the correct vocab size in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.

Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.

zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.

… _FP32EmbeddingMethod

Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).

- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra - cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328) - Applied only during prefill (n_seq_tokens > 1), matching Python logic - Mask filled with 1.0f for all positions (no padding info in ubatch)

Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701

This reverts commit f1bd772.

- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0). - zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.

This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.

The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.

Juste-Leo2 and others added 15 commits May 8, 2026 11:08

ops: add Conv1dGrouped operation

99e5d03

initial implementation

e0ac753

implementation checkpoint

7cc554a

update

02a9843

add corrections

8362c10

zaya generation running

109856e

zaya: remove unused CCA_QK_NORM tensor constant

fede4c6

zaya: remove dead ZAYA_ROUTER_MLP2 mapping from non-block config

2069583

zaya: revert unrelated debug.cpp changes

356e962

Juste-Leo2 mentioned this pull request May 15, 2026

Feature Request: Support ZAYA1-8B (Sparse MoE) and Markovian RSA Sampling ggml-org/llama.cpp#22776

Open

4 tasks

Juste-Leo2 added 2 commits May 15, 2026 17:21

zaya: replace hardcoded n_ff_exp with GGUF metadata

81d727f

Remove hardcoded 256 value for router MLP hidden size and read it from the GGUF expert_feed_forward_length metadata key instead. The converter now writes zaya_mlp_expansion from config.json.

Juste-Leo2 mentioned this pull request May 15, 2026

[DRAFT] Support for Zaya1 8B model (depends on PR #22833) ggml-org/llama.cpp#23112

Draft

Juste-Leo2 added 3 commits May 16, 2026 11:08

quant: exclude Zaya cca_conv_grp tensors from quantization

800fbe8

Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv, and RWKV time_mix tensors. These small conv weights (d_conv=2) are not divisible by quant block sizes (32), causing Q8_0 failures.

Juste-Leo2 added 2 commits May 18, 2026 20:19

docs(zaya): add Python reference comments to C++ implementation

f1bd772

Add detailed inline comments mapping each C++ code section to the corresponding zaya.py and cca.py Python lines, including code snippets for direct comparison.

fix(zaya): gate EDA with layer check matching Python use_eda logic

2234dab

zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via (self.layer_number != zaya_first_layer). Add il != 1 guard to match.

Juste-Leo2 force-pushed the Zaya1 branch 2 times, most recently from 05ec4f4 to 2b0c8c8 Compare May 21, 2026 22:35

feat(zaya): add zaya_high_prec for FP32 output logits matching Python…

1fc4581

… _FP32EmbeddingMethod

Juste-Leo2 force-pushed the Zaya1 branch from 2b0c8c8 to 1fc4581 Compare May 21, 2026 23:21

Juste-Leo2 and others added 8 commits May 22, 2026 01:56

zaya.cpp: fix comment reference to MOD skip expert handling

0f37ace

Correct line reference from zaya.py L387-389 to L459-469, and add note explaining why excluding the skip expert from gate_probs is correct (bias=-1.0 makes it effectively never selected at inference with topk=1).

zaya: cast residual to F32 before addition (residual_in_fp32)

9aaef94

Match Python reference which casts hidden_states and residual to float32 before ggml_add in both per-layer and final residual paths. zaya.py ref: L900, L1387, L1701

cleanup: revert debugs commits

abe9e40

Revert "docs(zaya): add Python reference comments to C++ implementation"

6fad5d8

This reverts commit f1bd772.

zaya: add il != 1 check for EDA to match python reference

894ffd4

This is a safety guard matching self.layer_number != zaya_first_layer in the original implementation. No behavioral change for correctly converted models since the tensor is already nullptr for layer 1.

zaya: compute residual in fp32 to match config

1a7582b

The model config has residual_in_fp32=true. Cast both residual branches to float32 to align with the python reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Zaya 1 Draft Support#2

[DRAFT] Zaya 1 Draft Support#2
Juste-Leo2 wants to merge 33 commits into
Zyphra:masterfrom
Juste-Leo2:Zaya1

Juste-Leo2 commented May 15, 2026 •

edited

Loading

Uh oh!

Juste-Leo2 commented May 15, 2026 •

edited

Loading

Uh oh!

Ramachandrajoshi commented May 17, 2026

Uh oh!

Juste-Leo2 commented May 17, 2026 •

edited

Loading

Uh oh!

Juste-Leo2 commented May 17, 2026 •

edited

Loading

Uh oh!

Ramachandrajoshi commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Juste-Leo2 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quickstart

Output:

Todo:

Uh oh!

Juste-Leo2 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ramachandrajoshi commented May 17, 2026

Uh oh!

Juste-Leo2 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Juste-Leo2 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ramachandrajoshi commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Juste-Leo2 commented May 15, 2026 •

edited

Loading

Juste-Leo2 commented May 15, 2026 •

edited

Loading

Juste-Leo2 commented May 17, 2026 •

edited

Loading

Juste-Leo2 commented May 17, 2026 •

edited

Loading