Skip to content

[DRAFT] Zaya 1 Draft Support#2

Open
Juste-Leo2 wants to merge 33 commits into
Zyphra:masterfrom
Juste-Leo2:Zaya1
Open

[DRAFT] Zaya 1 Draft Support#2
Juste-Leo2 wants to merge 33 commits into
Zyphra:masterfrom
Juste-Leo2:Zaya1

Conversation

@Juste-Leo2
Copy link
Copy Markdown

@Juste-Leo2 Juste-Leo2 commented May 15, 2026

@nanduruganesh I've created a new pull request here, so you can make changes and I can make changes too.
So here, I've refactored the code to use the existing constants while keeping it functional. I used Open Code to make the changes. It will likely replace the PR #1

I'm reposting the message you originally posted below for anyone who wants to try the PR

Quickstart

  1. Build llama.cpp
  2. Huggingface -> GGUF conversion
hf download Zyphra/ZAYA1-8B --local-dir ./models/ZAYA1-8B
python convert_hf_to_gguf.py models/ZAYA1-8B --outfile models/ZAYA1-8B-BF16.gguf --outtype bf16
  1. Inference
./build/bin/llama-cli -m models/ZAYA1-8B-BF16.gguf -p \
"Derive bernoulli's equation from the work-energy theorem" \
-n 256 -c 4096 -ngl all -sm none --single-turn --simple-io \
--no-warmup --ctx-checkpoints 0 --cache-ram 0 --no-context-shift

Output:

image

Todo:

  1. Fix slight output logit mismatch between GGUF and transformers forward
  2. Optimize for faster ITL (probably under single-batch conditions?)
  3. Clean up this vibecoded implementation for proper PR, mainly just going through Contributing

Juste-Leo2 and others added 15 commits May 8, 2026 11:08
- Remove LLM_TENSOR_CCA_CONV_DW and LLM_TENSOR_CCA_CONV_DW_B from llama-arch.h
- Update tensor name mappings in llama-arch.cpp to use SSM_CONV1D
- Remove CCA_CONV_DW and CCA_CONV_DW_B from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use SSM_CONV1D
- Update zaya.cpp to create tensors using LLM_TENSOR_SSM_CONV1D
- Update convert_hf_to_gguf.py to map conv_qk.0 to SSM_CONV1D
- Add HuggingFace tensor mapping for zaya conv_qk.0 to SSM_CONV1D

This improves consistency by reusing the existing SSM_CONV1D constant
that's already used by other SSM-based architectures (mamba, jamba, etc.)
- Remove LLM_TENSOR_ZAYA_ROUTER_NORM from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_NORM
- Remove ZAYA_ROUTER_NORM from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_NORM
- Update zaya.cpp to create router norm tensor using LLM_TENSOR_FFN_NORM
- Update convert_hf_to_gguf.py to map rmsnorm_eda to FFN_NORM
- Add HuggingFace tensor mapping for zaya rmsnorm_eda to FFN_NORM

Router normalization is a standard FFN norm (RMSNorm), making this
a semantically correct replacement that reduces custom constants.
- Remove LLM_TENSOR_ZAYA_ROUTER_DOWN from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_GATE_INP
- Remove ZAYA_ROUTER_DOWN from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE_INP
- Update zaya.cpp to create router down tensor using LLM_TENSOR_FFN_GATE_INP
- Update convert_hf_to_gguf.py to map down_proj.weight to FFN_GATE_INP
- Add HuggingFace tensor mapping for zaya router down_proj to FFN_GATE_INP

Router down projection is a linear projection similar to MoE gate input,
making this a semantically reasonable replacement.
- Remove LLM_TENSOR_ZAYA_ROUTER_MLP0 from llama-arch.h
- Update tensor mappings in llama-arch.cpp to use FFN_GATE
- Remove ZAYA_ROUTER_MLP0 from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list to use FFN_GATE
- Update zaya.cpp to create router mlp0 tensor using LLM_TENSOR_FFN_GATE
- Update convert_hf_to_gguf.py to map router_mlp.0.weight to FFN_GATE
- Add HuggingFace tensor mapping for zaya router_mlp.0 to FFN_GATE

Router MLP hidden layer is a linear projection similar to FFN gate,
making this a reasonable replacement for reducing custom constants.
- Remove LLM_TENSOR_RES_SCALE_HS_B, RES_SCALE_RES_B, RES_SCALE_HS_B_FINAL, RES_SCALE_RES_B_FINAL
- Use single RES_SCALE_HS for both weight and bias (same for RES_SCALE_RES)
- Update tensor mappings in llama-arch.cpp
- Remove bias constants from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list
- Update zaya.cpp to create bias tensors using same constant with 'bias' suffix
- Update convert_hf_to_gguf.py to map bias tensors with .bias suffix

This reduces 8 custom ZAYA constants to 4 by reusing the same constant
for both weight and bias tensors, differentiated by suffix.
- Remove ZAYA_ROUTER_DOWN_B, ZAYA_ROUTER_MLP0_B, ZAYA_ROUTER_MLP2_B
- Use FFN_GATE_INP for both router down weight and bias
- Use FFN_GATE for both router mlp0 weight and bias
- Use ZAYA_ROUTER_MLP2 for both router mlp2 weight and bias
- Update tensor mappings in llama-arch.cpp
- Remove bias constants from gguf constants.py
- Update MODEL_ARCH.ZAYA1 tensor list
- Update zaya.cpp to create bias tensors using same constant with 'bias' suffix
- Update convert_hf_to_gguf.py to map bias tensors with .bias suffix
- Add ZAYA_ROUTER_MLP2 tensor mapping for HuggingFace auto-detection

This reduces 3 more custom constants by reusing the same constant
for both weight and bias tensors, differentiated by suffix.
@Juste-Leo2
Copy link
Copy Markdown
Author

Juste-Leo2 commented May 15, 2026

Okay, regarding the warning that appears during quantification: This warning is normal, and suppressing it causes changes that affect all models. So it’s best to keep the warning; it doesn’t impact inference.
@nanduruganesh Can you give me a quick review if you think everything is good as is ? I will create a draft on Llama.cpp with a comprehensive description of how the model works; I’ll see about including inference tests (speed). I’m will put it in a draft because I’d prefer to merge the ggml_conv_1d_grouped operation first, if it meets the maintainers' requirements

Remove hardcoded 256 value for router MLP hidden size and read it
from the GGUF expert_feed_forward_length metadata key instead.
The converter now writes zaya_mlp_expansion from config.json.
val_proj1 and val_proj2 output dimension should be latent_k_dim / 2
(n_embd_k / 2) as per vLLM reference, not n_embd_head. Currently
both are equal for ZAYA1-8B (n_head_kv=2), but this would break
for any other n_head_kv configuration.
Follows the same pattern as Mamba ssm_conv1d, Kimi shortconv,
and RWKV time_mix tensors. These small conv weights (d_conv=2)
are not divisible by quant block sizes (32), causing Q8_0 failures.
ggml_im2col on CPU requires F16 kernel weights. Cast cca_conv_dw
and cca_conv_grp to F16 before convolution to support quantized
models (Q4, Q8). CUDA/SYCL backends are unaffected since their
im2col implementation only reads kernel dimensions, not data.
ROCm and Vulkan backends require contiguous tensors for im2col and
mul_mat operations. Add ggml_cont after ggml_cast for conv kernels
and after ggml_concat for hs_d to ensure compatibility across all
backends. CUDA was unaffected since it handles non-contiguous
tensors more permissively.
@Ramachandrajoshi
Copy link
Copy Markdown

Tested this PR, llama-cli is working. But failing with llama-server

llama-server -m ../models/ZAYA1-8B-Q4_K_S.ggufmain: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9081-3aaab7f7b
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '../models/ZAYA1-8B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
/data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed
0: 0x7ab358313c
1: 0x7ab35830fc ggml_print_backtrace
2: 0x7ab3596ac0 ggml_abort
3: 0x7ab3589f50 ggml_reshape_3d
4: 0x7aab66e34c _ZN16llama_model_zaya5graphC2ERK11llama_modelRK16llm_graph_params
5: 0x7aab66dc20 _ZNK16llama_model_zaya16build_arch_graphERK16llm_graph_params
6: 0x7aab5b03fc _ZNK11llama_model11build_graphERK16llm_graph_params
7: 0x7aab548bb4 _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm
8: 0x7aab547924 _ZN13llama_context13sched_reserveEv
9: 0x7aab5469e0 _ZN13llama_contextC2ERK11llama_model20llama_context_params
10: 0x7aab54f64c llama_init_from_model
11: 0x7ab188ed8c
12: 0x7ab1889c8c
13: 0x7ab1889a68 _Z17common_fit_paramsPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level
14: 0x7ab186ec4c _ZN18common_init_resultC2ER13common_params
15: 0x7ab186ffdc _Z23common_init_from_paramsR13common_params
16: 0x7ab5746654
17: 0x7ab56c820c
18: 0x7aa852dd64 __libc_init
Aborted                    llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf

@Juste-Leo2
Copy link
Copy Markdown
Author

Juste-Leo2 commented May 17, 2026

Tested this PR, llama-cli is working. But failing with llama-server

llama-server -m ../models/ZAYA1-8B-Q4_K_S.ggufmain: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9081-3aaab7f7b
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '../models/ZAYA1-8B-Q4_K_S.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
/data/data/com.termux/files/home/llama.cpp/ggml/src/ggml.c:3647: GGML_ASSERT(ggml_is_contiguous(a)) failed
0: 0x7ab358313c
1: 0x7ab35830fc ggml_print_backtrace
2: 0x7ab3596ac0 ggml_abort
3: 0x7ab3589f50 ggml_reshape_3d
4: 0x7aab66e34c _ZN16llama_model_zaya5graphC2ERK11llama_modelRK16llm_graph_params
5: 0x7aab66dc20 _ZNK16llama_model_zaya16build_arch_graphERK16llm_graph_params
6: 0x7aab5b03fc _ZNK11llama_model11build_graphERK16llm_graph_params
7: 0x7aab548bb4 _ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ibPm
8: 0x7aab547924 _ZN13llama_context13sched_reserveEv
9: 0x7aab5469e0 _ZN13llama_contextC2ERK11llama_model20llama_context_params
10: 0x7aab54f64c llama_init_from_model
11: 0x7ab188ed8c
12: 0x7ab1889c8c
13: 0x7ab1889a68 _Z17common_fit_paramsPKcP18llama_model_paramsP20llama_context_paramsPfP32llama_model_tensor_buft_overridePmj14ggml_log_level
14: 0x7ab186ec4c _ZN18common_init_resultC2ER13common_params
15: 0x7ab186ffdc _Z23common_init_from_paramsR13common_params
16: 0x7ab5746654
17: 0x7ab56c820c
18: 0x7aa852dd64 __libc_init
Aborted                    llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf

Thanks for the catch @Ramachandrajoshi . I'll fix that later. In the meantime, you can use this command instead:

llama-server -m ../models/ZAYA1-8B-Q4_K_S.gguf --parallel 1

That should work

- Add ggml_cont(prev_hs) for non-contiguous tensor view (n_seqs > 1)
- Replace ggml_conv_1d_dw with ggml_ssm_conv for proper batch support
- Cast conv kernel to F32 and permute output shape

ggml_conv_1d_dw does not support n_seqs > 1 (assert b->ne[3] == 1).
Use ggml_ssm_conv which is designed for SSM models with batching.
@Juste-Leo2
Copy link
Copy Markdown
Author

Juste-Leo2 commented May 17, 2026

@nanduruganesh Can you try again with the correction? It should be fixed by now :)

@Ramachandrajoshi
Copy link
Copy Markdown

Thanks, with
--parallel 1 flag llama server working fine.

The model's config.json reports vocab_size=262272 but the actual tokenizer
only has 262147 tokens. The 125 extra entries are padding in PyTorch's
embed_tokens.weight matrix that don't correspond to any real tokens.

Use the pre-computed _tokenizer_vocab_size to write the correct vocab size
in the GGUF metadata, matching llama.cpp's actual tokenizer vocabulary.
Add detailed inline comments mapping each C++ code section to the
corresponding zaya.py and cca.py Python lines, including code snippets
for direct comparison.
zaya.py L294-296: EDA is disabled for layer 1 (first MoE layer) via
(self.layer_number != zaya_first_layer). Add il != 1 guard to match.
@Juste-Leo2 Juste-Leo2 force-pushed the Zaya1 branch 2 times, most recently from 05ec4f4 to 2b0c8c8 Compare May 21, 2026 22:35
Juste-Leo2 and others added 8 commits May 22, 2026 01:56
Correct line reference from zaya.py L387-389 to L459-469, and add
note explaining why excluding the skip expert from gate_probs is
correct (bias=-1.0 makes it effectively never selected at inference
with topk=1).
- New llm_graph_input_cca_mask class + build_inp_cca_mask() in graph infra
- cca_mask tensor [1, n_tokens] F32 binary mask applied to hidden_states
  before CCA convolutions (modeling_zaya.py ref: CCA.forward L325-328)
- Applied only during prefill (n_seq_tokens > 1), matching Python logic
- Mask filled with 1.0f for all positions (no padding info in ubatch)
Match Python reference which casts hidden_states and residual to
float32 before ggml_add in both per-layer and final residual paths.

zaya.py ref: L900, L1387, L1701
- ggml: Update `ggml_conv_1d` (and variants) to use a conditional type for `im2col` activation (`a->type == GGML_TYPE_F16 ? GGML_TYPE_F16 : GGML_TYPE_F32`) instead of hardcoding `GGML_TYPE_F16`. This aligns with `ggml_conv_2d`, preserving F32/BF16 precision while still safely protecting against quantized weight crashes (e.g., Q4_0).
- zaya: Replace the forced F16 downcast for grouped convolutions with a dynamic promotion to F32 for unsupported types (like BF16 or quantized types). This ensures `im2col` properly allocates an F32 matrix and computes an F32xF32 mul_mat, avoiding CUDA/CPU backend crashes while fully restoring model accuracy and NMSE metrics.
This is a safety guard matching self.layer_number != zaya_first_layer
in the original implementation. No behavioral change for correctly
converted models since the tensor is already nullptr for layer 1.
The model config has residual_in_fp32=true. Cast both residual
branches to float32 to align with the python reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants