Draft PR for ZAYA1 llama.cpp implementation#1
Conversation
|
I tested the current PR with a Q3_K_M quantized model, and it works. Thanks a lot for the adjustments @nanduruganesh , the generation is now usable! I'll take a closer look at the code. During quantization I got a few warnings, but that's probably fixable without too much trouble: I've had the chance to contribute a bit to llama.cpp and I follow their work closely. The best approach would really be to break this large PR into several smaller ones, to make it easier for the maintainers to review and to adjust things incrementally. Even if visually we already have a natural split with the I'm with you on speed. Given how sparse this MoE is (only 760M active parameters), there's probably a lot of optimization potential. But for a llama.cpp integration, I'd recommend going with a naive implementation first, then optimizing later. That's often how it's done. We should probably also consider running some perplexity tests. This would help spot any potential degradation if something's off. As an electronics student by background, I need to get more comfortable with the codebase to really understand llama.cpp's structure and potentially suggest changes to the implementation if needed. I'll do that when I have some free time. Don't hesitate to move ahead if you see things that need changing. |
| struct ggml_tensor * a_g = ggml_view_3d(ctx, a, | ||
| a->ne[0], IC_G, OC_G, | ||
| a->nb[1], a->nb[2], | ||
| g * OC_G * a->nb[2]); | ||
|
|
||
| // slice input for group g: [L, IC_G, N] | ||
| struct ggml_tensor * b_g = ggml_view_3d(ctx, b, | ||
| b->ne[0], IC_G, b->ne[2], | ||
| b->nb[1], b->nb[2], | ||
| g * IC_G * b->nb[1]); | ||
|
|
||
| struct ggml_tensor * out_g = ggml_conv_1d(ctx, a_g, b_g, s0, p0, d0); | ||
|
|
||
| if (result == NULL) { | ||
| result = out_g; | ||
| } else { | ||
| result = ggml_concat(ctx, result, out_g, 1); | ||
| } | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
We're reusing the ggml operations; we'll need to see later whether it's necessary to create a specific operation for each backend.
| GGML_API struct ggml_tensor * ggml_conv_1d_grouped( | ||
| struct ggml_context * ctx, | ||
| struct ggml_tensor * a, // convolution kernel | ||
| struct ggml_tensor * b, // data | ||
| int s0, // stride | ||
| int p0, // padding | ||
| int d0, // dilation | ||
| int groups); // number of groups |
| @@ -162,6 +155,13 @@ bool common_debug_cb_eval(struct ggml_tensor * t, bool ask, void * user_data) { | |||
| } | |||
| } | |||
|
|
|||
| if (ask) { | |||
| return matches_filter; | |||
| } | |||
|
|
|||
| const struct ggml_tensor * src0 = t->src[0]; | |||
| const struct ggml_tensor * src1 = t->src[1]; | |||
|
|
|||
There was a problem hiding this comment.
These changes will need to be removed
| CCA_CONV_DW = auto() # Zaya | ||
| CCA_CONV_GRP = auto() # Zaya | ||
| CCA_CONV_DW_B = auto() # Zaya: conv_qk.0.bias | ||
| CCA_QK_NORM = auto() # Zaya (weightless - unit RMSNorm) | ||
| CCA_K_SCALE = auto() # Zaya | ||
| CCA_VAL_PROJ1 = auto() # Zaya: CCA value projection stream 1 | ||
| CCA_VAL_PROJ2 = auto() # Zaya: CCA value projection stream 2 | ||
| RES_SCALE_HS = auto() # Zaya: hidden_states_scale | ||
| RES_SCALE_HS_B = auto() # Zaya: hidden_states_bias | ||
| RES_SCALE_RES = auto() # Zaya: residual_scale | ||
| RES_SCALE_RES_B = auto() # Zaya: residual_bias | ||
| RES_SCALE_HS_FINAL = auto() # Zaya: final hidden_states_scale | ||
| RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias | ||
| RES_SCALE_RES_FINAL = auto() # Zaya: final residual_scale | ||
| RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias | ||
| ZAYA_ROUTER_DOWN = auto() # Zaya | ||
| ZAYA_ROUTER_DOWN_B = auto() # Zaya | ||
| ZAYA_ROUTER_NORM = auto() # Zaya | ||
| ZAYA_ROUTER_MLP0 = auto() # Zaya | ||
| ZAYA_ROUTER_MLP0_B = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2 = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2_B = auto() # Zaya | ||
| ZAYA_ROUTER_MLP4 = auto() # Zaya | ||
| ZAYA_ROUTER_BIASES = auto() # Zaya | ||
| ZAYA_ROUTER_EDA_SCALE = auto() # Zaya |
There was a problem hiding this comment.
I think we need to simplify and reuse what we already have
| SSM_BETA = auto() # Kimi Linear qwen3.5 | ||
| SSM_G_A = auto() # Kimi Linear | ||
| SSM_G_B = auto() # Kimi Linear | ||
| CCA_CONV_DW = auto() # Zaya |
| RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias | ||
| ZAYA_ROUTER_DOWN = auto() # Zaya | ||
| ZAYA_ROUTER_DOWN_B = auto() # Zaya | ||
| ZAYA_ROUTER_NORM = auto() # Zaya |
| ZAYA_ROUTER_MLP0 = auto() # Zaya | ||
| ZAYA_ROUTER_MLP0_B = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2 = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2_B = auto() # Zaya |
| ZAYA_ROUTER_MLP0_B = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2 = auto() # Zaya | ||
| ZAYA_ROUTER_MLP2_B = auto() # Zaya | ||
| ZAYA_ROUTER_MLP4 = auto() # Zaya |
There was a problem hiding this comment.
Note: I think I mentioned the wrong line (see 629)
| RES_SCALE_HS = auto() # Zaya: hidden_states_scale | ||
| RES_SCALE_HS_B = auto() # Zaya: hidden_states_bias | ||
| RES_SCALE_RES = auto() # Zaya: residual_scale | ||
| RES_SCALE_RES_B = auto() # Zaya: residual_bias | ||
| RES_SCALE_HS_FINAL = auto() # Zaya: final hidden_states_scale | ||
| RES_SCALE_HS_B_FINAL = auto() # Zaya: final hidden_states_bias | ||
| RES_SCALE_RES_FINAL = auto() # Zaya: final residual_scale | ||
| RES_SCALE_RES_B_FINAL = auto() # Zaya: final residual_bias |
| @@ -3992,6 +4044,42 @@ class MODEL_TENSOR(IntEnum): | |||
| MODEL_TENSOR.FFN_DOWN_SHEXP, | |||
| MODEL_TENSOR.FFN_UP_SHEXP, | |||
| ], | |||
| MODEL_ARCH.ZAYA: [ | |||
| MODEL_TENSOR.TOKEN_EMBD, | |||
| MODEL_TENSOR.OUTPUT_NORM, | |||
| MODEL_TENSOR.OUTPUT, | |||
| MODEL_TENSOR.ATTN_NORM, | |||
| MODEL_TENSOR.ATTN_Q, | |||
| MODEL_TENSOR.ATTN_K, | |||
| MODEL_TENSOR.ATTN_OUT, | |||
| MODEL_TENSOR.CCA_CONV_DW, | |||
| MODEL_TENSOR.CCA_CONV_DW_B, | |||
| MODEL_TENSOR.CCA_CONV_GRP, | |||
| MODEL_TENSOR.CCA_QK_NORM, | |||
| MODEL_TENSOR.CCA_K_SCALE, | |||
| MODEL_TENSOR.CCA_VAL_PROJ1, | |||
| MODEL_TENSOR.CCA_VAL_PROJ2, | |||
| MODEL_TENSOR.RES_SCALE_HS, | |||
| MODEL_TENSOR.RES_SCALE_HS_B, | |||
| MODEL_TENSOR.RES_SCALE_RES, | |||
| MODEL_TENSOR.RES_SCALE_RES_B, | |||
| MODEL_TENSOR.RES_SCALE_HS_FINAL, | |||
| MODEL_TENSOR.RES_SCALE_HS_B_FINAL, | |||
| MODEL_TENSOR.RES_SCALE_RES_FINAL, | |||
| MODEL_TENSOR.RES_SCALE_RES_B_FINAL, | |||
| MODEL_TENSOR.ZAYA_ROUTER_DOWN, | |||
| MODEL_TENSOR.ZAYA_ROUTER_DOWN_B, | |||
| MODEL_TENSOR.ZAYA_ROUTER_NORM, | |||
| MODEL_TENSOR.ZAYA_ROUTER_MLP0, | |||
| MODEL_TENSOR.ZAYA_ROUTER_MLP0_B, | |||
| MODEL_TENSOR.ZAYA_ROUTER_MLP2, | |||
| MODEL_TENSOR.ZAYA_ROUTER_MLP2_B, | |||
| MODEL_TENSOR.ZAYA_ROUTER_MLP4, | |||
| MODEL_TENSOR.ZAYA_ROUTER_BIASES, | |||
| MODEL_TENSOR.ZAYA_ROUTER_EDA_SCALE, | |||
| MODEL_TENSOR.FFN_GATE_UP_EXP, | |||
| MODEL_TENSOR.FFN_DOWN_EXP, | |||
| ], | |||
There was a problem hiding this comment.
We can try removing the _B or _B_FINAL.
|
I think the best approach is to stick with two PRs: |
Draft PR with an initial working implementation of ZAYA1-8B, forked from https://github.com/Juste-Leo2/llama.cpp/tree/CCA.
Open to public collaboration for this merge.
Quickstart
Output:
Todo: