Skip to content

Add cache-optimized embedding ops (~12x lookup speedup)#39

Open
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:neon-embed-opt
Open

Add cache-optimized embedding ops (~12x lookup speedup)#39
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:neon-embed-opt

Conversation

@dev-erik
Copy link

@dev-erik dev-erik commented Mar 4, 2026

Summary

Drop-in replacement for embed_lookup and embed_backward that eliminates stride-seq cache misses by using contiguous memcpy gather + vDSP_mtrans transpose.

Before / After

Benchmarked against upstream stories_cpu_ops.h on Apple M4 Max, compiled with clang -O2. Stories110M config: dim=768, seq=256, vocab=32000. 500 iterations, 10 warmup.

Operation Before (ms/call) After (ms/call) Speedup
embed_lookup 0.39 0.033 ~12x
embed_backward 0.50 0.45 ~1.1x

Consistent across 3 consecutive runs (11.5x-12.0x for lookup, 1.1x for backward).

Why it's faster

The original embed_lookup writes x[d*seq + t] in a double loop -- every write strides by seq floats (1 KB at seq=256), causing an L1 cache miss per element. The optimized version:

  1. Gathers contiguous embedding rows via memcpy into a temp buffer
  2. Transposes to channel-first layout in one call to vDSP_mtrans

Same approach for backward: transpose dx first, then scatter-add contiguous rows with vDSP_vadd.

Correctness

Bit-exact match (max |diff| = 0.00e+00) with upstream functions. Bounds checks preserved.

Usage

#include "stories_cpu_ops_opt.h"
float *tmp = malloc(SEQ * DIM * sizeof(float));
embed_lookup_opt(x, embed, tokens, DIM, SEQ, tmp);
embed_backward_opt(d_embed, dx, tokens, DIM, SEQ, tmp);

Requires a caller-provided scratch buffer of seq * dim floats. No new dependencies (uses Accelerate vDSP_mtrans / vDSP_vadd, already linked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant