Add cache-optimized embedding ops (~12x lookup speedup) by dev-erik · Pull Request #39 · maderix/ANE

dev-erik · 2026-03-04T13:26:41Z

Summary

Drop-in replacement for embed_lookup and embed_backward that eliminates stride-seq cache misses by using contiguous memcpy gather + vDSP_mtrans transpose.

Before / After

Benchmarked against upstream stories_cpu_ops.h on Apple M4 Max, compiled with clang -O2. Stories110M config: dim=768, seq=256, vocab=32000. 500 iterations, 10 warmup.

Operation	Before (ms/call)	After (ms/call)	Speedup
`embed_lookup`	0.39	0.033	~12x
`embed_backward`	0.50	0.45	~1.1x

Consistent across 3 consecutive runs (11.5x-12.0x for lookup, 1.1x for backward).

Why it's faster

The original embed_lookup writes x[d*seq + t] in a double loop -- every write strides by seq floats (1 KB at seq=256), causing an L1 cache miss per element. The optimized version:

Gathers contiguous embedding rows via memcpy into a temp buffer
Transposes to channel-first layout in one call to vDSP_mtrans

Same approach for backward: transpose dx first, then scatter-add contiguous rows with vDSP_vadd.

Correctness

Bit-exact match (max |diff| = 0.00e+00) with upstream functions. Bounds checks preserved.

Usage

#include "stories_cpu_ops_opt.h"
float *tmp = malloc(SEQ * DIM * sizeof(float));
embed_lookup_opt(x, embed, tokens, DIM, SEQ, tmp);
embed_backward_opt(d_embed, dx, tokens, DIM, SEQ, tmp);

Requires a caller-provided scratch buffer of seq * dim floats. No new dependencies (uses Accelerate vDSP_mtrans / vDSP_vadd, already linked).

[feat] Add cache-optimized embedding ops (~12x lookup speedup)

ec2b617

dev-erik mentioned this pull request Mar 4, 2026

Community contributions: M1-M4 compat, security fixes, docs, benchmarks, and community dashboard #25

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache-optimized embedding ops (~12x lookup speedup)#39

Add cache-optimized embedding ops (~12x lookup speedup)#39
dev-erik wants to merge 1 commit intomaderix:mainfrom
dev-erik:neon-embed-opt

dev-erik commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dev-erik commented Mar 4, 2026

Summary

Before / After

Why it's faster

Correctness

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant