fix: guard short token datasets in train_large_ane and dynamic pipeline by log-wade · Pull Request #48 · maderix/ANE

log-wade · 2026-03-07T20:13:21Z

Summary

Add a token dataset length guard in training/train_large_ane.m
Add the same guard in training/training_dynamic/train.m
Fail early with a clear error when the dataset is too short for one (input, target) window

Why

Both paths use max_pos = n_tokens - SEQ - 1. When n_tokens <= SEQ + 1, this unsigned subtraction underflows, producing a huge range and potentially out-of-bounds reads. train_large.m already had this guard (lines 299–304); this PR aligns the other two pipelines.

Validation

make -C training train_large_ane — builds
make -C training/training_dynamic train — builds
With a too-short data file, both binaries exit with the new error message.

Also included

M5 MacBook Pro benchmark result (20 steps, static pipeline): ~127 ms/step, 0.73 ANE TFLOPS. Entry added to benchmarks/community_results.json; full output in benchmarks/my_m5_benchmark_output.txt. Benchmark will also be posted to Issue Results on M1/2/3/4, mini, pro and max? #3.

- Add n_tokens <= SEQ+1 check in train_large_ane.m and training_dynamic/train.m - Prevents underflow in max_pos and possible OOB reads (aligns with train_large.m) - Add M5 MacBook Pro benchmark result and full output for Issue maderix#3 Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9fbd4dff5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-07T20:15:32Z

training/train_large_ane.m

+        if (n_tokens <= (size_t)(SEQ + 1)) {
+            printf("Token data too short: need at least %d tokens, got %zu\n", SEQ + 2, n_tokens);


Allow exactly one training window

The new guard rejects datasets with exactly SEQ + 1 tokens, but that case is still valid for one (input,target) window and does not underflow max_pos = n_tokens - SEQ - 1 (it becomes 0, so pos is 0). As written, both this file and training/training_dynamic/train.m now fail valid minimal datasets and smoke tests with the misleading message “need at least SEQ + 2 tokens.”

Useful? React with 👍 / 👎.

chatgpt-codex-connector bot reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: guard short token datasets in train_large_ane and dynamic pipeline#48

fix: guard short token datasets in train_large_ane and dynamic pipeline#48
log-wade wants to merge 1 commit intomaderix:mainfrom
log-wade:contribution/benchmark-m5-and-fixes

log-wade commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (n_tokens <= (size_t)(SEQ + 1)) {
		printf("Token data too short: need at least %d tokens, got %zu\n", SEQ + 2, n_tokens);

Conversation

log-wade commented Mar 7, 2026

Summary

Why

Validation

Also included

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant