fix: guard short token datasets in train_large_ane and dynamic pipeline#48
fix: guard short token datasets in train_large_ane and dynamic pipeline#48log-wade wants to merge 1 commit intomaderix:mainfrom
Conversation
- Add n_tokens <= SEQ+1 check in train_large_ane.m and training_dynamic/train.m - Prevents underflow in max_pos and possible OOB reads (aligns with train_large.m) - Add M5 MacBook Pro benchmark result and full output for Issue maderix#3 Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9fbd4dff5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (n_tokens <= (size_t)(SEQ + 1)) { | ||
| printf("Token data too short: need at least %d tokens, got %zu\n", SEQ + 2, n_tokens); |
There was a problem hiding this comment.
Allow exactly one training window
The new guard rejects datasets with exactly SEQ + 1 tokens, but that case is still valid for one (input,target) window and does not underflow max_pos = n_tokens - SEQ - 1 (it becomes 0, so pos is 0). As written, both this file and training/training_dynamic/train.m now fail valid minimal datasets and smoke tests with the misleading message “need at least SEQ + 2 tokens.”
Useful? React with 👍 / 👎.
Summary
training/train_large_ane.mtraining/training_dynamic/train.mWhy
Both paths use
max_pos = n_tokens - SEQ - 1. Whenn_tokens <= SEQ + 1, this unsigned subtraction underflows, producing a huge range and potentially out-of-bounds reads.train_large.malready had this guard (lines 299–304); this PR aligns the other two pipelines.Validation
make -C training train_large_ane— buildsmake -C training/training_dynamic train— buildsAlso included
benchmarks/community_results.json; full output inbenchmarks/my_m5_benchmark_output.txt. Benchmark will also be posted to Issue Results on M1/2/3/4, mini, pro and max? #3.