Harden token-data ingestion across training pipelines (validation + FD hygiene) by nabbilkhan · Pull Request #30 · maderix/ANE

nabbilkhan · 2026-03-03T19:43:52Z

Why

I’m using this project for local Apple-device training workflows in healthcare IT, where keeping PHI on-device is a hard requirement. While running repeated training/restart cycles, I hit two practical reliability risks:

malformed token files (misaligned byte length / bad token ids) were not fully rejected at ingestion time
exec()-based restart loops could keep token-data file descriptors open longer than necessary

In long-running jobs, these are the kinds of issues that can turn into hard-to-diagnose failures.

What this PR changes

1) Fail-fast token file layout validation in all training pipelines

Applied in:

training/train_large.m
training/train_large_ane.m
training/training_dynamic/train.m

New checks before training starts:

token file byte length must align to 16-bit token boundaries
token file must not be empty
token count must be at least SEQ + 1
all token IDs must be in [0, VOCAB)

2) Prevent restart-time FD accumulation

After successful mmap, each pipeline now immediately closes data_fd.
The mapping remains valid, and this avoids descriptor accumulation across repeated exec() restarts.

3) Shared helper + expanded unit tests

Added helper in training/data_validation.h:
- token_data_bytes_to_token_count(...)
Expanded training/test_data_validation.c from 8 to 18 tests, including:
- byte-alignment checks (even/odd)
- null-output-pointer behavior
- sequence boundary checks
- vocab boundary checks
- OOB detection (first/middle/last)
- randomized consistency checks for OOB scanning

4) Documentation update

training/README.md now documents byte-alignment validation in the startup checks section.

Validation performed

On Apple Silicon (macOS):

make test_data_validation && ./test_data_validation → 18 passed, 0 failed
make train_large train_large_ane
(cd training/training_dynamic && make train)

Runtime negative-path checks across all three binaries:

odd-byte token file → rejected with clear error
too-short token file → rejected with clear error
out-of-range token ID → rejected with clear error

Compatibility / scope

No model math changes
No optimizer changes
No kernel-shape changes
This is a safety/reliability hardening PR for data ingestion and process hygiene

I’m an active contributor to OpenClaw and I’m excited to contribute back here as well. Thanks for building and open-sourcing this project.

…x#30)

nabbilkhan added 2 commits March 3, 2026 19:36

Harden token dataset validation across all training pipelines

991bf4d

Harden token file layout checks and prevent exec-time fd leaks

60b0512

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026

[fix] Harden token-data ingestion with validation (upstream PR maderi…

fd86f99

…x#30)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden token-data ingestion across training pipelines (validation + FD hygiene)#30

Harden token-data ingestion across training pipelines (validation + FD hygiene)#30
nabbilkhan wants to merge 2 commits intomaderix:mainfrom
nabbilkhan:security/token-data-validation-enterprise

nabbilkhan commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nabbilkhan commented Mar 3, 2026

Why

What this PR changes

1) Fail-fast token file layout validation in all training pipelines

2) Prevent restart-time FD accumulation

3) Shared helper + expanded unit tests

4) Documentation update

Validation performed

Compatibility / scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant