Skip to content

Harden token-data ingestion across training pipelines (validation + FD hygiene)#30

Open
nabbilkhan wants to merge 2 commits intomaderix:mainfrom
nabbilkhan:security/token-data-validation-enterprise
Open

Harden token-data ingestion across training pipelines (validation + FD hygiene)#30
nabbilkhan wants to merge 2 commits intomaderix:mainfrom
nabbilkhan:security/token-data-validation-enterprise

Conversation

@nabbilkhan
Copy link
Contributor

Why

I’m using this project for local Apple-device training workflows in healthcare IT, where keeping PHI on-device is a hard requirement. While running repeated training/restart cycles, I hit two practical reliability risks:

  1. malformed token files (misaligned byte length / bad token ids) were not fully rejected at ingestion time
  2. exec()-based restart loops could keep token-data file descriptors open longer than necessary

In long-running jobs, these are the kinds of issues that can turn into hard-to-diagnose failures.

What this PR changes

1) Fail-fast token file layout validation in all training pipelines

Applied in:

  • training/train_large.m
  • training/train_large_ane.m
  • training/training_dynamic/train.m

New checks before training starts:

  • token file byte length must align to 16-bit token boundaries
  • token file must not be empty
  • token count must be at least SEQ + 1
  • all token IDs must be in [0, VOCAB)

2) Prevent restart-time FD accumulation

After successful mmap, each pipeline now immediately closes data_fd.
The mapping remains valid, and this avoids descriptor accumulation across repeated exec() restarts.

3) Shared helper + expanded unit tests

  • Added helper in training/data_validation.h:
    • token_data_bytes_to_token_count(...)
  • Expanded training/test_data_validation.c from 8 to 18 tests, including:
    • byte-alignment checks (even/odd)
    • null-output-pointer behavior
    • sequence boundary checks
    • vocab boundary checks
    • OOB detection (first/middle/last)
    • randomized consistency checks for OOB scanning

4) Documentation update

  • training/README.md now documents byte-alignment validation in the startup checks section.

Validation performed

On Apple Silicon (macOS):

  • make test_data_validation && ./test_data_validation18 passed, 0 failed
  • make train_large train_large_ane
  • (cd training/training_dynamic && make train)

Runtime negative-path checks across all three binaries:

  • odd-byte token file → rejected with clear error
  • too-short token file → rejected with clear error
  • out-of-range token ID → rejected with clear error

Compatibility / scope

  • No model math changes
  • No optimizer changes
  • No kernel-shape changes
  • This is a safety/reliability hardening PR for data ingestion and process hygiene

I’m an active contributor to OpenClaw and I’m excited to contribute back here as well. Thanks for building and open-sourcing this project.

dev-erik added a commit to dev-erik/ANE that referenced this pull request Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant