Skip to content

fix: address HIGH security findings (HIGH-01 to HIGH-05)#13

Open
manni07 wants to merge 17 commits intomaderix:mainfrom
manni07:fix/high-security-findings
Open

fix: address HIGH security findings (HIGH-01 to HIGH-05)#13
manni07 wants to merge 17 commits intomaderix:mainfrom
manni07:fix/high-security-findings

Conversation

@manni07
Copy link

@manni07 manni07 commented Mar 2, 2026

Summary

Fixes all 5 high-severity findings from the security audit (docs/reports/security-audit-2026-03-02.md).

  • HIGH-01: Token index out-of-bounds — added n_tokens < SEQ+1 early-exit guard in train_large.m (with munmap+close cleanup); added if (tok >= VOCAB) { tok = 0; } clamp in both embed_lookup() AND embed_backward() in stories_cpu_ops.h to prevent both OOB read and OOB write/heap corruption
  • HIGH-02: Relative-path DATA_PATH/MODEL_PATH without validation — realpath() guard before open() with actionable error message; realpath() audit log in load_pretrained()
  • HIGH-03: execl() without cleanup — munmap(token_data) + close(data_fd) before exec; realpath(argv[0]) to prevent exec of wrong binary via symlink/relative path
  • HIGH-04: All unchecked malloc()/calloc() — added xmf(n)/xcf(n) OOM-abort helpers in stories_config.h; replaced all 58 raw float allocations in stories_config.h, train_large.m, stories_cpu_ops.h; added inline OOM guards for non-float allocations in stories_io.h, stories_mil.h, ane_runtime.h, ane_mil_gen.h (total: 68 guarded sites)
  • HIGH-05: Silent ANE evaluation failures — ane_eval() changed from void to bool; step_ok &= ane_eval(...) at all 6 call sites; Adam gradient update skipped when any ANE eval fails in a step

Test plan

  • cd training && make train_large — must compile clean (zero errors, zero new warnings)
  • HIGH-01: Run with a token file smaller than (SEQ+1)*2 = 514 bytes → stderr: "Token file too small"
  • HIGH-02: Run train_large from a directory where DATA_PATH doesn't exist → stderr: "Data file not found: '...' Hint: run from training/ directory"
  • HIGH-04: ulimit -v 1000000 && ./train_large → OOM abort with "OOM: malloc(..." diagnostic
  • HIGH-05: Check ane_eval() in stories_io.h returns bool, step_ok used in train_large.m
  • Full training run: verify normal operation unaffected

Related

  • Follows PR fixing CRIT findings (HIGH-04 xmf/xcf helpers build on size_t type safety from CRIT-04)
  • Follows PR fixing MED findings (MED-06 dispatch_once + MED-01 IOSurfaceLock guards in same files)

manni07 and others added 17 commits March 2, 2026 22:37
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CRIT-01: dlopen() return check + NSClassFromString validation in ane_init()
           (ane_runtime.h + stories_config.h); g_ane_ok / g_ane_ok_large flag
           only set when all private classes load successfully; stories_config.h
           gets re-entry guard (g_ane_init_done) that was previously missing
- CRIT-02: g_ane_ok guard in ane_compile() and compile_kern_mil_w(); NULL check
           for inMemoryModel after inMemoryModelWithDescriptor: — prevents crash
           when API call returns nil (ane_runtime.h, stories_io.h)
- CRIT-03: Validate fread() return for critical config/header reads to prevent
           garbage malloc() sizes; fopen() NULL check in save_checkpoint();
           design decision documented (model.h, train_large.m)
- CRIT-04: int -> size_t in build_blob*/build_blob_t/build_blob_fp16; calloc()
           NULL checks added; (size_t) cast in malloc() size calculations to
           prevent signed integer overflow UB (stories_io.h, model.h)

Simulation: 3 iterations, overall score 96.15% (all criteria >= 95%)
ref: docs/reports/security-audit-2026-03-02.md
- MED-01: IOSurfaceLock() return checked in all 6 I/O functions; early return
          on failure prevents data race (stories_io.h, ane_runtime.h)
- MED-02: Per-process/per-call unique temp dirs via getpid()+g_compile_seq
          (stories_io.h, ane_runtime.h)
- MED-03: mil_dims_valid() guard in all 7 MIL-gen functions; nil return on
          invalid params (ane_mil_gen.h)
- MED-04: CkptHdr.pad[0]=0x01020304 byte-order sentinel; runtime check in
          load_checkpoint; _Static_assert for compile-time LE guarantee (train_large.m)
- MED-05: _Static_assert(SEQ%8==0) + ARM64 alignment rationale comment (stories_io.h)
- MED-06: dispatch_once replaces manual g_ane_loaded/g_ane_init_done guards;
          thread-safe one-time ANE init (ane_runtime.h, stories_config.h)

ref: docs/reports/security-audit-2026-03-02.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Simulation plan for HIGH-01 to HIGH-05 with 5-criteria scoring.
Overall avg: 95.76% (all criteria >=95%).
ref: docs/reports/security-audit-2026-03-02.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- train_large.m: early exit if n_tokens < SEQ+1 prevents arithmetic
  underflow and downstream index corruption
- stories_cpu_ops.h: clamp tok >= VOCAB to 0 in embed_lookup() to
  prevent heap-buffer-overflow

ref: docs/reports/security-audit-2026-03-02.md HIGH-01
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- stories_cpu_ops.h: apply same VOCAB clamp in embed_backward() to
  prevent OOB write / heap corruption (mirrors embed_lookup fix)
- train_large.m: munmap+close before early-exit to avoid FD/mmap leak
- train_large.m: realpath() guard for DATA_PATH before open() provides
  clear error message when run from wrong directory
- train_large.m: realpath() audit log in load_pretrained() for model path

ref: docs/reports/security-audit-2026-03-02.md HIGH-02
- munmap(token_data) + close(data_fd) before execl() prevents FD/mmap
  leak across process boundary
- realpath(argv[0]) resolves binary path to prevent exec of wrong binary
  when invoked via relative path

ref: docs/reports/security-audit-2026-03-02.md HIGH-03
Document the FD/mmap cleanup and realpath() argv[0] fix applied in
commit b5c3cf9, including design decisions and updated finding status.
- stories_config.h: add xmf(n)/xcf(n) static inline helpers that abort()
  with diagnostic on OOM — replaces all unchecked malloc/calloc
- stories_config.h: replace all (float*)malloc(X*4) -> xmf(X) and
  (float*)calloc(X,4) -> xcf(X) in all 5 alloc functions
- train_large.m: same replacement for direct alloc calls (embed, rms_final,
  grads, per-iteration temporaries — 31 call sites total)

ref: docs/reports/security-audit-2026-03-02.md HIGH-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- 7 additional raw malloc/calloc calls replaced with xmf()/xcf() helpers
- cross_entropy_loss buf cast to (size_t) to prevent int overflow

ref: docs/reports/security-audit-2026-03-02.md HIGH-04
- stories_io.h: Kern struct calloc NULL guard + abort
- ane_runtime.h: 5 allocs (ANEKernel, inputBytes, outputBytes, ioInputs,
  ioOutputs) guarded with abort on NULL
- ane_mil_gen.h: 3 uint8_t buf calloc calls guarded with abort on NULL

ref: docs/reports/security-audit-2026-03-02.md HIGH-04

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- stories_mil.h: get_mask_blob() mask calloc now guarded with abort()
  on NULL before immediate array write

ref: docs/reports/security-audit-2026-03-02.md HIGH-04
- stories_io.h: ane_eval() changed from void to bool; captures BOOL
  return from evaluateWithQoS:options:request:error: with NSError log
- train_large.m: step_ok tracks all 6 ane_eval() results per step;
  gradient update skipped if any ANE eval fails in that step

ref: docs/reports/security-audit-2026-03-02.md HIGH-05

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- security-audit-2026-03-02.md: add BEHOBEN status to all 5 HIGH findings
- 001-initial-setup-and-security-audit.md: update status table to show
  all 19 findings resolved; add stories_mil.h to HIGH-04 completeness list

ref: docs/reports/security-audit-2026-03-02.md
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2e0c8332a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +72 to +75
if (IOSurfaceLock(s, 0, NULL) != kIOReturnSuccess) { // MED-01
fprintf(stderr, "IOSurfaceLock(write) failed — surface write skipped\n");
return;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate IOSurface lock failures to step_ok

io_write_fp16 now returns early on IOSurfaceLock failure, but the training loop only tracks ane_eval() in step_ok. In the lock-failure path (e.g., invalid or busy IOSurface), ANE eval can still run on stale inputs and step_ok remains true, so the optimizer update is applied with corrupted gradients instead of being skipped. This should be surfaced as a boolean error back to the caller and combined into the same failure gate used for ane_eval.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant