Skip to content

[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13

Draft
speckhard wants to merge 2 commits intoLeMaterial:mainfrom
speckhard:fix/storage-safety
Draft

[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13
speckhard wants to merge 2 commits intoLeMaterial:mainfrom
speckhard:fix/storage-safety

Conversation

@speckhard
Copy link
Copy Markdown

@speckhard speckhard commented Apr 29, 2026

Summary

Three independent safety fixes in the storage codec, all replacing unchecked as u32 / as u8 / as u16 casts with checked conversions that surface a clear error instead of silently producing corrupt output. Plus a defensive type-tag schema check in the flat-batch reader. These are all defensive checks at boundaries; happy-path behavior is unchanged.

Fix 1 — Section encoder (atompack/src/storage/soa.rs)

write_section now returns Result<()> and rejects:

  • keys longer than 255 bytes (the on-disk key_len: u8 field cannot hold them — would have silently truncated and produced an unreadable record),
  • payloads longer than u32::MAX (the payload_len: u32 field cannot hold them).

n_sections is accumulated as usize and validated against u16::MAX before being written to the wire. Previously a pathological molecule with > 65535 sections would have wrapped silently.

Fix 2 — Record-size index fields (atompack/src/storage/mod.rs)

The four compressed.len() as u32 / bytes.len() as u32 casts in append_soa_records and append_owned_soa_records are now routed through a small record_size_u32 helper that errors when the size exceeds u32::MAX. Previously a single record larger than 4 GiB would have silently corrupted the on-disk index.

Fix 3 — Per-molecule slot length and type-tag check (atompack-py/src/database_flat.rs)

get_molecules_flat's flat-batch hot loop derives slot_bytes and type_tag from molecule 0, then memcpy's each subsequent molecule's per-molecule section payload into a slot of that size. The per-atom path was already validated per-record; the per-molecule slot path was not. A malformed later molecule with a different payload length would OOB-write into an adjacent buffer slot from a parallel rayon thread.

Two new checks:

  • sec.payload.len() == schema_entry.slot_bytes before the unsafe memcpy.
  • sec.type_tag == schema_entry.type_tag so a same-key-different-dtype mismatch can't reinterpret bytes silently.

Files

  • atompack/src/storage/soa.rs — fallible write_section + checked n_sections.
  • atompack/src/storage/mod.rsrecord_size_u32 helper + 4 call-sites.
  • atompack-py/src/database_flat.rs — per-molecule slot length check + type-tag check.
  • atompack-py/tests/test_database.py — three new tests covering fix docs: update README with PyPI install instructions #1 (overlong key), fix chore: build for python>3.13 #3 length mismatch, and the new type-tag check.

Test plan

  • maturin develop --release builds clean (no new warnings):
$ uv run --extra dev --locked --with "maturin>=1.4,<2.0" maturin develop --release
   Compiling atompack-py v0.2.1
    Finished `release` profile [optimized] target(s) in ~5s
🛠 Installed atompack-db-0.2.1
  • Full test suite green; three new tests cover the fixes:
$ uv run --extra dev --locked pytest tests/ -q
126 passed, 6 skipped in 2.77s
  • The three new safety tests run individually and exercise the new code paths:
$ uv run --extra dev --locked pytest tests/test_database.py::test_overlong_property_key_rejected_at_write \
                                       tests/test_database.py::test_get_molecules_flat_rejects_per_mol_slot_mismatch \
                                       tests/test_database.py::test_get_molecules_flat_rejects_per_mol_type_tag_mismatch -v
3 passed in 0.06s
  • (follow-up in PR 9e) Add a test for fix fix: remove harcoded paths in tests #2 (>4 GiB record). Skipped here because it requires constructing a multi-GiB corpus and the helper itself has trivial logic.

What this is not

This PR does not change the on-disk format, does not change the public Python API, does not affect performance on valid inputs (the new checks are O(1) per record/section), and does not silence any pre-existing diagnostic. It only converts silent corruption / undefined behavior on bad inputs into clean errors.

Three independent safety fixes in the storage path; all are defensive
checks at boundaries that previously cast unconditionally.

1. Section encoder (atompack/src/storage/soa.rs)
   - write_section() now returns Result and rejects keys > 255 bytes
     (key_len is a u8 field) and payloads > u32::MAX (payload_len is u32).
     Previously these would silently truncate and produce an unreadable
     record.
   - n_sections is accumulated as usize and checked against u16::MAX
     before writing the on-disk u16. Previously the cast wrapped silently
     on pathological inputs.

2. Record-size index fields (atompack/src/storage/mod.rs)
   - The four `compressed.len() as u32` / `uncompressed_size as u32`
     casts in append_soa_records and append_owned_soa_records now go
     through a small helper that errors if the size exceeds u32::MAX
     instead of silently corrupting the index for >4 GiB single records.

3. Per-molecule slot length check (atompack-py/src/database_flat.rs)
   - The flat-batch hot loop validates the slot_bytes schema against the
     first molecule, but the per-molecule slot memcpy did not re-validate
     subsequent molecules' payload lengths. A malformed later molecule
     could OOB-write into an adjacent buffer slot from a parallel rayon
     thread. Added a length assertion before the memcpy.

Tests:
- New test_overlong_property_key_rejected_at_write covers fix LeMaterial#1 end-to-end.
- All existing tests (123) pass unchanged, confirming no behavior regression
  on the happy path.
Independent review caught two follow-ups for the storage-safety PR:

- get_molecules_flat now also validates that subsequent molecules' section
  type_tag matches the schema (taken from molecule 0). Previously a
  same-key-same-size-different-dtype mismatch would have silently
  reinterpreted bytes downstream as a different dtype.
- Two new tests cover fix LeMaterial#3 (per-mol slot length mismatch, the unsafe
  memcpy hazard) and the new type-tag check. Built without a multi-GiB
  corpus by using two molecules with same-key but different per-molecule
  property dtypes/lengths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant