[DRAFT] fix: harden storage codec against silent truncation and OOB writes by speckhard · Pull Request #13 · LeMaterial/atompack

speckhard · 2026-04-29T10:11:13Z

Summary

Three independent safety fixes in the storage codec, all replacing unchecked as u32 / as u8 / as u16 casts with checked conversions that surface a clear error instead of silently producing corrupt output. Plus a defensive type-tag schema check in the flat-batch reader. These are all defensive checks at boundaries; happy-path behavior is unchanged.

Fix 1 — Section encoder (`atompack/src/storage/soa.rs`)

write_section now returns Result<()> and rejects:

keys longer than 255 bytes (the on-disk key_len: u8 field cannot hold them — would have silently truncated and produced an unreadable record),
payloads longer than u32::MAX (the payload_len: u32 field cannot hold them).

n_sections is accumulated as usize and validated against u16::MAX before being written to the wire. Previously a pathological molecule with > 65535 sections would have wrapped silently.

Fix 2 — Record-size index fields (`atompack/src/storage/mod.rs`)

The four compressed.len() as u32 / bytes.len() as u32 casts in append_soa_records and append_owned_soa_records are now routed through a small record_size_u32 helper that errors when the size exceeds u32::MAX. Previously a single record larger than 4 GiB would have silently corrupted the on-disk index.

Fix 3 — Per-molecule slot length and type-tag check (`atompack-py/src/database_flat.rs`)

get_molecules_flat's flat-batch hot loop derives slot_bytes and type_tag from molecule 0, then memcpy's each subsequent molecule's per-molecule section payload into a slot of that size. The per-atom path was already validated per-record; the per-molecule slot path was not. A malformed later molecule with a different payload length would OOB-write into an adjacent buffer slot from a parallel rayon thread.

Two new checks:

sec.payload.len() == schema_entry.slot_bytes before the unsafe memcpy.
sec.type_tag == schema_entry.type_tag so a same-key-different-dtype mismatch can't reinterpret bytes silently.

Files

atompack/src/storage/soa.rs — fallible write_section + checked n_sections.
atompack/src/storage/mod.rs — record_size_u32 helper + 4 call-sites.
atompack-py/src/database_flat.rs — per-molecule slot length check + type-tag check.
atompack-py/tests/test_database.py — three new tests covering fix docs: update README with PyPI install instructions #1 (overlong key), fix chore: build for python>3.13 #3 length mismatch, and the new type-tag check.

Test plan

maturin develop --release builds clean (no new warnings):

$ uv run --extra dev --locked --with "maturin>=1.4,<2.0" maturin develop --release
   Compiling atompack-py v0.2.1
    Finished `release` profile [optimized] target(s) in ~5s
🛠 Installed atompack-db-0.2.1

Full test suite green; three new tests cover the fixes:

$ uv run --extra dev --locked pytest tests/ -q
126 passed, 6 skipped in 2.77s

The three new safety tests run individually and exercise the new code paths:

$ uv run --extra dev --locked pytest tests/test_database.py::test_overlong_property_key_rejected_at_write \
                                       tests/test_database.py::test_get_molecules_flat_rejects_per_mol_slot_mismatch \
                                       tests/test_database.py::test_get_molecules_flat_rejects_per_mol_type_tag_mismatch -v
3 passed in 0.06s

(follow-up in PR 9e) Add a test for fix fix: remove harcoded paths in tests #2 (>4 GiB record). Skipped here because it requires constructing a multi-GiB corpus and the helper itself has trivial logic.

What this is not

This PR does not change the on-disk format, does not change the public Python API, does not affect performance on valid inputs (the new checks are O(1) per record/section), and does not silence any pre-existing diagnostic. It only converts silent corruption / undefined behavior on bad inputs into clean errors.

Three independent safety fixes in the storage path; all are defensive checks at boundaries that previously cast unconditionally. 1. Section encoder (atompack/src/storage/soa.rs) - write_section() now returns Result and rejects keys > 255 bytes (key_len is a u8 field) and payloads > u32::MAX (payload_len is u32). Previously these would silently truncate and produce an unreadable record. - n_sections is accumulated as usize and checked against u16::MAX before writing the on-disk u16. Previously the cast wrapped silently on pathological inputs. 2. Record-size index fields (atompack/src/storage/mod.rs) - The four `compressed.len() as u32` / `uncompressed_size as u32` casts in append_soa_records and append_owned_soa_records now go through a small helper that errors if the size exceeds u32::MAX instead of silently corrupting the index for >4 GiB single records. 3. Per-molecule slot length check (atompack-py/src/database_flat.rs) - The flat-batch hot loop validates the slot_bytes schema against the first molecule, but the per-molecule slot memcpy did not re-validate subsequent molecules' payload lengths. A malformed later molecule could OOB-write into an adjacent buffer slot from a parallel rayon thread. Added a length assertion before the memcpy. Tests: - New test_overlong_property_key_rejected_at_write covers fix LeMaterial#1 end-to-end. - All existing tests (123) pass unchanged, confirming no behavior regression on the happy path.

Independent review caught two follow-ups for the storage-safety PR: - get_molecules_flat now also validates that subsequent molecules' section type_tag matches the schema (taken from molecule 0). Previously a same-key-same-size-different-dtype mismatch would have silently reinterpreted bytes downstream as a different dtype. - Two new tests cover fix LeMaterial#3 (per-mol slot length mismatch, the unsafe memcpy hazard) and the new type-tag check. Built without a multi-GiB corpus by using two molecules with same-key but different per-molecule property dtypes/lengths.

speckhard added 2 commits April 29, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13

[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13
speckhard wants to merge 2 commits intoLeMaterial:mainfrom
speckhard:fix/storage-safety

speckhard commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

speckhard commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1 — Section encoder (atompack/src/storage/soa.rs)

Fix 2 — Record-size index fields (atompack/src/storage/mod.rs)

Fix 3 — Per-molecule slot length and type-tag check (atompack-py/src/database_flat.rs)

Files

Test plan

What this is not

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

speckhard commented Apr 29, 2026 •

edited

Loading

Fix 1 — Section encoder (`atompack/src/storage/soa.rs`)

Fix 2 — Record-size index fields (`atompack/src/storage/mod.rs`)

Fix 3 — Per-molecule slot length and type-tag check (`atompack-py/src/database_flat.rs`)