[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13
Draft
speckhard wants to merge 2 commits intoLeMaterial:mainfrom
Draft
[DRAFT] fix: harden storage codec against silent truncation and OOB writes#13speckhard wants to merge 2 commits intoLeMaterial:mainfrom
speckhard wants to merge 2 commits intoLeMaterial:mainfrom
Conversation
Three independent safety fixes in the storage path; all are defensive
checks at boundaries that previously cast unconditionally.
1. Section encoder (atompack/src/storage/soa.rs)
- write_section() now returns Result and rejects keys > 255 bytes
(key_len is a u8 field) and payloads > u32::MAX (payload_len is u32).
Previously these would silently truncate and produce an unreadable
record.
- n_sections is accumulated as usize and checked against u16::MAX
before writing the on-disk u16. Previously the cast wrapped silently
on pathological inputs.
2. Record-size index fields (atompack/src/storage/mod.rs)
- The four `compressed.len() as u32` / `uncompressed_size as u32`
casts in append_soa_records and append_owned_soa_records now go
through a small helper that errors if the size exceeds u32::MAX
instead of silently corrupting the index for >4 GiB single records.
3. Per-molecule slot length check (atompack-py/src/database_flat.rs)
- The flat-batch hot loop validates the slot_bytes schema against the
first molecule, but the per-molecule slot memcpy did not re-validate
subsequent molecules' payload lengths. A malformed later molecule
could OOB-write into an adjacent buffer slot from a parallel rayon
thread. Added a length assertion before the memcpy.
Tests:
- New test_overlong_property_key_rejected_at_write covers fix LeMaterial#1 end-to-end.
- All existing tests (123) pass unchanged, confirming no behavior regression
on the happy path.
Independent review caught two follow-ups for the storage-safety PR: - get_molecules_flat now also validates that subsequent molecules' section type_tag matches the schema (taken from molecule 0). Previously a same-key-same-size-different-dtype mismatch would have silently reinterpreted bytes downstream as a different dtype. - Two new tests cover fix LeMaterial#3 (per-mol slot length mismatch, the unsafe memcpy hazard) and the new type-tag check. Built without a multi-GiB corpus by using two molecules with same-key but different per-molecule property dtypes/lengths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three independent safety fixes in the storage codec, all replacing unchecked
as u32/as u8/as u16casts with checked conversions that surface a clear error instead of silently producing corrupt output. Plus a defensive type-tag schema check in the flat-batch reader. These are all defensive checks at boundaries; happy-path behavior is unchanged.Fix 1 — Section encoder (
atompack/src/storage/soa.rs)write_sectionnow returnsResult<()>and rejects:key_len: u8field cannot hold them — would have silently truncated and produced an unreadable record),u32::MAX(thepayload_len: u32field cannot hold them).n_sectionsis accumulated asusizeand validated againstu16::MAXbefore being written to the wire. Previously a pathological molecule with > 65535 sections would have wrapped silently.Fix 2 — Record-size index fields (
atompack/src/storage/mod.rs)The four
compressed.len() as u32/bytes.len() as u32casts inappend_soa_recordsandappend_owned_soa_recordsare now routed through a smallrecord_size_u32helper that errors when the size exceedsu32::MAX. Previously a single record larger than 4 GiB would have silently corrupted the on-disk index.Fix 3 — Per-molecule slot length and type-tag check (
atompack-py/src/database_flat.rs)get_molecules_flat's flat-batch hot loop derivesslot_bytesandtype_tagfrom molecule 0, then memcpy's each subsequent molecule's per-molecule section payload into a slot of that size. The per-atom path was already validated per-record; the per-molecule slot path was not. A malformed later molecule with a different payload length would OOB-write into an adjacent buffer slot from a parallel rayon thread.Two new checks:
sec.payload.len() == schema_entry.slot_bytesbefore the unsafe memcpy.sec.type_tag == schema_entry.type_tagso a same-key-different-dtype mismatch can't reinterpret bytes silently.Files
atompack/src/storage/soa.rs— falliblewrite_section+ checkedn_sections.atompack/src/storage/mod.rs—record_size_u32helper + 4 call-sites.atompack-py/src/database_flat.rs— per-molecule slot length check + type-tag check.atompack-py/tests/test_database.py— three new tests covering fix docs: update README with PyPI install instructions #1 (overlong key), fix chore: build for python>3.13 #3 length mismatch, and the new type-tag check.Test plan
maturin develop --releasebuilds clean (no new warnings):What this is not
This PR does not change the on-disk format, does not change the public Python API, does not affect performance on valid inputs (the new checks are O(1) per record/section), and does not silence any pre-existing diagnostic. It only converts silent corruption / undefined behavior on bad inputs into clean errors.