Verify bounded MinHasher sketch invariants (7.2.7)#230
Conversation
Add the pre-implementation ExecPlan for roadmap item 7.2.7. The plan captures the bounded Kani proof scope for `MinHasher::sketch`, the ordinary unit and behaviour test expectations, documentation updates, validation gates, and the approval boundary before implementation starts.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
SummaryThis PR implements roadmap task 7.2.7: bounded Kani verification of Key changesVerification harnesses (crates/whitaker_clones_core/src/index/kani.rs)Added three
Core implementation updates (crates/whitaker_clones_core/src/index/minhash.rs)
Test coverage enhancements
Documentation and planning
Validation
WalkthroughThis PR completes roadmap item 7.2.7 by refactoring ChangesMinHasher Kani Verification
Suggested labels
Suggested reviewers
Poem
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (15 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Record implementation approval and restore executable mode on the Kani runner so the Makefile proof target can run before the MinHasher harness work begins.
Add parameterized unit coverage for deterministic sketches, duplicate hash insensitivity, empty inputs, and reordered set semantics. Extend the BDD candidate-generation feature so duplicate retained hashes are observable through the public LSH workflow before adding the Kani harnesses.
Add the planned MinHasher sketch harnesses and register them with the clone-detector Kani runner. Record the current blocker in the ExecPlan: verification of the real production path needs a higher unwind bound or an approved proof seam because `MinHasher::new` builds the fixed 128-slot seed array.
Add Kani harnesses that exercise real `MinHasher::sketch` for empty input rejection, deterministic output, and duplicate-hash set semantics. Keep the proof tractable with a private `cfg(kani)` seed fixture and explicit fixed-width signature builder.
Record the bounded Kani proof shape for `MinHasher::sketch` in the clone-detector design and developer guide. Mark roadmap item 7.2.7 complete and close the execplan with the validation and CodeRabbit results.
|
@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response. Ensure that this is validated against the current version of the codegraph. If further refinement to address this finding would be deleterious, please supply a clear explanatory one to two paragraph markdown message I can paste into the CodeScene web ui's diagnostic suppression function so this diagnostic can be silenced. crates/whitaker_clones_core/src/index/minhash.rs Comment on lines +76 to +206 fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
[
minimum_mixed_hash(seeds[0], unique_hashes),
minimum_mixed_hash(seeds[1], unique_hashes),
minimum_mixed_hash(seeds[2], unique_hashes),
minimum_mixed_hash(seeds[3], unique_hashes),
minimum_mixed_hash(seeds[4], unique_hashes),
minimum_mixed_hash(seeds[5], unique_hashes),
minimum_mixed_hash(seeds[6], unique_hashes),
minimum_mixed_hash(seeds[7], unique_hashes),
minimum_mixed_hash(seeds[8], unique_hashes),
minimum_mixed_hash(seeds[9], unique_hashes),
minimum_mixed_hash(seeds[10], unique_hashes),
minimum_mixed_hash(seeds[11], unique_hashes),
minimum_mixed_hash(seeds[12], unique_hashes),
minimum_mixed_hash(seeds[13], unique_hashes),
minimum_mixed_hash(seeds[14], unique_hashes),
minimum_mixed_hash(seeds[15], unique_hashes),
minimum_mixed_hash(seeds[16], unique_hashes),
minimum_mixed_hash(seeds[17], unique_hashes),
minimum_mixed_hash(seeds[18], unique_hashes),
minimum_mixed_hash(seeds[19], unique_hashes),
minimum_mixed_hash(seeds[20], unique_hashes),
minimum_mixed_hash(seeds[21], unique_hashes),
minimum_mixed_hash(seeds[22], unique_hashes),
minimum_mixed_hash(seeds[23], unique_hashes),
minimum_mixed_hash(seeds[24], unique_hashes),
minimum_mixed_hash(seeds[25], unique_hashes),
minimum_mixed_hash(seeds[26], unique_hashes),
minimum_mixed_hash(seeds[27], unique_hashes),
minimum_mixed_hash(seeds[28], unique_hashes),
minimum_mixed_hash(seeds[29], unique_hashes),
minimum_mixed_hash(seeds[30], unique_hashes),
minimum_mixed_hash(seeds[31], unique_hashes),
minimum_mixed_hash(seeds[32], unique_hashes),
minimum_mixed_hash(seeds[33], unique_hashes),
minimum_mixed_hash(seeds[34], unique_hashes),
minimum_mixed_hash(seeds[35], unique_hashes),
minimum_mixed_hash(seeds[36], unique_hashes),
minimum_mixed_hash(seeds[37], unique_hashes),
minimum_mixed_hash(seeds[38], unique_hashes),
minimum_mixed_hash(seeds[39], unique_hashes),
minimum_mixed_hash(seeds[40], unique_hashes),
minimum_mixed_hash(seeds[41], unique_hashes),
minimum_mixed_hash(seeds[42], unique_hashes),
minimum_mixed_hash(seeds[43], unique_hashes),
minimum_mixed_hash(seeds[44], unique_hashes),
minimum_mixed_hash(seeds[45], unique_hashes),
minimum_mixed_hash(seeds[46], unique_hashes),
minimum_mixed_hash(seeds[47], unique_hashes),
minimum_mixed_hash(seeds[48], unique_hashes),
minimum_mixed_hash(seeds[49], unique_hashes),
minimum_mixed_hash(seeds[50], unique_hashes),
minimum_mixed_hash(seeds[51], unique_hashes),
minimum_mixed_hash(seeds[52], unique_hashes),
minimum_mixed_hash(seeds[53], unique_hashes),
minimum_mixed_hash(seeds[54], unique_hashes),
minimum_mixed_hash(seeds[55], unique_hashes),
minimum_mixed_hash(seeds[56], unique_hashes),
minimum_mixed_hash(seeds[57], unique_hashes),
minimum_mixed_hash(seeds[58], unique_hashes),
minimum_mixed_hash(seeds[59], unique_hashes),
minimum_mixed_hash(seeds[60], unique_hashes),
minimum_mixed_hash(seeds[61], unique_hashes),
minimum_mixed_hash(seeds[62], unique_hashes),
minimum_mixed_hash(seeds[63], unique_hashes),
minimum_mixed_hash(seeds[64], unique_hashes),
minimum_mixed_hash(seeds[65], unique_hashes),
minimum_mixed_hash(seeds[66], unique_hashes),
minimum_mixed_hash(seeds[67], unique_hashes),
minimum_mixed_hash(seeds[68], unique_hashes),
minimum_mixed_hash(seeds[69], unique_hashes),
minimum_mixed_hash(seeds[70], unique_hashes),
minimum_mixed_hash(seeds[71], unique_hashes),
minimum_mixed_hash(seeds[72], unique_hashes),
minimum_mixed_hash(seeds[73], unique_hashes),
minimum_mixed_hash(seeds[74], unique_hashes),
minimum_mixed_hash(seeds[75], unique_hashes),
minimum_mixed_hash(seeds[76], unique_hashes),
minimum_mixed_hash(seeds[77], unique_hashes),
minimum_mixed_hash(seeds[78], unique_hashes),
minimum_mixed_hash(seeds[79], unique_hashes),
minimum_mixed_hash(seeds[80], unique_hashes),
minimum_mixed_hash(seeds[81], unique_hashes),
minimum_mixed_hash(seeds[82], unique_hashes),
minimum_mixed_hash(seeds[83], unique_hashes),
minimum_mixed_hash(seeds[84], unique_hashes),
minimum_mixed_hash(seeds[85], unique_hashes),
minimum_mixed_hash(seeds[86], unique_hashes),
minimum_mixed_hash(seeds[87], unique_hashes),
minimum_mixed_hash(seeds[88], unique_hashes),
minimum_mixed_hash(seeds[89], unique_hashes),
minimum_mixed_hash(seeds[90], unique_hashes),
minimum_mixed_hash(seeds[91], unique_hashes),
minimum_mixed_hash(seeds[92], unique_hashes),
minimum_mixed_hash(seeds[93], unique_hashes),
minimum_mixed_hash(seeds[94], unique_hashes),
minimum_mixed_hash(seeds[95], unique_hashes),
minimum_mixed_hash(seeds[96], unique_hashes),
minimum_mixed_hash(seeds[97], unique_hashes),
minimum_mixed_hash(seeds[98], unique_hashes),
minimum_mixed_hash(seeds[99], unique_hashes),
minimum_mixed_hash(seeds[100], unique_hashes),
minimum_mixed_hash(seeds[101], unique_hashes),
minimum_mixed_hash(seeds[102], unique_hashes),
minimum_mixed_hash(seeds[103], unique_hashes),
minimum_mixed_hash(seeds[104], unique_hashes),
minimum_mixed_hash(seeds[105], unique_hashes),
minimum_mixed_hash(seeds[106], unique_hashes),
minimum_mixed_hash(seeds[107], unique_hashes),
minimum_mixed_hash(seeds[108], unique_hashes),
minimum_mixed_hash(seeds[109], unique_hashes),
minimum_mixed_hash(seeds[110], unique_hashes),
minimum_mixed_hash(seeds[111], unique_hashes),
minimum_mixed_hash(seeds[112], unique_hashes),
minimum_mixed_hash(seeds[113], unique_hashes),
minimum_mixed_hash(seeds[114], unique_hashes),
minimum_mixed_hash(seeds[115], unique_hashes),
minimum_mixed_hash(seeds[116], unique_hashes),
minimum_mixed_hash(seeds[117], unique_hashes),
minimum_mixed_hash(seeds[118], unique_hashes),
minimum_mixed_hash(seeds[119], unique_hashes),
minimum_mixed_hash(seeds[120], unique_hashes),
minimum_mixed_hash(seeds[121], unique_hashes),
minimum_mixed_hash(seeds[122], unique_hashes),
minimum_mixed_hash(seeds[123], unique_hashes),
minimum_mixed_hash(seeds[124], unique_hashes),
minimum_mixed_hash(seeds[125], unique_hashes),
minimum_mixed_hash(seeds[126], unique_hashes),
minimum_mixed_hash(seeds[127], unique_hashes),
]❌ New issue: Large Method |
This comment was marked as resolved.
This comment was marked as resolved.
Replace an unnecessary `rstest` marker with `test` and refresh the execplan risk and orientation text to match the shipped proof seam.
Update the execplan milestone text to reference Vec-backed dedup and the approved unwind 129 proof-bound resolution.
Assert the intermediate sorted and deduplicated hash vector in the existing MinHasher set-semantics tests so removing `dedup` fails.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/whitaker_clones_core/src/index/minhash.rs (2)
70-73: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winDocument the cfg-specific sketch_values implementations.
Both variants of
sketch_valueslack doc comments explaining the conditional compilation split. The unrolled#[cfg(kani)]version (lines 75-207) is a deliberate proof seam that prevents Kani's bounded model checker from spending proof budget on iterator state expansion, whilst the production version (lines 70-73) uses the idiomaticarray::from_fn. Add doc comments on both variants explaining:
- Why two implementations exist (proof tractability versus production idiom)
- That the cfg(kani) version is mechanically equivalent but explicitly unrolled
- That both compute the same 128-lane MinHash signature via
minimum_mixed_hash📝 Suggested documentation
+/// Computes the 128-lane MinHash signature from seeds and unique hashes. +/// +/// Production builds use [`array::from_fn`] for idiomatic iteration. #[cfg(not(kani))] fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] { array::from_fn(|index| minimum_mixed_hash(seeds[index], unique_hashes)) } +/// Computes the 128-lane MinHash signature with explicit unrolling for Kani. +/// +/// This cfg(kani) variant manually unrolls all 128 calls to +/// [`minimum_mixed_hash`] so Kani's bounded model checker does not spend proof +/// budget on loop or iterator state expansion. The explicit array literal keeps +/// the unwind bound tractable whilst still verifying the real signature +/// construction. This implementation is mechanically equivalent to the +/// production [`array::from_fn`] version. #[cfg(kani)] fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {As per coding guidelines: "Document public Rust APIs using Rustdoc comments (///) so documentation can be generated with cargo doc", and ensure unusual patterns are explained for maintainability.
Also applies to: 75-207
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 70 - 73, Add Rustdoc comments to both cfg-specific implementations of sketch_values explaining that there are two implementations: the #[cfg(kani)] unrolled variant exists as a proof seam to avoid Kani's iterator/state expansion and is mechanically equivalent to the production implementation, while the #[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state that both compute the same 128-lane MinHash signature by calling minimum_mixed_hash for each lane. Place these /// comments directly above each sketch_values definition and mention the equivalence and intent (proof tractability vs production idiom) and that both call minimum_mixed_hash to produce the MinHash signature.
60-68: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueExpand the doc comment to explain the proof-specific purpose.
The current comment describes what the function does but not why it exists. Explain that this fixture avoids seed-stream construction to reduce Kani's symbolic state space, referencing the
#[cfg(kani)]proof seam pattern.📝 Suggested expansion
- /// Creates a deterministic proof fixture without exercising seed-stream - /// construction. + /// Creates a deterministic proof fixture for Kani harnesses. + /// + /// This constructor bypasses seed-stream generation to reduce symbolic + /// state-space growth in bounded model checking. Production code continues + /// to use [`MinHasher::new`] and the full seed sequence. #[cfg(kani)] pub(super) fn from_seed_for_kani(seed: u64) -> Self {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 60 - 68, The doc comment for pub(super) fn from_seed_for_kani(seed: u64) currently describes behavior but not intent; update the comment to state this is a proof-only fixture guarded by #[cfg(kani)] that fills seeds with the same value to avoid constructing the seed-stream and thus drastically reduce Kani's symbolic state space as a proof seam; mention MINHASH_SIZE and that this pattern is used to provide deterministic, lightweight inputs for Kani proofs rather than production logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/whitaker_clones_core/src/index/minhash.rs`:
- Around line 70-73: Add Rustdoc comments to both cfg-specific implementations
of sketch_values explaining that there are two implementations: the #[cfg(kani)]
unrolled variant exists as a proof seam to avoid Kani's iterator/state expansion
and is mechanically equivalent to the production implementation, while the
#[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state
that both compute the same 128-lane MinHash signature by calling
minimum_mixed_hash for each lane. Place these /// comments directly above each
sketch_values definition and mention the equivalence and intent (proof
tractability vs production idiom) and that both call minimum_mixed_hash to
produce the MinHash signature.
- Around line 60-68: The doc comment for pub(super) fn from_seed_for_kani(seed:
u64) currently describes behavior but not intent; update the comment to state
this is a proof-only fixture guarded by #[cfg(kani)] that fills seeds with the
same value to avoid constructing the seed-stream and thus drastically reduce
Kani's symbolic state space as a proof seam; mention MINHASH_SIZE and that this
pattern is used to provide deterministic, lightweight inputs for Kani proofs
rather than production logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: a0ce00a0-0aaa-46e8-b6b1-42559315a8fd
📒 Files selected for processing (3)
crates/whitaker_clones_core/src/index/minhash.rscrates/whitaker_clones_core/src/index/tests.rsdocs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md
Update the ExecPlan milestone wording to distinguish the recorded unwind 129 investigation approval from the shipped MinHasher harnesses, which use unwind 4 through the proof seam.
Summary
This branch implements roadmap task (7.2.7): bounded Kani verification of
MinHasher::sketchinvariants. It verifies empty-input failure, deterministic output, and duplicate-hash insensitivity while preserving the existing public API.Roadmap task: (7.2.7)
Execplan: docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md
Review walkthrough
MinHasher::sketchharnesses.cfg(kani)proof seam.Validation
make kani-clone-detector: passed.make check-fmt: passed.make lint: passed.make test: passed, 1400 tests passed and 2 skipped under the default nextest profile.make markdownlint: passed.make nixie: passed.coderabbit review --agent: passed after the Kani proof milestone with zero findings.coderabbit review --agent: passed after the documentation and roadmap milestone with zero findings.Notes
The Kani harnesses call real
MinHasher::sketch. The proof uses a privatecfg(kani)seed fixture and fixed-width signature builder to avoid spending the proof bound on seed-stream array construction while keeping the production public API unchanged.docs/users-guide.mdis unchanged because this work adds maintainer-facing proof coverage and internal documentation only; it does not change CLI behaviour, output formats, or user-facing workflows.References
Lody session: https://lody.ai/leynos/sessions/9a0b5dd8-98b1-49cc-88e1-01ed3a3ad10b