Verify bounded MinHasher sketch invariants (7.2.7) by leynos · Pull Request #230 · leynos/whitaker

leynos · 2026-05-18T15:45:52Z

Summary

This branch implements roadmap task (7.2.7): bounded Kani verification of MinHasher::sketch invariants. It verifies empty-input failure, deterministic output, and duplicate-hash insensitivity while preserving the existing public API.

Roadmap task: (7.2.7)
Execplan: docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Review walkthrough

Start with crates/whitaker_clones_core/src/index/kani.rs for the new MinHasher::sketch harnesses.
Review crates/whitaker_clones_core/src/index/minhash.rs for the vector-backed set semantics and the private cfg(kani) proof seam.
Check docs/whitaker-clone-detector-design.md and docs/developers-guide.md for the recorded implementation decisions and maintainer-facing proof guidance.
Finish with docs/roadmap.md and the execplan to confirm item 7.2.7 is marked complete and the validation notes are current.

Validation

make kani-clone-detector: passed.
make check-fmt: passed.
make lint: passed.
make test: passed, 1400 tests passed and 2 skipped under the default nextest profile.
make markdownlint: passed.
make nixie: passed.
coderabbit review --agent: passed after the Kani proof milestone with zero findings.
coderabbit review --agent: passed after the documentation and roadmap milestone with zero findings.

Notes

The Kani harnesses call real MinHasher::sketch. The proof uses a private cfg(kani) seed fixture and fixed-width signature builder to avoid spending the proof bound on seed-stream array construction while keeping the production public API unchanged.

docs/users-guide.md is unchanged because this work adds maintainer-facing proof coverage and internal documentation only; it does not change CLI behaviour, output formats, or user-facing workflows.

References

Lody session: https://lody.ai/leynos/sessions/9a0b5dd8-98b1-49cc-88e1-01ed3a3ad10b

Add the pre-implementation ExecPlan for roadmap item 7.2.7. The plan captures the bounded Kani proof scope for `MinHasher::sketch`, the ordinary unit and behaviour test expectations, documentation updates, validation gates, and the approval boundary before implementation starts.

sourcery-ai

Sorry @leynos, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-05-18T15:46:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: cecfe9d1-ceca-4c49-8b29-627db371b1aa

📥 Commits

Reviewing files that changed from the base of the PR and between 640dbef and 368e4ac.

📒 Files selected for processing (1)

docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Summary

This PR implements roadmap task 7.2.7: bounded Kani verification of MinHasher::sketch invariants, adding formal verification harnesses that validate three critical properties whilst preserving the public API and production code paths.

Key changes

Verification harnesses (crates/whitaker_clones_core/src/index/kani.rs)

Added three #[kani::proof] harnesses:

verify_min_hasher_sketch_rejects_empty_input: asserts that sketching from an empty fingerprint set fails with IndexError::EmptyFingerprintSet
verify_min_hasher_sketch_is_deterministic: verifies deterministic output by sketching fixed fingerprints twice and comparing results in a symbolically chosen signature lane
verify_min_hasher_sketch_ignores_duplicate_hashes: confirms that duplicate fingerprint hashes do not affect the first signature lane

Core implementation updates (crates/whitaker_clones_core/src/index/minhash.rs)

Refactored MinHasher::sketch to delegate to a new sketch_values helper that computes [u64; MINHASH_SIZE] from deduplicated hashes
Changed fingerprint hash deduplication from BTreeSet to Vec<u64> with stable sorting and in-place deduplication, preserving set semantics without relying on verifier-hostile data structures
sketch_values uses conditional compilation: production code uses array::from_fn; Kani builds use an explicit 128-element literal expansion to avoid proof-bound expenditure on loop/iterator state exploration
Added #[cfg(kani)]-only helper MinHasher::from_seed_for_kani(seed: u64) to construct test fixtures with uniform seeds, bypassing seed-stream generation costs in proofs

Test coverage enhancements

Refactored unit tests in crates/whitaker_clones_core/src/index/tests.rs to use rstest parameterised cases
Added BDD scenario in crates/whitaker_clones_core/tests/features/min_hash_lsh.feature verifying that duplicate retained hashes use set semantics during candidate-pair generation
Extended behaviour-driven test suite with scenario_duplicate_hashes_use_set_semantics

Documentation and planning

Created new ExecPlan document (docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md) detailing bounded verification scope, operational constraints, tolerances, risk assessment, and implementation milestones
Updated docs/whitaker-clone-detector-design.md with implementation decisions: vector-based hash collection, cfg(kani) seed-constructor seam, and proof structure for three separate invariant harnesses
Extended docs/developers-guide.md proof-workflow section to describe invariant coverage split across unit tests, BDD harnesses, and Kani harnesses
Marked task 7.2.7 as complete in docs/roadmap.md
Updated scripts/run-kani.sh to execute new verify_min_hasher_sketch_* harnesses alongside existing LSH verification

Validation

make kani-clone-detector, make check-fmt, make lint, make test (1,400 tests passed, 2 skipped) all successful
Kani harnesses use unwind bound 4 via the cfg(kani) proof seam, avoiding state-space explosion
CodeRabbit automated review passed; Large Method finding addressed with documented rationale for intentional proof-seam unrolling (mechanically generated, only in cfg(kani) builds, replacement would render proof intractable)
Production public API and user-facing documentation unchanged

Walkthrough

This PR completes roadmap item 7.2.7 by refactoring MinHasher::sketch to use deterministic Vec-based set semantics, adding three bounded Kani harnesses proving empty-input failure, determinism, and duplicate-hash invariance, strengthening unit and BDD tests, and documenting the work through design updates and comprehensive execplan.

Changes

MinHasher Kani Verification

Layer / File(s)	Summary
MinHasher implementation refactoring `crates/whitaker_clones_core/src/index/minhash.rs`	`MinHasher::sketch` now calls `sketch_values` helper which collapses fingerprint hashes into a Vec, sorts with `sort_unstable`, and deduplicates before computing per-seed minima. Kani-only `from_seed_for_kani` fixture constructor builds a MinHasher with all seeds set to a single value. The non-Kani path uses `array::from_fn`; the Kani path explicitly enumerates `minimum_mixed_hash` calls for each seed index.
Kani harnesses for MinHasher::sketch `crates/whitaker_clones_core/src/index/kani.rs`	Three new `#[kani::proof]` harnesses verify: empty fingerprint input fails with `IndexError::EmptyFingerprintSet`; sketching identical inputs twice produces identical signatures at a symbolically chosen lane; sketches from unique vs duplicated hashes match at the first lane. Support helpers build deterministic `Fingerprint` ranges and select bounded symbolic lane indices.
Unit and BDD test strengthening `crates/whitaker_clones_core/src/index/tests.rs`, `crates/whitaker_clones_core/tests/features/min_hash_lsh.feature`, `crates/whitaker_clones_core/tests/min_hash_lsh_behaviour.rs`	MinHash unit tests are converted to `#[rstest]` parameterised cases for empty inputs, determinism across instances, and duplicate hash invariance with multiple input combinations. New Gherkin scenario "Duplicate retained hashes use set semantics" with corresponding behaviour test verifies duplicate hash handling produces canonical pairs.
Tooling and documentation updates `scripts/run-kani.sh`, `docs/whitaker-clone-detector-design.md`, `docs/developers-guide.md`	Script extended to run new `verify_min_hasher_sketch_*` harnesses. Design doc adds implementation-decisions section (7.2.7) detailing Vec sort+dedup mechanics and Kani proof structure. Developer guide expanded with MinHasher sketch coverage distribution across tests, BDD harness, and Kani harnesses, and updated Kani execution details including bounded symbolic domains.
Execplan documentation `docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md`	Comprehensive project document recording scope, constraints, tolerances, identified risks (state-space explosion, helper model proving, edge case semantics), dated progress log with completed milestones, discoveries made during proof shaping, decision log capturing harness/unwind/seam strategy choices, implementation guidance with acceptance criteria, and rollback plan.
Roadmap checkpoint `docs/roadmap.md`	Task 7.2.7 marked complete by updating checkbox from unchecked to checked.

Suggested labels

Roadmap

Suggested reviewers

codescene-delta-analysis

Poem

🔍 MinHash finds its proof at last,
Empty, duplicates, deterministic—three harnesses cast.
Vec replaces BTreeSet for Kani's sharp eye,
Set semantics secured, no stray hashes fly.
✨

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Module-Level Documentation	❌ Error	minhash.rs has only a one-line docstring lacking explanation of its cfg(kani) seam, Vec-based hash deduplication, and integration with the clone-detector pipeline required by the check.	Expand the minhash.rs module docstring to document the cfg(kani) seam purpose, proof bounds optimisation, Vec-based deduplication strategy, and the module's role in clone-detector pipeline.
Testing (Unit And Behavioural)	⚠️ Warning	The `unique_hashes` helper function lacks dedicated unit tests verifying its local behaviour and edge cases in isolation.	Add unit tests for `unique_hashes` covering: empty input, single element, duplicates, unsorted input, all equal elements, and ordering preservation.
Testing (Overall)	❓ Inconclusive	No result was produced after verification. Marking as INCONCLUSIVE.	Re-run the check or adjust instructions to produce a final result.

✅ Passed checks (15 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly references the roadmap task (7.2.7) and accurately summarises the main change: implementing bounded Kani verification of MinHasher sketch invariants.
Description check	✅ Passed	The description comprehensively relates to the changeset, covering the roadmap task, implementation approach, validation results, and key design decisions with references to modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
User-Facing Documentation	✅ Passed	PR adds Kani verification harnesses and refactors MinHasher without user-facing changes. Users-guide.md documents Whitaker lints, not clone-detector internals, so it correctly remains unchanged.
Developer Documentation	✅ Passed	Item 7.2.7 marked complete in roadmap. Execplan document created. Design document updated with implementation decisions. Developers guide extended with Kani proof documentation. Run script updated.
Testing (Property / Proof)	✅ Passed	Kani harnesses verify three substantive MinHasher::sketch invariants (empty-input rejection, determinism, duplicate-hash insensitivity) with proper symbolic constraints and real production code.
Testing (Compile-Time / Ui)	✅ Passed	The PR makes no changes that warrant compile-time UI tests or snapshot tests: no trybuild scenarios, code generation, or diagnostic output to snapshot.
Unit Architecture	✅ Passed	MinHasher::sketch remains pure query; dependencies injected at boundaries; cfg(kani) seam isolated and documented; fallibility explicit in API; tests exercise real functional boundaries.
Domain Architecture	✅ Passed	Domain logic segregated from infrastructure. MinHasher sketch operations use domain language only. Proof seams cfg-gated and pub(super). No infrastructure leakage into core code.
Observability	✅ Passed	PR introduces no operational behaviour changes: public API unchanged, Kani harnesses are verification-only code, internal refactoring is isolated with no new failure modes.
Security And Privacy	✅ Passed	The PR adds Kani harnesses with no secrets, unsafe code, injection vectors, or sensitive data. Proof-only code is cfg(kani)-scoped; public API unchanged. Test data is synthetic.
Performance And Resource Use	✅ Passed	No algorithmic regressions detected. Vec-based deduplication is O(n log n) with lower overhead than BTreeSet; no unbounded allocations, cloning, or blocking on hot paths.
Concurrency And State	✅ Passed	PR introduces only pure, deterministic functions with no async, locks, shared state, ordering constraints, or parallelism; concurrency check is not applicable.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Record implementation approval and restore executable mode on the Kani runner so the Makefile proof target can run before the MinHasher harness work begins.

Add parameterized unit coverage for deterministic sketches, duplicate hash insensitivity, empty inputs, and reordered set semantics. Extend the BDD candidate-generation feature so duplicate retained hashes are observable through the public LSH workflow before adding the Kani harnesses.

Add the planned MinHasher sketch harnesses and register them with the clone-detector Kani runner. Record the current blocker in the ExecPlan: verification of the real production path needs a higher unwind bound or an approved proof seam because `MinHasher::new` builds the fixed 128-slot seed array.

Add Kani harnesses that exercise real `MinHasher::sketch` for empty input rejection, deterministic output, and duplicate-hash set semantics. Keep the proof tractable with a private `cfg(kani)` seed fixture and explicit fixed-width signature builder.

Record the bounded Kani proof shape for `MinHasher::sketch` in the clone-detector design and developer guide. Mark roadmap item 7.2.7 complete and close the execplan with the validation and CodeRabbit results.

leynos · 2026-05-19T23:14:21Z

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response. Ensure that this is validated against the current version of the codegraph.

If further refinement to address this finding would be deleterious, please supply a clear explanatory one to two paragraph markdown message I can paste into the CodeScene web ui's diagnostic suppression function so this diagnostic can be silenced.

crates/whitaker_clones_core/src/index/minhash.rs

Comment on lines +76 to +206

fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
    [
        minimum_mixed_hash(seeds[0], unique_hashes),
        minimum_mixed_hash(seeds[1], unique_hashes),
        minimum_mixed_hash(seeds[2], unique_hashes),
        minimum_mixed_hash(seeds[3], unique_hashes),
        minimum_mixed_hash(seeds[4], unique_hashes),
        minimum_mixed_hash(seeds[5], unique_hashes),
        minimum_mixed_hash(seeds[6], unique_hashes),
        minimum_mixed_hash(seeds[7], unique_hashes),
        minimum_mixed_hash(seeds[8], unique_hashes),
        minimum_mixed_hash(seeds[9], unique_hashes),
        minimum_mixed_hash(seeds[10], unique_hashes),
        minimum_mixed_hash(seeds[11], unique_hashes),
        minimum_mixed_hash(seeds[12], unique_hashes),
        minimum_mixed_hash(seeds[13], unique_hashes),
        minimum_mixed_hash(seeds[14], unique_hashes),
        minimum_mixed_hash(seeds[15], unique_hashes),
        minimum_mixed_hash(seeds[16], unique_hashes),
        minimum_mixed_hash(seeds[17], unique_hashes),
        minimum_mixed_hash(seeds[18], unique_hashes),
        minimum_mixed_hash(seeds[19], unique_hashes),
        minimum_mixed_hash(seeds[20], unique_hashes),
        minimum_mixed_hash(seeds[21], unique_hashes),
        minimum_mixed_hash(seeds[22], unique_hashes),
        minimum_mixed_hash(seeds[23], unique_hashes),
        minimum_mixed_hash(seeds[24], unique_hashes),
        minimum_mixed_hash(seeds[25], unique_hashes),
        minimum_mixed_hash(seeds[26], unique_hashes),
        minimum_mixed_hash(seeds[27], unique_hashes),
        minimum_mixed_hash(seeds[28], unique_hashes),
        minimum_mixed_hash(seeds[29], unique_hashes),
        minimum_mixed_hash(seeds[30], unique_hashes),
        minimum_mixed_hash(seeds[31], unique_hashes),
        minimum_mixed_hash(seeds[32], unique_hashes),
        minimum_mixed_hash(seeds[33], unique_hashes),
        minimum_mixed_hash(seeds[34], unique_hashes),
        minimum_mixed_hash(seeds[35], unique_hashes),
        minimum_mixed_hash(seeds[36], unique_hashes),
        minimum_mixed_hash(seeds[37], unique_hashes),
        minimum_mixed_hash(seeds[38], unique_hashes),
        minimum_mixed_hash(seeds[39], unique_hashes),
        minimum_mixed_hash(seeds[40], unique_hashes),
        minimum_mixed_hash(seeds[41], unique_hashes),
        minimum_mixed_hash(seeds[42], unique_hashes),
        minimum_mixed_hash(seeds[43], unique_hashes),
        minimum_mixed_hash(seeds[44], unique_hashes),
        minimum_mixed_hash(seeds[45], unique_hashes),
        minimum_mixed_hash(seeds[46], unique_hashes),
        minimum_mixed_hash(seeds[47], unique_hashes),
        minimum_mixed_hash(seeds[48], unique_hashes),
        minimum_mixed_hash(seeds[49], unique_hashes),
        minimum_mixed_hash(seeds[50], unique_hashes),
        minimum_mixed_hash(seeds[51], unique_hashes),
        minimum_mixed_hash(seeds[52], unique_hashes),
        minimum_mixed_hash(seeds[53], unique_hashes),
        minimum_mixed_hash(seeds[54], unique_hashes),
        minimum_mixed_hash(seeds[55], unique_hashes),
        minimum_mixed_hash(seeds[56], unique_hashes),
        minimum_mixed_hash(seeds[57], unique_hashes),
        minimum_mixed_hash(seeds[58], unique_hashes),
        minimum_mixed_hash(seeds[59], unique_hashes),
        minimum_mixed_hash(seeds[60], unique_hashes),
        minimum_mixed_hash(seeds[61], unique_hashes),
        minimum_mixed_hash(seeds[62], unique_hashes),
        minimum_mixed_hash(seeds[63], unique_hashes),
        minimum_mixed_hash(seeds[64], unique_hashes),
        minimum_mixed_hash(seeds[65], unique_hashes),
        minimum_mixed_hash(seeds[66], unique_hashes),
        minimum_mixed_hash(seeds[67], unique_hashes),
        minimum_mixed_hash(seeds[68], unique_hashes),
        minimum_mixed_hash(seeds[69], unique_hashes),
        minimum_mixed_hash(seeds[70], unique_hashes),
        minimum_mixed_hash(seeds[71], unique_hashes),
        minimum_mixed_hash(seeds[72], unique_hashes),
        minimum_mixed_hash(seeds[73], unique_hashes),
        minimum_mixed_hash(seeds[74], unique_hashes),
        minimum_mixed_hash(seeds[75], unique_hashes),
        minimum_mixed_hash(seeds[76], unique_hashes),
        minimum_mixed_hash(seeds[77], unique_hashes),
        minimum_mixed_hash(seeds[78], unique_hashes),
        minimum_mixed_hash(seeds[79], unique_hashes),
        minimum_mixed_hash(seeds[80], unique_hashes),
        minimum_mixed_hash(seeds[81], unique_hashes),
        minimum_mixed_hash(seeds[82], unique_hashes),
        minimum_mixed_hash(seeds[83], unique_hashes),
        minimum_mixed_hash(seeds[84], unique_hashes),
        minimum_mixed_hash(seeds[85], unique_hashes),
        minimum_mixed_hash(seeds[86], unique_hashes),
        minimum_mixed_hash(seeds[87], unique_hashes),
        minimum_mixed_hash(seeds[88], unique_hashes),
        minimum_mixed_hash(seeds[89], unique_hashes),
        minimum_mixed_hash(seeds[90], unique_hashes),
        minimum_mixed_hash(seeds[91], unique_hashes),
        minimum_mixed_hash(seeds[92], unique_hashes),
        minimum_mixed_hash(seeds[93], unique_hashes),
        minimum_mixed_hash(seeds[94], unique_hashes),
        minimum_mixed_hash(seeds[95], unique_hashes),
        minimum_mixed_hash(seeds[96], unique_hashes),
        minimum_mixed_hash(seeds[97], unique_hashes),
        minimum_mixed_hash(seeds[98], unique_hashes),
        minimum_mixed_hash(seeds[99], unique_hashes),
        minimum_mixed_hash(seeds[100], unique_hashes),
        minimum_mixed_hash(seeds[101], unique_hashes),
        minimum_mixed_hash(seeds[102], unique_hashes),
        minimum_mixed_hash(seeds[103], unique_hashes),
        minimum_mixed_hash(seeds[104], unique_hashes),
        minimum_mixed_hash(seeds[105], unique_hashes),
        minimum_mixed_hash(seeds[106], unique_hashes),
        minimum_mixed_hash(seeds[107], unique_hashes),
        minimum_mixed_hash(seeds[108], unique_hashes),
        minimum_mixed_hash(seeds[109], unique_hashes),
        minimum_mixed_hash(seeds[110], unique_hashes),
        minimum_mixed_hash(seeds[111], unique_hashes),
        minimum_mixed_hash(seeds[112], unique_hashes),
        minimum_mixed_hash(seeds[113], unique_hashes),
        minimum_mixed_hash(seeds[114], unique_hashes),
        minimum_mixed_hash(seeds[115], unique_hashes),
        minimum_mixed_hash(seeds[116], unique_hashes),
        minimum_mixed_hash(seeds[117], unique_hashes),
        minimum_mixed_hash(seeds[118], unique_hashes),
        minimum_mixed_hash(seeds[119], unique_hashes),
        minimum_mixed_hash(seeds[120], unique_hashes),
        minimum_mixed_hash(seeds[121], unique_hashes),
        minimum_mixed_hash(seeds[122], unique_hashes),
        minimum_mixed_hash(seeds[123], unique_hashes),
        minimum_mixed_hash(seeds[124], unique_hashes),
        minimum_mixed_hash(seeds[125], unique_hashes),
        minimum_mixed_hash(seeds[126], unique_hashes),
        minimum_mixed_hash(seeds[127], unique_hashes),
    ]

❌ New issue: Large Method
sketch_values has 132 lines, threshold = 70

sourcery-ai

Sorry @leynos, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Replace an unnecessary `rstest` marker with `test` and refresh the execplan risk and orientation text to match the shipped proof seam.

Update the execplan milestone text to reference Vec-backed dedup and the approved unwind 129 proof-bound resolution.

Assert the intermediate sorted and deduplicated hash vector in the existing MinHasher set-semantics tests so removing `dedup` fails.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/whitaker_clones_core/src/index/minhash.rs (2)
70-73: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Document the cfg-specific sketch_values implementations.

Both variants of sketch_values lack doc comments explaining the conditional compilation split. The unrolled #[cfg(kani)] version (lines 75-207) is a deliberate proof seam that prevents Kani's bounded model checker from spending proof budget on iterator state expansion, whilst the production version (lines 70-73) uses the idiomatic array::from_fn. Add doc comments on both variants explaining:

Why two implementations exist (proof tractability versus production idiom)

That the cfg(kani) version is mechanically equivalent but explicitly unrolled

That both compute the same 128-lane MinHash signature via minimum_mixed_hash
📝 Suggested documentation
+/// Computes the 128-lane MinHash signature from seeds and unique hashes.
+///
+/// Production builds use [`array::from_fn`] for idiomatic iteration.
 #[cfg(not(kani))]
 fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
     array::from_fn(|index| minimum_mixed_hash(seeds[index], unique_hashes))
 }

+/// Computes the 128-lane MinHash signature with explicit unrolling for Kani.
+///
+/// This cfg(kani) variant manually unrolls all 128 calls to
+/// [`minimum_mixed_hash`] so Kani's bounded model checker does not spend proof
+/// budget on loop or iterator state expansion. The explicit array literal keeps
+/// the unwind bound tractable whilst still verifying the real signature
+/// construction. This implementation is mechanically equivalent to the
+/// production [`array::from_fn`] version.
 #[cfg(kani)]
 fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
As per coding guidelines: "Document public Rust APIs using Rustdoc comments (///) so documentation can be generated with cargo doc", and ensure unusual patterns are explained for maintainability.

Also applies to: 75-207
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 70 - 73, Add
Rustdoc comments to both cfg-specific implementations of sketch_values
explaining that there are two implementations: the #[cfg(kani)] unrolled variant
exists as a proof seam to avoid Kani's iterator/state expansion and is
mechanically equivalent to the production implementation, while the
#[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state
that both compute the same 128-lane MinHash signature by calling
minimum_mixed_hash for each lane. Place these /// comments directly above each
sketch_values definition and mention the equivalence and intent (proof
tractability vs production idiom) and that both call minimum_mixed_hash to
produce the MinHash signature.
60-68: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Expand the doc comment to explain the proof-specific purpose.

The current comment describes what the function does but not why it exists. Explain that this fixture avoids seed-stream construction to reduce Kani's symbolic state space, referencing the #[cfg(kani)] proof seam pattern.
📝 Suggested expansion
-    /// Creates a deterministic proof fixture without exercising seed-stream
-    /// construction.
+    /// Creates a deterministic proof fixture for Kani harnesses.
+    ///
+    /// This constructor bypasses seed-stream generation to reduce symbolic
+    /// state-space growth in bounded model checking. Production code continues
+    /// to use [`MinHasher::new`] and the full seed sequence.
     #[cfg(kani)]
     pub(super) fn from_seed_for_kani(seed: u64) -> Self {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 60 - 68, The
doc comment for pub(super) fn from_seed_for_kani(seed: u64) currently describes
behavior but not intent; update the comment to state this is a proof-only
fixture guarded by #[cfg(kani)] that fills seeds with the same value to avoid
constructing the seed-stream and thus drastically reduce Kani's symbolic state
space as a proof seam; mention MINHASH_SIZE and that this pattern is used to
provide deterministic, lightweight inputs for Kani proofs rather than production
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/whitaker_clones_core/src/index/minhash.rs`:
- Around line 70-73: Add Rustdoc comments to both cfg-specific implementations
of sketch_values explaining that there are two implementations: the #[cfg(kani)]
unrolled variant exists as a proof seam to avoid Kani's iterator/state expansion
and is mechanically equivalent to the production implementation, while the
#[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state
that both compute the same 128-lane MinHash signature by calling
minimum_mixed_hash for each lane. Place these /// comments directly above each
sketch_values definition and mention the equivalence and intent (proof
tractability vs production idiom) and that both call minimum_mixed_hash to
produce the MinHash signature.
- Around line 60-68: The doc comment for pub(super) fn from_seed_for_kani(seed:
u64) currently describes behavior but not intent; update the comment to state
this is a proof-only fixture guarded by #[cfg(kani)] that fills seeds with the
same value to avoid constructing the seed-stream and thus drastically reduce
Kani's symbolic state space as a proof seam; mention MINHASH_SIZE and that this
pattern is used to provide deterministic, lightweight inputs for Kani proofs
rather than production logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a0ce00a0-0aaa-46e8-b6b1-42559315a8fd

📥 Commits

Reviewing files that changed from the base of the PR and between 4630d3d and 640dbef.

📒 Files selected for processing (3)

crates/whitaker_clones_core/src/index/minhash.rs
crates/whitaker_clones_core/src/index/tests.rs
docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Update the ExecPlan milestone wording to distinguish the recorded unwind 129 investigation approval from the shipped MinHasher harnesses, which use unwind 4 through the proof seam.

sourcery-ai Bot reviewed May 18, 2026

View reviewed changes