Skip to content

Verify bounded MinHasher sketch invariants (7.2.7)#230

Open
leynos wants to merge 10 commits into
mainfrom
7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants
Open

Verify bounded MinHasher sketch invariants (7.2.7)#230
leynos wants to merge 10 commits into
mainfrom
7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented May 18, 2026

Summary

This branch implements roadmap task (7.2.7): bounded Kani verification of MinHasher::sketch invariants. It verifies empty-input failure, deterministic output, and duplicate-hash insensitivity while preserving the existing public API.

Roadmap task: (7.2.7)
Execplan: docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Review walkthrough

Validation

  • make kani-clone-detector: passed.
  • make check-fmt: passed.
  • make lint: passed.
  • make test: passed, 1400 tests passed and 2 skipped under the default nextest profile.
  • make markdownlint: passed.
  • make nixie: passed.
  • coderabbit review --agent: passed after the Kani proof milestone with zero findings.
  • coderabbit review --agent: passed after the documentation and roadmap milestone with zero findings.

Notes

The Kani harnesses call real MinHasher::sketch. The proof uses a private cfg(kani) seed fixture and fixed-width signature builder to avoid spending the proof bound on seed-stream array construction while keeping the production public API unchanged.

docs/users-guide.md is unchanged because this work adds maintainer-facing proof coverage and internal documentation only; it does not change CLI behaviour, output formats, or user-facing workflows.

References

Lody session: https://lody.ai/leynos/sessions/9a0b5dd8-98b1-49cc-88e1-01ed3a3ad10b

Add the pre-implementation ExecPlan for roadmap item 7.2.7. The plan
captures the bounded Kani proof scope for `MinHasher::sketch`, the ordinary
unit and behaviour test expectations, documentation updates, validation gates,
and the approval boundary before implementation starts.
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @leynos, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: cecfe9d1-ceca-4c49-8b29-627db371b1aa

📥 Commits

Reviewing files that changed from the base of the PR and between 640dbef and 368e4ac.

📒 Files selected for processing (1)
  • docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Summary

This PR implements roadmap task 7.2.7: bounded Kani verification of MinHasher::sketch invariants, adding formal verification harnesses that validate three critical properties whilst preserving the public API and production code paths.

Key changes

Verification harnesses (crates/whitaker_clones_core/src/index/kani.rs)

Added three #[kani::proof] harnesses:

  • verify_min_hasher_sketch_rejects_empty_input: asserts that sketching from an empty fingerprint set fails with IndexError::EmptyFingerprintSet
  • verify_min_hasher_sketch_is_deterministic: verifies deterministic output by sketching fixed fingerprints twice and comparing results in a symbolically chosen signature lane
  • verify_min_hasher_sketch_ignores_duplicate_hashes: confirms that duplicate fingerprint hashes do not affect the first signature lane

Core implementation updates (crates/whitaker_clones_core/src/index/minhash.rs)

  • Refactored MinHasher::sketch to delegate to a new sketch_values helper that computes [u64; MINHASH_SIZE] from deduplicated hashes
  • Changed fingerprint hash deduplication from BTreeSet to Vec<u64> with stable sorting and in-place deduplication, preserving set semantics without relying on verifier-hostile data structures
  • sketch_values uses conditional compilation: production code uses array::from_fn; Kani builds use an explicit 128-element literal expansion to avoid proof-bound expenditure on loop/iterator state exploration
  • Added #[cfg(kani)]-only helper MinHasher::from_seed_for_kani(seed: u64) to construct test fixtures with uniform seeds, bypassing seed-stream generation costs in proofs

Test coverage enhancements

  • Refactored unit tests in crates/whitaker_clones_core/src/index/tests.rs to use rstest parameterised cases
  • Added BDD scenario in crates/whitaker_clones_core/tests/features/min_hash_lsh.feature verifying that duplicate retained hashes use set semantics during candidate-pair generation
  • Extended behaviour-driven test suite with scenario_duplicate_hashes_use_set_semantics

Documentation and planning

  • Created new ExecPlan document (docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md) detailing bounded verification scope, operational constraints, tolerances, risk assessment, and implementation milestones
  • Updated docs/whitaker-clone-detector-design.md with implementation decisions: vector-based hash collection, cfg(kani) seed-constructor seam, and proof structure for three separate invariant harnesses
  • Extended docs/developers-guide.md proof-workflow section to describe invariant coverage split across unit tests, BDD harnesses, and Kani harnesses
  • Marked task 7.2.7 as complete in docs/roadmap.md
  • Updated scripts/run-kani.sh to execute new verify_min_hasher_sketch_* harnesses alongside existing LSH verification

Validation

  • make kani-clone-detector, make check-fmt, make lint, make test (1,400 tests passed, 2 skipped) all successful
  • Kani harnesses use unwind bound 4 via the cfg(kani) proof seam, avoiding state-space explosion
  • CodeRabbit automated review passed; Large Method finding addressed with documented rationale for intentional proof-seam unrolling (mechanically generated, only in cfg(kani) builds, replacement would render proof intractable)
  • Production public API and user-facing documentation unchanged

Walkthrough

This PR completes roadmap item 7.2.7 by refactoring MinHasher::sketch to use deterministic Vec-based set semantics, adding three bounded Kani harnesses proving empty-input failure, determinism, and duplicate-hash invariance, strengthening unit and BDD tests, and documenting the work through design updates and comprehensive execplan.

Changes

MinHasher Kani Verification

Layer / File(s) Summary
MinHasher implementation refactoring
crates/whitaker_clones_core/src/index/minhash.rs
MinHasher::sketch now calls sketch_values helper which collapses fingerprint hashes into a Vec, sorts with sort_unstable, and deduplicates before computing per-seed minima. Kani-only from_seed_for_kani fixture constructor builds a MinHasher with all seeds set to a single value. The non-Kani path uses array::from_fn; the Kani path explicitly enumerates minimum_mixed_hash calls for each seed index.
Kani harnesses for MinHasher::sketch
crates/whitaker_clones_core/src/index/kani.rs
Three new #[kani::proof] harnesses verify: empty fingerprint input fails with IndexError::EmptyFingerprintSet; sketching identical inputs twice produces identical signatures at a symbolically chosen lane; sketches from unique vs duplicated hashes match at the first lane. Support helpers build deterministic Fingerprint ranges and select bounded symbolic lane indices.
Unit and BDD test strengthening
crates/whitaker_clones_core/src/index/tests.rs, crates/whitaker_clones_core/tests/features/min_hash_lsh.feature, crates/whitaker_clones_core/tests/min_hash_lsh_behaviour.rs
MinHash unit tests are converted to #[rstest] parameterised cases for empty inputs, determinism across instances, and duplicate hash invariance with multiple input combinations. New Gherkin scenario "Duplicate retained hashes use set semantics" with corresponding behaviour test verifies duplicate hash handling produces canonical pairs.
Tooling and documentation updates
scripts/run-kani.sh, docs/whitaker-clone-detector-design.md, docs/developers-guide.md
Script extended to run new verify_min_hasher_sketch_* harnesses. Design doc adds implementation-decisions section (7.2.7) detailing Vec sort+dedup mechanics and Kani proof structure. Developer guide expanded with MinHasher sketch coverage distribution across tests, BDD harness, and Kani harnesses, and updated Kani execution details including bounded symbolic domains.
Execplan documentation
docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md
Comprehensive project document recording scope, constraints, tolerances, identified risks (state-space explosion, helper model proving, edge case semantics), dated progress log with completed milestones, discoveries made during proof shaping, decision log capturing harness/unwind/seam strategy choices, implementation guidance with acceptance criteria, and rollback plan.
Roadmap checkpoint
docs/roadmap.md
Task 7.2.7 marked complete by updating checkbox from unchecked to checked.

Suggested labels

Roadmap

Suggested reviewers

  • codescene-delta-analysis

Poem

🔍 MinHash finds its proof at last,
Empty, duplicates, deterministic—three harnesses cast.
Vec replaces BTreeSet for Kani's sharp eye,
Set semantics secured, no stray hashes fly.


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Module-Level Documentation ❌ Error minhash.rs has only a one-line docstring lacking explanation of its cfg(kani) seam, Vec-based hash deduplication, and integration with the clone-detector pipeline required by the check. Expand the minhash.rs module docstring to document the cfg(kani) seam purpose, proof bounds optimisation, Vec-based deduplication strategy, and the module's role in clone-detector pipeline.
Testing (Unit And Behavioural) ⚠️ Warning The unique_hashes helper function lacks dedicated unit tests verifying its local behaviour and edge cases in isolation. Add unit tests for unique_hashes covering: empty input, single element, duplicates, unsorted input, all equal elements, and ordering preservation.
Testing (Overall) ❓ Inconclusive No result was produced after verification. Marking as INCONCLUSIVE. Re-run the check or adjust instructions to produce a final result.
✅ Passed checks (15 passed)
Check name Status Explanation
Title check ✅ Passed The title directly references the roadmap task (7.2.7) and accurately summarises the main change: implementing bounded Kani verification of MinHasher sketch invariants.
Description check ✅ Passed The description comprehensively relates to the changeset, covering the roadmap task, implementation approach, validation results, and key design decisions with references to modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
User-Facing Documentation ✅ Passed PR adds Kani verification harnesses and refactors MinHasher without user-facing changes. Users-guide.md documents Whitaker lints, not clone-detector internals, so it correctly remains unchanged.
Developer Documentation ✅ Passed Item 7.2.7 marked complete in roadmap. Execplan document created. Design document updated with implementation decisions. Developers guide extended with Kani proof documentation. Run script updated.
Testing (Property / Proof) ✅ Passed Kani harnesses verify three substantive MinHasher::sketch invariants (empty-input rejection, determinism, duplicate-hash insensitivity) with proper symbolic constraints and real production code.
Testing (Compile-Time / Ui) ✅ Passed The PR makes no changes that warrant compile-time UI tests or snapshot tests: no trybuild scenarios, code generation, or diagnostic output to snapshot.
Unit Architecture ✅ Passed MinHasher::sketch remains pure query; dependencies injected at boundaries; cfg(kani) seam isolated and documented; fallibility explicit in API; tests exercise real functional boundaries.
Domain Architecture ✅ Passed Domain logic segregated from infrastructure. MinHasher sketch operations use domain language only. Proof seams cfg-gated and pub(super). No infrastructure leakage into core code.
Observability ✅ Passed PR introduces no operational behaviour changes: public API unchanged, Kani harnesses are verification-only code, internal refactoring is isolated with no new failure modes.
Security And Privacy ✅ Passed The PR adds Kani harnesses with no secrets, unsafe code, injection vectors, or sensitive data. Proof-only code is cfg(kani)-scoped; public API unchanged. Test data is synthetic.
Performance And Resource Use ✅ Passed No algorithmic regressions detected. Vec-based deduplication is O(n log n) with lower overhead than BTreeSet; no unbounded allocations, cloning, or blocking on hot paths.
Concurrency And State ✅ Passed PR introduces only pure, deterministic functions with no async, locks, shared state, ordering constraints, or parallelism; concurrency check is not applicable.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants

Comment @coderabbitai help to get the list of available commands and usage tips.

codescene-delta-analysis[bot]

This comment was marked as outdated.

leynos added 3 commits May 18, 2026 20:27
Record implementation approval and restore executable mode on the Kani runner so the Makefile proof target can run before the MinHasher harness work begins.
Add parameterized unit coverage for deterministic sketches, duplicate hash
insensitivity, empty inputs, and reordered set semantics. Extend the BDD
candidate-generation feature so duplicate retained hashes are observable
through the public LSH workflow before adding the Kani harnesses.
Add the planned MinHasher sketch harnesses and register them with the
clone-detector Kani runner. Record the current blocker in the ExecPlan:
verification of the real production path needs a higher unwind bound or an
approved proof seam because `MinHasher::new` builds the fixed 128-slot seed
array.
codescene-delta-analysis[bot]

This comment was marked as outdated.

leynos added 2 commits May 19, 2026 22:22
Add Kani harnesses that exercise real `MinHasher::sketch` for empty
input rejection, deterministic output, and duplicate-hash set
semantics. Keep the proof tractable with a private `cfg(kani)` seed
fixture and explicit fixed-width signature builder.
Record the bounded Kani proof shape for `MinHasher::sketch` in the
clone-detector design and developer guide. Mark roadmap item 7.2.7
complete and close the execplan with the validation and CodeRabbit
results.
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented May 19, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response. Ensure that this is validated against the current version of the codegraph.

If further refinement to address this finding would be deleterious, please supply a clear explanatory one to two paragraph markdown message I can paste into the CodeScene web ui's diagnostic suppression function so this diagnostic can be silenced.

crates/whitaker_clones_core/src/index/minhash.rs

Comment on lines +76 to +206

fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
    [
        minimum_mixed_hash(seeds[0], unique_hashes),
        minimum_mixed_hash(seeds[1], unique_hashes),
        minimum_mixed_hash(seeds[2], unique_hashes),
        minimum_mixed_hash(seeds[3], unique_hashes),
        minimum_mixed_hash(seeds[4], unique_hashes),
        minimum_mixed_hash(seeds[5], unique_hashes),
        minimum_mixed_hash(seeds[6], unique_hashes),
        minimum_mixed_hash(seeds[7], unique_hashes),
        minimum_mixed_hash(seeds[8], unique_hashes),
        minimum_mixed_hash(seeds[9], unique_hashes),
        minimum_mixed_hash(seeds[10], unique_hashes),
        minimum_mixed_hash(seeds[11], unique_hashes),
        minimum_mixed_hash(seeds[12], unique_hashes),
        minimum_mixed_hash(seeds[13], unique_hashes),
        minimum_mixed_hash(seeds[14], unique_hashes),
        minimum_mixed_hash(seeds[15], unique_hashes),
        minimum_mixed_hash(seeds[16], unique_hashes),
        minimum_mixed_hash(seeds[17], unique_hashes),
        minimum_mixed_hash(seeds[18], unique_hashes),
        minimum_mixed_hash(seeds[19], unique_hashes),
        minimum_mixed_hash(seeds[20], unique_hashes),
        minimum_mixed_hash(seeds[21], unique_hashes),
        minimum_mixed_hash(seeds[22], unique_hashes),
        minimum_mixed_hash(seeds[23], unique_hashes),
        minimum_mixed_hash(seeds[24], unique_hashes),
        minimum_mixed_hash(seeds[25], unique_hashes),
        minimum_mixed_hash(seeds[26], unique_hashes),
        minimum_mixed_hash(seeds[27], unique_hashes),
        minimum_mixed_hash(seeds[28], unique_hashes),
        minimum_mixed_hash(seeds[29], unique_hashes),
        minimum_mixed_hash(seeds[30], unique_hashes),
        minimum_mixed_hash(seeds[31], unique_hashes),
        minimum_mixed_hash(seeds[32], unique_hashes),
        minimum_mixed_hash(seeds[33], unique_hashes),
        minimum_mixed_hash(seeds[34], unique_hashes),
        minimum_mixed_hash(seeds[35], unique_hashes),
        minimum_mixed_hash(seeds[36], unique_hashes),
        minimum_mixed_hash(seeds[37], unique_hashes),
        minimum_mixed_hash(seeds[38], unique_hashes),
        minimum_mixed_hash(seeds[39], unique_hashes),
        minimum_mixed_hash(seeds[40], unique_hashes),
        minimum_mixed_hash(seeds[41], unique_hashes),
        minimum_mixed_hash(seeds[42], unique_hashes),
        minimum_mixed_hash(seeds[43], unique_hashes),
        minimum_mixed_hash(seeds[44], unique_hashes),
        minimum_mixed_hash(seeds[45], unique_hashes),
        minimum_mixed_hash(seeds[46], unique_hashes),
        minimum_mixed_hash(seeds[47], unique_hashes),
        minimum_mixed_hash(seeds[48], unique_hashes),
        minimum_mixed_hash(seeds[49], unique_hashes),
        minimum_mixed_hash(seeds[50], unique_hashes),
        minimum_mixed_hash(seeds[51], unique_hashes),
        minimum_mixed_hash(seeds[52], unique_hashes),
        minimum_mixed_hash(seeds[53], unique_hashes),
        minimum_mixed_hash(seeds[54], unique_hashes),
        minimum_mixed_hash(seeds[55], unique_hashes),
        minimum_mixed_hash(seeds[56], unique_hashes),
        minimum_mixed_hash(seeds[57], unique_hashes),
        minimum_mixed_hash(seeds[58], unique_hashes),
        minimum_mixed_hash(seeds[59], unique_hashes),
        minimum_mixed_hash(seeds[60], unique_hashes),
        minimum_mixed_hash(seeds[61], unique_hashes),
        minimum_mixed_hash(seeds[62], unique_hashes),
        minimum_mixed_hash(seeds[63], unique_hashes),
        minimum_mixed_hash(seeds[64], unique_hashes),
        minimum_mixed_hash(seeds[65], unique_hashes),
        minimum_mixed_hash(seeds[66], unique_hashes),
        minimum_mixed_hash(seeds[67], unique_hashes),
        minimum_mixed_hash(seeds[68], unique_hashes),
        minimum_mixed_hash(seeds[69], unique_hashes),
        minimum_mixed_hash(seeds[70], unique_hashes),
        minimum_mixed_hash(seeds[71], unique_hashes),
        minimum_mixed_hash(seeds[72], unique_hashes),
        minimum_mixed_hash(seeds[73], unique_hashes),
        minimum_mixed_hash(seeds[74], unique_hashes),
        minimum_mixed_hash(seeds[75], unique_hashes),
        minimum_mixed_hash(seeds[76], unique_hashes),
        minimum_mixed_hash(seeds[77], unique_hashes),
        minimum_mixed_hash(seeds[78], unique_hashes),
        minimum_mixed_hash(seeds[79], unique_hashes),
        minimum_mixed_hash(seeds[80], unique_hashes),
        minimum_mixed_hash(seeds[81], unique_hashes),
        minimum_mixed_hash(seeds[82], unique_hashes),
        minimum_mixed_hash(seeds[83], unique_hashes),
        minimum_mixed_hash(seeds[84], unique_hashes),
        minimum_mixed_hash(seeds[85], unique_hashes),
        minimum_mixed_hash(seeds[86], unique_hashes),
        minimum_mixed_hash(seeds[87], unique_hashes),
        minimum_mixed_hash(seeds[88], unique_hashes),
        minimum_mixed_hash(seeds[89], unique_hashes),
        minimum_mixed_hash(seeds[90], unique_hashes),
        minimum_mixed_hash(seeds[91], unique_hashes),
        minimum_mixed_hash(seeds[92], unique_hashes),
        minimum_mixed_hash(seeds[93], unique_hashes),
        minimum_mixed_hash(seeds[94], unique_hashes),
        minimum_mixed_hash(seeds[95], unique_hashes),
        minimum_mixed_hash(seeds[96], unique_hashes),
        minimum_mixed_hash(seeds[97], unique_hashes),
        minimum_mixed_hash(seeds[98], unique_hashes),
        minimum_mixed_hash(seeds[99], unique_hashes),
        minimum_mixed_hash(seeds[100], unique_hashes),
        minimum_mixed_hash(seeds[101], unique_hashes),
        minimum_mixed_hash(seeds[102], unique_hashes),
        minimum_mixed_hash(seeds[103], unique_hashes),
        minimum_mixed_hash(seeds[104], unique_hashes),
        minimum_mixed_hash(seeds[105], unique_hashes),
        minimum_mixed_hash(seeds[106], unique_hashes),
        minimum_mixed_hash(seeds[107], unique_hashes),
        minimum_mixed_hash(seeds[108], unique_hashes),
        minimum_mixed_hash(seeds[109], unique_hashes),
        minimum_mixed_hash(seeds[110], unique_hashes),
        minimum_mixed_hash(seeds[111], unique_hashes),
        minimum_mixed_hash(seeds[112], unique_hashes),
        minimum_mixed_hash(seeds[113], unique_hashes),
        minimum_mixed_hash(seeds[114], unique_hashes),
        minimum_mixed_hash(seeds[115], unique_hashes),
        minimum_mixed_hash(seeds[116], unique_hashes),
        minimum_mixed_hash(seeds[117], unique_hashes),
        minimum_mixed_hash(seeds[118], unique_hashes),
        minimum_mixed_hash(seeds[119], unique_hashes),
        minimum_mixed_hash(seeds[120], unique_hashes),
        minimum_mixed_hash(seeds[121], unique_hashes),
        minimum_mixed_hash(seeds[122], unique_hashes),
        minimum_mixed_hash(seeds[123], unique_hashes),
        minimum_mixed_hash(seeds[124], unique_hashes),
        minimum_mixed_hash(seeds[125], unique_hashes),
        minimum_mixed_hash(seeds[126], unique_hashes),
        minimum_mixed_hash(seeds[127], unique_hashes),
    ]

❌ New issue: Large Method
sketch_values has 132 lines, threshold = 70

@coderabbitai

This comment was marked as resolved.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos marked this pull request as ready for review May 19, 2026 23:21
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @leynos, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai coderabbitai Bot added the Roadmap label May 19, 2026
coderabbitai[bot]

This comment was marked as resolved.

Replace an unnecessary `rstest` marker with `test` and refresh the
execplan risk and orientation text to match the shipped proof seam.
codescene-delta-analysis[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

Update the execplan milestone text to reference Vec-backed dedup and
the approved unwind 129 proof-bound resolution.
codescene-delta-analysis[bot]

This comment was marked as outdated.

Assert the intermediate sorted and deduplicated hash vector in the
existing MinHasher set-semantics tests so removing `dedup` fails.
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/whitaker_clones_core/src/index/minhash.rs (2)

70-73: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Document the cfg-specific sketch_values implementations.

Both variants of sketch_values lack doc comments explaining the conditional compilation split. The unrolled #[cfg(kani)] version (lines 75-207) is a deliberate proof seam that prevents Kani's bounded model checker from spending proof budget on iterator state expansion, whilst the production version (lines 70-73) uses the idiomatic array::from_fn. Add doc comments on both variants explaining:

  • Why two implementations exist (proof tractability versus production idiom)
  • That the cfg(kani) version is mechanically equivalent but explicitly unrolled
  • That both compute the same 128-lane MinHash signature via minimum_mixed_hash
📝 Suggested documentation
+/// Computes the 128-lane MinHash signature from seeds and unique hashes.
+///
+/// Production builds use [`array::from_fn`] for idiomatic iteration.
 #[cfg(not(kani))]
 fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {
     array::from_fn(|index| minimum_mixed_hash(seeds[index], unique_hashes))
 }

+/// Computes the 128-lane MinHash signature with explicit unrolling for Kani.
+///
+/// This cfg(kani) variant manually unrolls all 128 calls to
+/// [`minimum_mixed_hash`] so Kani's bounded model checker does not spend proof
+/// budget on loop or iterator state expansion. The explicit array literal keeps
+/// the unwind bound tractable whilst still verifying the real signature
+/// construction. This implementation is mechanically equivalent to the
+/// production [`array::from_fn`] version.
 #[cfg(kani)]
 fn sketch_values(seeds: &[u64; MINHASH_SIZE], unique_hashes: &[u64]) -> [u64; MINHASH_SIZE] {

As per coding guidelines: "Document public Rust APIs using Rustdoc comments (///) so documentation can be generated with cargo doc", and ensure unusual patterns are explained for maintainability.

Also applies to: 75-207

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 70 - 73, Add
Rustdoc comments to both cfg-specific implementations of sketch_values
explaining that there are two implementations: the #[cfg(kani)] unrolled variant
exists as a proof seam to avoid Kani's iterator/state expansion and is
mechanically equivalent to the production implementation, while the
#[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state
that both compute the same 128-lane MinHash signature by calling
minimum_mixed_hash for each lane. Place these /// comments directly above each
sketch_values definition and mention the equivalence and intent (proof
tractability vs production idiom) and that both call minimum_mixed_hash to
produce the MinHash signature.

60-68: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Expand the doc comment to explain the proof-specific purpose.

The current comment describes what the function does but not why it exists. Explain that this fixture avoids seed-stream construction to reduce Kani's symbolic state space, referencing the #[cfg(kani)] proof seam pattern.

📝 Suggested expansion
-    /// Creates a deterministic proof fixture without exercising seed-stream
-    /// construction.
+    /// Creates a deterministic proof fixture for Kani harnesses.
+    ///
+    /// This constructor bypasses seed-stream generation to reduce symbolic
+    /// state-space growth in bounded model checking. Production code continues
+    /// to use [`MinHasher::new`] and the full seed sequence.
     #[cfg(kani)]
     pub(super) fn from_seed_for_kani(seed: u64) -> Self {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/whitaker_clones_core/src/index/minhash.rs` around lines 60 - 68, The
doc comment for pub(super) fn from_seed_for_kani(seed: u64) currently describes
behavior but not intent; update the comment to state this is a proof-only
fixture guarded by #[cfg(kani)] that fills seeds with the same value to avoid
constructing the seed-stream and thus drastically reduce Kani's symbolic state
space as a proof seam; mention MINHASH_SIZE and that this pattern is used to
provide deterministic, lightweight inputs for Kani proofs rather than production
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@crates/whitaker_clones_core/src/index/minhash.rs`:
- Around line 70-73: Add Rustdoc comments to both cfg-specific implementations
of sketch_values explaining that there are two implementations: the #[cfg(kani)]
unrolled variant exists as a proof seam to avoid Kani's iterator/state expansion
and is mechanically equivalent to the production implementation, while the
#[cfg(not(kani)] production variant uses the idiomatic array::from_fn; state
that both compute the same 128-lane MinHash signature by calling
minimum_mixed_hash for each lane. Place these /// comments directly above each
sketch_values definition and mention the equivalence and intent (proof
tractability vs production idiom) and that both call minimum_mixed_hash to
produce the MinHash signature.
- Around line 60-68: The doc comment for pub(super) fn from_seed_for_kani(seed:
u64) currently describes behavior but not intent; update the comment to state
this is a proof-only fixture guarded by #[cfg(kani)] that fills seeds with the
same value to avoid constructing the seed-stream and thus drastically reduce
Kani's symbolic state space as a proof seam; mention MINHASH_SIZE and that this
pattern is used to provide deterministic, lightweight inputs for Kani proofs
rather than production logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a0ce00a0-0aaa-46e8-b6b1-42559315a8fd

📥 Commits

Reviewing files that changed from the base of the PR and between 4630d3d and 640dbef.

📒 Files selected for processing (3)
  • crates/whitaker_clones_core/src/index/minhash.rs
  • crates/whitaker_clones_core/src/index/tests.rs
  • docs/execplans/7-2-7-kani-verification-of-bounded-min-hasher-sketch-invariants.md

Update the ExecPlan milestone wording to distinguish the recorded unwind
129 investigation approval from the shipped MinHasher harnesses, which use
unwind 4 through the proof seam.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant