Skip to content

Self-clone evidence bounds overlap due to inconsistent a/b snippet ordering #22

@drogers0

Description

@drogers0

Summary

Self-clone findings (similar code at different positions within the same function) display overlapping evidence bounds in the HTML report header, making it appear that matched windows overlap — which should be impossible.

Affected code

  • src/clonehunter/similarity/rollup.py:17-19 (rollup_findings grouping loop)
  • src/clonehunter/similarity/rollup.py:65-68 (_fn_pair_key normalizes pair order)
  • src/clonehunter/reporting/html_reporter.py:353-360 (_evidence_bounds aggregates per-side)

Problem

When window snippets within the same function match each other, _fn_pair_key normalizes to (identity, identity) and all self-clone matches land in one group. However, individual CandidateMatch entries have arbitrary a/b ordering: when window@1800 queries and finds window@2100 the match is (a=1800, b=2100), but when window@2100 queries and finds window@1800 the match is (a=2100, b=1800). After _dedupe_matches, one of each pair survives, but the surviving a/b assignment is inconsistent across the group.

_evidence_bounds then computes min(all snippet_a.start_line) and max(all snippet_a.end_line) per side. With mixed ordering, both sides span nearly the entire function, producing overlapping header ranges like:

build_default_step_handlers
app/services/step_handlers.py:1254-4101
build_default_step_handlers
app/services/step_handlers.py:1182-4101

The actual matched content (visible in the diff) is disjoint (e.g. lines 1788–1822 vs 2082–2118), but the summary is misleading.

This also affects _duplicated_lines and _hidden_duplicated_lines which aggregate per-side in the same way.

Expected

Evidence bounds for each side should be disjoint and accurately reflect which line ranges belong to "Function A" vs "Function B". For self-clones, the lower-positioned snippets should consistently be on one side.

Suggested fix

Normalize a/b ordering when appending to each group in rollup_findings: for self-clone groups, ensure snippet_a.start_line < snippet_b.start_line; for cross-function groups, ensure snippet_a.function.identity matches the first element of the pair key. This keeps _evidence_bounds, _duplicated_lines, and diff rendering consistent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions