Summary
Self-clone findings (similar code at different positions within the same function) display overlapping evidence bounds in the HTML report header, making it appear that matched windows overlap — which should be impossible.
Affected code
src/clonehunter/similarity/rollup.py:17-19 (rollup_findings grouping loop)
src/clonehunter/similarity/rollup.py:65-68 (_fn_pair_key normalizes pair order)
src/clonehunter/reporting/html_reporter.py:353-360 (_evidence_bounds aggregates per-side)
Problem
When window snippets within the same function match each other, _fn_pair_key normalizes to (identity, identity) and all self-clone matches land in one group. However, individual CandidateMatch entries have arbitrary a/b ordering: when window@1800 queries and finds window@2100 the match is (a=1800, b=2100), but when window@2100 queries and finds window@1800 the match is (a=2100, b=1800). After _dedupe_matches, one of each pair survives, but the surviving a/b assignment is inconsistent across the group.
_evidence_bounds then computes min(all snippet_a.start_line) and max(all snippet_a.end_line) per side. With mixed ordering, both sides span nearly the entire function, producing overlapping header ranges like:
build_default_step_handlers
app/services/step_handlers.py:1254-4101
build_default_step_handlers
app/services/step_handlers.py:1182-4101
The actual matched content (visible in the diff) is disjoint (e.g. lines 1788–1822 vs 2082–2118), but the summary is misleading.
This also affects _duplicated_lines and _hidden_duplicated_lines which aggregate per-side in the same way.
Expected
Evidence bounds for each side should be disjoint and accurately reflect which line ranges belong to "Function A" vs "Function B". For self-clones, the lower-positioned snippets should consistently be on one side.
Suggested fix
Normalize a/b ordering when appending to each group in rollup_findings: for self-clone groups, ensure snippet_a.start_line < snippet_b.start_line; for cross-function groups, ensure snippet_a.function.identity matches the first element of the pair key. This keeps _evidence_bounds, _duplicated_lines, and diff rendering consistent.
Summary
Self-clone findings (similar code at different positions within the same function) display overlapping evidence bounds in the HTML report header, making it appear that matched windows overlap — which should be impossible.
Affected code
src/clonehunter/similarity/rollup.py:17-19(rollup_findingsgrouping loop)src/clonehunter/similarity/rollup.py:65-68(_fn_pair_keynormalizes pair order)src/clonehunter/reporting/html_reporter.py:353-360(_evidence_boundsaggregates per-side)Problem
When window snippets within the same function match each other,
_fn_pair_keynormalizes to(identity, identity)and all self-clone matches land in one group. However, individualCandidateMatchentries have arbitrary a/b ordering: when window@1800 queries and finds window@2100 the match is(a=1800, b=2100), but when window@2100 queries and finds window@1800 the match is(a=2100, b=1800). After_dedupe_matches, one of each pair survives, but the surviving a/b assignment is inconsistent across the group._evidence_boundsthen computesmin(all snippet_a.start_line)andmax(all snippet_a.end_line)per side. With mixed ordering, both sides span nearly the entire function, producing overlapping header ranges like:The actual matched content (visible in the diff) is disjoint (e.g. lines 1788–1822 vs 2082–2118), but the summary is misleading.
This also affects
_duplicated_linesand_hidden_duplicated_lineswhich aggregate per-side in the same way.Expected
Evidence bounds for each side should be disjoint and accurately reflect which line ranges belong to "Function A" vs "Function B". For self-clones, the lower-positioned snippets should consistently be on one side.
Suggested fix
Normalize a/b ordering when appending to each group in
rollup_findings: for self-clone groups, ensuresnippet_a.start_line < snippet_b.start_line; for cross-function groups, ensuresnippet_a.function.identitymatches the first element of the pair key. This keeps_evidence_bounds,_duplicated_lines, and diff rendering consistent.