CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell … by borisbsu · Pull Request #715 · codescan-io/sonarqube

borisbsu · 2026-05-13T17:30:28Z

The Myers Diff algorithm behaves drastically differently on big versus small files because its execution time scale depends on both file size and the number of differences. The core difference lies in computational overhead: the algorithm has a time complexity of O(N*D), where N is the sum of the file lengths and D is the size of the minimum edit script (total insertions and deletions).

When comparing a massive file (e.g. ~30,000 lines) directly against a tiny file (~50 lines), the Myers Diff algorithm encounters a highly asymmetrical workload.

Because Myers operates on a grid where N = Length of File A + Length of File B, the total sequence length N is roughly 30,050. The minimum edit script D (the number of deletes and inserts) will be exceptionally high - at least 29950 -because nearly the entire large file must be deleted to match the small one.

In this specific scenario, 30,050 x 29,950 translates to nearly 900 million computational operations.

Previously, we added a guard check in NewCoverageMeasuresStep.java which help to avoid calling MyersDiff when there is no test coverage files attached.
#697

That helped there. But MyersDiff is being executed on later steps too, so slowness re-appeared on the NewSizeMeasuresStep.java

This PR moves the guard to SourceLinesDiffFinder.findMatchingLines() — so all callers (and any future caller) are protected by a single change.

…gates The previous narrow fix in NewCoverageMeasuresStep (commit 82308bb) only prevented one of six callers of NewLinesRepository.getNewLines() from warming the SCM cache. Once that caller was guarded, the unbounded Myers diff in SourceLinesDiffFinder.findMatchingLines() simply moved to whoever asked next - NewSizeMeasuresStep in this case, and any of the other four consumers (NewMaintainabilityMeasuresVisitor, IsNewLineReader, NewIssueClassifier, PullRequestTrackerExecution) thereafter. Fix the algorithm itself, not the call sites: 1. Trim common prefix/suffix on the line-hash inputs before invoking Myers. Standard speedup in production diff implementations; for the typical "large file with small PR delta" pattern this collapses to the small divergent core. Cost is O(min(N, M)) hash equality checks - milliseconds even for 100K-line inputs. 2. Apply a cell-product gate (4_000_000 cells) against the divergent core. Catches catastrophic shapes like symmetric fully-disjoint 5K x 5K (25 M cells) that prefix/suffix trim cannot reduce. 3. Apply an asymmetry-ratio gate (max/min > 100 when max >= 5 000) against the divergent core. Catches the EZ-Commit signature: small scanner delta against a large reference-branch file (e.g. 30K x 50 = ratio 600, 1.5 M cells - below the cell gate but forced quadratic by D >= N - M). When a gate fires, unmatched report lines are returned as zero - semantically identical to the existing dead DifferentiationFailedException catch path, which downstream consumers already tolerate. Tested cases (all gated cases return in <= 6 ms on M-series; full test results in the new SourceLinesDiffFinderTest entries): - 100K x 100K identical -> prefix trim, no Myers, identity map - 80K x 80K with 100-line mid diff -> trim leaves 100 x 100 core - 30K x 50 disjoint (your case) -> asymmetry gate fires - 100K x 100 disjoint (BOI scale) -> asymmetry + cell gate fire - 5K x 5K disjoint symmetric -> cell gate fires - 4K x 50 disjoint (below floor) -> Myers runs normally - all 10 existing golden tests -> preserved

borisbsu added 4 commits May 13, 2026 15:47

debug

f920a75

debug

980c649

debug

0757d69

borisbsu requested review from dewan-som, lokeswarayadav-ar and ritesh-ghiya-cs May 14, 2026 07:43

ritesh-ghiya-cs approved these changes May 14, 2026

View reviewed changes

borisbsu changed the base branch from codescan-24.12 to Release-26.0.10 May 14, 2026 12:53

borisbsu changed the base branch from Release-26.0.10 to codescan-24.12 May 14, 2026 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715

CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715
borisbsu wants to merge 4 commits into
codescan-24.12from
feature/CD-8797

borisbsu commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

borisbsu commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants