Skip to content

CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715

Open
borisbsu wants to merge 4 commits into
codescan-24.12from
feature/CD-8797
Open

CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715
borisbsu wants to merge 4 commits into
codescan-24.12from
feature/CD-8797

Conversation

@borisbsu
Copy link
Copy Markdown
Collaborator

The Myers Diff algorithm behaves drastically differently on big versus small files because its execution time scale depends on both file size and the number of differences. The core difference lies in computational overhead: the algorithm has a time complexity of O(N*D), where N is the sum of the file lengths and D is the size of the minimum edit script (total insertions and deletions).

When comparing a massive file (e.g. ~30,000 lines) directly against a tiny file (~50 lines), the Myers Diff algorithm encounters a highly asymmetrical workload.

Because Myers operates on a grid where N = Length of File A + Length of File B, the total sequence length N is roughly 30,050. The minimum edit script D (the number of deletes and inserts) will be exceptionally high - at least 29950 -because nearly the entire large file must be deleted to match the small one.

In this specific scenario, 30,050 x 29,950 translates to nearly 900 million computational operations.

Previously, we added a guard check in NewCoverageMeasuresStep.java which help to avoid calling MyersDiff when there is no test coverage files attached.
#697

That helped there. But MyersDiff is being executed on later steps too, so slowness re-appeared on the NewSizeMeasuresStep.java

This PR moves the guard to SourceLinesDiffFinder.findMatchingLines() — so all callers (and any future caller) are protected by a single change.

borisbsu added 4 commits May 13, 2026 15:47
…gates

The previous narrow fix in NewCoverageMeasuresStep (commit 82308bb) only
prevented one of six callers of NewLinesRepository.getNewLines() from
warming the SCM cache. Once that caller was guarded, the unbounded Myers
diff in SourceLinesDiffFinder.findMatchingLines() simply moved to whoever
asked next - NewSizeMeasuresStep in this case, and any of the other four
consumers (NewMaintainabilityMeasuresVisitor, IsNewLineReader,
NewIssueClassifier, PullRequestTrackerExecution) thereafter.

Fix the algorithm itself, not the call sites:

1. Trim common prefix/suffix on the line-hash inputs before invoking Myers.
   Standard speedup in production diff implementations; for the typical
   "large file with small PR delta" pattern this collapses to the small
   divergent core. Cost is O(min(N, M)) hash equality checks - milliseconds
   even for 100K-line inputs.

2. Apply a cell-product gate (4_000_000 cells) against the divergent core.
   Catches catastrophic shapes like symmetric fully-disjoint 5K x 5K
   (25 M cells) that prefix/suffix trim cannot reduce.

3. Apply an asymmetry-ratio gate (max/min > 100 when max >= 5 000) against
   the divergent core. Catches the EZ-Commit signature: small scanner
   delta against a large reference-branch file (e.g. 30K x 50 = ratio 600,
   1.5 M cells - below the cell gate but forced quadratic by D >= N - M).

When a gate fires, unmatched report lines are returned as zero - semantically
identical to the existing dead DifferentiationFailedException catch path,
which downstream consumers already tolerate.

Tested cases (all gated cases return in <= 6 ms on M-series; full test
results in the new SourceLinesDiffFinderTest entries):

  - 100K x 100K identical              -> prefix trim, no Myers, identity map
  - 80K x 80K with 100-line mid diff   -> trim leaves 100 x 100 core
  - 30K x 50 disjoint (your case)      -> asymmetry gate fires
  - 100K x 100 disjoint (BOI scale)    -> asymmetry + cell gate fire
  - 5K x 5K disjoint symmetric          -> cell gate fires
  - 4K x 50 disjoint (below floor)     -> Myers runs normally
  - all 10 existing golden tests       -> preserved
@borisbsu borisbsu changed the base branch from codescan-24.12 to Release-26.0.10 May 14, 2026 12:53
@borisbsu borisbsu changed the base branch from Release-26.0.10 to codescan-24.12 May 14, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants