CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715
Open
borisbsu wants to merge 4 commits into
Open
CD-8797 SourceLinesDiffFinder: prefix/suffix trim + asymmetry + cell …#715borisbsu wants to merge 4 commits into
borisbsu wants to merge 4 commits into
Conversation
…gates The previous narrow fix in NewCoverageMeasuresStep (commit 82308bb) only prevented one of six callers of NewLinesRepository.getNewLines() from warming the SCM cache. Once that caller was guarded, the unbounded Myers diff in SourceLinesDiffFinder.findMatchingLines() simply moved to whoever asked next - NewSizeMeasuresStep in this case, and any of the other four consumers (NewMaintainabilityMeasuresVisitor, IsNewLineReader, NewIssueClassifier, PullRequestTrackerExecution) thereafter. Fix the algorithm itself, not the call sites: 1. Trim common prefix/suffix on the line-hash inputs before invoking Myers. Standard speedup in production diff implementations; for the typical "large file with small PR delta" pattern this collapses to the small divergent core. Cost is O(min(N, M)) hash equality checks - milliseconds even for 100K-line inputs. 2. Apply a cell-product gate (4_000_000 cells) against the divergent core. Catches catastrophic shapes like symmetric fully-disjoint 5K x 5K (25 M cells) that prefix/suffix trim cannot reduce. 3. Apply an asymmetry-ratio gate (max/min > 100 when max >= 5 000) against the divergent core. Catches the EZ-Commit signature: small scanner delta against a large reference-branch file (e.g. 30K x 50 = ratio 600, 1.5 M cells - below the cell gate but forced quadratic by D >= N - M). When a gate fires, unmatched report lines are returned as zero - semantically identical to the existing dead DifferentiationFailedException catch path, which downstream consumers already tolerate. Tested cases (all gated cases return in <= 6 ms on M-series; full test results in the new SourceLinesDiffFinderTest entries): - 100K x 100K identical -> prefix trim, no Myers, identity map - 80K x 80K with 100-line mid diff -> trim leaves 100 x 100 core - 30K x 50 disjoint (your case) -> asymmetry gate fires - 100K x 100 disjoint (BOI scale) -> asymmetry + cell gate fire - 5K x 5K disjoint symmetric -> cell gate fires - 4K x 50 disjoint (below floor) -> Myers runs normally - all 10 existing golden tests -> preserved
ritesh-ghiya-cs
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Myers Diff algorithm behaves drastically differently on big versus small files because its execution time scale depends on both file size and the number of differences. The core difference lies in computational overhead: the algorithm has a time complexity of
O(N*D), whereNis the sum of the file lengths andDis the size of the minimum edit script (total insertions and deletions).When comparing a massive file (e.g. ~30,000 lines) directly against a tiny file (~50 lines), the Myers Diff algorithm encounters a highly asymmetrical workload.
Because Myers operates on a grid where
N = Length of File A + Length of File B, the total sequence lengthNis roughly 30,050. The minimum edit scriptD(the number of deletes and inserts) will be exceptionally high - at least 29950 -because nearly the entire large file must be deleted to match the small one.In this specific scenario,
30,050 x 29,950translates to nearly 900 million computational operations.Previously, we added a guard check in
NewCoverageMeasuresStep.javawhich help to avoid calling MyersDiff when there is no test coverage files attached.#697
That helped there. But MyersDiff is being executed on later steps too, so slowness re-appeared on the
NewSizeMeasuresStep.javaThis PR moves the guard to
SourceLinesDiffFinder.findMatchingLines()— so all callers (and any future caller) are protected by a single change.