perf: batch _compute_summaries queries to stop hangs on large repos#184
perf: batch _compute_summaries queries to stop hangs on large repos#184realkotob wants to merge 2 commits intotirth8205:mainfrom
Conversation
Code Review — Approved, needs rebaseThis is excellent work. The batch-aggregate refactor is correct, well-reasoned, and the test coverage is outstanding:
One minor issue: In the node_rows = conn.execute( # nosec B608
"SELECT id, qualified_name FROM nodes "
f"WHERE id IN ({placeholders})",
batch,
).fetchall()The node_rows = conn.execute(
"SELECT id, qualified_name FROM nodes "
f"WHERE id IN ({placeholders})", # nosec B608
batch,
).fetchall()This is a nit — the code is correct either way since Action needed: This branch has a merge conflict with |
…, batch store_communities Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Co-Authored-By: Cursor <noreply@cursor.com> Made-with: Cursor
6eab902 to
175e67e
Compare
|
Thanks for the thorough review @tirth8205! I've addressed your feedback — moved the |
…, batch store_communities (#213) Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by #183 and #184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Made-with: Cursor Co-authored-by: Cursor <noreply@cursor.com> Co-authored-by: Tirth Kanani <tirthkanani18@gmail.com>
Running `code-review-graph build` on the Godot source (~9k files,
~100k edges, ~93k nodes) effectively hung during post-processing even
after a previous perf fix to community detection. A sampled stack
trace showed the Python process pinned 100% CPU inside a single
SQLite `.execute(...)` call, doing B-tree page reads:
pysqlite_connection_execute
_pysqlite_query_execute
sqlite3_step
sqlite3VdbeExec
sqlite3VdbeFinishMoveto
sqlite3BtreeTableMoveto
moveToChild -> getPageNormal -> pread
The last printed log said "Detecting communities with Leiden algorithm
(igraph)", which made it look like Leiden was the culprit — but that
was a stale line. Leiden had actually finished; the hang was in the
very next step: `_compute_summaries` in code_review_graph/tools/build.py,
which populates the community_summaries, flow_snapshots, and
risk_index tables.
Why it hung:
All three sections of `_compute_summaries` ran per-row / per-community
SQLite queries inside Python loops. The worst offender was
risk_index, which ran two COUNT(*) queries against the edges table
for every Function/Class/Test node:
for n in nodes:
caller_count = conn.execute(
"SELECT COUNT(*) FROM edges WHERE target_qualified = ? "
"AND kind = 'CALLS'", (qn,),
).fetchone()[0]
tested = conn.execute(
"SELECT COUNT(*) FROM edges WHERE source_qualified = ? "
"AND kind = 'TESTED_BY'", (qn,),
).fetchone()[0]
On Godot that filter matches tens of thousands of nodes, so the loop
issues ~100k round trips. Each individual query uses the edges index
on target_qualified / source_qualified, but the index isn't covering
— SQLite still has to fault in the main-table row to check `kind`.
Add Python's per-query overhead and the loop is effectively
unbounded on a repo this size.
`community_summaries` had the same disease with a triple-JOIN
aggregate query per community, and `flow_snapshots` fetched node
names one id at a time.
The fix:
Same batch-aggregate pattern that fixed the community cohesion perf
bug — pre-compute what the loops need in a handful of `GROUP BY`
queries, then do the rest in memory.
- risk_index: two `GROUP BY` aggregates give us `caller_counts`
and `tested_counts` dicts. The per-node loop becomes dict
lookups.
- community_summaries: three aggregate/group queries pre-compute
per-node edge counts, per-community node lists, and per-community
distinct file paths. The per-community loop picks top symbols
and the path prefix from in-memory lists.
- flow_snapshots: collect every node id referenced by any flow,
fetch names in one batched `IN (...)` query (chunked at 450 to
stay under SQLite's default `SQLITE_MAX_VARIABLE_NUMBER`), then
build critical paths from the id→name dict.
Semantics are preserved for every table except one minor incidental
fix: the old `community_summaries` top-symbols query used
`LEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id)`,
which produces a cartesian product over (in, out) edges. The comment
on the original code said the intent was "in+out edge count", so the
refactor now uses the correct `in + out` sum. Hub nodes may appear
slightly differently in the top-5 list, but the ordering is more
sensible and matches what the code claimed to do.
Test coverage:
Three new tests in `tests/test_tools.py::TestComputeSummaries` seed
a fixture with two communities, cross-community CALLS edges,
TESTED_BY edges, and a security-keyword-matching node, then:
1. `test_risk_index_populated_with_correct_values` — pins exact
caller_count, test_coverage, security_relevant, and risk_score
values for each node.
2. `test_community_summaries_populated_with_correct_values` —
pins key_symbols, size, and dominant_language for both
communities, including the first-ranked symbol (which has
strictly more edges than the others to catch ordering bugs).
3. `test_compute_summaries_does_not_scale_per_node` — attaches
a `sqlite3.Connection.set_trace_callback` hook and fails if
any per-row SELECT (`WHERE target_qualified = 'x'`,
`WHERE id = 5`, etc.) fires. This is the real regression
guard against someone reintroducing the hot loop.
All three tests were validated with a mutation test (reintroduced
the per-row caller_count query) — the trace test caught the
regression with the offending SQL printed in the failure message,
then the mutation was reverted.
Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff clean on
`code_review_graph/tools/build.py`, mypy clean.
- Split long SQL string in communities.py to stay under 100-char limit - Remove trailing whitespace in parser.py - Rename uppercase local variables in skills.py (N806) - Guard tomllib import in test_skills.py for Python 3.10 compat
6bd762d to
441472f
Compare
|
Rebased on main and resolved conflicts again (communities.py SQL formatting and test_skills.py imports). All 743 tests pass locally. |
The problem
Running
code-review-graph buildon a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. A sampled stack trace from the hung Python process shows it pinned at 100% CPU inside a single SQLite.execute(...)call doing B-tree page reads:The last log line printed says
Detecting communities with Leiden algorithm (igraph), which made it look like Leiden was the culprit — but that line is stale. Leiden actually finished fine. The hang is in the very next step:_compute_summariesincode_review_graph/tools/build.py, which populates thecommunity_summaries,flow_snapshots, andrisk_indextables.Why it happens
All three sections of
_compute_summariesran per-row / per-community SQLite queries inside Python loops. The worst offender wasrisk_index:On Godot the
WHERE kind IN ('Function', 'Class', 'Test')filter matches tens of thousands of nodes, so this loop issues ~100k SQLite round trips. Each query uses the edges index, but the index isn't covering — SQLite still has to fault in the main-table row to checkkind. Add Python's per-query overhead on top, and the loop is effectively unbounded on a repo this size.community_summarieshad the same disease (triple-JOIN aggregate query per community) andflow_snapshotsfetched node names one id at a time.The fix
Same batch-aggregate pattern as the recent community cohesion perf fix — pre-compute what the loops need in a handful of
GROUP BYqueries, then do the rest in memory.risk_index— twoGROUP BYaggregates give uscaller_countsandtested_countsdicts. The per-node loop becomes dict lookups.community_summaries— three aggregate/group queries pre-compute per-node edge counts, per-community node lists, and per-community distinct file paths. The per-community loop just picks top symbols and the path prefix from in-memory lists.flow_snapshots— collect every node id referenced by any flow, fetch names in one batchedIN (...)query (chunked at 450 to stay under SQLite's defaultSQLITE_MAX_VARIABLE_NUMBER), then build critical paths from the id→name dict.Semantics are preserved for every table except one minor incidental fix: the old
community_summariestop-symbols query usedLEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id), which actually produces a cartesian product over (in, out) edges. The comment on the original code said the intent was "in+out edge count", so the refactor now uses the correctin + outsum. Hub nodes may appear slightly differently in the top-5 list, but the ordering matches what the code was trying to do.Test coverage
Three new tests in
tests/test_tools.py::TestComputeSummariesseed a fixture with two communities, cross-communityCALLSedges,TESTED_BYedges, and a security-keyword-matching node.test_risk_index_populated_with_correct_valuescaller_count,test_coverage,security_relevant, andrisk_scorefor every nodetest_community_summaries_populated_with_correct_valueskey_symbols,size, anddominant_languagefor both communities, including the first-ranked symbol (strictly more edges than the rest to catch ordering bugs)test_compute_summaries_does_not_scale_per_nodesqlite3.Connection.set_trace_callbackhook and fails if any per-rowSELECT(WHERE target_qualified = 'x',WHERE id = 5, etc.) fires. This is the real regression guard against someone reintroducing the hot loopMutation test validation: I temporarily reintroduced the per-row
caller_countquery and ran the trace test — it failed loudly with exactly 7 offending SQL strings printed in the assertion message:Then the mutation was reverted.
Test plan
uv run pytest tests/test_tools.py::TestComputeSummaries— 3 passeduv run pytest tests/— 624 passed, 1 skipped, 2 xpasseduv run ruff check code_review_graph/tools/build.py— cleanuv run mypy code_review_graph/tools/build.py --ignore-missing-imports --no-strict-optional— cleanOut of scope
caller_count,risk_score,test_coverage, orsecurity_relevant.community_summaries.key_symbolsordering (cartesian → sum), noted above._compute_summariesis still private.