perf: batch _compute_summaries queries to stop hangs on large repos by realkotob · Pull Request #184 · tirth8205/code-review-graph

realkotob · 2026-04-09T12:35:39Z

The problem

Running code-review-graph build on a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. A sampled stack trace from the hung Python process shows it pinned at 100% CPU inside a single SQLite .execute(...) call doing B-tree page reads:

pysqlite_connection_execute
  _pysqlite_query_execute
    sqlite3_step
      sqlite3VdbeExec
        sqlite3VdbeFinishMoveto
          sqlite3BtreeTableMoveto
            moveToChild -> getPageNormal -> pread

The last log line printed says Detecting communities with Leiden algorithm (igraph), which made it look like Leiden was the culprit — but that line is stale. Leiden actually finished fine. The hang is in the very next step: _compute_summaries in code_review_graph/tools/build.py, which populates the community_summaries, flow_snapshots, and risk_index tables.

Why it happens

All three sections of _compute_summaries ran per-row / per-community SQLite queries inside Python loops. The worst offender was risk_index:

for n in nodes:
    caller_count = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE target_qualified = ? "
        "AND kind = 'CALLS'", (qn,),
    ).fetchone()[0]
    tested = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE source_qualified = ? "
        "AND kind = 'TESTED_BY'", (qn,),
    ).fetchone()[0]

On Godot the WHERE kind IN ('Function', 'Class', 'Test') filter matches tens of thousands of nodes, so this loop issues ~100k SQLite round trips. Each query uses the edges index, but the index isn't covering — SQLite still has to fault in the main-table row to check kind. Add Python's per-query overhead on top, and the loop is effectively unbounded on a repo this size.

community_summaries had the same disease (triple-JOIN aggregate query per community) and flow_snapshots fetched node names one id at a time.

The fix

Same batch-aggregate pattern as the recent community cohesion perf fix — pre-compute what the loops need in a handful of GROUP BY queries, then do the rest in memory.

risk_index — two GROUP BY aggregates give us caller_counts and tested_counts dicts. The per-node loop becomes dict lookups.
community_summaries — three aggregate/group queries pre-compute per-node edge counts, per-community node lists, and per-community distinct file paths. The per-community loop just picks top symbols and the path prefix from in-memory lists.
flow_snapshots — collect every node id referenced by any flow, fetch names in one batched IN (...) query (chunked at 450 to stay under SQLite's default SQLITE_MAX_VARIABLE_NUMBER), then build critical paths from the id→name dict.

Semantics are preserved for every table except one minor incidental fix: the old community_summaries top-symbols query used LEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id), which actually produces a cartesian product over (in, out) edges. The comment on the original code said the intent was "in+out edge count", so the refactor now uses the correct in + out sum. Hub nodes may appear slightly differently in the top-5 list, but the ordering matches what the code was trying to do.

Test coverage

Three new tests in tests/test_tools.py::TestComputeSummaries seed a fixture with two communities, cross-community CALLS edges, TESTED_BY edges, and a security-keyword-matching node.

Test	What it guards
`test_risk_index_populated_with_correct_values`	Pins exact `caller_count`, `test_coverage`, `security_relevant`, and `risk_score` for every node
`test_community_summaries_populated_with_correct_values`	Pins `key_symbols`, `size`, and `dominant_language` for both communities, including the first-ranked symbol (strictly more edges than the rest to catch ordering bugs)
`test_compute_summaries_does_not_scale_per_node`	Attaches a `sqlite3.Connection.set_trace_callback` hook and fails if any per-row `SELECT` (`WHERE target_qualified = 'x'`, `WHERE id = 5`, etc.) fires. This is the real regression guard against someone reintroducing the hot loop

Mutation test validation: I temporarily reintroduced the per-row caller_count query and ran the trace test — it failed loudly with exactly 7 offending SQL strings printed in the assertion message:

AssertionError: _compute_summaries issued 7 per-row SELECTs — the batch-aggregate refactor has regressed:
  - SELECT COUNT(*) FROM edges WHERE target_qualified = 'auth.py::login' AND kind = 'CALLS'
  - SELECT COUNT(*) FROM edges WHERE target_qualified = 'auth.py::logout' AND kind = 'CALLS'
  ...

Then the mutation was reverted.

Test plan

uv run pytest tests/test_tools.py::TestComputeSummaries — 3 passed
uv run pytest tests/ — 624 passed, 1 skipped, 2 xpassed
uv run ruff check code_review_graph/tools/build.py — clean
uv run mypy code_review_graph/tools/build.py --ignore-missing-imports --no-strict-optional — clean
Mutation test: reintroduced per-row query, confirmed trace test catches it, reverted.

Out of scope

No schema changes. No migration. No changes to the semantics of caller_count, risk_score, test_coverage, or security_relevant.
Minor incidental fix to community_summaries.key_symbols ordering (cartesian → sum), noted above.
No change to public API. _compute_summaries is still private.
Independent of and compatible with perf: batch community cohesion to stop hangs on large repos #183 (community cohesion batch refactor) — neither PR touches the same files.

tirth8205 · 2026-04-11T08:51:38Z

Code Review — Approved, needs rebase

This is excellent work. The batch-aggregate refactor is correct, well-reasoned, and the test coverage is outstanding:

risk_index: Two GROUP BY queries instead of 2×N per-node SELECT COUNT(*). Semantically identical to original.
community_summaries: Pre-computes edge counts and file paths in memory — fixes the cartesian-product bug in the original LEFT JOIN edges e1 LEFT JOIN edges e2 (which produces a cross product over in/out edges). The comment in the PR description correctly identifies this as an incidental correctness fix, not just a perf fix.
flow_snapshots: Chunked IN(?) at 450 to stay under SQLITE_MAX_VARIABLE_NUMBER. Correct pattern, matches what GraphStore.get_edges_among already does.
Tests: The SQL trace callback regression guard in test_compute_summaries_does_not_scale_per_node is excellent — it will catch any future per-row regression immediately with a clear error message.

One minor issue: In the flow_snapshots section, the chunked node fetch uses:

node_rows = conn.execute(  # nosec B608
    "SELECT id, qualified_name FROM nodes "
    f"WHERE id IN ({placeholders})",
    batch,
).fetchall()

The # nosec B608 comment is on conn.execute( rather than the f-string line. Move it to the f-string line for consistency with the rest of the codebase:

node_rows = conn.execute(
    "SELECT id, qualified_name FROM nodes "
    f"WHERE id IN ({placeholders})",  # nosec B608
    batch,
).fetchall()

This is a nit — the code is correct either way since placeholders is built from "?" * len(batch).

Action needed: This branch has a merge conflict with main. Please rebase onto current main and fix the conflict. After that this is ready to merge.

…, batch store_communities Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Co-Authored-By: Cursor <noreply@cursor.com> Made-with: Cursor

realkotob · 2026-04-11T13:07:29Z

Thanks for the thorough review @tirth8205! I've addressed your feedback — moved the # nosec B608 comment to the f-string line for consistency. I also rebased the branch on main (resolved the docstring conflict in build.py) and fixed the CI failures (lint errors + Python 3.10 tomllib import issue in test_skills.py). All 664 tests passing locally.

…, batch store_communities (#213) Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by #183 and #184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Made-with: Cursor Co-authored-by: Cursor <noreply@cursor.com> Co-authored-by: Tirth Kanani <tirthkanani18@gmail.com>

Running `code-review-graph build` on the Godot source (~9k files, ~100k edges, ~93k nodes) effectively hung during post-processing even after a previous perf fix to community detection. A sampled stack trace showed the Python process pinned 100% CPU inside a single SQLite `.execute(...)` call, doing B-tree page reads: pysqlite_connection_execute _pysqlite_query_execute sqlite3_step sqlite3VdbeExec sqlite3VdbeFinishMoveto sqlite3BtreeTableMoveto moveToChild -> getPageNormal -> pread The last printed log said "Detecting communities with Leiden algorithm (igraph)", which made it look like Leiden was the culprit — but that was a stale line. Leiden had actually finished; the hang was in the very next step: `_compute_summaries` in code_review_graph/tools/build.py, which populates the community_summaries, flow_snapshots, and risk_index tables. Why it hung: All three sections of `_compute_summaries` ran per-row / per-community SQLite queries inside Python loops. The worst offender was risk_index, which ran two COUNT(*) queries against the edges table for every Function/Class/Test node: for n in nodes: caller_count = conn.execute( "SELECT COUNT(*) FROM edges WHERE target_qualified = ? " "AND kind = 'CALLS'", (qn,), ).fetchone()[0] tested = conn.execute( "SELECT COUNT(*) FROM edges WHERE source_qualified = ? " "AND kind = 'TESTED_BY'", (qn,), ).fetchone()[0] On Godot that filter matches tens of thousands of nodes, so the loop issues ~100k round trips. Each individual query uses the edges index on target_qualified / source_qualified, but the index isn't covering — SQLite still has to fault in the main-table row to check `kind`. Add Python's per-query overhead and the loop is effectively unbounded on a repo this size. `community_summaries` had the same disease with a triple-JOIN aggregate query per community, and `flow_snapshots` fetched node names one id at a time. The fix: Same batch-aggregate pattern that fixed the community cohesion perf bug — pre-compute what the loops need in a handful of `GROUP BY` queries, then do the rest in memory. - risk_index: two `GROUP BY` aggregates give us `caller_counts` and `tested_counts` dicts. The per-node loop becomes dict lookups. - community_summaries: three aggregate/group queries pre-compute per-node edge counts, per-community node lists, and per-community distinct file paths. The per-community loop picks top symbols and the path prefix from in-memory lists. - flow_snapshots: collect every node id referenced by any flow, fetch names in one batched `IN (...)` query (chunked at 450 to stay under SQLite's default `SQLITE_MAX_VARIABLE_NUMBER`), then build critical paths from the id→name dict. Semantics are preserved for every table except one minor incidental fix: the old `community_summaries` top-symbols query used `LEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id)`, which produces a cartesian product over (in, out) edges. The comment on the original code said the intent was "in+out edge count", so the refactor now uses the correct `in + out` sum. Hub nodes may appear slightly differently in the top-5 list, but the ordering is more sensible and matches what the code claimed to do. Test coverage: Three new tests in `tests/test_tools.py::TestComputeSummaries` seed a fixture with two communities, cross-community CALLS edges, TESTED_BY edges, and a security-keyword-matching node, then: 1. `test_risk_index_populated_with_correct_values` — pins exact caller_count, test_coverage, security_relevant, and risk_score values for each node. 2. `test_community_summaries_populated_with_correct_values` — pins key_symbols, size, and dominant_language for both communities, including the first-ranked symbol (which has strictly more edges than the others to catch ordering bugs). 3. `test_compute_summaries_does_not_scale_per_node` — attaches a `sqlite3.Connection.set_trace_callback` hook and fails if any per-row SELECT (`WHERE target_qualified = 'x'`, `WHERE id = 5`, etc.) fires. This is the real regression guard against someone reintroducing the hot loop. All three tests were validated with a mutation test (reintroduced the per-row caller_count query) — the trace test caught the regression with the offending SQL printed in the failure message, then the mutation was reverted. Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff clean on `code_review_graph/tools/build.py`, mypy clean.

- Split long SQL string in communities.py to stay under 100-char limit - Remove trailing whitespace in parser.py - Rename uppercase local variables in skills.py (N806) - Guard tomllib import in test_skills.py for Python 3.10 compat

realkotob · 2026-04-12T12:33:49Z

Rebased on main and resolved conflicts again (communities.py SQL formatting and test_skills.py imports). All 743 tests pass locally.

tirth8205 mentioned this pull request Apr 11, 2026

perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities #213

Merged

3 tasks

realkotob force-pushed the perf/batch-compute-summaries branch from 6eab902 to 175e67e Compare April 11, 2026 12:36

realkotob added 2 commits April 12, 2026 15:22

realkotob force-pushed the perf/batch-compute-summaries branch from 6bd762d to 441472f Compare April 12, 2026 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch _compute_summaries queries to stop hangs on large repos#184

perf: batch _compute_summaries queries to stop hangs on large repos#184
realkotob wants to merge 2 commits intotirth8205:mainfrom
realkotob:perf/batch-compute-summaries

realkotob commented Apr 9, 2026 •

edited

Loading

Uh oh!

tirth8205 commented Apr 11, 2026

Uh oh!

realkotob commented Apr 11, 2026

Uh oh!

realkotob commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realkotob commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem

Why it happens

The fix

Test coverage

Test plan

Out of scope

Uh oh!

tirth8205 commented Apr 11, 2026

Code Review — Approved, needs rebase

Uh oh!

realkotob commented Apr 11, 2026

Uh oh!

realkotob commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

realkotob commented Apr 9, 2026 •

edited

Loading