Skip to content

perf: batch _compute_summaries queries to stop hangs on large repos#184

Open
realkotob wants to merge 2 commits intotirth8205:mainfrom
realkotob:perf/batch-compute-summaries
Open

perf: batch _compute_summaries queries to stop hangs on large repos#184
realkotob wants to merge 2 commits intotirth8205:mainfrom
realkotob:perf/batch-compute-summaries

Conversation

@realkotob
Copy link
Copy Markdown
Contributor

@realkotob realkotob commented Apr 9, 2026

The problem

Running code-review-graph build on a large source tree (~9k files, ~100k edges, ~93k nodes) on my Macbook M1 Pro effectively hangs during post-processing. A sampled stack trace from the hung Python process shows it pinned at 100% CPU inside a single SQLite .execute(...) call doing B-tree page reads:

pysqlite_connection_execute
  _pysqlite_query_execute
    sqlite3_step
      sqlite3VdbeExec
        sqlite3VdbeFinishMoveto
          sqlite3BtreeTableMoveto
            moveToChild -> getPageNormal -> pread

The last log line printed says Detecting communities with Leiden algorithm (igraph), which made it look like Leiden was the culprit — but that line is stale. Leiden actually finished fine. The hang is in the very next step: _compute_summaries in code_review_graph/tools/build.py, which populates the community_summaries, flow_snapshots, and risk_index tables.

Why it happens

All three sections of _compute_summaries ran per-row / per-community SQLite queries inside Python loops. The worst offender was risk_index:

for n in nodes:
    caller_count = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE target_qualified = ? "
        "AND kind = 'CALLS'", (qn,),
    ).fetchone()[0]
    tested = conn.execute(
        "SELECT COUNT(*) FROM edges WHERE source_qualified = ? "
        "AND kind = 'TESTED_BY'", (qn,),
    ).fetchone()[0]

On Godot the WHERE kind IN ('Function', 'Class', 'Test') filter matches tens of thousands of nodes, so this loop issues ~100k SQLite round trips. Each query uses the edges index, but the index isn't covering — SQLite still has to fault in the main-table row to check kind. Add Python's per-query overhead on top, and the loop is effectively unbounded on a repo this size.

community_summaries had the same disease (triple-JOIN aggregate query per community) and flow_snapshots fetched node names one id at a time.

The fix

Same batch-aggregate pattern as the recent community cohesion perf fix — pre-compute what the loops need in a handful of GROUP BY queries, then do the rest in memory.

  • risk_index — two GROUP BY aggregates give us caller_counts and tested_counts dicts. The per-node loop becomes dict lookups.
  • community_summaries — three aggregate/group queries pre-compute per-node edge counts, per-community node lists, and per-community distinct file paths. The per-community loop just picks top symbols and the path prefix from in-memory lists.
  • flow_snapshots — collect every node id referenced by any flow, fetch names in one batched IN (...) query (chunked at 450 to stay under SQLite's default SQLITE_MAX_VARIABLE_NUMBER), then build critical paths from the id→name dict.

Semantics are preserved for every table except one minor incidental fix: the old community_summaries top-symbols query used LEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id), which actually produces a cartesian product over (in, out) edges. The comment on the original code said the intent was "in+out edge count", so the refactor now uses the correct in + out sum. Hub nodes may appear slightly differently in the top-5 list, but the ordering matches what the code was trying to do.

Test coverage

Three new tests in tests/test_tools.py::TestComputeSummaries seed a fixture with two communities, cross-community CALLS edges, TESTED_BY edges, and a security-keyword-matching node.

Test What it guards
test_risk_index_populated_with_correct_values Pins exact caller_count, test_coverage, security_relevant, and risk_score for every node
test_community_summaries_populated_with_correct_values Pins key_symbols, size, and dominant_language for both communities, including the first-ranked symbol (strictly more edges than the rest to catch ordering bugs)
test_compute_summaries_does_not_scale_per_node Attaches a sqlite3.Connection.set_trace_callback hook and fails if any per-row SELECT (WHERE target_qualified = 'x', WHERE id = 5, etc.) fires. This is the real regression guard against someone reintroducing the hot loop

Mutation test validation: I temporarily reintroduced the per-row caller_count query and ran the trace test — it failed loudly with exactly 7 offending SQL strings printed in the assertion message:

AssertionError: _compute_summaries issued 7 per-row SELECTs — the batch-aggregate refactor has regressed:
  - SELECT COUNT(*) FROM edges WHERE target_qualified = 'auth.py::login' AND kind = 'CALLS'
  - SELECT COUNT(*) FROM edges WHERE target_qualified = 'auth.py::logout' AND kind = 'CALLS'
  ...

Then the mutation was reverted.

Test plan

  • uv run pytest tests/test_tools.py::TestComputeSummaries — 3 passed
  • uv run pytest tests/ — 624 passed, 1 skipped, 2 xpassed
  • uv run ruff check code_review_graph/tools/build.py — clean
  • uv run mypy code_review_graph/tools/build.py --ignore-missing-imports --no-strict-optional — clean
  • Mutation test: reintroduced per-row query, confirmed trace test catches it, reverted.

Out of scope

  • No schema changes. No migration. No changes to the semantics of caller_count, risk_score, test_coverage, or security_relevant.
  • Minor incidental fix to community_summaries.key_symbols ordering (cartesian → sum), noted above.
  • No change to public API. _compute_summaries is still private.
  • Independent of and compatible with perf: batch community cohesion to stop hangs on large repos #183 (community cohesion batch refactor) — neither PR touches the same files.

@tirth8205
Copy link
Copy Markdown
Owner

Code Review — Approved, needs rebase

This is excellent work. The batch-aggregate refactor is correct, well-reasoned, and the test coverage is outstanding:

  • risk_index: Two GROUP BY queries instead of 2×N per-node SELECT COUNT(*). Semantically identical to original.
  • community_summaries: Pre-computes edge counts and file paths in memory — fixes the cartesian-product bug in the original LEFT JOIN edges e1 LEFT JOIN edges e2 (which produces a cross product over in/out edges). The comment in the PR description correctly identifies this as an incidental correctness fix, not just a perf fix.
  • flow_snapshots: Chunked IN(?) at 450 to stay under SQLITE_MAX_VARIABLE_NUMBER. Correct pattern, matches what GraphStore.get_edges_among already does.
  • Tests: The SQL trace callback regression guard in test_compute_summaries_does_not_scale_per_node is excellent — it will catch any future per-row regression immediately with a clear error message.

One minor issue: In the flow_snapshots section, the chunked node fetch uses:

node_rows = conn.execute(  # nosec B608
    "SELECT id, qualified_name FROM nodes "
    f"WHERE id IN ({placeholders})",
    batch,
).fetchall()

The # nosec B608 comment is on conn.execute( rather than the f-string line. Move it to the f-string line for consistency with the rest of the codebase:

node_rows = conn.execute(
    "SELECT id, qualified_name FROM nodes "
    f"WHERE id IN ({placeholders})",  # nosec B608
    batch,
).fetchall()

This is a nit — the code is correct either way since placeholders is built from "?" * len(batch).

Action needed: This branch has a merge conflict with main. Please rebase onto current main and fix the conflict. After that this is ready to merge.

dg264 added a commit to dg264/code-review-graph that referenced this pull request Apr 11, 2026
…, batch store_communities

Narrowed scope per review feedback — cohesion batch fix and build.py
changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes
only the optimizations unique to this branch:

1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on
   every community >50 nodes caused exponential blow-up on large graphs.
   With 3k+ communities, each sub-pass re-scanned all edges and ran
   a full Leiden + cohesion pass. Removed entirely — the first-pass
   partitioning is sufficient.

2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until
   convergence, which can take unbounded time on dense code graphs.
   Two passes produce equivalent partition quality for dependency graphs.

3. **Batch UPDATE in `store_communities`**: Replace per-member
   `UPDATE nodes SET community_id` with a single
   `WHERE qualified_name IN (...)` per community. Fully parameterized
   (nosec B608 is correct).

4. **Progress logging**: Added `logger.info()` at each phase boundary
   (node loading, igraph construction, Leiden execution, cohesion,
   completion) so users can monitor progress on large builds.

## Performance (tested on 132k-file monorepo, M3 Pro)

| Phase                | Before             | After          |
|----------------------|--------------------|----------------|
| Leiden clustering    | >2 hours (hung)    | ~45 seconds    |
| store_communities    | ~5 min             | ~15 seconds    |

Co-Authored-By: Cursor <noreply@cursor.com>
Made-with: Cursor
@realkotob realkotob force-pushed the perf/batch-compute-summaries branch from 6eab902 to 175e67e Compare April 11, 2026 12:36
@realkotob
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @tirth8205! I've addressed your feedback — moved the # nosec B608 comment to the f-string line for consistency. I also rebased the branch on main (resolved the docstring conflict in build.py) and fixed the CI failures (lint errors + Python 3.10 tomllib import issue in test_skills.py). All 664 tests passing locally.

tirth8205 added a commit that referenced this pull request Apr 11, 2026
…, batch store_communities (#213)

Narrowed scope per review feedback — cohesion batch fix and build.py
changes are handled by #183 and #184 respectively. This PR contributes
only the optimizations unique to this branch:

1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on
   every community >50 nodes caused exponential blow-up on large graphs.
   With 3k+ communities, each sub-pass re-scanned all edges and ran
   a full Leiden + cohesion pass. Removed entirely — the first-pass
   partitioning is sufficient.

2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until
   convergence, which can take unbounded time on dense code graphs.
   Two passes produce equivalent partition quality for dependency graphs.

3. **Batch UPDATE in `store_communities`**: Replace per-member
   `UPDATE nodes SET community_id` with a single
   `WHERE qualified_name IN (...)` per community. Fully parameterized
   (nosec B608 is correct).

4. **Progress logging**: Added `logger.info()` at each phase boundary
   (node loading, igraph construction, Leiden execution, cohesion,
   completion) so users can monitor progress on large builds.

## Performance (tested on 132k-file monorepo, M3 Pro)

| Phase                | Before             | After          |
|----------------------|--------------------|----------------|
| Leiden clustering    | >2 hours (hung)    | ~45 seconds    |
| store_communities    | ~5 min             | ~15 seconds    |


Made-with: Cursor

Co-authored-by: Cursor <noreply@cursor.com>
Co-authored-by: Tirth Kanani <tirthkanani18@gmail.com>
Running `code-review-graph build` on the Godot source (~9k files,
~100k edges, ~93k nodes) effectively hung during post-processing even
after a previous perf fix to community detection. A sampled stack
trace showed the Python process pinned 100% CPU inside a single
SQLite `.execute(...)` call, doing B-tree page reads:

    pysqlite_connection_execute
      _pysqlite_query_execute
        sqlite3_step
          sqlite3VdbeExec
            sqlite3VdbeFinishMoveto
              sqlite3BtreeTableMoveto
                moveToChild -> getPageNormal -> pread

The last printed log said "Detecting communities with Leiden algorithm
(igraph)", which made it look like Leiden was the culprit — but that
was a stale line. Leiden had actually finished; the hang was in the
very next step: `_compute_summaries` in code_review_graph/tools/build.py,
which populates the community_summaries, flow_snapshots, and
risk_index tables.

Why it hung:

All three sections of `_compute_summaries` ran per-row / per-community
SQLite queries inside Python loops. The worst offender was
risk_index, which ran two COUNT(*) queries against the edges table
for every Function/Class/Test node:

    for n in nodes:
        caller_count = conn.execute(
            "SELECT COUNT(*) FROM edges WHERE target_qualified = ? "
            "AND kind = 'CALLS'", (qn,),
        ).fetchone()[0]
        tested = conn.execute(
            "SELECT COUNT(*) FROM edges WHERE source_qualified = ? "
            "AND kind = 'TESTED_BY'", (qn,),
        ).fetchone()[0]

On Godot that filter matches tens of thousands of nodes, so the loop
issues ~100k round trips. Each individual query uses the edges index
on target_qualified / source_qualified, but the index isn't covering
— SQLite still has to fault in the main-table row to check `kind`.
Add Python's per-query overhead and the loop is effectively
unbounded on a repo this size.

`community_summaries` had the same disease with a triple-JOIN
aggregate query per community, and `flow_snapshots` fetched node
names one id at a time.

The fix:

Same batch-aggregate pattern that fixed the community cohesion perf
bug — pre-compute what the loops need in a handful of `GROUP BY`
queries, then do the rest in memory.

  - risk_index: two `GROUP BY` aggregates give us `caller_counts`
    and `tested_counts` dicts. The per-node loop becomes dict
    lookups.

  - community_summaries: three aggregate/group queries pre-compute
    per-node edge counts, per-community node lists, and per-community
    distinct file paths. The per-community loop picks top symbols
    and the path prefix from in-memory lists.

  - flow_snapshots: collect every node id referenced by any flow,
    fetch names in one batched `IN (...)` query (chunked at 450 to
    stay under SQLite's default `SQLITE_MAX_VARIABLE_NUMBER`), then
    build critical paths from the id→name dict.

Semantics are preserved for every table except one minor incidental
fix: the old `community_summaries` top-symbols query used
`LEFT JOIN edges e1 LEFT JOIN edges e2 ... COUNT(e1.id) + COUNT(e2.id)`,
which produces a cartesian product over (in, out) edges. The comment
on the original code said the intent was "in+out edge count", so the
refactor now uses the correct `in + out` sum. Hub nodes may appear
slightly differently in the top-5 list, but the ordering is more
sensible and matches what the code claimed to do.

Test coverage:

Three new tests in `tests/test_tools.py::TestComputeSummaries` seed
a fixture with two communities, cross-community CALLS edges,
TESTED_BY edges, and a security-keyword-matching node, then:

  1. `test_risk_index_populated_with_correct_values` — pins exact
     caller_count, test_coverage, security_relevant, and risk_score
     values for each node.

  2. `test_community_summaries_populated_with_correct_values` —
     pins key_symbols, size, and dominant_language for both
     communities, including the first-ranked symbol (which has
     strictly more edges than the others to catch ordering bugs).

  3. `test_compute_summaries_does_not_scale_per_node` — attaches
     a `sqlite3.Connection.set_trace_callback` hook and fails if
     any per-row SELECT (`WHERE target_qualified = 'x'`,
     `WHERE id = 5`, etc.) fires. This is the real regression
     guard against someone reintroducing the hot loop.

All three tests were validated with a mutation test (reintroduced
the per-row caller_count query) — the trace test caught the
regression with the offending SQL printed in the failure message,
then the mutation was reverted.

Full suite: 624 passed, 1 skipped, 2 xpassed. Ruff clean on
`code_review_graph/tools/build.py`, mypy clean.
- Split long SQL string in communities.py to stay under 100-char limit
- Remove trailing whitespace in parser.py
- Rename uppercase local variables in skills.py (N806)
- Guard tomllib import in test_skills.py for Python 3.10 compat
@realkotob realkotob force-pushed the perf/batch-compute-summaries branch from 6bd762d to 441472f Compare April 12, 2026 12:33
@realkotob
Copy link
Copy Markdown
Contributor Author

Rebased on main and resolved conflicts again (communities.py SQL formatting and test_skills.py imports). All 743 tests pass locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants