perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities by dg264 · Pull Request #213 · tirth8205/code-review-graph

dg264 · 2026-04-11T08:36:07Z

Problem

On large monorepos (100k+ files, 300k+ nodes, 1M+ edges), the Leiden community detection hangs indefinitely due to:

Recursive sub-community splitting — After the initial Leiden pass, every community >50 nodes triggers a second full Leiden + cohesion pass via _detect_leiden_sub. With 3k+ communities on large graphs, this compounds exponentially.
Uncapped Leiden iterations — The default runs until convergence, which can take unbounded time on dense code graphs.
Per-member UPDATE in store_communities — Each member node gets an individual UPDATE query, resulting in thousands of sequential writes.

Tested on a production monorepo: 132,353 files → 328k nodes → 1.7M edges.

Scope (per review feedback)

This PR is narrowed to contributions not covered by #183 or #184:

Cohesion O(C×E) fix → handled by perf: batch community cohesion to stop hangs on large repos #183 (_compute_cohesion_batch) ✅ already merged
tools/build.py batch summaries → handled by perf: batch _compute_summaries queries to stop hangs on large repos #184 (more complete + regression tests) ✅ defer to perf: batch _compute_summaries queries to stop hangs on large repos #184
This PR → the 4 items below, unique to this branch

Changes (communities.py only)

#	Change	Impact
1	Remove `_detect_leiden_sub`	Eliminates the recursive second-pass Leiden on communities >50 nodes that caused exponential blow-up
2	Cap Leiden iterations (`n_iterations=2`)	Prevents unbounded convergence on dense graphs; 2 passes produce equivalent quality for code dependency graphs
3	Batch UPDATE in `store_communities`	Single `WHERE qualified_name IN (...)` per community instead of per-member UPDATE. Fully parameterized (`# nosec B608` consistent with project pattern in `graph.py`)
4	Progress logging	`logger.info()` at each phase (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor large builds

Performance (132k-file monorepo, Apple M3 Pro)

Phase	Before	After
Leiden clustering	>2 hours (hung)	~45 seconds
store_communities	~5 min	~15 seconds

Backward Compatibility

All changes are internal optimizations
Public API, database schema, and output formats are unchanged
Existing graphs do not need migration

Test Plan

Verified full build completes end-to-end on 132k-file monorepo
Verified community count and cohesion values are reasonable (3,273 communities detected)
Verified MCP tools (semantic_search, query_graph, detect_changes) work correctly after build

dg264 · 2026-04-11T08:42:10Z

FYI:- @tirth8205 i ran this on my local machine, after this optimization it is working for me. Previously build was never ending.

tirth8205 · 2026-04-11T08:51:04Z

Code Review

This PR makes real performance improvements but has overlap and conflict with two other PRs now in queue.

Conflict with #183 (communities.py) — now merged

PR #183 was just merged. It addresses the same cohesion O(C×E) bottleneck via _compute_cohesion_batch (single pass over all edges, community-index bucketing). Your PR solves it via _compute_cohesion_fast (pre-built adjacency dict). The branch now has a merge conflict on communities.py.

Both algorithms are correct O(E) improvements. The merged approach is slightly more memory-efficient (no extra adjacency dict). What is unique and valuable in your PR that #183 does not have:

Removing _detect_leiden_sub — the recursive second-pass Leiden on communities >50 nodes. This is a real fix for the sub-community exponential blow-up on large graphs. perf: batch community cohesion to stop hangs on large repos #183 kept this code path.
n_iterations=2 for Leiden — capping iteration count. perf: batch community cohesion to stop hangs on large repos #183 left this uncapped.
Batch UPDATE nodes SET community_id ... WHERE qualified_name IN (...) in store_communities — reduces per-member UPDATE calls to one per community. Not in perf: batch community cohesion to stop hangs on large repos #183.
Progress logging throughout detect_communities — not in perf: batch community cohesion to stop hangs on large repos #183.

Overlap with #184 (tools/build.py) — still open

Your tools/build.py changes are substantively identical to PR #184. Both batch the risk_index using GROUP BY target_qualified + DISTINCT source_qualified. However #184 is strictly more complete:

perf: batch _compute_summaries queries to stop hangs on large repos #184 also batches community_summaries (in-memory edge count dict) and flow_snapshots (chunked IN(?) queries at 450 to respect SQLITE_MAX_VARIABLE_NUMBER).
perf: batch _compute_summaries queries to stop hangs on large repos #184 ships 3 regression tests including a SQL trace-callback guard that will catch re-introduction of per-row queries.

Your tools/build.py changes should be dropped in favour of #184.

Action needed

Rebase onto current main to pick up perf: batch community cohesion to stop hangs on large repos #183.
In communities.py: remove _compute_cohesion_fast / _build_adjacency and instead use the already-merged _compute_cohesion_batch for the Leiden path. Keep your unique contributions: removing _detect_leiden_sub, adding n_iterations=2, and the batch UPDATE in store_communities.
Drop all tools/build.py changes — they are superseded by perf: batch _compute_summaries queries to stop hangs on large repos #184.

One security note on the batch UPDATE in store_communities:

f"UPDATE nodes SET community_id = ? WHERE qualified_name IN ({placeholders})",  # nosec B608
[community_id] + member_qns,

placeholders is built from "?" * len(member_qns) so this is fully parameterized — the # nosec B608 annotation is correct and consistent with the project's existing pattern in graph.py.

Please rebase, narrow scope to the three unique contributions above, and this will be mergeable.

…, batch store_communities Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Co-Authored-By: Cursor <noreply@cursor.com> Made-with: Cursor

dg264 · 2026-04-11T10:12:15Z

@tirth8205 addressed comments

Unreleased fixes since v2.2.2 that users are complaining about: - #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182, #188, #191, #201) — v2.2.2 generates an invalid hooks schema and timeouts in ms instead of seconds; PreCommit is also not a real event. - #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit transactions from the legacy sqlite3 default caused "cannot start a transaction within a transaction" on update. - #166 Go method receivers resolved from field_identifier. - #170 UTF-8 decode errors in detect_changes (fixes #169). - #142 --platform target filters (fixes #133). - #213 / #183 large-repo community detection hangs. - #220 CI lint + tomllib on Python 3.10. - #159 missing pytest-cov dev dep. - #154 JSX component CALLS edges. Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge, #215 recurse_submodules, #185 gitignore default (#175), #171 gitignore docs (#157). Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean, 691 tests pass, coverage 73.72%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Unreleased fixes since v2.2.2 that users are complaining about: - #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182, #188, #191, #201) — v2.2.2 generates an invalid hooks schema and timeouts in ms instead of seconds; PreCommit is also not a real event. - #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit transactions from the legacy sqlite3 default caused "cannot start a transaction within a transaction" on update. - #166 Go method receivers resolved from field_identifier. - #170 UTF-8 decode errors in detect_changes (fixes #169). - #142 --platform target filters (fixes #133). - #213 / #183 large-repo community detection hangs. - #220 CI lint + tomllib on Python 3.10. - #159 missing pytest-cov dev dep. - #154 JSX component CALLS edges. Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge, #215 recurse_submodules, #185 gitignore default (#175), #171 gitignore docs (#157). Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean, 691 tests pass, coverage 73.72%. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dg264 force-pushed the perf-optimize-large-repos branch from 39f15a1 to 5f1fe15 Compare April 11, 2026 09:07

dg264 changed the title ~~perf: optimize community detection and risk index for large repos (100k+ files)~~ perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities Apr 11, 2026

Merge branch 'main' into perf-optimize-large-repos

48cbfd1

tirth8205 merged commit 533e581 into tirth8205:main Apr 11, 2026
9 checks passed

tirth8205 mentioned this pull request Apr 11, 2026

chore: release v2.2.3 #221

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities#213

perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities#213
tirth8205 merged 2 commits intotirth8205:mainfrom
dg264:perf-optimize-large-repos

dg264 commented Apr 11, 2026 •

edited

Loading

Uh oh!

dg264 commented Apr 11, 2026

Uh oh!

tirth8205 commented Apr 11, 2026

Uh oh!

dg264 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dg264 commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Scope (per review feedback)

Changes (communities.py only)

Performance (132k-file monorepo, Apple M3 Pro)

Backward Compatibility

Test Plan

Uh oh!

dg264 commented Apr 11, 2026

Uh oh!

tirth8205 commented Apr 11, 2026

Code Review

Conflict with #183 (communities.py) — now merged

Overlap with #184 (tools/build.py) — still open

Action needed

Uh oh!

dg264 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dg264 commented Apr 11, 2026 •

edited

Loading