Skip to content

perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities#213

Merged
tirth8205 merged 2 commits intotirth8205:mainfrom
dg264:perf-optimize-large-repos
Apr 11, 2026
Merged

perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities#213
tirth8205 merged 2 commits intotirth8205:mainfrom
dg264:perf-optimize-large-repos

Conversation

@dg264
Copy link
Copy Markdown
Contributor

@dg264 dg264 commented Apr 11, 2026

Problem

On large monorepos (100k+ files, 300k+ nodes, 1M+ edges), the Leiden community detection hangs indefinitely due to:

  1. Recursive sub-community splitting — After the initial Leiden pass, every community >50 nodes triggers a second full Leiden + cohesion pass via _detect_leiden_sub. With 3k+ communities on large graphs, this compounds exponentially.
  2. Uncapped Leiden iterations — The default runs until convergence, which can take unbounded time on dense code graphs.
  3. Per-member UPDATE in store_communities — Each member node gets an individual UPDATE query, resulting in thousands of sequential writes.

Tested on a production monorepo: 132,353 files → 328k nodes → 1.7M edges.

Scope (per review feedback)

This PR is narrowed to contributions not covered by #183 or #184:

Changes (communities.py only)

# Change Impact
1 Remove _detect_leiden_sub Eliminates the recursive second-pass Leiden on communities >50 nodes that caused exponential blow-up
2 Cap Leiden iterations (n_iterations=2) Prevents unbounded convergence on dense graphs; 2 passes produce equivalent quality for code dependency graphs
3 Batch UPDATE in store_communities Single WHERE qualified_name IN (...) per community instead of per-member UPDATE. Fully parameterized (# nosec B608 consistent with project pattern in graph.py)
4 Progress logging logger.info() at each phase (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor large builds

Performance (132k-file monorepo, Apple M3 Pro)

Phase Before After
Leiden clustering >2 hours (hung) ~45 seconds
store_communities ~5 min ~15 seconds

Backward Compatibility

  • All changes are internal optimizations
  • Public API, database schema, and output formats are unchanged
  • Existing graphs do not need migration

Test Plan

  • Verified full build completes end-to-end on 132k-file monorepo
  • Verified community count and cohesion values are reasonable (3,273 communities detected)
  • Verified MCP tools (semantic_search, query_graph, detect_changes) work correctly after build

@dg264
Copy link
Copy Markdown
Contributor Author

dg264 commented Apr 11, 2026

FYI:- @tirth8205 i ran this on my local machine, after this optimization it is working for me. Previously build was never ending.

@tirth8205
Copy link
Copy Markdown
Owner

Code Review

This PR makes real performance improvements but has overlap and conflict with two other PRs now in queue.

Conflict with #183 (communities.py) — now merged

PR #183 was just merged. It addresses the same cohesion O(C×E) bottleneck via _compute_cohesion_batch (single pass over all edges, community-index bucketing). Your PR solves it via _compute_cohesion_fast (pre-built adjacency dict). The branch now has a merge conflict on communities.py.

Both algorithms are correct O(E) improvements. The merged approach is slightly more memory-efficient (no extra adjacency dict). What is unique and valuable in your PR that #183 does not have:

Overlap with #184 (tools/build.py) — still open

Your tools/build.py changes are substantively identical to PR #184. Both batch the risk_index using GROUP BY target_qualified + DISTINCT source_qualified. However #184 is strictly more complete:

Your tools/build.py changes should be dropped in favour of #184.

Action needed

  1. Rebase onto current main to pick up perf: batch community cohesion to stop hangs on large repos #183.
  2. In communities.py: remove _compute_cohesion_fast / _build_adjacency and instead use the already-merged _compute_cohesion_batch for the Leiden path. Keep your unique contributions: removing _detect_leiden_sub, adding n_iterations=2, and the batch UPDATE in store_communities.
  3. Drop all tools/build.py changes — they are superseded by perf: batch _compute_summaries queries to stop hangs on large repos #184.

One security note on the batch UPDATE in store_communities:

f"UPDATE nodes SET community_id = ? WHERE qualified_name IN ({placeholders})",  # nosec B608
[community_id] + member_qns,

placeholders is built from "?" * len(member_qns) so this is fully parameterized — the # nosec B608 annotation is correct and consistent with the project's existing pattern in graph.py.

Please rebase, narrow scope to the three unique contributions above, and this will be mergeable.

…, batch store_communities

Narrowed scope per review feedback — cohesion batch fix and build.py
changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes
only the optimizations unique to this branch:

1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on
   every community >50 nodes caused exponential blow-up on large graphs.
   With 3k+ communities, each sub-pass re-scanned all edges and ran
   a full Leiden + cohesion pass. Removed entirely — the first-pass
   partitioning is sufficient.

2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until
   convergence, which can take unbounded time on dense code graphs.
   Two passes produce equivalent partition quality for dependency graphs.

3. **Batch UPDATE in `store_communities`**: Replace per-member
   `UPDATE nodes SET community_id` with a single
   `WHERE qualified_name IN (...)` per community. Fully parameterized
   (nosec B608 is correct).

4. **Progress logging**: Added `logger.info()` at each phase boundary
   (node loading, igraph construction, Leiden execution, cohesion,
   completion) so users can monitor progress on large builds.

## Performance (tested on 132k-file monorepo, M3 Pro)

| Phase                | Before             | After          |
|----------------------|--------------------|----------------|
| Leiden clustering    | >2 hours (hung)    | ~45 seconds    |
| store_communities    | ~5 min             | ~15 seconds    |

Co-Authored-By: Cursor <noreply@cursor.com>
Made-with: Cursor
@dg264 dg264 force-pushed the perf-optimize-large-repos branch from 39f15a1 to 5f1fe15 Compare April 11, 2026 09:07
@dg264 dg264 changed the title perf: optimize community detection and risk index for large repos (100k+ files) perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities Apr 11, 2026
@dg264
Copy link
Copy Markdown
Contributor Author

dg264 commented Apr 11, 2026

@tirth8205 addressed comments

@tirth8205 tirth8205 merged commit 533e581 into tirth8205:main Apr 11, 2026
9 checks passed
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Unreleased fixes since v2.2.2 that users are complaining about:
- #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182,
  #188, #191, #201) — v2.2.2 generates an invalid hooks schema and
  timeouts in ms instead of seconds; PreCommit is also not a real event.
- #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit
  transactions from the legacy sqlite3 default caused "cannot start a
  transaction within a transaction" on update.
- #166 Go method receivers resolved from field_identifier.
- #170 UTF-8 decode errors in detect_changes (fixes #169).
- #142 --platform target filters (fixes #133).
- #213 / #183 large-repo community detection hangs.
- #220 CI lint + tomllib on Python 3.10.
- #159 missing pytest-cov dev dep.
- #154 JSX component CALLS edges.

Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge,
#215 recurse_submodules, #185 gitignore default (#175), #171 gitignore
docs (#157).

Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean,
691 tests pass, coverage 73.72%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tirth8205 tirth8205 mentioned this pull request Apr 11, 2026
5 tasks
tirth8205 added a commit that referenced this pull request Apr 11, 2026
Unreleased fixes since v2.2.2 that users are complaining about:
- #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182,
  #188, #191, #201) — v2.2.2 generates an invalid hooks schema and
  timeouts in ms instead of seconds; PreCommit is also not a real event.
- #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit
  transactions from the legacy sqlite3 default caused "cannot start a
  transaction within a transaction" on update.
- #166 Go method receivers resolved from field_identifier.
- #170 UTF-8 decode errors in detect_changes (fixes #169).
- #142 --platform target filters (fixes #133).
- #213 / #183 large-repo community detection hangs.
- #220 CI lint + tomllib on Python 3.10.
- #159 missing pytest-cov dev dep.
- #154 JSX component CALLS edges.

Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge,
#215 recurse_submodules, #185 gitignore default (#175), #171 gitignore
docs (#157).

Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean,
691 tests pass, coverage 73.72%.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants