perf: remove recursive sub-community splitting, cap Leiden iterations, batch store_communities#213
Conversation
|
FYI:- @tirth8205 i ran this on my local machine, after this optimization it is working for me. Previously build was never ending. |
Code ReviewThis PR makes real performance improvements but has overlap and conflict with two other PRs now in queue. Conflict with #183 (communities.py) — now mergedPR #183 was just merged. It addresses the same cohesion O(C×E) bottleneck via Both algorithms are correct O(E) improvements. The merged approach is slightly more memory-efficient (no extra adjacency dict). What is unique and valuable in your PR that #183 does not have:
Overlap with #184 (tools/build.py) — still openYour
Your Action needed
One security note on the batch UPDATE in f"UPDATE nodes SET community_id = ? WHERE qualified_name IN ({placeholders})", # nosec B608
[community_id] + member_qns,
Please rebase, narrow scope to the three unique contributions above, and this will be mergeable. |
…, batch store_communities Narrowed scope per review feedback — cohesion batch fix and build.py changes are handled by tirth8205#183 and tirth8205#184 respectively. This PR contributes only the optimizations unique to this branch: 1. **Remove `_detect_leiden_sub`**: The recursive second-pass Leiden on every community >50 nodes caused exponential blow-up on large graphs. With 3k+ communities, each sub-pass re-scanned all edges and ran a full Leiden + cohesion pass. Removed entirely — the first-pass partitioning is sufficient. 2. **Cap Leiden iterations** (`n_iterations=2`): The default runs until convergence, which can take unbounded time on dense code graphs. Two passes produce equivalent partition quality for dependency graphs. 3. **Batch UPDATE in `store_communities`**: Replace per-member `UPDATE nodes SET community_id` with a single `WHERE qualified_name IN (...)` per community. Fully parameterized (nosec B608 is correct). 4. **Progress logging**: Added `logger.info()` at each phase boundary (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor progress on large builds. ## Performance (tested on 132k-file monorepo, M3 Pro) | Phase | Before | After | |----------------------|--------------------|----------------| | Leiden clustering | >2 hours (hung) | ~45 seconds | | store_communities | ~5 min | ~15 seconds | Co-Authored-By: Cursor <noreply@cursor.com> Made-with: Cursor
39f15a1 to
5f1fe15
Compare
|
@tirth8205 addressed comments |
Unreleased fixes since v2.2.2 that users are complaining about: - #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182, #188, #191, #201) — v2.2.2 generates an invalid hooks schema and timeouts in ms instead of seconds; PreCommit is also not a real event. - #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit transactions from the legacy sqlite3 default caused "cannot start a transaction within a transaction" on update. - #166 Go method receivers resolved from field_identifier. - #170 UTF-8 decode errors in detect_changes (fixes #169). - #142 --platform target filters (fixes #133). - #213 / #183 large-repo community detection hangs. - #220 CI lint + tomllib on Python 3.10. - #159 missing pytest-cov dev dep. - #154 JSX component CALLS edges. Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge, #215 recurse_submodules, #185 gitignore default (#175), #171 gitignore docs (#157). Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean, 691 tests pass, coverage 73.72%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Unreleased fixes since v2.2.2 that users are complaining about: - #208 Claude Code hook schema (fixes #97, #138, #163, #168, #172, #182, #188, #191, #201) — v2.2.2 generates an invalid hooks schema and timeouts in ms instead of seconds; PreCommit is also not a real event. - #205 SQLite transaction nesting (fixes #110, #135, #181) — implicit transactions from the legacy sqlite3 default caused "cannot start a transaction within a transaction" on update. - #166 Go method receivers resolved from field_identifier. - #170 UTF-8 decode errors in detect_changes (fixes #169). - #142 --platform target filters (fixes #133). - #213 / #183 large-repo community detection hangs. - #220 CI lint + tomllib on Python 3.10. - #159 missing pytest-cov dev dep. - #154 JSX component CALLS edges. Plus features: #177 Codex, #165 Luau (#153), #217 REFERENCES edge, #215 recurse_submodules, #185 gitignore default (#175), #171 gitignore docs (#157). Verified locally on Python 3.11: ruff clean, mypy clean, bandit clean, 691 tests pass, coverage 73.72%. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem
On large monorepos (100k+ files, 300k+ nodes, 1M+ edges), the Leiden community detection hangs indefinitely due to:
_detect_leiden_sub. With 3k+ communities on large graphs, this compounds exponentially.store_communities— Each member node gets an individualUPDATEquery, resulting in thousands of sequential writes.Tested on a production monorepo: 132,353 files → 328k nodes → 1.7M edges.
Scope (per review feedback)
This PR is narrowed to contributions not covered by #183 or #184:
_compute_cohesion_batch) ✅ already mergedtools/build.pybatch summaries → handled by perf: batch _compute_summaries queries to stop hangs on large repos #184 (more complete + regression tests) ✅ defer to perf: batch _compute_summaries queries to stop hangs on large repos #184Changes (communities.py only)
_detect_leiden_subn_iterations=2)store_communitiesWHERE qualified_name IN (...)per community instead of per-member UPDATE. Fully parameterized (# nosec B608consistent with project pattern ingraph.py)logger.info()at each phase (node loading, igraph construction, Leiden execution, cohesion, completion) so users can monitor large buildsPerformance (132k-file monorepo, Apple M3 Pro)
Backward Compatibility
Test Plan