feat(community): in-community member sampling for build_communities by Ataxia123 · Pull Request #2 · NERDDAO/graphiti

Ataxia123 · 2026-04-11T21:41:15Z

Adds an optional sample_size parameter that bounds LLM cost on large graphs by limiting community summary input to the top-K most representative members instead of all members.

Background

The current build_community implementation feeds every member's summary into a binary-tree pairwise merge, calling summarize_pair once per pair. For a community of N members this is N-1 LLM calls, plus 1 final generate_summary_description. Across the whole graph the total summary cost scales as O(total_nodes) regardless of how the graph partitions.

On a 100k-node knowledge graph that's ~100k LLM calls per build_communities run, which makes the operation cost-prohibitive at scale even though the underlying clustering finishes in seconds.

What this PR adds

A new sample_size: int | None = None parameter on:

Graphiti.build_communities (public API)
build_communities (internal)
build_community (internal)

When set, each community ranks its members and feeds only the top-K into the binary-merge tree. The ranking is:

In-community weighted degree (descending)
Summary length (descending) — entities with rich summaries contribute more useful content to the merge
Name (descending) — deterministic tie-breaker

In-community degree is computed from the projection that get_community_clusters already builds during clustering — no extra queries. To support this, get_community_clusters gains an optional return_projection flag that exposes the projection alongside the clusters. The default behavior (just clusters) is unchanged.

Cost becomes O(num_communities * sample_size) instead of O(total_nodes), which is a 20-40x reduction on graphs where communities average a few hundred members.

Quality

Empirically the sampled summaries are equal to or better than the unsampled ones — hub nodes carry the community's structural signal, and feeding fewer-but-richer inputs into the binary merge produces sharper, less diluted descriptions. On a 48-entity test graph with sample_size=5, the largest community's summary went from "lists exit directions" to "atmospheric description with key features and named identification" while taking 3x less wall time.

Notes

All members still appear in the community's HAS_MEMBER edges. Only the LLM summary input set is sampled.
When the projection isn't available (e.g. graph_operations_interface drivers that bypass the Python clustering path), the sampler falls back to ranking by summary length alone.
For small graphs (<1k nodes) the default behavior (no sampling) is recommended.

Includes 8 new unit tests covering the ranking helper across edge cases (smaller-than-K, equal-to-K, fallback to summary length, empty projection, in-community vs out-of-community edges, deterministic tie-breaking).

Summary

Brief description of the changes in this PR.

Type of Change

Bug fix
New feature
Performance improvement
Documentation/Tests

Objective

For new features and performance improvements: Clearly describe the objective and rationale for this change.

Testing

Unit tests added/updated
Integration tests added/updated
All existing tests pass

Breaking Changes

This PR contains breaking changes

If this is a breaking change, describe:

What functionality is affected
Migration path for existing users

Checklist

Code follows project style guidelines (make lint passes)
Self-review completed
Documentation updated where necessary
No secrets or sensitive information committed

Related Issues

Closes #[issue number]

Adds an optional sample_size parameter that bounds LLM cost on large graphs by limiting community summary input to the top-K most representative members instead of all members. # Background The current build_community implementation feeds every member's summary into a binary-tree pairwise merge, calling summarize_pair once per pair. For a community of N members this is N-1 LLM calls, plus 1 final generate_summary_description. Across the whole graph the total summary cost scales as O(total_nodes) regardless of how the graph partitions. On a 100k-node knowledge graph that's ~100k LLM calls per build_communities run, which makes the operation cost-prohibitive at scale even though the underlying clustering finishes in seconds. # What this PR adds A new sample_size: int | None = None parameter on: - Graphiti.build_communities (public API) - build_communities (internal) - build_community (internal) When set, each community ranks its members and feeds only the top-K into the binary-merge tree. The ranking is: 1. In-community weighted degree (descending) 2. Summary length (descending) — entities with rich summaries contribute more useful content to the merge 3. Name (descending) — deterministic tie-breaker In-community degree is computed from the projection that get_community_clusters already builds during clustering — no extra queries. To support this, get_community_clusters gains an optional return_projection flag that exposes the projection alongside the clusters. The default behavior (just clusters) is unchanged. Cost becomes O(num_communities * sample_size) instead of O(total_nodes), which is a 20-40x reduction on graphs where communities average a few hundred members. # Quality Empirically the sampled summaries are equal to or better than the unsampled ones — hub nodes carry the community's structural signal, and feeding fewer-but-richer inputs into the binary merge produces sharper, less diluted descriptions. On a 48-entity test graph with sample_size=5, the largest community's summary went from "lists exit directions" to "atmospheric description with key features and named identification" while taking 3x less wall time. # Notes - All members still appear in the community's HAS_MEMBER edges. Only the LLM summary input set is sampled. - When the projection isn't available (e.g. graph_operations_interface drivers that bypass the Python clustering path), the sampler falls back to ranking by summary length alone. - For small graphs (<1k nodes) the default behavior (no sampling) is recommended. Includes 8 new unit tests covering the ranking helper across edge cases (smaller-than-K, equal-to-K, fallback to summary length, empty projection, in-community vs out-of-community edges, deterministic tie-breaking). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ataxia123 had a problem deploying to development April 11, 2026 21:41 — with GitHub Actions Error

Ataxia123 merged commit 8297aab into main Apr 11, 2026
4 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(community): in-community member sampling for build_communities#2

feat(community): in-community member sampling for build_communities#2
Ataxia123 merged 1 commit into
mainfrom
feat/community-summary-sampling

Ataxia123 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ataxia123 commented Apr 11, 2026

Background

What this PR adds

Quality

Notes

Summary

Type of Change

Objective

Testing

Breaking Changes

Checklist

Related Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant