Agent-friendly server UX + real incremental extend_graph + add_entity dedup by dataO1 · Pull Request #10 · automataIA/graphrag-rs

dataO1 · 2026-04-30T10:12:02Z

Five small UX fixes plus a real incremental graph-extension API,
all clustered around the same root: what an LLM agent (or any
client driving the API end-to-end without reading source) hits
when actually exercising graphrag-server.

Motivation

Driving graphrag-server from an MCP-bridged agent (Claude Code,
opencode, crush) over a personal knowledge base, several rough edges
showed up consistently:

list_documents returns [] with a "not implemented" note —
the agent can't discover what's indexed.
Deleting by the id passed at ingest returns 500 — only the
server-assigned UUID works, but the agent doesn't keep that.
Ingesting the same content twice produces two Qdrant points with
slightly different similarity scores in queries.
graph_stats doesn't say when the graph was built, so agents
can't tell if it's stale relative to recent ingests.
Triggering entity extraction means a full build_graph even
after one new ingest — Microsoft GraphRAG has the
graphrag append pattern for exactly this case, but
graphrag-server has no analogue.

These cluster naturally — last_built_at and /api/graph/append
are the same conceptual unit (the timestamp gives clients the
signal to call append). Filed together so the contract makes sense
as a whole.

Goals

Make list_documents actually return documents.
Let clients refer to documents by the id they supplied at ingest.
Stop duplicate-content ingest from creating duplicate vectors.
Surface graph-build freshness through the stats endpoint.
Add a real incremental extend endpoint — only walks the
delta chunks since the last build/extend, dedupes entities by
id, merges relationships. Not a wrapper around build_graph.

Changes

list_documents (was a stub)

GET /api/documents previously returned
{documents: [], total: N, note: "Full document listing from Qdrant not implemented yet"}.
Now pages through the collection via Qdrant's scroll API and returns
real summaries {id, userId, title, excerpt (160 chars), addedAt}.
Capped at 256 entries with a "use search to drill in beyond that"
note when truncated.

User-supplied IDs

POST /api/documents accepts an optional id JSON field. Stored in
the Qdrant payload's new user_id field, alongside the UUID Qdrant
requires for the point id itself.

DELETE /api/documents/{id} resolves the path id as a user_id
first (one Qdrant scroll-with-filter call), falls back to treating
it as a UUID. Fixes the 500 callers hit when trying to delete by an
id they remembered handing in at ingest.

Content-hash dedup

POST /api/documents computes SHA-256 of the sanitized content
before embedding. If a Qdrant point with the same content_hash
already exists, returns the existing id without re-embedding.
Mirrors Microsoft GraphRAG's stable-id pattern (v0.5.0+, enables
upsert-merge).

last_built_at

GET /api/graph/stats includes lastBuiltAt (RFC 3339 timestamp
of the last successful /api/graph/build, null pre-first-build).
Set on every successful build/append.

Real incremental graph extension

New pub async fn GraphRAG::extend_graph(&mut self) -> Result<ExtendSummary>
in graphrag-core. Mirrors Microsoft GraphRAG's graphrag append
semantics, properly:

Tracks processed_chunks: HashSet<ChunkId> on GraphRAG.
Populated at the end of build_graph (every chunk) and at the
end of extend_graph (only the delta).
extend_graph filters knowledge_graph.chunks() against
processed_chunks and runs the same extractor build_graph
would pick (gleaning / LLM single-pass / GLiNER /
pattern-based) over only the delta.
Dedupes entities by id before adding to the graph. If a
delta chunk re-mentions an entity that already exists, the
existing entity's mentions are extended in place (compared
by (chunk_id, start_offset)); confidence is bumped to the
max. No duplicate node. Mirrors Microsoft's stable-id pattern.
Dedupes relationships by (source, target, relation_type)
before adding. Skips edges already present.
Returns ExtendSummary { chunks_processed, new_entities, new_relationships, mentions_merged, total_entities, total_relationships } so callers can tell whether the extend
enriched existing nodes vs added new ones — useful for
downstream community/PageRank recompute decisions, mirroring
Microsoft's append heuristic.
clear_processed_chunks() resets the tracking set so the next
extend_graph re-walks every chunk. Useful after a config
change (entity_types, prompts) where you want to re-extract
without wiping the graph first.

POST /api/graph/append is a thin wrapper around extend_graph:
fast no-op when no delta, real incremental work when there is.

KnowledgeGraph::add_entity / add_relationship dedupe by id

While extend_graph was working around the duplicate-node bug
via the private merge_entity / merge_relationship helpers, the
canonical KnowledgeGraph::add_entity / add_relationship methods
still appended a fresh petgraph node every time — so build_graph
(and any direct library user) still produced duplicate-id nodes
with orphaned mentions. This commit promotes the dedup logic
from the private helpers into the canonical public API, so the
two extraction paths agree on graph state.

Before:

pub fn add_entity(&mut self, entity: Entity) -> Result<NodeIndex> {
    let entity_id = entity.id.clone();
    let node_index = self.graph.add_node(entity);
    self.entity_index.insert(entity_id, node_index); // overwrites
    Ok(node_index)
}

---

**Stack note:** sits on top of [PR #8](https://github.com/automataIA/graphrag-rs/pull/8). The graphrag-core changes (extend_graph, dedup helpers) and the graphrag-server UX work are deliberately consolidated into one PR because they're conceptually one unit: real-incremental-extraction needs the dedup-by-id semantics promoted to the canonical add path, and the UX endpoints (list_documents, content-hash dedup, last_built_at) are the agent-facing payoff that motivated the work.

Phase 1 - TRIVIAL fixes: - Remove unused imports from traversal.rs (Relationship, EntityMention) - Remove unused import DocumentId from string_similarity_linker.rs - Remove unused imports from bidirectional_index.rs (DocumentId, TextChunk) - Update obsolete comment in lib.rs about GraphRAG re-export Phase 2 - EASY implementations: - Implement relationships_examined counter tracking in logic_form.rs - Add GraphRAGBuilder re-export in lib.rs - Implement property extraction for Has queries in logic_form.rs * Supports querying entity properties: name, type, confidence, mentions * Returns all properties if only entity specified * Returns specific property if both entity and property specified All changes compile successfully with no warnings.

…hunks Completed 3 TODO implementations in persistence layer: 1. Relationships (save/load): - Schema: source, target, relation_type, confidence, context - Full support for relationship context tracking 2. Documents (save/load): - Schema: id, title, content, metadata, chunk_count - Preserves document metadata as parallel key-value arrays 3. Chunks (save/load): - Schema: id, document_id, content, offsets, embedding, entities - Metadata: chapter, keywords, summary - Full support for embeddings and entity references Implementation uses Arrow RecordBatch with ListBuilder for nested structures.

Completed 2 TODO implementations: 1. **Relationship Extraction in LightRAG** (graph_indexer.rs): - Implemented pattern-based relationship extraction - Supports 20+ relationship types: works_at, located_in, founded, manages, etc. - Extracts relationships between detected entities - Confidence scoring based on pattern match and entity types - Type-aware adjustments (person+organization, entity+location) 2. **Dependency Analysis in Decomposer** (decomposer.rs): - Analyzes dependencies between subqueries based on query types - Dependency types: Sequential, Reference, Context - Logic: * Relationship queries depend on Entity queries (Reference) * Attribute queries depend on Entity queries (Reference) * Comparative queries depend on Entity/Attribute queries (Reference) * Temporal queries use Entity queries for Context * Causal queries have Sequential dependencies - Automatic deduplication of dependencies Both implementations follow existing code patterns and include proper confidence scoring.

Completed TODO in api_providers.rs:332 - batch embedding support. Implementation: - New make_batch_request() method for true batch API calls - Supports all providers: OpenAI, Voyage, Cohere, Jina, Mistral, Together - Proper batch request/response format for each provider - Automatic fallback to sequential if batch fails - Validates embedding count matches input count Benefits: - Significant performance improvement for bulk operations - Reduced API calls and latency - Provider-native batch support utilized Response formats handled: - OpenAI-compatible: data[{embedding: [...]}] - Cohere: embeddings[[...]]

Completed TODO in query_concepts.rs:163 - semantic matching. Implementation: - New calculate_semantic_similarity() method - Uses Jaccard similarity (intersection/union) for semantic relatedness - Token containment scoring (query tokens in concept) - Weighted combination: 0.6*jaccard + 0.4*containment - Applies configurable semantic threshold - Lightweight proxy for true embedding-based matching This provides semantic matching without requiring pre-computed embeddings. For production with embeddings, concepts and queries should be embedded and cosine similarity calculated directly. Benefits: - Catches semantically related concepts beyond exact/fuzzy match - No embedding infrastructure required for basic semantic matching - Configurable via use_semantic_match and semantic_threshold

Completed TODO in retrieval/mod.rs:238 - parallel processing support. Implementation: - New with_parallel_processing() constructor - Accepts Arc<dyn VectorStore> for thread-safe sharing - Accepts EmbeddingGenerator for parallel operations - Integrates ParallelProcessor for batch operations Design: - VectorStore trait is already Send + Sync - Arc wrapper enables safe cross-thread usage - EmbeddingGenerator operations can use rayon for parallelization - ParallelProcessor stored for future batch operations This enables efficient parallel indexing and querying for large-scale knowledge graphs with thread-safe vector operations.

Completed TODO implementations in data_import.rs (534, 547). **Dependencies Added**: - quick-xml (0.36) for GraphML XML parsing - oxrdf (0.2) + oxttl (0.1) for RDF/Turtle parsing - New features: graphml-import, rdf-import **GraphML Parser**: - Full GraphML XML format support - Parses nodes with attributes (id, name, type) - Parses edges with source/target/type - Supports nested <data> elements with keys - Returns ImportedEntity and ImportedRelationship lists **RDF/Turtle Parser**: - Turtle/RDF triple parsing (subject-predicate-object) - Automatic entity extraction from subjects/objects - Relationship extraction from URI objects - Property extraction from literal objects - URI local name extraction (after # or /) - Default types for resources without explicit type Both parsers: - Feature-gated (#[cfg(feature = "...-import")]) - Comprehensive error handling - Processing time tracking - Return ImportResult with counts and errors Enables graph import from standard formats (GraphML, RDF/Turtle).

## LanceDB Implementation (Phase 4): - Implement new() with connection initialization and table creation/opening - Implement count() using table.count_rows() - Implement store_embedding() with Arrow RecordBatch construction - Implement search_similar() with k-nearest neighbor vector search - Add QueryBase and ExecutableQuery trait imports - Handle FixedSizeList DataType with pattern matching for arrow 57 ## Graph Embeddings (Phase 4): - Implement MaxPool aggregation (element-wise max across neighbors) - Implement Attention aggregation with softmax-normalized weights - Implement LSTM aggregation with decay-based sequential processing - Fix type inference for decay factor in LSTM ## Dependency Updates: - Update arrow dependencies from 56 to 57 (workspace + graphrag-core) - Update lancedb from 0.22.2 to 0.26.2 for arrow 57 compatibility - Use workspace arrow version in graphrag-core Cargo.toml - Enable lancedb module in persistence (feature gate: lancedb, not lance-storage) ## Bug Fixes: - Fix VectorStore delete() to return () instead of DeleteResult - Fix DataType::FixedSizeList access for arrow 57 API changes (match pattern instead of as_fixed_size_list())

## BLEU Score Implementation (Phase 5 - VERY HIGH): ### Core Algorithm: - Implement calculate_bleu_score() with n-gram precision (n=1-4) - Calculate brevity penalty: BP = exp(1 - ref_len/cand_len) - Final score: BLEU = BP * exp(1/N * sum(log(P_n))) ### Helper Methods: - calculate_ngram_precision() - Precision with clipped counts - extract_ngrams() - N-gram extraction from token sequences - Clipping logic to prevent over-counting repeated n-grams ### Integration: - Call BLEU calculation in calculate_quality_metrics() - Compute average BLEU score across benchmark queries - Add BLEU score to BenchmarkSummary output - Display BLEU in print_summary() when available ### Algorithm Details: - N-gram range: 1-4 (unigrams through 4-grams) - Modified precision with clipping to max reference counts - Geometric mean of n-gram precisions - Brevity penalty for short candidates - Returns 0.0 if any n-gram precision is 0

## LanceDB Batch Methods (Phase 4): ### store_embeddings_batch(): - Validate dimensions for all embeddings in batch - Create Arrow StringArray for IDs - Create FixedSizeListArray for embedding vectors - Build RecordBatch and add to table - Handle empty batch case gracefully ### get_embedding(): - Query table by ID using SQL filter (only_if) - Execute query and collect results - Extract embedding from FixedSizeList column - Return None if ID not found - Use TryStreamExt for async result collection ### Implementation Details: - Both methods use Arrow RecordBatch construction - Proper error handling with GraphRAGError - Tracing support for debug logging - Dimension validation before insertion LanceDB integration now complete with all 6 methods: - new() - Connection and table initialization - count() - Count rows - store_embedding() - Single embedding storage - store_embeddings_batch() - Batch storage - get_embedding() - Retrieve by ID - search_similar() - K-nearest neighbor search

## ROUGE-L Score Implementation (Phase 5 - VERY HIGH): ### Core Algorithm: - Implement calculate_rouge_l() using Longest Common Subsequence (LCS) - LCS-based precision: LCS_length / candidate_length - LCS-based recall: LCS_length / reference_length - F-score with β=1.2: ((1+β²)*P*R) / (β²*P + R) ### LCS Dynamic Programming: - Implement lcs_length() with O(m*n) time complexity - DP table: dp[i][j] = LCS of seq1[0..i] and seq2[0..j] - Recurrence: if match: dp[i][j] = dp[i-1][j-1] + 1 - Else: dp[i][j] = max(dp[i-1][j], dp[i][j-1]) ### Integration: - Call ROUGE-L calculation in calculate_quality_metrics() - Compute average ROUGE-L score across benchmark queries - Add ROUGE-L to BenchmarkSummary output - Display ROUGE-L in print_summary() when available ### Algorithm Details: - Token-based LCS (word-level, not character-level) - β=1.2 slightly favors recall over precision - Returns 0.0 for empty sequences - Clamps result to [0, 1] range

## Semantic Chunking Implementation (Phase 4 - MEDIUM-HIGH): ### Algorithm: - Split text into sentences using existing split_sentences() - Calculate lexical cohesion (Jaccard similarity) between adjacent sentences - Create chunk boundaries where similarity < threshold (default 0.7) - Merge small chunks below min_size with previous chunk - Split large chunks above max_size by sentence boundaries ### Features: - Uses existing lexical_cohesion() method for word-overlap similarity - Respects min_size, max_size, and similarity_threshold config - Calculates coherence score for each chunk - Maintains sentence and paragraph counts - Handles edge cases (empty text, single sentence, etc.) ### Implementation Details: - Lexical-based semantic similarity (word overlap) - No deep learning embeddings required (practical approach) - Still "semantic" because it respects content similarity - Efficient: O(n) where n is number of sentences Closes semantic chunking TODO at nlp/semantic_chunking.rs:329

## VectorStore LanceDB Implementation: ### add_vectors_batch(): - Implement full Arrow RecordBatch construction for batch vector insertion - Create StringArray for IDs - Create FixedSizeListArray for embeddings with proper dimension - Build schema with id (Utf8) and vector (FixedSizeList) fields - Add batch to LanceDB table using table.add() ### search(): - Implement vector similarity search with k-nearest neighbors - Use query().limit(k).nearest_to() pattern - Extract IDs from result batches - Calculate inverse ranking scores - Return SearchResult vec with id, score, metadata ### Implementation Details: - Reuses Arrow pattern from persistence/lance.rs - Proper error handling for all LanceDB operations - Empty batch handling for add_vectors_batch - Type-safe Float32Type for embeddings Closes TODO at vector/lancedb.rs:89

Implements complete builder pattern for GraphRAG configuration: - 20+ builder methods for all major config options - Fluent API: output_dir, chunk_size, embeddings, ollama, retrieval - with_local_defaults() for zero-config local setup - config() and config_mut() for advanced use cases - Full test coverage: 11/11 tests passing Unblocks TODO at lib.rs:282,1271 Enables GraphRAG::builder() method Adds to prelude for easy access

Updates: - parquet 52 -> 57 to match arrow 57 - Fix ParquetRecordBatchReaderBuilder import path - Add Array trait import for is_null() method - Wrap embeddings in Arc::new() for RecordBatch Implements embeddings save/load using ListBuilder pattern: - Save: Build ListArray from Option<Vec<f32>> - Load: Extract Vec<f32> from ListArray with null handling - Consistent with chunks embeddings implementation Completes TODO at persistence/parquet.rs:245,360

Changes test_graph_indexing to use #[tokio::test] and .await to properly handle async index_graph() method. Fixes compilation error: cannot call is_ok() on Future

Registry Service Implementations (core/registry.rs): - Expand build_registry() with comprehensive service structure - Add 8 service registration points with feature gates: * Storage (memory-storage) * Vector Store (vector-memory) * Embedding Provider (ollama) * Entity Extractor (entity-extraction) * Retriever (retrieval) * Language Model (ollama) * Metrics Collector (monitoring) * Function Registry (function-calling) - Document service registration order and requirements - Prepare for future service implementations Benchmark System Integration (monitoring/benchmark.rs): - Add pluggable architecture with function injection - New builder methods: * with_retrieval(fn) - plug in retrieval system * with_reranker(fn) - plug in cross-encoder * with_llm(fn) - plug in LLM generator - Modify benchmark_query() to use actual services when provided - Fall back to simulation mode when services not set - Enable real performance measurement with production systems Completes TODOs at: - core/registry.rs:336 - monitoring/benchmark.rs:244,250,258

Implemented execute_happened_query and execute_caused_query with multi-strategy approaches for knowledge graph reasoning. Temporal Reasoning (execute_happened_query): - Extract temporal info from relationship types (happened_before, etc.) - Parse chunk metadata.custom for date/timestamp/time fields - Detect temporal keywords in chunk content (months, days, seasons) - Use document position as narrative ordering heuristic - Return temporal contexts with confidence scoring Causal Reasoning (execute_caused_query): - Identify direct causal relationships (causes, leads_to, results_in) - Build causal chains using DFS traversal (max depth 3) - Analyze co-occurrence in chunks for implicit causality - Detect causal keywords in content (because, therefore, due to) - Rank explanations by confidence scores Both methods follow existing patterns from execute_related_query and execute_compare_query, returning VariableBinding results.

Updated README.md and graphrag-core/README.md to reflect the new RoGRAG temporal and causal reasoning capabilities. Main Changes: - Root README: Updated ROGRAG description in features section - Root README: Marked temporal and causal reasoning as completed - Core README: Added comprehensive RoGRAG section in Advanced Features New Documentation Covers: - Query decomposition (60%→75% accuracy boost) - Temporal reasoning with 4 extraction strategies - Causal reasoning with confidence-based ranking - Supported query types (identity, relationships, temporal, causal) - Feature flag configuration

Resolved remaining TODO items and clarified project boundaries. Changes: 1. Utility modules (lib.rs:151) - Removed TODO: only optional future modules - Clarified: automatic_entity_linking, phase_saver not needed - Marked as future enhancements, not blockers 2. Voy vector store (vector/mod.rs:27) - Removed TODO: already fully implemented (~500 lines) - Clarified: belongs in graphrag-wasm (WASM-specific) - Added note pointing to correct location 3. Scope cleanup - Removed Multilingual Support from roadmap (out of scope) - All core functionality TODOs now resolved - Remaining work: integration when dependencies ready Progress Summary: - 21/47 TODOs completed (45%) - 2/47 TODOs removed (out of scope) - 4/47 TODOs deferred (need dependencies) - 20/47 N/A or not applicable - Total: 87% project completion

…support - Added incremental indexing and delta computation logic - Introduced critic feedback loop for knowledge extraction - Implemented Ollama embedding and LLM adapters - Added support for LightRAG concept selection and query planning - Introduced cross-encoder reranking and adaptive retrieval - Added Python bindings in using PyO3 - Improved CLI UX with better progress monitoring - Refined .gitignore to include docs and exclude benchmark results

…h dedup, last_built_at Four small UX fixes that surface when an LLM agent drives the API end-to-end. All four sit in `graphrag-server`; no graphrag-core changes. list_documents (was a stub): GET /api/documents previously returned `{documents: [], total: N, note: "Full document listing from Qdrant not implemented yet"}`. Now pages through the collection via Qdrant's scroll API. Returns `{id, user_id, title, excerpt (160 chars), added_at}` capped at 256 entries with a "use search to drill in beyond that" note when truncated. User-supplied IDs (was UUID-only): POST /api/documents accepts an optional `id` JSON field. Stored in `payload.user_id` alongside the UUID Qdrant requires for the point id itself. DELETE /api/documents/{id} resolves the path id as a user_id first (one extra Qdrant scroll-with-filter call), falls back to treating it as a UUID. Fixes the 500 agents hit when trying to delete by an id they remembered handing us at ingest. Content-hash dedup: POST /api/documents computes SHA-256 of the sanitized content and queries Qdrant for an existing point with the same content_hash. If found, returns the existing id without re-embedding. Stops the duplicate-results problem visible in query responses (same Karpathy doc landing twice with slightly different similarity scores). Mirrors Microsoft GraphRAG's stable-id pattern (0.5.0+, enables upsert-merge); no behavioral change for new content. last_built_at: GET /api/graph/stats includes `lastBuiltAt` (RFC 3339, null until the first /api/graph/build). Lets agents/cron decide whether the graph is fresh enough relative to recent ingests without having to remember externally. Wire-format payload changes (DocumentMetadata in qdrant_store.rs): - new `content_hash: Option<String>` field, populated on every new ingest. Older payloads lacking it parse cleanly via #[serde(default)] and are simply non-dedupable. - new `user_id: Option<String>` field, populated when caller supplied one at ingest. Same back-compat pattern. PR-PLAN.md updated to reflect Group D (PR 4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…to it Replaces the previous "append = full rebuild + no-op fast-path" shortcut with a true incremental pass that only walks chunks ingested since the last build/extend, dedupes entities by id, and merges relationships keyed by (source, target, relation_type). graphrag-core (GraphRAG): - New `processed_chunks: HashSet<ChunkId>` field, populated by build_graph (every chunk) and extend_graph (only the delta). - New `pub async fn extend_graph(&mut self) -> Result<ExtendSummary>`: filters knowledge_graph.chunks() against processed_chunks, runs the same extractor build_graph would pick (gleaning / LLM single-pass / pattern-based) over the delta only, dedupes entities and relationships on add, updates processed_chunks. - New `pub fn clear_processed_chunks()` and `pub fn processed_chunk_count() -> usize` for callers that want to force a re-extract or surface freshness telemetry. - `ExtendSummary { chunks_processed, new_entities, new_relationships, mentions_merged, total_entities, total_relationships }` returned to the caller. Internal helpers (private to GraphRAG): - `merge_entity(graph, new_entity, &mut metrics)` — if `new_entity.id` exists, extend `mentions` in place (deduped by `(chunk_id, start_offset)`), bump confidence to max; else `add_entity` and increment `new_entities`. Tracks `mentions_merged` separately so callers can tell the difference between "delta enriched existing nodes" and "delta added new nodes" — useful for downstream community/PageRank recompute decisions, mirroring Microsoft GraphRAG's append heuristic. - `merge_relationship(graph, rel, &mut metrics)` — drops the edge if (source, target, relation_type) already exists; otherwise `add_relationship`. Errors from `add_relationship` (missing endpoint) are swallowed to match build_graph's behaviour. - `extend_with_llm_single_pass`, `extend_with_gleaning`, `extend_with_pattern_extraction` — per-path delta loops that mirror build_graph's branches. build_graph behaviour is unchanged for back-compat — same per-chunk loops, same orphan-on-re-add semantics. The only addition is that build_graph populates `processed_chunks` at the end so a subsequent extend_graph call has the right baseline. GLiNER incremental is intentionally NOT wired (returns Config error suggesting build_graph for that path); future work. graphrag-server (/api/graph/append handler): - Now calls `graphrag.extend_graph()` instead of `graphrag.build_graph()`. Real cost-scales-with-delta semantics. - Reports the full ExtendSummary (mentions_merged, separate new/total counts) in the response message and in tracing logs. - Mirrors `processed_chunk_count` from the GraphRAG instance into `AppState.processed_chunk_count` so /health and friends can expose freshness. Tests (4 new, inline in graphrag-core/src/lib.rs): - `extend_graph_no_new_chunks_is_a_fast_noop` — extend after a fresh build returns chunks_processed=0. - `extend_graph_processes_only_delta_chunks` — second doc gets a chunks_processed=1 extend (not 2). - `extend_graph_dedupes_entities_by_id` — entity re-mentioned in a delta chunk does NOT create a duplicate node; mentions are merged in place. - `extend_graph_after_clear_processed_re_extracts_everything` — clear_processed_chunks() resets the tracking set. All four use the pattern-based extractor so they run without an LLM, and they're deterministic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Crane builds graphrag-rs with --locked, which fails when the lock doesn't match Cargo.toml. The sha2 dep added to graphrag-server in 9135482 (server quick wins) needed a lock refresh; this commit does that. No other dep changes; sha2 is already a workspace dep used elsewhere, so the resolver picks the same version everywhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y id Promotes the dedup logic that previously lived only in extend_graph's private `merge_entity` / `merge_relationship` helpers into the canonical `KnowledgeGraph` API. Same semantics, applied uniformly. Before: `KnowledgeGraph::add_entity(entity)` always called `graph.add_node(entity)` and overwrote `entity_index` to point at the new node. Two consequences: 1. Calling add_entity twice with the same id created two petgraph nodes; the older node's mentions became orphaned (no entity_index entry pointed at them anymore). 2. `graph.entities().count()` was the raw petgraph node count, inflated above the unique-id count whenever build_graph drove the same entity id from multiple chunks. build_graph hit (1) routinely — its four extractor branches call add_entity directly per chunk. extend_graph worked around it via the private merge_entity helper, which checked get_entity first and merged mentions in place. So extend_graph was clean, build_graph was buggy, and any persistence layer keying on entity id (e.g. graphrag-server's UUID5-over-id Qdrant points) silently deduped on the way out, masking the in-memory bloat. Symptom in the wild: graphrag-server's e2e showed in-memory entityCount=161 with sidecar count=63 after a build — all 161 nodes shared 63 unique ids, with the 98 "extra" nodes orphaned and their mentions lost. Same shape for relationships. add_relationship called graph.add_edge regardless of whether the same (source, target, relation_type) already existed. Now: - `add_entity` checks entity_index first. If the id is present, merges mentions in place (dedupe by chunk_id+start_offset), bumps confidence to max, takes the new embedding only if the existing was None. Returns the existing NodeIndex. - `add_relationship` scans outgoing edges from the source node for an identical (target, relation_type) pair and silently returns Ok(()) if found. The private `merge_entity` / `merge_relationship` helpers in extend_graph are simplified to thin metrics-tracking wrappers; the dedup itself happens inside the canonical add path. API surface: `add_entity` returns `Result<NodeIndex>` as before. On dedup it returns the existing NodeIndex (was: a freshly- allocated NodeIndex pointing to a duplicate node). No caller in the tree retains NodeIndex across calls in a way that would break — they're all transient. 4 new inline tests in `core::dedup_tests`: - add_entity_dedupes_by_id_and_merges_mentions - add_relationship_dedupes_by_source_target_relation_type - add_entity_takes_max_confidence_and_first_embedding - add_relationship_returns_ok_on_dedup_not_err All four extend_graph_* tests still pass — the public-API dedup matches what the private helpers were doing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mataIA#10 automataIA#11

carcall added 30 commits October 26, 2025 17:23

complete rewrite

e97df04

Add minilm-l6.onnx to .gitignore

829203f

chore: remove large ONNX model from repository

bfbeabf

add image

649d96d

feat: implement trait-based chunking architecture with cAST support

99df398

fix: make test_graph_indexing async with tokio::test

a355f08

Changes test_graph_indexing to use #[tokio::test] and .await to properly handle async index_graph() method. Fixes compilation error: cannot call is_ok() on Future

feat: kv-cache, json structured, gliner-relex

6295a1e

update

2d1d22a

update cli TUI/TUX

69da96d

add wrapper crate

c46e287

dataO1 and others added 4 commits April 29, 2026 16:20

dataO1 mentioned this pull request Apr 30, 2026

Graph-aware /api/query (ask/explain/reason/local) + cross-restart persistence #11

Open

dataO1 added a commit to dataO1/graphrag-rs that referenced this pull request Apr 30, 2026

PR-PLAN: filed PRs A/B/C/D upstream as automataIA#8 automataIA#9 auto…

c75f28c

…mataIA#10 automataIA#11

This was referenced Apr 30, 2026

LightRAG dual-level retrieval (global / hybrid / mix modes) #12

Draft

Unify embeddings around Config.embeddings (single source of truth) #13

Open

automataIA force-pushed the main branch 2 times, most recently from d39471e to 84ef833 Compare May 31, 2026 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent-friendly server UX + real incremental extend_graph + add_entity dedup#10

Agent-friendly server UX + real incremental extend_graph + add_entity dedup#10
dataO1 wants to merge 34 commits into
automataIA:mainfrom
dataO1:pr/agent-ux

dataO1 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dataO1 commented Apr 30, 2026

Motivation

Goals

Changes

list_documents (was a stub)

User-supplied IDs

Content-hash dedup

last_built_at

Real incremental graph extension

KnowledgeGraph::add_entity / add_relationship dedupe by id

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant