Skip to content

OpenAI-compatible chat + embeddings backend (feature-gated, opt-in)#9

Open
dataO1 wants to merge 38 commits into
automataIA:mainfrom
dataO1:pr/openai-backend
Open

OpenAI-compatible chat + embeddings backend (feature-gated, opt-in)#9
dataO1 wants to merge 38 commits into
automataIA:mainfrom
dataO1:pr/openai-backend

Conversation

@dataO1
Copy link
Copy Markdown

@dataO1 dataO1 commented Apr 30, 2026

Adds an OpenAI-compatible backend on parity with the existing Ollama
path — for both chat (entity extraction, query, gleaning) and
embeddings. Lets users drive graphrag-rs against any server that
speaks /v1/chat/completions and /v1/embeddings: vLLM, llama.cpp's
llama-server, OpenVINO Model Server, OpenRouter, OpenAI itself,
self-hosted text-generation-inference, etc.

Includes the small diagnostic endpoint (GET /api/embeddings/stats)
that lets users verify which backend is actually serving — useful
specifically when standing up a new local OpenAI-compat stack.

Motivation

Local LLM deployments increasingly run on OpenAI-compatible servers
(vLLM, llama.cpp, OVMS, etc.) rather than Ollama, partly because they
support more modern features (tool calling, structured output,
chat-template knobs) and partly because they integrate better with
existing OpenAI client tooling. graphrag-rs's chat path was hardcoded
to Ollama protocol and the embedding side had a parsed-but-unused
"openai" config branch that fell back to hash. This PR closes the gap.

Goals

  • Drive graphrag-rs against any OpenAI-compat chat server with no
    forking, on parity with the Ollama path.
  • Same for embeddings.
  • Per-request escape hatch for backend-specific knobs without
    growing the config struct for every quirk (motivating case:
    chat_template_kwargs.enable_thinking=false for Qwen3 on
    llama.cpp; response_format for vLLM JSON mode).
  • Make uncapped extraction work for local LLMs (no token billing,
    reasoning models truncate JSON when capped).
  • Feature-gate it the same way ollama is gated, to keep
    WASM/minimal builds slim.

Changes

Chat (graphrag-core)

  • New OpenAIConfig struct alongside OllamaConfig on Config.
    Fields: enabled, base_url, chat_model, api_key,
    timeout_seconds, max_retries, max_tokens (Option<u32>),
    temperature, enable_caching, extra_body.
  • New OpenAIClient (ureq + tokio::task::spawn_blocking, mirrors
    OllamaClient's sync-wrapped-async pattern).
  • New ChatClient enum dispatcher in graphrag-core::chat.
    ChatClient::from_config picks the active backend based on
    openai.enabled / ollama.enabled. Every consumer of chat —
    entity extraction, query planning, gleaning — now takes
    ChatClient instead of OllamaClient directly.
  • Config::chat_enabled() helper that returns true when either
    backend is enabled. build_graph and friends gate on this so the
    graph build cleanly skips LLM extraction when no chat backend is
    available, instead of failing midway.

Embeddings (graphrag-server)

  • EmbeddingService got an OpenAI-compat branch alongside the
    existing Ollama path. Activated by EMBEDDING_BACKEND=openai plus
    OPENAI_URL / OPENAI_EMBEDDING_MODEL / OPENAI_API_KEY envs.
    Reqwest-based (already a non-optional dep), so the gate is purely
    a code-path toggle.

Per-request extras (extra_body)

  • New optional OpenAIConfig.extra_body: Option<serde_json::Value>
    field. Merged into every /chat/completions request body at the
    top level. Existing keys win — set fields on OpenAIConfig
    (model, max_tokens, temperature, stop, top_p) take precedence
    over extra_body collisions, so users can't accidentally
    overwrite a typed field with a raw JSON blob.
  • Motivating cases (in the README):
    • chat_template_kwargs.enable_thinking=false for Qwen3 on
      llama.cpp's --jinja path (suppresses reasoning output that
      truncates JSON extraction within a token cap).
    • response_format = { type = "json_object" } for vLLM JSON mode.

Token-cap rework

  • LLMEntityExtractor.max_tokens: usizeOption<usize>. None
    means "no cap" — num_predict / max_tokens is omitted from the
    request body, server uses its own default (llama.cpp: -1 /
    unlimited up to ctx). Useful for local LLMs where token cost is
    just compute time and reasoning models truncate JSON when capped.
    Default stays at Some(1500); existing call sites keep working.
  • Drive-by bug fix: lib.rs::build_graph was reading
    ollama.max_tokens even when openai.enabled — silently capping
    openai extraction at the ollama default. Now reads the active
    backend's cap.

Feature gate

  • graphrag-core: openai = ["ureq", "async"]. Added to the
    starter bundle. Mirrors the existing ollama feature.
  • graphrag-server: openai = ["graphrag-core/openai"].
  • OpenAIConfig itself stays unconditional so user configs round-
    trip through serde regardless. Without the feature,
    ChatClient::from_config falls through to ollama / None, with a
    tracing::warn! explaining how to enable it.

Diagnostic endpoint

  • GET /api/embeddings/stats reports the live
    EmbeddingService.backend_name() (openai / ollama / hash-fallback),
    dimension, and per-source request counters. Plain Actix route
    below .build(), same OpenAPI-bypass dance as /config — the
    handler returns serde_json::Value rather than an apistos-typed
    struct, which doesn't satisfy PathItemDefinition.

    Useful precisely when verifying a new OpenAI-compat backend is
    serving — separately from /config, which reflects graphrag-core's
    internal embedding-generator config (a different layer that's not
    the user-facing path).

Documentation

  • README: [openai] chat block alongside the existing [ollama]
    block. Quick Start gets an "Option B" path showing the
    EMBEDDING_BACKEND=openai flow against vLLM.

Methodology

  • Cherry-picked off upstream/main (c46e287).
  • All three feature combos compile clean: qdrant,ollama /
    qdrant,openai / qdrant,ollama,openai.
  • Nine inline unit tests in graphrag-core/src/openai/mod.rs:
    • serde round-trip with extra_body objects
    • max_tokens=None round-trip (skip-on-None)
    • extra_body=None round-trip
    • request body shape (model, messages, stream, defaults)
    • params override config (temperature, num_predict)
    • max_tokens omitted when uncapped
    • extra_body unique-key merge
    • extra_body precedence rule (set fields beat collisions)
    • defensive: non-object extra_body silently dropped
  • Run with: cargo test -p graphrag-core --lib --features openai openai::. 9/9 pass.
  • cargo fmt --check clean on touched files. Pre-existing fmt
    warnings in untouched upstream files left alone.

Open questions

  • extra_body is Option<serde_json::Value> for maximum flexibility.
    Considered a typed enum (e.g., BackendExtras::LlamaCpp { ... } | BackendExtras::Vllm { ... }) but settled on raw Value because the
    server-specific knobs change faster than this codebase's release
    cadence. Open to switching if you'd rather have validation.
  • The feature gate is opt-in (mirrors ollama). Happy to flip to
    default-on or add to default = [...].
  • /api/embeddings/stats is folded in here because its primary use
    case is diagnosing the new OpenAI embedding backend. Happy to
    split into a follow-up PR if you'd prefer.

Stack note: sits on top of PR #8. Builds standalone against upstream/main because the cherry-pick is independent at the git level — no merge conflicts with PR #8 — but reviewing them in order may make the contribution easier to follow.

carcall added 30 commits October 26, 2025 17:23
Phase 1 - TRIVIAL fixes:
- Remove unused imports from traversal.rs (Relationship, EntityMention)
- Remove unused import DocumentId from string_similarity_linker.rs
- Remove unused imports from bidirectional_index.rs (DocumentId, TextChunk)
- Update obsolete comment in lib.rs about GraphRAG re-export

Phase 2 - EASY implementations:
- Implement relationships_examined counter tracking in logic_form.rs
- Add GraphRAGBuilder re-export in lib.rs
- Implement property extraction for Has queries in logic_form.rs
  * Supports querying entity properties: name, type, confidence, mentions
  * Returns all properties if only entity specified
  * Returns specific property if both entity and property specified

All changes compile successfully with no warnings.
…hunks

Completed 3 TODO implementations in persistence layer:

1. Relationships (save/load):
   - Schema: source, target, relation_type, confidence, context
   - Full support for relationship context tracking

2. Documents (save/load):
   - Schema: id, title, content, metadata, chunk_count
   - Preserves document metadata as parallel key-value arrays

3. Chunks (save/load):
   - Schema: id, document_id, content, offsets, embedding, entities
   - Metadata: chapter, keywords, summary
   - Full support for embeddings and entity references

Implementation uses Arrow RecordBatch with ListBuilder for nested structures.
Completed 2 TODO implementations:

1. **Relationship Extraction in LightRAG** (graph_indexer.rs):
   - Implemented pattern-based relationship extraction
   - Supports 20+ relationship types: works_at, located_in, founded, manages, etc.
   - Extracts relationships between detected entities
   - Confidence scoring based on pattern match and entity types
   - Type-aware adjustments (person+organization, entity+location)

2. **Dependency Analysis in Decomposer** (decomposer.rs):
   - Analyzes dependencies between subqueries based on query types
   - Dependency types: Sequential, Reference, Context
   - Logic:
     * Relationship queries depend on Entity queries (Reference)
     * Attribute queries depend on Entity queries (Reference)
     * Comparative queries depend on Entity/Attribute queries (Reference)
     * Temporal queries use Entity queries for Context
     * Causal queries have Sequential dependencies
   - Automatic deduplication of dependencies

Both implementations follow existing code patterns and include proper confidence scoring.
Completed TODO in api_providers.rs:332 - batch embedding support.

Implementation:
- New make_batch_request() method for true batch API calls
- Supports all providers: OpenAI, Voyage, Cohere, Jina, Mistral, Together
- Proper batch request/response format for each provider
- Automatic fallback to sequential if batch fails
- Validates embedding count matches input count

Benefits:
- Significant performance improvement for bulk operations
- Reduced API calls and latency
- Provider-native batch support utilized

Response formats handled:
- OpenAI-compatible: data[{embedding: [...]}]
- Cohere: embeddings[[...]]
Completed TODO in query_concepts.rs:163 - semantic matching.

Implementation:
- New calculate_semantic_similarity() method
- Uses Jaccard similarity (intersection/union) for semantic relatedness
- Token containment scoring (query tokens in concept)
- Weighted combination: 0.6*jaccard + 0.4*containment
- Applies configurable semantic threshold
- Lightweight proxy for true embedding-based matching

This provides semantic matching without requiring pre-computed embeddings.
For production with embeddings, concepts and queries should be embedded
and cosine similarity calculated directly.

Benefits:
- Catches semantically related concepts beyond exact/fuzzy match
- No embedding infrastructure required for basic semantic matching
- Configurable via use_semantic_match and semantic_threshold
Completed TODO in retrieval/mod.rs:238 - parallel processing support.

Implementation:
- New with_parallel_processing() constructor
- Accepts Arc<dyn VectorStore> for thread-safe sharing
- Accepts EmbeddingGenerator for parallel operations
- Integrates ParallelProcessor for batch operations

Design:
- VectorStore trait is already Send + Sync
- Arc wrapper enables safe cross-thread usage
- EmbeddingGenerator operations can use rayon for parallelization
- ParallelProcessor stored for future batch operations

This enables efficient parallel indexing and querying for large-scale
knowledge graphs with thread-safe vector operations.
Completed TODO implementations in data_import.rs (534, 547).

**Dependencies Added**:
- quick-xml (0.36) for GraphML XML parsing
- oxrdf (0.2) + oxttl (0.1) for RDF/Turtle parsing
- New features: graphml-import, rdf-import

**GraphML Parser**:
- Full GraphML XML format support
- Parses nodes with attributes (id, name, type)
- Parses edges with source/target/type
- Supports nested <data> elements with keys
- Returns ImportedEntity and ImportedRelationship lists

**RDF/Turtle Parser**:
- Turtle/RDF triple parsing (subject-predicate-object)
- Automatic entity extraction from subjects/objects
- Relationship extraction from URI objects
- Property extraction from literal objects
- URI local name extraction (after # or /)
- Default types for resources without explicit type

Both parsers:
- Feature-gated (#[cfg(feature = "...-import")])
- Comprehensive error handling
- Processing time tracking
- Return ImportResult with counts and errors

Enables graph import from standard formats (GraphML, RDF/Turtle).
## LanceDB Implementation (Phase 4):
- Implement new() with connection initialization and table creation/opening
- Implement count() using table.count_rows()
- Implement store_embedding() with Arrow RecordBatch construction
- Implement search_similar() with k-nearest neighbor vector search
- Add QueryBase and ExecutableQuery trait imports
- Handle FixedSizeList DataType with pattern matching for arrow 57

## Graph Embeddings (Phase 4):
- Implement MaxPool aggregation (element-wise max across neighbors)
- Implement Attention aggregation with softmax-normalized weights
- Implement LSTM aggregation with decay-based sequential processing
- Fix type inference for decay factor in LSTM

## Dependency Updates:
- Update arrow dependencies from 56 to 57 (workspace + graphrag-core)
- Update lancedb from 0.22.2 to 0.26.2 for arrow 57 compatibility
- Use workspace arrow version in graphrag-core Cargo.toml
- Enable lancedb module in persistence (feature gate: lancedb, not lance-storage)

## Bug Fixes:
- Fix VectorStore delete() to return () instead of DeleteResult
- Fix DataType::FixedSizeList access for arrow 57 API changes (match pattern instead of as_fixed_size_list())
## BLEU Score Implementation (Phase 5 - VERY HIGH):

### Core Algorithm:
- Implement calculate_bleu_score() with n-gram precision (n=1-4)
- Calculate brevity penalty: BP = exp(1 - ref_len/cand_len)
- Final score: BLEU = BP * exp(1/N * sum(log(P_n)))

### Helper Methods:
- calculate_ngram_precision() - Precision with clipped counts
- extract_ngrams() - N-gram extraction from token sequences
- Clipping logic to prevent over-counting repeated n-grams

### Integration:
- Call BLEU calculation in calculate_quality_metrics()
- Compute average BLEU score across benchmark queries
- Add BLEU score to BenchmarkSummary output
- Display BLEU in print_summary() when available

### Algorithm Details:
- N-gram range: 1-4 (unigrams through 4-grams)
- Modified precision with clipping to max reference counts
- Geometric mean of n-gram precisions
- Brevity penalty for short candidates
- Returns 0.0 if any n-gram precision is 0
## LanceDB Batch Methods (Phase 4):

### store_embeddings_batch():
- Validate dimensions for all embeddings in batch
- Create Arrow StringArray for IDs
- Create FixedSizeListArray for embedding vectors
- Build RecordBatch and add to table
- Handle empty batch case gracefully

### get_embedding():
- Query table by ID using SQL filter (only_if)
- Execute query and collect results
- Extract embedding from FixedSizeList column
- Return None if ID not found
- Use TryStreamExt for async result collection

### Implementation Details:
- Both methods use Arrow RecordBatch construction
- Proper error handling with GraphRAGError
- Tracing support for debug logging
- Dimension validation before insertion

LanceDB integration now complete with all 6 methods:
- new() - Connection and table initialization
- count() - Count rows
- store_embedding() - Single embedding storage
- store_embeddings_batch() - Batch storage
- get_embedding() - Retrieve by ID
- search_similar() - K-nearest neighbor search
## ROUGE-L Score Implementation (Phase 5 - VERY HIGH):

### Core Algorithm:
- Implement calculate_rouge_l() using Longest Common Subsequence (LCS)
- LCS-based precision: LCS_length / candidate_length
- LCS-based recall: LCS_length / reference_length
- F-score with β=1.2: ((1+β²)*P*R) / (β²*P + R)

### LCS Dynamic Programming:
- Implement lcs_length() with O(m*n) time complexity
- DP table: dp[i][j] = LCS of seq1[0..i] and seq2[0..j]
- Recurrence: if match: dp[i][j] = dp[i-1][j-1] + 1
- Else: dp[i][j] = max(dp[i-1][j], dp[i][j-1])

### Integration:
- Call ROUGE-L calculation in calculate_quality_metrics()
- Compute average ROUGE-L score across benchmark queries
- Add ROUGE-L to BenchmarkSummary output
- Display ROUGE-L in print_summary() when available

### Algorithm Details:
- Token-based LCS (word-level, not character-level)
- β=1.2 slightly favors recall over precision
- Returns 0.0 for empty sequences
- Clamps result to [0, 1] range
## Semantic Chunking Implementation (Phase 4 - MEDIUM-HIGH):

### Algorithm:
- Split text into sentences using existing split_sentences()
- Calculate lexical cohesion (Jaccard similarity) between adjacent sentences
- Create chunk boundaries where similarity < threshold (default 0.7)
- Merge small chunks below min_size with previous chunk
- Split large chunks above max_size by sentence boundaries

### Features:
- Uses existing lexical_cohesion() method for word-overlap similarity
- Respects min_size, max_size, and similarity_threshold config
- Calculates coherence score for each chunk
- Maintains sentence and paragraph counts
- Handles edge cases (empty text, single sentence, etc.)

### Implementation Details:
- Lexical-based semantic similarity (word overlap)
- No deep learning embeddings required (practical approach)
- Still "semantic" because it respects content similarity
- Efficient: O(n) where n is number of sentences

Closes semantic chunking TODO at nlp/semantic_chunking.rs:329
## VectorStore LanceDB Implementation:

### add_vectors_batch():
- Implement full Arrow RecordBatch construction for batch vector insertion
- Create StringArray for IDs
- Create FixedSizeListArray for embeddings with proper dimension
- Build schema with id (Utf8) and vector (FixedSizeList) fields
- Add batch to LanceDB table using table.add()

### search():
- Implement vector similarity search with k-nearest neighbors
- Use query().limit(k).nearest_to() pattern
- Extract IDs from result batches
- Calculate inverse ranking scores
- Return SearchResult vec with id, score, metadata

### Implementation Details:
- Reuses Arrow pattern from persistence/lance.rs
- Proper error handling for all LanceDB operations
- Empty batch handling for add_vectors_batch
- Type-safe Float32Type for embeddings

Closes TODO at vector/lancedb.rs:89
Implements complete builder pattern for GraphRAG configuration:
- 20+ builder methods for all major config options
- Fluent API: output_dir, chunk_size, embeddings, ollama, retrieval
- with_local_defaults() for zero-config local setup
- config() and config_mut() for advanced use cases
- Full test coverage: 11/11 tests passing

Unblocks TODO at lib.rs:282,1271
Enables GraphRAG::builder() method
Adds to prelude for easy access
Updates:
- parquet 52 -> 57 to match arrow 57
- Fix ParquetRecordBatchReaderBuilder import path
- Add Array trait import for is_null() method
- Wrap embeddings in Arc::new() for RecordBatch

Implements embeddings save/load using ListBuilder pattern:
- Save: Build ListArray from Option<Vec<f32>>
- Load: Extract Vec<f32> from ListArray with null handling
- Consistent with chunks embeddings implementation

Completes TODO at persistence/parquet.rs:245,360
Changes test_graph_indexing to use #[tokio::test] and .await
to properly handle async index_graph() method.

Fixes compilation error: cannot call is_ok() on Future
Registry Service Implementations (core/registry.rs):
- Expand build_registry() with comprehensive service structure
- Add 8 service registration points with feature gates:
  * Storage (memory-storage)
  * Vector Store (vector-memory)
  * Embedding Provider (ollama)
  * Entity Extractor (entity-extraction)
  * Retriever (retrieval)
  * Language Model (ollama)
  * Metrics Collector (monitoring)
  * Function Registry (function-calling)
- Document service registration order and requirements
- Prepare for future service implementations

Benchmark System Integration (monitoring/benchmark.rs):
- Add pluggable architecture with function injection
- New builder methods:
  * with_retrieval(fn) - plug in retrieval system
  * with_reranker(fn) - plug in cross-encoder
  * with_llm(fn) - plug in LLM generator
- Modify benchmark_query() to use actual services when provided
- Fall back to simulation mode when services not set
- Enable real performance measurement with production systems

Completes TODOs at:
- core/registry.rs:336
- monitoring/benchmark.rs:244,250,258
Implemented execute_happened_query and execute_caused_query with
multi-strategy approaches for knowledge graph reasoning.

Temporal Reasoning (execute_happened_query):
- Extract temporal info from relationship types (happened_before, etc.)
- Parse chunk metadata.custom for date/timestamp/time fields
- Detect temporal keywords in chunk content (months, days, seasons)
- Use document position as narrative ordering heuristic
- Return temporal contexts with confidence scoring

Causal Reasoning (execute_caused_query):
- Identify direct causal relationships (causes, leads_to, results_in)
- Build causal chains using DFS traversal (max depth 3)
- Analyze co-occurrence in chunks for implicit causality
- Detect causal keywords in content (because, therefore, due to)
- Rank explanations by confidence scores

Both methods follow existing patterns from execute_related_query
and execute_compare_query, returning VariableBinding results.
Updated README.md and graphrag-core/README.md to reflect the new
RoGRAG temporal and causal reasoning capabilities.

Main Changes:
- Root README: Updated ROGRAG description in features section
- Root README: Marked temporal and causal reasoning as completed
- Core README: Added comprehensive RoGRAG section in Advanced Features

New Documentation Covers:
- Query decomposition (60%→75% accuracy boost)
- Temporal reasoning with 4 extraction strategies
- Causal reasoning with confidence-based ranking
- Supported query types (identity, relationships, temporal, causal)
- Feature flag configuration
Resolved remaining TODO items and clarified project boundaries.

Changes:
1. Utility modules (lib.rs:151)
   - Removed TODO: only optional future modules
   - Clarified: automatic_entity_linking, phase_saver not needed
   - Marked as future enhancements, not blockers

2. Voy vector store (vector/mod.rs:27)
   - Removed TODO: already fully implemented (~500 lines)
   - Clarified: belongs in graphrag-wasm (WASM-specific)
   - Added note pointing to correct location

3. Scope cleanup
   - Removed Multilingual Support from roadmap (out of scope)
   - All core functionality TODOs now resolved
   - Remaining work: integration when dependencies ready

Progress Summary:
- 21/47 TODOs completed (45%)
- 2/47 TODOs removed (out of scope)
- 4/47 TODOs deferred (need dependencies)
- 20/47 N/A or not applicable
- Total: 87% project completion
…support

- Added incremental indexing and delta computation logic
- Introduced critic feedback loop for knowledge extraction
- Implemented Ollama embedding and LLM adapters
- Added support for LightRAG concept selection and query planning
- Introduced cross-encoder reranking and adaptive retrieval
- Added Python bindings in  using PyO3
- Improved CLI UX with better progress monitoring
- Refined .gitignore to include docs and exclude benchmark results
wellos and others added 8 commits April 29, 2026 14:35
Adds a third option to EMBEDDING_BACKEND alongside "ollama" and "hash":

    EMBEDDING_BACKEND=openai \
    OPENAI_URL=http://localhost:8000/v1 \
    OPENAI_EMBEDDING_MODEL=BAAI/bge-m3 \
    OPENAI_API_KEY=optional \
    EMBEDDING_DIM=1024

Hits any OpenAI-compatible /embeddings endpoint:
- vLLM (`vllm serve <model> --task embed`)
- OpenVINO Model Server (with EmbeddingsCalculatorOV graph)
- llama.cpp server (`llama-server --embedding`)
- the real OpenAI API
- LiteLLM, OpenRouter, etc.

Implementation:
- New OpenAIClient struct (reqwest-based) holding base_url, model, api_key.
- New `openai_url` / `openai_model` / `openai_api_key` fields on
  EmbeddingConfig with sensible defaults.
- `EmbeddingService::new` probes /models on startup; falls back to hash
  embeddings if the server isn't reachable. Synthetic model names that
  don't match the configured one are tolerated (vLLM single-model mode,
  OVMS Mediapipe graph names like "embeddings").
- New `generate_with_openai` method posts one request per text using
  the OpenAI body shape `{"model": ..., "input": ...}` and unwraps
  `data[0].embedding` from the response. Per-text rather than batched
  to keep the dimension-validation path simple.
- `generate()` dispatch tries openai first if configured, then ollama,
  then hash fallback.
- `backend_name()` reports "openai" when active.

Cargo: adds reqwest as a non-optional dep on graphrag-server (already
in the build via qdrant-client transitively).

Cargo check passes with --no-default-features --features qdrant,ollama.

Note: chat LLM still routes through OllamaClient. Wiring an OpenAI-compat
chat backend through graphrag-core's pipeline (entity extraction, query
planner, gleaning) is a larger refactor — staged as a follow-up.
The runtime pipeline (entity extraction, query planner, gleaning, answer
generation) used to construct OllamaClient directly in 4 places in lib.rs
and 7 consumer files took OllamaClient as a concrete type. Adding any
non-Ollama chat backend required either a tree-wide trait refactor or a
shim — both costly.

Solution: a small ChatClient enum dispatcher. Same surface as
OllamaClient (`generate`, `generate_with_params`, `get_stats`,
`keep_alive`), routes to either backend at runtime based on
`config.openai.enabled` / `config.ollama.enabled`.

Files:
- NEW graphrag-core/src/openai/mod.rs (~250 LoC) — OpenAIClient + OpenAIConfig.
  Mirrors OllamaClient: ureq-based, sync-on-spawn-blocking, OllamaUsageStats.
  Posts {model, messages:[{role:user, content}], temperature, max_tokens,
  top_p, stop} to {base_url}/chat/completions; reads
  choices[0].message.content. Honors api_key when non-empty (Bearer header).
  Ollama-only fields (top_k, repeat_penalty, keep_alive, num_ctx, context)
  in OllamaGenerationParams are silently ignored on the OpenAI path.

- NEW graphrag-core/src/chat/mod.rs (~100 LoC) — ChatClient enum:
  Ollama(OllamaClient) | OpenAI(OpenAIClient). `from_config(&ollama, &openai)`
  picks: openai when enabled, else ollama when enabled, else None.
  `from_ollama` / `from_openai` are explicit constructors for tests and
  call sites that already built a backend.

- graphrag-core/src/lib.rs: register chat + openai modules. Replace 4
  `OllamaClient::new(self.config.ollama.clone())` callsites with
  `ChatClient::from_config(&self.config.ollama, &self.config.openai)`.
  Two skip-and-warn paths when neither is enabled (gleaning + single-pass
  extraction); one error-return path (semantic answer generation).

- graphrag-core/src/config/mod.rs: add `openai: OpenAIConfig` field on
  `Config`. Defaults to disabled. Parses from JSON config under
  `["openai"]` (same shape as `["ollama"]`).

- Consumers swapped concrete OllamaClient -> ChatClient:
    entity/atomic_fact_extractor.rs
    entity/gleaning_extractor.rs        (extracts keep_alive via new
                                         ChatClient::keep_alive() helper)
    entity/llm_extractor.rs
    entity/llm_relationship_extractor.rs (Option<OllamaClient> ->
                                          Option<ChatClient>)
    entity/semantic_merging.rs
    text/contextual_enricher.rs         (added from_chat_client; old
                                          new(OllamaConfig) preserved)
    query/planner.rs

  Tests in gleaning_extractor and llm_extractor wrap their constructed
  OllamaClient with ChatClient::from_ollama() before passing.

End state: existing Ollama users see no behavior change. To switch to
llama-server / vLLM / real OpenAI, set in the runtime pipeline config:

    "openai": {
      "enabled": true,
      "base_url": "http://localhost:17171/v1",
      "chat_model": "Qwen3.6-27B-Q4_K_M",
      "api_key": ""
    }

graphrag-core cargo check passes.
The five `if self.config.ollama.enabled` gates in build_graph()/query
predated the openai backend split. With ollama disabled and openai
enabled (the production case behind a llama.cpp / vLLM / OVMS server),
they all fell through to pattern-based extraction or non-LLM answer
synthesis, even though ChatClient::from_config would have happily
returned an OpenAI client.

Add `Config::chat_enabled()` (`ollama.enabled || openai.enabled`) and
swap the five sites — gleaning gate, single-pass gate, the two query
synthesis paths, and the critic-loop gate. Logging also corrected:
the single-pass branch comment no longer claims "Ollama enabled".

Net: with `openai.enabled=true, ollama.enabled=false, use_gleaning=
{true|false}` (the HM-shipped config on neo-16), graph build now
dispatches via ChatClient::OpenAI to llama-server instead of logging
"Using pattern-based entity extraction" and returning 0 entities.

server: feed /api/documents content into GraphRAG, not just qdrant

`/api/documents` previously short-circuited after writing to qdrant
because qdrant is the retrieval backend. The live GraphRAG instance —
which owns the chunks/knowledge_graph used by /api/graph/build — never
saw the content, so build_graph() ran over zero chunks and reported
"0 entities, 0 relationships" no matter how many docs you POSTed.

Now after a successful qdrant insert we also call
`graphrag.add_document_from_text(content)` and flip graph_built=false
so a subsequent build is required. Failure of the GraphRAG ingest is
logged but does not poison the qdrant write — qdrant is canonical for
retrieval and the graph is best-effort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets callers inject non-standard top-level fields into every
/chat/completions request without expanding OpenAIConfig for every
backend quirk. Motivating case: pass
chat_template_kwargs.enable_thinking=false to llama.cpp's --jinja
path so Qwen3-style reasoning is suppressed per-client, without
flipping --reasoning off on the shared llama-server.

- openai/mod.rs: new Option<serde_json::Value>, merged at top-level
  with set-field precedence (existing keys win on collision).
- config/mod.rs: disk-config parser (json crate) round-trips through
  a string to convert to serde_json::Value; POST /config (serde
  path) picks the field up via #[serde(default)].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d's cap

Two coupled changes so callers can opt out of the per-extraction-call
generation cap when running against a local LLM (no token billing,
just compute time — reasoning-class models truncate JSON mid-output
when capped, even with thinking suppressed).

- LLMEntityExtractor.max_tokens: usize → Option<usize>. `None` means
  "no cap" — `num_predict` is omitted from the request body, so the
  server uses its own default (llama.cpp: -1 / unlimited up to ctx).
  Default still Some(1500); existing `with_max_tokens(usize)` keeps
  its signature. New `with_max_tokens_opt(Option<usize>)` exposes
  the uncapped path. num_ctx formula falls back to 2048 when uncapped
  (only matters on the Ollama path; OpenAI ignores num_ctx).

- lib.rs build_graph: read max_tokens (and temperature) from the
  active chat backend instead of hardcoding ollama.*. Previously,
  enabling openai still inherited ollama's defaults — silently
  capping extraction at 1500 even when openai.max_tokens was set
  higher. Now openai.enabled routes to openai.max_tokens; ollama
  remains the fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the [openai] config block alongside the existing [ollama] block
so users see both options when picking a chat backend, plus an Option
B in Quick Start showing the EMBEDDING_BACKEND=openai / OPENAI_URL
flow against vLLM-class servers. Mentions extra_body for backend-
specific knobs (Qwen3 thinking suppression, vLLM json-only outputs).

Embedding-side OpenAI backend was already mentioned in the providers
table; this commit fills in the chat-side gap and the Optional
Dependencies bullet so the OpenAI-compatible path is discoverable from
the top of the README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
graphrag-core: add `openai = ["ureq", "async"]`. Pulled into the
`starter` bundle so the common path is single-flag. The OpenAIConfig
struct stays unconditional (always parses through serde, so user
configs round-trip whether the feature is on or off); only
OpenAIClient + the HTTP path + the ChatClient::OpenAI dispatch arm
are gated. ChatClient::from_config falls through to ollama / None
when openai.enabled = true is set without the feature compiled in,
with a tracing::warn explaining how to fix.

graphrag-server: add `openai = ["graphrag-core/openai"]`. Gates the
OpenAIClient struct, the openai_client field on EmbeddingService,
the openai-init branch in EmbeddingService::new, and the
generate_with_openai method. Without the feature, setting
EMBEDDING_BACKEND=openai logs a "not compiled in" warning and falls
back to the hash generator — same shape as the existing ollama
feature-off path.

Body construction in openai/mod.rs is extracted into a small
build_request_body helper so unit tests can assert the exact wire
shape (extra_body merge precedence, max_tokens omission when
uncapped) without standing up an HTTP server. Adds 9 tests, all
inline under `#[cfg(all(test, feature = "openai"))]`:

  - serde round-trip (incl. extra_body objects, max_tokens=None)
  - body shape (model, messages, stream, defaults from config)
  - params override config (temperature, num_predict)
  - max_tokens omitted when uncapped (None)
  - extra_body unique-key merge
  - extra_body precedence rule (set fields beat collisions)
  - extra_body defensive: non-object value silently dropped

Verified all three relevant feature combos compile inside the
graphrag-rs-nix devshell:
  --features qdrant,ollama
  --features qdrant,openai
  --features qdrant,ollama,openai

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reports the runtime EmbeddingService's backend (openai / ollama /
hash-fallback), dimension, and per-source request counters. Lets
callers (e.g. an e2e harness) verify which embedding path is actually
serving — separately from /config's view, which reflects graphrag-
core's internal embedding-generator config and is not the path that
serves /api/documents and /api/query.

Registered as a plain Actix route (not apistos) below .build() —
same OpenAPI-bypass dance as /config endpoints, since the stats
handler returns serde_json::Value rather than an apistos-typed
struct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dataO1 added a commit to dataO1/graphrag-rs that referenced this pull request Apr 30, 2026
dataO1 added a commit to dataO1/graphrag-rs that referenced this pull request May 3, 2026
…ngs refactor

Scope grew beyond the original PR F draft (which was just 4e8c6ff —
inject the real embedder). Filed PR includes both 4e8c6ff and the
follow-up d74116f (unify around Config.embeddings, drop dual storage,
atomic POST /config swap, dim validation, /api/embeddings/stats →
/embeddings/stats route move, /health.embeddings block).

Stacked on automataIA#12 (LightRAG) because d74116f's main.rs already carries
PR D/E content and cherry-picking onto a shallower base re-introduces
conflicts. Conceptual dependency is only on automataIA#9 (PR B's
EmbeddingService); the chain through 10/11/12 is a base artifact.

End-to-end validated against live OVMS+NPU: 52 passed / 0 failed,
including new backend-switching test (POST /config flips backend
atomically across /config + /embeddings/stats + /health.embeddings;
dim mismatch returns HTTP 400 with no state change).

Also delete the stale PR-F-DRAFT.md scratch file.
@automataIA automataIA force-pushed the main branch 2 times, most recently from d39471e to 84ef833 Compare May 31, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants