You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal Draft: Split Control Plane Metadata from Vector Indexing in KB
Abstract
This proposal introduces a clear storage boundary for KB:
MetadataStore for control-plane metadata (collection_metadata, ingestion_runs, prompt_templates, main_pointers)
VectorIndexStore for vector indexing and retrieval (documents/parses/chunks/embeddings and search/index operations)
The goal is to remove current coupling, reduce migration risk, and make multi-backend vector support practical.
This proposal follows the direction discussed in Issue #90.
What We Are Proposing
Define a strict storage boundary:
Control-plane -> MetadataStore
Data-plane -> VectorIndexStore
Migrate control-plane metadata from LanceDB to RDB in a phased way.
Keep vector path provider-agnostic through VectorIndexStore SPI.
Use single-primary vector write/read by default, while reserving interfaces for future multi-store read.
Current State
Today LanceDB serves both:
control-plane metadata
vector indexing and retrieval
This leads to cross-layer coupling in API/service code and makes backend expansion expensive.
Current (As-Is) Core Flow
sequenceDiagram
participant API as API/Service
participant Biz as KB Business Modules
participant LDB as LanceDB (mixed responsibilities)
API->>Biz: ingest/search/delete
Biz->>LDB: write metadata tables
Biz->>LDB: write vector tables
Biz->>LDB: read/search/index ops
Loading
Problems and Motivation
Problems
Boundary ambiguity: control-plane and vector-plane responsibilities are mixed.
High migration cost: adding a new vector backend risks copying existing coupling.
Consistency complexity: metadata/vector updates are difficult to reason about and recover.
Review friction: architecture intent is hidden behind implementation details.
Motivation
Make responsibilities explicit and auditable.
Enable safe phased migration with rollback controls.
Prepare clean SPI for pluggable vector providers.
Improve long-term maintainability for open-source contributors.
Proposed Architecture
Target (To-Be) Core Flow
sequenceDiagram
participant API as API/Service
participant Coord as KBWriteCoordinator
participant MS as MetadataStore (RDB)
participant OP as Operation/Outbox
participant VS as VectorIndexStore (VDB)
API->>Coord: submit write/delete
Coord->>MS: write control-plane state
Coord->>OP: enqueue operation
OP->>VS: apply vector-side changes
VS-->>MS: update status/stats via coordinator
Loading
Design Principles
Domain contracts first (no direct DB semantics in business/API layer)
Explicit consistency model (operation/outbox/reconcile)
Configurable and rollback-friendly migration
Provider-agnostic vector API with explicit capability fallback
2PC is treated as a reference option, not the default path
Phased Plan
Phase 1A: Interface Decoupling (no physical DB split yet)
Proposal Draft: Split Control Plane Metadata from Vector Indexing in KB
Abstract
This proposal introduces a clear storage boundary for KB:
MetadataStorefor control-plane metadata (collection_metadata,ingestion_runs,prompt_templates,main_pointers)VectorIndexStorefor vector indexing and retrieval (documents/parses/chunks/embeddings and search/index operations)The goal is to remove current coupling, reduce migration risk, and make multi-backend vector support practical.
This proposal follows the direction discussed in Issue #90.
What We Are Proposing
MetadataStoreVectorIndexStoreVectorIndexStoreSPI.Current State
Today LanceDB serves both:
This leads to cross-layer coupling in API/service code and makes backend expansion expensive.
Current (As-Is) Core Flow
sequenceDiagram participant API as API/Service participant Biz as KB Business Modules participant LDB as LanceDB (mixed responsibilities) API->>Biz: ingest/search/delete Biz->>LDB: write metadata tables Biz->>LDB: write vector tables Biz->>LDB: read/search/index opsProblems and Motivation
Problems
Motivation
Proposed Architecture
Target (To-Be) Core Flow
sequenceDiagram participant API as API/Service participant Coord as KBWriteCoordinator participant MS as MetadataStore (RDB) participant OP as Operation/Outbox participant VS as VectorIndexStore (VDB) API->>Coord: submit write/delete Coord->>MS: write control-plane state Coord->>OP: enqueue operation OP->>VS: apply vector-side changes VS-->>MS: update status/stats via coordinatorDesign Principles
Phased Plan
Phase 1A: Interface Decoupling (no physical DB split yet)
MetadataStore/VectorIndexStore/ coordinator contracts.Phase 1B: Control Plane Migration to RDB
MetadataStore.Phase 2: Pluggable Vector Providers
Phase 3: Stabilization and Decommission
Compatibility and Migration Principles
Non-Goals (for this proposal)
Open Questions
Suggested Baseline (for discussion)
To make review concrete, here is a proposed starting baseline for the three open questions:
Reconcile thresholds before read switch
<= 0.1%over at least 3 consecutive runs.collection/doc_id/step_type/model_tag):0 unresolvedin final pre-switch run.Minimum provider capability matrix for Phase 2 exit
upsert,delete,dense search,metadata filter,index health checks.sparse/FTS,hybrid search, advanced index tuning controls.Phase 3 retention and archival policy