Skip to content

Proposal Draft: Split Control Plane Metadata from Vector Indexing in KB #135

@sqhyz55

Description

@sqhyz55

Proposal Draft: Split Control Plane Metadata from Vector Indexing in KB

Abstract

This proposal introduces a clear storage boundary for KB:

  • MetadataStore for control-plane metadata (collection_metadata, ingestion_runs, prompt_templates, main_pointers)
  • VectorIndexStore for vector indexing and retrieval (documents/parses/chunks/embeddings and search/index operations)

The goal is to remove current coupling, reduce migration risk, and make multi-backend vector support practical.
This proposal follows the direction discussed in Issue #90.

What We Are Proposing

  1. Define a strict storage boundary:
    • Control-plane -> MetadataStore
    • Data-plane -> VectorIndexStore
  2. Migrate control-plane metadata from LanceDB to RDB in a phased way.
  3. Keep vector path provider-agnostic through VectorIndexStore SPI.
  4. Use single-primary vector write/read by default, while reserving interfaces for future multi-store read.

Current State

Today LanceDB serves both:

  • control-plane metadata
  • vector indexing and retrieval

This leads to cross-layer coupling in API/service code and makes backend expansion expensive.

Current (As-Is) Core Flow

sequenceDiagram
    participant API as API/Service
    participant Biz as KB Business Modules
    participant LDB as LanceDB (mixed responsibilities)

    API->>Biz: ingest/search/delete
    Biz->>LDB: write metadata tables
    Biz->>LDB: write vector tables
    Biz->>LDB: read/search/index ops
Loading

Problems and Motivation

Problems

  • Boundary ambiguity: control-plane and vector-plane responsibilities are mixed.
  • High migration cost: adding a new vector backend risks copying existing coupling.
  • Consistency complexity: metadata/vector updates are difficult to reason about and recover.
  • Review friction: architecture intent is hidden behind implementation details.

Motivation

  • Make responsibilities explicit and auditable.
  • Enable safe phased migration with rollback controls.
  • Prepare clean SPI for pluggable vector providers.
  • Improve long-term maintainability for open-source contributors.

Proposed Architecture

Target (To-Be) Core Flow

sequenceDiagram
    participant API as API/Service
    participant Coord as KBWriteCoordinator
    participant MS as MetadataStore (RDB)
    participant OP as Operation/Outbox
    participant VS as VectorIndexStore (VDB)

    API->>Coord: submit write/delete
    Coord->>MS: write control-plane state
    Coord->>OP: enqueue operation
    OP->>VS: apply vector-side changes
    VS-->>MS: update status/stats via coordinator
Loading

Design Principles

  • Domain contracts first (no direct DB semantics in business/API layer)
  • Explicit consistency model (operation/outbox/reconcile)
  • Configurable and rollback-friendly migration
  • Provider-agnostic vector API with explicit capability fallback
  • 2PC is treated as a reference option, not the default path

Phased Plan

Phase 1A: Interface Decoupling (no physical DB split yet)

  • Introduce MetadataStore / VectorIndexStore / coordinator contracts.
  • Remove direct control-plane LanceDB semantics from service/API paths.

Phase 1B: Control Plane Migration to RDB

  • Add RDB MetadataStore.
  • Run dual-write + backfill.
  • Switch control-plane reads to RDB.
  • Stop VDB control-plane writes after stability window.

Phase 2: Pluggable Vector Providers

  • Finalize vector SPI.
  • Incrementally add providers (for example, Milvus/Chroma).
  • Define and test explicit degrade/fallback behavior.

Phase 3: Stabilization and Decommission

  • Remove legacy paths.
  • Harden operations (reconcile, alerts, runbooks, SLO checks).
  • Finalize archival/cleanup and compatibility policy.

Compatibility and Migration Principles

  • Read compatibility first
  • Idempotent backfill/dual-write
  • Independent read/write feature gates for rollback
  • Continuous reconcile checks during migration window
  • Prefer Outbox + idempotency + reconcile as the default consistency path; keep 2PC as optional future mode only.

Non-Goals (for this proposal)

  • No Memory subsystem refactor in this workstream.
  • No frontend progress subsystem redesign.
  • No immediate default enablement of multi-store read.

Open Questions

  1. What are acceptance thresholds for reconcile mismatch before read switch?
  2. Which provider capability matrix is required before Phase 2 exits?
  3. What retention/archival policy should be enforced in Phase 3?

Suggested Baseline (for discussion)

To make review concrete, here is a proposed starting baseline for the three open questions:

  1. Reconcile thresholds before read switch

    • Record-count mismatch rate: <= 0.1% over at least 3 consecutive runs.
    • Critical-key mismatch (collection/doc_id/step_type/model_tag): 0 unresolved in final pre-switch run.
    • Operational health gates: no sustained outbox backlog breach and no high-severity reconcile errors for a full stability window (for example, 7 days).
  2. Minimum provider capability matrix for Phase 2 exit

    • Required capabilities (must pass): upsert, delete, dense search, metadata filter, index health checks.
    • Optional with explicit fallback (must be documented): sparse/FTS, hybrid search, advanced index tuning controls.
    • Exit gate: at least one non-LanceDB provider passes required capabilities in CI + integration environment, with fallback behavior verified.
  3. Phase 3 retention and archival policy

    • Operations/outbox records: hot retention 30 days, archive retention 180 days (configurable by deployment).
    • Legacy VDB control-plane artifacts: keep read-only for one release cycle after cutover, then archive and remove.
    • Reconcile/audit evidence: keep summaries for at least one full release cycle to support post-cutover incident analysis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions