Skip to content

evilgaoshu/ragrig

RAGRig logo

RAGRig

Open-source RAG governance and pipeline platform for enterprise knowledge.

源栈: from scattered enterprise sources to traceable, permission-aware, model-ready knowledge.

中文


About

RAGRig is an open-source platform for building lightweight, governable RAG systems for small and medium-sized teams.

It helps organizations connect scattered knowledge sources, clean and structure documents with LLM-assisted pipelines, index them into vector stores such as Qdrant and pgvector, and serve retrieval results through traceable, permission-aware APIs.

RAGRig is not meant to be another generic chatbot wrapper. Its focus is the hard operational layer around RAG:

  • source connectors for documents, wikis, shared drives, databases, object storage, and enterprise document hubs
  • customizable ingestion and cleaning workflows
  • model registry for LLMs, embedding models, rerankers, OCR, and parsers
  • Qdrant and Postgres/pgvector as first-class vector backends
  • document, chunk, and metadata versioning
  • permission-aware retrieval with pre-retrieval access filtering
  • RAG evaluation, observability, and regression checks
  • source traceability from answer to document, version, chunk, and pipeline run
  • Markdown and document preview/editing integrations for knowledge review workflows

The goal is to make enterprise knowledge usable by AI systems without losing control over source provenance, permissions, quality, or deployment cost.

Why RAGRig

Many RAG tools make it easy to upload files and chat with them. Production RAG inside a company needs more than that.

Teams need to know where each answer came from, whether the source is still valid, which model created the embedding, who is allowed to retrieve the content, and whether a pipeline change made retrieval better or worse.

RAGRig treats RAG as an operational system:

  • Source-first: every generated answer should point back to inspectable source material.
  • Governed by default: access control, metadata, versions, and audit events are part of the core model.
  • Model-flexible: bring local or hosted LLMs, embedding models, rerankers, OCR, and parsers.
  • Local-first: prefer local files, pgvector, Ollama, LM Studio, BGE, and self-hosted runtimes before cloud services.
  • Vector-store portable: start with pgvector, scale to Qdrant, and keep migration paths explicit.
  • Ops-friendly: designed for Docker Compose first, with a path to Kubernetes later.
  • Plugin-first: keep the core small, then extend sources, sinks, models, vector stores, preview tools, and workflow nodes through explicit contracts.
  • Quality-gated: core modules must reach and maintain 100% test coverage, with cloud and enterprise plugins covered through contract tests.

Architecture

flowchart LR
    sources["Source plugins<br/>files, object storage, docs, wiki, DB"]
    pipeline["Pipeline engine<br/>scan, parse, clean, chunk, embed, index"]
    profiles["Processing Profiles<br/>extension × task matrix"]
    formats["Supported Format<br/>registry"]
    understanding["Document Understanding<br/>summaries, glossary, knowledge map"]
    core["RAGRig core<br/>KB, versions, chunks, runs, audit"]
    vectors["Vector backends<br/>pgvector, Qdrant, others"]
    console["Web Console<br/>operate, review, debug, upload"]
    api["Retrieval API / MCP / exports"]

    formats --> pipeline
    profiles --> pipeline
    sources --> pipeline --> core
    core --> vectors
    core --> understanding
    core --> console
    vectors --> api
    core --> api
Loading

Project Status

RAGRig is in early project design and scaffolding.

Current implementation status:

  1. Phase 0 docs and project framing are committed.
  2. Phase 1a scaffold provides a FastAPI service, local Docker Compose stack, pgvector-enabled PostgreSQL, and verification commands.
  3. Phase 1a metadata DB adds SQLAlchemy models, Alembic migrations, and DB smoke commands for the MVP metadata boundary.
  4. Phase 1b now supports local Markdown/Text ingestion into the metadata DB, including document_versions and pipeline-run tracking.
  5. Phase 1c now supports deterministic local chunking and embedding into chunks and embeddings for the latest ingested document versions.
  6. Phase 1d now supports a minimal retrieval API and smoke CLI over the real indexed chunks and embeddings.
  7. Phase 1e PR-1 now adds a core provider registry contract and registers deterministic-local through it.
  8. Phase 1e PR-2 now adds local provider adapters for Ollama, LM Studio, OpenAI-compatible local runtimes, and optional BGE boundaries without changing the default secret-free test path.
  9. Phase 1e PR-3 now adds cloud-second provider stubs for Vertex AI, Bedrock, Azure OpenAI, OpenRouter, OpenAI, Cohere, Voyage, and Jina through the same registry and discovery surfaces.
  10. source.s3 now supports real S3-compatible Markdown/Text ingestion with fake-client-first tests and opt-in runtime dependencies.
  11. source.fileshare now supports offline-tested SMB, mounted NFS/local path, WebDAV, and SFTP ingestion contracts with truthful readiness, delete-detection placeholders, and an explicit make fileshare-check smoke path.
  12. The Web Console now includes a plugin/data source setup wizard that drafts registry-backed config, rejects raw secrets, and validates plugin config before wiring.
  13. A ProcessingProfile, SupportedFormat, browser upload, and Document Understanding architecture spec is committed as a design document; implementation is queued for follow-up issues.
  14. Semantic production embeddings, live local runtime smoke checks, production cloud adapters, reranking, and richer source types remain intentionally limited or deferred in this repository state.

Authoritative specs:

Web Console

RAGRig now ships a first lightweight Web Console inside the same FastAPI service. It is an operator workbench for knowledge bases, sources, ingestion tasks, pipeline runs, document and chunk review, retrieval debugging, model shells, and health status.

The console lives at:

GET /console

What the current MVP covers:

  • knowledge base inventory from the real DB
  • local-directory source configuration from the real DB
  • CLI-connected ingestion entry with disabled browser-write state
  • pipeline run history and per-item detail from the real DB
  • document latest-version preview and real chunk preview or empty state
  • retrieval Playground backed by the real POST /retrieval/search contract
  • embedding profile inventory from indexed chunks
  • health, DB dialect, Alembic revision, extension state, and visible tables
  • vector backend readiness with backend type, dependency state, collection rows, and score semantics
  • plugin/data source setup wizard backed by real registry metadata and POST /plugins/{plugin_id}/validate-config

Current limitations:

  • browser-triggered create/update actions are intentionally not implemented yet
  • the plugin wizard validates config drafts and next-step commands, but does not persist plugin configuration or create sources from the browser
  • model registry remains read-only, but now exposes local LLM and reranker registry shells for PR-2 providers
  • provider registry metadata is exposed read-only, including Ollama, LM Studio, OpenAI-compatible local runtimes, and optional BGE boundaries
  • the console only shows capabilities backed by existing DB/API boundaries and uses empty, disabled, or degraded states for the rest
  • qdrant remains optional; missing qdrant-client or missing live collections degrade only the vector panel instead of the whole console

RAGRig Web Console prototype

Phase 1a Foundation

Phase 1a currently ships the engineering scaffold and metadata database foundation required for follow-on ingestion and retrieval work:

  • Python 3.11+ service with FastAPI
  • typed settings via pydantic-settings
  • GET /health with explicit app and database status
  • SQLAlchemy 2.x models for the metadata boundary from MVP Section 12
  • Alembic migrations rooted at alembic/
  • pgvector-backed embeddings table with dynamic dimensions metadata
  • uv-managed dependencies in pyproject.toml
  • ruff format/lint commands and pytest tests
  • Docker Compose for the app and PostgreSQL with pgvector
  • smoke commands for migration and schema validation

Phase 1b, Phase 1c, and Phase 1d add these implemented boundaries:

  • src/ragrig/ingestion
  • src/ragrig/parsers
  • src/ragrig/repositories
  • src/ragrig/chunkers
  • src/ragrig/embeddings
  • src/ragrig/indexing
  • src/ragrig/retrieval.py

Still reserved for later phases:

  • src/ragrig/cleaners
  • src/ragrig/vectorstore

The current repository state supports local Markdown/Text parsing, character-window chunking, deterministic local embeddings, a provider registry core contract, and a minimal pgvector-backed retrieval API for smoke validation. Production embedding providers, reranking, and answer generation are still deferred.

Provider Registry

Phase 1e PR-1 establishes the core provider registry contract in src/ragrig/providers/.

What exists now:

  • provider metadata and capability declarations
  • register/get/read/list/health-check registry operations
  • structured provider errors for missing providers and unsupported capabilities
  • deterministic-local registered as the built-in embedding provider for CI and smoke flows
  • read-only provider inventory in GET /models

PR-2 additions:

  • model.ollama local adapter metadata and fake-client contract tests
  • model.lm_studio local OpenAI-compatible adapter metadata and fake-client contract tests
  • shared local adapter declarations for model.llama_cpp, model.vllm, model.xinference, and model.localai
  • embedding.bge and reranker.bge provider boundaries with lazy optional dependency loading
  • read-only /models and /plugins visibility for the above providers

PR-3 additions:

  • model.vertex_ai, model.bedrock, model.azure_openai, model.openrouter, model.openai, model.cohere, model.voyage, and model.jina registry metadata
  • cloud-second plugin discovery entries in /plugins and make plugins-check
  • optional cloud dependency groups in pyproject.toml without changing the default install path
  • read-only /models visibility for the cloud stubs, including required secret and config metadata

What is still deferred:

  • no production cloud API calls in this PR slice
  • no DB-backed model profile management
  • no default live local or cloud runtime smoke in make test

deterministic-local remains a secret-free, network-free test and smoke provider. It is not a production semantic embedding model.

Model Provider Catalog

RAGRig also exposes a provider catalog for mainstream model vendors and API protocols. The catalog is based on official provider documentation links and is visible in GET /models and the Web Console model panel.

Current catalog coverage includes OpenAI-compatible providers and gateways, Anthropic, Google Gemini, Azure OpenAI, Amazon Bedrock, OpenRouter, Mistral, Cohere, Together, Fireworks, Groq, DeepSeek, Moonshot/Kimi, MiniMax, Alibaba DashScope, SiliconFlow, Zhipu/Z.ai, Baidu Qianfan, Volcengine Ark, xAI, Perplexity, NVIDIA NIM, Ollama, LM Studio, llama.cpp, vLLM, Xinference, LocalAI, BGE embedding, and BGE reranking.

Runtime probing endpoints:

GET  /models/{provider_name}/available-models
POST /models/{provider_name}/speed-test

Without credentials, these endpoints return missing_credentials with the exact required environment variable names and do not attempt a network call. With credentials, the first implementation measures the provider's model-list endpoint latency; it does not spend tokens on generation.

Local Provider Extras

PR-2 keeps local runtime SDKs and heavy ML packages out of the default install.

Install optional local runtime support with:

uv sync --extra local-ml --dev

The local-ml extra currently groups:

  • ollama
  • openai
  • FlagEmbedding
  • sentence-transformers
  • torch

Default tests still use fake clients and optional-dependency-safe loaders. A fresh clone does not need Ollama, LM Studio, GPUs, or local model downloads.

Cloud Provider Extras

PR-3 keeps cloud SDKs out of the default install and ships only contract-first cloud stubs.

Optional cloud dependency groups:

Processing Profile System

PR-5 introduces a read-only ProcessingProfile module that defines a file-type × task-type pipeline matrix. The system ships with default wildcard profiles and serves as the resolution layer for processing decisions in the ingestion and indexing pipelines.

Concepts

  • TaskType: correct, clean, chunk, summarize, understand, embed
  • ProcessingProfile: defines provider, kind (deterministic/LLM-assisted), and status per (extension, task_type) combination
  • Profile Resolution: extension override → wildcard default → safe fallback
  • Default Profiles: all task types ship with wildcard defaults; chunk/embed use deterministic providers, summarize/understand are LLM-assisted stubs

Current Scope (P0)

  • src/ragrig/processing_profile/ core module with 100% test coverage
  • GET /processing-profiles — list all profiles (with provider, status, task_type; no raw secrets)
  • GET /processing-profiles/matrix — returns extension × task_type grid with default/override and deterministic/LLM-assisted markers
  • Chunk and embed metadata include profile_id for traceability
  • Web Console Processing Profile Matrix read-only view
  • resolve_provider_availability() correctly reports unavailable providers (not faked as ready)

Limitations (not yet implemented)

  • No browser-side profile CRUD (create/edit/delete)
  • No real LLM summarize/understand calls — profiles define configuration only
  • No per-profile A/B evaluation metrics
  • No secret storage or secret echo in API responses
  • Provider availability for LLM tasks is read-only; no runtime health check past plugin registry status

Degradation Semantics

When a profile's LLM provider is unavailable:

  • The matrix marks the cell with provider_available: false and an amber "⚠ unavail" indicator in the console
  • The API response includes provider_available: false without fabricating a ready state
  • Pipeline runs record chunk_profile_id/embed_profile_id in config snapshots; future phases will use these for fallback logic

Cloud Provider Extras (cont.)

  • cloud-google: google-cloud-aiplatform
  • cloud-aws: boto3
  • cloud-openai: openai
  • cloud-cohere: cohere
  • cloud-voyage: voyageai
  • cloud-jina: no SDK package yet; the stub documents an httpx-style API boundary only

Example installs:

uv sync --extra cloud-openai --extra cloud-google --dev
uv sync --extra cloud-aws --extra cloud-cohere --dev

PR-3 cloud stubs are intentionally contract-only:

  • no live cloud API calls in default tests
  • no real API keys required for fresh clone verification
  • /models, /plugins, and make plugins-check expose metadata, secret requirements, and current stub status only
  • production cloud adapters should land in follow-up PRs, not inside this stub/docs slice

Default local endpoints documented by PR-2:

  • model.ollama: http://localhost:11434
  • model.lm_studio: http://localhost:1234/v1
  • model.llama_cpp: http://localhost:8080/v1
  • model.vllm: http://localhost:8000/v1
  • model.xinference: http://localhost:9997/v1
  • model.localai: http://localhost:8080/v1

Quick Start

  1. Install uv if it is not already available.

  2. Sync dependencies:

    make sync
  3. Create a local env file:

    cp .env.example .env

    If 8000 or 5432 are already in use on the host, set alternate values in .env, for example APP_HOST_PORT=18000 or DB_HOST_PORT=15433.

  4. Run code quality checks:

    make format
    make lint
    make test
    make coverage
    make dependency-inventory
  5. Run supply-chain checks:

    make licenses
    make sbom
    make audit

    make audit requires network access. If the environment is offline, use make audit-dry-run and treat the vulnerability audit as blocked rather than silently skipped.

  6. Start the database service:

    docker compose up --build -d db
  7. Run the initial migration:

    make migrate
  8. Verify the extension and schema:

    make db-check

    Expected output shape:

    {
      "current_revision": "20260503_0001",
      "extension": "vector",
      "missing_tables": [],
      "present_tables": [
        "chunks",
        "document_versions",
        "documents",
        "embeddings",
        "knowledge_bases",
        "pipeline_run_items",
        "pipeline_runs",
        "sources"
      ],
      "revision_matches_head": true
    }
  9. Preview the local ingestion fixture without writing to the database:

    make ingest-local-dry-run
  10. Ingest the local Markdown/Text fixture into the database:

make ingest-local
  1. Query the latest local-ingestion run summary:
make ingest-check

Expected output shape:

{
  "counts": {
    "document_versions": 4,
    "documents": 5,
    "pipeline_run_items": 5,
    "sources": 1
  },
  "knowledge_base": {
    "name": "fixture-local"
  },
  "latest_pipeline_run": {
    "failure_count": 0,
    "status": "completed",
    "success_count": 4,
    "total_items": 5
  }
}
  1. Chunk and embed the latest ingested document versions:

    make index-local
  2. Query the latest chunking and embedding run summary:

    make index-check

    Expected output shape:

    {
      "counts": {
        "chunks": 4,
        "embeddings": 4
      },
      "embedding_dimensions": [
        {
          "count": 4,
          "dimensions": 8,
          "model": "hash-8d",
          "provider": "deterministic-local"
        }
      ],
      "latest_pipeline_run": {
        "failure_count": 0,
        "status": "completed",
        "success_count": 3,
        "total_items": 4
      }
     }
  3. Run a retrieval smoke query against the indexed chunks:

    make retrieve-check QUERY="RAGRig Guide"

    Expected output shape:

    {
      "dimensions": 8,
      "distance_metric": "cosine_distance",
      "knowledge_base": "fixture-local",
      "model": "hash-8d",
      "provider": "deterministic-local",
      "query": "RAGRig Guide",
      "results": [
        {
          "chunk_id": "...",
          "chunk_index": 0,
          "document_id": "...",
          "document_uri": ".../guide.md",
          "document_version_id": "...",
          "distance": 0.0,
          "score": 1.0,
          "source_uri": ".../tests/fixtures/local_ingestion",
          "text_preview": "# RAGRig Guide ..."
        }
      ],
      "top_k": 3,
      "total_results": 1
     }

    The default path uses VECTOR_BACKEND=pgvector. If you explicitly enable Qdrant, the response shape stays the same and adds backend metadata:

    {
      "backend": "qdrant",
      "backend_metadata": {
        "distance_metric": "cosine",
        "status": "ready"
      }
    }
  4. Start optional local Qdrant only when you want the alternate backend smoke path:

    docker compose --profile qdrant up -d qdrant
    uv sync --extra vectorstores
    VECTOR_BACKEND=qdrant make index-local
    VECTOR_BACKEND=qdrant make retrieve-check QUERY="RAGRig Guide"

    qdrant-client is intentionally optional. Fresh clone make test and make coverage continue to pass without the package or a running Qdrant container.

  5. Inspect plugin readiness offline:

    make plugins-check

    source.s3 reports unavailable until you install the optional S3 SDK:

    uv sync --extra s3
  6. Run the opt-in S3-compatible smoke path against MinIO or another S3-compatible endpoint:

    docker compose --profile minio up -d minio
    uv sync --extra s3
    make s3-check

    The default .env.example values target the local MinIO profile. make s3-check seeds tests/fixtures/local_ingestion/ into the configured bucket before ingesting it.

    Minimal runtime config uses declared secret refs only:

    {
      "bucket": "ragrig-smoke",
      "prefix": "ragrig-smoke",
      "endpoint_url": "http://127.0.0.1:9000",
      "region": "us-east-1",
      "use_path_style": true,
      "verify_tls": false,
      "access_key": "env:AWS_ACCESS_KEY_ID",
      "secret_key": "env:AWS_SECRET_ACCESS_KEY",
      "session_token": "env:AWS_SESSION_TOKEN"
    }

    Current source.s3 limits:

    • only Markdown and plain-text objects are parsed
    • unsupported extensions, binary objects, and oversized objects are skipped with recorded reasons
    • delete detection, tombstones, and standalone cursor state are not implemented yet
  7. Start the local API service, including the Web Console:

```bash
make run-web
```

Then open `http://localhost:8000/console`.

If you changed `APP_HOST_PORT`, open that port instead.
  1. Run the Web Console smoke contract:
```bash
make web-check
```
  1. Start the full local development stack when you also want Docker-managed app + DB:
```bash
docker compose up --build
```
  1. Verify the service and pgvector bootstrap:
```bash
curl http://localhost:8000/health
docker compose exec db psql -U ragrig -d ragrig -c "SELECT extname FROM pg_extension WHERE extname = 'vector';"
docker compose exec db psql -U ragrig -d ragrig -c "SELECT tablename FROM pg_tables WHERE schemaname = 'public' ORDER BY tablename;"
```

If you changed `APP_HOST_PORT`, use that port in the `curl` command.
If you changed `DB_HOST_PORT`, keep using `docker compose exec db ...`; no command change is required.

Expected healthy response:

{
  "status": "healthy",
  "app": "ok",
  "db": "connected",
  "version": "0.1.0"
}

If PostgreSQL is unavailable, /health returns 503 with a clear error payload.

  1. Exercise the retrieval API directly:

    curl -X POST http://localhost:8000/retrieval/search \
      -H "Content-Type: application/json" \
      -d '{"knowledge_base":"fixture-local","query":"RAGRig Guide","top_k":1}'

    If you changed APP_HOST_PORT, use that port in the request URL.

Database Commands

Repository-level DB commands:

  • make migrate: apply Alembic migrations to head
  • make migrate-down: roll back one migration step
  • make db-check: verify pgvector extension, required Phase 1a tables, and Alembic head revision
  • make db-shell: open psql in the Compose database container
  • make test-db: alias for the DB smoke check
  • make web-check: verify /console and the Web Console data routes
  • make ingest-local-dry-run: preview scanned files and skip reasons without DB writes
  • make ingest-local: ingest the local fixture corpus or an overridden root path into the metadata DB
  • make ingest-check: query the latest local-ingestion run and document-version evidence
  • make index-local: chunk and embed the latest ingested document versions for the chosen knowledge base
  • make index-check: query the latest chunk and embedding run, counts, spans, and embedding dimensions
  • make retrieve-check QUERY="...": query the indexed chunks and print top-k citation fields

Fresh-clone schema verification path:

make sync
cp .env.example .env
docker compose up --build -d db
make migrate
make db-check

The Compose file still supports shared-machine port overrides through .env, for example:

APP_HOST_PORT=18000
DB_HOST_PORT=15433

This override path must remain available for 192.168.3.100 and other shared hosts where default ports are already in use.

Host-side migration and smoke commands (make migrate, make db-check) connect through localhost:${DB_HOST_PORT} so they work from the machine that launched Docker Compose, even though the application container still uses DATABASE_URL=postgresql://ragrig:ragrig_dev@db:5432/ragrig internally.

The same host-side runtime URL rule also applies to make ingest-local and make ingest-check, so shared-host verification can use alternate mapped DB ports without rewriting the app container path.

Local Ingestion

Phase 1b currently implements the smallest reproducible local ingestion loop for Markdown and plain text files.

What it does:

  • scans an explicit local root path
  • applies include and exclude glob filters
  • skips excluded, oversized, unsupported, and binary files with recorded reasons
  • parses UTF-8 Markdown and text files
  • computes SHA-256 file hashes
  • writes sources, documents, document_versions, pipeline_runs, and pipeline_run_items
  • avoids duplicate document_versions when the file content hash has not changed

What it does not do yet:

  • chunking
  • embeddings or pgvector writes
  • deletion cleanup or tombstones

Default fixture path:

tests/fixtures/local_ingestion

Custom run example:

uv run python -m scripts.ingest_local \
  --knowledge-base demo \
  --root-path tests/fixtures/local_ingestion \
  --include "*.md" \
  --include "*.txt" \
  --exclude "nested/*"

Dry-run example:

uv run python -m scripts.ingest_local \
  --knowledge-base demo \
  --root-path tests/fixtures/local_ingestion \
  --dry-run

Retrieval API

Phase 1d implements the smallest retrieval boundary on top of Phase 1c indexed chunks.

What it does:

  • embeds query text with the same deterministic-local provider used for default indexing smoke runs
  • searches only the latest document_versions rows per document
  • returns top-k chunk matches with document_id, document_version_id, chunk_id, chunk_index, document_uri, source_uri, distance, score, and chunk_metadata
  • exposes both POST /retrieval/search and make retrieve-check

What it does not do yet:

  • answer generation
  • reranking or lexical fallback
  • ACL filtering
  • external embedding providers as the default path

Web Console Usage

The Web Console is served by the same FastAPI process as /health, /docs, and /retrieval/search.

Local startup sequence from a fresh clone:

make sync
cp .env.example .env
docker compose up --build -d db
make migrate
make ingest-local
make index-local
make run-web

Then open:

http://localhost:8000/console

Suggested local verification sequence:

make test
make web-check
make retrieve-check QUERY="RAGRig Guide"

Relationship to other interfaces:

  • Web Console: operator-facing overview and debugging workbench
  • Swagger (/docs): raw API exploration
  • CLI / Make targets: write-path orchestration for ingest and indexing in this MVP

The console does not invent data. If a knowledge base has no chunks, models, or retrieval results yet, the UI shows real empty or degraded states instead of placeholders.

Plugin Architecture

RAGRig is designed as a small core with plugin-first extension points. The core owns workspace state, knowledge bases, documents, versions, chunks, embeddings, pipeline runs, metadata, access boundaries, audit events, and plugin contracts. Integrations live behind typed plugin interfaces.

The goal is not to build a plugin marketplace first. The goal is to make every integration explicit, testable, observable, and replaceable.

The README uses official platform links instead of embedding third-party logos. A visual integration gallery can be added later under docs/ when each logo's trademark and usage rules are checked.

Provider priority is local-first, cloud-second. Local model runtimes, local embeddings, local rerankers, and self-hosted vector stores must be usable before a user configures a cloud account.

Plugin families:

Family Purpose Examples
Source connectors Read enterprise knowledge from external systems local files, SMB/NFS, S3-compatible storage, Google Drive, SharePoint, Confluence, databases
Parsers and OCR Convert raw files into extracted text and structure Markdown, plain text, PDF, DOCX, XLSX, Docling, MinerU, Tesseract, PaddleOCR
Cleaning nodes Normalize, redact, classify, dedupe, and enrich content deterministic cleaners, LLM-assisted cleaners, PII redaction, metadata extraction
Chunkers Split document versions into traceable chunks character windows, Markdown heading chunks, recursive text chunks, table-aware chunks
Model providers Supply LLMs, embedding models, rerankers, OCR, and parsing models local Ollama, LM Studio, vLLM, llama.cpp, Xinference, BAAI BGE, plus cloud Google Vertex AI, Amazon Bedrock, OpenRouter, OpenAI, Cohere, Voyage AI
Vector backends Store and search vectors with backend-specific capability reporting pgvector, Qdrant, Milvus/Zilliz, Weaviate, OpenSearch/Elasticsearch, Redis/Valkey
Output sinks Write governed knowledge or retrieval artifacts elsewhere Amazon S3/Cloudflare R2/MinIO, NFS, relational databases, JSONL, Parquet, Markdown, webhooks, MCP
Preview/edit integrations Let operators inspect or edit source and cleaned knowledge Markdown editor, WPS, OnlyOffice, Collabora Online, source-system deep links
Evaluation plugins Measure retrieval and answer quality golden questions, citation coverage, latency/cost, regression checks
Workflow nodes Compose ingestion, indexing, export, and evaluation pipelines scan, parse, clean, chunk, embed, index, retrieve, evaluate, export, notify

Plugin Tiers

RAGRig separates plugins by stability, priority, and maintenance ownership.

Tier Meaning Ships with core Extension policy
Built-in core plugins Minimal local-first path required for a reproducible RAG pipeline Yes Maintained in this repository, no optional external service dependency
Official plugins High-demand integrations maintained by the RAGRig project Usually optional May live in this repository first, then move to separate packages as APIs stabilize
Community plugins Third-party integrations built against public contracts No Installed through Python packages or plugin manifests once the contract is stable

Initial built-in core plugins:

Plugin Family Read/write Why it is core
source.local Source connector Read Fresh-clone demo, fixture validation, shared-host smoke testing
parser.markdown Parser Read Common documentation format, deterministic tests
parser.text Parser Read Smallest plain-text ingestion path
chunker.character_window Chunker Write chunks Reproducible chunking before semantic chunkers exist
embedding.deterministic_local Model provider Write embeddings Secret-free development and CI validation
vector.pgvector Vector backend Read/write Default lightweight backend on Postgres
sink.jsonl Output sink Write Portable debug/export format
preview.markdown Preview/edit Read/write draft Operator review without needing an office suite

Priority official plugins:

Priority Plugin area Platforms and protocols to cover first
P0 vector.qdrant Self-hosted Qdrant first, Qdrant Cloud second
P0 model.local_runtime Ollama, LM Studio, llama.cpp server, vLLM, Xinference, LocalAI through official SDKs or OpenAI-compatible local APIs
P0 embedding.bge and reranker.bge BAAI BGE embedding and reranker models through local FlagEmbedding, sentence-transformers, or OpenAI-compatible serving
P1 model.cloud_provider Google Vertex AI, Amazon Bedrock, OpenRouter, OpenAI, Azure OpenAI, Cohere, Voyage AI, Jina AI
P1 source.s3 AWS S3, Cloudflare R2, MinIO, Ceph RGW, Wasabi, Backblaze B2 S3 API, Tencent COS S3 API, Alibaba OSS S3-compatible mode when available
P1 sink.object_storage AWS S3, Cloudflare R2, MinIO, Ceph RGW, Wasabi, Backblaze B2, Google Cloud Storage, Azure Blob Storage
P1 source.fileshare SMB/CIFS, NFS, WebDAV, SFTP/OpenSSH
P1 source.google_workspace Google Drive, Google Docs, Google Sheets, Google Slides
P1 source.microsoft_365 SharePoint, OneDrive, Word, Excel, PowerPoint
P1 source.wiki Confluence, MediaWiki, GitBook, Docusaurus, MkDocs
P1 source.database PostgreSQL, MySQL/MariaDB, SQL Server, Oracle Database, SQLite, MongoDB, Elasticsearch/OpenSearch
P1 preview.office WPS, OnlyOffice, Collabora Online
P2 source.collaboration Notion, Lark/Feishu, DingTalk, WeCom, Slack files, Microsoft Teams files
P2 parser.advanced_documents PDF layout extraction, DOCX/PPTX/XLSX, Docling, MinerU, Unstructured
P2 ocr PaddleOCR, Tesseract, AWS Textract, Azure Document Intelligence, Google Document AI
P2 vector.enterprise Milvus/Zilliz, Weaviate, OpenSearch/Elasticsearch vector, Redis/Valkey vector, Vespa
P2 sink.analytics Parquet, DuckDB, ClickHouse, BigQuery, Snowflake
P2 sink.agent_access MCP server, webhooks, retrieval API export adapters

Every plugin should declare:

  • plugin id, type, version, and owner
  • supported read/write operations
  • configuration schema
  • required secrets
  • secret requirements
  • capability matrix
  • local/cloud classification
  • dimensions and context-window metadata when applicable
  • SDK or protocol surface
  • cursor or incremental-sync support
  • delete detection support
  • permission mapping support
  • failure and retry behavior
  • emitted metrics and audit events

Example manifest shape:

manifest_version: 1
id: source.s3
type: source
version: 0.1.0
capabilities:
  - read
  - incremental_sync
  - delete_detection
config_model: S3SourceConfig
secret_requirements:
  - AWS_ACCESS_KEY_ID
  - AWS_SECRET_ACCESS_KEY

Current contract-first implementation adds:

  • src/ragrig/plugins/ for the registry, manifest schema, dependency guards, and built-in plus official stub manifests.
  • GET /plugins for offline plugin discovery with readiness, missing dependency, configurability, and secret requirement reporting.
  • POST /plugins/{plugin_id}/validate-config for safe Web Console config validation without collecting raw secrets.
  • make plugins-check for offline JSON inspection of the registry.
  • source.fileshare as a real official source plugin with mounted-path NFS support, fake-client SMB/WebDAV/SFTP coverage, and protocol-level readiness reporting.
  • make fileshare-check for offline mounted-path and fake remote fileshare smoke validation.

Enterprise Connector Catalog and Workflow Engine

RAGRig now exposes an enterprise connector catalog separate from live connector execution. It covers local files, fileshares, S3-compatible storage, Google Workspace, Microsoft 365, wikis, databases, collaboration suites, Notion, Slack files, Box, Dropbox, and GitHub repository contents with official documentation links, protocols, credential names, and workflow operation metadata.

New endpoints:

  • GET /enterprise-connectors lists connector families, protocols, credential env var names, docs links, and workflow operation mappings.
  • POST /enterprise-connectors/{connector_id}/probe performs a safe local probe. Without credentials, cloud/SaaS connectors return missing_credentials and do not make network calls.
  • GET /workflows/operations lists workflow node operations.
  • POST /workflows/runs runs or dry-runs a lightweight DAG with dependency validation.

Workflow operations available now:

  • ingest.local
  • ingest.fileshare
  • ingest.s3
  • ingest.connector
  • index.knowledge_base
  • noop

The engine executes steps in topological order, rejects duplicate steps, unknown dependencies, cycles, and unsupported operations, supports dry-runs, per-step retry counts, dependency skipping, and returns linked pipeline_run_id values for real ingest/index steps. Default tests stay network-free and secret-free.

Fileshare Source

source.fileshare is the current local-first bridge for enterprise shared storage.

What it supports now:

  • protocol = nfs_mounted: mount the share through the OS, then point RAGRig at the mounted directory
  • protocol = smb: SMB/CIFS contract, readiness reporting, fake-client tests, optional smbprotocol runtime dependency
  • protocol = webdav: WebDAV contract, readiness reporting, fake-client tests, optional httpx runtime dependency
  • protocol = sftp: SFTP contract, readiness reporting, fake-client tests, optional paramiko runtime dependency

Current boundaries:

  • default make test and make coverage stay network-free and secret-free
  • delete detection is a placeholder audit signal only; it records deleted_upstream in pipeline items but does not delete stored documents
  • permission mapping is metadata-only for now; access enforcement is not implemented in this phase
  • only Markdown/Text parsing goes through the existing parser path by default

Install optional runtime SDKs with:

uv sync --extra fileshare --dev

Offline smoke:

make fileshare-check

Live smoke (local Docker services, explicit opt-in):

make preflight-fileshare-live           # check Docker, ports, and optional SDKs
make test-live-fileshare                # preflight + up + seed + pytest + evidence
make test-live-fileshare-print-evidence # same, but prints the evidence record to stdout
make fileshare-live-down                # tear down

Live smoke validates real list/read/stat/skip behavior against local Samba, WebDAV, and SFTP containers. It does not run in default CI.

Prerequisites:

  • Optional SDKs can be installed with:
    uv sync --extra fileshare --dev

QA acceptance path:

  1. Run make preflight-fileshare-live first. If it reports blockers, do not start containers.
  2. Run make test-live-fileshare to produce a full evidence record at docs/operations/artifacts/fileshare-live-smoke-record.json.
  3. Paste the record (or make test-live-fileshare-print-evidence output) into the PR or issue as验收证据.

Unavailable environment fallback:

  • If .env is missing, preflight blocks with cp .env.example .env and stops before any container checks.
  • If Docker is not installed or the daemon is not running, preflight prints actionable steps and exits without starting containers.
  • If optional SDKs (smbprotocol, paramiko, httpx) are missing, preflight warns with the exact install command (uv sync --extra fileshare --dev) and a fallback note; pytest will skip the corresponding protocol tests.
  • If a required port is occupied, preflight outputs the port number and three fix options:
    1. free the port,
    2. override in .env (e.g. SMB_HOST_PORT=1446),
    3. or run with FILESHARE_AUTO_PICK_PORTS=1 make test-live-fileshare to auto-select free ports.
  • Offline coverage is still enforced by make test and make coverage; live smoke is an additive, explicit opt-in only.

Example SMB config:

{
  "protocol": "smb",
  "host": "files.example.internal",
  "share": "knowledge",
  "root_path": "/docs",
  "username": "env:FILESHARE_USERNAME",
  "password": "env:FILESHARE_PASSWORD",
  "include_patterns": ["*.md", "*.txt"],
  "exclude_patterns": [],
  "max_file_size_mb": 50,
  "page_size": 1000,
  "max_retries": 3,
  "connect_timeout_seconds": 10,
  "read_timeout_seconds": 30
}

Example mounted NFS/local-path config:

{
  "protocol": "nfs_mounted",
  "root_path": "/mnt/company-knowledge",
  "include_patterns": ["*.md", "*.txt"],
  "exclude_patterns": [],
  "max_file_size_mb": 50,
  "page_size": 1000,
  "max_retries": 1,
  "connect_timeout_seconds": 10,
  "read_timeout_seconds": 30
}

Plugin development will start with internal Python interfaces. Public third-party plugin packaging should wait until the core contracts, test kit, and capability matrix are stable.

Quality and Supply Chain

RAGRig uses a strict quality and dependency policy:

  • Core modules must reach and maintain 100% test coverage.
  • Default tests must not require network access, cloud accounts, or secrets.
  • Provider SDKs must be official or actively maintained open-source packages whenever possible.
  • Heavy or cloud-specific SDKs must live behind optional plugin extras, not the core runtime.
  • uv.lock stays committed, and release candidates should include vulnerability checks, license review, and SBOM generation.

Executable commands in this repository:

  • make coverage: enforces 100% line coverage for the hard core scope: db, repositories, ingestion, parsers, chunkers, embeddings, indexing, plugins, retrieval.py, config.py, and health.py.
  • make plugins-check: prints the plugin registry discovery payload as offline JSON.
  • make export-object-storage-check: runs an opt-in object storage export smoke command and defaults to dry_run unless explicitly overridden.
  • make licenses: fails on GPL, AGPL, SSPL, or source-available third-party packages.

Object Storage Sink

sink.object_storage now exports a minimal governed artifact set to S3-compatible object storage using optional boto3, with opt-in Parquet export support via optional pyarrow.

Current runtime-ready targets:

  • AWS S3
  • Cloudflare R2
  • MinIO
  • Ceph RGW
  • Wasabi
  • Backblaze B2 S3 API
  • Tencent COS S3 API
  • Alibaba OSS in S3-compatible mode

Contract-only targets in this phase:

  • Google Cloud Storage
  • Azure Blob Storage

Example config:

{
  "bucket": "exports",
  "prefix": "team-a",
  "endpoint_url": "http://localhost:9000",
  "region": "us-east-1",
  "use_path_style": true,
  "verify_tls": true,
  "access_key": "env:AWS_ACCESS_KEY_ID",
  "secret_key": "env:AWS_SECRET_ACCESS_KEY",
  "session_token": "env:AWS_SESSION_TOKEN",
  "path_template": "{knowledge_base}/{run_id}/{artifact}.{format}",
  "overwrite": false,
  "dry_run": true,
  "include_retrieval_artifact": true,
  "include_markdown_summary": true,
  "parquet_export": false,
  "object_metadata": {
    "environment": "dev"
  }
}

Behavior notes:

  • JSONL artifacts use application/x-ndjson.
  • Markdown summaries use text/markdown; charset=utf-8.
  • Parquet artifacts use application/vnd.apache.parquet when parquet_export=true.
  • Install uv sync --dev --extra parquet to enable local Parquet export and validation.
  • Existing objects are skipped when overwrite=false.
  • dry_run=true computes the export plan without uploading objects.
  • Retrieval and evaluation exports are explicitly marked unsupported/degraded until dedicated runtimes exist.
  • retrieval_status.parquet is emitted only when include_retrieval_artifact=true; schema-only Parquet remains typed.
  • make sbom: writes a CycloneDX JSON SBOM to docs/operations/artifacts/sbom.cyclonedx.json.
  • make audit: runs a vulnerability audit of the local environment and writes docs/operations/artifacts/pip-audit.json.
  • make dependency-inventory: refreshes docs/operations/dependency-inventory.md.
  • make supply-chain-check: runs the license check, SBOM export, and vulnerability audit together.

Hard-scope omissions are explicit rather than hidden by a broad exclude:

  • src/ragrig/main.py: app wiring only
  • src/ragrig/web_console.py: Web Console adapter layer, outside this issue's hard scope
  • src/ragrig/cleaners/* and src/ragrig/vectorstore/*: placeholder packages with no shipped behavior

See the local-first, quality, and supply chain policy for the SDK inventory and supply chain rules. See core coverage and supply chain gates, supply chain operations, and the dependency inventory for the executable gate details.

GitHub CI

RAGRig now includes a GitHub Actions baseline workflow named RAGRig CI, running on Python 3.11 and 3.12.

What it covers on pull_request and push to main:

  • frozen dependency install from uv.lock with uv sync --dev --frozen
  • formatting check with uv run ruff format --check .
  • lint with uv run ruff check .
  • repository test suite with make test
  • hard-scope coverage gate with make coverage
  • Web Console smoke contract with make web-check

What it does not cover yet:

  • shared-environment runtime validation on 192.168.3.100
  • Docker Compose deployment checks
  • supply-chain, SBOM, license, or vulnerability gates that are still intentionally excluded from default GitHub CI
  • any workflow that depends on secrets, cloud accounts, GPUs, Ollama, LM Studio, or model downloads

Validation boundary:

  • GitHub CI proves the fresh-clone lint and test baseline inside GitHub Actions.
  • Local developer validation still covers targeted repro, iterative debugging, and pre-PR confirmation.
  • Shared-environment validation remains a separate requirement for issues that explicitly require 192.168.3.100 evidence.

After the first successful GitHub Actions run exists, the repository owner may still need to configure branch protection required checks in GitHub settings.

Repository Layout

.
├── alembic/
│   ├── env.py
│   └── versions/
│       └── 20260503_0001_phase_1a_metadata_schema.py
├── assets/
│   ├── ragrig-icon.png
│   └── ragrig-icon.svg
├── .github/
│   └── workflows/
│       └── ci.yml
├── docs/
│   ├── operations/
│   ├── prototypes/
│   ├── roadmap.md
│   └── specs/
│       ├── ragrig-github-ci-checks-spec.md
│       ├── ragrig-mvp-spec.md
│       ├── ragrig-phase-1a-metadata-db-spec.md
│       ├── ragrig-phase-1a-scaffold-spec.md
│       ├── ragrig-phase-1b-local-ingestion-spec.md
│       ├── ragrig-phase-1c-chunking-embedding-spec.md
│       ├── ragrig-phase-1d-retrieval-api-spec.md
│       ├── ragrig-local-first-quality-supply-chain-policy.md
│       ├── ragrig-web-console-plugin-source-wizard-spec.md
│       └── ragrig-web-console-spec.md
├── scripts/
│   ├── db_check.py
│   ├── index_check.py
│   ├── index_local.py
│   ├── ingest_check.py
│   ├── ingest_local.py
│   ├── retrieve_check.py
│   └── init-db.sql
├── src/
│   └── ragrig/
│       ├── db/
│       │   ├── engine.py
│       │   ├── models/
│       │   └── session.py
│       ├── chunkers/
│       ├── cleaners/
│       ├── embeddings/
│       ├── retrieval.py
│       ├── indexing/
│       ├── ingestion/
│       ├── indexing/
│       ├── ingestion/
│       ├── parsers/
│       ├── repositories/
│       ├── vectorstore/
│       ├── config.py
│       └── main.py
├── tests/
│   ├── fixtures/
│   ├── test_alembic_sql.py
│   ├── test_db_check.py
│   ├── test_db_config.py
│   ├── test_db_models.py
│   ├── test_db_runtime_url.py
│   ├── test_db_session.py
│   ├── test_health.py
│   ├── test_indexing_pipeline.py
│   ├── test_ingestion_pipeline.py
│   ├── test_parsers.py
│   ├── test_retrieval.py
│   └── test_scanner.py
├── .env.example
├── alembic.ini
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── README.zh-CN.md
└── SECURITY.md

License

RAGRig is licensed under the Apache License 2.0. See LICENSE.

About

Open-source RAG governance and pipeline platform for enterprise knowledge.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages