Open-source RAG governance and pipeline platform for enterprise knowledge.
源栈: from scattered enterprise sources to traceable, permission-aware, model-ready knowledge.
RAGRig is an open-source platform for building lightweight, governable RAG systems for small and medium-sized teams.
It helps organizations connect scattered knowledge sources, clean and structure documents with LLM-assisted pipelines, index them into vector stores such as Qdrant and pgvector, and serve retrieval results through traceable, permission-aware APIs.
RAGRig is not meant to be another generic chatbot wrapper. Its focus is the hard operational layer around RAG:
- source connectors for documents, wikis, shared drives, databases, object storage, and enterprise document hubs
- customizable ingestion and cleaning workflows
- model registry for LLMs, embedding models, rerankers, OCR, and parsers
- Qdrant and Postgres/pgvector as first-class vector backends
- document, chunk, and metadata versioning
- permission-aware retrieval with pre-retrieval access filtering
- RAG evaluation, observability, and regression checks
- source traceability from answer to document, version, chunk, and pipeline run
- Markdown and document preview/editing integrations for knowledge review workflows
The goal is to make enterprise knowledge usable by AI systems without losing control over source provenance, permissions, quality, or deployment cost.
Many RAG tools make it easy to upload files and chat with them. Production RAG inside a company needs more than that.
Teams need to know where each answer came from, whether the source is still valid, which model created the embedding, who is allowed to retrieve the content, and whether a pipeline change made retrieval better or worse.
RAGRig treats RAG as an operational system:
- Source-first: every generated answer should point back to inspectable source material.
- Governed by default: access control, metadata, versions, and audit events are part of the core model.
- Model-flexible: bring local or hosted LLMs, embedding models, rerankers, OCR, and parsers.
- Local-first: prefer local files, pgvector, Ollama, LM Studio, BGE, and self-hosted runtimes before cloud services.
- Vector-store portable: start with pgvector, scale to Qdrant, and keep migration paths explicit.
- Ops-friendly: designed for Docker Compose first, with a path to Kubernetes later.
- Plugin-first: keep the core small, then extend sources, sinks, models, vector stores, preview tools, and workflow nodes through explicit contracts.
- Quality-gated: core modules must reach and maintain 100% test coverage, with cloud and enterprise plugins covered through contract tests.
flowchart LR
sources["Source plugins<br/>files, object storage, docs, wiki, DB"]
pipeline["Pipeline engine<br/>scan, parse, clean, chunk, embed, index"]
profiles["Processing Profiles<br/>extension × task matrix"]
formats["Supported Format<br/>registry"]
understanding["Document Understanding<br/>summaries, glossary, knowledge map"]
core["RAGRig core<br/>KB, versions, chunks, runs, audit"]
vectors["Vector backends<br/>pgvector, Qdrant, others"]
console["Web Console<br/>operate, review, debug, upload"]
api["Retrieval API / MCP / exports"]
formats --> pipeline
profiles --> pipeline
sources --> pipeline --> core
core --> vectors
core --> understanding
core --> console
vectors --> api
core --> api
RAGRig is in early project design and scaffolding.
Current implementation status:
- Phase 0 docs and project framing are committed.
- Phase 1a scaffold provides a FastAPI service, local Docker Compose stack, pgvector-enabled PostgreSQL, and verification commands.
- Phase 1a metadata DB adds SQLAlchemy models, Alembic migrations, and DB smoke commands for the MVP metadata boundary.
- Phase 1b now supports local Markdown/Text ingestion into the metadata DB, including
document_versionsand pipeline-run tracking. - Phase 1c now supports deterministic local chunking and embedding into
chunksandembeddingsfor the latest ingested document versions. - Phase 1d now supports a minimal retrieval API and smoke CLI over the real indexed chunks and embeddings.
- Phase 1e PR-1 now adds a core provider registry contract and registers
deterministic-localthrough it. - Phase 1e PR-2 now adds local provider adapters for Ollama, LM Studio, OpenAI-compatible local runtimes, and optional BGE boundaries without changing the default secret-free test path.
- Phase 1e PR-3 now adds cloud-second provider stubs for Vertex AI, Bedrock, Azure OpenAI, OpenRouter, OpenAI, Cohere, Voyage, and Jina through the same registry and discovery surfaces.
source.s3now supports real S3-compatible Markdown/Text ingestion with fake-client-first tests and opt-in runtime dependencies.source.filesharenow supports offline-tested SMB, mounted NFS/local path, WebDAV, and SFTP ingestion contracts with truthful readiness, delete-detection placeholders, and an explicitmake fileshare-checksmoke path.- The Web Console now includes a plugin/data source setup wizard that drafts registry-backed config, rejects raw secrets, and validates plugin config before wiring.
- A ProcessingProfile, SupportedFormat, browser upload, and Document Understanding architecture spec is committed as a design document; implementation is queued for follow-up issues.
- Semantic production embeddings, live local runtime smoke checks, production cloud adapters, reranking, and richer source types remain intentionally limited or deferred in this repository state.
Authoritative specs:
- MVP spec
- Phase 1a scaffold spec
- Phase 1a metadata DB spec
- Phase 1b local ingestion spec
- Phase 1c chunking and embedding spec
- Phase 1d retrieval API spec
- Phase 1e local model provider plugin spec
- GitHub CI checks spec
- Web Console spec
- Web Console plugin/source setup wizard spec
- Vector backend status console spec
- ProcessingProfile, SupportedFormat, browser upload, and Document Understanding spec
- Local-first, quality, and supply chain policy
- Core coverage and supply chain gates
- Web Console prototype
RAGRig now ships a first lightweight Web Console inside the same FastAPI service. It is an operator workbench for knowledge bases, sources, ingestion tasks, pipeline runs, document and chunk review, retrieval debugging, model shells, and health status.
The console lives at:
GET /console
What the current MVP covers:
- knowledge base inventory from the real DB
- local-directory source configuration from the real DB
- CLI-connected ingestion entry with disabled browser-write state
- pipeline run history and per-item detail from the real DB
- document latest-version preview and real chunk preview or empty state
- retrieval Playground backed by the real
POST /retrieval/searchcontract - embedding profile inventory from indexed chunks
- health, DB dialect, Alembic revision, extension state, and visible tables
- vector backend readiness with backend type, dependency state, collection rows, and score semantics
- plugin/data source setup wizard backed by real registry metadata and
POST /plugins/{plugin_id}/validate-config
Current limitations:
- browser-triggered create/update actions are intentionally not implemented yet
- the plugin wizard validates config drafts and next-step commands, but does not persist plugin configuration or create sources from the browser
- model registry remains read-only, but now exposes local LLM and reranker registry shells for PR-2 providers
- provider registry metadata is exposed read-only, including Ollama, LM Studio, OpenAI-compatible local runtimes, and optional BGE boundaries
- the console only shows capabilities backed by existing DB/API boundaries and uses empty, disabled, or degraded states for the rest
- qdrant remains optional; missing
qdrant-clientor missing live collections degrade only the vector panel instead of the whole console
Phase 1a currently ships the engineering scaffold and metadata database foundation required for follow-on ingestion and retrieval work:
- Python 3.11+ service with FastAPI
- typed settings via
pydantic-settings GET /healthwith explicit app and database status- SQLAlchemy 2.x models for the metadata boundary from MVP Section 12
- Alembic migrations rooted at
alembic/ - pgvector-backed
embeddingstable with dynamic dimensions metadata uv-managed dependencies inpyproject.tomlruffformat/lint commands andpytesttests- Docker Compose for the app and PostgreSQL with pgvector
- smoke commands for migration and schema validation
Phase 1b, Phase 1c, and Phase 1d add these implemented boundaries:
src/ragrig/ingestionsrc/ragrig/parserssrc/ragrig/repositoriessrc/ragrig/chunkerssrc/ragrig/embeddingssrc/ragrig/indexingsrc/ragrig/retrieval.py
Still reserved for later phases:
src/ragrig/cleanerssrc/ragrig/vectorstore
The current repository state supports local Markdown/Text parsing, character-window chunking, deterministic local embeddings, a provider registry core contract, and a minimal pgvector-backed retrieval API for smoke validation. Production embedding providers, reranking, and answer generation are still deferred.
Phase 1e PR-1 establishes the core provider registry contract in src/ragrig/providers/.
What exists now:
- provider metadata and capability declarations
- register/get/read/list/health-check registry operations
- structured provider errors for missing providers and unsupported capabilities
- deterministic-local registered as the built-in embedding provider for CI and smoke flows
- read-only provider inventory in
GET /models
PR-2 additions:
model.ollamalocal adapter metadata and fake-client contract testsmodel.lm_studiolocal OpenAI-compatible adapter metadata and fake-client contract tests- shared local adapter declarations for
model.llama_cpp,model.vllm,model.xinference, andmodel.localai embedding.bgeandreranker.bgeprovider boundaries with lazy optional dependency loading- read-only
/modelsand/pluginsvisibility for the above providers
PR-3 additions:
model.vertex_ai,model.bedrock,model.azure_openai,model.openrouter,model.openai,model.cohere,model.voyage, andmodel.jinaregistry metadata- cloud-second plugin discovery entries in
/pluginsandmake plugins-check - optional cloud dependency groups in
pyproject.tomlwithout changing the default install path - read-only
/modelsvisibility for the cloud stubs, including required secret and config metadata
What is still deferred:
- no production cloud API calls in this PR slice
- no DB-backed model profile management
- no default live local or cloud runtime smoke in
make test
deterministic-local remains a secret-free, network-free test and smoke provider. It is not a production semantic embedding model.
RAGRig also exposes a provider catalog for mainstream model vendors and API protocols. The catalog is based on official provider documentation links and is visible in GET /models and the Web Console model panel.
Current catalog coverage includes OpenAI-compatible providers and gateways, Anthropic, Google Gemini, Azure OpenAI, Amazon Bedrock, OpenRouter, Mistral, Cohere, Together, Fireworks, Groq, DeepSeek, Moonshot/Kimi, MiniMax, Alibaba DashScope, SiliconFlow, Zhipu/Z.ai, Baidu Qianfan, Volcengine Ark, xAI, Perplexity, NVIDIA NIM, Ollama, LM Studio, llama.cpp, vLLM, Xinference, LocalAI, BGE embedding, and BGE reranking.
Runtime probing endpoints:
GET /models/{provider_name}/available-models
POST /models/{provider_name}/speed-test
Without credentials, these endpoints return missing_credentials with the exact required environment variable names and do not attempt a network call. With credentials, the first implementation measures the provider's model-list endpoint latency; it does not spend tokens on generation.
PR-2 keeps local runtime SDKs and heavy ML packages out of the default install.
Install optional local runtime support with:
uv sync --extra local-ml --devThe local-ml extra currently groups:
ollamaopenaiFlagEmbeddingsentence-transformerstorch
Default tests still use fake clients and optional-dependency-safe loaders. A fresh clone does not need Ollama, LM Studio, GPUs, or local model downloads.
PR-3 keeps cloud SDKs out of the default install and ships only contract-first cloud stubs.
Optional cloud dependency groups:
PR-5 introduces a read-only ProcessingProfile module that defines a file-type × task-type pipeline matrix. The system ships with default wildcard profiles and serves as the resolution layer for processing decisions in the ingestion and indexing pipelines.
- TaskType:
correct,clean,chunk,summarize,understand,embed - ProcessingProfile: defines provider, kind (deterministic/LLM-assisted), and status per (extension, task_type) combination
- Profile Resolution: extension override → wildcard default → safe fallback
- Default Profiles: all task types ship with wildcard defaults; chunk/embed use deterministic providers, summarize/understand are LLM-assisted stubs
src/ragrig/processing_profile/core module with 100% test coverageGET /processing-profiles— list all profiles (with provider, status, task_type; no raw secrets)GET /processing-profiles/matrix— returns extension × task_type grid with default/override and deterministic/LLM-assisted markers- Chunk and embed metadata include
profile_idfor traceability - Web Console
Processing Profile Matrixread-only view resolve_provider_availability()correctly reports unavailable providers (not faked as ready)
- No browser-side profile CRUD (create/edit/delete)
- No real LLM summarize/understand calls — profiles define configuration only
- No per-profile A/B evaluation metrics
- No secret storage or secret echo in API responses
- Provider availability for LLM tasks is read-only; no runtime health check past plugin registry status
When a profile's LLM provider is unavailable:
- The matrix marks the cell with
provider_available: falseand an amber "⚠ unavail" indicator in the console - The API response includes
provider_available: falsewithout fabricating a ready state - Pipeline runs record
chunk_profile_id/embed_profile_idin config snapshots; future phases will use these for fallback logic
cloud-google:google-cloud-aiplatformcloud-aws:boto3cloud-openai:openaicloud-cohere:coherecloud-voyage:voyageaicloud-jina: no SDK package yet; the stub documents anhttpx-style API boundary only
Example installs:
uv sync --extra cloud-openai --extra cloud-google --dev
uv sync --extra cloud-aws --extra cloud-cohere --devPR-3 cloud stubs are intentionally contract-only:
- no live cloud API calls in default tests
- no real API keys required for fresh clone verification
/models,/plugins, andmake plugins-checkexpose metadata, secret requirements, and current stub status only- production cloud adapters should land in follow-up PRs, not inside this stub/docs slice
Default local endpoints documented by PR-2:
model.ollama:http://localhost:11434model.lm_studio:http://localhost:1234/v1model.llama_cpp:http://localhost:8080/v1model.vllm:http://localhost:8000/v1model.xinference:http://localhost:9997/v1model.localai:http://localhost:8080/v1
-
Install
uvif it is not already available. -
Sync dependencies:
make sync
-
Create a local env file:
cp .env.example .env
If
8000or5432are already in use on the host, set alternate values in.env, for exampleAPP_HOST_PORT=18000orDB_HOST_PORT=15433. -
Run code quality checks:
make format make lint make test make coverage make dependency-inventory -
Run supply-chain checks:
make licenses make sbom make audit
make auditrequires network access. If the environment is offline, usemake audit-dry-runand treat the vulnerability audit as blocked rather than silently skipped. -
Start the database service:
docker compose up --build -d db
-
Run the initial migration:
make migrate
-
Verify the extension and schema:
make db-check
Expected output shape:
{ "current_revision": "20260503_0001", "extension": "vector", "missing_tables": [], "present_tables": [ "chunks", "document_versions", "documents", "embeddings", "knowledge_bases", "pipeline_run_items", "pipeline_runs", "sources" ], "revision_matches_head": true } -
Preview the local ingestion fixture without writing to the database:
make ingest-local-dry-run
-
Ingest the local Markdown/Text fixture into the database:
make ingest-local- Query the latest local-ingestion run summary:
make ingest-checkExpected output shape:
{
"counts": {
"document_versions": 4,
"documents": 5,
"pipeline_run_items": 5,
"sources": 1
},
"knowledge_base": {
"name": "fixture-local"
},
"latest_pipeline_run": {
"failure_count": 0,
"status": "completed",
"success_count": 4,
"total_items": 5
}
}-
Chunk and embed the latest ingested document versions:
make index-local
-
Query the latest chunking and embedding run summary:
make index-check
Expected output shape:
{ "counts": { "chunks": 4, "embeddings": 4 }, "embedding_dimensions": [ { "count": 4, "dimensions": 8, "model": "hash-8d", "provider": "deterministic-local" } ], "latest_pipeline_run": { "failure_count": 0, "status": "completed", "success_count": 3, "total_items": 4 } } -
Run a retrieval smoke query against the indexed chunks:
make retrieve-check QUERY="RAGRig Guide"Expected output shape:
{ "dimensions": 8, "distance_metric": "cosine_distance", "knowledge_base": "fixture-local", "model": "hash-8d", "provider": "deterministic-local", "query": "RAGRig Guide", "results": [ { "chunk_id": "...", "chunk_index": 0, "document_id": "...", "document_uri": ".../guide.md", "document_version_id": "...", "distance": 0.0, "score": 1.0, "source_uri": ".../tests/fixtures/local_ingestion", "text_preview": "# RAGRig Guide ..." } ], "top_k": 3, "total_results": 1 }The default path uses
VECTOR_BACKEND=pgvector. If you explicitly enable Qdrant, the response shape stays the same and adds backend metadata:{ "backend": "qdrant", "backend_metadata": { "distance_metric": "cosine", "status": "ready" } } -
Start optional local Qdrant only when you want the alternate backend smoke path:
docker compose --profile qdrant up -d qdrant uv sync --extra vectorstores VECTOR_BACKEND=qdrant make index-local VECTOR_BACKEND=qdrant make retrieve-check QUERY="RAGRig Guide"qdrant-clientis intentionally optional. Fresh clonemake testandmake coveragecontinue to pass without the package or a running Qdrant container. -
Inspect plugin readiness offline:
make plugins-check
source.s3reportsunavailableuntil you install the optional S3 SDK:uv sync --extra s3
-
Run the opt-in S3-compatible smoke path against MinIO or another S3-compatible endpoint:
docker compose --profile minio up -d minio uv sync --extra s3 make s3-check
The default
.env.examplevalues target the local MinIO profile.make s3-checkseedstests/fixtures/local_ingestion/into the configured bucket before ingesting it.Minimal runtime config uses declared secret refs only:
{ "bucket": "ragrig-smoke", "prefix": "ragrig-smoke", "endpoint_url": "http://127.0.0.1:9000", "region": "us-east-1", "use_path_style": true, "verify_tls": false, "access_key": "env:AWS_ACCESS_KEY_ID", "secret_key": "env:AWS_SECRET_ACCESS_KEY", "session_token": "env:AWS_SESSION_TOKEN" }Current
source.s3limits:- only Markdown and plain-text objects are parsed
- unsupported extensions, binary objects, and oversized objects are skipped with recorded reasons
- delete detection, tombstones, and standalone cursor state are not implemented yet
-
Start the local API service, including the Web Console:
```bash
make run-web
```
Then open `http://localhost:8000/console`.
If you changed `APP_HOST_PORT`, open that port instead.
- Run the Web Console smoke contract:
```bash
make web-check
```
- Start the full local development stack when you also want Docker-managed app + DB:
```bash
docker compose up --build
```
- Verify the service and pgvector bootstrap:
```bash
curl http://localhost:8000/health
docker compose exec db psql -U ragrig -d ragrig -c "SELECT extname FROM pg_extension WHERE extname = 'vector';"
docker compose exec db psql -U ragrig -d ragrig -c "SELECT tablename FROM pg_tables WHERE schemaname = 'public' ORDER BY tablename;"
```
If you changed `APP_HOST_PORT`, use that port in the `curl` command.
If you changed `DB_HOST_PORT`, keep using `docker compose exec db ...`; no command change is required.
Expected healthy response:
{
"status": "healthy",
"app": "ok",
"db": "connected",
"version": "0.1.0"
}If PostgreSQL is unavailable, /health returns 503 with a clear error payload.
-
Exercise the retrieval API directly:
curl -X POST http://localhost:8000/retrieval/search \ -H "Content-Type: application/json" \ -d '{"knowledge_base":"fixture-local","query":"RAGRig Guide","top_k":1}'
If you changed
APP_HOST_PORT, use that port in the request URL.
Repository-level DB commands:
make migrate: apply Alembic migrations to headmake migrate-down: roll back one migration stepmake db-check: verifypgvectorextension, required Phase 1a tables, and Alembic head revisionmake db-shell: openpsqlin the Compose database containermake test-db: alias for the DB smoke checkmake web-check: verify/consoleand the Web Console data routesmake ingest-local-dry-run: preview scanned files and skip reasons without DB writesmake ingest-local: ingest the local fixture corpus or an overridden root path into the metadata DBmake ingest-check: query the latest local-ingestion run and document-version evidencemake index-local: chunk and embed the latest ingested document versions for the chosen knowledge basemake index-check: query the latest chunk and embedding run, counts, spans, and embedding dimensionsmake retrieve-check QUERY="...": query the indexed chunks and print top-k citation fields
Fresh-clone schema verification path:
make sync
cp .env.example .env
docker compose up --build -d db
make migrate
make db-checkThe Compose file still supports shared-machine port overrides through .env, for example:
APP_HOST_PORT=18000
DB_HOST_PORT=15433This override path must remain available for 192.168.3.100 and other shared hosts where default ports are already in use.
Host-side migration and smoke commands (make migrate, make db-check) connect through localhost:${DB_HOST_PORT} so they work from the machine that launched Docker Compose, even though the application container still uses DATABASE_URL=postgresql://ragrig:ragrig_dev@db:5432/ragrig internally.
The same host-side runtime URL rule also applies to make ingest-local and make ingest-check, so shared-host verification can use alternate mapped DB ports without rewriting the app container path.
Phase 1b currently implements the smallest reproducible local ingestion loop for Markdown and plain text files.
What it does:
- scans an explicit local root path
- applies include and exclude glob filters
- skips excluded, oversized, unsupported, and binary files with recorded reasons
- parses UTF-8 Markdown and text files
- computes SHA-256 file hashes
- writes
sources,documents,document_versions,pipeline_runs, andpipeline_run_items - avoids duplicate
document_versionswhen the file content hash has not changed
What it does not do yet:
- chunking
- embeddings or pgvector writes
- deletion cleanup or tombstones
Default fixture path:
tests/fixtures/local_ingestionCustom run example:
uv run python -m scripts.ingest_local \
--knowledge-base demo \
--root-path tests/fixtures/local_ingestion \
--include "*.md" \
--include "*.txt" \
--exclude "nested/*"Dry-run example:
uv run python -m scripts.ingest_local \
--knowledge-base demo \
--root-path tests/fixtures/local_ingestion \
--dry-runPhase 1d implements the smallest retrieval boundary on top of Phase 1c indexed chunks.
What it does:
- embeds query text with the same deterministic-local provider used for default indexing smoke runs
- searches only the latest
document_versionsrows per document - returns top-k chunk matches with
document_id,document_version_id,chunk_id,chunk_index,document_uri,source_uri,distance,score, andchunk_metadata - exposes both
POST /retrieval/searchandmake retrieve-check
What it does not do yet:
- answer generation
- reranking or lexical fallback
- ACL filtering
- external embedding providers as the default path
The Web Console is served by the same FastAPI process as /health, /docs, and /retrieval/search.
Local startup sequence from a fresh clone:
make sync
cp .env.example .env
docker compose up --build -d db
make migrate
make ingest-local
make index-local
make run-webThen open:
http://localhost:8000/console
Suggested local verification sequence:
make test
make web-check
make retrieve-check QUERY="RAGRig Guide"Relationship to other interfaces:
- Web Console: operator-facing overview and debugging workbench
- Swagger (
/docs): raw API exploration - CLI / Make targets: write-path orchestration for ingest and indexing in this MVP
The console does not invent data. If a knowledge base has no chunks, models, or retrieval results yet, the UI shows real empty or degraded states instead of placeholders.
RAGRig is designed as a small core with plugin-first extension points. The core owns workspace state, knowledge bases, documents, versions, chunks, embeddings, pipeline runs, metadata, access boundaries, audit events, and plugin contracts. Integrations live behind typed plugin interfaces.
The goal is not to build a plugin marketplace first. The goal is to make every integration explicit, testable, observable, and replaceable.
The README uses official platform links instead of embedding third-party logos. A visual integration gallery can be added later under docs/ when each logo's trademark and usage rules are checked.
Provider priority is local-first, cloud-second. Local model runtimes, local embeddings, local rerankers, and self-hosted vector stores must be usable before a user configures a cloud account.
Plugin families:
| Family | Purpose | Examples |
|---|---|---|
| Source connectors | Read enterprise knowledge from external systems | local files, SMB/NFS, S3-compatible storage, Google Drive, SharePoint, Confluence, databases |
| Parsers and OCR | Convert raw files into extracted text and structure | Markdown, plain text, PDF, DOCX, XLSX, Docling, MinerU, Tesseract, PaddleOCR |
| Cleaning nodes | Normalize, redact, classify, dedupe, and enrich content | deterministic cleaners, LLM-assisted cleaners, PII redaction, metadata extraction |
| Chunkers | Split document versions into traceable chunks | character windows, Markdown heading chunks, recursive text chunks, table-aware chunks |
| Model providers | Supply LLMs, embedding models, rerankers, OCR, and parsing models | local Ollama, LM Studio, vLLM, llama.cpp, Xinference, BAAI BGE, plus cloud Google Vertex AI, Amazon Bedrock, OpenRouter, OpenAI, Cohere, Voyage AI |
| Vector backends | Store and search vectors with backend-specific capability reporting | pgvector, Qdrant, Milvus/Zilliz, Weaviate, OpenSearch/Elasticsearch, Redis/Valkey |
| Output sinks | Write governed knowledge or retrieval artifacts elsewhere | Amazon S3/Cloudflare R2/MinIO, NFS, relational databases, JSONL, Parquet, Markdown, webhooks, MCP |
| Preview/edit integrations | Let operators inspect or edit source and cleaned knowledge | Markdown editor, WPS, OnlyOffice, Collabora Online, source-system deep links |
| Evaluation plugins | Measure retrieval and answer quality | golden questions, citation coverage, latency/cost, regression checks |
| Workflow nodes | Compose ingestion, indexing, export, and evaluation pipelines | scan, parse, clean, chunk, embed, index, retrieve, evaluate, export, notify |
RAGRig separates plugins by stability, priority, and maintenance ownership.
| Tier | Meaning | Ships with core | Extension policy |
|---|---|---|---|
| Built-in core plugins | Minimal local-first path required for a reproducible RAG pipeline | Yes | Maintained in this repository, no optional external service dependency |
| Official plugins | High-demand integrations maintained by the RAGRig project | Usually optional | May live in this repository first, then move to separate packages as APIs stabilize |
| Community plugins | Third-party integrations built against public contracts | No | Installed through Python packages or plugin manifests once the contract is stable |
Initial built-in core plugins:
| Plugin | Family | Read/write | Why it is core |
|---|---|---|---|
source.local |
Source connector | Read | Fresh-clone demo, fixture validation, shared-host smoke testing |
parser.markdown |
Parser | Read | Common documentation format, deterministic tests |
parser.text |
Parser | Read | Smallest plain-text ingestion path |
chunker.character_window |
Chunker | Write chunks | Reproducible chunking before semantic chunkers exist |
embedding.deterministic_local |
Model provider | Write embeddings | Secret-free development and CI validation |
vector.pgvector |
Vector backend | Read/write | Default lightweight backend on Postgres |
sink.jsonl |
Output sink | Write | Portable debug/export format |
preview.markdown |
Preview/edit | Read/write draft | Operator review without needing an office suite |
Priority official plugins:
| Priority | Plugin area | Platforms and protocols to cover first |
|---|---|---|
| P0 | vector.qdrant |
Self-hosted Qdrant first, Qdrant Cloud second |
| P0 | model.local_runtime |
Ollama, LM Studio, llama.cpp server, vLLM, Xinference, LocalAI through official SDKs or OpenAI-compatible local APIs |
| P0 | embedding.bge and reranker.bge |
BAAI BGE embedding and reranker models through local FlagEmbedding, sentence-transformers, or OpenAI-compatible serving |
| P1 | model.cloud_provider |
Google Vertex AI, Amazon Bedrock, OpenRouter, OpenAI, Azure OpenAI, Cohere, Voyage AI, Jina AI |
| P1 | source.s3 |
AWS S3, Cloudflare R2, MinIO, Ceph RGW, Wasabi, Backblaze B2 S3 API, Tencent COS S3 API, Alibaba OSS S3-compatible mode when available |
| P1 | sink.object_storage |
AWS S3, Cloudflare R2, MinIO, Ceph RGW, Wasabi, Backblaze B2, Google Cloud Storage, Azure Blob Storage |
| P1 | source.fileshare |
SMB/CIFS, NFS, WebDAV, SFTP/OpenSSH |
| P1 | source.google_workspace |
Google Drive, Google Docs, Google Sheets, Google Slides |
| P1 | source.microsoft_365 |
SharePoint, OneDrive, Word, Excel, PowerPoint |
| P1 | source.wiki |
Confluence, MediaWiki, GitBook, Docusaurus, MkDocs |
| P1 | source.database |
PostgreSQL, MySQL/MariaDB, SQL Server, Oracle Database, SQLite, MongoDB, Elasticsearch/OpenSearch |
| P1 | preview.office |
WPS, OnlyOffice, Collabora Online |
| P2 | source.collaboration |
Notion, Lark/Feishu, DingTalk, WeCom, Slack files, Microsoft Teams files |
| P2 | parser.advanced_documents |
PDF layout extraction, DOCX/PPTX/XLSX, Docling, MinerU, Unstructured |
| P2 | ocr |
PaddleOCR, Tesseract, AWS Textract, Azure Document Intelligence, Google Document AI |
| P2 | vector.enterprise |
Milvus/Zilliz, Weaviate, OpenSearch/Elasticsearch vector, Redis/Valkey vector, Vespa |
| P2 | sink.analytics |
Parquet, DuckDB, ClickHouse, BigQuery, Snowflake |
| P2 | sink.agent_access |
MCP server, webhooks, retrieval API export adapters |
Every plugin should declare:
- plugin id, type, version, and owner
- supported read/write operations
- configuration schema
- required secrets
- secret requirements
- capability matrix
- local/cloud classification
- dimensions and context-window metadata when applicable
- SDK or protocol surface
- cursor or incremental-sync support
- delete detection support
- permission mapping support
- failure and retry behavior
- emitted metrics and audit events
Example manifest shape:
manifest_version: 1
id: source.s3
type: source
version: 0.1.0
capabilities:
- read
- incremental_sync
- delete_detection
config_model: S3SourceConfig
secret_requirements:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEYCurrent contract-first implementation adds:
src/ragrig/plugins/for the registry, manifest schema, dependency guards, and built-in plus official stub manifests.GET /pluginsfor offline plugin discovery with readiness, missing dependency, configurability, and secret requirement reporting.POST /plugins/{plugin_id}/validate-configfor safe Web Console config validation without collecting raw secrets.make plugins-checkfor offline JSON inspection of the registry.source.fileshareas a real official source plugin with mounted-path NFS support, fake-client SMB/WebDAV/SFTP coverage, and protocol-level readiness reporting.make fileshare-checkfor offline mounted-path and fake remote fileshare smoke validation.
RAGRig now exposes an enterprise connector catalog separate from live connector execution. It covers local files, fileshares, S3-compatible storage, Google Workspace, Microsoft 365, wikis, databases, collaboration suites, Notion, Slack files, Box, Dropbox, and GitHub repository contents with official documentation links, protocols, credential names, and workflow operation metadata.
New endpoints:
GET /enterprise-connectorslists connector families, protocols, credential env var names, docs links, and workflow operation mappings.POST /enterprise-connectors/{connector_id}/probeperforms a safe local probe. Without credentials, cloud/SaaS connectors returnmissing_credentialsand do not make network calls.GET /workflows/operationslists workflow node operations.POST /workflows/runsruns or dry-runs a lightweight DAG with dependency validation.
Workflow operations available now:
ingest.localingest.fileshareingest.s3ingest.connectorindex.knowledge_basenoop
The engine executes steps in topological order, rejects duplicate steps, unknown
dependencies, cycles, and unsupported operations, supports dry-runs, per-step retry
counts, dependency skipping, and returns linked pipeline_run_id values for real
ingest/index steps. Default tests stay network-free and secret-free.
source.fileshare is the current local-first bridge for enterprise shared storage.
What it supports now:
protocol = nfs_mounted: mount the share through the OS, then point RAGRig at the mounted directoryprotocol = smb: SMB/CIFS contract, readiness reporting, fake-client tests, optionalsmbprotocolruntime dependencyprotocol = webdav: WebDAV contract, readiness reporting, fake-client tests, optionalhttpxruntime dependencyprotocol = sftp: SFTP contract, readiness reporting, fake-client tests, optionalparamikoruntime dependency
Current boundaries:
- default
make testandmake coveragestay network-free and secret-free - delete detection is a placeholder audit signal only; it records
deleted_upstreamin pipeline items but does not delete stored documents - permission mapping is metadata-only for now; access enforcement is not implemented in this phase
- only Markdown/Text parsing goes through the existing parser path by default
Install optional runtime SDKs with:
uv sync --extra fileshare --devOffline smoke:
make fileshare-checkLive smoke (local Docker services, explicit opt-in):
make preflight-fileshare-live # check Docker, ports, and optional SDKs
make test-live-fileshare # preflight + up + seed + pytest + evidence
make test-live-fileshare-print-evidence # same, but prints the evidence record to stdout
make fileshare-live-down # tear downLive smoke validates real list/read/stat/skip behavior against local Samba, WebDAV, and SFTP containers. It does not run in default CI.
Prerequisites:
- Optional SDKs can be installed with:
uv sync --extra fileshare --dev
QA acceptance path:
- Run
make preflight-fileshare-livefirst. If it reports blockers, do not start containers. - Run
make test-live-fileshareto produce a full evidence record atdocs/operations/artifacts/fileshare-live-smoke-record.json. - Paste the record (or
make test-live-fileshare-print-evidenceoutput) into the PR or issue as验收证据.
Unavailable environment fallback:
- If
.envis missing, preflight blocks withcp .env.example .envand stops before any container checks. - If Docker is not installed or the daemon is not running, preflight prints actionable steps and exits without starting containers.
- If optional SDKs (
smbprotocol,paramiko,httpx) are missing, preflight warns with the exact install command (uv sync --extra fileshare --dev) and a fallback note; pytest will skip the corresponding protocol tests. - If a required port is occupied, preflight outputs the port number and three fix options:
- free the port,
- override in
.env(e.g.SMB_HOST_PORT=1446), - or run with
FILESHARE_AUTO_PICK_PORTS=1 make test-live-fileshareto auto-select free ports.
- Offline coverage is still enforced by
make testandmake coverage; live smoke is an additive, explicit opt-in only.
Example SMB config:
{
"protocol": "smb",
"host": "files.example.internal",
"share": "knowledge",
"root_path": "/docs",
"username": "env:FILESHARE_USERNAME",
"password": "env:FILESHARE_PASSWORD",
"include_patterns": ["*.md", "*.txt"],
"exclude_patterns": [],
"max_file_size_mb": 50,
"page_size": 1000,
"max_retries": 3,
"connect_timeout_seconds": 10,
"read_timeout_seconds": 30
}Example mounted NFS/local-path config:
{
"protocol": "nfs_mounted",
"root_path": "/mnt/company-knowledge",
"include_patterns": ["*.md", "*.txt"],
"exclude_patterns": [],
"max_file_size_mb": 50,
"page_size": 1000,
"max_retries": 1,
"connect_timeout_seconds": 10,
"read_timeout_seconds": 30
}Plugin development will start with internal Python interfaces. Public third-party plugin packaging should wait until the core contracts, test kit, and capability matrix are stable.
RAGRig uses a strict quality and dependency policy:
- Core modules must reach and maintain 100% test coverage.
- Default tests must not require network access, cloud accounts, or secrets.
- Provider SDKs must be official or actively maintained open-source packages whenever possible.
- Heavy or cloud-specific SDKs must live behind optional plugin extras, not the core runtime.
uv.lockstays committed, and release candidates should include vulnerability checks, license review, and SBOM generation.
Executable commands in this repository:
make coverage: enforces 100% line coverage for the hard core scope:db,repositories,ingestion,parsers,chunkers,embeddings,indexing,plugins,retrieval.py,config.py, andhealth.py.make plugins-check: prints the plugin registry discovery payload as offline JSON.make export-object-storage-check: runs an opt-in object storage export smoke command and defaults todry_rununless explicitly overridden.make licenses: fails on GPL, AGPL, SSPL, or source-available third-party packages.
sink.object_storage now exports a minimal governed artifact set to S3-compatible object storage using optional boto3, with opt-in Parquet export support via optional pyarrow.
Current runtime-ready targets:
- AWS S3
- Cloudflare R2
- MinIO
- Ceph RGW
- Wasabi
- Backblaze B2 S3 API
- Tencent COS S3 API
- Alibaba OSS in S3-compatible mode
Contract-only targets in this phase:
- Google Cloud Storage
- Azure Blob Storage
Example config:
{
"bucket": "exports",
"prefix": "team-a",
"endpoint_url": "http://localhost:9000",
"region": "us-east-1",
"use_path_style": true,
"verify_tls": true,
"access_key": "env:AWS_ACCESS_KEY_ID",
"secret_key": "env:AWS_SECRET_ACCESS_KEY",
"session_token": "env:AWS_SESSION_TOKEN",
"path_template": "{knowledge_base}/{run_id}/{artifact}.{format}",
"overwrite": false,
"dry_run": true,
"include_retrieval_artifact": true,
"include_markdown_summary": true,
"parquet_export": false,
"object_metadata": {
"environment": "dev"
}
}Behavior notes:
- JSONL artifacts use
application/x-ndjson. - Markdown summaries use
text/markdown; charset=utf-8. - Parquet artifacts use
application/vnd.apache.parquetwhenparquet_export=true. - Install
uv sync --dev --extra parquetto enable local Parquet export and validation. - Existing objects are skipped when
overwrite=false. dry_run=truecomputes the export plan without uploading objects.- Retrieval and evaluation exports are explicitly marked unsupported/degraded until dedicated runtimes exist.
retrieval_status.parquetis emitted only wheninclude_retrieval_artifact=true; schema-only Parquet remains typed.make sbom: writes a CycloneDX JSON SBOM todocs/operations/artifacts/sbom.cyclonedx.json.make audit: runs a vulnerability audit of the local environment and writesdocs/operations/artifacts/pip-audit.json.make dependency-inventory: refreshesdocs/operations/dependency-inventory.md.make supply-chain-check: runs the license check, SBOM export, and vulnerability audit together.
Hard-scope omissions are explicit rather than hidden by a broad exclude:
src/ragrig/main.py: app wiring onlysrc/ragrig/web_console.py: Web Console adapter layer, outside this issue's hard scopesrc/ragrig/cleaners/*andsrc/ragrig/vectorstore/*: placeholder packages with no shipped behavior
See the local-first, quality, and supply chain policy for the SDK inventory and supply chain rules. See core coverage and supply chain gates, supply chain operations, and the dependency inventory for the executable gate details.
RAGRig now includes a GitHub Actions baseline workflow named RAGRig CI, running on Python 3.11 and 3.12.
What it covers on pull_request and push to main:
- frozen dependency install from
uv.lockwithuv sync --dev --frozen - formatting check with
uv run ruff format --check . - lint with
uv run ruff check . - repository test suite with
make test - hard-scope coverage gate with
make coverage - Web Console smoke contract with
make web-check
What it does not cover yet:
- shared-environment runtime validation on
192.168.3.100 - Docker Compose deployment checks
- supply-chain, SBOM, license, or vulnerability gates that are still intentionally excluded from default GitHub CI
- any workflow that depends on secrets, cloud accounts, GPUs, Ollama, LM Studio, or model downloads
Validation boundary:
- GitHub CI proves the fresh-clone lint and test baseline inside GitHub Actions.
- Local developer validation still covers targeted repro, iterative debugging, and pre-PR confirmation.
- Shared-environment validation remains a separate requirement for issues that explicitly require
192.168.3.100evidence.
After the first successful GitHub Actions run exists, the repository owner may still need to configure branch protection required checks in GitHub settings.
.
├── alembic/
│ ├── env.py
│ └── versions/
│ └── 20260503_0001_phase_1a_metadata_schema.py
├── assets/
│ ├── ragrig-icon.png
│ └── ragrig-icon.svg
├── .github/
│ └── workflows/
│ └── ci.yml
├── docs/
│ ├── operations/
│ ├── prototypes/
│ ├── roadmap.md
│ └── specs/
│ ├── ragrig-github-ci-checks-spec.md
│ ├── ragrig-mvp-spec.md
│ ├── ragrig-phase-1a-metadata-db-spec.md
│ ├── ragrig-phase-1a-scaffold-spec.md
│ ├── ragrig-phase-1b-local-ingestion-spec.md
│ ├── ragrig-phase-1c-chunking-embedding-spec.md
│ ├── ragrig-phase-1d-retrieval-api-spec.md
│ ├── ragrig-local-first-quality-supply-chain-policy.md
│ ├── ragrig-web-console-plugin-source-wizard-spec.md
│ └── ragrig-web-console-spec.md
├── scripts/
│ ├── db_check.py
│ ├── index_check.py
│ ├── index_local.py
│ ├── ingest_check.py
│ ├── ingest_local.py
│ ├── retrieve_check.py
│ └── init-db.sql
├── src/
│ └── ragrig/
│ ├── db/
│ │ ├── engine.py
│ │ ├── models/
│ │ └── session.py
│ ├── chunkers/
│ ├── cleaners/
│ ├── embeddings/
│ ├── retrieval.py
│ ├── indexing/
│ ├── ingestion/
│ ├── indexing/
│ ├── ingestion/
│ ├── parsers/
│ ├── repositories/
│ ├── vectorstore/
│ ├── config.py
│ └── main.py
├── tests/
│ ├── fixtures/
│ ├── test_alembic_sql.py
│ ├── test_db_check.py
│ ├── test_db_config.py
│ ├── test_db_models.py
│ ├── test_db_runtime_url.py
│ ├── test_db_session.py
│ ├── test_health.py
│ ├── test_indexing_pipeline.py
│ ├── test_ingestion_pipeline.py
│ ├── test_parsers.py
│ ├── test_retrieval.py
│ └── test_scanner.py
├── .env.example
├── alembic.ini
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── pyproject.toml
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── README.zh-CN.md
└── SECURITY.md
RAGRig is licensed under the Apache License 2.0. See LICENSE.
