End-to-end RAG pipeline that takes the Weaviate documentation (MDX/MD files from GitHub), resolves cross-file references into self-contained documents, chunks them, embeds them, stores them in Weaviate, and answers questions using Claude. Evaluation uses an LLM-as-judge approach with results reported to Google Sheets.
GitHub Repo ──> Parse & Resolve ──> Chunk ──> Embed ──> Store in Weaviate ──> Retrieve ──> Generate
(weaviate-io) (MDX/MD files) (text) (vectors) (embedded Weaviate) (query) (Claude)
# Install dependencies
uv sync
# Configure environment
cp .env.example .env # then fill in REPO_URL, REPO_DIR, ATLASSIAN_API_KEY
# Run the full pipeline
uv run python main.py
# Force rebuild all caches
uv run python main.py --rebuild
# Run tests
uv run pytest tests/ -v
# Lint and format
uv run ruff check .
uv run ruff format .| Variable | Description |
|---|---|
REPO_URL |
GitHub URL for the Weaviate docs repo |
REPO_DIR |
Local directory for the cloned repo |
ANTHROPIC_API_KEY |
For Claude generation and LLM-judge evaluation |
SPREADSHEET_ID |
Google Sheets ID for benchmark data and run results |
The Weaviate docs use Docusaurus (MDX). Pages import content from other files up to 3 levels deep. Two patterns dominate:
- MDX includes --
import Intro from './_intro.mdx'rendered as<Intro /> - FilteredTextBlock --
!!raw-loader!imports that extract code between comment markers
The pipeline builds an import graph, identifies top-level pages (not imported by others), then resolves each page by inlining includes, resolving code blocks, and stripping JSX syntax. Result: 329 source files become clean, self-contained markdown documents.
Six chunking strategies are implemented. The active strategy is markdown-optimized:
- Split at header boundaries with breadcrumb trails (
Title > H2 > H3) - Child chunks target ~1500 chars, split at paragraph boundaries
- Each child stores the full parent section text for richer LLM context
- Small adjacent children are merged; trailing runts are merged backward
Chunks are embedded using a local sentence-transformers model (BAAI/bge-base-en-v1.5 on MPS) and batch-inserted into an embedded Weaviate instance with vectorizer=none.
Each query is embedded (or optionally transformed via HYDE), searched via near_vector or hybrid (BM25 + vector fusion), optionally reranked with a cross-encoder, then passed to Claude Sonnet for generation. The LLM receives deduplicated parent-section text rather than individual child chunks.
Each query is scored on two dimensions:
- Source matching -- do retrieved chunk paths match expected source documents?
- Fact recall (LLM-as-judge) -- Claude extracts atomic key facts from benchmark answers, then checks which facts appear in retrieved chunks and in the generated answer
Results go to Google Sheets: a per-run tab with full metadata and a summary grid comparing configurations.
main.py loops over the Cartesian product of axes defined in config.py:
| Axis | Current Default |
|---|---|
| Chunking strategy | markdown-optimized |
| Embedding model | BAAI/bge-base-en-v1.5 |
| Retrieval k | 250 |
| Reranker model | BAAI/bge-reranker-base |
| Search type | vector |
| Query transform | none |
12 benchmark queries are evaluated per configuration.
Each pipeline stage caches independently under cache/:
cache/
├── documents.json # Resolved documents
├── key_facts.json # LLM-extracted key facts
└── {strategy}/
├── chunks.json # Chunked documents
└── {model-slug}/
└── embeddings.npy # Embeddings as numpy array
Use --rebuild to force regeneration.
rag/
├── src/rag/ # Package (15 modules, ~1590 lines)
│ ├── models.py # Document, ImportGraph, Chunk dataclasses
│ ├── git.py # clone_or_pull (repo cloning)
│ ├── parsing.py # MDX parsing, resolution, import graph building
│ ├── load.py # Document loading & JSON caching
│ ├── config.py # All pipeline configuration constants
│ ├── chunking.py # 6 chunking strategies
│ ├── embedding.py # Local (sentence-transformers) & API (litellm) embeddings
│ ├── retrieval.py # Weaviate batch indexing
│ ├── reranker.py # Cross-encoder reranking
│ ├── pipeline.py # Pipeline orchestration
│ ├── llm.py # Claude generation + HYDE
│ ├── llm_judge.py # LLM-based fact extraction & recall scoring
│ ├── eval.py # Source matching + evaluation dispatch
│ └── sheets.py # Google Sheets benchmark I/O
├── main.py # Entry point
├── tests/ # 82 tests across 5 files
├── cache/ # Cached artifacts (gitignored)
└── weaviate-docs/ # Cloned Weaviate docs repo
Core: pyyaml, python-dotenv, requests, markdown
ML/Embedding: sentence-transformers, optimum[onnxruntime], onnxruntime, litellm
Database: weaviate-client
Google Sheets: google-api-python-client, google-auth-oauthlib, google-auth-httplib2
Visualization: matplotlib
Dev: pytest, ruff
The codebase has three functions that parse import statements, each serving a different consumer:
parse_file_imports |
parse_raw_imports |
parse_mdx_imports |
|
|---|---|---|---|
| Used by | build_import_graph |
resolve_filtered_text_blocks |
inline_mdx_includes |
| Returns | list[Path] |
dict[str, str] (var -> path) |
dict[str, Path] (name -> path) |
| Matches | All file imports | Only !!raw-loader! |
Only .md/.mdx files |
See architecture.md for the full module reference, data flow diagrams, regex inventory, and known issues.