A minimal Retrieval-Augmented Generation (RAG) pipeline written in pure C, targeting Apple M1. No C++, no Python at runtime, no heavy frameworks.
This project implements a complete RAG system that:
- Chunks documents into overlapping text segments
- Embeds chunks using GTE-Small (384-dimensional embeddings)
- Stores and retrieves chunks with configurable backends (flat file, hash table, SQLite, or custom index)
- Generates responses by calling a local LLM server with retrieved context
The codebase is designed for learning—every component is implemented from first principles with readable, antirez-style C code.
- macOS with Apple Silicon (M1/M2/M3, etc.)
- Xcode Command Line Tools:
xcode-select --install - libcurl (usually pre-installed on macOS)
- llama.cpp for the LLM server:
brew install llama.cpp
# Download GTE-Small embedding model
pip install huggingface_hub safetensors
python -c "from huggingface_hub import snapshot_download; \
snapshot_download('thenlper/gte-small', local_dir='./models/gte-small-hf')"
# Convert to .gtemodel format (included in the repo)
python convert_model.py ./models/gte-small-hf ./models/gte-small.gtemodel
# Download TinyLlama LLM (optional, for generation)
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir ./modelsllama-server -m ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --port 8080# Build the executable
make
# Embed a document and save to disk
./rag --model ./models/gte-small.gtemodel --file doc.txt --save store.bin
# Query the saved store
./rag --model ./models/gte-small.gtemodel --load store.bin --query "What is this about?"
# Or: embed and query in-memory (no disk store)
./rag --model ./models/gte-small.gtemodel --file doc.txt --query "What is this about?"v1 has three modes of operation:
./rag --model <path.gtemodel> --file <document.txt> --save <store.bin>- Reads the document
- Splits it into overlapping chunks (1000 chars, 200 char overlap)
- Embeds each chunk using GTE-Small
- Writes chunks and embeddings to
<store.bin>in the v1 binary format
./rag --model <path.gtemodel> --load <store.bin> --query "your question" [--k <int>] [--server <url>]- Loads the binary store from disk
- Embeds the query using GTE-Small
- Scans all stored chunks, scores each by cosine similarity
- Prints the top-k matches (default k=3)
- If
--serveris given, sends the top-k context to the LLM and prints the generated answer
./rag --model <path.gtemodel> --file <document.txt> --query "your question" [--k <int>] [--server <url>]- Same as Mode 2 but chunks and embeds the document in-memory — nothing is written to disk
- Useful for one-off queries or testing
| Flag | Required | Description |
|---|---|---|
--model <path> |
always | Path to the .gtemodel embedding model |
--file <path> |
Mode 1, 3 | Document to chunk and embed |
--save <path> |
Mode 1 | Write binary store to this file |
--load <path> |
Mode 2 | Load binary store from this file |
--query <text> |
Mode 2, 3 | Query string to search for |
--k <int> |
optional | Number of top results to return (default: 3) |
--server <url> |
optional | llama.cpp server URL for LLM generation (e.g. http://localhost:8080) |
--save and --load are mutually exclusive.
rag_c/
├── README.md
├── Plan.md # Architecture and version roadmap
├── CLAUDE.md # Learning principles
│
├── Core Dependencies (3rd-party)
├── gte.c / gte.h # GTE embedding model (antirez)
├── sds.c / sds.h # Dynamic strings (antirez)
├── cJSON.h # JSON parsing (DaveGamble)
├── third_party/gte/ # GTE model source
│
├── Shared Source (you write)
├── main.c # CLI entry point
├── chunk.c / chunk.h # Document chunking
├── embed.c / embed.h # Embedding wrapper
├── llm.c / llm.h # LLM HTTP calls
│
├── Storage Backends
├── store_v1.c / store_v1.h # Flat binary file (ACTIVE)
├── store_v2.c / store_v2.h # Hash table (antirez dict)
├── store_v3.c / store_v3.h # SQLite + vector index
├── store_v4.c / store_v4.h # Custom vector index (planned)
│
└── models/
├── gte-small.gtemodel # Embedding model
└── tinyllama-*.gguf # LLM model
The project supports four progressively advanced storage/retrieval backends:
| Version | Storage | Retrieval | When to Use |
|---|---|---|---|
| v1 | Flat binary file | Linear scan + cosine similarity | Learning, small datasets (<5k chunks) |
| v2 | Hash table (dict) | Linear scan with fast ID lookup | Want Redis-style data structures |
| v3 | SQLite + sqlite-vec | HNSW approximate KNN | Production, larger datasets (10k+ chunks) |
| v4 | Custom C index | Pure C HNSW or IVF | Full control, zero dependencies |
Currently v1 is implemented. See Plan.md for roadmap.
The project uses GTE-Small, a 127MB embedding model that produces 384-dimensional vectors. Since GTE embeddings are L2-normalized, the dot product between two embeddings equals their cosine similarity—no special normalization needed.
similarity = dot_product(query_embedding, chunk_embedding, 384);Documents are split into overlapping windows:
- Chunk size: 1000 characters
- Overlap: 200 characters
- Simple byte-based splitting (no fancy tokenization)
A custom binary format stores chunks and embeddings efficiently:
[uint32 num_chunks]
[uint32 dim]
[chunk_0: uint32 text_len | uint8 text[text_len] | float embedding[dim]]
[chunk_1: ...]
Brute-force linear scan over all chunks, scoring each with cosine similarity:
for (each chunk) {
score = dot_product(query_embedding, chunk_embedding, 384)
if (score in top-k) insert into results
}
return top-k sorted by scoreThe Makefile handles everything:
make # build
make clean # remove objects and binaryIt compiles: main.c chunk.c embed.c store_v1.c llm.c third_party/gte/gte.c
with flags: -O3 -march=native -ffast-math -Wall -I third_party/gte
To enable BLAS-backed dot products, add -DUSE_BLAS -framework Accelerate to CFLAGS in the Makefile.
The codebase follows antirez principles:
- Readable, explicit C with minimal macros
- No over-engineering or premature abstraction
- Short, focused functions with clear error handling
- Comments explaining why, not what
- CLAUDE.md: Philosophy and rules for this learning project
- Plan.md: Detailed architecture and future versions
- Decisions.md: Design decisions and tradeoffs
- dev_log.md: Development progress and notes
This is a learning project. Each component is designed to teach one concept deeply:
- How embeddings work
- Binary file formats
- Cosine similarity
- C memory management
- HTTP communication
See CLAUDE.md for the project's learning principles.
This project includes code from:
- antirez: gte.c, sds.c (original Redis/single-purpose implementations)
- DaveGamble: cJSON
- asg017: sqlite-vec (for v3)
See individual files for license details.
- ✅ v1: Flat binary file storage (complete)
- 🔄 v2-v4: In progress (see Plan.md)
Run ./rag with no arguments for usage help.