RAG in Pure C

A minimal Retrieval-Augmented Generation (RAG) pipeline written in pure C, targeting Apple M1. No C++, no Python at runtime, no heavy frameworks.

Overview

This project implements a complete RAG system that:

Chunks documents into overlapping text segments
Embeds chunks using GTE-Small (384-dimensional embeddings)
Stores and retrieves chunks with configurable backends (flat file, hash table, SQLite, or custom index)
Generates responses by calling a local LLM server with retrieved context

The codebase is designed for learning—every component is implemented from first principles with readable, antirez-style C code.

Quick Start

Prerequisites

macOS with Apple Silicon (M1/M2/M3, etc.)
Xcode Command Line Tools: xcode-select --install
libcurl (usually pre-installed on macOS)
llama.cpp for the LLM server: brew install llama.cpp

1. Download & Convert Models

# Download GTE-Small embedding model
pip install huggingface_hub safetensors
python -c "from huggingface_hub import snapshot_download; \
           snapshot_download('thenlper/gte-small', local_dir='./models/gte-small-hf')"

# Convert to .gtemodel format (included in the repo)
python convert_model.py ./models/gte-small-hf ./models/gte-small.gtemodel

# Download TinyLlama LLM (optional, for generation)
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
    tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir ./models

2. Start the LLM Server (in a separate terminal)

llama-server -m ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --port 8080

3. Build & Run

# Build the executable
make

# Embed a document and save to disk
./rag --model ./models/gte-small.gtemodel --file doc.txt --save store.bin

# Query the saved store
./rag --model ./models/gte-small.gtemodel --load store.bin --query "What is this about?"

# Or: embed and query in-memory (no disk store)
./rag --model ./models/gte-small.gtemodel --file doc.txt --query "What is this about?"

Usage

v1 has three modes of operation:

Mode 1: Embed and save to disk

./rag --model <path.gtemodel> --file <document.txt> --save <store.bin>

Reads the document
Splits it into overlapping chunks (1000 chars, 200 char overlap)
Embeds each chunk using GTE-Small
Writes chunks and embeddings to <store.bin> in the v1 binary format

Mode 2: Load from disk and query

./rag --model <path.gtemodel> --load <store.bin> --query "your question" [--k <int>] [--server <url>]

Loads the binary store from disk
Embeds the query using GTE-Small
Scans all stored chunks, scores each by cosine similarity
Prints the top-k matches (default k=3)
If --server is given, sends the top-k context to the LLM and prints the generated answer

Mode 3: In-memory (no disk store)

./rag --model <path.gtemodel> --file <document.txt> --query "your question" [--k <int>] [--server <url>]

Same as Mode 2 but chunks and embeds the document in-memory — nothing is written to disk
Useful for one-off queries or testing

Flags

Flag	Required	Description
`--model <path>`	always	Path to the `.gtemodel` embedding model
`--file <path>`	Mode 1, 3	Document to chunk and embed
`--save <path>`	Mode 1	Write binary store to this file
`--load <path>`	Mode 2	Load binary store from this file
`--query <text>`	Mode 2, 3	Query string to search for
`--k <int>`	optional	Number of top results to return (default: 3)
`--server <url>`	optional	llama.cpp server URL for LLM generation (e.g. `http://localhost:8080`)

--save and --load are mutually exclusive.

Project Structure

rag_c/
├── README.md
├── Plan.md                    # Architecture and version roadmap
├── CLAUDE.md                  # Learning principles
│
├── Core Dependencies (3rd-party)
├── gte.c / gte.h              # GTE embedding model (antirez)
├── sds.c / sds.h              # Dynamic strings (antirez)
├── cJSON.h                    # JSON parsing (DaveGamble)
├── third_party/gte/           # GTE model source
│
├── Shared Source (you write)
├── main.c                     # CLI entry point
├── chunk.c / chunk.h          # Document chunking
├── embed.c / embed.h          # Embedding wrapper
├── llm.c / llm.h              # LLM HTTP calls
│
├── Storage Backends
├── store_v1.c / store_v1.h    # Flat binary file (ACTIVE)
├── store_v2.c / store_v2.h    # Hash table (antirez dict)
├── store_v3.c / store_v3.h    # SQLite + vector index
├── store_v4.c / store_v4.h    # Custom vector index (planned)
│
└── models/
    ├── gte-small.gtemodel     # Embedding model
    └── tinyllama-*.gguf       # LLM model

Versions

The project supports four progressively advanced storage/retrieval backends:

Version	Storage	Retrieval	When to Use
v1	Flat binary file	Linear scan + cosine similarity	Learning, small datasets (<5k chunks)
v2	Hash table (dict)	Linear scan with fast ID lookup	Want Redis-style data structures
v3	SQLite + sqlite-vec	HNSW approximate KNN	Production, larger datasets (10k+ chunks)
v4	Custom C index	Pure C HNSW or IVF	Full control, zero dependencies

Currently v1 is implemented. See Plan.md for roadmap.

How It Works

Embeddings

The project uses GTE-Small, a 127MB embedding model that produces 384-dimensional vectors. Since GTE embeddings are L2-normalized, the dot product between two embeddings equals their cosine similarity—no special normalization needed.

similarity = dot_product(query_embedding, chunk_embedding, 384);

Chunking

Documents are split into overlapping windows:

Chunk size: 1000 characters
Overlap: 200 characters
Simple byte-based splitting (no fancy tokenization)

Storage (v1)

A custom binary format stores chunks and embeddings efficiently:

[uint32 num_chunks]
[uint32 dim]
[chunk_0: uint32 text_len | uint8 text[text_len] | float embedding[dim]]
[chunk_1: ...]

Retrieval

Brute-force linear scan over all chunks, scoring each with cosine similarity:

for (each chunk) {
    score = dot_product(query_embedding, chunk_embedding, 384)
    if (score in top-k) insert into results
}
return top-k sorted by score

Compilation

The Makefile handles everything:

make        # build
make clean  # remove objects and binary

It compiles: main.c chunk.c embed.c store_v1.c llm.c third_party/gte/gte.c with flags: -O3 -march=native -ffast-math -Wall -I third_party/gte

With Apple Accelerate (faster matrix ops)

To enable BLAS-backed dot products, add -DUSE_BLAS -framework Accelerate to CFLAGS in the Makefile.

Code Style

The codebase follows antirez principles:

Readable, explicit C with minimal macros
No over-engineering or premature abstraction
Short, focused functions with clear error handling
Comments explaining why, not what

Learning Resources

CLAUDE.md: Philosophy and rules for this learning project
Plan.md: Detailed architecture and future versions
Decisions.md: Design decisions and tradeoffs
dev_log.md: Development progress and notes

Contributing

This is a learning project. Each component is designed to teach one concept deeply:

How embeddings work
Binary file formats
Cosine similarity
C memory management
HTTP communication

See CLAUDE.md for the project's learning principles.

License

This project includes code from:

antirez: gte.c, sds.c (original Redis/single-purpose implementations)
DaveGamble: cJSON
asg017: sqlite-vec (for v3)

See individual files for license details.

Status

✅ v1: Flat binary file storage (complete)
🔄 v2-v4: In progress (see Plan.md)

Questions?

Run ./rag with no arguments for usage help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG in Pure C

Overview

Quick Start

Prerequisites

1. Download & Convert Models

2. Start the LLM Server (in a separate terminal)

3. Build & Run

Usage

Mode 1: Embed and save to disk

Mode 2: Load from disk and query

Mode 3: In-memory (no disk store)

Flags

Project Structure

Versions

How It Works

Embeddings

Chunking

Storage (v1)

Retrieval

Compilation

With Apple Accelerate (faster matrix ops)

Code Style

Learning Resources

Contributing

License

Status

Questions?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
sample_outputs/v1		sample_outputs/v1
third_party/gte		third_party/gte
.gitignore		.gitignore
Decisions.md		Decisions.md
Makefile		Makefile
Plan.md		Plan.md
README.md		README.md
chunk.c		chunk.c
chunk.h		chunk.h
dev_log.md		dev_log.md
doc.txt		doc.txt
embed.c		embed.c
embed.h		embed.h
llm.c		llm.c
llm.h		llm.h
main.c		main.c
store_v1.c		store_v1.c
store_v1.h		store_v1.h

Folders and files

Latest commit

History

Repository files navigation

RAG in Pure C

Overview

Quick Start

Prerequisites

1. Download & Convert Models

2. Start the LLM Server (in a separate terminal)

3. Build & Run

Usage

Mode 1: Embed and save to disk

Mode 2: Load from disk and query

Mode 3: In-memory (no disk store)

Flags

Project Structure

Versions

How It Works

Embeddings

Chunking

Storage (v1)

Retrieval

Compilation

With Apple Accelerate (faster matrix ops)

Code Style

Learning Resources

Contributing

License

Status

Questions?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages