Skip to content

Drepheus/RAG-codebase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Code RAG System

Ask questions about your codebase in plain English. Get answers with exact file and line references.

Two ways to use:

  • πŸ–₯️ Web UI β€” Dark-themed dashboard in your browser
  • ⌨️ CLI β€” Command-line scripts for automation

What This Does

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  1. Point it at your code folder                                β”‚
β”‚  2. It reads and indexes all files                              β”‚
β”‚  3. Ask questions like "How does login work?"                   β”‚
β”‚  4. Get answers with exact file paths + line numbers            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example Output:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ANSWER                                                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ User authentication is handled in src/auth.py.              β”‚
β”‚ The authenticate_user() function (lines 24-56) validates    β”‚
β”‚ credentials using bcrypt...                                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SOURCES                                                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ File                   β”‚ Lines       β”‚ Symbol                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ src/auth.py            β”‚ 24-56       β”‚ authenticate_user     β”‚
β”‚ src/auth.py            β”‚ 58-72       β”‚ create_token          β”‚
β”‚ src/models/user.py     β”‚ 1-35        β”‚ User                  β”‚
β”‚ src/middleware.py      β”‚ 10-28       β”‚ require_auth          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Prerequisites

Before starting, make sure you have:

Requirement How to Check How to Install
Docker docker --version Install Docker
Docker Compose docker-compose --version Included with Docker Desktop
Python 3.10+ python3 --version Install Python
API Key You have an OpenAI or Anthropic account OpenAI or Anthropic

Setup (One-Time)

Step 1: Download the Project

git clone https://github.com/Drepheus/RAG-codebase.git
cd RAG-codebase

Step 2: Set Your API Key

You have two options. Choose one:

Option A: Using a .env file (Recommended)

cp .env.example .env

Open .env in a text editor (Notepad, VS Code, nano, etc.) and set your key:

For OpenAI:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

For Anthropic (Claude):

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Option B: Using Environment Variables

Linux / macOS (Terminal):

export LLM_PROVIDER="openai"
export OPENAI_API_KEY="sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Windows (PowerShell):

$env:LLM_PROVIDER = "openai"
$env:OPENAI_API_KEY = "sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Windows (Command Prompt):

set LLM_PROVIDER=openai
set OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

Note: Environment variables are temporary (lost when you close the terminal). The .env file is permanent.


Step 3: Install Python Packages

pip install -r requirements.txt

Step 4: Start the Database

docker-compose up -d

⏳ Wait 60 seconds for the AI model to download and load (first time only).

Check if it's ready:

curl http://localhost:8080/v1/.well-known/ready

Expected response:

{"status":"READY"}

Step 5: Make Scripts Executable

Linux / macOS:

chmod +x scripts/ingest.sh scripts/query.sh scripts/ui.sh

Windows: No action needed. Use python -m src.ingest, python -m src.query, and python -m uvicorn src.app:app instead.


Usage

Choose your interface:

Mode Command Best For
Web UI ./scripts/ui.sh Interactive exploration, visual feedback
CLI ./scripts/query.sh "question" Automation, scripting, quick queries

Option 1: Web UI (Recommended for Beginners)

Start the web dashboard:

./scripts/ui.sh

Windows (PowerShell):

python -m uvicorn src.app:app --host 0.0.0.0 --port 8000

Then open http://localhost:8000 in your browser.

Features:

  • πŸŒ™ Dark theme interface
  • πŸ“Š Live status (connection, indexed repos, chunk count)
  • πŸ” Query with example prompts
  • πŸ“ Ingest codebases via form
  • πŸ“‹ Results with clickable sources

Option 2: Command Line

Ingest a Codebase

./scripts/ingest.sh /full/path/to/your/code

Copy-paste examples:

# Index a project in your home folder
./scripts/ingest.sh /home/username/projects/my-app

# Index with a custom name
./scripts/ingest.sh /home/username/projects/my-app my-app

# Index the current folder
./scripts/ingest.sh $(pwd)

Windows (PowerShell):

python -m src.ingest C:\Users\YourName\projects\my-app

Sample output:

Ingesting repository: my-app
Path: /home/user/projects/my-app

Connecting to Weaviate...
βœ“ CodeChunk collection already exists
βœ“ Deleted 0 existing chunks for repo: my-app
  βœ“ src/main.py (3 chunks)
  βœ“ src/auth.py (2 chunks)
  βœ“ src/api/routes.py (5 chunks)

βœ“ Ingestion complete!
  Files processed: 3
  Chunks created: 10

Ask Questions

./scripts/query.sh "your question here"

Copy-paste examples:

./scripts/query.sh "How does user authentication work?"
./scripts/query.sh "What does the User class do?"
./scripts/query.sh "Explain the database connection logic"
./scripts/query.sh "How are API routes organized?"
./scripts/query.sh "Where is error handling implemented?"

Windows (PowerShell):

python -m src.query "How does user authentication work?"

Sample output:

Query: How does user authentication work?

Searching for relevant code...
Found 8 relevant chunks

Calling OPENAI...
============================================================
ANSWER
============================================================
User authentication is implemented in `src/auth.py`. The main function
`authenticate_user()` (lines 24-56) accepts a username and password,
hashes the password using bcrypt, and compares it against the stored
hash in the database...

============================================================
SOURCES
============================================================
  β€’ src/auth.py (lines 24-56) [authenticate_user]
  β€’ src/auth.py (lines 58-72) [create_token]
  β€’ src/models/user.py (lines 1-35) [User]
  β€’ src/middleware.py (lines 10-28) [require_auth]

Quick Reference

Task Command
Start database docker-compose up -d
Stop database docker-compose down
View database logs docker-compose logs -f
Start Web UI ./scripts/ui.sh β†’ open http://localhost:8000
Ingest code (CLI) ./scripts/ingest.sh /path/to/code
Ask a question (CLI) ./scripts/query.sh "your question"

βœ… Safe to Re-Run (Idempotent)

Both scripts are safe to run multiple times:

Script What Happens on Re-Run
ingest.sh Deletes old chunks for that repo, then re-indexes. Other repos are untouched.
query.sh Just asks a new question. No side effects.

You cannot break anything by running these commands multiple times.


Switching Between OpenAI and Anthropic

You can switch LLM providers at any time by editing your .env file:

To use OpenAI:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

To use Anthropic (Claude):

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxxxxxxxxx

No need to restart anything. The change takes effect on the next query.

Available Models (set in .env):

Provider Default Model Alternatives
OpenAI gpt-4o-mini gpt-4o, gpt-4-turbo, gpt-3.5-turbo
Anthropic claude-3-haiku-20240307 claude-3-5-sonnet-20241022, claude-3-opus-20240229

To change the model, add to .env:

OPENAI_MODEL=gpt-4o

or

ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

Troubleshooting

❌ "Connection refused" or "Weaviate is not running"

Problem: The database isn't running.

Solution:

docker-compose up -d

Wait 60 seconds, then try again.


❌ "OPENAI_API_KEY not set" or "ANTHROPIC_API_KEY not set"

Problem: Your API key is missing or not loaded.

Solution:

  1. Check that .env file exists:

    ls -la .env
  2. Check that it contains your key:

    cat .env
  3. Verify the key format:

    • OpenAI keys start with: sk-proj- or sk-
    • Anthropic keys start with: sk-ant-
  4. If using environment variables instead of .env, make sure you exported them in the same terminal session:

    echo $OPENAI_API_KEY

❌ "No relevant code found"

Problem: The codebase hasn't been indexed yet.

Solution:

./scripts/ingest.sh /path/to/your/code

❌ Database won't start / out of memory

Problem: The embedding model needs ~2GB RAM.

Solution:

Check available memory:

free -h

If low on memory, close other applications or increase VM memory.


❌ Scripts don't run on Windows

Problem: .sh scripts are for Linux/macOS.

Solution: Use Python directly:

python -m src.ingest C:\path\to\code
python -m src.query "your question"

πŸ”„ Re-index a codebase

Just run ingest again. It's safe and replaces old data automatically:

./scripts/ingest.sh /path/to/code

πŸ” View database status

# Check if containers are running
docker ps

# Check Weaviate logs
docker-compose logs weaviate

# Check embedding model logs
docker-compose logs t2v-transformers

πŸ—‘οΈ Reset everything (delete all data)

docker-compose down -v
docker-compose up -d

Wait 60 seconds for restart.


Multiple Repositories

You can index multiple codebases. Each one is stored separately:

# Index first project
./scripts/ingest.sh /path/to/project-a project-a

# Index second project
./scripts/ingest.sh /path/to/project-b project-b

Queries search across all indexed repositories.

To re-index just one project, run ingest again with the same name. Only that project's chunks are replaced.


Project Structure

.
β”œβ”€β”€ docker-compose.yml      # Database configuration
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env.example           # Template for your API key
β”œβ”€β”€ .env                   # Your actual API key (not in git)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ ingest.sh          # Ingest command wrapper
β”‚   └── query.sh           # Query command wrapper
└── src/
    β”œβ”€β”€ config.py          # Settings
    β”œβ”€β”€ schema.py          # Database schema
    β”œβ”€β”€ chunker.py         # Code splitting logic
    β”œβ”€β”€ ingest.py          # Ingestion logic
    └── query.py           # Query logic

How It Works

  1. Ingestion: Reads all code files, splits them into chunks (by function/class when possible), and stores them in Weaviate with metadata (file path, line numbers, language).

  2. Query: Converts your question into a vector, finds the 8 most similar code chunks, sends them to the LLM with your question, and returns the answer with sources.


Supported Languages

Python, JavaScript, TypeScript, Java, Go, Rust, C, C++, C#, Ruby, PHP, Swift, Kotlin, Scala, Vue, Svelte, SQL, Shell, YAML, JSON, TOML, Markdown, HTML, CSS, SCSS


Configuration

Edit src/config.py to customize:

Setting Default Description
MAX_LINES 160 Maximum lines per chunk
OVERLAP_LINES 30 Overlap between chunks
TOP_K_RESULTS 8 Number of chunks to retrieve

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors