RAG Application - Project Documentation

Overview

This repository contains a complete RAG (Retrieval-Augmented Generation) application with web scraping capabilities, embedding processing and storage, and query functionality. The Web UI is a convenient interface for user to enter queries to the RAG application. Of course, the questions shall be specifically related to the contents you scraped off the web, otherwise, there is no point of having a RAG application; You can just qeury llm directly.

Key Features

Web Scraping, Embedding, and Vector Storage: Extract starting from any website url, process/embed text, and store content and embedding in local postgresql database.
Query Interface: Support for natural language queries with domain-specific knowledge retrieved from scraped contents.
Web Interface: Web-based UI for interaction with streaming response.

Project Structure

${basedir}/scraper/
├── .env                    # Environment configuration file
├── adhoc_query.py          # Test vector query
├── requirements.txt        # Python dependencies
├── web_app.py              # Main web application
├── env/                    # Virtual environment
├── scraper/                # Core scraping functionality
└── static/                 # Web server static files

Getting Started

Prerequisites

Python 3.12 or higher
pip (Python package manager)
Virtual environment tools
Postgresql@18 + pgvector extension 0.8.1
An account/API key to use LLM. You can have a free account to use a free LLM model at OpenRouter.

Database set up

I have everything set up on my macbook, if you are a windows user, please find the installation procedure on the web.

Postgresql database 18.0 installation

   brew search postgresql
   brew install postgresql@18

pgvector extension 0.8.1 installation

   brew install pgvector

link pgvector extension to Postgresql installation

This works on Macbook:

ln -s /usr/local/Cellar/pgvector/0.8.1/share/postgresql@18/extension/*  /usr/local/Cellar/postgresql@18/18.0/share/postgresql/extension/

On Windows, you need to find the place pgvector is installed, and link the contents of extensions directory or copy them to the extension directory of your postgresql 18.0 installation.

Hard coded setup:

database: "myprojdb"
host:     "localhost"
port:     5432
schema:   "scraper"
table:    "pages"
user:     "myuser" 
password  "mypassword"

user myuser shall be the owner of myprojdb with full admin + R/W permissions

verify pgvector is supported in your db.

To enable authentication on your db, edit /usr/local/var/postgresql@18/pg_hba.conf, change the 'trust' to 'md5' for your db connection from localhost or remote.

login to your db:  psql -U myuser -d myprojdb

run 'set search_path to scraper, public;'
run 'create extension vector;'
you shall see CREATED or ERROR: extension "vector" already exists if you run more than once.

you can also do \dx to see installed extensions. it shall have vector (0.8.1)

you can optionally create the pages table (the program shall create it automatically if not there)

table "pages"" columns:

id integer
url text
title text
content text
embedding vector(384)
chunk_id integer

pages_pkey is id
pages_url_chunk_id is unique key

Configuration

LLM model/Database interaction: .env

PGVECTOR_DB_URL=postgresql://myuser:mypassword@localhost:5432/myprojdb
OPENROUTER_API_KEY=sk-or-v1-7256xxxxxx (use your own open router api key here)
DEEPSEEK_FREE_MODEL=deepseek/deepseek-chat-v3.1:free
TOKENIZERS_PARALLELISM=false

Scraper/Spider Configuration: settings.py
Dependencies: requirements.txt

LLM Vendor Account and LLM Model

I recommend setting up an account at https://openrouter.ai/ and get an API key (sk-or-v1-7256.....). The benefit of having an openrouter account is that it supports all major LLM models without the need to register with different model vendors and have a separate API key with each. One openrouter API key let you have access to any models you choose, and openrouter handles the billing and payment of your usage. There are free LLM models you can configure via openrouter, and you pay nothing if you chooose to use one of the free models. I have configured deepseek/deepseek-chat-v3.1:free for this project.

Application Installation

Navigate to project directory:
```
git clone ....

cd ${basedir}/scraper
```

Set up virtual environment:

python -m venv env
source env/bin/activate
pip install -r requirements.txt

Running the Application

Scraping/Embedding/Store:

For scraping web content, Spider is used respecting rate limit and bot scrawler policies of the site.

The configuration for the scraper is in the settings.py file. You can adjust the settings to fit your own preference. Two configuration items are of particular importance as listed below:

DEPTH_LIMIT = 1
CLOSESPIDER_PAGECOUNT = 500

What they mean is that we will only crawl one level deep, and we stop scrawling after roughly around 500 urls if they are enough depth one referenced urls from the starting page. You can adjust either one or both to get the amount of data you are comfortable with in your local database. 

You can crawl with a different starting urls with multiple runs. You can repeat the crawling of the same starting url more than once, but the data shall be saved only once as the data base has constraints to prevent duplicate data being saved twice.

For scraping web site content, embedding, and save, run the following: 

export SCRAPY_SETTINGS_MODULE=scraper.settings

cd ${basedir}/scraper

python -m scraper.embed_and_store {url}

e.g. url = "https://en.wikivoyage.org/wiki/Driving_in_China"

you should see some records stored in pages table while this is running.

run below sql to verify:

select count(*) from scraper.pages;

Note on embedding:

We are using the embedding model all-MiniLM-L6-v2, which has dimension of 384. You can try other embedding models with higher dimensions for potentially better result. For this demo project, it is good enough to serve the purpose.

Since each page could be arbitarily long, we use 1500 character chuns (about 300 words) for each embedding; a url (page) usually have multiple chunks. Our vector search is on page chunks, not on a page as a whole.

Note on vector operator:

v1 <=> v2 is defined as:

$$\text{cosineDistance}(v_1, v_2) = 1 - \cos(\theta) = 1 - \frac{v_1 \cdot v_2}{|v_1| |v_2|}$$

Range: [0, 2]
0 = perfect match (cos(θ) = 1)
1 = orthogonal
2 = opposite

KNN vector search is based on cosine distance, meaning the smaller the distance, the closer the vectors are.

SELECT id, url, title, content
     FROM pages
     ORDER BY embedding <=> %s::vector
     LIMIT 3;

Vector Search Query Testing:

This verification step is not needed for the RAG application to function, but it was an intermediary stage before integrating with a web UI during development. You can skip this and go directly with the web UI.

python adhoc_query.py {your query}

Your query shall be related to your scraped contents. e.g. use "what is the max speed limit on Expressway in China?" or "what is the max speed of bullet train in Chain?" for {your query}

you should see 3 top records returned back.

Using the Web UI:

start FAST API server in ${basedir}/scraper

uvicorn web_app:app --reload --port=8080

then open in browser:

http://localhost:8080

Artifacts/Screen Shots

Below are some screen shots showing database set up, WEB UI of the RAG application, and some sample query results, both from the RAG UI and the verification from the database.

Database table "pages":
WEB UI:

Scraped content vs RAG answers:

Following two screen shots show data captured in table corresponding to the RAG query responses.

Carefully compare the contents in table "pages" in the two screenshots below to the RAG application responses to queries in the above WEB UI screenshot, what do you see?.

content scraped showing max driving speed of 120 km/hr on Expressway; RAG response cited the source as chunk 7 of "Driving in China" website.

content scraped showing max bullet train speed of 350 km/hr; RAG response cited the source as chunk 2 of "Rail travel in China" website.

These two screen shots are evidence that the responses from RAG application were indeed retrieved from our local knowledge base, and it was not part of the general answers from the LLM model itself.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Application - Project Documentation

Overview

Key Features

Project Structure

Getting Started

Prerequisites

Database set up

Configuration

LLM Vendor Account and LLM Model

Application Installation

Running the Application

Artifacts/Screen Shots

License

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
scraper		scraper
static		static
.gitignore		.gitignore
README.md		README.md
adhoc_query.py		adhoc_query.py
requirements.txt		requirements.txt
web_app.py		web_app.py

Folders and files

Latest commit

History

Repository files navigation

RAG Application - Project Documentation

Overview

Key Features

Project Structure

Getting Started

Prerequisites

Database set up

Configuration

LLM Vendor Account and LLM Model

Application Installation

Running the Application

Artifacts/Screen Shots

License

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages