Tax Legal RAG System

A full-stack knowledge graph-based Retrieval Augmented Generation (RAG) system for Vietnamese tax law documents. Compares vector search vs graph-based retrieval approaches.

Architecture

Document-Graph-Representation/
├── api/                       # FastAPI backend
│   ├── routers/               # API endpoints (graph, rag, health)
│   ├── services/              # Business logic
│   └── schemas/               
├── frontend/                  # React frontend
│   ├── src/
│   │   ├── components/        
│   │   ├── pages/             
│   │   ├── services/          
│   │   └── stores/            # Zustand state
├── rag_model/                 # ML pipeline
│   ├── model/                 # NER, RE, document processing
│   └── retrieval_pipeline/    # Retrieval strategies
├── shared_functions/          # Utilities (Neo4j, S3, eval)
└── docs/                      # Documentation

Tech Stack

Backend

Component	Technology
Framework	FastAPI 0.115.6
Graph DB	Neo4j 5.27.0 (AuraDB)
Storage	AWS S3
Embeddings	sentence-transformers 3.3.1
NLP	Underthesea (Vietnamese)

Frontend

Component	Technology
Framework	React 18.3.1
Language	TypeScript 5.8.3
Build	Vite 5.4.19
State	Zustand 5.0.8
Data	TanStack Query 5.83.0
UI	shadcn/ui + Tailwind CSS
Graph Viz	react-force-graph

Quick Start

Prerequisites

Python 3.8+
Node.js 18+
Neo4j database (local or AuraDB) - Required for backend to work

Step 1: Clone & Setup Environment

# Clone repo
git clone https://github.com/GinHikat/Document-Graph-Representation.git
cd Document-Graph-Representation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirement.txt
pip install -r requirements-api.txt

Step 2: Configure Environment Variables

# Copy example config
cp .env.example .env

# Ask team lead for the actual credentials to fill in:
# - NEO4J_URI, NEO4J_AUTH (Neo4j database)
# - GOOGLE_API_KEY (Gemini API for RAG answers)

Note: The project uses a shared Neo4j database. Contact the team for credentials - don't create a new one.

Step 3: Run Backend Server

# From project root (NOT from api/ folder)
uvicorn api.main:app --reload --port 8000

# Verify: Open http://localhost:8000/api/health
# Should return {"status": "healthy", ...}

Step 4: Run Frontend

# In a new terminal
cd frontend

npm install

# Configure environment
cp .env.example .env
# Default VITE_API_URL=http://localhost:8000/api is correct

# Run dev server
npm run dev

Frontend runs at http://localhost:8080, API at http://localhost:8000.

API Endpoints

Graph API (`/api/graph`)

Method	Endpoint	Description
GET	`/nodes`	Fetch graph nodes with optional filters
POST	`/execute`	Execute Cypher queries
GET	`/schema`	Get graph schema
GET	`/stats`	Graph statistics

RAG API (`/api/rag`)

Method	Endpoint	Description
POST	`/query`	RAG query with SSE streaming
POST	`/retrieve`	Retrieve relevant context
POST	`/rerank`	Rerank retrieved results
GET	`/tools`	List available tools

Health API

Method	Endpoint	Description
GET	`/health`	Health check

Retrieval Modes

modes = {
    1: "default",           		# Standard embedding search
    2: "traverse_embed",    		# Embeddings + Graph Traversal
    3: "traverse_exact",    		# Exact Match + Graph TRaversal
    4: "exact_match",  	    		# Exact Match
    5: "exact_match_with_rerank",   # Exact match then Rerank with embeddings
    6: "hybrid_search",       		# Top k by both Embeddings and Exact match
}

Embedding Models

models = {
    0: "paraphrase-multilingual-MiniLM-L12-v2",
    1: "distiluse-base-multilingual-cased-v2",
    2: "all-mpnet-base-v2",
    3: "all-MiniLM-L12-v2",
    4: "vinai/phobert-base",   # Vietnamese-specific
    5: "BAAI/bge-m3"           # Evaluation only
}

Evaluation Modes

mode_map = {
    1: 'embedding',
    2: 'jaccard', 
    3: 'combined'
}

Environment Variables

Backend (.env)

# Neo4j
NEO4J_URI=neo4j+s://xxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password

# AWS S3
AWS_ACCESS_KEY_ID=your-key
AWS_SECRET_ACCESS_KEY=your-secret
AWS_BUCKET_NAME=your-bucket
AWS_REGION=ap-southeast-1

# Optional
OPENAI_API_KEY=your-openai-key

Frontend (.env)

VITE_API_URL=http://localhost:8000/api
VITE_ENABLE_GRAPH_VIEW=true
VITE_ENABLE_ANNOTATIONS=true

Usage Examples

Python SDK

from shared_functions.batch_retrieval_neo4j import Neo4j_retriever

retriever = Neo4j_retriever()

# Single query
result = retriever.query_neo4j(
    text="Thuế thu nhập cá nhân",
    mode=6,  		            # Hybrid search
    graph=True,		            # Use Graph Embedding, None if only use Textual Embedding
    chunks=None,	            # Include chunk nodes (only available in GraphSAGE integrated database)
    hop=2,		                # Number of hops in traversal
    namespace = 'Test_rel_3'    # Namespace (Node label for filtering)
)

# Batch query, df should include "question" column
df = retriever.batch_query(df, mode=2, graph=True, chunks=True, hop=2, namespace = 'Test')

Evaluation

from shared_functions.eval import Evaluator
from shared_functions.batch_retrieve_neo4j import *

retriever = Neo4j_retriever()
eval = Evaluator(embedding_as_judge=5)

# Combined evaluation
result = eval.combined_evaluator(
    referenced_context="...",
    retrieved_context="...",
    embedding_threshold=0.7,
    jaccard_threshold=0.3,
    scaling_factor=0.5
)

# For batch evaluation, df must have supporting_context and retrieved_context columns with List type
retriever.str_to_list(df, 'supporting_context')
retriever.str_to_list(df, 'retrieved_context')
eval.run_evaluation(df, embedding_threshold = , jaccard_threshold = , scaling_factor = , mode = )

# RAGAS evaluation
eval.ragas(df)  # df: question, answer, retrieved_contexts

Frontend Pages

Page	Route	Description
Home	`/`	Dashboard overview
Documents	`/documents`	Document management
Q&A	`/qa`	Query interface
Graph	`/graph`	Knowledge graph visualization
Annotate	`/annotate`	Document annotation

Development

Run Tests

Backend Tests

# Run all tests (40 tests, 100% pass rate)
pytest api/tests/ -v

# Run with coverage (69% coverage)
pytest api/tests/ --cov=api --cov-report=html

# Run specific test file
pytest api/tests/test_auth.py -v      # 13 auth tests
pytest api/tests/test_documents.py -v # 17 document tests
pytest api/tests/test_rag.py -v       # 10 RAG tests

# See docs/testing.md for detailed testing documentation

Frontend Tests

# Linting
npm run lint

# Type checking
npm run build

Build for Production

# Frontend
cd frontend
npm run build
# Output in frontend/dist/

# Serve with backend
# Configure FastAPI to serve static files

Project Structure Details

See docs/ for detailed documentation:

docs/system-architecture.md - Architecture diagrams
docs/codebase-summary.md - Component details
docs/code-standards.md - Coding conventions
docs/project-roadmap.md - Development roadmap
docs/testing.md - Testing guide and coverage reports

Troubleshooting

"Cannot connect to backend server. Is it running on localhost:8000?"

Cause: Backend server is not running or crashed on startup.

Solutions:

Make sure you're running the backend from project root:

# Correct (from project root)
uvicorn api.main:app --reload --port 8000

# Wrong (from api/ folder)
cd api && uvicorn main:app --reload --port 8000

Check if .env has valid credentials (ask team for credentials):

# Required in .env
NEO4J_URI=<get from team>
NEO4J_AUTH=<get from team>

Verify backend health:

curl http://localhost:8000/api/health
# Should return {"status": "healthy", ...}

"Backend Disconnected" in UI

The frontend can't reach the API. Check:

Is backend running on port 8000?
Is VITE_API_URL=http://localhost:8000/api set in frontend/.env?
Any CORS errors in browser console?

Port Already in Use

# Find process on port 8000
lsof -i :8000

# Kill it
kill -9 <PID>

License

MIT

Contributors

Tax Legal RAG Team

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
api		api
frontend		frontend
rag_model		rag_model
shared_functions		shared_functions
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirement.txt		requirement.txt
requirements-api.txt		requirements-api.txt

Folders and files

Latest commit

History

Repository files navigation

Tax Legal RAG System

Architecture

Tech Stack

Backend

Frontend

Quick Start

Prerequisites

Step 1: Clone & Setup Environment

Step 2: Configure Environment Variables

Step 3: Run Backend Server

Step 4: Run Frontend

API Endpoints

Graph API (/api/graph)

RAG API (/api/rag)

Health API

Retrieval Modes

Embedding Models

Evaluation Modes

Environment Variables

Backend (.env)

Frontend (.env)

Usage Examples

Python SDK

Evaluation

Frontend Pages

Development

Run Tests

Build for Production

Project Structure Details

Troubleshooting

"Cannot connect to backend server. Is it running on localhost:8000?"

"Backend Disconnected" in UI

Port Already in Use

License

Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Graph API (`/api/graph`)

RAG API (`/api/rag`)

Packages