Production-grade Retrieval-Augmented Generation (RAG) system for enterprise document Q&A. Combines semantic search with Claude AI to provide accurate, source-cited answers from large document collections. Built for high-accuracy information retrieval with transparent source attribution.
Enterprise document intelligence platform that indexes PDFs, Word documents, and text files for semantic search and AI-powered question answering. Uses vector embeddings for retrieval and Claude Sonnet 4 for answer generation with confidence scoring and source citation.
Key Capabilities:
- Multi-format document processing (PDF, DOCX, TXT, Markdown)
- Intelligent text chunking with semantic boundaries
- Vector embeddings for semantic search (sentence-transformers)
- Claude AI integration for context-aware answer generation
- Source attribution with page numbers and relevance scores
- Confidence scoring for answer reliability
- Real-time document indexing and querying
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INDEXING PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Document Upload (PDF/DOCX/TXT)
โ
Text Extraction
(PyPDF2, python-docx)
โ
Intelligent Chunking
(512 tokens, 50 overlap)
โ
Embedding Generation
(sentence-transformers)
โ
Vector Storage
(In-memory index)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ QUERY PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
User Question
โ
Query Embedding
โ
Semantic Search โโโ Cosine Similarity โโโ Top-K Chunks
โ
Context Construction
(Chunk + Page + Relevance)
โ
Claude AI Generation
(Answer + Confidence + Citations)
โ
Response with Sources
Technology Stack:
- Embeddings: sentence-transformers (all-MiniLM-L6-v2, 384-dim)
- LLM: Claude Sonnet 4 (Anthropic API)
- Document Processing: PyPDF2, python-docx
- Vector Search: NumPy + scikit-learn (cosine similarity)
- API: FastAPI with async Python
- Tokenization: tiktoken (Claude-compatible)
Multi-Format Support:
- PDF documents (text extraction with page boundaries)
- Word documents (paragraph structure preservation)
- Plain text and Markdown files
- Automatic format detection
Smart Chunking:
- Semantic boundary detection (sentence-based)
- Configurable chunk size (default: 512 tokens)
- Overlap between chunks (default: 50 tokens)
- Page number tracking for source attribution
Token-Aware Processing:
- Uses tiktoken for accurate Claude-compatible token counting
- Prevents context window overflow
- Optimizes chunk sizes for retrieval quality
Embedding Model:
- sentence-transformers/all-MiniLM-L6-v2
- 384-dimensional dense vectors
- Fast inference (~50ms per query)
- Strong semantic understanding
Search Capabilities:
- Cosine similarity ranking
- Top-K retrieval (configurable)
- Document filtering (query specific collections)
- Batch embedding for efficient indexing
Claude AI Integration:
- Sonnet 4 for accurate, nuanced answers
- Low temperature (0.1) for factual consistency
- Structured prompts for reliable output
- Source citation in responses
Confidence Scoring:
- Self-assessed confidence (0.0-1.0)
- Based on context quality and information sufficiency
- Transparent uncertainty communication
Source Attribution:
- Citations with [Source X] notation
- Page numbers from original documents
- Relevance scores for each source
- Full chunk context available
Async Processing:
- Non-blocking document uploads
- Parallel embedding generation
- Background indexing tasks
Scalability:
- Batch processing for large documents
- Efficient vector operations (NumPy)
- Stateless API design
Monitoring:
- Structured logging (JSON format)
- Processing time tracking
- Health check endpoints
def _create_chunks(text, page_map):
"""
Creates overlapping chunks with semantic boundaries
Preserves sentence structure for context coherence
"""
sentences = split_sentences(text)
current_chunk = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = count_tokens(sentence)
if current_tokens + sentence_tokens > MAX_CHUNK_TOKENS:
# Save chunk
save_chunk(current_chunk)
# Create overlap
overlap = get_last_sentences(current_chunk, OVERLAP_TOKENS)
current_chunk = overlap + [sentence]
else:
current_chunk.append(sentence)
current_tokens += sentence_tokensWhy overlapping chunks?
- Prevents information loss at boundaries
- Maintains context across chunks
- Improves retrieval quality for cross-boundary queries
async def search(query: str, top_k: int = 5):
"""
Semantic search using cosine similarity
Returns most relevant chunks with scores
"""
# Generate query embedding
query_vector = model.encode([query])[0]
# Compute cosine similarity with all chunks
similarities = cosine_similarity(
query_vector.reshape(1, -1),
chunk_vectors
)[0]
# Get top-k results
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [
RetrievalResult(
chunk=chunks[idx],
similarity_score=similarities[idx],
rank=i+1
)
for i, idx in enumerate(top_indices)
]Cosine Similarity:
- Measures semantic similarity between vectors
- Range: [-1, 1], where 1 = identical meaning
- Robust to document length differences
prompt = f"""You are a helpful AI assistant answering questions based on provided documents.
CONTEXT FROM DOCUMENTS:
{context_with_sources}
USER QUESTION:
{question}
INSTRUCTIONS:
1. Answer using ONLY information from the provided context
2. Cite sources using [Source X] notation
3. Be precise and factual
4. Express uncertainty if context insufficient
5. Provide confidence score (0.0-1.0)
Format:
ANSWER: [Your answer]
CONFIDENCE: [0.0-1.0]
REASONING: [Why this confidence]"""Key Prompt Elements:
- Clear role definition (factual assistant)
- Explicit grounding in context (no hallucination)
- Structured output format (parseable)
- Confidence self-assessment
- Source citation requirement
# Batch embedding for efficiency
texts = [chunk["content"] for chunk in chunks]
embeddings = model.encode(texts, show_progress_bar=False)
for text, embedding in zip(texts, embeddings):
store_vector(text, embedding.tolist())Why batch encoding?
- 10-50x faster than sequential encoding
- Better GPU utilization
- Consistent normalization across batch
- Python 3.11+
- Anthropic API key
- 4GB+ RAM (for embedding model)
git clone https://github.com/yourusername/rag-document-qa.git
cd rag-document-qa
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download embedding model (first run)
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"
# Set API key
export ANTHROPIC_API_KEY=your_key_here
# Run server
python src/main.pyDependencies:
fastapi>=0.104.0
uvicorn>=0.24.0
anthropic>=0.8.0
sentence-transformers>=2.2.0
numpy>=1.24.0
scikit-learn>=1.3.0
PyPDF2>=3.0.0
python-docx>=1.0.0
tiktoken>=0.5.0
asyncpg>=0.29.0
redis>=5.0.0curl -X POST http://localhost:8000/documents/upload \
-F "file=@research_paper.pdf"Response:
{
"status": "success",
"document_id": "doc_a1b2c3d4e5f6g7h8",
"title": "research_paper.pdf",
"chunks_indexed": 47,
"total_pages": 12
}curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{
"question": "What are the main findings of the study?",
"top_k": 5
}'Response:
{
"question": "What are the main findings of the study?",
"answer": "The study found three main results: [Source 1] First, the intervention improved outcomes by 34% compared to control. [Source 2] Second, effects were sustained at 6-month follow-up. [Source 3] Third, cost-effectiveness ratio was favorable at $12,000 per QALY.",
"confidence": 0.92,
"sources": [
{
"chunk": {
"content": "Results showed a 34% improvement...",
"page_number": 8
},
"similarity_score": 0.87,
"rank": 1
}
],
"processing_time_ms": 1847,
"model_used": "claude-sonnet-4-20250514"
}curl http://localhost:8000/documentsResponse:
{
"total_documents": 5,
"document_ids": [
"doc_a1b2c3d4e5f6g7h8",
"doc_i9j8k7l6m5n4o3p2"
]
}curl http://localhost:8000/health| Operation | Latency (p95) | Notes |
|---|---|---|
| Document Upload | 2-5s | Depends on size (10-page PDF ~3s) |
| Embedding Generation | 50-100ms | Per batch of 10 chunks |
| Vector Search | 10-20ms | For 1000 chunks |
| Claude AI Generation | 1-2s | Depends on answer length |
| End-to-End Query | 1.5-3s | Search + generation |
Scalability:
- 1000 documents: <100ms search
- 10,000 chunks: ~200ms search
- Memory: ~1GB per 10,000 chunks
This project demonstrates:
RAG Architecture:
- Document chunking strategies
- Vector embedding generation
- Semantic similarity search
- Context-aware answer generation
- Source attribution and confidence scoring
AI/ML Engineering:
- Claude Sonnet 4 API integration
- Sentence-transformers for embeddings
- Prompt engineering for factual accuracy
- Structured output parsing
- Temperature tuning for consistency
Document Processing:
- Multi-format text extraction (PDF, DOCX)
- Intelligent chunking with overlaps
- Page number tracking
- Token-aware processing
Production Python:
- Async/await for non-blocking I/O
- FastAPI for high-performance API
- Type hints and dataclasses
- Structured logging
- Error handling and validation
Vector Operations:
- NumPy for efficient similarity computation
- Batch processing for throughput
- In-memory vector index (scalable to Pinecone)
- Hybrid search (BM25 + vector search)
- Multi-language support
- Image and table extraction from PDFs
- Conversation memory for multi-turn Q&A
- Fine-tuned embeddings for domain-specific docs
- PostgreSQL pgvector integration
- Streaming responses for long answers
- Document update and versioning
- Internal documentation Q&A
- Policy and procedure lookup
- Training material search
- Contract analysis
- Regulatory document search
- Precedent research
- Literature review assistance
- Paper summarization
- Citation tracking
- Product documentation Q&A
- Troubleshooting guide search
- FAQ automation
MIT License
Built for accurate, source-cited document Q&A at scale