You now have a complete, production-ready semantic search system with extensive educational materials. This is a Week 2 learning track project covering:
- ✅ Core concepts (embeddings, similarity, chunking, vector DB)
- ✅ Working implementation using Ollama + ChromaDB
- ✅ Interactive Streamlit web app
- ✅ Comprehensive documentation
- ✅ Educational Jupyter notebook with experiments
- ✅ Sample documents and queries
| File | Purpose | Key Classes |
|---|---|---|
config.py |
Configuration management | Config |
ingestion.py |
Load PDFs, TXT, MD files | DocumentIngester, Document |
chunking.py |
Text chunking strategies | FixedSizeChunker, SemanticChunker, ChunkerFactory |
embeddings.py |
Generate embeddings via Ollama | OllamaEmbeddings, EmbeddingFactory |
similarity.py |
Similarity metrics (cosine, dot, L2) | CosineSimilarity, DotProduct, EuclideanDistance |
vector_store.py |
ChromaDB integration | VectorStore |
search_engine.py |
Main orchestrator | SemanticSearchEngine |
| File | Content |
|---|---|
| README.md | Project overview, setup, usage, troubleshooting |
| LEARNING_GUIDE.md | Detailed explanations of concepts (embeddings, chunking, etc.) |
| QUICK_REFERENCE.md | Cheat sheet for quick lookup |
| EXAMPLE_QUERIES.md | Sample queries with expected results |
| File | Purpose |
|---|---|
| app.py | Streamlit web interface |
| Semantic_Search_Complete_Learning.ipynb | Interactive Jupyter notebook with experiments |
data/documents/ |
Sample documents (markdown) |
data/chroma_db/ |
Vector database storage (auto-created) |
| File | Purpose |
|---|---|
| requirements.txt | Python dependencies |
| .env.example | Configuration template |
| .gitignore | Git exclusions |
# Setup
pip install -r requirements.txt
# Terminal 1: Start Ollama
ollama serve
# Terminal 2: Pull model
ollama pull nomic-embed-text
# Terminal 3: Run app
streamlit run app.pyThen open http://localhost:8501
Best for: Interactive exploration, demos, non-technical users
# Setup
pip install -r requirements.txt jupyter
# Terminal 1: Start Ollama
ollama serve
# Terminal 2: Start Jupyter
jupyter notebook
# Open: Semantic_Search_Complete_Learning.ipynbBest for: Understanding concepts, experiments, hands-on learning
from src.search_engine import SemanticSearchEngine
engine = SemanticSearchEngine()
engine.index_documents("./data/documents")
results = engine.search("Your query here")Best for: Integration into other projects, custom workflows
What you learn:
- How embeddings work (text → vectors)
- Similarity metrics (cosine, dot product, L2)
- Chunking strategies and trade-offs
- Vector databases and indexing
- Building a complete search pipeline
How to learn:
- Read LEARNING_GUIDE.md for deep concepts
- Run Jupyter notebook for hands-on experiments
- Try web app to see it in action
- Modify QUICK_REFERENCE.md examples in Python
- Day 1: Read LEARNING_GUIDE.md (concepts)
- Day 2: Run Jupyter notebook, do experiments
- Day 3: Use web app, try different configurations
- Day 4: Extend with custom documents
- Day 5: Integrate into your own project
RAG_101/
├── src/ (7 modules, ~1300 lines)
│ ├── config.py (~40 lines)
│ ├── ingestion.py (~120 lines)
│ ├── chunking.py (~200 lines)
│ ├── embeddings.py (~130 lines)
│ ├── similarity.py (~150 lines)
│ ├── vector_store.py (~250 lines)
│ └── search_engine.py (~120 lines)
│
├── data/
│ ├── documents/ (3 markdown files, ~25KB)
│ └── chroma_db/ (auto-created, ~5-10MB when indexed)
│
├── app.py (~350 lines, Streamlit app)
├── Semantic_Search_Complete_Learning.ipynb (~500 lines, 10 cells)
│
├── Documentation (~2000 lines total)
│ ├── README.md (~500 lines)
│ ├── LEARNING_GUIDE.md (~800 lines)
│ ├── QUICK_REFERENCE.md (~200 lines)
│ ├── EXAMPLE_QUERIES.md (~300 lines)
│ └── PROJECT_SUMMARY.md (this file)
│
└── Configuration
├── requirements.txt (6 packages)
├── .env.example (7 settings)
└── .gitignore
- Each component (embeddings, chunking, similarity) is independent
- Easy to swap implementations (e.g., different embedding models)
- Clean separation of concerns
- Well-documented with docstrings
- Web UI: Streamlit app for interactive use
- Python API: Direct use in code
- Jupyter Notebook: Interactive learning
- Chunk size (200-1000 characters)
- Chunk overlap (0-400 characters)
- Number of results (1-20)
- Embedding model (Ollama models)
- Similarity metric (cosine, dot product, L2)
- Persistent vector database (ChromaDB)
- Metadata tracking
- Error handling
- Logging
- Configuration management
The Jupyter notebook includes these hands-on experiments:
- Embedding Generation: See how text becomes vectors
- Similarity Metrics: Compare cosine, dot product, L2
- Chunking Impact: How chunk size affects results
- Search Pipeline: Index and search documents
- Query Variations: Try different query phrasings
- Embedding Visualization: Plot embeddings in 2D space
- Similarity Analysis: Understand similarity scores
- Limitations: Explore where semantic search struggles
- Customization: Add your own documents
After working through this project, you'll understand:
- ✅ How embeddings capture semantic meaning
- ✅ Why similarity metrics work (geometry of vectors)
- ✅ How chunking affects retrieval quality
- ✅ What vector databases do and why
- ✅ How to build semantic search from scratch
- ✅ How to use Ollama for local embeddings
- ✅ How to chunk text strategically
- ✅ How to store and search embeddings
- ✅ How to evaluate search results
- ✅ How to extend the system
- ✅ Trade-offs in chunk size
- ✅ Precision vs. recall in search
- ✅ Limitations of semantic search
- ✅ When to use semantic vs. keyword search
- Try different chunk sizes
- Experiment with overlap values
- Test different Ollama models
- Add more sample documents
- Add semantic chunking
- Implement reranking
- Add metadata filtering
- Create custom similarity metrics
- Add API endpoints (FastAPI)
- Implement caching
- Add authentication
- Scale to production (Pinecone)
- Add LLM answer generation
- Compare embedding models
- Study chunking strategies
- Analyze false positive patterns
- Implement hybrid search
- Week 3: Multi-hop retrieval and query expansion
- Week 4: Reranking and relevance optimization
- Week 5: RAG (Retrieval Augmented Generation)
- Week 6: Production deployment and scaling
- Transformer models and attention
- Vector database comparison (Pinecone, Weaviate, Milvus)
- Information retrieval metrics (NDCG, MRR, MAP)
- Advanced NLP techniques
- Start Simple: Use the web app first, then code
- Read Docstrings: Every function has detailed documentation
- Check Examples: EXAMPLE_QUERIES.md has sample queries
- Experiment: Modify configurations and observe impacts
- Monitor: Watch the Ollama terminal to see embeddings being generated
- Read Code: Source files have inline comments explaining "why"
- Ollama running? (
ollama serve) - Model pulled? (
ollama listshows nomic-embed-text) - Dependencies installed? (
pip install -r requirements.txt) - Documents in folder? (
ls ./data/documents/) - Index created? (Use Stats tab in Streamlit)
- Queries working? (Try example queries first)
See README.md for detailed troubleshooting.
- Concept questions: → LEARNING_GUIDE.md
- Setup issues: → README.md (Troubleshooting section)
- Usage questions: → QUICK_REFERENCE.md
- Code examples: → app.py or Jupyter notebook
- Expected results: → EXAMPLE_QUERIES.md
- Complete implementation (7 Python modules)
- Streamlit web application
- Jupyter notebook with experiments
- Sample documents (3 markdown files)
- Configuration system
- Comprehensive documentation (4 guides)
- Example queries with expected results
- Quick reference guide
- Error handling and logging
- Git-ready project structure
Everything is ready to use! 🎉
You've successfully completed this project if you can:
- Setup: Run the system in < 10 minutes
- Understand: Explain embeddings, chunking, similarity, vector DB
- Use: Index documents and search via web app
- Code: Run Jupyter notebook and modify examples
- Extend: Add new features (reranking, metadata filtering, etc.)
- Teach: Explain to others how semantic search works
cd /path/to/RAG_101
# Quick start
pip install -r requirements.txt
streamlit run app.py
# OR for learning
jupyter notebook Semantic_Search_Complete_Learning.ipynbHappy learning! 🚀