This document provides example queries you can try with the sample documents included in this project.
- machine_learning_intro.md - Introduction to machine learning, neural networks, applications
- embeddings_guide.md - Understanding embeddings, semantic similarity, embedding models
- vector_databases.md - Vector databases, ChromaDB, indexing strategies
These queries will find highly relevant results because they match content directly:
Query: "What are embeddings?"
Expected: embeddings_guide.md - Direct explanation of embeddings
Similarity: 0.85-0.95
Query: "How does machine learning work?"
Expected: machine_learning_intro.md - Core ML concepts
Similarity: 0.80-0.92
Query: "What is ChromaDB?"
Expected: vector_databases.md - Specific section on ChromaDB
Similarity: 0.82-0.88
Query: "Neural networks"
Expected: machine_learning_intro.md - Section on neural networks
Similarity: 0.78-0.90
Query: "Semantic similarity"
Expected: embeddings_guide.md - Detailed explanation
Similarity: 0.85-0.95
These queries will find related but not exact matches:
Query: "How do vectors help with text?"
Expected: Mix of embeddings_guide.md and vector_databases.md
Similarity: 0.65-0.80
Why: Related but doesn't use exact same words
Query: "Deep learning applications"
Expected: machine_learning_intro.md (has both concepts)
Similarity: 0.70-0.82
Why: Both terms present but not always together
Query: "Storage and indexing"
Expected: vector_databases.md
Similarity: 0.68-0.78
Why: Related to indexing but different focus
Query: "Comparing different models"
Expected: embeddings_guide.md (compares Word2Vec, BERT, etc.)
Similarity: 0.65-0.75
Why: General concept, not specific phrase
These queries are harder because they require more inference:
Query: "What's the opposite of clustering?"
Expected: May find clustering content, but not its opposite
Similarity: 0.45-0.65
Challenge: Negation is hard for embeddings
Query: "NOT machine learning"
Expected: Will still find ML content (can't handle NOT)
Similarity: 0.50-0.70
Challenge: Negative queries confuse embedding-based search
Query: "Advantages and disadvantages of using APIs"
Expected: May find relevant content (embeddings mentioned API-based services)
Similarity: 0.55-0.75
Challenge: Specific domain concept, not directly covered
Query: "How fast is semantic search?"
Expected: Will find vector database content about speed
Similarity: 0.60-0.75
Challenge: Performance is mentioned but not deeply discussed
These queries require combining concepts from multiple documents:
Query: "How do embeddings enable semantic search?"
Expected: Content from both embeddings_guide.md and vector_databases.md
Similarity: 0.75-0.85
Why: Combines concepts across documents
Query: "Machine learning with vector databases"
Expected: Both machine_learning_intro.md and vector_databases.md
Similarity: 0.70-0.82
Why: Bridges two documents
Query: "Comparing different similarity metrics for search"
Expected: similarity.py comments + vector_databases.md
Similarity: 0.68-0.80
Why: Requires understanding from multiple sources
0.90-1.00 → Excellent match (direct answer)
0.80-0.89 → Very good match (highly relevant)
0.70-0.79 → Good match (relevant)
0.60-0.69 → Fair match (somewhat related)
0.50-0.59 → Weak match (tangentially related)
< 0.50 → Poor match (likely irrelevant)
- Semantic gap: Query meaning differs from document content
- Vocabulary mismatch: Different words used for same concept
- Abstraction level: Very general vs specific queries
- Negation: "NOT" and "except" are hard for embeddings
- Ambiguity: Query could mean multiple things
Query A: "What are embeddings?"
Query B: "Explain vector representations of text"
Query C: "How do you convert words to numbers?"
Expected: All should find similar results with similar scores
This demonstrates semantic understanding!
Try the same query with different chunk sizes:
# Edit .env to try different sizes
CHUNK_SIZE=300 # Smaller, more chunks
CHUNK_SIZE=500 # Default
CHUNK_SIZE=1000 # Larger, fewer chunks
# Reindex and search
# Do results change? How does score change?# Try different overlap settings
CHUNK_OVERLAP=0 # No overlap
CHUNK_OVERLAP=50 # 10% overlap
CHUNK_OVERLAP=200 # 40% overlap
# Search for same query
# Does overlap affect which results are found?
# Does it improve coverage of concepts?# In Streamlit UI, you'd modify to support this
# For manual testing:
from src.embeddings import EmbeddingFactory
from src.similarity import SimilarityMetricFactory
embedding = EmbeddingFactory.create()
query_vec = embedding.embed("What are embeddings?")
# Compare results with different metrics
for metric in ["cosine", "dot_product", "euclidean"]:
similarity = SimilarityMetricFactory.create(metric)
# Score different chunks
If you get lower scores than expected:
-
Check the document exists
- Verify the file is in
./data/documents/ - Check file extension (.md, .txt, .pdf)
- Verify the file is in
-
Verify indexing worked
- Check Stats tab shows chunks
- Look at
./data/chroma_db/folder size
-
Try simpler queries
- Start with exact phrases from documents
- Then try more abstract concepts
-
Check chunk boundaries
- A concept might be split across chunks
- Try searching for parts of the phrase
-
Inspect embeddings directly
from src.embeddings import EmbeddingFactory embedder = EmbeddingFactory.create() # Get embeddings for comparison vec1 = embedder.embed("machine learning") vec2 = embedder.embed("artificial intelligence") # Calculate similarity from src.similarity import SimilarityMetricFactory cosine = SimilarityMetricFactory.create("cosine") score = cosine.similarity(vec1, vec2) print(f"Similarity: {score}") # Should be high since ML is a type of AI
Try these realistic queries:
"Explain how embeddings work"
"What are the benefits of deep learning?"
"How do vector databases improve search?"
"List advantages of supervised learning"
"When should I use semantic search instead of keyword search?"
"Tell me about BERT and transformers"
"What is HNSW indexing?"
Create a comparison table:
Query: "What are embeddings?"
| Setting | Chunks | Avg Score | Top Result |
|---------|--------|-----------|------------|
| Default | 45 | 0.82 | embeddings_guide.md chunk 1 |
| Size 300 | 89 | 0.79 | embeddings_guide.md chunk 2 |
| Size 1000 | 23 | 0.85 | embeddings_guide.md chunk 1 |
This helps you understand the trade-offs!
- Run the app:
streamlit run app.py - Index documents: Use "Index Documents" tab
- Try queries: Use "Search" tab with examples above
- Experiment: Change settings and observe results
- Learn: Read LEARNING_GUIDE.md to understand why results look the way they do
Happy experimenting! 🚀