Example Queries and Expected Results

This document provides example queries you can try with the sample documents included in this project.

Sample Documents Included

machine_learning_intro.md - Introduction to machine learning, neural networks, applications
embeddings_guide.md - Understanding embeddings, semantic similarity, embedding models
vector_databases.md - Vector databases, ChromaDB, indexing strategies

Example Queries

✅ Good Matches (High Similarity Expected)

These queries will find highly relevant results because they match content directly:

Query: "What are embeddings?"
Expected: embeddings_guide.md - Direct explanation of embeddings
Similarity: 0.85-0.95

Query: "How does machine learning work?"
Expected: machine_learning_intro.md - Core ML concepts
Similarity: 0.80-0.92

Query: "What is ChromaDB?"
Expected: vector_databases.md - Specific section on ChromaDB
Similarity: 0.82-0.88

Query: "Neural networks"
Expected: machine_learning_intro.md - Section on neural networks
Similarity: 0.78-0.90

Query: "Semantic similarity"
Expected: embeddings_guide.md - Detailed explanation
Similarity: 0.85-0.95

🟡 Moderate Matches (Medium Similarity)

These queries will find related but not exact matches:

Query: "How do vectors help with text?"
Expected: Mix of embeddings_guide.md and vector_databases.md
Similarity: 0.65-0.80
Why: Related but doesn't use exact same words

Query: "Deep learning applications"
Expected: machine_learning_intro.md (has both concepts)
Similarity: 0.70-0.82
Why: Both terms present but not always together

Query: "Storage and indexing"
Expected: vector_databases.md
Similarity: 0.68-0.78
Why: Related to indexing but different focus

Query: "Comparing different models"
Expected: embeddings_guide.md (compares Word2Vec, BERT, etc.)
Similarity: 0.65-0.75
Why: General concept, not specific phrase

🔴 Challenging Queries (Lower Similarity)

These queries are harder because they require more inference:

Query: "What's the opposite of clustering?"
Expected: May find clustering content, but not its opposite
Similarity: 0.45-0.65
Challenge: Negation is hard for embeddings

Query: "NOT machine learning"
Expected: Will still find ML content (can't handle NOT)
Similarity: 0.50-0.70
Challenge: Negative queries confuse embedding-based search

Query: "Advantages and disadvantages of using APIs"
Expected: May find relevant content (embeddings mentioned API-based services)
Similarity: 0.55-0.75
Challenge: Specific domain concept, not directly covered

Query: "How fast is semantic search?"
Expected: Will find vector database content about speed
Similarity: 0.60-0.75
Challenge: Performance is mentioned but not deeply discussed

🎯 Cross-Document Queries

These queries require combining concepts from multiple documents:

Query: "How do embeddings enable semantic search?"
Expected: Content from both embeddings_guide.md and vector_databases.md
Similarity: 0.75-0.85
Why: Combines concepts across documents

Query: "Machine learning with vector databases"
Expected: Both machine_learning_intro.md and vector_databases.md
Similarity: 0.70-0.82
Why: Bridges two documents

Query: "Comparing different similarity metrics for search"
Expected: similarity.py comments + vector_databases.md
Similarity: 0.68-0.80
Why: Requires understanding from multiple sources

Understanding the Results

Similarity Score Interpretation

0.90-1.00  → Excellent match (direct answer)
0.80-0.89  → Very good match (highly relevant)
0.70-0.79  → Good match (relevant)
0.60-0.69  → Fair match (somewhat related)
0.50-0.59  → Weak match (tangentially related)
< 0.50     → Poor match (likely irrelevant)

Why Some Queries Return Lower Scores

Semantic gap: Query meaning differs from document content
Vocabulary mismatch: Different words used for same concept
Abstraction level: Very general vs specific queries
Negation: "NOT" and "except" are hard for embeddings
Ambiguity: Query could mean multiple things

Experiments to Try

1. Same Meaning, Different Words

Query A: "What are embeddings?"
Query B: "Explain vector representations of text"
Query C: "How do you convert words to numbers?"

Expected: All should find similar results with similar scores
This demonstrates semantic understanding!

2. Chunk Size Impact

Try the same query with different chunk sizes:

# Edit .env to try different sizes
CHUNK_SIZE=300  # Smaller, more chunks
CHUNK_SIZE=500  # Default
CHUNK_SIZE=1000 # Larger, fewer chunks

# Reindex and search
# Do results change? How does score change?

3. Overlap Impact

# Try different overlap settings
CHUNK_OVERLAP=0    # No overlap
CHUNK_OVERLAP=50   # 10% overlap
CHUNK_OVERLAP=200  # 40% overlap

# Search for same query
# Does overlap affect which results are found?
# Does it improve coverage of concepts?

4. Similarity Metric Comparison

# In Streamlit UI, you'd modify to support this
# For manual testing:

from src.embeddings import EmbeddingFactory
from src.similarity import SimilarityMetricFactory

embedding = EmbeddingFactory.create()
query_vec = embedding.embed("What are embeddings?")

# Compare results with different metrics
for metric in ["cosine", "dot_product", "euclidean"]:
    similarity = SimilarityMetricFactory.create(metric)
    # Score different chunks

Debugging Lower-Than-Expected Scores

If you get lower scores than expected:

Check the document exists
- Verify the file is in ./data/documents/
- Check file extension (.md, .txt, .pdf)
Verify indexing worked
- Check Stats tab shows chunks
- Look at ./data/chroma_db/ folder size
Try simpler queries
- Start with exact phrases from documents
- Then try more abstract concepts
Check chunk boundaries
- A concept might be split across chunks
- Try searching for parts of the phrase

Inspect embeddings directly

from src.embeddings import EmbeddingFactory

embedder = EmbeddingFactory.create()

# Get embeddings for comparison
vec1 = embedder.embed("machine learning")
vec2 = embedder.embed("artificial intelligence")

# Calculate similarity
from src.similarity import SimilarityMetricFactory
cosine = SimilarityMetricFactory.create("cosine")
score = cosine.similarity(vec1, vec2)
print(f"Similarity: {score}")
# Should be high since ML is a type of AI

Real-World Query Examples

Try these realistic queries:

"Explain how embeddings work"
"What are the benefits of deep learning?"
"How do vector databases improve search?"
"List advantages of supervised learning"
"When should I use semantic search instead of keyword search?"
"Tell me about BERT and transformers"
"What is HNSW indexing?"

Comparing Results Across Settings

Create a comparison table:

Query: "What are embeddings?"

| Setting | Chunks | Avg Score | Top Result |
|---------|--------|-----------|------------|
| Default | 45 | 0.82 | embeddings_guide.md chunk 1 |
| Size 300 | 89 | 0.79 | embeddings_guide.md chunk 2 |
| Size 1000 | 23 | 0.85 | embeddings_guide.md chunk 1 |

This helps you understand the trade-offs!

Next Steps

Run the app: streamlit run app.py
Index documents: Use "Index Documents" tab
Try queries: Use "Search" tab with examples above
Experiment: Change settings and observe results
Learn: Read LEARNING_GUIDE.md to understand why results look the way they do

Happy experimenting! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Queries and Expected Results

Sample Documents Included

Example Queries

✅ Good Matches (High Similarity Expected)

🟡 Moderate Matches (Medium Similarity)

🔴 Challenging Queries (Lower Similarity)

🎯 Cross-Document Queries

Understanding the Results

Similarity Score Interpretation

Why Some Queries Return Lower Scores

Experiments to Try

1. Same Meaning, Different Words

2. Chunk Size Impact

3. Overlap Impact

4. Similarity Metric Comparison

Debugging Lower-Than-Expected Scores

Real-World Query Examples

Comparing Results Across Settings

Next Steps

FilesExpand file tree

EXAMPLE_QUERIES.md

Latest commit

History

EXAMPLE_QUERIES.md

File metadata and controls

Example Queries and Expected Results

Sample Documents Included

Example Queries

✅ Good Matches (High Similarity Expected)

🟡 Moderate Matches (Medium Similarity)

🔴 Challenging Queries (Lower Similarity)

🎯 Cross-Document Queries

Understanding the Results

Similarity Score Interpretation

Why Some Queries Return Lower Scores

Experiments to Try

1. Same Meaning, Different Words

2. Chunk Size Impact

3. Overlap Impact

4. Similarity Metric Comparison

Debugging Lower-Than-Expected Scores

Real-World Query Examples

Comparing Results Across Settings

Next Steps