A project demonstrating semantic caching for LLM responses using Redis Vector DB, LangChain, and HuggingFace embeddings. The pipeline parses a PDF document, generates FAQ pairs using a Groq LLM, and stores them in a Redis semantic cache for fast, similarity-based retrieval.
- PDF Parsing — Uses the LlamaCloud SDK (
llama-cloud>=1.0) to upload a PDF and parse it into structured markdown via an agentic OCR workflow. - FAQ Generation — Uses a Groq LLM (Llama-3.1-8B) via LangChain to extract question/answer pairs from each document section.
- Semantic Caching — Embeds FAQ prompts with a HuggingFace sentence-transformer model and stores them in Redis using RedisVL's
SemanticCache. - Cache Lookup — Queries the cache with natural language; returns cached responses for semantically similar questions without calling the LLM again.
- Python 3.9+
- A Redis Cloud instance (or local Redis Stack)
- A Groq API key
- A LlamaCloud API key
pip install -r requirements.txtCreate a credentials.py file in the project root with the following variables:
LLAMA_CLOUD_API_KEY = "<your-llamacloud-api-key>"
GROQ_API_KEY = "<your-groq-api-key>"
HF_TOKEN = "<your-huggingface-token>"
REDIS_HOST = "<your-redis-host>"
REDIS_PORT = <your-redis-port>
REDIS_PASSWORD = "<your-redis-password>"| Variable | Description |
|---|---|
LLAMA_CLOUD_API_KEY |
LlamaCloud API key for PDF parsing |
GROQ_API_KEY |
Groq API key for the LLM |
HF_TOKEN |
HuggingFace token for embedding model access |
REDIS_HOST |
Redis instance hostname |
REDIS_PORT |
Redis instance port |
REDIS_PASSWORD |
Redis instance password |
Open and run semantic-cache-project.ipynb:
- Set credentials and environment variables
- Initialize the Groq LLM and HuggingFace vectorizer
- Download the sample PDF (2022 Chevrolet Colorado brochure)
- Connect to Redis and optionally flush the database
- Upload and parse the PDF using the LlamaCloud SDK
- Split the parsed markdown into chunks with
MarkdownNodeParser - Generate FAQ prompt/response pairs with the Groq LLM
- Embed FAQ prompts and store them in the Redis semantic cache
- Query the cache with natural language questions
streamlit run app.pyFeatures:
- Type any question and get an instant answer from the Redis semantic cache
- Cache misses fall back to the Groq LLM and can optionally be stored for next time
- Sidebar shows live Redis connection status, hit/miss stats, and a hit-rate progress bar
- Adjustable similarity threshold slider (tighter or looser matching)
- Populate cache panel — upload a new PDF or use the existing one to re-run the full parse → FAQ → embed → store pipeline
- Upload your own PDF directly from the UI — the app parses it with LlamaCloud, automatically extracts FAQ prompt/response pairs using the Groq LLM, and pre-populates the semantic cache, with no notebook or command-line steps required
semantic-cache/
├── app.py # Streamlit web app
├── semantic-cache-project.ipynb # Notebook walkthrough
├── credentials.py # API keys and secrets
├── requirements.txt # Python dependencies
├── README.md # This file
└── data/ # Downloaded PDF files
- Streamlit — Web app framework for the interactive UI
- llama-cloud — Official LlamaCloud Python SDK for agentic PDF parsing (
llama-cloud>=1.0) - llama-index-core —
DocumentandMarkdownNodeParserfor document chunking - RedisVL — Vector layer for Redis, provides
SemanticCache - LangChain — LLM orchestration framework
- LangChain-Groq — Groq LLM integration
- sentence-transformers — HuggingFace embedding models
Both the notebook and the app use the same underlying SemanticCache from RedisVL, but they apply fundamentally different caching strategies.
The notebook follows a write-then-read pattern. The entire cache is built before any queries are made:
- A PDF is parsed into structured markdown with LlamaCloud.
- The markdown chunks are passed to the Groq LLM to generate FAQ prompt/response pairs in bulk.
- All FAQ prompts are embedded in a single batch with
redisvl_vectorizer.embed_many(...). - Every entry is stored in Redis up-front via
cache.store(...). - Queries are then made with
cache.check(...)— the cache is fully populated and never written to again during querying. - A cache miss simply returns an empty list; there is no LLM fallback at query time.
- The distance threshold is fixed at
0.2and cannot be changed without re-running cells.
This is best described as static pre-generated: the cache is a snapshot derived entirely from a known document, and its contents do not change based on what users ask.
app.py follows a read-through with write-on-miss pattern. The cache evolves dynamically during user interactions:
- On every query the cache is checked first (
cache.check(question)). - A cache hit returns the stored answer instantly — no LLM call is made.
- A cache miss falls through to the Groq LLM, and the response is optionally written back into the cache (
cache.store(...)) so the same (or semantically similar) question hits the cache next time. - The cache grows organically as users ask new questions; it is not limited to content derived from a single document.
- The distance threshold is adjustable at runtime via the sidebar slider — changing it clears and reinitialises the
SemanticCacheinstance without restarting the server. - Hit/miss statistics and a hit-rate progress bar give immediate visibility into how well the cache is performing.
- The app also exposes an optional "Populate cache from PDF" panel that re-runs the same batch pipeline as the notebook (parse → FAQ generation → embed → store), allowing you to seed the cache from a document before live queries begin.
| Notebook | Streamlit App (app.py) |
|
|---|---|---|
| Cache population | Batch, upfront, before any queries | Organic, on-miss writes during live use |
| LLM at query time | Never called during querying | Called on cache miss as a fallback |
| Cache growth | Static after population step | Grows with every uncached question |
| Distance threshold | Hardcoded (0.2) |
Adjustable via UI slider at runtime |
| Pre-Generated from PDF | Primary workflow | Optional — upload any PDF via the sidebar UI; the app parses it and pre-generates FAQs into the cache automatically |
| Observability | Raw cache.check() output |
Live hit/miss counters and hit-rate bar |
| Use case | Exploring and validating the pipeline | Production-ready interactive assistant |