Semantic Cache (Offline and Online Strategies)

A project demonstrating semantic caching for LLM responses using Redis Vector DB, LangChain, and HuggingFace embeddings. The pipeline parses a PDF document, generates FAQ pairs using a Groq LLM, and stores them in a Redis semantic cache for fast, similarity-based retrieval.

Overview

PDF Parsing — Uses the LlamaCloud SDK (llama-cloud>=1.0) to upload a PDF and parse it into structured markdown via an agentic OCR workflow.
FAQ Generation — Uses a Groq LLM (Llama-3.1-8B) via LangChain to extract question/answer pairs from each document section.
Semantic Caching — Embeds FAQ prompts with a HuggingFace sentence-transformer model and stores them in Redis using RedisVL's SemanticCache.
Cache Lookup — Queries the cache with natural language; returns cached responses for semantically similar questions without calling the LLM again.

Requirements

Python 3.9+
A Redis Cloud instance (or local Redis Stack)
A Groq API key
A LlamaCloud API key

Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure credentials

Create a credentials.py file in the project root with the following variables:

LLAMA_CLOUD_API_KEY = "<your-llamacloud-api-key>"
GROQ_API_KEY        = "<your-groq-api-key>"
HF_TOKEN            = "<your-huggingface-token>"
REDIS_HOST          = "<your-redis-host>"
REDIS_PORT          = <your-redis-port>
REDIS_PASSWORD      = "<your-redis-password>"

Variable	Description
`LLAMA_CLOUD_API_KEY`	LlamaCloud API key for PDF parsing
`GROQ_API_KEY`	Groq API key for the LLM
`HF_TOKEN`	HuggingFace token for embedding model access
`REDIS_HOST`	Redis instance hostname
`REDIS_PORT`	Redis instance port
`REDIS_PASSWORD`	Redis instance password

Usage

Notebook

Open and run semantic-cache-project.ipynb:

Set credentials and environment variables
Initialize the Groq LLM and HuggingFace vectorizer
Download the sample PDF (2022 Chevrolet Colorado brochure)
Connect to Redis and optionally flush the database
Upload and parse the PDF using the LlamaCloud SDK
Split the parsed markdown into chunks with MarkdownNodeParser
Generate FAQ prompt/response pairs with the Groq LLM
Embed FAQ prompts and store them in the Redis semantic cache
Query the cache with natural language questions

Streamlit App

streamlit run app.py

Features:

Type any question and get an instant answer from the Redis semantic cache
Cache misses fall back to the Groq LLM and can optionally be stored for next time
Sidebar shows live Redis connection status, hit/miss stats, and a hit-rate progress bar
Adjustable similarity threshold slider (tighter or looser matching)
Populate cache panel — upload a new PDF or use the existing one to re-run the full parse → FAQ → embed → store pipeline
Upload your own PDF directly from the UI — the app parses it with LlamaCloud, automatically extracts FAQ prompt/response pairs using the Groq LLM, and pre-populates the semantic cache, with no notebook or command-line steps required

Project Structure

semantic-cache/
├── app.py                         # Streamlit web app
├── semantic-cache-project.ipynb   # Notebook walkthrough
├── credentials.py                 # API keys and secrets
├── requirements.txt               # Python dependencies
├── README.md                      # This file
└── data/                          # Downloaded PDF files

Key Technologies

Streamlit — Web app framework for the interactive UI
llama-cloud — Official LlamaCloud Python SDK for agentic PDF parsing (llama-cloud>=1.0)
llama-index-core — Document and MarkdownNodeParser for document chunking
RedisVL — Vector layer for Redis, provides SemanticCache
LangChain — LLM orchestration framework
LangChain-Groq — Groq LLM integration
sentence-transformers — HuggingFace embedding models

Caching Strategies: Notebook vs. Streamlit App

Both the notebook and the app use the same underlying SemanticCache from RedisVL, but they apply fundamentally different caching strategies.

Notebook — Offline / Batch pre-generated

The notebook follows a write-then-read pattern. The entire cache is built before any queries are made:

A PDF is parsed into structured markdown with LlamaCloud.
The markdown chunks are passed to the Groq LLM to generate FAQ prompt/response pairs in bulk.
All FAQ prompts are embedded in a single batch with redisvl_vectorizer.embed_many(...).
Every entry is stored in Redis up-front via cache.store(...).
Queries are then made with cache.check(...) — the cache is fully populated and never written to again during querying.
A cache miss simply returns an empty list; there is no LLM fallback at query time.
The distance threshold is fixed at 0.2 and cannot be changed without re-running cells.

This is best described as static pre-generated: the cache is a snapshot derived entirely from a known document, and its contents do not change based on what users ask.

Streamlit App — Live / Interactive Online Caching

app.py follows a read-through with write-on-miss pattern. The cache evolves dynamically during user interactions:

On every query the cache is checked first (cache.check(question)).
A cache hit returns the stored answer instantly — no LLM call is made.
A cache miss falls through to the Groq LLM, and the response is optionally written back into the cache (cache.store(...)) so the same (or semantically similar) question hits the cache next time.
The cache grows organically as users ask new questions; it is not limited to content derived from a single document.
The distance threshold is adjustable at runtime via the sidebar slider — changing it clears and reinitialises the SemanticCache instance without restarting the server.
Hit/miss statistics and a hit-rate progress bar give immediate visibility into how well the cache is performing.
The app also exposes an optional "Populate cache from PDF" panel that re-runs the same batch pipeline as the notebook (parse → FAQ generation → embed → store), allowing you to seed the cache from a document before live queries begin.

Side-by-Side Comparison

	Notebook	Streamlit App (`app.py`)
Cache population	Batch, upfront, before any queries	Organic, on-miss writes during live use
LLM at query time	Never called during querying	Called on cache miss as a fallback
Cache growth	Static after population step	Grows with every uncached question
Distance threshold	Hardcoded (`0.2`)	Adjustable via UI slider at runtime
Pre-Generated from PDF	Primary workflow	Optional — upload any PDF via the sidebar UI; the app parses it and pre-generates FAQs into the cache automatically
Observability	Raw `cache.check()` output	Live hit/miss counters and hit-rate bar
Use case	Exploring and validating the pipeline	Production-ready interactive assistant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Cache (Offline and Online Strategies)

Overview

Requirements

Setup

1. Install dependencies

2. Configure credentials

Usage

Notebook

Streamlit App

Project Structure

Key Technologies

Caching Strategies: Notebook vs. Streamlit App

Notebook — Offline / Batch pre-generated

Streamlit App — Live / Interactive Online Caching

Side-by-Side Comparison

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
app.py		app.py
credentials.py		credentials.py
requirements.txt		requirements.txt
semantic-cache-project.ipynb		semantic-cache-project.ipynb

Folders and files

Latest commit

History

Repository files navigation

Semantic Cache (Offline and Online Strategies)

Overview

Requirements

Setup

1. Install dependencies

2. Configure credentials

Usage

Notebook

Streamlit App

Project Structure

Key Technologies

Caching Strategies: Notebook vs. Streamlit App

Notebook — Offline / Batch pre-generated

Streamlit App — Live / Interactive Online Caching

Side-by-Side Comparison

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages