A chatbot that lets you upload any CSV file and ask natural language questions about the data. It returns a structured JSON response with both a human-readable answer and the relevant metadata.
Given a CSV dataset (in this case, a books dataset), build an API-based chatbot that:
- Accepts a user's question in plain English
- Analyzes the data and returns an answer
- Outputs the result in a strict JSON format:
{
"response": "the humane response",
"metadata": [
{"column1": "value", "column2": "value"},
{"column1": "value", "column2": "value"}
]
}The bot should handle edge cases like missing data, special characters in queries (apostrophes, etc.), and malformed CSV rows.
I went with a Text-to-SQL pipeline instead of the more common approach of loading the entire CSV into the LLM's context or writing ad-hoc Python/Pandas code at runtime.
Here's how it works:
- User uploads a CSV through the API
- The file gets loaded into an in-memory DuckDB database
- The schema (column names and types) is extracted and sent to the LLM
- When a user asks a question, the LLM generates a SQL query based on the schema
- DuckDB executes the query and returns the results
- The LLM then takes those results and writes a natural, human-friendly response
- Both the response and the raw data are returned as JSON
I considered a few different architectures before settling on this one:
Option A: Load CSV into LLM context
- Sounds simple, but CSV files can be huge. A 2000-row file easily exceeds token limits. You'd either truncate the data (losing accuracy) or spend a lot on API costs sending massive prompts. Not practical.
Option B: Let the LLM write Python/Pandas code
- This is what a lot of tutorials do. The LLM generates Python code, you
exec()it, and return the result. The problem? Running arbitrary LLM-generated code is a security nightmare. One bad generation and you're executingos.remove()or worse. You'd need a sandbox, which adds complexity.
Option C: Text-to-SQL (what I chose)
- The LLM only generates SELECT queries. I validate that before execution — if it's not a SELECT, it gets rejected. DuckDB runs the query in memory, so there's no persistent database to corrupt. The LLM never sees the actual data during query generation, only the schema, which keeps the prompt small and focused. And SQL is deterministic — the same query always returns the same results, unlike LLM-generated Python which can be unpredictable.
The tradeoff is that SQL can't do everything (like generating charts), but for a data Q&A chatbot, it covers 95% of use cases and does it safely.
- Python — core language
- FastAPI — API server
- DuckDB — in-memory SQL engine for querying CSV data (fast, no setup, no external DB needed)
- Pandas — CSV loading with encoding fallback and bad-line handling
- Groq — LLM provider (using Llama 3.3 70B for SQL generation and response synthesis)
- Streamlit — frontend UI
csv_analysis_bot/
app.py # backend API (FastAPI + DuckDB + Groq)
frontend/
app.py # frontend UI (Streamlit)
uploads/ # uploaded CSVs stored here at runtime
.env # GROQ_API_KEY goes here
run.bat # starts both backend and frontend
Everything important lives in app.py. I deliberately kept it as a single file instead of splitting into services/models/orchestrators — for a project this size, splitting just adds indirection without real benefit.
- Install dependencies:
pip install fastapi uvicorn pandas duckdb groq python-dotenv streamlit requests- Create a
.envfile:
GROQ_API_KEY=your_key_here
Get a free key from console.groq.com
- Run the project:
# Option 1: double-click run.bat
# Option 2: manual (two terminals)
python -m uvicorn app:app --port 8000
python -m streamlit run frontend/app.py- Open http://localhost:8501, upload a CSV, and start asking questions.
Upload a CSV:
curl -X POST http://localhost:8000/upload -F "file=@books_cleaned.csv"Ask a question:
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question": "Give me top 5 books with best ratings"}'Response:
{
"response": "Here are the top 5 books with the best ratings...",
"metadata": [
{"title": "The Complete Calvin and Hobbes", "average_rating": 4.82},
{"title": "Harry Potter Boxed Set", "average_rating": 4.78}
]
}- "Give me top 5 books with best ratings"
- "Recommend some books written by Agatha Christie"
- "Find a book that has 'One Woman's Search for Everything' in its subtitle"
- "Which author has written the most books in this dataset?"
- "Are there any books about Indian history?"
- "Show me the most popular books based on ratings count"