CSV Chatbot

A chatbot that lets you upload any CSV file and ask natural language questions about the data. It returns a structured JSON response with both a human-readable answer and the relevant metadata.

Problem Statement

Given a CSV dataset (in this case, a books dataset), build an API-based chatbot that:

Accepts a user's question in plain English
Analyzes the data and returns an answer
Outputs the result in a strict JSON format:

{
    "response": "the humane response",
    "metadata": [
        {"column1": "value", "column2": "value"},
        {"column1": "value", "column2": "value"}
    ]
}

The bot should handle edge cases like missing data, special characters in queries (apostrophes, etc.), and malformed CSV rows.

My Approach — Text-to-SQL

I went with a Text-to-SQL pipeline instead of the more common approach of loading the entire CSV into the LLM's context or writing ad-hoc Python/Pandas code at runtime.

Here's how it works:

User uploads a CSV through the API
The file gets loaded into an in-memory DuckDB database
The schema (column names and types) is extracted and sent to the LLM
When a user asks a question, the LLM generates a SQL query based on the schema
DuckDB executes the query and returns the results
The LLM then takes those results and writes a natural, human-friendly response
Both the response and the raw data are returned as JSON

Why Text-to-SQL over other approaches?

I considered a few different architectures before settling on this one:

Option A: Load CSV into LLM context

Sounds simple, but CSV files can be huge. A 2000-row file easily exceeds token limits. You'd either truncate the data (losing accuracy) or spend a lot on API costs sending massive prompts. Not practical.

Option B: Let the LLM write Python/Pandas code

This is what a lot of tutorials do. The LLM generates Python code, you exec() it, and return the result. The problem? Running arbitrary LLM-generated code is a security nightmare. One bad generation and you're executing os.remove() or worse. You'd need a sandbox, which adds complexity.

Option C: Text-to-SQL (what I chose)

The LLM only generates SELECT queries. I validate that before execution — if it's not a SELECT, it gets rejected. DuckDB runs the query in memory, so there's no persistent database to corrupt. The LLM never sees the actual data during query generation, only the schema, which keeps the prompt small and focused. And SQL is deterministic — the same query always returns the same results, unlike LLM-generated Python which can be unpredictable.

The tradeoff is that SQL can't do everything (like generating charts), but for a data Q&A chatbot, it covers 95% of use cases and does it safely.

Tech Stack

Python — core language
FastAPI — API server
DuckDB — in-memory SQL engine for querying CSV data (fast, no setup, no external DB needed)
Pandas — CSV loading with encoding fallback and bad-line handling
Groq — LLM provider (using Llama 3.3 70B for SQL generation and response synthesis)
Streamlit — frontend UI

Project Structure

csv_analysis_bot/
    app.py              # backend API (FastAPI + DuckDB + Groq)
    frontend/
        app.py           # frontend UI (Streamlit)
    uploads/             # uploaded CSVs stored here at runtime
    .env                 # GROQ_API_KEY goes here
    run.bat              # starts both backend and frontend

Everything important lives in app.py. I deliberately kept it as a single file instead of splitting into services/models/orchestrators — for a project this size, splitting just adds indirection without real benefit.

Setup

Install dependencies:

pip install fastapi uvicorn pandas duckdb groq python-dotenv streamlit requests

Create a .env file:

GROQ_API_KEY=your_key_here

Get a free key from console.groq.com

Run the project:

# Option 1: double-click run.bat

# Option 2: manual (two terminals)
python -m uvicorn app:app --port 8000
python -m streamlit run frontend/app.py

Open http://localhost:8501, upload a CSV, and start asking questions.

API Usage

Upload a CSV:

curl -X POST http://localhost:8000/upload -F "file=@books_cleaned.csv"

Ask a question:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Give me top 5 books with best ratings"}'

Response:

{
    "response": "Here are the top 5 books with the best ratings...",
    "metadata": [
        {"title": "The Complete Calvin and Hobbes", "average_rating": 4.82},
        {"title": "Harry Potter Boxed Set", "average_rating": 4.78}
    ]
}

Sample Questions

"Give me top 5 books with best ratings"
"Recommend some books written by Agatha Christie"
"Find a book that has 'One Woman's Search for Everything' in its subtitle"
"Which author has written the most books in this dataset?"
"Are there any books about Indian history?"
"Show me the most popular books based on ratings count"

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
frontend		frontend
.gitignore		.gitignore
README.md		README.md
app.py		app.py
run.bat		run.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSV Chatbot

Problem Statement

My Approach — Text-to-SQL

Why Text-to-SQL over other approaches?

Tech Stack

Project Structure

Setup

API Usage

Sample Questions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CSV Chatbot

Problem Statement

My Approach — Text-to-SQL

Why Text-to-SQL over other approaches?

Tech Stack

Project Structure

Setup

API Usage

Sample Questions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages