Skip to content

Gopisokk/csv_chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV Chatbot

A chatbot that lets you upload any CSV file and ask natural language questions about the data. It returns a structured JSON response with both a human-readable answer and the relevant metadata.


Problem Statement

Given a CSV dataset (in this case, a books dataset), build an API-based chatbot that:

  • Accepts a user's question in plain English
  • Analyzes the data and returns an answer
  • Outputs the result in a strict JSON format:
{
    "response": "the humane response",
    "metadata": [
        {"column1": "value", "column2": "value"},
        {"column1": "value", "column2": "value"}
    ]
}

The bot should handle edge cases like missing data, special characters in queries (apostrophes, etc.), and malformed CSV rows.


My Approach — Text-to-SQL

I went with a Text-to-SQL pipeline instead of the more common approach of loading the entire CSV into the LLM's context or writing ad-hoc Python/Pandas code at runtime.

Here's how it works:

  1. User uploads a CSV through the API
  2. The file gets loaded into an in-memory DuckDB database
  3. The schema (column names and types) is extracted and sent to the LLM
  4. When a user asks a question, the LLM generates a SQL query based on the schema
  5. DuckDB executes the query and returns the results
  6. The LLM then takes those results and writes a natural, human-friendly response
  7. Both the response and the raw data are returned as JSON

Why Text-to-SQL over other approaches?

I considered a few different architectures before settling on this one:

Option A: Load CSV into LLM context

  • Sounds simple, but CSV files can be huge. A 2000-row file easily exceeds token limits. You'd either truncate the data (losing accuracy) or spend a lot on API costs sending massive prompts. Not practical.

Option B: Let the LLM write Python/Pandas code

  • This is what a lot of tutorials do. The LLM generates Python code, you exec() it, and return the result. The problem? Running arbitrary LLM-generated code is a security nightmare. One bad generation and you're executing os.remove() or worse. You'd need a sandbox, which adds complexity.

Option C: Text-to-SQL (what I chose)

  • The LLM only generates SELECT queries. I validate that before execution — if it's not a SELECT, it gets rejected. DuckDB runs the query in memory, so there's no persistent database to corrupt. The LLM never sees the actual data during query generation, only the schema, which keeps the prompt small and focused. And SQL is deterministic — the same query always returns the same results, unlike LLM-generated Python which can be unpredictable.

The tradeoff is that SQL can't do everything (like generating charts), but for a data Q&A chatbot, it covers 95% of use cases and does it safely.


Tech Stack

  • Python — core language
  • FastAPI — API server
  • DuckDB — in-memory SQL engine for querying CSV data (fast, no setup, no external DB needed)
  • Pandas — CSV loading with encoding fallback and bad-line handling
  • Groq — LLM provider (using Llama 3.3 70B for SQL generation and response synthesis)
  • Streamlit — frontend UI

Project Structure

csv_analysis_bot/
    app.py              # backend API (FastAPI + DuckDB + Groq)
    frontend/
        app.py           # frontend UI (Streamlit)
    uploads/             # uploaded CSVs stored here at runtime
    .env                 # GROQ_API_KEY goes here
    run.bat              # starts both backend and frontend

Everything important lives in app.py. I deliberately kept it as a single file instead of splitting into services/models/orchestrators — for a project this size, splitting just adds indirection without real benefit.


Setup

  1. Install dependencies:
pip install fastapi uvicorn pandas duckdb groq python-dotenv streamlit requests
  1. Create a .env file:
GROQ_API_KEY=your_key_here

Get a free key from console.groq.com

  1. Run the project:
# Option 1: double-click run.bat

# Option 2: manual (two terminals)
python -m uvicorn app:app --port 8000
python -m streamlit run frontend/app.py
  1. Open http://localhost:8501, upload a CSV, and start asking questions.

API Usage

Upload a CSV:

curl -X POST http://localhost:8000/upload -F "file=@books_cleaned.csv"

Ask a question:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "Give me top 5 books with best ratings"}'

Response:

{
    "response": "Here are the top 5 books with the best ratings...",
    "metadata": [
        {"title": "The Complete Calvin and Hobbes", "average_rating": 4.82},
        {"title": "Harry Potter Boxed Set", "average_rating": 4.78}
    ]
}

Sample Questions

  • "Give me top 5 books with best ratings"
  • "Recommend some books written by Agatha Christie"
  • "Find a book that has 'One Woman's Search for Everything' in its subtitle"
  • "Which author has written the most books in this dataset?"
  • "Are there any books about Indian history?"
  • "Show me the most popular books based on ratings count"

About

A chatbot that lets you upload any CSV file and ask natural language questions about the data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors