Skip to content

heyitsgautham/data-analyst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AI Data Analyst πŸ€–πŸ“Š

Python FastAPI Docker License GitHub

An intelligent data analysis system that combines web scraping, multi-format data extraction, and AI-powered analysis to answer complex data questions automatically. πŸš€βœ¨

✨ Features Included

πŸ“₯ Multi-Source Data Ingestion

  • 🌐 Web scraping with Playwright (JavaScript-rendered pages)
  • πŸ“‹ HTML table extraction
  • πŸ“„ PDF data extraction (text and tables)
  • πŸ–ΌοΈ Image OCR (extract data from screenshots/charts)
  • πŸ“Š CSV, JSON, Excel, SQL file processing
  • πŸ“¦ Archive support (ZIP, TAR, TAR.GZ)

🧠 Intelligent Data Processing

  • πŸ”’ Automatic numeric field detection and cleaning
  • πŸ’° Currency, percentage, and scientific notation handling
  • πŸ”— Multi-table relationship detection
  • πŸ—ƒοΈ Database query generation (DuckDB)
  • πŸ“… Datetime parsing and standardization

πŸ€– AI-Powered Analysis

  • 🎯 Multiple LLM support (GPT, Claude, Gemini, Grok)
  • πŸ“ Task breakdown and execution planning
  • πŸ’¬ Natural language question answering
  • βš™οΈ Automatic code generation for data analysis
  • πŸ”„ Error recovery and retry mechanisms

πŸš€ Advanced Features

  • 🧹 Automatic file cleanup after processing
  • πŸ“‚ Support for multiple concurrent files
  • πŸ›‘οΈ Comprehensive error handling
  • πŸ“Š Progress tracking and detailed logging

πŸš€ How to Use It

πŸ“‹ Prerequisites

  • 🐍 Python 3.11+
  • 🐳 Docker (optional)

πŸ’» Installation

  1. πŸ“₯ Clone the repository
git clone https://github.com/heyitsgautham/data-analyst.git
cd dataanalyst
  1. πŸ“¦ Install dependencies
pip install -r requirements.txt
playwright install
  1. πŸ” Set up environment variables Create a .env file with your API keys:
API_KEY=your_openai_api_key
CLAUDE_API_KEY=your_claude_api_key
gemini_api=your_gemini_api_key
gemini_api_2=your_backup_gemini_key
grok_api=your_grok_api_key
OCR_API_KEY=your_ocr_space_key
  1. ▢️ Run the application
uvicorn app:app --host 0.0.0.0 --port 8000

🐳 Using Docker

docker build -t ai-data-analyst .
docker run -p 8000:8000 --env-file .env ai-data-analyst

πŸ“‘ API Usage

Endpoint: POST /api/

Request: Upload files via multipart/form-data

  • πŸ“ Questions file (.txt) - your data questions
  • πŸ“Š Data sources: CSV, JSON, Excel, PDF, HTML, images, archives
  • πŸ”’ Multiple files supported simultaneously

Example with cURL:

curl -X POST http://localhost:8000/api/ \
  -F "questions=@questions.txt" \
  -F "data=@data.csv" \
  -F "image=@chart.png"

Response: JSON with analysis results, generated visualizations, and insights πŸ“ˆβœ…

πŸ—οΈ Solution Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Client Request                         β”‚
β”‚            (Files: TXT, CSV, PDF, Images, etc.)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  FastAPI Application Layer                  β”‚
β”‚                  (app.py - Main Router)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  File Processing β”‚    β”‚   Data Scraping  β”‚
β”‚   & Extraction   β”‚    β”‚  (data_scrape.py)β”‚
β”‚                  β”‚    β”‚                  β”‚
β”‚ β€’ Archive extractβ”‚    β”‚ β€’ Web scraping   β”‚
β”‚ β€’ OCR/Image      β”‚    β”‚ β€’ HTML parsing   β”‚
β”‚ β€’ PDF parsing    β”‚    β”‚ β€’ Table extract  β”‚
β”‚ β€’ Format detect  β”‚    β”‚ β€’ Numeric clean  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Data Unification  β”‚
         β”‚   (DuckDB Engine)   β”‚
         β”‚                     β”‚
         β”‚ β€’ Multi-table join  β”‚
         β”‚ β€’ SQL generation    β”‚
         β”‚ β€’ Query execution   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   AI Analysis Layer β”‚
         β”‚                     β”‚
         β”‚ β€’ Task breakdown    β”‚
         β”‚ β€’ Code generation   β”‚
         β”‚ β€’ LLM orchestration β”‚
         β”‚ β€’ Multi-model retry β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Response Builder  β”‚
         β”‚                     β”‚
         β”‚ β€’ Visualization     β”‚
         β”‚ β€’ JSON formatting   β”‚
         β”‚ β€’ File cleanup      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Client Response   β”‚
         β”‚   (JSON + Charts)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”‘ Key Components:

  • 🌐 FastAPI Backend: Handles HTTP requests and orchestrates processing
  • πŸ•·οΈ Data Scraper: Fetches and extracts data from various sources
  • πŸ—„οΈ DuckDB Engine: In-memory database for complex queries
  • πŸ€– AI Orchestrator: Manages multiple LLM providers with fallback
  • 🧹 Cleanup Manager: Tracks and removes temporary files

πŸ› οΈ Tech Stack

🎯 Core Framework

FastAPI Python

  • FastAPI - High-performance async web framework πŸš€
  • Python 3.11 - Programming language 🐍

πŸ“Š Data Processing

Pandas NumPy

  • Pandas - Data manipulation and analysis 🐼
  • NumPy - Numerical computing πŸ”’
  • DuckDB - In-memory analytical database πŸ¦†
  • Tabula - PDF table extraction πŸ“„
  • pdfplumber - PDF text extraction πŸ“–
  • openpyxl - Excel file handling πŸ“—

🌐 Web Scraping

Playwright Selenium

  • Playwright - Browser automation 🎭
  • playwright-stealth - Anti-detection πŸ₯·
  • BeautifulSoup4 - HTML parsing 🍜
  • httpx - Async HTTP client ⚑
  • Selenium - Alternative browser automation 🌐

πŸ€– AI & ML

OpenAI Anthropic Google

  • OpenAI GPT - Language model 🧠
  • Claude (Anthropic) - Language model πŸ’­
  • Google Gemini - Language model ✨
  • Grok (xAI) - Language model πŸš€
  • OCR.space API - Optical character recognition πŸ‘οΈ

πŸ“ˆ Visualization

  • Matplotlib - Chart generation πŸ“Š
  • Seaborn - Statistical visualizations 🎨
  • NetworkX - Graph/network visualizations πŸ•ΈοΈ

πŸ”§ Utilities

  • python-dotenv - Environment management πŸ”
  • python-multipart - File upload handling πŸ“€
  • chardet - Character encoding detection πŸ”€

🚒 Deployment

Docker Uvicorn

  • Docker - Containerization 🐳
  • Uvicorn - ASGI server ⚑

πŸ™ Thank You!

Thank you for using AI Data Analyst! This project aims to democratize data analysis by making it accessible through natural language. πŸ’‘

πŸ‘¨β€πŸ’» Created by: @heyitsgautham

🀝 Contributions Welcome! Feel free to open issues or submit pull requests.

πŸ“œ License: See LICENSE file


⭐ Star this repo if you find it helpful!

GitHub stars GitHub forks GitHub watchers

Made with ❀️ and AI

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors