AI Data Analyst 🤖📊

An intelligent data analysis system that combines web scraping, multi-format data extraction, and AI-powered analysis to answer complex data questions automatically. 🚀✨

✨ Features Included

📥 Multi-Source Data Ingestion

🌐 Web scraping with Playwright (JavaScript-rendered pages)
📋 HTML table extraction
📄 PDF data extraction (text and tables)
🖼️ Image OCR (extract data from screenshots/charts)
📊 CSV, JSON, Excel, SQL file processing
📦 Archive support (ZIP, TAR, TAR.GZ)

🧠 Intelligent Data Processing

🔢 Automatic numeric field detection and cleaning
💰 Currency, percentage, and scientific notation handling
🔗 Multi-table relationship detection
🗃️ Database query generation (DuckDB)
📅 Datetime parsing and standardization

🤖 AI-Powered Analysis

🎯 Multiple LLM support (GPT, Claude, Gemini, Grok)
📝 Task breakdown and execution planning
💬 Natural language question answering
⚙️ Automatic code generation for data analysis
🔄 Error recovery and retry mechanisms

🚀 Advanced Features

🧹 Automatic file cleanup after processing
📂 Support for multiple concurrent files
🛡️ Comprehensive error handling
📊 Progress tracking and detailed logging

🚀 How to Use It

📋 Prerequisites

🐍 Python 3.11+
🐳 Docker (optional)

💻 Installation

📥 Clone the repository

git clone https://github.com/heyitsgautham/data-analyst.git
cd dataanalyst

📦 Install dependencies

pip install -r requirements.txt
playwright install

🔐 Set up environment variables Create a .env file with your API keys:

API_KEY=your_openai_api_key
CLAUDE_API_KEY=your_claude_api_key
gemini_api=your_gemini_api_key
gemini_api_2=your_backup_gemini_key
grok_api=your_grok_api_key
OCR_API_KEY=your_ocr_space_key

▶️ Run the application

uvicorn app:app --host 0.0.0.0 --port 8000

🐳 Using Docker

docker build -t ai-data-analyst .
docker run -p 8000:8000 --env-file .env ai-data-analyst

📡 API Usage

Endpoint: POST /api/

Request: Upload files via multipart/form-data

📝 Questions file (.txt) - your data questions
📊 Data sources: CSV, JSON, Excel, PDF, HTML, images, archives
🔢 Multiple files supported simultaneously

Example with cURL:

curl -X POST http://localhost:8000/api/ \
  -F "questions=@questions.txt" \
  -F "data=@data.csv" \
  -F "image=@chart.png"

Response: JSON with analysis results, generated visualizations, and insights 📈✅

🏗️ Solution Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Client Request                         │
│            (Files: TXT, CSV, PDF, Images, etc.)             │
└─────────────────────┬───────────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────────┐
│                  FastAPI Application Layer                  │
│                  (app.py - Main Router)                     │
└─────────────────────┬───────────────────────────────────────┘
                      │
          ┌───────────┴───────────┐
          ▼                       ▼
┌──────────────────┐    ┌──────────────────┐
│  File Processing │    │   Data Scraping  │
│   & Extraction   │    │  (data_scrape.py)│
│                  │    │                  │
│ • Archive extract│    │ • Web scraping   │
│ • OCR/Image      │    │ • HTML parsing   │
│ • PDF parsing    │    │ • Table extract  │
│ • Format detect  │    │ • Numeric clean  │
└────────┬─────────┘    └────────┬─────────┘
         │                       │
         └───────────┬───────────┘
                     ▼
         ┌─────────────────────┐
         │   Data Unification  │
         │   (DuckDB Engine)   │
         │                     │
         │ • Multi-table join  │
         │ • SQL generation    │
         │ • Query execution   │
         └──────────┬──────────┘
                    ▼
         ┌─────────────────────┐
         │   AI Analysis Layer │
         │                     │
         │ • Task breakdown    │
         │ • Code generation   │
         │ • LLM orchestration │
         │ • Multi-model retry │
         └──────────┬──────────┘
                    ▼
         ┌─────────────────────┐
         │   Response Builder  │
         │                     │
         │ • Visualization     │
         │ • JSON formatting   │
         │ • File cleanup      │
         └──────────┬──────────┘
                    ▼
         ┌─────────────────────┐
         │   Client Response   │
         │   (JSON + Charts)   │
         └─────────────────────┘

🔑 Key Components:

🌐 FastAPI Backend: Handles HTTP requests and orchestrates processing
🕷️ Data Scraper: Fetches and extracts data from various sources
🗄️ DuckDB Engine: In-memory database for complex queries
🤖 AI Orchestrator: Manages multiple LLM providers with fallback
🧹 Cleanup Manager: Tracks and removes temporary files

🛠️ Tech Stack

🎯 Core Framework

FastAPI - High-performance async web framework 🚀
Python 3.11 - Programming language 🐍

📊 Data Processing

Pandas - Data manipulation and analysis 🐼
NumPy - Numerical computing 🔢
DuckDB - In-memory analytical database 🦆
Tabula - PDF table extraction 📄
pdfplumber - PDF text extraction 📖
openpyxl - Excel file handling 📗

🌐 Web Scraping

Playwright - Browser automation 🎭
playwright-stealth - Anti-detection 🥷
BeautifulSoup4 - HTML parsing 🍜
httpx - Async HTTP client ⚡
Selenium - Alternative browser automation 🌐

🤖 AI & ML

OpenAI GPT - Language model 🧠
Claude (Anthropic) - Language model 💭
Google Gemini - Language model ✨
Grok (xAI) - Language model 🚀
OCR.space API - Optical character recognition 👁️

📈 Visualization

Matplotlib - Chart generation 📊
Seaborn - Statistical visualizations 🎨
NetworkX - Graph/network visualizations 🕸️

🔧 Utilities

python-dotenv - Environment management 🔐
python-multipart - File upload handling 📤
chardet - Character encoding detection 🔤

🚢 Deployment

Docker - Containerization 🐳
Uvicorn - ASGI server ⚡

🙏 Thank You!

Thank you for using AI Data Analyst! This project aims to democratize data analysis by making it accessible through natural language. 💡

👨‍💻 Created by: @heyitsgautham

🤝 Contributions Welcome! Feel free to open issues or submit pull requests.

📜 License: See LICENSE file

⭐ Star this repo if you find it helpful!

Made with ❤️ and AI

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
prompts		prompts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_scrape.py		data_scrape.py
fix.txt		fix.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Data Analyst 🤖📊

✨ Features Included

📥 Multi-Source Data Ingestion

🧠 Intelligent Data Processing

🤖 AI-Powered Analysis

🚀 Advanced Features

🚀 How to Use It

📋 Prerequisites

💻 Installation

🐳 Using Docker

📡 API Usage

🏗️ Solution Architecture

🛠️ Tech Stack

🎯 Core Framework

📊 Data Processing

🌐 Web Scraping

🤖 AI & ML

📈 Visualization

🔧 Utilities

🚢 Deployment

🙏 Thank You!

⭐ Star this repo if you find it helpful!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Data Analyst 🤖📊

✨ Features Included

📥 Multi-Source Data Ingestion

🧠 Intelligent Data Processing

🤖 AI-Powered Analysis

🚀 Advanced Features

🚀 How to Use It

📋 Prerequisites

💻 Installation

🐳 Using Docker

📡 API Usage

🏗️ Solution Architecture

🛠️ Tech Stack

🎯 Core Framework

📊 Data Processing

🌐 Web Scraping

🤖 AI & ML

📈 Visualization

🔧 Utilities

🚢 Deployment

🙏 Thank You!

⭐ Star this repo if you find it helpful!

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages