An intelligent data analysis system that combines web scraping, multi-format data extraction, and AI-powered analysis to answer complex data questions automatically. πβ¨
- π Web scraping with Playwright (JavaScript-rendered pages)
- π HTML table extraction
- π PDF data extraction (text and tables)
- πΌοΈ Image OCR (extract data from screenshots/charts)
- π CSV, JSON, Excel, SQL file processing
- π¦ Archive support (ZIP, TAR, TAR.GZ)
- π’ Automatic numeric field detection and cleaning
- π° Currency, percentage, and scientific notation handling
- π Multi-table relationship detection
- ποΈ Database query generation (DuckDB)
- π Datetime parsing and standardization
- π― Multiple LLM support (GPT, Claude, Gemini, Grok)
- π Task breakdown and execution planning
- π¬ Natural language question answering
- βοΈ Automatic code generation for data analysis
- π Error recovery and retry mechanisms
- π§Ή Automatic file cleanup after processing
- π Support for multiple concurrent files
- π‘οΈ Comprehensive error handling
- π Progress tracking and detailed logging
- π Python 3.11+
- π³ Docker (optional)
- π₯ Clone the repository
git clone https://github.com/heyitsgautham/data-analyst.git
cd dataanalyst- π¦ Install dependencies
pip install -r requirements.txt
playwright install- π Set up environment variables
Create a
.envfile with your API keys:
API_KEY=your_openai_api_key
CLAUDE_API_KEY=your_claude_api_key
gemini_api=your_gemini_api_key
gemini_api_2=your_backup_gemini_key
grok_api=your_grok_api_key
OCR_API_KEY=your_ocr_space_keyβΆοΈ Run the application
uvicorn app:app --host 0.0.0.0 --port 8000docker build -t ai-data-analyst .
docker run -p 8000:8000 --env-file .env ai-data-analystEndpoint: POST /api/
Request: Upload files via multipart/form-data
- π Questions file (
.txt) - your data questions - π Data sources: CSV, JSON, Excel, PDF, HTML, images, archives
- π’ Multiple files supported simultaneously
Example with cURL:
curl -X POST http://localhost:8000/api/ \
-F "questions=@questions.txt" \
-F "data=@data.csv" \
-F "image=@chart.png"Response: JSON with analysis results, generated visualizations, and insights πβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Request β
β (Files: TXT, CSV, PDF, Images, etc.) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Application Layer β
β (app.py - Main Router) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββ΄ββββββββββββ
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββ
β File Processing β β Data Scraping β
β & Extraction β β (data_scrape.py)β
β β β β
β β’ Archive extractβ β β’ Web scraping β
β β’ OCR/Image β β β’ HTML parsing β
β β’ PDF parsing β β β’ Table extract β
β β’ Format detect β β β’ Numeric clean β
ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
β β
βββββββββββββ¬ββββββββββββ
βΌ
βββββββββββββββββββββββ
β Data Unification β
β (DuckDB Engine) β
β β
β β’ Multi-table join β
β β’ SQL generation β
β β’ Query execution β
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β AI Analysis Layer β
β β
β β’ Task breakdown β
β β’ Code generation β
β β’ LLM orchestration β
β β’ Multi-model retry β
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β Response Builder β
β β
β β’ Visualization β
β β’ JSON formatting β
β β’ File cleanup β
ββββββββββββ¬βββββββββββ
βΌ
βββββββββββββββββββββββ
β Client Response β
β (JSON + Charts) β
βββββββββββββββββββββββ
π Key Components:
- π FastAPI Backend: Handles HTTP requests and orchestrates processing
- π·οΈ Data Scraper: Fetches and extracts data from various sources
- ποΈ DuckDB Engine: In-memory database for complex queries
- π€ AI Orchestrator: Manages multiple LLM providers with fallback
- π§Ή Cleanup Manager: Tracks and removes temporary files
- FastAPI - High-performance async web framework π
- Python 3.11 - Programming language π
- Pandas - Data manipulation and analysis πΌ
- NumPy - Numerical computing π’
- DuckDB - In-memory analytical database π¦
- Tabula - PDF table extraction π
- pdfplumber - PDF text extraction π
- openpyxl - Excel file handling π
- Playwright - Browser automation π
- playwright-stealth - Anti-detection π₯·
- BeautifulSoup4 - HTML parsing π
- httpx - Async HTTP client β‘
- Selenium - Alternative browser automation π
- OpenAI GPT - Language model π§
- Claude (Anthropic) - Language model π
- Google Gemini - Language model β¨
- Grok (xAI) - Language model π
- OCR.space API - Optical character recognition ποΈ
- Matplotlib - Chart generation π
- Seaborn - Statistical visualizations π¨
- NetworkX - Graph/network visualizations πΈοΈ
- python-dotenv - Environment management π
- python-multipart - File upload handling π€
- chardet - Character encoding detection π€
- Docker - Containerization π³
- Uvicorn - ASGI server β‘
Thank you for using AI Data Analyst! This project aims to democratize data analysis by making it accessible through natural language. π‘
π¨βπ» Created by: @heyitsgautham
π€ Contributions Welcome! Feel free to open issues or submit pull requests.
π License: See LICENSE file