Skip to content

Chaitya44/aria-webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

84 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ARIA Logo

ARIA

Adaptive Retrieval & Intelligent Analysis

AI-powered universal web extraction platform built for structured data intelligence.


✨ Overview

ARIA is a modern AI-powered web extraction engine that transforms unstructured webpages into clean, structured, machine-readable data using advanced semantic analysis.

Instead of relying on fragile CSS selectors or hardcoded scraping logic, ARIA uses intelligent page understanding powered by Google Gemini AI combined with a stealth-capable extraction pipeline.

Designed for:

  • E-commerce extraction
  • Market research
  • AI agents
  • Knowledge aggregation
  • Structured dataset generation
  • Research automation

πŸš€ Core Features

🌐 Universal Extraction

Extract structured data from virtually any website:

  • E-commerce stores
  • Documentation pages
  • Blogs & articles
  • Forums
  • News websites
  • Wikis
  • Dynamic JavaScript applications

No manual selectors required.


🧠 AI Semantic Structuring

ARIA uses a multi-pass AI analysis pipeline to:

  • Understand page context
  • Identify relevant entities
  • Extract structured information
  • Generate schema-clean JSON outputs

πŸ•ΆοΈ Stealth Rendering Engine

Built-in headless browsing system with:

  • JavaScript rendering
  • Anti-bot handling
  • Dynamic page loading
  • Fallback extraction pipelines

πŸ“¦ Structured Output

Export clean:

  • JSON
  • CSV
  • Typed structured objects

Optimized for:

  • AI workflows
  • APIs
  • Databases
  • Analytics pipelines

πŸ”’ Privacy Focused

ARIA processes extraction through your own infrastructure and API keys.

Your data remains under your control.


πŸ—οΈ System Architecture

ARIA follows a modular dual-service architecture:

Frontend (Next.js)
        ↓
FastAPI Backend
        ↓
Extraction Engine
        ↓
Gemini AI Structuring
        ↓
Structured JSON Output

Frontend

Modern glassmorphism interface built with:

  • Next.js 14
  • React
  • TailwindCSS
  • Framer Motion

Backend

High-performance FastAPI service responsible for:

  • scraping orchestration
  • rate limiting
  • AI classification
  • semantic extraction
  • fallback handling

Detailed architecture diagrams and flows are available in:

docs/architecture.md

⚑ Quick Start

1. Clone Repository

git clone https://github.com/Chaitya44/aria-webscraper.git
cd aria-webscraper

2. Backend Setup

cd backend

python -m venv venv

# Windows
.\venv\Scripts\Activate.ps1

# Mac/Linux
source venv/bin/activate

pip install -r requirements.txt

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Backend runs on:

http://localhost:8000

3. Frontend Setup

cd frontend/nexus-scraper-ui

npm install

npm run dev

Frontend runs on:

http://localhost:3000

πŸ”‘ Environment Variables

Backend

Create a .env file inside backend/

FIRECRAWL_API_KEY=
GEMINI_API_KEY=

Frontend

Create a .env.local file inside frontend/nexus-scraper-ui/

NEXT_PUBLIC_FIREBASE_API_KEY=
NEXT_PUBLIC_FIREBASE_AUTH_DOMAIN=
NEXT_PUBLIC_FIREBASE_PROJECT_ID=
NEXT_PUBLIC_FIREBASE_STORAGE_BUCKET=
NEXT_PUBLIC_FIREBASE_MESSAGING_SENDER_ID=
NEXT_PUBLIC_FIREBASE_APP_ID=

🌍 Deployment

Frontend

Recommended platforms:

  • Vercel
  • Netlify

Deploy directory:

frontend/nexus-scraper-ui

Backend

Recommended platforms:

  • Render
  • Railway
  • Ubuntu VPS
  • Docker

Start production server:

uvicorn main:app --host 0.0.0.0 --port 8000

πŸ› οΈ Tech Stack

Frontend

  • Next.js 14
  • React
  • TypeScript
  • TailwindCSS
  • Framer Motion

Backend

  • FastAPI
  • Python
  • Gemini AI
  • Firecrawl
  • AsyncIO

🧭 Roadmap

  • Scheduled scraping
  • Workflow automation
  • Browser extension
  • Team workspaces
  • API dashboard
  • Scraping templates
  • Real-time extraction monitoring

πŸ“ Project Structure

aria-webscraper/
β”‚
β”œβ”€β”€ backend/
β”œβ”€β”€ frontend/
β”œβ”€β”€ docs/
β”‚
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
└── .gitignore

πŸ“„ License

MIT License


Built with AI-driven extraction intelligence.