Skip to content

yossufyahya2000/scraper

Repository files navigation

Web Scraping API

A FastAPI-based web scraping service optimized for Vercel deployment with JavaScript rendering support using Playwright.

Features

  • 🚀 FastAPI - Modern, fast web framework
  • 🎭 Playwright Integration - JavaScript rendering support for dynamic content
  • Vercel Optimized - Configured for serverless deployment
  • 🛡️ Security - URL validation, rate limiting, and content sanitization
  • 📊 Structured Output - Clean JSON responses with metadata
  • ⏱️ Timeout Handling - Optimized for Vercel's 10-second execution limit

API Endpoints

POST /scrape

Scrape content from a given URL.

Request Body:

{
  "url": "https://example.com",
  "wait_for_js": true,
  "timeout": 8,
  "extract_links": false,
  "extract_images": false
}

Parameters:

  • url (required): The URL to scrape
  • wait_for_js (optional, default: true): Whether to wait for JavaScript rendering
  • timeout (optional, default: 8, max: 8): Request timeout in seconds
  • extract_links (optional, default: false): Extract all links from the page
  • extract_images (optional, default: false): Extract all images from the page

Response:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "This domain is for use in illustrative examples...",
  "meta_description": "Example domain for documentation",
  "links": ["https://example.com/page1", "https://example.com/page2"],
  "images": ["https://example.com/image1.jpg"],
  "processing_time": 2.34,
  "timestamp": 1703123456.789
}

GET /health

Health check endpoint.

Response:

{
  "status": "healthy",
  "timestamp": 1703123456.789,
  "scraper_ready": true
}

Local Development

  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install chromium
  1. Run the application:
python main.py

The API will be available at http://localhost:8000 with interactive docs at http://localhost:8000/docs.

Deployment to Vercel

  1. Install Vercel CLI:
npm install -g vercel
  1. Deploy:
vercel

The application is configured with vercel.json for optimal serverless deployment.

Usage Examples

Python

import requests

response = requests.post("https://your-app.vercel.app/scrape", json={
    "url": "https://example.com",
    "wait_for_js": True,
    "extract_links": True
})

data = response.json()
print(f"Title: {data['title']}")
print(f"Content length: {len(data['content'])}")

JavaScript

const response = await fetch('https://your-app.vercel.app/scrape', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com',
    wait_for_js: true,
    extract_links: true
  })
});

const data = await response.json();
console.log('Title:', data.title);
console.log('Content length:', data.content.length);

cURL

curl -X POST "https://your-app.vercel.app/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "wait_for_js": true,
    "timeout": 8
  }'

Security Features

  • URL Validation: Blocks private IPs, localhost, and suspicious domains
  • Rate Limiting: 20 requests per minute per IP
  • Content Sanitization: Removes harmful content and limits response size
  • Timeout Protection: Hard limits to prevent long-running requests

Limitations

  • Vercel Timeout: 10-second maximum execution time (8-second scraping timeout + buffer)
  • Memory Limit: 1024MB on Vercel Pro, 512MB on Hobby
  • Response Size: Content limited to 500KB to ensure fast responses
  • Rate Limiting: 20 requests per minute per IP address

Error Handling

The API returns structured error responses:

{
  "success": false,
  "error": "Error description",
  "status_code": 400,
  "timestamp": 1703123456.789
}

Common error codes:

  • 400: Invalid URL or parameters
  • 408: Request timeout
  • 429: Rate limit exceeded
  • 500: Internal server error

Environment Variables

  • PLAYWRIGHT_BROWSERS_PATH: Browser installation path (set automatically on Vercel)
  • PYTHONPATH: Python module path (set automatically on Vercel)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages