Web Scraping API

A FastAPI-based web scraping service optimized for Vercel deployment with JavaScript rendering support using Playwright.

Features

🚀 FastAPI - Modern, fast web framework
🎭 Playwright Integration - JavaScript rendering support for dynamic content
⚡ Vercel Optimized - Configured for serverless deployment
🛡️ Security - URL validation, rate limiting, and content sanitization
📊 Structured Output - Clean JSON responses with metadata
⏱️ Timeout Handling - Optimized for Vercel's 10-second execution limit

API Endpoints

POST `/scrape`

Scrape content from a given URL.

Request Body:

{
  "url": "https://example.com",
  "wait_for_js": true,
  "timeout": 8,
  "extract_links": false,
  "extract_images": false
}

Parameters:

url (required): The URL to scrape
wait_for_js (optional, default: true): Whether to wait for JavaScript rendering
timeout (optional, default: 8, max: 8): Request timeout in seconds
extract_links (optional, default: false): Extract all links from the page
extract_images (optional, default: false): Extract all images from the page

Response:

{
  "success": true,
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "This domain is for use in illustrative examples...",
  "meta_description": "Example domain for documentation",
  "links": ["https://example.com/page1", "https://example.com/page2"],
  "images": ["https://example.com/image1.jpg"],
  "processing_time": 2.34,
  "timestamp": 1703123456.789
}

GET `/health`

Health check endpoint.

Response:

{
  "status": "healthy",
  "timestamp": 1703123456.789,
  "scraper_ready": true
}

Local Development

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install chromium

Run the application:

python main.py

The API will be available at http://localhost:8000 with interactive docs at http://localhost:8000/docs.

Deployment to Vercel

Install Vercel CLI:

npm install -g vercel

Deploy:

vercel

The application is configured with vercel.json for optimal serverless deployment.

Usage Examples

Python

import requests

response = requests.post("https://your-app.vercel.app/scrape", json={
    "url": "https://example.com",
    "wait_for_js": True,
    "extract_links": True
})

data = response.json()
print(f"Title: {data['title']}")
print(f"Content length: {len(data['content'])}")

JavaScript

const response = await fetch('https://your-app.vercel.app/scrape', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com',
    wait_for_js: true,
    extract_links: true
  })
});

const data = await response.json();
console.log('Title:', data.title);
console.log('Content length:', data.content.length);

cURL

curl -X POST "https://your-app.vercel.app/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "wait_for_js": true,
    "timeout": 8
  }'

Security Features

URL Validation: Blocks private IPs, localhost, and suspicious domains
Rate Limiting: 20 requests per minute per IP
Content Sanitization: Removes harmful content and limits response size
Timeout Protection: Hard limits to prevent long-running requests

Limitations

Vercel Timeout: 10-second maximum execution time (8-second scraping timeout + buffer)
Memory Limit: 1024MB on Vercel Pro, 512MB on Hobby
Response Size: Content limited to 500KB to ensure fast responses
Rate Limiting: 20 requests per minute per IP address

Error Handling

The API returns structured error responses:

{
  "success": false,
  "error": "Error description",
  "status_code": 400,
  "timestamp": 1703123456.789
}

Common error codes:

400: Invalid URL or parameters
408: Request timeout
429: Rate limit exceeded
500: Internal server error

Environment Variables

PLAYWRIGHT_BROWSERS_PATH: Browser installation path (set automatically on Vercel)
PYTHONPATH: Python module path (set automatically on Vercel)

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
api		api
examples		examples
scraper		scraper
.gitignore		.gitignore
DEPLOYMENT.md		DEPLOYMENT.md
README.md		README.md
main.py		main.py
requirements-vercel.txt		requirements-vercel.txt
requirements.txt		requirements.txt
test_api.py		test_api.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping API

Features

API Endpoints

POST `/scrape`

GET `/health`

Local Development

Deployment to Vercel

Usage Examples

Python

JavaScript

cURL

Security Features

Limitations

Error Handling

Environment Variables

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraping API

Features

API Endpoints

POST /scrape

GET /health

Local Development

Deployment to Vercel

Usage Examples

Python

JavaScript

cURL

Security Features

Limitations

Error Handling

Environment Variables

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/scrape`

GET `/health`

Packages