A FastAPI-based web scraping service optimized for Vercel deployment with JavaScript rendering support using Playwright.
- 🚀 FastAPI - Modern, fast web framework
- 🎭 Playwright Integration - JavaScript rendering support for dynamic content
- ⚡ Vercel Optimized - Configured for serverless deployment
- 🛡️ Security - URL validation, rate limiting, and content sanitization
- 📊 Structured Output - Clean JSON responses with metadata
- ⏱️ Timeout Handling - Optimized for Vercel's 10-second execution limit
Scrape content from a given URL.
Request Body:
{
"url": "https://example.com",
"wait_for_js": true,
"timeout": 8,
"extract_links": false,
"extract_images": false
}Parameters:
url(required): The URL to scrapewait_for_js(optional, default: true): Whether to wait for JavaScript renderingtimeout(optional, default: 8, max: 8): Request timeout in secondsextract_links(optional, default: false): Extract all links from the pageextract_images(optional, default: false): Extract all images from the page
Response:
{
"success": true,
"url": "https://example.com",
"title": "Example Domain",
"content": "This domain is for use in illustrative examples...",
"meta_description": "Example domain for documentation",
"links": ["https://example.com/page1", "https://example.com/page2"],
"images": ["https://example.com/image1.jpg"],
"processing_time": 2.34,
"timestamp": 1703123456.789
}Health check endpoint.
Response:
{
"status": "healthy",
"timestamp": 1703123456.789,
"scraper_ready": true
}- Install dependencies:
pip install -r requirements.txt- Install Playwright browsers:
playwright install chromium- Run the application:
python main.pyThe API will be available at http://localhost:8000 with interactive docs at http://localhost:8000/docs.
- Install Vercel CLI:
npm install -g vercel- Deploy:
vercelThe application is configured with vercel.json for optimal serverless deployment.
import requests
response = requests.post("https://your-app.vercel.app/scrape", json={
"url": "https://example.com",
"wait_for_js": True,
"extract_links": True
})
data = response.json()
print(f"Title: {data['title']}")
print(f"Content length: {len(data['content'])}")const response = await fetch('https://your-app.vercel.app/scrape', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com',
wait_for_js: true,
extract_links: true
})
});
const data = await response.json();
console.log('Title:', data.title);
console.log('Content length:', data.content.length);curl -X POST "https://your-app.vercel.app/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"wait_for_js": true,
"timeout": 8
}'- URL Validation: Blocks private IPs, localhost, and suspicious domains
- Rate Limiting: 20 requests per minute per IP
- Content Sanitization: Removes harmful content and limits response size
- Timeout Protection: Hard limits to prevent long-running requests
- Vercel Timeout: 10-second maximum execution time (8-second scraping timeout + buffer)
- Memory Limit: 1024MB on Vercel Pro, 512MB on Hobby
- Response Size: Content limited to 500KB to ensure fast responses
- Rate Limiting: 20 requests per minute per IP address
The API returns structured error responses:
{
"success": false,
"error": "Error description",
"status_code": 400,
"timestamp": 1703123456.789
}Common error codes:
400: Invalid URL or parameters408: Request timeout429: Rate limit exceeded500: Internal server error
PLAYWRIGHT_BROWSERS_PATH: Browser installation path (set automatically on Vercel)PYTHONPATH: Python module path (set automatically on Vercel)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details.