A TypeScript-based microservice for capturing web pages as self-contained HTML files using Playwright. The service renders pages fully (including running JavaScript and loading images) then saves them as complete HTML documents that can be viewed offline.
- ✅ Full page rendering with JavaScript execution
- ✅ JavaScript removal (captured pages are static and secure)
- ✅ Smart image optimization with lightweight blurred placeholders
- ✅ CSS and resource inlining
- ✅ Auto-scrolling for lazy-loaded content
- ✅ Enhanced Nuxt.js support (handles delay hydration with mouse simulation)
- ✅ Dynamic content detection and stability verification
- ✅ Persistent file storage (captures saved to disk)
- ✅ Memory management and cleanup
- ✅ Automatic old file cleanup
- ✅ Storage statistics and monitoring
- ✅ Security validation (blocks private IPs, localhost)
- ✅ Size limits and timeouts
- ✅ TypeScript with strict typing
- ✅ Comprehensive error handling
- 🌐 Interactive HTML Interface with element selector
- 🎯 Visual element picker with hover highlighting and CSS selector generation
- 📊 Real-time capture progress and statistics
- 📱 Responsive design for desktop and mobile
POST /api/capture-webpage
Content-Type: application/json
{
"url": "https://example.com"
}Response:
{
"html": "<!DOCTYPE html>...",
"size": 1048576,
"capturedAt": "2025-09-18T10:30:00.000Z",
"status": "complete",
"warnings": [],
"filePath": "/Users/aaa/Work/seo/back/captures/capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
"metrics": {
"loadTime": 3500,
"imageCount": 12,
"memoryUsed": 157286400
}
}GET /api/capturesResponse:
{
"captures": ["capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"],
"total": 1,
"stats": {
"totalFiles": 1,
"totalSizeMB": 1.5,
"oldestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
"newestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"
}
}# Install dependencies
npm install
# Install Playwright browsers
npm run install-browsers# Start in development mode
npm run dev
# Build for production
npm run build
# Start production server
npm startPORT- Server port (default: 3000)NODE_ENV- Environment (development/production)
- Storage Directory:
./captures/(created automatically) - File Naming:
capture-{timestamp}-{url-hash}.html - Max Stored Files: 1000 (older files auto-deleted)
- Cleanup Interval: Every hour
- File Permissions: Standard filesystem permissions
- Maximum file size: 100MB
- Maximum image size: 10MB per image
- Maximum images: 200 per page
- Request timeout: 10 seconds
- Memory threshold: 80%
- Blocks localhost and private IP ranges
- Only allows HTTP/HTTPS protocols
- Validates URL format
- Complete JavaScript removal from captured pages
- Memory usage monitoring
- Resource cleanup
The service automatically removes all JavaScript from captured pages for security and static rendering:
- Removes all
<script>tags and content - Strips JavaScript event handlers (onclick, onload, etc.)
- Removes
javascript:protocol URLs - Cleans inline JavaScript attributes
- Converts
<noscript>content to regular HTML
This ensures captured pages are completely static and safe to view offline.
GET /healthReturns server status and memory usage information.
/src
/types - TypeScript interfaces
/routes - Express route handlers
/services - Business logic services
/middleware - Express middleware
/utils - Utility functions
/schemas - Zod validation schemas
/config - Configuration constants
server.ts - Main server file
- Build the application:
npm run build - Set environment variables
- Start with process manager (PM2, etc.)
- Monitor memory usage
- Set up log rotation
The service includes comprehensive error handling:
- Validation errors (400)
- Security violations (400)
- Memory limits (503)
- Processing failures (500)
- Uses Playwright with Chromium
- Headless mode for security
- Proper cleanup and resource management
- Stealth configuration to avoid detection