Skip to content

abaskalov/page-capturer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Page Capture Service

A TypeScript-based microservice for capturing web pages as self-contained HTML files using Playwright. The service renders pages fully (including running JavaScript and loading images) then saves them as complete HTML documents that can be viewed offline.

✨ Features

Core Capture Features

  • ✅ Full page rendering with JavaScript execution
  • JavaScript removal (captured pages are static and secure)
  • ✅ Smart image optimization with lightweight blurred placeholders
  • ✅ CSS and resource inlining
  • ✅ Auto-scrolling for lazy-loaded content
  • Enhanced Nuxt.js support (handles delay hydration with mouse simulation)
  • ✅ Dynamic content detection and stability verification

Storage & Management

  • Persistent file storage (captures saved to disk)
  • ✅ Memory management and cleanup
  • ✅ Automatic old file cleanup
  • ✅ Storage statistics and monitoring

Security & Reliability

  • ✅ Security validation (blocks private IPs, localhost)
  • ✅ Size limits and timeouts
  • ✅ TypeScript with strict typing
  • ✅ Comprehensive error handling

User Interface

  • 🌐 Interactive HTML Interface with element selector
  • 🎯 Visual element picker with hover highlighting and CSS selector generation
  • 📊 Real-time capture progress and statistics
  • 📱 Responsive design for desktop and mobile

API Specification

Capture Webpage

POST /api/capture-webpage
Content-Type: application/json

{
  "url": "https://example.com"
}

Response:

{
  "html": "<!DOCTYPE html>...",
  "size": 1048576,
  "capturedAt": "2025-09-18T10:30:00.000Z",
  "status": "complete",
  "warnings": [],
  "filePath": "/Users/aaa/Work/seo/back/captures/capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
  "metrics": {
    "loadTime": 3500,
    "imageCount": 12,
    "memoryUsed": 157286400
  }
}

List Stored Captures

GET /api/captures

Response:

{
  "captures": ["capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"],
  "total": 1,
  "stats": {
    "totalFiles": 1,
    "totalSizeMB": 1.5,
    "oldestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
    "newestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"
  }
}

Installation

# Install dependencies
npm install

# Install Playwright browsers
npm run install-browsers

Development

# Start in development mode
npm run dev

# Build for production
npm run build

# Start production server
npm start

Configuration

Environment Variables

  • PORT - Server port (default: 3000)
  • NODE_ENV - Environment (development/production)

Storage Configuration

  • Storage Directory: ./captures/ (created automatically)
  • File Naming: capture-{timestamp}-{url-hash}.html
  • Max Stored Files: 1000 (older files auto-deleted)
  • Cleanup Interval: Every hour
  • File Permissions: Standard filesystem permissions

Limits

  • Maximum file size: 100MB
  • Maximum image size: 10MB per image
  • Maximum images: 200 per page
  • Request timeout: 10 seconds
  • Memory threshold: 80%

Security

  • Blocks localhost and private IP ranges
  • Only allows HTTP/HTTPS protocols
  • Validates URL format
  • Complete JavaScript removal from captured pages
  • Memory usage monitoring
  • Resource cleanup

JavaScript Removal

The service automatically removes all JavaScript from captured pages for security and static rendering:

  • Removes all <script> tags and content
  • Strips JavaScript event handlers (onclick, onload, etc.)
  • Removes javascript: protocol URLs
  • Cleans inline JavaScript attributes
  • Converts <noscript> content to regular HTML

This ensures captured pages are completely static and safe to view offline.

Health Check

GET /health

Returns server status and memory usage information.

Architecture

/src
  /types          - TypeScript interfaces
  /routes         - Express route handlers
  /services       - Business logic services
  /middleware     - Express middleware
  /utils          - Utility functions
  /schemas        - Zod validation schemas
  /config         - Configuration constants
  server.ts       - Main server file

Production Deployment

  1. Build the application: npm run build
  2. Set environment variables
  3. Start with process manager (PM2, etc.)
  4. Monitor memory usage
  5. Set up log rotation

Error Handling

The service includes comprehensive error handling:

  • Validation errors (400)
  • Security violations (400)
  • Memory limits (503)
  • Processing failures (500)

Browser Management

  • Uses Playwright with Chromium
  • Headless mode for security
  • Proper cleanup and resource management
  • Stealth configuration to avoid detection

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors