Web Page Capture Service

A TypeScript-based microservice for capturing web pages as self-contained HTML files using Playwright. The service renders pages fully (including running JavaScript and loading images) then saves them as complete HTML documents that can be viewed offline.

✨ Features

Core Capture Features

✅ Full page rendering with JavaScript execution
✅ JavaScript removal (captured pages are static and secure)
✅ Smart image optimization with lightweight blurred placeholders
✅ CSS and resource inlining
✅ Auto-scrolling for lazy-loaded content
✅ Enhanced Nuxt.js support (handles delay hydration with mouse simulation)
✅ Dynamic content detection and stability verification

Storage & Management

✅ Persistent file storage (captures saved to disk)
✅ Memory management and cleanup
✅ Automatic old file cleanup
✅ Storage statistics and monitoring

Security & Reliability

✅ Security validation (blocks private IPs, localhost)
✅ Size limits and timeouts
✅ TypeScript with strict typing
✅ Comprehensive error handling

User Interface

🌐 Interactive HTML Interface with element selector
🎯 Visual element picker with hover highlighting and CSS selector generation
📊 Real-time capture progress and statistics
📱 Responsive design for desktop and mobile

API Specification

Capture Webpage

POST /api/capture-webpage
Content-Type: application/json

{
  "url": "https://example.com"
}

Response:

{
  "html": "<!DOCTYPE html>...",
  "size": 1048576,
  "capturedAt": "2025-09-18T10:30:00.000Z",
  "status": "complete",
  "warnings": [],
  "filePath": "/Users/aaa/Work/seo/back/captures/capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
  "metrics": {
    "loadTime": 3500,
    "imageCount": 12,
    "memoryUsed": 157286400
  }
}

List Stored Captures

GET /api/captures

Response:

{
  "captures": ["capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"],
  "total": 1,
  "stats": {
    "totalFiles": 1,
    "totalSizeMB": 1.5,
    "oldestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html",
    "newestFile": "capture-2025-09-18T10-30-00-000Z-a1b2c3d4.html"
  }
}

Installation

# Install dependencies
npm install

# Install Playwright browsers
npm run install-browsers

Development

# Start in development mode
npm run dev

# Build for production
npm run build

# Start production server
npm start

Configuration

Environment Variables

PORT - Server port (default: 3000)
NODE_ENV - Environment (development/production)

Storage Configuration

Storage Directory: ./captures/ (created automatically)
File Naming: capture-{timestamp}-{url-hash}.html
Max Stored Files: 1000 (older files auto-deleted)
Cleanup Interval: Every hour
File Permissions: Standard filesystem permissions

Limits

Maximum file size: 100MB
Maximum image size: 10MB per image
Maximum images: 200 per page
Request timeout: 10 seconds
Memory threshold: 80%

Security

Blocks localhost and private IP ranges
Only allows HTTP/HTTPS protocols
Validates URL format
Complete JavaScript removal from captured pages
Memory usage monitoring
Resource cleanup

JavaScript Removal

The service automatically removes all JavaScript from captured pages for security and static rendering:

Removes all <script> tags and content
Strips JavaScript event handlers (onclick, onload, etc.)
Removes javascript: protocol URLs
Cleans inline JavaScript attributes
Converts <noscript> content to regular HTML

This ensures captured pages are completely static and safe to view offline.

Health Check

GET /health

Returns server status and memory usage information.

Architecture

/src
  /types          - TypeScript interfaces
  /routes         - Express route handlers
  /services       - Business logic services
  /middleware     - Express middleware
  /utils          - Utility functions
  /schemas        - Zod validation schemas
  /config         - Configuration constants
  server.ts       - Main server file

Production Deployment

Build the application: npm run build
Set environment variables
Start with process manager (PM2, etc.)
Monitor memory usage
Set up log rotation

Error Handling

The service includes comprehensive error handling:

Validation errors (400)
Security violations (400)
Memory limits (503)
Processing failures (500)

Browser Management

Uses Playwright with Chromium
Headless mode for security
Proper cleanup and resource management
Stealth configuration to avoid detection

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
public		public
src		src
.gitignore		.gitignore
README.md		README.md
agents.md		agents.md
package-lock.json		package-lock.json
package.json		package.json
test-service.sh		test-service.sh
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Page Capture Service

✨ Features

Core Capture Features

Storage & Management

Security & Reliability

User Interface

API Specification

Capture Webpage

List Stored Captures

Installation

Development

Configuration

Environment Variables

Storage Configuration

Limits

Security

JavaScript Removal

Health Check

Architecture

Production Deployment

Error Handling

Browser Management

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Page Capture Service

✨ Features

Core Capture Features

Storage & Management

Security & Reliability

User Interface

API Specification

Capture Webpage

List Stored Captures

Installation

Development

Configuration

Environment Variables

Storage Configuration

Limits

Security

JavaScript Removal

Health Check

Architecture

Production Deployment

Error Handling

Browser Management

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages