Skip to content

ByteNite2/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

How to Build a Scalable Web Scraping Pipeline with ByteNite

Table of Contents

  1. Introduction
  2. Use Cases
  3. What is ByteNite?
  4. Project Structure
  5. Prerequisites
  6. Installing the ByteNite Developer CLI
  7. Web Scraping Pipeline Components
  8. Running a Web Scraping Job on ByteNite
  9. Troubleshooting & FAQ
  10. References

Introduction

This repository provides a robust, scalable web scraping pipeline designed to run on ByteNite's distributed, serverless container platform. It enables high-performance Amazon product data extraction at scale with minimal infrastructure configuration. The pipeline uses Playwright for modern web automation, handling JavaScript-heavy pages and anti-bot measures while finding the cheapest products across multiple Amazon search results.

The web scraper specializes in Amazon product data extraction, gathering product titles, prices, and buyer counts to perform price comparison analysis. Its distributed architecture allows parallel processing of multiple URLs with intelligent error handling and automatic result aggregation.

Use Cases

This web scraping pipeline is ideal for:

  • Price Comparison: Finding the cheapest products across multiple Amazon search results
  • Market Research: Analyzing product prices and popularity trends across categories
  • Competitive Analysis: Monitoring pricing strategies and product positioning
  • Data Collection: Gathering structured product data for analytics and reporting
  • E-commerce Intelligence: Building datasets for pricing algorithms and recommendations
  • Inventory Monitoring: Tracking product availability and pricing changes over time

What is ByteNite?

ByteNite is a serverless container platform for stateless, compute-intensive workloads. It abstracts away cloud infrastructure, letting you focus on your application logic. ByteNite provides:

  • Near-instant startup times and flexible compute
  • Distributed execution fabric (native fan-in/fan-out logic)
  • Modular building blocks: Partitioners, Apps, Assemblers
  • Simple job submission via CLI or API

Project Structure

web-scraper/
โ”œโ”€โ”€ fanout-urls/                     # Partitioner engine
โ”‚   โ”œโ”€โ”€ manifest.json                # Engine manifest
โ”‚   โ””โ”€โ”€ app/main.py                  # URL partitioning logic
โ”œโ”€โ”€ scraper-engine/                  # Scraper app
โ”‚   โ”œโ”€โ”€ manifest.json                # App manifest
โ”‚   โ”œโ”€โ”€ Dockerfile                   # Playwright container setup
โ”‚   โ””โ”€โ”€ app/
โ”‚       โ”œโ”€โ”€ main.py                  # Playwright scraper logic
โ”‚       โ””โ”€โ”€ requirements.txt         # Python dependencies
โ”œโ”€โ”€ data-assembler/                  # Assembler engine  
โ”‚   โ”œโ”€โ”€ manifest.json                # Engine manifest
โ”‚   โ””โ”€โ”€ app/main.py                  # Price analysis logic
โ”œโ”€โ”€ templates/                       # Job templates
โ”‚   โ””โ”€โ”€ web-scraper-simple-template.json
โ”œโ”€โ”€ test_final.py                    # Complete local test
โ””โ”€โ”€ README.md

Prerequisites

Already onboarded to ByteNite? If you've already created an account, set up payment, and installed the CLI for a previous app, you can skip this section and jump straight to Installing the ByteNite Developer CLI or the next relevant step.

๐Ÿ‘ค Create an account

You will need to Request an Access Code and fill out the resulting form with your contact info. Once receiving your access code, you will be able to sign up on the computing platform.

๐Ÿ’ณ Add a payment method

Once logged into the platform, go to the Billing Page (can also access by clicking into the Billing tab in the sidebar). Locate the Payment Info card and navigate to the Customer Portal. Add a payment method to your account through Stripe. Your payment info is used for manual and automatic top-ups. Ensure you have enough funds to avoid service interruptions.

๐Ÿช™ Redeem ByteChips

If you have a coupon code, redeem it on your billing page to add ByteChips (credits) to your balance. Go to the Account Balance card and click "Redeem". Enter your coupon code and complete the process. Refresh to confirm the balance. We'd love to get you started with free credits to test our platform, contact ByteNite support to request some.

For CLI/SDK Users (Recommended)

Most users should use the CLI/SDK for the easiest experience:

  1. Download and install the ByteNite Developer CLI (see below for instructions by OS).
  2. Authenticate by running:
    bytenite auth
    This will open a browser window for secure login.
  3. Once authenticated, you can use all bytenite CLI commands to manage apps, engines, templates, and jobs.

For API Users (Advanced/Programmatic Access)

If you plan to use the ByteNite API directly (e.g., with Postman or custom scripts), you'll need an API key and access token:

๐Ÿ” Get an API key

Go to your ByteNite profile or click your profile avatar (top right). Click New API Key, configure its settings, and enter the confirmation code sent to your email. Copy your API key immediately and store it securely. You will not be able to view it again. If a key is no longer needed or is compromised, revoke it from your profile.

๐Ÿ”‘ Get an access token

An access token is required to authenticate all requests to the ByteNite API (including Postman). Request an access token from the Access Token endpoint using your API key. Tokens last 1 hour by default. See the API Reference for details and example requests.

๐Ÿ› ๏ธ Set up development tools

  • Python 3.8+ for local development or running scripts.
  • Playwright for web scraping (required for local testing):
    pip install playwright
    playwright install chromium
  • Git (to clone this repository)
  • (Optional) Docker if you plan to build custom containers.

Installing the ByteNite Developer CLI

Linux (Ubuntu/Debian)

Add the ByteNite repository:

echo "deb [trusted=yes] https://storage.googleapis.com/bytenite-prod-apt-repo/debs ./" | sudo tee /etc/apt/sources.list.d/bytenite.list

Update package lists:

sudo apt update

Install the ByteNite CLI:

sudo apt install bytenite

Troubleshooting:

  • Update your system: sudo apt update && sudo apt upgrade
  • Verify repository: cat /etc/apt/sources.list.d/bytenite.list
  • Check package: apt search bytenite

macOS

Add the ByteNite tap:

brew tap ByteNite2/bytenite-dev-cli https://github.com/ByteNite2/bytenite-dev-cli.git

Install the CLI:

brew install bytenite

Update Homebrew:

brew update

Upgrade ByteNite CLI:

brew upgrade bytenite

Windows

Download and run the latest Windows release from the ByteNite CLI GitHub page.

Verify Installation

Check the CLI version:

bytenite version

Authenticate

Authenticate with OAuth2:

bytenite auth

This opens a browser for login. Credentials are stored at:

  • Linux: $HOME/.config/bytenite-cli/auth-prod.json
  • Mac: /Users/[user]/Library/Application Support/bytenite-cli/auth-prod.json

Quick Start

Follow these steps to get up and running with your own ByteNite web scraping pipeline:

  1. Clone this repository to your own machine:

    git clone https://github.com/ByteNite2/web-scraper.git && cd web-scraper

    (Optional but recommended) Fork this repo to your own GitHub account.

  2. Install the ByteNite Developer CLI (see instructions above for your OS).

  3. Authenticate with ByteNite:

    bytenite auth
  4. Push the engines and apps to your ByteNite account:

    cd fanout-urls && bytenite engine push . && \
    cd ../scraper-engine && bytenite app push . && \
    cd ../data-assembler && bytenite engine push .
  5. Activate the engines and apps:

    bytenite engine activate fanout-urls && \
    bytenite app activate scraper-engine && \
    bytenite engine activate data-assembler
  6. Push the job template:

    cd .. && bytenite template push ./templates/web-scraper-simple-template.json
  7. Launch a job using the methods described below.

ByteNite Dev CLI: Commands & Usage

Run the help command to see all options:

bytenite --help

Most users only need these commands in order:

  • bytenite engine push [engine_folder]
  • bytenite engine activate [engine_tag]
  • bytenite app push [app_folder]
  • bytenite app activate [app_tag]
  • bytenite template push [template_filepath]
  • bytenite engine status [engine_tag] (to check engine status)
  • bytenite app status [app_tag] (to check app status)

For more commands, run bytenite --help or see the ByteNite documentation.

Web Scraping Pipeline Components

Partitioner (fanout-urls)

  • Purpose: Splits URL lists into processable chunks for parallel scraping
  • Type: Engine component
  • Functionality: Validates Amazon URLs, creates JSON chunks for distributed processing
  • Parameters: urls (list of Amazon URLs), chunk_size (URLs per chunk), num_replicas (number of replicas)
  • Container: Uses default ByteNite runtime

App (scraper-engine)

  • Purpose: Scrapes Amazon product data using Playwright browser automation
  • Type: App component
  • Functionality: Extracts product titles, prices, and buyer counts from Amazon search pages
  • Parameters: headless, delay_between_requests, size
  • Container: Custom Docker image vyomapatel12/web-scraper:v1.5 with Playwright pre-installed

Configurable App Parameters

  • headless: Run browser in headless mode (default: true)
  • delay_between_requests: Delay between URL requests in seconds (default: 5)
  • size: Processing batch size (default: 2)

Assembler (data-assembler)

  • Purpose: Analyzes scraped results to find the cheapest products
  • Type: Engine component
  • Functionality: Compares prices across all scraped items, identifies cheapest product
  • Parameters: None required (uses default empty parameters)
  • Container: Uses default ByteNite runtime

Templates

Templates define the complete job configuration linking all components:

  • web-scraper-simple-template: Configured for Amazon product price comparison

Running a Web Scraping Job on ByteNite

To launch a web scraping job, create a new ByteNite job with the web-scraper-simple-template and provide your desired Amazon URLs and scraping parameters. The pipeline will process URLs in parallel, extract product data, and identify the cheapest items, with results saved to the output directory for easy access via the ByteNite UI or API.

Launching via GUI

Follow these steps to launch a job using the ByteNite GUI:

  1. Go to https://computing.bytenite.com and log in.
  2. Navigate to the Templates section in the sidebar.
  3. Select web-scraper-simple-template and click on it to create a new job.
  4. In the job configuration form, fill in the required parameters:
    • Partitioner: Configure urls (list of Amazon search URLs), chunk_size (default: 1), and num_replicas (default: 2)
    • App: Configure scraping parameters (headless, delay_between_requests, size)
    • Assembler: Leave parameters empty (uses defaults)
  5. For Data Source, select Bypass.
  6. For Data Destination, select Temporary Bucket.
  7. Review your configuration and click Start Job.
  8. Monitor job progress and logs from the job overview page. Once complete, download the results directly from the interface.

Launching via API

Get an access token:

import requests

response = requests.post(
    "https://api.bytenite.com/v1/auth/access_token",
    json={"apiKey": "<YOUR_API_KEY>"}
)
token = response.json()["token"]

Web Scraping Job Submission

{
  "templateId": "web-scraper-simple-template",
  "description": "Amazon product price comparison job",
  "params": {
    "partitioner": {
      "urls": [
        "https://www.amazon.com/s?k=xbox+controller",
        "https://www.amazon.com/s?k=ps5+controller",
        "https://www.amazon.com/s?k=8bit+do+controller"
      ],
      "chunk_size": 1,
      "num_replicas": 2
    },
    "app": {
      "headless": true,
      "delay_between_requests": 5,
      "size": 2
    },
    "assembler": {}
  },
  "dataSource": {
    "dataSourceDescriptor": "bypass"
  },
  "dataDestination": {
    "dataSourceDescriptor": "bucket"
  },
  "config": {
    "isTestJob": true,
    "jobTimeout": 3600,
    "taskTimeout": 3600
  }
}

Local Testing

Before deploying to ByteNite, test the complete pipeline locally:

# Install dependencies
pip install playwright
playwright install chromium

# Run the complete test
python3 test_final.py

The test script will:

  1. Create test URLs and parameters
  2. Run the partitioner to create chunks
  3. Execute the scraper on each chunk
  4. Run the assembler to find the cheapest item
  5. Display the final results

Troubleshooting & FAQ

  • App fails to start: Check your container image and manifest.json for correct dependencies and entrypoint.
  • No scraped data: Ensure Amazon URLs are accessible and not blocked. Try increasing delay_between_requests parameter.
  • Resource errors: Increase min_cpu/min_memory in manifest.json or check Docker image compatibility.
  • Authentication issues: Regenerate your API key and access token.
  • Scraping blocked: Amazon may block automated requests. Try increasing delay_between_requests or using different URLs.
  • Engine vs App confusion: Remember fanout-urls and data-assembler are engines, scraper-engine is an app. Use correct CLI commands.

See ByteNite Docs FAQ for more.

References

For questions or support, please open an issue or contact the ByteNite team via the official docs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors