- Introduction
- Use Cases
- What is ByteNite?
- Project Structure
- Prerequisites
- Installing the ByteNite Developer CLI
- Web Scraping Pipeline Components
- Running a Web Scraping Job on ByteNite
- Troubleshooting & FAQ
- References
This repository provides a robust, scalable web scraping pipeline designed to run on ByteNite's distributed, serverless container platform. It enables high-performance Amazon product data extraction at scale with minimal infrastructure configuration. The pipeline uses Playwright for modern web automation, handling JavaScript-heavy pages and anti-bot measures while finding the cheapest products across multiple Amazon search results.
The web scraper specializes in Amazon product data extraction, gathering product titles, prices, and buyer counts to perform price comparison analysis. Its distributed architecture allows parallel processing of multiple URLs with intelligent error handling and automatic result aggregation.
This web scraping pipeline is ideal for:
- Price Comparison: Finding the cheapest products across multiple Amazon search results
- Market Research: Analyzing product prices and popularity trends across categories
- Competitive Analysis: Monitoring pricing strategies and product positioning
- Data Collection: Gathering structured product data for analytics and reporting
- E-commerce Intelligence: Building datasets for pricing algorithms and recommendations
- Inventory Monitoring: Tracking product availability and pricing changes over time
ByteNite is a serverless container platform for stateless, compute-intensive workloads. It abstracts away cloud infrastructure, letting you focus on your application logic. ByteNite provides:
- Near-instant startup times and flexible compute
- Distributed execution fabric (native fan-in/fan-out logic)
- Modular building blocks: Partitioners, Apps, Assemblers
- Simple job submission via CLI or API
web-scraper/
โโโ fanout-urls/ # Partitioner engine
โ โโโ manifest.json # Engine manifest
โ โโโ app/main.py # URL partitioning logic
โโโ scraper-engine/ # Scraper app
โ โโโ manifest.json # App manifest
โ โโโ Dockerfile # Playwright container setup
โ โโโ app/
โ โโโ main.py # Playwright scraper logic
โ โโโ requirements.txt # Python dependencies
โโโ data-assembler/ # Assembler engine
โ โโโ manifest.json # Engine manifest
โ โโโ app/main.py # Price analysis logic
โโโ templates/ # Job templates
โ โโโ web-scraper-simple-template.json
โโโ test_final.py # Complete local test
โโโ README.md
Already onboarded to ByteNite? If you've already created an account, set up payment, and installed the CLI for a previous app, you can skip this section and jump straight to Installing the ByteNite Developer CLI or the next relevant step.
You will need to Request an Access Code and fill out the resulting form with your contact info. Once receiving your access code, you will be able to sign up on the computing platform.
Once logged into the platform, go to the Billing Page (can also access by clicking into the Billing tab in the sidebar). Locate the Payment Info card and navigate to the Customer Portal. Add a payment method to your account through Stripe. Your payment info is used for manual and automatic top-ups. Ensure you have enough funds to avoid service interruptions.
If you have a coupon code, redeem it on your billing page to add ByteChips (credits) to your balance. Go to the Account Balance card and click "Redeem". Enter your coupon code and complete the process. Refresh to confirm the balance. We'd love to get you started with free credits to test our platform, contact ByteNite support to request some.
Most users should use the CLI/SDK for the easiest experience:
- Download and install the ByteNite Developer CLI (see below for instructions by OS).
- Authenticate by running:
This will open a browser window for secure login.
bytenite auth
- Once authenticated, you can use all bytenite CLI commands to manage apps, engines, templates, and jobs.
If you plan to use the ByteNite API directly (e.g., with Postman or custom scripts), you'll need an API key and access token:
Go to your ByteNite profile or click your profile avatar (top right). Click New API Key, configure its settings, and enter the confirmation code sent to your email. Copy your API key immediately and store it securely. You will not be able to view it again. If a key is no longer needed or is compromised, revoke it from your profile.
An access token is required to authenticate all requests to the ByteNite API (including Postman). Request an access token from the Access Token endpoint using your API key. Tokens last 1 hour by default. See the API Reference for details and example requests.
- Python 3.8+ for local development or running scripts.
- Playwright for web scraping (required for local testing):
pip install playwright playwright install chromium
- Git (to clone this repository)
- (Optional) Docker if you plan to build custom containers.
Add the ByteNite repository:
echo "deb [trusted=yes] https://storage.googleapis.com/bytenite-prod-apt-repo/debs ./" | sudo tee /etc/apt/sources.list.d/bytenite.listUpdate package lists:
sudo apt updateInstall the ByteNite CLI:
sudo apt install byteniteTroubleshooting:
- Update your system:
sudo apt update && sudo apt upgrade - Verify repository:
cat /etc/apt/sources.list.d/bytenite.list - Check package:
apt search bytenite
Add the ByteNite tap:
brew tap ByteNite2/bytenite-dev-cli https://github.com/ByteNite2/bytenite-dev-cli.gitInstall the CLI:
brew install byteniteUpdate Homebrew:
brew updateUpgrade ByteNite CLI:
brew upgrade byteniteDownload and run the latest Windows release from the ByteNite CLI GitHub page.
Check the CLI version:
bytenite versionAuthenticate with OAuth2:
bytenite authThis opens a browser for login. Credentials are stored at:
- Linux:
$HOME/.config/bytenite-cli/auth-prod.json - Mac:
/Users/[user]/Library/Application Support/bytenite-cli/auth-prod.json
Follow these steps to get up and running with your own ByteNite web scraping pipeline:
-
Clone this repository to your own machine:
git clone https://github.com/ByteNite2/web-scraper.git && cd web-scraper
(Optional but recommended) Fork this repo to your own GitHub account.
-
Install the ByteNite Developer CLI (see instructions above for your OS).
-
Authenticate with ByteNite:
bytenite auth
-
Push the engines and apps to your ByteNite account:
cd fanout-urls && bytenite engine push . && \ cd ../scraper-engine && bytenite app push . && \ cd ../data-assembler && bytenite engine push .
-
Activate the engines and apps:
bytenite engine activate fanout-urls && \ bytenite app activate scraper-engine && \ bytenite engine activate data-assembler
-
Push the job template:
cd .. && bytenite template push ./templates/web-scraper-simple-template.json
-
Launch a job using the methods described below.
Run the help command to see all options:
bytenite --helpMost users only need these commands in order:
bytenite engine push [engine_folder]bytenite engine activate [engine_tag]bytenite app push [app_folder]bytenite app activate [app_tag]bytenite template push [template_filepath]bytenite engine status [engine_tag](to check engine status)bytenite app status [app_tag](to check app status)
For more commands, run bytenite --help or see the ByteNite documentation.
- Purpose: Splits URL lists into processable chunks for parallel scraping
- Type: Engine component
- Functionality: Validates Amazon URLs, creates JSON chunks for distributed processing
- Parameters:
urls(list of Amazon URLs),chunk_size(URLs per chunk),num_replicas(number of replicas) - Container: Uses default ByteNite runtime
- Purpose: Scrapes Amazon product data using Playwright browser automation
- Type: App component
- Functionality: Extracts product titles, prices, and buyer counts from Amazon search pages
- Parameters:
headless,delay_between_requests,size - Container: Custom Docker image
vyomapatel12/web-scraper:v1.5with Playwright pre-installed
- headless: Run browser in headless mode (default: true)
- delay_between_requests: Delay between URL requests in seconds (default: 5)
- size: Processing batch size (default: 2)
- Purpose: Analyzes scraped results to find the cheapest products
- Type: Engine component
- Functionality: Compares prices across all scraped items, identifies cheapest product
- Parameters: None required (uses default empty parameters)
- Container: Uses default ByteNite runtime
Templates define the complete job configuration linking all components:
- web-scraper-simple-template: Configured for Amazon product price comparison
To launch a web scraping job, create a new ByteNite job with the web-scraper-simple-template and provide your desired Amazon URLs and scraping parameters. The pipeline will process URLs in parallel, extract product data, and identify the cheapest items, with results saved to the output directory for easy access via the ByteNite UI or API.
Follow these steps to launch a job using the ByteNite GUI:
- Go to https://computing.bytenite.com and log in.
- Navigate to the Templates section in the sidebar.
- Select
web-scraper-simple-templateand click on it to create a new job. - In the job configuration form, fill in the required parameters:
- Partitioner: Configure
urls(list of Amazon search URLs),chunk_size(default: 1), andnum_replicas(default: 2) - App: Configure scraping parameters (headless, delay_between_requests, size)
- Assembler: Leave parameters empty (uses defaults)
- Partitioner: Configure
- For Data Source, select Bypass.
- For Data Destination, select Temporary Bucket.
- Review your configuration and click Start Job.
- Monitor job progress and logs from the job overview page. Once complete, download the results directly from the interface.
Get an access token:
import requests
response = requests.post(
"https://api.bytenite.com/v1/auth/access_token",
json={"apiKey": "<YOUR_API_KEY>"}
)
token = response.json()["token"]{
"templateId": "web-scraper-simple-template",
"description": "Amazon product price comparison job",
"params": {
"partitioner": {
"urls": [
"https://www.amazon.com/s?k=xbox+controller",
"https://www.amazon.com/s?k=ps5+controller",
"https://www.amazon.com/s?k=8bit+do+controller"
],
"chunk_size": 1,
"num_replicas": 2
},
"app": {
"headless": true,
"delay_between_requests": 5,
"size": 2
},
"assembler": {}
},
"dataSource": {
"dataSourceDescriptor": "bypass"
},
"dataDestination": {
"dataSourceDescriptor": "bucket"
},
"config": {
"isTestJob": true,
"jobTimeout": 3600,
"taskTimeout": 3600
}
}Before deploying to ByteNite, test the complete pipeline locally:
# Install dependencies
pip install playwright
playwright install chromium
# Run the complete test
python3 test_final.pyThe test script will:
- Create test URLs and parameters
- Run the partitioner to create chunks
- Execute the scraper on each chunk
- Run the assembler to find the cheapest item
- Display the final results
- App fails to start: Check your container image and manifest.json for correct dependencies and entrypoint.
- No scraped data: Ensure Amazon URLs are accessible and not blocked. Try increasing delay_between_requests parameter.
- Resource errors: Increase min_cpu/min_memory in manifest.json or check Docker image compatibility.
- Authentication issues: Regenerate your API key and access token.
- Scraping blocked: Amazon may block automated requests. Try increasing delay_between_requests or using different URLs.
- Engine vs App confusion: Remember fanout-urls and data-assembler are engines, scraper-engine is an app. Use correct CLI commands.
See ByteNite Docs FAQ for more.
- ByteNite Documentation
- ByteNite Dev CLI
- API Reference
- Playwright Documentation
- Web Scraping Best Practices
For questions or support, please open an issue or contact the ByteNite team via the official docs.