A production-ready Python scraper for collecting product data from Sainsbury's UK grocery website.
- π Scrapes product data from Sainsbury's grocery API
- π¦ Automatic category discovery and traversal
- πΎ JSON file storage with metadata
- π Checkpoint/Resume capability - interrupt and resume anytime
- β±οΈ Built-in rate limiting
- π Progress tracking and statistics
- ποΈ SQL query interface using DuckDB for data analysis
β οΈ Defunct category handling - HTTP 400 errors won't block progress- π§ͺ Test mode for development
- Python 3.8+
- uv - Fast Python package manager
# Clone the repository
git clone https://github.com/danclark-codes/sainsburys-scrape.git
cd sainsburys-scrape
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txtYou need to capture authentication credentials from your browser:
- Open Sainsbury's website in Chrome/Firefox
- Open Developer Tools (F12)
- Go to Network tab
- Navigate to any product page
- Find requests to
groceries-api/gol-services/product/v1/product - Copy:
- Authorization header (the full "Bearer ..." token)
- Cookie header (all cookies)
# Create auth config template
uv run python scrape.py --create-auth
# Edit config/auth.json with your credentials
nano config/auth.jsonUpdate the file with your captured credentials:
{
"authorization": "Bearer YOUR_ACTUAL_TOKEN_HERE",
"cookie": "YOUR_COOKIES_HERE",
"headers": { ... }
}See auth.example.json for a complete template with instructions.
# Test mode - scrape 1 category, 2 pages
uv run python scrape.py --test
# Full scrape - all categories
uv run python scrape.py --full
# List available categories
uv run python scrape.py --list
# Resume from checkpoint after interruption
uv run python scrape.py --resume# Scrape specific categories by index
uv run python scrape.py --categories 0 1 2 --max-pages 5
# Scrape single category by name
uv run python scrape.py --category-name "gb/groceries/fruit-veg" --max-pages 10
# Scrape without subcategory discovery
uv run python scrape.py --categories 0 --no-subcategories
# Custom data directory
uv run python scrape.py --test --data-dir /path/to/data
# Adjust rate limiting (seconds between requests)
uv run python scrape.py --test --rate-limit 5.0The scraper supports graceful interruption and resume:
# Start scraping
uv run python scrape.py --full
# Press Ctrl+C to stop gracefully...
# Progress will be saved automatically
# Later, resume from checkpoint:
uv run python scrape.py --full --resumeQuery your scraped data using SQL with DuckDB:
# Show all tables
uv run python query.py "SHOW TABLES"
# Find products by name
uv run python query.py "SELECT name, price_value, brand FROM products WHERE name LIKE '%chocolate%' LIMIT 20"
# Analyze prices by brand
uv run python query.py "SELECT * FROM price_analysis"
# Show top-rated products
uv run python query.py "SELECT name, brand, avg_rating, review_count FROM products WHERE review_count > 10 ORDER BY avg_rating DESC LIMIT 10"
# Find most expensive products
uv run python query.py "SELECT name, brand, price_value FROM products ORDER BY price_value DESC LIMIT 10"
# Rebuild database from JSON files
uv run python query.py --rebuild "SELECT COUNT(*) as total FROM products"
# Export to different formats
uv run python query.py --format markdown "SELECT * FROM product_summary LIMIT 5"| Option | Description |
|---|---|
--test |
Test mode (1 category, 2 pages, no subcategories) |
--full |
Full scrape (all categories) |
--categories INDEX ... |
Scrape specific categories by index |
--category-name NAME |
Scrape specific category by name |
--max-pages N |
Maximum pages per category |
--no-subcategories |
Don't discover/scrape subcategories |
--resume |
Resume from checkpoint if available |
--list |
List available categories |
--data-dir PATH |
Output directory (default: data) |
--config-dir PATH |
Config directory (default: config) |
--rate-limit SECONDS |
Delay between requests (default: 3.0) |
--create-auth |
Create auth.json template |
| Option | Description |
|---|---|
--data-dir PATH |
Data directory (default: data) |
--format FORMAT |
Output format: grid, simple, html, latex, markdown |
--rebuild |
Force rebuild database from JSON files |
--no-views |
Skip creating convenience views |
sainsburys-scrape/
βββ sainsburys/ # Main package
β βββ client.py # API client
β βββ scraper.py # Scraper orchestration
β βββ storage.py # Data storage
β βββ config.py # Configuration management
β βββ checkpoint.py # Checkpoint/resume functionality
βββ config/ # Configuration files
β βββ auth.json # Authentication (gitignored)
β βββ categories.json # Categories list
βββ data/ # Scraped data
β βββ products/ # Individual product JSONs
β βββ categories/ # Category metadata
β βββ checkpoint.json # Resume checkpoint (if interrupted)
β βββ sainsburys.duckdb # DuckDB database for queries
βββ scrape.py # CLI entry point
βββ query.py # SQL query interface
βββ auth.example.json # Authentication template
βββ requirements.txt # Dependencies
Each product is saved as data/products/product_{id}.json:
{
"product_uid": "7968098",
"name": "Product Name",
"retail_price": {
"price": 2.50,
"measure": "unit"
},
"breadcrumbs": [...],
"_metadata": {
"scraped_at": "2025-08-17T10:30:00",
"scraped_timestamp": 1755340200.0,
"scraper_version": "2.0.0"
}
}product_summary- Essential product information with prices and ratingsprice_analysis- Price statistics grouped by brandcategory_summary- Category scraping statistics and duplicates
- Authentication tokens expire - refresh from browser if you get 401/403 errors
- Cookies typically last for a browser session
- The scraper saves credentials locally - never commit
config/auth.json
# Run in test mode
uv run python scrape.py --test
# Test specific category
uv run python scrape.py --category-name "gb/groceries/bakery" --max-pages 1
# Test SQL queries
uv run python query.py "SELECT COUNT(*) FROM products"The scraper automatically tracks what has been scraped and will skip already processed items on subsequent runs. To force a re-scrape, delete the relevant files from data/.
- Default rate limit: 3 seconds between requests
- Typical scraping speed: ~20 products/minute
- Full scrape estimate: Several hours for all categories
- DuckDB queries: Instant on thousands of products
- HTTP 400 errors: Defunct categories are marked as processed with 0 products
- Graceful shutdown: Press Ctrl+C to save progress and resume later
- Duplicate detection: Products already scraped are automatically skipped
- Connection errors: Graceful failure with error reporting
- Authentication expired - capture fresh credentials from browser
- Update
config/auth.jsonwith new token and cookies
- Check if category URL is correct using
--list - Verify authentication is working with
--test
- Increase delay with
--rate-limit 5.0if getting blocked - Default 3 seconds is usually safe
- Run
query.py --rebuildto recreate database from JSON files - Check
data/sainsburys.duckdbexists and has read/write permissions
All dependencies are managed with uv:
# Install/update dependencies
uv pip install -r requirements.txt
# Key dependencies:
# - requests: HTTP client
# - duckdb: SQL database engine
# - tabulate: Table formatting for query resultsFor personal use only. Respect Sainsbury's terms of service.