Sainsbury's Grocery Scraper

A production-ready Python scraper for collecting product data from Sainsbury's UK grocery website.

Features

🛒 Scrapes product data from Sainsbury's grocery API
📦 Automatic category discovery and traversal
💾 JSON file storage with metadata
🔄 Checkpoint/Resume capability - interrupt and resume anytime
⏱️ Built-in rate limiting
📊 Progress tracking and statistics
🗂️ SQL query interface using DuckDB for data analysis
⚠️ Defunct category handling - HTTP 400 errors won't block progress
🧪 Test mode for development

Installation

Prerequisites

Python 3.8+
uv - Fast Python package manager

Setup

# Clone the repository
git clone https://github.com/danclark-codes/sainsburys-scrape.git
cd sainsburys-scrape

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Setup

1. Capture Authentication Credentials

You need to capture authentication credentials from your browser:

Open Sainsbury's website in Chrome/Firefox
Open Developer Tools (F12)
Go to Network tab
Navigate to any product page
Find requests to groceries-api/gol-services/product/v1/product
Copy:
- Authorization header (the full "Bearer ..." token)
- Cookie header (all cookies)

2. Configure Authentication

# Create auth config template
uv run python scrape.py --create-auth

# Edit config/auth.json with your credentials
nano config/auth.json

Update the file with your captured credentials:

{
  "authorization": "Bearer YOUR_ACTUAL_TOKEN_HERE",
  "cookie": "YOUR_COOKIES_HERE",
  "headers": { ... }
}

See auth.example.json for a complete template with instructions.

Usage

Quick Start

# Test mode - scrape 1 category, 2 pages
uv run python scrape.py --test

# Full scrape - all categories
uv run python scrape.py --full

# List available categories
uv run python scrape.py --list

# Resume from checkpoint after interruption
uv run python scrape.py --resume

Advanced Usage

# Scrape specific categories by index
uv run python scrape.py --categories 0 1 2 --max-pages 5

# Scrape single category by name
uv run python scrape.py --category-name "gb/groceries/fruit-veg" --max-pages 10

# Scrape without subcategory discovery
uv run python scrape.py --categories 0 --no-subcategories

# Custom data directory
uv run python scrape.py --test --data-dir /path/to/data

# Adjust rate limiting (seconds between requests)
uv run python scrape.py --test --rate-limit 5.0

Checkpoint & Resume

The scraper supports graceful interruption and resume:

# Start scraping
uv run python scrape.py --full

# Press Ctrl+C to stop gracefully...
# Progress will be saved automatically

# Later, resume from checkpoint:
uv run python scrape.py --full --resume

SQL Query Interface

Query your scraped data using SQL with DuckDB:

# Show all tables
uv run python query.py "SHOW TABLES"

# Find products by name
uv run python query.py "SELECT name, price_value, brand FROM products WHERE name LIKE '%chocolate%' LIMIT 20"

# Analyze prices by brand
uv run python query.py "SELECT * FROM price_analysis"

# Show top-rated products
uv run python query.py "SELECT name, brand, avg_rating, review_count FROM products WHERE review_count > 10 ORDER BY avg_rating DESC LIMIT 10"

# Find most expensive products
uv run python query.py "SELECT name, brand, price_value FROM products ORDER BY price_value DESC LIMIT 10"

# Rebuild database from JSON files
uv run python query.py --rebuild "SELECT COUNT(*) as total FROM products"

# Export to different formats
uv run python query.py --format markdown "SELECT * FROM product_summary LIMIT 5"

Command Options

scrape.py

Option	Description
`--test`	Test mode (1 category, 2 pages, no subcategories)
`--full`	Full scrape (all categories)
`--categories INDEX ...`	Scrape specific categories by index
`--category-name NAME`	Scrape specific category by name
`--max-pages N`	Maximum pages per category
`--no-subcategories`	Don't discover/scrape subcategories
`--resume`	Resume from checkpoint if available
`--list`	List available categories
`--data-dir PATH`	Output directory (default: data)
`--config-dir PATH`	Config directory (default: config)
`--rate-limit SECONDS`	Delay between requests (default: 3.0)
`--create-auth`	Create auth.json template

query.py

Option	Description
`--data-dir PATH`	Data directory (default: data)
`--format FORMAT`	Output format: grid, simple, html, latex, markdown
`--rebuild`	Force rebuild database from JSON files
`--no-views`	Skip creating convenience views

Project Structure

sainsburys-scrape/
├── sainsburys/          # Main package
│   ├── client.py        # API client
│   ├── scraper.py       # Scraper orchestration
│   ├── storage.py       # Data storage
│   ├── config.py        # Configuration management
│   └── checkpoint.py    # Checkpoint/resume functionality
├── config/              # Configuration files
│   ├── auth.json        # Authentication (gitignored)
│   └── categories.json  # Categories list
├── data/                # Scraped data
│   ├── products/        # Individual product JSONs
│   ├── categories/      # Category metadata
│   ├── checkpoint.json  # Resume checkpoint (if interrupted)
│   └── sainsburys.duckdb # DuckDB database for queries
├── scrape.py           # CLI entry point
├── query.py            # SQL query interface
├── auth.example.json   # Authentication template
└── requirements.txt    # Dependencies

Data Output

Product JSON Structure

Each product is saved as data/products/product_{id}.json:

{
  "product_uid": "7968098",
  "name": "Product Name",
  "retail_price": {
    "price": 2.50,
    "measure": "unit"
  },
  "breadcrumbs": [...],
  "_metadata": {
    "scraped_at": "2025-08-17T10:30:00",
    "scraped_timestamp": 1755340200.0,
    "scraper_version": "2.0.0"
  }
}

Available SQL Views

product_summary - Essential product information with prices and ratings
price_analysis - Price statistics grouped by brand
category_summary - Category scraping statistics and duplicates

Authentication Notes

Authentication tokens expire - refresh from browser if you get 401/403 errors
Cookies typically last for a browser session
The scraper saves credentials locally - never commit config/auth.json

Development

Testing

# Run in test mode
uv run python scrape.py --test

# Test specific category
uv run python scrape.py --category-name "gb/groceries/bakery" --max-pages 1

# Test SQL queries
uv run python query.py "SELECT COUNT(*) FROM products"

Resume Capability

The scraper automatically tracks what has been scraped and will skip already processed items on subsequent runs. To force a re-scrape, delete the relevant files from data/.

Performance

Default rate limit: 3 seconds between requests
Typical scraping speed: ~20 products/minute
Full scrape estimate: Several hours for all categories
DuckDB queries: Instant on thousands of products

Error Handling

HTTP 400 errors: Defunct categories are marked as processed with 0 products
Graceful shutdown: Press Ctrl+C to save progress and resume later
Duplicate detection: Products already scraped are automatically skipped
Connection errors: Graceful failure with error reporting

Troubleshooting

401/403 Errors

Authentication expired - capture fresh credentials from browser
Update config/auth.json with new token and cookies

No Products Found

Check if category URL is correct using --list
Verify authentication is working with --test

Rate Limiting

Increase delay with --rate-limit 5.0 if getting blocked
Default 3 seconds is usually safe

Database Issues

Run query.py --rebuild to recreate database from JSON files
Check data/sainsburys.duckdb exists and has read/write permissions

Dependencies

All dependencies are managed with uv:

# Install/update dependencies
uv pip install -r requirements.txt

# Key dependencies:
# - requests: HTTP client
# - duckdb: SQL database engine
# - tabulate: Table formatting for query results

License

For personal use only. Respect Sainsbury's terms of service.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
sainsburys		sainsburys
.gitignore		.gitignore
README.md		README.md
query.py		query.py
requirements.txt		requirements.txt
scrape.py		scrape.py
status.py		status.py

danclark-codes/sainburys-scrape

Folders and files

Latest commit

History

Repository files navigation

Sainsbury's Grocery Scraper

Features

Installation

Prerequisites

Setup

Setup

1. Capture Authentication Credentials

2. Configure Authentication

Usage

Quick Start

Advanced Usage

Checkpoint & Resume

SQL Query Interface

Command Options

scrape.py

query.py

Project Structure

Data Output

Product JSON Structure

Available SQL Views

Authentication Notes

Development

Testing

Resume Capability

Performance

Error Handling

Troubleshooting

401/403 Errors

No Products Found

Rate Limiting

Database Issues

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages