Skip to content

A simple app to scrape sainsburys.co.uk groceries data. Built with Claude.

Notifications You must be signed in to change notification settings

danclark-codes/sainburys-scrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sainsbury's Grocery Scraper

A production-ready Python scraper for collecting product data from Sainsbury's UK grocery website.

Features

  • πŸ›’ Scrapes product data from Sainsbury's grocery API
  • πŸ“¦ Automatic category discovery and traversal
  • πŸ’Ύ JSON file storage with metadata
  • πŸ”„ Checkpoint/Resume capability - interrupt and resume anytime
  • ⏱️ Built-in rate limiting
  • πŸ“Š Progress tracking and statistics
  • πŸ—‚οΈ SQL query interface using DuckDB for data analysis
  • ⚠️ Defunct category handling - HTTP 400 errors won't block progress
  • πŸ§ͺ Test mode for development

Installation

Prerequisites

  • Python 3.8+
  • uv - Fast Python package manager

Setup

# Clone the repository
git clone https://github.com/danclark-codes/sainsburys-scrape.git
cd sainsburys-scrape

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Setup

1. Capture Authentication Credentials

You need to capture authentication credentials from your browser:

  1. Open Sainsbury's website in Chrome/Firefox
  2. Open Developer Tools (F12)
  3. Go to Network tab
  4. Navigate to any product page
  5. Find requests to groceries-api/gol-services/product/v1/product
  6. Copy:
    • Authorization header (the full "Bearer ..." token)
    • Cookie header (all cookies)

2. Configure Authentication

# Create auth config template
uv run python scrape.py --create-auth

# Edit config/auth.json with your credentials
nano config/auth.json

Update the file with your captured credentials:

{
  "authorization": "Bearer YOUR_ACTUAL_TOKEN_HERE",
  "cookie": "YOUR_COOKIES_HERE",
  "headers": { ... }
}

See auth.example.json for a complete template with instructions.

Usage

Quick Start

# Test mode - scrape 1 category, 2 pages
uv run python scrape.py --test

# Full scrape - all categories
uv run python scrape.py --full

# List available categories
uv run python scrape.py --list

# Resume from checkpoint after interruption
uv run python scrape.py --resume

Advanced Usage

# Scrape specific categories by index
uv run python scrape.py --categories 0 1 2 --max-pages 5

# Scrape single category by name
uv run python scrape.py --category-name "gb/groceries/fruit-veg" --max-pages 10

# Scrape without subcategory discovery
uv run python scrape.py --categories 0 --no-subcategories

# Custom data directory
uv run python scrape.py --test --data-dir /path/to/data

# Adjust rate limiting (seconds between requests)
uv run python scrape.py --test --rate-limit 5.0

Checkpoint & Resume

The scraper supports graceful interruption and resume:

# Start scraping
uv run python scrape.py --full

# Press Ctrl+C to stop gracefully...
# Progress will be saved automatically

# Later, resume from checkpoint:
uv run python scrape.py --full --resume

SQL Query Interface

Query your scraped data using SQL with DuckDB:

# Show all tables
uv run python query.py "SHOW TABLES"

# Find products by name
uv run python query.py "SELECT name, price_value, brand FROM products WHERE name LIKE '%chocolate%' LIMIT 20"

# Analyze prices by brand
uv run python query.py "SELECT * FROM price_analysis"

# Show top-rated products
uv run python query.py "SELECT name, brand, avg_rating, review_count FROM products WHERE review_count > 10 ORDER BY avg_rating DESC LIMIT 10"

# Find most expensive products
uv run python query.py "SELECT name, brand, price_value FROM products ORDER BY price_value DESC LIMIT 10"

# Rebuild database from JSON files
uv run python query.py --rebuild "SELECT COUNT(*) as total FROM products"

# Export to different formats
uv run python query.py --format markdown "SELECT * FROM product_summary LIMIT 5"

Command Options

scrape.py

Option Description
--test Test mode (1 category, 2 pages, no subcategories)
--full Full scrape (all categories)
--categories INDEX ... Scrape specific categories by index
--category-name NAME Scrape specific category by name
--max-pages N Maximum pages per category
--no-subcategories Don't discover/scrape subcategories
--resume Resume from checkpoint if available
--list List available categories
--data-dir PATH Output directory (default: data)
--config-dir PATH Config directory (default: config)
--rate-limit SECONDS Delay between requests (default: 3.0)
--create-auth Create auth.json template

query.py

Option Description
--data-dir PATH Data directory (default: data)
--format FORMAT Output format: grid, simple, html, latex, markdown
--rebuild Force rebuild database from JSON files
--no-views Skip creating convenience views

Project Structure

sainsburys-scrape/
β”œβ”€β”€ sainsburys/          # Main package
β”‚   β”œβ”€β”€ client.py        # API client
β”‚   β”œβ”€β”€ scraper.py       # Scraper orchestration
β”‚   β”œβ”€β”€ storage.py       # Data storage
β”‚   β”œβ”€β”€ config.py        # Configuration management
β”‚   └── checkpoint.py    # Checkpoint/resume functionality
β”œβ”€β”€ config/              # Configuration files
β”‚   β”œβ”€β”€ auth.json        # Authentication (gitignored)
β”‚   └── categories.json  # Categories list
β”œβ”€β”€ data/                # Scraped data
β”‚   β”œβ”€β”€ products/        # Individual product JSONs
β”‚   β”œβ”€β”€ categories/      # Category metadata
β”‚   β”œβ”€β”€ checkpoint.json  # Resume checkpoint (if interrupted)
β”‚   └── sainsburys.duckdb # DuckDB database for queries
β”œβ”€β”€ scrape.py           # CLI entry point
β”œβ”€β”€ query.py            # SQL query interface
β”œβ”€β”€ auth.example.json   # Authentication template
└── requirements.txt    # Dependencies

Data Output

Product JSON Structure

Each product is saved as data/products/product_{id}.json:

{
  "product_uid": "7968098",
  "name": "Product Name",
  "retail_price": {
    "price": 2.50,
    "measure": "unit"
  },
  "breadcrumbs": [...],
  "_metadata": {
    "scraped_at": "2025-08-17T10:30:00",
    "scraped_timestamp": 1755340200.0,
    "scraper_version": "2.0.0"
  }
}

Available SQL Views

  • product_summary - Essential product information with prices and ratings
  • price_analysis - Price statistics grouped by brand
  • category_summary - Category scraping statistics and duplicates

Authentication Notes

  • Authentication tokens expire - refresh from browser if you get 401/403 errors
  • Cookies typically last for a browser session
  • The scraper saves credentials locally - never commit config/auth.json

Development

Testing

# Run in test mode
uv run python scrape.py --test

# Test specific category
uv run python scrape.py --category-name "gb/groceries/bakery" --max-pages 1

# Test SQL queries
uv run python query.py "SELECT COUNT(*) FROM products"

Resume Capability

The scraper automatically tracks what has been scraped and will skip already processed items on subsequent runs. To force a re-scrape, delete the relevant files from data/.

Performance

  • Default rate limit: 3 seconds between requests
  • Typical scraping speed: ~20 products/minute
  • Full scrape estimate: Several hours for all categories
  • DuckDB queries: Instant on thousands of products

Error Handling

  • HTTP 400 errors: Defunct categories are marked as processed with 0 products
  • Graceful shutdown: Press Ctrl+C to save progress and resume later
  • Duplicate detection: Products already scraped are automatically skipped
  • Connection errors: Graceful failure with error reporting

Troubleshooting

401/403 Errors

  • Authentication expired - capture fresh credentials from browser
  • Update config/auth.json with new token and cookies

No Products Found

  • Check if category URL is correct using --list
  • Verify authentication is working with --test

Rate Limiting

  • Increase delay with --rate-limit 5.0 if getting blocked
  • Default 3 seconds is usually safe

Database Issues

  • Run query.py --rebuild to recreate database from JSON files
  • Check data/sainsburys.duckdb exists and has read/write permissions

Dependencies

All dependencies are managed with uv:

# Install/update dependencies
uv pip install -r requirements.txt

# Key dependencies:
# - requests: HTTP client
# - duckdb: SQL database engine
# - tabulate: Table formatting for query results

License

For personal use only. Respect Sainsbury's terms of service.

About

A simple app to scrape sainsburys.co.uk groceries data. Built with Claude.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages