Skip to content

Latest commit

 

History

History
996 lines (734 loc) · 21.3 KB

File metadata and controls

996 lines (734 loc) · 21.3 KB

EasyScrape Tutorial

A comprehensive, step-by-step guide to mastering web scraping with EasyScrape.


Table of Contents

  1. Introduction
  2. Installation & Setup
  3. Your First Scraper
  4. CSS Selectors Deep Dive
  5. Structured Data Extraction
  6. Configuration & Customisation
  7. Handling Pagination
  8. JavaScript-Rendered Pages
  9. Asynchronous Scraping
  10. Sessions & Authentication
  11. Error Handling
  12. Data Export
  13. Best Practices
  14. Real-World Project

1. Introduction

What is Web Scraping?

Web scraping is the automated extraction of data from websites. Instead of manually copying information, you write code that:

  1. Fetches web pages (like a browser)
  2. Parses the HTML structure
  3. Extracts the specific data you need
  4. Stores it in a useful format (CSV, JSON, database)

Why EasyScrape?

EasyScrape was designed with three principles:

  1. Simplicity: Common tasks should be one-liners
  2. Safety: Security features built-in, not bolted on
  3. Speed: Async support for high-performance scraping

Prerequisites

  • Python 3.9 or higher
  • Basic Python knowledge (variables, functions, loops)
  • Understanding of HTML (tags, attributes, classes)

2. Installation & Setup

Basic Installation

pip install easyscrape-py

Verify Installation

# test_install.py
import easyscrape
print(f"EasyScrape version: {easyscrape.__version__}")
print("Installation successful!")

Run it:

python test_install.py
# Output: EasyScrape version: 0.1.0
# Output: Installation successful!

Optional Dependencies

# For JavaScript-rendered pages
pip install easyscrape[browser]

# For stealth mode (bypass bot detection)
pip install easyscrape[stealth]

# For Excel/Parquet export
pip install easyscrape[export]

# Everything
pip install easyscrape[all]

3. Your First Scraper

Let's scrape a real website step by step.

Step 1: Import and Fetch

from easyscrape import scrape

# Fetch a web page
result = scrape("https://example.com")

That's it! One line to fetch a page. The result object contains everything you need.

Step 2: Check the Response

# Did it work?
print(f"Status code: {result.status_code}")  # 200 = success
print(f"OK: {result.ok}")                     # True if status < 400
print(f"URL: {result.url}")                   # Final URL (after redirects)

Step 3: View the Content

# See the raw HTML
print(result.text[:500])  # First 500 characters

# Or just the title
print(f"Page title: {result.title()}")

Step 4: Extract Data

# Get specific elements
heading = result.css("h1")
print(f"Main heading: {heading}")

# Get a paragraph
paragraph = result.css("p")
print(f"First paragraph: {paragraph}")

Complete Example

"""
my_first_scraper.py - A complete beginner example
"""
from easyscrape import scrape

def main():
    # Fetch the page
    print("Fetching https://example.com...")
    result = scrape("https://example.com")
    
    # Check if successful
    if not result.ok:
        print(f"Error: {result.status_code}")
        return
    
    # Extract data
    title = result.title()
    heading = result.css("h1")
    paragraph = result.css("p")
    links = result.links()
    
    # Display results
    print(f"\nPage Title: {title}")
    print(f"Main Heading: {heading}")
    print(f"First Paragraph: {paragraph[:100]}...")
    print(f"Number of Links: {len(links)}")

if __name__ == "__main__":
    main()

4. CSS Selectors Deep Dive

CSS selectors are patterns that identify HTML elements. Master these to extract any data.

Basic Selectors

Selector Meaning Example
tag Element by tag name h1, p, div
.class Element by class .price, .title
#id Element by ID #header, #main
[attr] Element with attribute [href], [src]
[attr=val] Attribute equals value [type="text"]

Combinators

Selector Meaning Example
A B B inside A (any level) div p
A > B B directly inside A ul > li
A + B B immediately after A h1 + p
A, B A or B h1, h2, h3

Pseudo-Selectors (EasyScrape Extensions)

Selector Returns Example
::text Text content p::text
::attr(name) Attribute value a::attr(href)
::html Inner HTML div::html

Practical Examples

from easyscrape import scrape

result = scrape("https://books.toscrape.com")

# Get all book titles (attribute value)
titles = result.css_list("h3 a::attr(title)")

# Get all prices (text content)
prices = result.css_list(".price_color::text")

# Get star ratings (class name contains rating)
ratings = result.css_list(".star-rating::attr(class)")

# Get book URLs (combine with base URL)
urls = result.css_list("h3 a::attr(href)")
full_urls = [result.urljoin(url) for url in urls]

# Print first 3 books
for i in range(3):
    print(f"{titles[i]}: {prices[i]}")

Finding the Right Selector

  1. Open browser DevTools (F12 or right-click > Inspect)
  2. Select the element (Ctrl+Shift+C, then click)
  3. Look at the HTML - note the tag, classes, and structure
  4. Build your selector - start simple, add specificity if needed

Pro tip: In Chrome DevTools, right-click an element > Copy > Copy selector


5. Structured Data Extraction

Instead of extracting one field at a time, extract complete records.

Single Item Extraction

from easyscrape import scrape

result = scrape("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

# Extract multiple fields at once
book = result.extract({
    "title": "h1",
    "price": ".price_color",
    "availability": ".availability::text",
    "description": "#product_description + p",
    "upc": "tr:nth-child(1) td",
})

print(book)
# {
#     "title": "A Light in the Attic",
#     "price": "GBP 51.77",
#     "availability": "In stock (22 available)",
#     "description": "It's hard to imagine a world without...",
#     "upc": "a897fe39b1053632"
# }

Multiple Items Extraction

from easyscrape import scrape

result = scrape("https://books.toscrape.com")

# Extract all books on the page
books = result.extract_all(".product_pod", {
    "title": "h3 a::attr(title)",
    "price": ".price_color::text",
    "rating": ".star-rating::attr(class)",
    "url": "h3 a::attr(href)",
})

print(f"Found {len(books)} books")
for book in books[:3]:
    print(f"  - {book['title']}: {book['price']}")

Nested Extraction

# For complex structures, use nested schemas
result = scrape("https://example.com/products")

products = result.extract_all(".product", {
    "name": ".name",
    "price": ".price",
    "specs": {
        "_selector": ".specifications",  # Container
        "weight": ".weight",
        "dimensions": ".dimensions",
    },
    "reviews": {
        "_selector": ".reviews .review",
        "_multiple": True,
        "author": ".author",
        "rating": ".rating",
    }
})

6. Configuration & Customisation

Creating a Configuration

from easyscrape import scrape, Config

config = Config(
    timeout=60.0,       # Wait longer for slow sites
    max_retries=5,      # Retry more times
    rate_limit=1.0,     # Be polite: 1 request/second
)

result = scrape("https://example.com", config=config)

Common Configuration Patterns

Development Mode

dev_config = Config(
    cache_enabled=True,   # Don't re-download pages
    cache_ttl=86400,      # Cache for 24 hours
    timeout=60.0,         # Patient timeouts
)

Production Mode

prod_config = Config(
    max_retries=5,
    retry_delay=2.0,
    backoff_factor=2.0,   # 2s, 4s, 8s, 16s, 32s
    rate_limit=2.0,       # 2 requests/second
    rotate_ua=True,       # Vary User-Agent
    respect_robots=True,  # Honour robots.txt
)

Stealth Mode

stealth_config = Config(
    use_stealth=True,     # TLS fingerprint bypass
    rotate_ua=True,       # Random User-Agent
    headers={
        "Accept-Language": "en-GB,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    },
)

Custom Headers

config = Config(
    headers={
        "Authorization": "Bearer your-token",
        "X-Custom-Header": "value",
        "Referer": "https://google.com",
    }
)

Using Proxies

config = Config(
    proxies=[
        "http://user:pass@proxy1.com:8080",
        "http://user:pass@proxy2.com:8080",
    ],
    proxy_mode="round-robin",  # or "random"
)

7. Handling Pagination

Most websites split content across multiple pages. Here's how to handle them.

Method 1: Follow "Next" Links

from easyscrape import paginate

all_items = []

for page in paginate(
    "https://books.toscrape.com",
    next_selector=".next a",
    max_pages=10,
):
    items = page.css_list("h3 a::attr(title)")
    all_items.extend(items)
    print(f"Page {page.url}: {len(items)} items")

print(f"Total: {len(all_items)} items")

Method 2: Parameter-Based Pagination

from easyscrape import paginate_param

for page in paginate_param(
    "https://example.com/search",
    param="page",
    start=1,
    end=10,
):
    results = page.css_list(".result")
    print(f"Page {page}: {len(results)} results")

Method 3: Offset-Based Pagination

from easyscrape import paginate_offset

for page in paginate_offset(
    "https://example.com/api/items",
    offset_param="offset",
    limit_param="limit",
    limit=20,
    max_offset=200,
):
    items = page.json()["items"]
    print(f"Offset {page}: {len(items)} items")

Method 4: Manual Control

from easyscrape import scrape

page_num = 1
all_books = []

while True:
    url = f"https://books.toscrape.com/catalogue/page-{page_num}.html"
    result = scrape(url)
    
    if not result.ok:
        break  # No more pages
    
    books = result.css_list("h3 a::attr(title)")
    if not books:
        break  # Empty page
    
    all_books.extend(books)
    print(f"Page {page_num}: {len(books)} books")
    
    page_num += 1
    if page_num > 50:  # Safety limit
        break

print(f"Total: {len(all_books)} books")

8. JavaScript-Rendered Pages

Many modern websites use JavaScript to load content. EasyScrape handles this with Playwright.

Installation

pip install easyscrape[browser]
playwright install chromium

Basic Usage

from easyscrape import scrape, Config

config = Config(javascript=True)
result = scrape("https://quotes.toscrape.com/js/", config=config)

quotes = result.css_list(".quote .text")
print(f"Found {len(quotes)} quotes")

Wait for Content

config = Config(
    javascript=True,
    js_wait=3.0,              # Wait 3 seconds after load
    js_wait_for=".quote",     # Or wait for this selector
)

Advanced Browser Control

from easyscrape import Browser

async def scrape_dynamic_page():
    async with Browser(headless=True) as browser:
        page = await browser.goto("https://example.com")
        
        # Wait for specific element
        await page.wait_for(".content-loaded")
        
        # Click a button
        await page.click("#load-more")
        
        # Wait for new content
        await page.wait(1.0)
        
        # Extract data
        items = await page.css_list(".item")
        return items

9. Asynchronous Scraping

For scraping many pages quickly, use async.

Why Async?

Synchronous (100 pages, 1s each):  100 seconds
Asynchronous (100 pages, 10 concurrent): ~10 seconds

Basic Async

import asyncio
from easyscrape import async_scrape

async def main():
    result = await async_scrape("https://example.com")
    print(result.title())

asyncio.run(main())

Scraping Many Pages

import asyncio
from easyscrape import async_scrape_many, Config

async def scrape_all_books():
    urls = [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, 51)
    ]
    
    config = Config(
        concurrent_limit=10,  # Max 10 at a time
        rate_limit=5.0,       # 5 requests/second
    )
    
    results = await async_scrape_many(urls, config=config)
    
    all_books = []
    for result in results:
        if result.ok:
            books = result.css_list("h3 a::attr(title)")
            all_books.extend(books)
    
    return all_books

books = asyncio.run(scrape_all_books())
print(f"Scraped {len(books)} books")

With Progress Tracking

import asyncio
from easyscrape import async_scrape, Config

async def scrape_with_progress(urls):
    config = Config(rate_limit=5.0)
    results = []
    
    for i, url in enumerate(urls, 1):
        result = await async_scrape(url, config=config)
        results.append(result)
        print(f"Progress: {i}/{len(urls)} ({100*i/len(urls):.1f}%)")
    
    return results

10. Sessions & Authentication

Maintaining Cookies

from easyscrape import Session

with Session() as session:
    # First request sets cookies
    session.get("https://example.com")
    
    # Subsequent requests include those cookies
    result = session.get("https://example.com/dashboard")

Login Flow

from easyscrape import Session

with Session() as session:
    # Step 1: Get the login page (may set CSRF token)
    login_page = session.get("https://example.com/login")
    
    # Step 2: Extract CSRF token if needed
    csrf = login_page.css("input[name='csrf']::attr(value)")
    
    # Step 3: Submit login form
    response = session.post(
        "https://example.com/login",
        data={
            "username": "myuser",
            "password": "mypass",
            "csrf": csrf,
        }
    )
    
    # Step 4: Check if login worked
    if "Welcome" in response.text:
        print("Login successful!")
        
        # Step 5: Access protected content
        profile = session.get("https://example.com/profile")
        print(profile.css(".user-name"))

11. Error Handling

The Exception Hierarchy

EasyScrapeError (catch all)
+-- NetworkError (connection issues)
|   +-- RequestTimeout
+-- HTTPError (4xx, 5xx responses)
+-- InvalidURLError
+-- RateLimitHit (429 Too Many Requests)
+-- RetryExhausted
+-- ExtractionError

Basic Error Handling

from easyscrape import scrape
from easyscrape.exceptions import EasyScrapeError

try:
    result = scrape("https://example.com")
except EasyScrapeError as e:
    print(f"Scraping failed: {e}")

Specific Error Handling

from easyscrape import scrape
from easyscrape.exceptions import (
    NetworkError,
    HTTPError,
    RateLimitHit,
    RequestTimeout,
)

def safe_scrape(url):
    try:
        return scrape(url)
    except RateLimitHit:
        print("Rate limited! Waiting...")
        time.sleep(60)
        return scrape(url)  # Retry
    except RequestTimeout:
        print("Timeout - skipping")
        return None
    except HTTPError as e:
        print(f"HTTP {e.status_code}")
        return None
    except NetworkError as e:
        print(f"Network error: {e}")
        return None

12. Data Export

Export to CSV

from easyscrape import scrape, to_csv

result = scrape("https://books.toscrape.com")
books = result.extract_all(".product_pod", {
    "title": "h3 a::attr(title)",
    "price": ".price_color::text",
})

to_csv(books, "books.csv")

Export to JSON

from easyscrape import to_json

to_json(books, "books.json", indent=2)

Export to Excel

from easyscrape import to_excel

to_excel(books, "books.xlsx")

Export to DataFrame

from easyscrape import to_dataframe

df = to_dataframe(books)
print(df.head())
print(df.describe())

13. Best Practices

1. Rate Limiting

# Always limit your request rate
config = Config(rate_limit=1.0)  # 1 request/second

2. Respect robots.txt

config = Config(respect_robots=True)

3. Identify Yourself

config = Config(
    headers={"User-Agent": "MyBot/1.0 (contact@example.com)"}
)

4. Handle Errors

# Never let one error crash your whole scrape
for url in urls:
    try:
        result = scrape(url)
        process(result)
    except EasyScrapeError:
        continue

5. Cache During Development

config = Config(cache_enabled=True, cache_ttl=86400)

6. Use Async for Large Jobs

# 100 pages: 10x faster with async
await async_scrape_many(urls, config=Config(concurrent_limit=10))

14. Real-World Project

Let's build a complete book scraper that:

  1. Scrapes all 50 pages of books.toscrape.com
  2. Extracts title, price, rating, and availability
  3. Handles errors gracefully
  4. Exports to CSV and JSON
"""
complete_book_scraper.py

A production-ready scraper for books.toscrape.com
"""

import asyncio
from easyscrape import (
    async_scrape_many,
    Config,
    to_csv,
    to_json,
)
from easyscrape.exceptions import EasyScrapeError


def create_urls(num_pages: int) -> list[str]:
    """Generate URLs for all pages."""
    return [
        f"https://books.toscrape.com/catalogue/page-{i}.html"
        for i in range(1, num_pages + 1)
    ]


def parse_rating(class_name: str) -> int:
    """Convert 'star-rating Three' to 3."""
    ratings = {
        "One": 1, "Two": 2, "Three": 3,
        "Four": 4, "Five": 5
    }
    for word, num in ratings.items():
        if word in class_name:
            return num
    return 0


def parse_price(price_str: str) -> float:
    """Convert 'GBP 51.77' to 51.77."""
    return float(price_str.replace("GBP ", "").replace("$", ""))


async def scrape_books(num_pages: int = 50) -> list[dict]:
    """Scrape all books from the website."""
    
    # Configuration
    config = Config(
        timeout=30.0,
        max_retries=3,
        rate_limit=5.0,
        concurrent_limit=10,
        cache_enabled=True,
        rotate_ua=True,
    )
    
    # Generate URLs
    urls = create_urls(num_pages)
    print(f"Scraping {len(urls)} pages...")
    
    # Fetch all pages
    try:
        results = await async_scrape_many(urls, config=config)
    except EasyScrapeError as e:
        print(f"Fatal error: {e}")
        return []
    
    # Parse results
    all_books = []
    errors = []
    
    for i, result in enumerate(results, 1):
        if not result.ok:
            errors.append({"page": i, "status": result.status_code})
            continue
        
        books = result.extract_all(".product_pod", {
            "title": "h3 a::attr(title)",
            "price": ".price_color::text",
            "rating": ".star-rating::attr(class)",
            "availability": ".availability::text",
            "url": "h3 a::attr(href)",
        })
        
        # Clean and transform data
        for book in books:
            book["price_numeric"] = parse_price(book["price"])
            book["rating_numeric"] = parse_rating(book["rating"])
            book["availability"] = book["availability"].strip()
            book["url"] = f"https://books.toscrape.com/catalogue/{book['url']}"
        
        all_books.extend(books)
        
        # Progress
        if i % 10 == 0:
            print(f"Processed {i}/{len(urls)} pages...")
    
    # Report
    print(f"\nComplete!")
    print(f"  Books scraped: {len(all_books)}")
    print(f"  Errors: {len(errors)}")
    
    if errors:
        print(f"  Failed pages: {[e['page'] for e in errors]}")
    
    return all_books


def export_data(books: list[dict]) -> None:
    """Export books to multiple formats."""
    
    # CSV
    to_csv(books, "output/books.csv")
    print("Exported to output/books.csv")
    
    # JSON
    to_json(books, "output/books.json", indent=2)
    print("Exported to output/books.json")
    
    # Summary
    if books:
        prices = [b["price_numeric"] for b in books]
        print(f"\nSummary:")
        print(f"  Total books: {len(books)}")
        print(f"  Price range: GBP {min(prices):.2f} - {max(prices):.2f}")
        print(f"  Average price: GBP {sum(prices)/len(prices):.2f}")


async def main():
    """Main entry point."""
    print("=" * 60)
    print("Book Scraper - books.toscrape.com")
    print("=" * 60)
    
    books = await scrape_books(num_pages=50)
    
    if books:
        export_data(books)
    else:
        print("No books scraped!")


if __name__ == "__main__":
    asyncio.run(main())

Conclusion

You've learned:

  • Basic scraping with scrape()
  • CSS selectors for data extraction
  • Structured data extraction with schemas
  • Configuration and customisation
  • Pagination handling
  • JavaScript rendering
  • Async scraping for speed
  • Sessions and authentication
  • Error handling
  • Data export

Next Steps

  1. Practice: Scrape a website you're interested in
  2. Read: Check the API Reference for all available methods
  3. Explore: Try the Cookbook recipes for specific tasks
  4. Contribute: Found a bug? Want a feature? Open an issue!

Happy scraping!