Skip to content

irfanalidv/scrapeflow-py

Repository files navigation

ScrapeFlow

An opinionated scraping workflow engine built on Playwright

GitHub PyPI PyPI Downloads Python Playwright Status


ScrapeFlow is a production-ready Python library that transforms Playwright into a powerful, enterprise-grade web scraping framework. It handles the common challenges of web scraping: retries, rate limiting, anti-detection, error recovery, and workflow orchestration.

πŸš€ Features

  • πŸ“‹ Specification-Driven Extraction: Declarative Pydantic models define fields, types, and validationβ€”decouple field definitions from page structure
  • πŸ€– robots.txt Compliance: Built-in robots.txt parsing and enforcement; ethical crawling by design
  • βš–οΈ Ethical Crawling (GDPR/CCPA): Configurable data retention, anonymization, and consent options in the specification layer
  • πŸ“¦ Component Registry: Shared, versioned selectors, pagination handlers, and login flowsβ€”platform thinking over one-off scrapers
  • πŸ”” Monitoring & Alerting: Alert callbacks on failure thresholds; rollback hooks for failed extraction runs
  • πŸ”Œ MCP Extensibility: Pluggable backends for Scrapy MCP Server, Playwright MCP, or LLM-based semantic extraction
  • πŸ€– Mistral LLM Extraction: Generate schemas from natural language, extract without selectorsβ€”uses MISTRAL_API_KEY
  • πŸ”„ Hybrid Extraction: Selector-first, LLM fallback when validation failsβ€”self-healing spiders
  • πŸ›‘οΈ Selector Fallback Chain: Try multiple selectors per field ([".price", ".cost"]) for resilience
  • πŸ«™ Session Persistence: Save/load cookies and localStorage via save_storage_state() and storage_state_path
  • 🧹 Content Cleaning: Strip scripts/styles before LLM extraction for better accuracy and token usage
  • πŸ”„ Intelligent Retry Logic: Automatic retries with exponential backoff and jitter
  • ⚑ Rate Limiting: Token bucket algorithm to respect server limits
  • πŸ•΅οΈ Anti-Detection: Stealth mode, user agent rotation, and proxy support
  • πŸ“Š Workflow Engine: Define complex scraping workflows with steps and conditions
  • πŸ“ˆ Monitoring & Metrics: Built-in performance monitoring and logging
  • πŸ› οΈ Data Extraction: Powerful utilities for extracting structured data
  • πŸ”§ Error Handling: Comprehensive error classification and recovery
  • πŸ“ Type Hints: Full type support for better IDE experience

ScrapeFlow Architecture

πŸ“¦ Installation

pip install scrapeflow-py

Or install from source:

git clone https://github.com/irfanalidv/scrapeflow-py.git
cd scrapeflow-py
pip install -e .

Note: After installation, install Playwright browsers:

playwright install

🎯 Real-World Use Cases

ScrapeFlow is used in production for:

  • πŸ’° E-commerce Price Monitoring - Track competitor prices, monitor deals, and optimize pricing strategies
  • πŸ“° News & Content Aggregation - Collect articles from multiple sources for content platforms
  • πŸ’Ό Job Listings Scraping - Aggregate job postings from various job boards
  • 🏠 Real Estate Data Collection - Monitor property listings, prices, and market trends
  • ⭐ Product Review Analysis - Extract and analyze customer reviews for market research
  • πŸ“Š Market Research - Gather competitor data, customer sentiment, and industry trends
  • πŸ” Lead Generation - Extract contact information from business directories
  • πŸ“ˆ Financial Data Collection - Monitor stock prices, cryptocurrency data, and market indicators

πŸš€ Quick Start

Use Case 1: Scraping Quotes with Retry & Rate Limiting

Real-world scenario: Collecting inspirational quotes from quotes.toscrape.com - a real website designed for scraping practice.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig, RetryConfig
from scrapeflow.extractors import Extractor

async def main():
    # Configure for production scraping
    config = ScrapeFlowConfig(
        rate_limit=RateLimitConfig(requests_per_second=2.0),  # Respect server limits
        retry=RetryConfig(max_retries=3, initial_delay=1.0),  # Auto-retry on failures
    )

    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Extract all quotes from the page
        quotes = []
        quote_elements = scraper.page.locator(".quote")
        count = await quote_elements.count()

        for i in range(count):
            quote_elem = quote_elements.nth(i)
            text = await quote_elem.locator(".text").text_content()
            author = await quote_elem.locator(".author").text_content()
            tags = await Extractor.extract_texts(quote_elem, ".tag")

            quotes.append({
                "quote": text.strip() if text else "",
                "author": author.strip() if author else "",
                "tags": tags
            })

        print(f"Scraped {len(quotes)} quotes")
        for quote in quotes[:3]:  # Show first 3
            print(f"\n{quote['quote']}\nβ€” {quote['author']}")

asyncio.run(main())

Real Output:

Scraped 10 quotes from quotes.toscrape.com

1. Quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
   Author: Albert Einstein
   Tags: ['change', 'deep-thoughts', 'thinking', 'world']

2. Quote: "It is our choices, Harry, that show what we truly are, far more than our abilities."
   Author: J.K. Rowling
   Tags: ['abilities', 'choices']

3. Quote: "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
   Author: Albert Einstein
   Tags: ['inspirational', 'life', 'live', 'miracle', 'miracles']

Use Case 2: E-commerce Book Scraping Workflow

Real-world scenario: Scraping book data from books.toscrape.com - a real e-commerce site designed for scraping practice.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import ScrapeFlowConfig
from scrapeflow.extractors import StructuredExtractor

async def scrape_books(page, context):
    """Extract book listings from the page."""
    schema = {
        "books": {
            "items": "article.product_pod",
            "schema": {
                "title": "h3 a",
                "price": ".price_color",
                "availability": ".instock.availability"
            }
        }
    }
    extractor = StructuredExtractor(schema)
    return await extractor.extract(page)

async def check_affordable_books(data, context):
    """Callback to find affordable books."""
    for book in data.get("books", []):
        price_str = book.get("price", "").replace("Β£", "").strip()
        try:
            price = float(price_str)
            if price < 20.0:  # Books under Β£20
                print(f"πŸ’° Affordable: {book['title'][:50]}... - Β£{price}")
        except ValueError:
            pass

async def main():
    config = ScrapeFlowConfig()
    async with ScrapeFlow(config) as scraper:
        workflow = Workflow(name="book_scraper")

        # Step 1: Navigate to books page
        async def navigate_to_books(page, context):
            scraper = context["scraper"]
            await scraper.navigate("https://books.toscrape.com/")
            await scraper.wait_for_selector("article.product_pod", timeout=10000)

        # Step 2: Extract book data
        workflow.add_step("navigate", navigate_to_books, required=True)
        workflow.add_step("extract", scrape_books, on_success=check_affordable_books)

        # Execute workflow
        result = await scraper.run_workflow(workflow)
        print(f"βœ… Scraped {len(result.final_data.get('books', []))} books")

asyncio.run(main())

Real Output:

Workflow 'book_scraper' completed. Success: True, Steps: 2/2
βœ… Scraped 20 books

Use Case 3: Quote Aggregation with Anti-Detection

Real-world scenario: Collecting quotes from quotes.toscrape.com while avoiding detection using stealth mode and user agent rotation.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig
)
from scrapeflow.extractors import Extractor

async def scrape_quotes_with_stealth():
    """Scrape quotes with anti-detection enabled."""
    config = ScrapeFlowConfig(
        anti_detection=AntiDetectionConfig(
            rotate_user_agents=True,  # Rotate user agents
            stealth_mode=True,        # Remove automation indicators
            viewport_width=1920,
            viewport_height=1080
        ),
        rate_limit=RateLimitConfig(requests_per_second=1.0)  # Be respectful
    )

    async with ScrapeFlow(config) as scraper:
        # Navigate to quotes site
        await scraper.navigate("https://quotes.toscrape.com/")

        # Verify stealth mode is working
        user_agent = await scraper.page.evaluate("() => navigator.userAgent")
        page_title = await scraper.page.title()

        # Extract quote data
        quotes = []
        quote_elements = scraper.page.locator(".quote")
        count = await quote_elements.count()

        for i in range(count):
            quote_elem = quote_elements.nth(i)
            text = await quote_elem.locator(".text").text_content()
            author = await quote_elem.locator(".author").text_content()

            quotes.append({
                "quote": text.strip() if text else "",
                "author": author.strip() if author else "",
                "url": await scraper.page.url
            })

        return quotes, user_agent, page_title

# Run the scraper
quotes, ua, title = asyncio.run(scrape_quotes_with_stealth())
print(f"πŸ“° Collected {len(quotes)} quotes from {title}")
print(f"πŸ•΅οΈ User Agent: {ua[:60]}...")
print(f"\nFirst quote: {quotes[0]['quote'][:80]}...")

Real Output:

πŸ“° Collected 10 quotes from Quotes to Scrape
πŸ•΅οΈ User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:120.0) Gecko/20100101...

First quote: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."

Use Case 4: Multi-Page Scraping with Error Handling & Metrics

Real-world scenario: Scraping multiple pages from quotes.toscrape.com with comprehensive error handling and performance monitoring.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig
from scrapeflow.exceptions import (
    ScrapeFlowBlockedError,
    ScrapeFlowTimeoutError,
    ScrapeFlowRetryError
)
from scrapeflow.extractors import Extractor

async def scrape_multiple_pages():
    config = ScrapeFlowConfig(
        retry=RetryConfig(max_retries=5, initial_delay=2.0),
        log_level="INFO"
    )

    try:
        async with ScrapeFlow(config) as scraper:
            # Scrape multiple pages
            all_quotes = []
            pages = [
                "https://quotes.toscrape.com/",
                "https://quotes.toscrape.com/page/2/",
            ]

            for url in pages:
                await scraper.navigate(url)

                # Extract quotes
                quote_elements = scraper.page.locator(".quote")
                count = await quote_elements.count()

                for i in range(count):
                    quote_elem = quote_elements.nth(i)
                    text = await quote_elem.locator(".text").text_content()
                    author = await quote_elem.locator(".author").text_content()

                    all_quotes.append({
                        "quote": text.strip() if text else "",
                        "author": author.strip() if author else "",
                    })

            # Get performance metrics
            metrics = scraper.get_metrics()
            print(f"πŸ“Š Success rate: {metrics.get_success_rate():.2f}%")
            print(f"πŸ“Š Total requests: {metrics.total_requests}")
            print(f"πŸ“Š Average response time: {metrics.average_response_time:.2f}s")

            return all_quotes

    except ScrapeFlowBlockedError as e:
        print(f"🚫 Blocked! Retry after {e.retry_after} seconds")
        return []
    except ScrapeFlowTimeoutError:
        print("⏱️ Request timed out")
        return []
    except ScrapeFlowRetryError as e:
        print(f"❌ Failed after {e.retry_count} retries")
        return []

quotes = asyncio.run(scrape_multiple_pages())
print(f"πŸ’Ό Found {len(quotes)} quotes across pages")

Real Output:

πŸ“Š Success rate: 100.00%
πŸ“Š Total requests: 2
πŸ“Š Average response time: 1.10s
πŸ’Ό Found 20 quotes across pages

Use Case 5: LLM-Powered Extraction (No Selectors)

Real-world scenario: Extract structured data using Mistral AIβ€”no CSS selectors, just natural language field descriptions. Ideal for pages with inconsistent markup or when you want semantic understanding.

Requires: MISTRAL_API_KEY in .env or environment.

import asyncio
import os

try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    pass

from scrapeflow import ScrapeFlow, create_mcp_backend
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig
from scrapeflow.llm_extract import generate_schema_from_prompt, extract_with_schema

async def run_llm_extraction():
    if not os.environ.get("MISTRAL_API_KEY"):
        print("Set MISTRAL_API_KEY in .env to run this example")
        return

    config = ScrapeFlowConfig(
        ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True),
    )

    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Option 1: Generate schema from natural language
        schema = generate_schema_from_prompt(
            "Extract quote text, author name, and list of tags"
        )

        # Option 2: Extract using schema + page content (no selectors!)
        content = await scraper.page.evaluate("() => document.body.innerText")
        data = extract_with_schema(
            content,
            schema,
            prompt="Extract all quotes from this page with text, author, and tags",
        )
        print("LLM extracted:", data)

        # Option 3: MistralLLMBackendβ€”field descriptions only
        backend = create_mcp_backend("mistral")
        semantic_data = await backend.extract_with_semantics(
            scraper.page,
            field_descriptions={
                "quote": "The quote text",
                "author": "Author name",
                "tags": "Comma-separated tags",
            },
            context="Extract the first quote on the page",
        )
        print("Semantic extraction:", semantic_data)

asyncio.run(run_llm_extraction())

Real Output:

Generated schema: {'type': 'object', 'properties': {'quote_text': {...}, 'author_name': {...}, 'tags': {...}}, 'required': ['quote_text', 'author_name', 'tags']}

LLM extracted: [{'quote_text': 'The world as we have created it is a process of our thinking...', 'author_name': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, ...]

Semantic extraction: {'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'author': 'Albert Einstein', 'tags': 'change, deep-thoughts, thinking, world'}

What you get: Schema generation from prompts, extraction without brittle selectors, and semantic understanding of page contentβ€”powerful for dynamic or inconsistent sites.

Use Case 6: Login + Session Persistence

Real-world scenario: Log in once, save cookies, then reuse the session on future runsβ€”no re-login needed.

import asyncio
import os
from scrapeflow import ScrapeFlow, SpecificationExtractor, get_registry, register_quotes_login_handler
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig, BrowserConfig
from scrapeflow.specifications import FieldSpec, ItemSpec
from pydantic import BaseModel

class QuoteItem(BaseModel):
    text: str
    author: str

class QuotesPage(BaseModel):
    quotes: list[QuoteItem]

async def run():
    state_file = "/tmp/scrapeflow_session.json"
    schema = {"quotes": ItemSpec(items_selector=".quote", fields={"text": ".text", "author": ".author"})}

    # Step 1: Login and save (if first run)
    if not os.path.exists(state_file):
        async with ScrapeFlow(ScrapeFlowConfig(ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True))) as scraper:
            register_quotes_login_handler(get_registry())
            handler = get_registry().get_login("quotes_login")
            await scraper.login_with_handler("https://quotes.toscrape.com/login", "admin", "amdin", handler)
            await scraper.save_storage_state(state_file)

    # Step 2: Use saved session (no login)
    config = ScrapeFlowConfig(ethical_crawling=EthicalCrawlingConfig(respect_robots_txt=True),
                             browser=BrowserConfig(storage_state_path=state_file))
    async with ScrapeFlow(config) as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")
        data = await SpecificationExtractor(QuotesPage, schema=schema).extract(scraper.page)
        print(f"Extracted {len(data.quotes)} quotes (authenticated)")

asyncio.run(run())

Real Output:

Extracted 10 quotes (authenticated)

(On first run: login + save; on subsequent runs: load session and extract without re-login.)

What you get: login_with_handler, save_storage_state, storage_state_pathβ€”reusable authenticated sessions.

πŸ“š Documentation

Configuration

Use Case: Setting up a production-ready scraper for monitoring competitor prices across multiple sites.

from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig,
    RetryConfig,
    BrowserConfig,
    BrowserType,
)

# Production configuration for price monitoring
config = ScrapeFlowConfig(
    browser=BrowserConfig(
        browser_type=BrowserType.CHROMIUM,
        headless=True,  # Run in background
        timeout=30000,   # 30 second timeout
    ),
    retry=RetryConfig(
        max_retries=5,           # Retry up to 5 times
        initial_delay=1.0,       # Start with 1 second delay
        max_delay=60.0,          # Cap at 60 seconds
        exponential_base=2.0,     # Double delay each retry
        jitter=True,              # Add randomness
    ),
    rate_limit=RateLimitConfig(
        requests_per_second=2.0,  # Max 2 requests/second
        burst_size=5,              # Allow bursts of 5
    ),
    anti_detection=AntiDetectionConfig(
        rotate_user_agents=True,   # Rotate user agents
        stealth_mode=True,          # Remove automation traces
        viewport_width=1920,
        viewport_height=1080,
    ),
    log_level="INFO",  # Log important events
)

Specification-Driven Extraction (Pydantic)

Use Case: Declarative extraction with validationβ€”fields, types, and rules in specs, not fragile XPaths.

from pydantic import BaseModel
from scrapeflow import ScrapeFlow, SpecificationExtractor
from scrapeflow.specifications import FieldSpec, ItemSpec, ProductPriceSpec

# Model for list of products
class BookListing(BaseModel):
    books: list[ProductPriceSpec]

# Schema maps fields to selectors
schema = {
    "books": ItemSpec(
        items_selector="article.product_pod",
        fields={
            "title": FieldSpec(selector="h3 a"),
            "price": FieldSpec(selector=".price_color"),
            "availability": FieldSpec(selector=".instock.availability", default=""),
            "url": FieldSpec(selector="h3 a", type="attribute", attribute="href"),
        },
    )
}

async with ScrapeFlow() as scraper:
    await scraper.navigate("https://books.toscrape.com/")
    extractor = SpecificationExtractor(BookListing, schema=schema)
    # Extract and validate in one step
    data = await extractor.extract(scraper.page)
    for book in data.books:
        print(f"{book.title}: {book.price}")

Ethical Crawling & robots.txt

Use Case: GDPR/CCPA compliance and robots.txt respect built into the specification layer.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, EthicalCrawlingConfig

config = ScrapeFlowConfig(
    ethical_crawling=EthicalCrawlingConfig(
        respect_robots_txt=True,      # Check robots.txt before each request
        user_agent_for_robots="ScrapeFlow",
        anonymize_ip=True,            # GDPR: minimize personal data
        data_retention_days=30,       # Document retention policy
    )
)

async with ScrapeFlow(config) as scraper:
    # navigate() automatically checks robots.txt
    await scraper.navigate("https://example.com/page")

Anti-Detection

Use Case: Scraping protected e-commerce sites that block automated access.

from scrapeflow.config import ScrapeFlowConfig, AntiDetectionConfig

# Configure anti-detection for protected sites
config = ScrapeFlowConfig(
    anti_detection=AntiDetectionConfig(
        # Rotate user agents to appear as different browsers
        rotate_user_agents=True,
        user_agents=[
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Firefox/120.0",
        ],
        # Enable stealth mode to remove automation indicators
        stealth_mode=True,  # Removes webdriver property, mocks plugins, etc.
        # Use realistic viewport sizes
        viewport_width=1920,
        viewport_height=1080,
        # Optional: Rotate proxies for additional protection
        rotate_proxies=True,
        proxies=[
            {"server": "http://proxy1.example.com:8080"},
            {"server": "http://proxy2.example.com:8080"},
        ],
    )
)

async with ScrapeFlow(config) as scraper:
    # This will use stealth techniques automatically
    await scraper.navigate("https://protected-site.com")
    # Your scraping code here...

Rate Limiting

Use Case: Respecting API rate limits when scraping multiple pages to avoid getting blocked.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RateLimitConfig

config = ScrapeFlowConfig(
    rate_limit=RateLimitConfig(
        requests_per_second=1.0,    # Max 1 request per second
        requests_per_minute=60.0,   # Or 60 requests per minute
        burst_size=5,                # Allow bursts of 5 requests
    )
)

async with ScrapeFlow(config) as scraper:
    # Scrape multiple pages - rate limiter ensures we don't exceed limits
    urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
        "https://quotes.toscrape.com/page/3/",
    ]

    for url in urls:
        await scraper.navigate(url)  # Automatically rate-limited
        # Extract data...
        print(f"Scraped: {url}")
        # Rate limiter ensures proper delays between requests

Retry Logic

Use Case: Handling network failures and temporary server errors when scraping unreliable sources.

from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, RetryConfig

config = ScrapeFlowConfig(
    retry=RetryConfig(
        max_retries=5,              # Retry up to 5 times
        initial_delay=1.0,          # Start with 1 second delay
        max_delay=60.0,             # Cap at 60 seconds
        exponential_base=2.0,       # Double delay each retry (1s, 2s, 4s, 8s...)
        jitter=True,                 # Add randomness to avoid thundering herd
    )
)

async with ScrapeFlow(config) as scraper:
    # If this fails, it will automatically retry with exponential backoff
    await scraper.navigate("https://unreliable-site.com/products")
    # Retry logic handles:
    # - Network timeouts
    # - 500/502/503 server errors
    # - Connection errors
    # - Temporary blocks

Pagination

Use Case: Scrape multiple pages (e.g. search results, product listings) with limits.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.config import ScrapeFlowConfig, PaginationConfig
from scrapeflow.pagination import paginate
from scrapeflow.registry import get_registry

async def extract_quotes(page, context):
    quotes = []
    for i in range(await page.locator(".quote").count()):
        q = page.locator(".quote").nth(i)
        text = await q.locator(".text").text_content()
        author = await q.locator(".author").text_content()
        quotes.append({"text": text or "", "author": author or ""})
    return quotes

async def main():
    get_registry().register_pagination("quotes", next_selector="li.next a", has_next="li.next")
    handler = get_registry().get_pagination("quotes")
    config = PaginationConfig(max_pages=3, max_results=25)

    async with ScrapeFlow(ScrapeFlowConfig()) as scraper:
        async for page_data in paginate(scraper, "https://quotes.toscrape.com/", handler, extract_quotes, config):
            print(f"Page: {len(page_data)} quotes")
            for q in page_data[:2]:
                print(f"  - {q['author']}: {q['text'][:40]}...")

asyncio.run(main())

Data Extraction

Use Case: Extracting structured data from quotes.toscrape.com and books.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.extractors import Extractor, StructuredExtractor

async def main():
    async with ScrapeFlow() as scraper:
        await scraper.navigate("https://quotes.toscrape.com/")

        # Method 1: Simple extraction
        page_title = await Extractor.extract_text(scraper.page, "h1")
        all_links = await Extractor.extract_links(scraper.page, "a")

        # Method 2: Structured extraction with schema (Best for complex pages)
        schema = {
            "page_title": "h1",
            "quotes": {
                "items": ".quote",  # Find all quote elements
                "schema": {
                    "text": ".text",           # Extract quote text
                    "author": ".author",       # Extract author
                    "tags": ".tag",            # Extract all tags
                },
            },
        }
        extractor = StructuredExtractor(schema)
        structured_data = await extractor.extract(scraper.page)

        print(f"Page: {structured_data['page_title']}")
        print(f"Quotes found: {len(structured_data['quotes'])}")
        if structured_data['quotes']:
            first = structured_data['quotes'][0]
            print(f"First quote: {first['text'][:60]}...")
            print(f"Author: {first['author']}")
            print(f"Tags: {first['tags']}")

asyncio.run(main())

Real Output:

Page: Quotes to Scrape
Quotes found: 10
First quote: "The world as we have created it is a process of our thinkin...
Author: Albert Einstein
Tags: ['change', 'deep-thoughts', 'thinking', 'world']

Workflows

Use Case: Building a multi-step scraper that navigates, extracts, and processes data with error handling.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.extractors import Extractor

async def login_step(page, context):
    """Step 1: Login to the site"""
    # Scraper is automatically available in context
    scraper = context["scraper"]
    await scraper.navigate("https://example.com/login")
    await scraper.fill("#username", context["username"])
    await scraper.fill("#password", context["password"])
    await scraper.click("button[type='submit']")
    await scraper.wait_for_selector(".dashboard", timeout=10000)

async def extract_products(page, context):
    """Step 2: Extract product data"""
    products = []
    product_elements = page.locator(".product")
    count = await product_elements.count()

    for i in range(count):
        product = product_elements.nth(i)
        products.append({
            "name": await Extractor.extract_text(product, ".name"),
            "price": await Extractor.extract_text(product, ".price"),
        })
    return products

async def save_to_database(data, context):
    """Callback: Save extracted data"""
    print(f"πŸ’Ύ Saving {len(data)} products to database...")
    # Your database save logic here

async def handle_error(error, context):
    """Callback: Handle errors"""
    print(f"❌ Error in workflow: {error}")
    # Your error handling logic here

async def main():
    async with ScrapeFlow() as scraper:
        workflow = Workflow(name="product_scraper")

        # Step 1: Login (required - stops workflow if fails)
        workflow.add_step(
            name="login",
            func=login_step,
            required=True,
            retryable=True,
        )

        # Step 2: Extract products (only if login succeeded)
        workflow.add_step(
            name="extract",
            func=extract_products,
            retryable=True,
            on_success=save_to_database,
            on_error=handle_error,
            condition=lambda ctx: ctx.get("logged_in", False),  # Conditional
        )

        # Set context
        workflow.set_context("username", "user@example.com")
        workflow.set_context("password", "secret123")

        # Execute workflow
        result = await scraper.run_workflow(workflow)
        print(f"βœ… Workflow completed: {result.success}")

asyncio.run(main())

Monitoring & Metrics

Use Case: Monitoring scraping performance when scraping multiple pages from quotes.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow

async def main():
    async with ScrapeFlow() as scraper:
        # Perform multiple scraping operations
        urls = [
            "https://quotes.toscrape.com/",
            "https://quotes.toscrape.com/page/2/",
        ]

        for url in urls:
            await scraper.navigate(url)
            # Extract data...

        # Get comprehensive metrics
        metrics = scraper.get_metrics()

        print(f"πŸ“Š Performance Metrics:")
        print(f"   Success Rate: {metrics.get_success_rate():.2f}%")
        print(f"   Total Requests: {metrics.total_requests}")
        print(f"   Successful: {metrics.successful_requests}")
        print(f"   Failed: {metrics.failed_requests}")
        print(f"   Retries: {metrics.retry_count}")
        print(f"   Avg Response Time: {metrics.average_response_time:.2f}s")
        print(f"   Total Duration: {metrics.total_duration:.2f}s")

        # Reset metrics for next batch
        scraper.reset_metrics()

asyncio.run(main())

Real Output:

πŸ“Š Performance Metrics:
   Success Rate: 100.00%
   Total Requests: 2
   Successful: 2
   Failed: 0
   Retries: 0
   Avg Response Time: 1.10s
   Total Duration: 2.20s

Error Handling

Use Case: Gracefully handling different types of errors when scraping quotes.toscrape.com.

import asyncio
from scrapeflow import ScrapeFlow
from scrapeflow.exceptions import (
    ScrapeFlowError,
    ScrapeFlowRetryError,
    ScrapeFlowTimeoutError,
    ScrapeFlowBlockedError,
)

async def scrape_with_error_handling():
    async with ScrapeFlow() as scraper:
        try:
            await scraper.navigate("https://quotes.toscrape.com/")
            title = await scraper.page.title()
            print(f"βœ… Successfully scraped: {title}")

        except ScrapeFlowBlockedError as e:
            # Site blocked us - wait and retry later
            print(f"🚫 Blocked! Retry after {e.retry_after} seconds")

        except ScrapeFlowTimeoutError:
            # Request took too long
            print("⏱️ Request timed out - site may be slow")

        except ScrapeFlowRetryError as e:
            # All retries exhausted
            print(f"❌ Failed after {e.retry_count} retries")

        except ScrapeFlowError as e:
            # Generic ScrapeFlow error
            print(f"⚠️ ScrapeFlow error: {e}")

        except Exception as e:
            # Other unexpected errors
            print(f"πŸ’₯ Unexpected error: {e}")

asyncio.run(scrape_with_error_handling())

Real Output:

βœ… Successfully scraped: Quotes to Scrape

πŸ“‹ Feature Coverageβ€”Nothing Hidden

Every ScrapeFlow feature is demonstrated somewhere. Quick reference:

Feature Where to See It
Extractors Use Case 1, basic_usage.py, advanced_example.py
StructuredExtractor Use Case 2, workflow_example.py, advanced_example.py
SpecificationExtractor Use Case 6, Doc section, specification_driven_example.py
HybridExtractor hybrid_extraction_example.py
FieldSpec, ItemSpec Doc section, specification_driven_example.py, hybrid_extraction_example.py
Selector fallback [".a", ".b"] hybrid_extraction_example.py
LLM extraction Use Case 5, llm_extraction_example.py
Login Use Case 6, authenticated_quotes_example.py, session_persistence_example.py
Session persistence Use Case 6, session_persistence_example.py
Registry, LoginHandler authenticated_quotes_example.py, specification_driven_example.py
Workflow Use Case 2, workflow_example.py
Rate limit, Retry Use Cases 1, 4, basic_usage.py, advanced_example.py
Anti-detection Use Case 3, advanced_example.py
Ethical crawling, robots.txt specification_driven_example.py, authenticated_quotes_example.py
Pagination Doc section, scrapeflow.pagination.paginate, PaginationConfig
Content cleaning Used internally by MistralLLMBackend; scrapeflow.content_utils

🎨 Complete Examples

Check out the examples/ directoryβ€”each example showcases specific features:

Example Features Demonstrated
basic_usage.py ScrapeFlow, Extractor, rate limit, retry, metrics
workflow_example.py Workflow, steps, on_success/on_error, StructuredExtractor
advanced_example.py Anti-detection, rate limit, retry, StructuredExtractor
specification_driven_example.py SpecificationExtractor, FieldSpec, ItemSpec, ethical crawling, registry
authenticated_quotes_example.py Login, login_with_handler, registry, LoginHandler
llm_extraction_example.py Mistral LLM, generate_schema_from_prompt, extract_with_schema, MistralLLMBackend (requires MISTRAL_API_KEY)
hybrid_extraction_example.py HybridExtractor, selector fallback chain [".a", ".b"], LLM fallback
session_persistence_example.py save_storage_state, storage_state_path, skip login on reuse

Real Outputs from Examples

Outputs below are from running each example against live sites (quotes.toscrape.com, books.toscrape.com).

basic_usage.py

Page title: Quotes to Scrape

Extracted 5 quotes:

1. "The world as we have created it is a process of our thinkin...
   Author: Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. "It is our choices, Harry, that show what we truly are, far ...
   Author: J.K. Rowling
   Tags: abilities, choices

3. "There are only two ways to live your life. One is as though...
   Author: Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

4. "The person, be it gentleman or lady, who has not pleasure i...
   Author: Jane Austen
   Tags: aliteracy, books, classic, humor

5. "Imperfection is beauty, madness is genius and it's better t...
   Author: Marilyn Monroe
   Tags: be-yourself, inspirational

πŸ“Š Metrics:
   Success rate: 100.00%
   Total requests: 1
   Average response time: 1.72s

advanced_example.py

πŸ“ Extracted quotes:

1. "The world as we have created it is a process of our thinkin...
   Author: Albert Einstein
   Tags: change, deep-thoughts, thinking, world

2. "It is our choices, Harry, that show what we truly are, far ...
   Author: J.K. Rowling
   Tags: abilities, choices

3. "There are only two ways to live your life. One is as though...
   Author: Albert Einstein
   Tags: inspirational, life, live, miracle, miracles

4. "The person, be it gentleman or lady, who has not pleasure i...
   Author: Jane Austen
   Tags: aliteracy, books, classic, humor

5. "Imperfection is beauty, madness is genius and it's better t...
   Author: Marilyn Monroe
   Tags: be-yourself, inspirational

Scraping completed:
  Success rate: 100.00%
  Total requests: 1
  Retries: 0

workflow_example.py

πŸ’Ύ Saving 20 books to database...
   - A Light in the ...... - Β£51.77
   - Tipping the Velvet... - Β£53.74
   - Soumission... - Β£50.10

βœ… Workflow completed successfully!
πŸ“š Extracted 20 books

πŸ“Š Metrics:
   Total requests: 3
   Success rate: 100.00%
   Average response time: 5.85s

specification_driven_example.py

Extracted 10 items (validated via Pydantic)
  1. "The world as we have created it is a process of o... | Albert Einstein
  2. "It is our choices, Harry, that show what we truly... | J.K. Rowling
  3. "There are only two ways to live your life. One is... | Albert Einstein

Metrics: 1 requests, 100.0% success

authenticated_quotes_example.py

Authenticated login: True
Quotes extracted: 10
1. Albert Einstein: "The world as we have created it is a process of our thinking. It cannot be chan...
2. J.K. Rowling: "It is our choices, Harry, that show what we truly are, far more than our abilit...
3. Albert Einstein: "There are only two ways to live your life. One is as though nothing is a miracl...

hybrid_extraction_example.py

Extracted (selector or LLM):
  Title: A Light in the Attic
  Price: Β£51.77
  Availability: In stock (22 available)

llm_extraction_example.py (requires MISTRAL_API_KEY)

Generated schema: {'type': 'object', 'properties': {'quote_text': {'type': 'string', 'description': 'The text of the quote'}, 'author_name': {'type': 'string', 'description': 'The name of the author of the quote'}, 'tags': {'type': 'array', 'description': 'A list of tags associated with the quote'}}, 'required': ['quote_text', 'author_name', 'tags']}

LLM extracted: [{'quote_text': 'The world as we have created it is a process of our thinking...', 'author_name': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}, ...]

Semantic extraction: {'quote': 'The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', 'author': 'Albert Einstein', 'tags': 'change, deep-thoughts, thinking, world'}

session_persistence_example.py

Session saved to /tmp/scrapeflow_quotes_session.json
Loaded session: extracted 10 quotes (authenticated)

Example: Complete Book Scraper with All Features

Real-world scenario: Complete example scraping books from books.toscrape.com using all ScrapeFlow features.

import asyncio
from scrapeflow import ScrapeFlow, Workflow
from scrapeflow.config import (
    ScrapeFlowConfig,
    AntiDetectionConfig,
    RateLimitConfig,
    RetryConfig
)
from scrapeflow.extractors import StructuredExtractor

async def scrape_books_complete():
    """Complete book scraping solution with all ScrapeFlow features."""

    config = ScrapeFlowConfig(
        anti_detection=AntiDetectionConfig(
            rotate_user_agents=True,
            stealth_mode=True,
        ),
        rate_limit=RateLimitConfig(requests_per_second=1.0),
        retry=RetryConfig(max_retries=3),
    )

    async with ScrapeFlow(config) as scraper:
        workflow = Workflow(name="book_monitor")

        async def extract_books(page, context):
            schema = {
                "books": {
                    "items": "article.product_pod",
                    "schema": {
                        "title": "h3 a",
                        "price": ".price_color",
                    }
                }
            }
            extractor = StructuredExtractor(schema)
            return await extractor.extract(page)

        async def navigate_to_books(page, context):
            scraper = context["scraper"]
            await scraper.navigate("https://books.toscrape.com/")
            await scraper.wait_for_selector("article.product_pod", timeout=10000)

        workflow.add_step("navigate", navigate_to_books, required=True)
        workflow.add_step("extract", extract_books)

        result = await scraper.run_workflow(workflow)

        # Get metrics
        metrics = scraper.get_metrics()
        books = result.final_data.get("books", [])
        print(f"βœ… Scraped {len(books)} books")
        print(f"πŸ“Š Success rate: {metrics.get_success_rate():.2f}%")
        print(f"πŸ“Š Average response time: {metrics.average_response_time:.2f}s")

        if books:
            print(f"\nπŸ“š Sample books:")
            for book in books[:3]:
                print(f"   - {book.get('title', '')[:40]}... - {book.get('price', '')}")

        return result.final_data

asyncio.run(scrape_books_complete())

Real Output:

Workflow 'book_monitor' completed. Success: True, Steps: 2/2
βœ… Scraped 20 books
πŸ“Š Success rate: 100.00%
πŸ“Š Average response time: 1.15s

πŸ“š Sample books:
   - A Light in the Attic... - Β£51.77
   - Tipping the Velvet... - Β£53.74
   - Soumission... - Β£50.10

πŸ—οΈ Architecture

ScrapeFlow is built with a modular architecture:

scrapeflow/
β”œβ”€β”€ engine.py           # Main ScrapeFlow engine
β”œβ”€β”€ ports.py            # Protocols for dependency inversion
β”œβ”€β”€ browser_runtime.py  # Playwright runtime adapter
β”œβ”€β”€ workflow.py         # Workflow definition entities
β”œβ”€β”€ workflow_executor.py # Workflow execution service
β”œβ”€β”€ config.py           # Configuration (EthicalCrawling, Pagination, etc.)
β”œβ”€β”€ specifications.py   # SpecificationExtractor, HybridExtractor, FieldSpec
β”œβ”€β”€ schema_library.py   # Reusable schema definitions
β”œβ”€β”€ extractors.py       # Extractor, StructuredExtractor
β”œβ”€β”€ llm_extract.py      # Mistral LLM schema + extraction
β”œβ”€β”€ content_utils.py    # HTML cleaning for LLM
β”œβ”€β”€ mcp_backend.py      # MCPBackend, MistralLLMBackend
β”œβ”€β”€ pagination.py       # paginate() helper
β”œβ”€β”€ robots.py           # robots.txt parsing and enforcement
β”œβ”€β”€ registry.py         # Selectors, login handlers, pagination
β”œβ”€β”€ anti_detection.py   # Stealth mode, user agent rotation
β”œβ”€β”€ rate_limiter.py     # Rate limiting implementation
β”œβ”€β”€ retry.py            # Retry logic and error classification
β”œβ”€β”€ monitoring.py       # Metrics, logging, alerting
└── exceptions.py       # Custom exceptions

For deeper design details, see ARCHITECTURE.md.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built on top of Playwright - an amazing browser automation library
  • Inspired by the need for production-ready scraping solutions

πŸ“§ Contact

Irfan Ali - GitHub

Project Link: https://github.com/irfanalidv/scrapeflow-py


Made with ❀️ for the scraping community

About

Production-ready web scraping engine on Playwright. LLM extraction, hybrid selectors, session persistence, rate limiting, anti-detection.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors