Skip to content

cafitac/ai-crawler

Repository files navigation

ai-crawler

CI

AI-driven network-first crawler compiler for authorized workflows.

ai-crawler turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with curl-cffi.

Browser is not the crawler. Browser is the probe.
AI is not the request loop. AI is the planner/debugger/recipe author.

What it is

ai-crawler is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.

It focuses on:

  • Network-first API discovery and replay
  • Recipe generation, testing, repair, and deterministic execution
  • Simple CLI defaults for humans and AI harnesses
  • Python SDK facade for application integrations
  • stdio MCP server for Hermes, Claude Code, Codex, and other agents
  • Local-first tests with fake transports and fixture sites
  • Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic

Install for local development

git clone https://github.com/cafitac/ai-crawler.git
cd ai-crawler
uv sync --extra dev --extra http --extra mcp

If you are already inside a local checkout:

uv sync --extra dev --extra http --extra mcp

npm wrapper

For npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:

npx @cafitac/ai-crawler --help
npx @cafitac/ai-crawler auto evidence.json --json
npx @cafitac/ai-crawler mcp

Wrapper behavior:

  • inside the repo checkout: runs the local Python core with uv run --project <repo> ai-crawler ...
  • outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes gitHead, otherwise falls back to uvx --from "git+https://github.com/cafitac/ai-crawler.git[all]" ai-crawler ...
  • override the published Python package spec with AI_CRAWLER_PYTHON_SPEC
  • override the uvx Python version with AI_CRAWLER_UVX_PYTHON

Quick start

The one-command path from URL to crawler artifacts is:

uv sync --extra browser --extra http
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json

compile opens the page briefly, records normalized network response events into evidence.json, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly fetch/xhr 2xx/3xx responses and drops static assets, failed responses, and other browser noise.

If you want to inspect or edit evidence before compiling, split the flow:

uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products"
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document
uv run --extra http ai-crawler auto evidence.json --json

If you already have an evidence file, the main AI-harness command is:

ai-crawler auto evidence.json --json

With a local checkout:

uv run --extra http ai-crawler auto evidence.json --json

This writes default artifacts:

evidence.json            # browser probe evidence, if generated by probe
recipe.yaml              # initial generated recipe
repaired.recipe.yaml     # repaired/final recipe
test.jsonl               # initial diagnostic crawl output
crawl.jsonl              # final crawl output
auto.report.json         # stable machine-readable report

The JSON report includes:

  • final success/failure status
  • command_type (compile or auto)
  • failure_phase for quick triage (probe, generate, final_test, or empty on success)
  • ordered phase_diagnostics for probe -> generate -> initial_test -> repair -> final_test
  • recipe/output paths
  • initial and final crawl results
  • bounded/redacted diagnostic samples
  • failure classifications such as success, extraction_failed, http_error, no_response, challenge_detected, probe_failed, and no_endpoint_candidates

In --json mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code 2 still writes auto.report.json so agents can inspect the failure.

Evidence format

Create evidence with a short browser probe:

uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --output evidence.json

The probe tuning options are available on both probe and compile:

  • --wait-ms: browser settle time after network idle (default: 1000)
  • --max-events: maximum replay candidates retained after filtering (default: 200)
  • --include-resource-type: comma-separated Playwright resource types to retain (default: fetch,xhr)

Minimal evidence JSON:

{
  "target_url": "https://example.com/products",
  "goal": "collect products",
  "events": [
    {
      "method": "GET",
      "url": "https://example.com/api/products?page=1",
      "status_code": 200,
      "resource_type": "fetch"
    }
  ]
}

Generate and run manually:

uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json

Or run each artifact step yourself:

uv run --extra http ai-crawler generate-recipe evidence.json
uv run --extra http ai-crawler test-recipe recipe.yaml
uv run --extra http ai-crawler repair-recipe recipe.yaml
uv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonl

MCP usage

Generate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see docs/harness-examples.md.

uv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client codex --project /path/to/ai-crawler

Generate npm-first snippets for the published wrapper:

uv run ai-crawler mcp-config --client hermes --launcher npm

Run as a stdio MCP server:

uv run --extra mcp --extra http ai-crawler mcp

Exposed tools:

  • compile_url
  • auto_compile
  • generate_recipe
  • test_recipe
  • repair_recipe

If you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:

npx @cafitac/ai-crawler mcp

Hermes development snippet shape:

mcp_servers:
  ai-crawler:
    command: "uv"
    args: ["run", "--project", "/path/to/ai-crawler", "--extra", "mcp", "--extra", "http", "ai-crawler", "mcp"]
    timeout: 300
    connect_timeout: 60

Hermes npm-first snippet shape:

mcp_servers:
  ai-crawler:
    command: "npx"
    args: ["-y", "@cafitac/ai-crawler", "mcp"]
    timeout: 300
    connect_timeout: 60

Python SDK

The Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See docs/harness-examples.md for copy-paste SDK, MCP, and published-wrapper examples.

npm publishing is automated with .github/workflows/npm-publish.yml.

  • push a tag matching the package version, for example npm-v0.1.2
  • or run the workflow manually with workflow_dispatch
  • the workflow validates that package.json, pyproject.toml, and src/ai_crawler/__init__.py agree on the release version before publish
  • tag-triggered publishes also validate that the pushed tag matches npm-v<package.json version>
  • use docs/release-runbook.md for the full version bump, tagging, and post-publish smoke checklist

Example tag flow:

git tag npm-v0.1.2
git push origin npm-v0.1.2
from ai_crawler import AICrawler

crawler = AICrawler()
result = crawler.auto("evidence.json")
print(result.ok)
print(result.exit_code)
print(result.report)

compile_result = crawler.compile_url("https://example.com/products", goal="collect products")
print(compile_result.report["command_type"])

For tests or embedded usage, inject a fake fetcher:

crawler = AICrawler(fetcher=my_fake_fetcher)

Verification

Fast local lint/type checks while iterating:

bash scripts/check-python.sh

Full project verification:

bash scripts/verify-ai-harness.sh

MCP auto_compile fixture smoke test:

uv run --extra http python scripts/smoke-mcp-auto-compile.py

This starts a local fixture HTTP site and verifies generate -> test -> repair -> retest without external internet, a real browser, or a real LLM.

Security and compliance boundary

ai-crawler is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.

It does not implement:

  • CAPTCHA solving
  • MFA bypass
  • Cloudflare/bot-challenge bypass
  • stealth fingerprint manipulation
  • evasion proxy rotation

Challenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.

Sensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.

Documentation

Development docs live under .dev/:

  • .dev/README.md
  • .dev/03-ai/auto-harness-contract.md
  • .dev/04-mcp/server.md
  • .dev/08-operations/security-and-compliance.md
  • .dev/08-operations/challenge-handling-policy.md

Status

Alpha. The deterministic recipe compiler, one-command compile flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.

License

MIT

About

AI-driven network-first crawler compiler for authorized workflows

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors