AI-driven network-first crawler compiler for authorized workflows.
ai-crawler turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with curl-cffi.
Browser is not the crawler. Browser is the probe.
AI is not the request loop. AI is the planner/debugger/recipe author.
ai-crawler is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.
It focuses on:
- Network-first API discovery and replay
- Recipe generation, testing, repair, and deterministic execution
- Simple CLI defaults for humans and AI harnesses
- Python SDK facade for application integrations
- stdio MCP server for Hermes, Claude Code, Codex, and other agents
- Local-first tests with fake transports and fixture sites
- Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic
git clone https://github.com/cafitac/ai-crawler.git
cd ai-crawler
uv sync --extra dev --extra http --extra mcpIf you are already inside a local checkout:
uv sync --extra dev --extra http --extra mcpFor npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:
npx @cafitac/ai-crawler --help
npx @cafitac/ai-crawler auto evidence.json --json
npx @cafitac/ai-crawler mcpWrapper behavior:
- inside the repo checkout: runs the local Python core with
uv run --project <repo> ai-crawler ... - outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes
gitHead, otherwise falls back touvx --from "git+https://github.com/cafitac/ai-crawler.git[all]" ai-crawler ... - override the published Python package spec with
AI_CRAWLER_PYTHON_SPEC - override the uvx Python version with
AI_CRAWLER_UVX_PYTHON
The one-command path from URL to crawler artifacts is:
uv sync --extra browser --extra http
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --jsoncompile opens the page briefly, records normalized network response events into evidence.json, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly fetch/xhr 2xx/3xx responses and drops static assets, failed responses, and other browser noise.
If you want to inspect or edit evidence before compiling, split the flow:
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products"
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document
uv run --extra http ai-crawler auto evidence.json --jsonIf you already have an evidence file, the main AI-harness command is:
ai-crawler auto evidence.json --jsonWith a local checkout:
uv run --extra http ai-crawler auto evidence.json --jsonThis writes default artifacts:
evidence.json # browser probe evidence, if generated by probe
recipe.yaml # initial generated recipe
repaired.recipe.yaml # repaired/final recipe
test.jsonl # initial diagnostic crawl output
crawl.jsonl # final crawl output
auto.report.json # stable machine-readable report
The JSON report includes:
- final success/failure status
command_type(compileorauto)failure_phasefor quick triage (probe,generate,final_test, or empty on success)- ordered
phase_diagnosticsforprobe -> generate -> initial_test -> repair -> final_test - recipe/output paths
- initial and final crawl results
- bounded/redacted diagnostic samples
- failure classifications such as
success,extraction_failed,http_error,no_response,challenge_detected,probe_failed, andno_endpoint_candidates
In --json mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code 2 still writes auto.report.json so agents can inspect the failure.
Create evidence with a short browser probe:
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --output evidence.jsonThe probe tuning options are available on both probe and compile:
--wait-ms: browser settle time after network idle (default:1000)--max-events: maximum replay candidates retained after filtering (default:200)--include-resource-type: comma-separated Playwright resource types to retain (default:fetch,xhr)
Minimal evidence JSON:
{
"target_url": "https://example.com/products",
"goal": "collect products",
"events": [
{
"method": "GET",
"url": "https://example.com/api/products?page=1",
"status_code": 200,
"resource_type": "fetch"
}
]
}Generate and run manually:
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --jsonOr run each artifact step yourself:
uv run --extra http ai-crawler generate-recipe evidence.json
uv run --extra http ai-crawler test-recipe recipe.yaml
uv run --extra http ai-crawler repair-recipe recipe.yaml
uv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonlGenerate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see docs/harness-examples.md.
uv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client codex --project /path/to/ai-crawlerGenerate npm-first snippets for the published wrapper:
uv run ai-crawler mcp-config --client hermes --launcher npmRun as a stdio MCP server:
uv run --extra mcp --extra http ai-crawler mcpExposed tools:
compile_urlauto_compilegenerate_recipetest_reciperepair_recipe
If you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:
npx @cafitac/ai-crawler mcpHermes development snippet shape:
mcp_servers:
ai-crawler:
command: "uv"
args: ["run", "--project", "/path/to/ai-crawler", "--extra", "mcp", "--extra", "http", "ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60Hermes npm-first snippet shape:
mcp_servers:
ai-crawler:
command: "npx"
args: ["-y", "@cafitac/ai-crawler", "mcp"]
timeout: 300
connect_timeout: 60The Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See docs/harness-examples.md for copy-paste SDK, MCP, and published-wrapper examples.
npm publishing is automated with .github/workflows/npm-publish.yml.
- push a tag matching the package version, for example
npm-v0.1.2 - or run the workflow manually with
workflow_dispatch - the workflow validates that
package.json,pyproject.toml, andsrc/ai_crawler/__init__.pyagree on the release version before publish - tag-triggered publishes also validate that the pushed tag matches
npm-v<package.json version> - use
docs/release-runbook.mdfor the full version bump, tagging, and post-publish smoke checklist
Example tag flow:
git tag npm-v0.1.2
git push origin npm-v0.1.2from ai_crawler import AICrawler
crawler = AICrawler()
result = crawler.auto("evidence.json")
print(result.ok)
print(result.exit_code)
print(result.report)
compile_result = crawler.compile_url("https://example.com/products", goal="collect products")
print(compile_result.report["command_type"])For tests or embedded usage, inject a fake fetcher:
crawler = AICrawler(fetcher=my_fake_fetcher)Fast local lint/type checks while iterating:
bash scripts/check-python.shFull project verification:
bash scripts/verify-ai-harness.shMCP auto_compile fixture smoke test:
uv run --extra http python scripts/smoke-mcp-auto-compile.pyThis starts a local fixture HTTP site and verifies generate -> test -> repair -> retest without external internet, a real browser, or a real LLM.
ai-crawler is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.
It does not implement:
- CAPTCHA solving
- MFA bypass
- Cloudflare/bot-challenge bypass
- stealth fingerprint manipulation
- evasion proxy rotation
Challenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.
Sensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.
Development docs live under .dev/:
.dev/README.md.dev/03-ai/auto-harness-contract.md.dev/04-mcp/server.md.dev/08-operations/security-and-compliance.md.dev/08-operations/challenge-handling-policy.md
Alpha. The deterministic recipe compiler, one-command compile flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.
MIT