Releases: beyhangl/evalcraft
v0.2.1 — scrub unregistered evalcraft.dev domain
Patch release: removes all references to the unregistered evalcraft.dev domain.
evalcraft.dev was never registered, but the cloud feature defaulted to a non-existent api.evalcraft.dev endpoint and the landing page linked dead @evalcraft.dev emails. There is no public hosted service — the cloud client now requires an explicit self-hosted dashboard URL (base_url=, EVALCRAFT_BASE_URL, or ~/.evalcraft/config.json) and raises a clear self-host error when unconfigured, instead of failing against a dead host.
Core capture / replay / eval is unaffected (fully offline).
pip install --upgrade evalcraftFull changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md
v0.2.0 — VCR for AI agents
VCR for AI agents — record an agent run once, replay it deterministically in CI for $0.
This release ships the full backlog that accumulated since 0.1.0: the complete evaluation suite, new drift-catching + determinism tooling, an honest positioning re-scope, several bug fixes, and a clean lint/type pass. Backward-compatible with 0.1.0 cassettes.
pip install --upgrade evalcraftHighlights
- Live-eval mode — run scorers against the real model over a golden input set and gate CI on score regressions (
run_live_eval/compare_to_baseline/evalcraft live-eval). The layer that catches model / prompt / retrieval drift, which replay can't. - Full eval suite — LLM-as-Judge, RAG metrics, pairwise A/B, multi-judge jury, statistical eval (Wilson CIs), hallucination detection.
- Cassette provenance — model set, prompt hash, SDK/Python versions, record time (for staleness reasoning); surfaced in
evalcraft info. - Opt-in judge cache — record/replay LLM-judge responses for deterministic, $0 judge scoring in CI.
- More adapters — Gemini, Pydantic AI (Python); Gemini + Vercel AI (JS).
Fixed
- LangGraph adapter: two
NameErrors in the LLM/chain end-callbacks. - NetworkGuard: Python 3.9/3.10 crash from a hard-coded
all_errorskwarg (3.11-only). - De-flaked the JS fingerprint-determinism test.
- Repointed the dead
evalcraft.devdocs link to GitHub Pages.
Internal
- ruff 325 → 0; mypy made runnable + clean across the package; 803 Python tests and 145 JS tests passing.
📖 Docs: https://beyhangl.github.io/evalcraft/ · 📝 Full changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md
v0.1.0 — The pytest for AI agents
evalcraft v0.1.0 — The pytest for AI agents
Capture, replay, mock, and evaluate AI agent behavior with deterministic, fast, cost-free tests.
Get Started
pip install evalcraft
evalcraft init📖 Documentation · 🐙 GitHub · 💬 Discussions
✨ Core SDK (Phase 1)
Cassette-Based Recording
- Capture & replay agent interactions — like VCR for HTTP, but for AI agents
- Deterministic, fast, cost-free test execution against recorded cassettes
- Golden-set management for baseline comparisons
- Regression detection with configurable thresholds
8 Built-in Scorers & Assertions
- Semantic similarity, exact match, JSON schema validation, latency, cost, token count, custom scoring functions, and composite assertions
6 Framework Adapters
- OpenAI · Anthropic · LangGraph · CrewAI · AutoGen · LlamaIndex
pytest Plugin
- 6 fixtures for seamless test integration
- 3 markers for test categorization and filtering
- 2 CLI flags for record/replay mode control
CLI with 12+ Commands
evalcraft replay·evalcraft diff·evalcraft inspect·evalcraft run·evalcraft initand more
Comprehensive Test Suite
- 390 Python tests passing
📚 Launch Prep (Phase 2)
Documentation
- MkDocs site with 14 pages covering quickstart, adapters, scorers, CI integration, and advanced usage
CI/CD Integration
- GitHub Actions reusable action with automatic PR comments
- CI gate for regression detection in pull requests
Example Projects
- 4 complete example projects with pre-recorded cassettes demonstrating real-world usage patterns
🚀 SaaS & Scale (Phase 3)
SaaS Backend
- FastAPI backend with JWT + API key authentication
- Multi-tenant architecture with team and project support
React Dashboard
- 8-page dashboard with dark theme
- Recharts-powered analytics and visualizations
- Cassette browsing, golden-set management, regression tracking
TypeScript SDK
- Full-featured TypeScript client (145 tests passing)
- Vercel AI SDK adapter for Next.js integration
Alert Integrations
- Slack · Email · Webhook notifications
- Configurable alert rules and thresholds (33 tests)
Cloud Upload Client
- Automatic cassette upload to SaaS backend
- Offline queue with retry logic (23 tests)
🔒 Trust & Security
- Input sanitization — automatic PII scrubbing from recorded cassettes
- Network blocking — prevent real API calls during replay mode
evalcraft init— scaffolding with secure defaults