Skip to content

Releases: beyhangl/evalcraft

v0.2.1 — scrub unregistered evalcraft.dev domain

30 May 10:05
034fe31

Choose a tag to compare

Patch release: removes all references to the unregistered evalcraft.dev domain.

evalcraft.dev was never registered, but the cloud feature defaulted to a non-existent api.evalcraft.dev endpoint and the landing page linked dead @evalcraft.dev emails. There is no public hosted service — the cloud client now requires an explicit self-hosted dashboard URL (base_url=, EVALCRAFT_BASE_URL, or ~/.evalcraft/config.json) and raises a clear self-host error when unconfigured, instead of failing against a dead host.

Core capture / replay / eval is unaffected (fully offline).

pip install --upgrade evalcraft

Full changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md

v0.2.0 — VCR for AI agents

30 May 09:48
ca0b585

Choose a tag to compare

VCR for AI agents — record an agent run once, replay it deterministically in CI for $0.

This release ships the full backlog that accumulated since 0.1.0: the complete evaluation suite, new drift-catching + determinism tooling, an honest positioning re-scope, several bug fixes, and a clean lint/type pass. Backward-compatible with 0.1.0 cassettes.

pip install --upgrade evalcraft

Highlights

  • Live-eval mode — run scorers against the real model over a golden input set and gate CI on score regressions (run_live_eval / compare_to_baseline / evalcraft live-eval). The layer that catches model / prompt / retrieval drift, which replay can't.
  • Full eval suite — LLM-as-Judge, RAG metrics, pairwise A/B, multi-judge jury, statistical eval (Wilson CIs), hallucination detection.
  • Cassette provenance — model set, prompt hash, SDK/Python versions, record time (for staleness reasoning); surfaced in evalcraft info.
  • Opt-in judge cache — record/replay LLM-judge responses for deterministic, $0 judge scoring in CI.
  • More adapters — Gemini, Pydantic AI (Python); Gemini + Vercel AI (JS).

Fixed

  • LangGraph adapter: two NameErrors in the LLM/chain end-callbacks.
  • NetworkGuard: Python 3.9/3.10 crash from a hard-coded all_errors kwarg (3.11-only).
  • De-flaked the JS fingerprint-determinism test.
  • Repointed the dead evalcraft.dev docs link to GitHub Pages.

Internal

  • ruff 325 → 0; mypy made runnable + clean across the package; 803 Python tests and 145 JS tests passing.

📖 Docs: https://beyhangl.github.io/evalcraft/ · 📝 Full changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md

v0.1.0 — The pytest for AI agents

06 Mar 08:01

Choose a tag to compare

evalcraft v0.1.0 — The pytest for AI agents

Capture, replay, mock, and evaluate AI agent behavior with deterministic, fast, cost-free tests.

Get Started

pip install evalcraft
evalcraft init

📖 Documentation · 🐙 GitHub · 💬 Discussions


✨ Core SDK (Phase 1)

Cassette-Based Recording

  • Capture & replay agent interactions — like VCR for HTTP, but for AI agents
  • Deterministic, fast, cost-free test execution against recorded cassettes
  • Golden-set management for baseline comparisons
  • Regression detection with configurable thresholds

8 Built-in Scorers & Assertions

  • Semantic similarity, exact match, JSON schema validation, latency, cost, token count, custom scoring functions, and composite assertions

6 Framework Adapters

  • OpenAI · Anthropic · LangGraph · CrewAI · AutoGen · LlamaIndex

pytest Plugin

  • 6 fixtures for seamless test integration
  • 3 markers for test categorization and filtering
  • 2 CLI flags for record/replay mode control

CLI with 12+ Commands

  • evalcraft replay · evalcraft diff · evalcraft inspect · evalcraft run · evalcraft init and more

Comprehensive Test Suite

  • 390 Python tests passing

📚 Launch Prep (Phase 2)

Documentation

  • MkDocs site with 14 pages covering quickstart, adapters, scorers, CI integration, and advanced usage

CI/CD Integration

  • GitHub Actions reusable action with automatic PR comments
  • CI gate for regression detection in pull requests

Example Projects

  • 4 complete example projects with pre-recorded cassettes demonstrating real-world usage patterns

🚀 SaaS & Scale (Phase 3)

SaaS Backend

  • FastAPI backend with JWT + API key authentication
  • Multi-tenant architecture with team and project support

React Dashboard

  • 8-page dashboard with dark theme
  • Recharts-powered analytics and visualizations
  • Cassette browsing, golden-set management, regression tracking

TypeScript SDK

  • Full-featured TypeScript client (145 tests passing)
  • Vercel AI SDK adapter for Next.js integration

Alert Integrations

  • Slack · Email · Webhook notifications
  • Configurable alert rules and thresholds (33 tests)

Cloud Upload Client

  • Automatic cassette upload to SaaS backend
  • Offline queue with retry logic (23 tests)

🔒 Trust & Security

  • Input sanitization — automatic PII scrubbing from recorded cassettes
  • Network blocking — prevent real API calls during replay mode
  • evalcraft init — scaffolding with secure defaults

Full Changelog

https://github.com/beyhangl/evalcraft/commits/v0.1.0