30 May 10:05

beyhangl

034fe31

v0.2.1 — scrub unregistered evalcraft.dev domain Latest

Latest

Patch release: removes all references to the unregistered evalcraft.dev domain.

evalcraft.dev was never registered, but the cloud feature defaulted to a non-existent api.evalcraft.dev endpoint and the landing page linked dead @evalcraft.dev emails. There is no public hosted service — the cloud client now requires an explicit self-hosted dashboard URL (base_url=, EVALCRAFT_BASE_URL, or ~/.evalcraft/config.json) and raises a clear self-host error when unconfigured, instead of failing against a dead host.

Core capture / replay / eval is unaffected (fully offline).

pip install --upgrade evalcraft

Full changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md

Assets 2

30 May 09:48

beyhangl

v0.2.0

ca0b585

v0.2.0 — VCR for AI agents

VCR for AI agents — record an agent run once, replay it deterministically in CI for $0.

This release ships the full backlog that accumulated since 0.1.0: the complete evaluation suite, new drift-catching + determinism tooling, an honest positioning re-scope, several bug fixes, and a clean lint/type pass. Backward-compatible with 0.1.0 cassettes.

pip install --upgrade evalcraft

Highlights

Live-eval mode — run scorers against the real model over a golden input set and gate CI on score regressions (run_live_eval / compare_to_baseline / evalcraft live-eval). The layer that catches model / prompt / retrieval drift, which replay can't.
Full eval suite — LLM-as-Judge, RAG metrics, pairwise A/B, multi-judge jury, statistical eval (Wilson CIs), hallucination detection.
Cassette provenance — model set, prompt hash, SDK/Python versions, record time (for staleness reasoning); surfaced in evalcraft info.
Opt-in judge cache — record/replay LLM-judge responses for deterministic, $0 judge scoring in CI.
More adapters — Gemini, Pydantic AI (Python); Gemini + Vercel AI (JS).

Fixed

LangGraph adapter: two NameErrors in the LLM/chain end-callbacks.
NetworkGuard: Python 3.9/3.10 crash from a hard-coded all_errors kwarg (3.11-only).
De-flaked the JS fingerprint-determinism test.
Repointed the dead evalcraft.dev docs link to GitHub Pages.

Internal

ruff 325 → 0; mypy made runnable + clean across the package; 803 Python tests and 145 JS tests passing.

📖 Docs: https://beyhangl.github.io/evalcraft/ · 📝 Full changelog: https://github.com/beyhangl/evalcraft/blob/main/CHANGELOG.md

Assets 2

06 Mar 08:01

beyhangl

v0.1.0

d342d72

v0.1.0 — The pytest for AI agents

evalcraft v0.1.0 — The pytest for AI agents

Capture, replay, mock, and evaluate AI agent behavior with deterministic, fast, cost-free tests.

Get Started

pip install evalcraft
evalcraft init

📖 Documentation · 🐙 GitHub · 💬 Discussions

✨ Core SDK (Phase 1)

Cassette-Based Recording

Capture & replay agent interactions — like VCR for HTTP, but for AI agents
Deterministic, fast, cost-free test execution against recorded cassettes
Golden-set management for baseline comparisons
Regression detection with configurable thresholds

8 Built-in Scorers & Assertions

Semantic similarity, exact match, JSON schema validation, latency, cost, token count, custom scoring functions, and composite assertions

6 Framework Adapters

OpenAI · Anthropic · LangGraph · CrewAI · AutoGen · LlamaIndex

pytest Plugin

6 fixtures for seamless test integration
3 markers for test categorization and filtering
2 CLI flags for record/replay mode control

CLI with 12+ Commands

evalcraft replay · evalcraft diff · evalcraft inspect · evalcraft run · evalcraft init and more

Comprehensive Test Suite

390 Python tests passing

📚 Launch Prep (Phase 2)

Documentation

MkDocs site with 14 pages covering quickstart, adapters, scorers, CI integration, and advanced usage

CI/CD Integration

GitHub Actions reusable action with automatic PR comments
CI gate for regression detection in pull requests

Example Projects

4 complete example projects with pre-recorded cassettes demonstrating real-world usage patterns

🚀 SaaS & Scale (Phase 3)

SaaS Backend

FastAPI backend with JWT + API key authentication
Multi-tenant architecture with team and project support

React Dashboard

8-page dashboard with dark theme
Recharts-powered analytics and visualizations
Cassette browsing, golden-set management, regression tracking

TypeScript SDK

Full-featured TypeScript client (145 tests passing)
Vercel AI SDK adapter for Next.js integration

Alert Integrations

Slack · Email · Webhook notifications
Configurable alert rules and thresholds (33 tests)

Cloud Upload Client

Automatic cassette upload to SaaS backend
Offline queue with retry logic (23 tests)

🔒 Trust & Security

Input sanitization — automatic PII scrubbing from recorded cassettes
Network blocking — prevent real API calls during replay mode
evalcraft init — scaffolding with secure defaults

Full Changelog

https://github.com/beyhangl/evalcraft/commits/v0.1.0

Assets 4

Releases: beyhangl/evalcraft

v0.2.1 — scrub unregistered evalcraft.dev domain

Uh oh!

v0.2.0 — VCR for AI agents

Highlights

Fixed

Internal

Uh oh!

v0.1.0 — The pytest for AI agents

evalcraft v0.1.0 — The pytest for AI agents

Get Started

✨ Core SDK (Phase 1)

Cassette-Based Recording

8 Built-in Scorers & Assertions

6 Framework Adapters

pytest Plugin

CLI with 12+ Commands

Comprehensive Test Suite

📚 Launch Prep (Phase 2)

Documentation

CI/CD Integration

Example Projects

🚀 SaaS & Scale (Phase 3)

SaaS Backend

React Dashboard

TypeScript SDK

Alert Integrations

Cloud Upload Client

🔒 Trust & Security

Full Changelog

Uh oh!