diff --git a/README.md b/README.md index d1ee74c..4006112 100644 --- a/README.md +++ b/README.md @@ -1,135 +1,389 @@ # Evaliphy (Beta)
- Test Your AI Features Like The Rest Of Your Product + AI Evaluation Framework — Assertions for LLM-as-Judge
--- -Evaliphy is a end-to-end testing solution for evaluating AI aplications. It treats AI pipelines as black boxes, allowing you to write robust, production-ready evaluations using the same workflow you use for end-to-end testing. +Evaliphy is an AI evaluation framework that treats your AI system as a black box. Write assertions against your real API, get structured results, and catch regressions in CI — without touching your pipeline internals or writing prompt engineering from scratch. -If you can write a Playwright or Vitest test, you can evaluate AI. +Built-in LLM-as-Judge assertions handle the hard parts. You focus on writing evaluations, not wiring up models. -[Documentation](https://evaliphy.com) + -## ✨ Key Features +--- + +## Prerequisites -- **Playwright-Style API**: Fluent, chainable assertions that feel natural to QA engineers. -- **Black-Box Testing**: Evaluate observable outputs (responses) without needing access to internal vector DBs or prompt templates. -- **LLM-as-a-Judge**: Built-in, production-grade evaluators for Faithfulness, Relevance, Groundedness, and more. -- **CI/CD Ready**: Runs in your existing pipelines and produces structured reports your whole team can read. -- **TypeScript Native**: Full type safety and IDE autocompletion for your evaluation suites. +- Node JS 24.0.0 or higher +- An OpenAI API key or any OpenAI-compatible provider +- A running AI application with an HTTP endpoint + +--- -## 🚀 Quick Start +## Quick start -### 1. Install & Initialize +### 1. Install and initialise ```bash -npm install -g evaliphy -npx evaliphy init my-eval-project +npm install -g @evaliphy/sdk +evaliphy init my-eval-project cd my-eval-project npm install ``` -### 2. Write Your First Eval +### 2. Set your environment variables -Create a file like `chat.eval.ts`: - -```typescript -import { evaluate, expect } from 'evaliphy'; - -evaluate("Customer Support Bot", async ({ httpClient }) => { - const query = "What is the return policy?"; - - // 1. Hit your real RAG endpoint - const res = await httpClient.post('/api/chat', { message: query }); - const { answer, context } = await res.json(); - - // 2. Assert against the LLM's behavior in plain English - await expect({ query, response: answer, context }).toBeFaithful(); - await expect({ query, response: answer, context }).toBeRelevant({threshold: 0.9}); -}); +```bash +cp .env.example .env ``` -### 3. Run the Evals +Add your API key to `.env`: -```bash -npx evaliphy eval +``` +OPENAI_API_KEY=your-api-key-here ``` -## ⚙️ Configuration +### 3. Configure Evaliphy -Evaliphy is configured via an `evaliphy.config.ts` file in your project root. Use the `defineConfig` helper for full TypeScript support and autocompletion. +Open `evaliphy.config.ts` and point it at your AI application: ```typescript -import { defineConfig } from 'evaliphy'; +import { defineConfig } from "@evaliphy/sdk"; export default defineConfig({ - // 1. Configure your RAG API http: { - baseUrl: 'https://api.your-service.com', + baseUrl: "https://api.your-service.com", + timeout: 10_000, headers: { - 'Authorization': `Bearer ${process.env.API_KEY}` - } + Authorization: `Bearer ${process.env.API_KEY}`, + }, }, - - // 2. Setup the LLM Judge llmAsJudgeConfig: { - model: 'gpt-4o-mini', + model: "gpt-4o-mini", provider: { - type: 'openai', + type: "openai", apiKey: process.env.OPENAI_API_KEY, - } + }, }, + reporters: ["console", "html"], +}); +``` + +### 4. Write your first evaluation + +Create `evals/chat.eval.ts`: - // 3. Reporting - reporters: ['console', 'html'] +```typescript +import { evaluate, expect } from "@evaliphy/sdk"; + +const sample = { + query: "What is the return policy?", + expectedContext: "Items can be returned within 30 days." +}; + +evaluate("Return Policy Chat", async ({ httpClient }) => { + // 1. Hit your RAG endpoint + const res = await httpClient.post('/api/chat', { message: sample.query }); + const data = await res.json(); + + // 2. Assert in plain English + await expect({ + query: sample.query, + response: data.answer, + context: sample.expectedContext + }).toBeFaithful(); + + await expect({ + query: sample.query, + response: data.answer, + context: sample.expectedContext + }).toBeRelevant({threshold:0.7}); }); ``` -## 🧠 Why Evaliphy? +### 5. Run your evaluations + +```bash +evaliphy eval +``` + +--- + +## Assertions + +### LLM assertions + +Scored 0.0 to 1.0 by a configurable judge model. Pass if the score meets or exceeds the threshold. + +| Assertion | What it checks | +| ---------------- | --------------------------------------------- | +| `toBeFaithful()` | Response is grounded in the retrieved context | +| `toBeRelevant()` | Response addresses the query | +| `toBeGrounded()` | Claims are supported by source documents | +| `toBeCoherent()` | Response is logically consistent | +| `toBeHarmless()` | Response contains no harmful or toxic content | + +All LLM assertions accept an optional config object: + +```typescript +await expect({ query, response, context }).toBeFaithful({ + threshold: 0.9, // override global threshold for this assertion +}); +``` + +### Deterministic assertions + +Coming in v1. Fast, free, no LLM call required. + +--- + +## Configuration reference + +| Field | Type | Default | Description | +| ----------------------------- | ------ | ------------- | ------------------------------- | +| `http.baseUrl` | string | — | Base URL of your AI application | +| `http.timeout` | number | `10000` | Request timeout in ms | +| `http.headers` | object | `{}` | Headers sent with every request | +| `llmAsJudgeConfig.model` | string | `gpt-4o-mini` | Judge model | +| `llmAsJudgeConfig.threshold` | number | `0.7` | Global pass threshold | +| `llmAsJudgeConfig.promptsDir` | string | — | Path to custom prompt directory | +| `reporters` | array | `['console']` | Output formats | + +--- + +## Supported LLM Providers + +Evaliphy uses the [Vercel AI SDK](https://sdk.vercel.ai) under the hood, which means it supports a wide range of LLM providers out of the box. Configure your provider once in `evaliphy.config.ts` and Evaliphy handles the rest. + +| Provider | Type key | Required field | +|---|---|---| +| OpenAI | `openai` | `apiKey` | +| Anthropic | `anthropic` | `apiKey` | +| Azure OpenAI | `azure` | `apiKey`, `resourceName` | +| Google Gemini | `google` | `apiKey` | +| Mistral | `mistral` | `apiKey` | +| OpenAI-compatible gateway | `gateway` | `apiKey`, `url` | + +### OpenAI + +```typescript +llmAsJudgeConfig: { + model: 'gpt-4o-mini', + provider: { + type: 'openai', + apiKey: process.env.OPENAI_API_KEY, + } +} +``` + +### Anthropic + +```typescript +llmAsJudgeConfig: { + model: 'claude-3-5-haiku-20241022', + provider: { + type: 'anthropic', + apiKey: process.env.ANTHROPIC_API_KEY, + } +} +``` + +### OpenAI-compatible gateway (OpenRouter, LiteLLM, etc.) + +```typescript +llmAsJudgeConfig: { + model: 'gpt-4o-mini', + provider: { + type: 'gateway', + url: 'https://openrouter.ai/api/v1', + apiKey: process.env.OPENROUTER_API_KEY, + } +} +``` + +### Azure OpenAI + +```typescript +llmAsJudgeConfig: { + model: 'gpt-4o-mini', + provider: { + type: 'azure', + resourceName: process.env.AZURE_RESOURCE_NAME, + apiKey: process.env.AZURE_API_KEY, + } +} +``` + +Any provider supported by the Vercel AI SDK can be used with Evaliphy. See the [Vercel AI SDK provider documentation](https://sdk.vercel.ai/providers/ai-sdk-providers) for the full list. + +--- + +## Custom prompts + +Evaliphy ships with built-in prompts for every assertion. Override any of them by creating a markdown file in your prompts directory and pointing `promptsDir` at it. + +``` +my-eval-project/ + prompts/ + faithfulness.md ← overrides built-in faithfulness prompt +``` + +```typescript +llmAsJudgeConfig: { + promptsDir: "./prompts"; +} +``` + +Each prompt file uses frontmatter to declare its input variables: + +```markdown +--- +name: faithfulness +input_variables: + - question + - context + - response +--- + +You are evaluating a RAG system for a UK e-commerce company. +Faithfulness means every claim traces back to the retrieved context. + +## Question + +{{question}} + +## Context + +{{context}} + +## Response + +{{response}} +``` + +See the [custom prompts guide](https://evaliphy.com/docs/llm-as-judge#using-custom-prompts) for full documentation. -### It fits where your tests already live. -Eval files sit in your repo alongside your other tests. No Python notebooks, no complex ML metrics, and no brittle manual testing. +--- -### You test your real API. -Evaliphy makes HTTP calls to your actual running service. If your RAG system breaks in production, Evaliphy catches it the same way your E2E tests catch a broken UI. +## CI integration -### The judges are built-in. -Faithfulness, relevance, groundedness — the assertions that matter are shipped with the framework. No prompt writing or LLM wiring required. +Evaliphy exits with a non-zero code when any assertion fails, making it compatible with any CI pipeline. -## 🛠 How it Works +### GitHub Actions -Evaliphy uses an **LLM-as-a-Judge** workflow to provide objective, repeatable scores for subjective AI outputs. +```yaml +name: Evaliphy -1. **Data Submission**: Your query, response, and context are sent to a high-capability judge model. -2. **Scoring**: The judge evaluates the input against a specialized rubric (0.0 - 1.0). -3. **Thresholding**: If the score meets your threshold (default 0.7), the test passes. +on: [push, pull_request] - +jobs: + eval: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-node@v4 + with: + node-version: 20 -## 🤝 Join the Beta + - run: npm ci + - run: evaliphy eval + env: + OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} + API_KEY: ${{ secrets.API_KEY }} +``` + +--- -We are currently in open beta and looking for feedback from QA teams and engineers building RAG applications. +## Reporters -- ✅ **Free** for commercial use during Beta. -- ✅ **Influence** the v1.0 roadmap. -- ✅ **Contribute** to our growing library of matchers. +| Reporter | Output | Description | +| --------- | ------------ | --------------------------------------------- | +| `console` | Terminal | Streams results as tests run | +| `json` | `.json` file | Machine-readable, good for CI pipelines | +| `html` | `.html` file | Self-contained visual report | +| `csv` | `.csv` file | Coming Soon | +| `xlsx` | `.xlsx` file | Coming Soon | -[Documentation](https://evaliphy.com) | [GitHub](https://github.com/evaliphy/evaliphy) | [Submit Feedback](https://forms.gle/9ztrqUCXUg2YGSJJA) +Configure in `evaliphy.config.ts`: -## 🚀 Built by the Community +--- + +## How it works + +1. Your eval file makes an HTTP call to your real running API +2. The response and context are passed to the assertion +3. The assertion sends a rendered prompt to the judge model +4. The judge scores the response 0.0 to 1.0 +5. The score is compared against the threshold — pass or fail +6. Results are written to all configured reporters + +--- + +## Why Evaliphy + +**It fits where your tests already live.** Eval files are TypeScript files that sit in your repo alongside your other tests. No Python notebooks, no complex setup, no new workflow to learn. + +**You test your real API.** Evaliphy makes HTTP calls to your actual running service — not a mocked response or an offline dataset. If your AI system breaks in production, Evaliphy catches it. + +**The judges are built in.** Faithfulness, relevance, groundedness — the assertions that matter are shipped with the framework. No prompt writing or LLM wiring required. + +**Configurable when you need it.** Sensible defaults out of the box. Override the judge model globally, per file, or per assertion. Bring your own prompts for domain-specific evaluation. + +--- + +## Project structure + +After running `evaliphy init`, your project looks like this: + +``` +my-eval-project/ + evals/ + example.eval.ts — sample evaluation to get you started + prompts/ — optional custom prompt overrides + evaliphy.config.ts — main configuration file + .env.example — environment variable template + package.json + tsconfig.json +``` + +--- + +## Beta + +Evaliphy is in open beta. The API may change between versions. We are looking for feedback from engineers and teams building AI applications. + +- Free for commercial use during beta +- Influence the v1.0 roadmap directly +- Contribute to the growing assertion library + +[Submit feedback](https://forms.gle/9ztrqUCXUg2YGSJJA) + +--- + +## Contributing + +Contributions are welcome. Please read the [contributing guide](./CONTRIBUTING.md) before opening a pull request. + +--- + +## Built by the community+ Evaliphy is an AI evaluation framework that treats your AI system + as a black box. Write assertions against your real API, get + structured results, and catch regressions in CI — without touching + internals of AI system or writing prompt engineering from + scratch. +
++ Built-in LLM-as-Judge assertions handle the hard parts. You focus + on writing evaluations, not wiring up models. +
+
- Forget {"\""}Contextual Precision{"\""} and {"\""}Cosine Similarity.{"\""} Assert
- against what actually matters:
+ Forget {'"'}Contextual Precision{'"'} and {'"'}Cosine
+ Similarity.{'"'} Assert against what actually matters:
toBeFaithful()
@@ -228,8 +273,8 @@ export default async function Home() {
We spent hundreds of hours benchmarking LLM-as-a-judge prompts - so you don{"'"}t have to. Just provide your API key, and Evaliphy - handles the prompting, parsing, and retry logic. + so you don{"'"}t have to. Just provide your API key, and + Evaliphy handles the prompting, parsing, and retry logic.
- Evaliphy is the only evaluation framework that treats RAG as a black box. + Evaliphy is the only evaluation framework that treats RAG as a black + box.
| Feature | -Evaliphy | -DeepEval / Ragas | ++ Feature + | ++ Evaliphy + | ++ DeepEval / Ragas + |
|---|---|---|---|---|---|
| Primary Audience | -QA & Software Engineers | -Data Scientists | ++ Primary Audience + | ++ QA & Software Engineers + | ++ Data Scientists + |
| Language | -TypeScript / Node.js | -Python | ++ TypeScript / Node.js + | ++ Python + | |
| Testing Style | -Black-box (API-driven) | -White-box (Pipeline-driven) | ++ Testing Style + | ++ Black-box (API-driven) + | ++ White-box (Pipeline-driven) + |
| Integration | -CI/CD Ready (npx) | -Notebooks / Python Scripts | ++ CI/CD Ready (npx) + | ++ Notebooks / Python Scripts + |