Complete API documentation for @gleanwork/mcp-server-tester.
- Fixtures
- Authentication
- Eval Functions
- Programmatic Validators
- Playwright Matchers
- Text Utilities
- Judge Functions
- Conformance Functions
Raw MCP SDK client from @modelcontextprotocol/sdk.
test('use raw client', async ({ mcpClient }) => {
const tools = await mcpClient.listTools();
const result = await mcpClient.callTool({ name: 'tool_name', arguments: { ... } });
});High-level test API with helper methods.
export interface MCPFixtureApi {
/**
* The underlying MCP client (for advanced usage)
*/
client: Client;
/**
* Authentication type used for this test session
*/
authType: AuthType;
/**
* Playwright project name for this test session
*/
project?: string;
/**
* Lists all available tools from the MCP server
*
* @returns Array of tool definitions
*/
listTools(): Promise<Array<Tool>>;
/**
* Calls a tool on the MCP server
*
* @param name - Tool name
* @param args - Tool arguments
* @returns Tool call result
*/
callTool<TArgs extends Record<string, unknown> = Record<string, unknown>>(
name: string,
args: TArgs
): Promise<CallToolResult>;
/**
* Gets information about the connected server
*/
getServerInfo(): {
name?: string;
version?: string;
} | null;
}List all tools available from the MCP server.
Returns: Promise<Array<Tool>>
const tools = await mcp.listTools();
console.log(tools.map((t) => t.name));Call a tool by name with arguments.
Parameters:
name: string- Tool nameargs: TArgs- Tool arguments
Returns: Promise<CallToolResult>
const result = await mcp.callTool('get_weather', { city: 'London' });Get server information (name, version).
Returns: { name?: string; version?: string } | null
const info = mcp.getServerInfo();
console.log(info?.name, info?.version);Creates an MCPFixtureApi wrapper around a raw MCP Client. Use this when you need manual fixture setup — for example in custom fixture hierarchies, non-Playwright test runners (Vitest, Jest), or when composing with other lifecycle logic.
For the standard Playwright use case, prefer importing test and mcp from @gleanwork/mcp-server-tester/fixtures/mcp, which wires this up automatically.
Parameters:
client: Client— MCP client created viacreateMCPClientForConfig()testInfo?: TestInfo— Optional PlaywrightTestInfo. When provided, operations are wrapped intest.step()and attachments are created for the MCP reporteroptions?: MCPFixtureOptions— Optional configuration
MCPFixtureOptions:
| Field | Type | Default | Description |
|---|---|---|---|
authType |
'oauth' | 'api-token' | 'none' |
'none' |
Authentication type for this session |
project |
string |
— | Playwright project name (for filtering/grouping in the reporter) |
callTimeoutMs |
number |
30000 |
Timeout in milliseconds for MCP operations |
Returns: MCPFixtureApi
import {
createMCPFixture,
createMCPClientForConfig,
closeMCPClient,
} from '@gleanwork/mcp-server-tester';
import { test as base } from '@playwright/test';
import type { MCPFixtureApi } from '@gleanwork/mcp-server-tester';
const test = base.extend<{ mcp: MCPFixtureApi }>({
mcp: async ({}, use, testInfo) => {
const client = await createMCPClientForConfig(config);
const api = createMCPFixture(client, testInfo, { authType: 'api-token' });
await use(api);
await closeMCPClient(client);
},
});
// Non-Playwright usage (no reporter attachments)
const client = await createMCPClientForConfig(config);
const api = createMCPFixture(client);
const tools = await api.listTools();For comprehensive authentication documentation, see the Authentication Guide.
import {
createTokenAuthHeaders,
validateAccessToken,
isTokenExpired,
isTokenExpiringSoon,
} from '@gleanwork/mcp-server-tester';Create HTTP headers with Authorization header.
Parameters:
accessToken: string- Access tokentokenType?: string- Token type (default:'Bearer')
Returns: Record<string, string>
const headers = createTokenAuthHeaders(process.env.MCP_ACCESS_TOKEN);
// { Authorization: 'Bearer eyJ...' }Validate that an access token is present and non-empty.
Parameters:
accessToken: string | undefined- Token to validate
Throws: Error if token is missing or empty
Check if a JWT token appears to be expired.
Parameters:
accessToken: string- JWT token
Returns: boolean
Check if a token will expire within the buffer time.
Parameters:
expiresAt: number | undefined- Expiration timestamp in millisecondsbufferMs?: number- Buffer time (default:60000= 1 minute)
Returns: boolean
import { PlaywrightOAuthClientProvider } from '@gleanwork/mcp-server-tester';Implements the MCP SDK's OAuthClientProvider interface with file-based storage.
const provider = new PlaywrightOAuthClientProvider({
storagePath: 'playwright/.auth/mcp-oauth-state.json',
redirectUri: 'http://localhost:3000/oauth/callback',
clientId: process.env.MCP_OAUTH_CLIENT_ID,
clientSecret: process.env.MCP_OAUTH_CLIENT_SECRET,
});import { test } from '@gleanwork/mcp-server-tester/fixtures/mcpAuth';
test('uses auth provider', async ({ mcpAuthProvider }) => {
// mcpAuthProvider is configured from environment variables
});interface MCPAuthConfig {
accessToken?: string;
oauth?: MCPOAuthConfig;
}
interface MCPOAuthConfig {
serverUrl: string;
scopes?: string[];
resource?: string;
authStatePath?: string;
clientId?: string;
clientSecret?: string;
redirectUri?: string;
}Load an eval dataset from a JSON file.
Parameters:
path: string- Path to dataset JSON fileoptions?: objectschemas?: Record<string, ZodSchema>- Zod schemas for validation
Returns: Promise<EvalDataset>
const dataset = await loadEvalDataset('./data/evals.json', {
schemas: {
'weather-response': z.object({
city: z.string(),
temperature: z.number(),
}),
},
});Run an eval dataset. Expectations are defined per-case in the dataset's expect blocks.
Parameters:
options: EvalRunnerOptionsdataset: EvalDataset- Dataset to runschemas?: Record<string, ZodType>- Schema registry forexpect.schemavalidation by namestopOnFailure?: boolean- Stop on first failure (default:false)onCaseComplete?: (result: EvalCaseResult) => void- Callback after each case completesconcurrency?: number- Max parallel cases (default:1= sequential)defaultLlmIterations?: number- Default iteration count formcp_hostcases (default:1)defaultJudgeReps?: number- Default judge evaluation count per case (default:1)filterTags?: string[]- Only run cases whosetagscontain at least one matchsaveResultsTo?: string- Save run results to file for baseline comparisonomitResponsesFromBaseline?: boolean- Strip responses from saved baseline (default:true)baselineResultsFrom?: string- Load baseline file for regression detectiontoolOverrides?: ToolOverrideVariant- Runtime tool metadata overrides for variant experimentsmcpHostModel?: string- Model identifier recorded in run metadatajudgeModel?: string- Judge model identifier recorded in run metadata
context: EvalContextmcp: MCPFixtureApi- MCP fixture APItestInfo?: TestInfo- Playwright test info (required for snapshot support)expect?: ExpectType- Playwright expect function (required for snapshot support)
Returns: Promise<EvalRunnerResult>
const result = await runEvalDataset(
{ dataset }, // options — what to run and how
{ mcp, testInfo } // context — Playwright fixtures from your test
);
console.log(`Passed: ${result.passed}/${result.total}`);Runtime tool overrides let you test alternate tool descriptions or input schemas without editing the eval dataset or MCP server source. Tool names are canonical server tool names; v1 does not support renames.
const variant = {
id: 'search-description-v2',
tools: {
search: {
description:
'Search internal company documents, policies, wiki pages, and announcements.',
inputSchema: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'Natural language document or policy query.',
},
},
required: ['query'],
},
},
},
};
const baseline = await runEvalDataset(
{ dataset, defaultLlmIterations: 10 },
{ mcp, testInfo }
);
const candidate = await runEvalDataset(
{
dataset,
defaultLlmIterations: 10,
toolOverrides: variant,
},
{ mcp, testInfo }
);
console.log(candidate.metadata?.toolOverrideVariantId);Use compareEvalRuns() to summarize the completed baseline and candidate runs:
import { compareEvalRuns } from '@gleanwork/mcp-server-tester';
const comparison = compareEvalRuns({
baseline,
candidate,
labels: {
baseline: 'baseline',
candidate: variant.id,
},
});
console.log(`Pass-rate delta: ${comparison.deltaPassRate}`);
console.log(`Improved cases: ${comparison.improvedCases.length}`);
console.log(`Regressed cases: ${comparison.regressedCases.length}`);interface ToolOverrideVariant {
id: string;
description?: string;
tools: Record<
string,
{
description?: string;
inputSchema?: Record<string, unknown>;
}
>;
}Compare two completed eval runs. This is a pure utility: it does not run evals, read or write baselines, call LLMs, or mutate datasets.
Parameters:
options: CompareEvalRunsOptionsbaseline: EvalRunnerResult- Baseline run resultcandidate: EvalRunnerResult- Candidate run resultlabels?: { baseline?: string; candidate?: string }- Optional display labels
Returns: EvalRunComparisonResult
const comparison = compareEvalRuns({
baseline,
candidate,
});The result includes pass-rate deltas, optional tool precision/recall/F1 deltas, and case buckets:
improvedCases- failed in baseline, passed in candidateregressedCases- passed in baseline, failed in candidateunchangedPasses- passed in both runsunchangedFailures- failed in both runsmissingFromBaseline- case exists only in candidatemissingFromCandidate- case exists only in baseline
External result storage persists eval runs, reporter runs, and comparison artifacts as JSON. GCS is the first built-in cloud provider.
type StoredArtifactKind =
| 'eval-runner-result'
| 'reporter-run'
| 'eval-run-comparison'
| 'server-comparison';
interface EvalResultStore {
saveArtifact<T>(artifact: StoredEvalArtifact<T>): Promise<void>;
loadArtifact<T>(
kind: StoredArtifactKind,
id: string
): Promise<StoredEvalArtifact<T>>;
loadLatestArtifact<T>(
kind: StoredArtifactKind
): Promise<StoredEvalArtifact<T> | null>;
listArtifacts(
kind: StoredArtifactKind,
options?: { limit?: number }
): Promise<StoredArtifactSummary[]>;
}Create a store from config:
import { createEvalResultStore } from '@gleanwork/mcp-server-tester';
const store = createEvalResultStore({
provider: 'gcs',
bucket: 'my-mcp-eval-results',
prefix: 'my-server/main',
});runEvalDataset() accepts store-backed baseline references in addition to local
file paths:
await runEvalDataset(
{
dataset,
resultStore: store,
baselineResultsFrom: { store: true, ref: 'latest' },
saveResultsTo: { store: true, ref: { id: 'candidate-run' } },
},
{ mcp, testInfo }
);Stored runs can be used with compareEvalRuns():
import {
compareEvalRuns,
loadStoredEvalRunnerResult,
saveEvalRunComparison,
} from '@gleanwork/mcp-server-tester';
const baseline = await loadStoredEvalRunnerResult(store, { id: 'baseline' });
const candidate = await loadStoredEvalRunnerResult(store, { id: 'candidate' });
const comparison = compareEvalRuns({ baseline, candidate });
await saveEvalRunComparison({ store, comparison, id: 'candidate-comparison' });Result Structure:
/**
* Per-tool metadata overrides keyed by canonical tool name.
*/
tools: Record<string, ToolMetadataOverride>;
}
/**
* Overall result of running an eval dataset
*/
export interface EvalRunnerResult {
/**
* Total number of cases
*/
total: number;
/**
* Number of passing cases
*/
passed: number;
/**
* Number of failing cases
*/
failed: number;
/**
* Individual case results
*/
caseResults: Array<EvalCaseResult>;
/**
* Overall execution time in milliseconds
*/
durationMs: number;
/**
* Difference between current pass rate and baseline pass rate.
* Positive = improvement, negative = regression.
* Only present when `baselineResultsFrom` was provided.
*/
deltaPassRate?: number;
/**
* Number of cases that regressed: passed in baseline, failed now.
* Only present when `baselineResultsFrom` was provided.
*/
regressions?: number;
/**
* Number of cases that improved: failed in baseline, passed now.
* Only present when `baselineResultsFrom` was provided.
*/
improvements?: number;
/**
* Average tool precision across all mcp_host cases that have a
* `toolsTriggered` expectation (precision = fraction of called tools
* that were expected). Only present when at least one such case ran.
*/
datasetToolPrecision?: number;
/**
* Average tool recall across all mcp_host cases that have a
* `toolsTriggered` expectation (recall = fraction of required tools
* that were actually called). Only present when at least one such case ran.
*/
datasetToolRecall?: number;
/**
* Harmonic mean of `datasetToolPrecision` and `datasetToolRecall`.
* Only present when at least one case contributes precision/recall data.
*/
datasetToolF1?: number;
/**
* Experiment tracking metadata captured at run time.
*/
metadata?: EvalRunMetadata;Run a tool-metadata variant experiment: establish a baseline, inject each candidate variant via toolOverrides, compare against the baseline, rank by a metric, guard against regressions, and emit a structured improvement proposal. This is the high-level API that wraps the manual baseline → candidate → compareEvalRuns loop.
Parameters:
options: VariantExperimentOptionsdataset: EvalDataset- The dataset to run (never mutated)variants?: ToolOverrideVariant[]- Static candidates tried in round 0proposeVariants?: (ctx: ProposeVariantsContext) => Promise<ToolOverrideVariant[]>- Callback returning the next candidates from prior-round evidence; return[]to stopmetric?: 'passRate' | 'toolF1' | 'toolPrecision' | 'toolRecall'- Ranking metric (default'passRate')maxRounds?: number- Round budget (default1)minImprovement?: number- Stop when a round's best gain is below this (default0)allowRegressions?: boolean- Allow winners that regress cases (defaultfalse)- Plus
runEvalDatasetpassthrough:defaultLlmIterations,defaultJudgeReps,concurrency,filterTags,schemas,mcpHostModel,judgeModel
context: EvalContext-{ mcp, testInfo? }from your test
Returns: VariantExperimentResult
baseline- The original no-override runrounds- Every round's candidates with per-candidateresult,comparison,metricValue,metricDelta,disqualifiedwinner- Best non-disqualified candidate across all roundsproposal-VariantImprovementProposalwithrecommendation: 'apply' | 'reject' | 'inconclusive', metric values,toolChanges, and improved/regressed case idsreason- Why the experiment stopped:'no-variants' | 'no-improvement' | 'max-rounds' | 'threshold-met'
import { runVariantExperiment } from '@gleanwork/mcp-server-tester';
const result = await runVariantExperiment(
{
dataset,
variants: [variant],
metric: 'passRate',
defaultLlmIterations: 10,
},
{ mcp, testInfo }
);
if (result.proposal?.recommendation === 'apply') {
console.log(result.winner?.variant.id, result.proposal.delta);
}A candidate that regresses any case is disqualified from winning unless allowRegressions: true; the best attempt is still surfaced in proposal with recommendation: 'reject' so an agent can see what broke. See MCP Host Simulation for the full agent-loop example.
Run a single eval case. Useful when you want fine-grained control over individual cases outside of a dataset, or when building custom eval orchestration.
Parameters:
evalCase: EvalCase- The eval case to runcontext: EvalContextmcp: MCPFixtureApi- MCP fixture APItestInfo?: TestInfo- Playwright test info (for reporter integration)expect?: Expect- Playwright expect (for snapshot support)
options?: EvalCaseOptionsdatasetName?: string- Dataset name for the result (default:'single-case')schemas?: Record<string, ZodType>- Schema registry for named schema validation
Returns: Promise<EvalCaseResult>
import { runEvalCase } from '@gleanwork/mcp-server-tester';
test('single eval case', async ({ mcp }, testInfo) => {
const result = await runEvalCase(
{
id: 'search-check',
mode: 'direct',
toolName: 'search',
args: { query: 'planning' },
expect: { textContains: ['result'] },
},
{ mcp, testInfo }
);
expect(result.pass).toBe(true);
});When evalCase.iterations > 1, the case is run multiple times and result.assertionPassRate is populated with the fraction of passing iterations.
Pure validation functions that power both Playwright matchers and the eval runner. Each returns a ValidationResult with pass, message, and optional details. Use these when you need validation logic outside of Playwright's expect() — for example in Vitest/Jest tests, eval datasets, or custom pipelines.
import { validateText, validateSchema } from '@gleanwork/mcp-server-tester';
interface ValidationResult {
pass: boolean;
message: string;
details?: Record<string, unknown>;
metrics?: { precision?: number; recall?: number };
}Checks that the response contains all expected text substrings.
Parameters:
response: unknown— The response to validateexpected: string | string[]— Substring(s) to findoptions?: TextValidatorOptions—{ caseSensitive?: boolean }(default:true)
const result = validateText(response, ['temperature', 'conditions']);
const result2 = validateText(response, 'hello', { caseSensitive: false });Checks that the response matches all expected regex patterns.
Parameters:
response: unknown— The response to validatepatterns: string | RegExp | (string | RegExp)[]— Pattern(s) to matchoptions?: PatternValidatorOptions—{ caseSensitive?: boolean }(default:true)
const result = validatePattern(response, /temperature: \d+/);
const result2 = validatePattern(response, ['\\d+ degrees', /humidity: \d+%/]);Checks that the response is (or is not) an error, optionally with a specific message.
Parameters:
response: unknown— The response to validateexpected?: boolean | string | string[]—true= expect any error,false= expect no error,string= expect error containing text (default:true)
const result = validateError(response, true); // any error
const result2 = validateError(response, false); // no error
const result3 = validateError(response, 'not found'); // error with messageChecks that the response size in bytes is within bounds.
Parameters:
response: unknown— The response to validateoptions: SizeValidatorOptions—{ minBytes?: number; maxBytes?: number }(at least one required)
const result = validateSize(response, { maxBytes: 10_000 });
const result2 = validateSize(response, { minBytes: 100, maxBytes: 50_000 });Validates the response against a Zod schema. Automatically parses JSON text responses.
Parameters:
response: unknown— The response to validateschema: ZodType— Zod schema to validate againstoptions?: SchemaValidatorOptions—{ strict?: boolean }(default:false)
import { z } from 'zod';
const WeatherSchema = z.object({
temperature: z.number(),
conditions: z.string(),
});
const result = validateSchema(response, WeatherSchema);Deep equality comparison using JSON serialization.
Parameters:
actual: unknown— The actual responseexpected: unknown— The expected response
const result = validateResponse(response, { status: 'ok', count: 42 });Validates tool calls from an MCP host simulation result. Only applicable to mcp_host mode.
Parameters:
response: unknown— Must be anMCPHostSimulationResultexpectation: ToolCallExpectation— Expected tool call specification
import type { ToolCallExpectation } from '@gleanwork/mcp-server-tester';
const expectation: ToolCallExpectation = {
calls: [{ name: 'search', required: true }],
order: 'any',
exclusive: false,
};
const result = validateToolCalls(simulationResult, expectation);
// result.metrics contains { precision, recall }Validates the number of tool calls from an MCP host simulation result. Only applicable to mcp_host mode.
Parameters:
response: unknown— Must be anMCPHostSimulationResultoptions: ToolCallCountOptions—{ min?: number; max?: number; exact?: number }
const result = validateToolCallCount(simulationResult, { min: 1, max: 3 });Evaluates a response using an LLM-as-a-judge. Returns a Promise<ValidationResult>.
Parameters:
response: unknown— The response to evaluateconfig: JudgeValidatorConfig— Judge configuration
JudgeValidatorConfig:
| Field | Type | Default | Description |
|---|---|---|---|
rubric |
RubricSpec |
— | Evaluation rubric (required unless judge is set) |
judge |
string |
— | Name of a registered custom judge |
reference |
unknown |
— | Reference response to compare against |
threshold |
number |
0.7 |
Minimum score to pass (0–1) |
reps |
number |
1 |
Number of evaluations to run (scores averaged) |
provider |
ProviderKind |
'anthropic' |
Judge LLM provider |
model |
string |
— | Model override |
const result = await validateJudge(response, {
rubric: 'Does the response accurately describe the weather?',
threshold: 0.8,
});Custom Playwright matchers for writing inline assertions against MCP tool responses. Import expect from the package or its fixtures:
import { expect } from '@gleanwork/mcp-server-tester';
// or, when using fixtures:
import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';Assert that the tool response exactly deep-equals the expected value.
test('exact response', async ({ mcp }) => {
const result = await mcp.callTool('calculate', { a: 2, b: 3 });
expect(result).toMatchToolResponse({ result: 5 });
});For eval datasets, use the expect.response field:
{
"id": "calc-test",
"toolName": "calculate",
"args": { "a": 2, "b": 3 },
"expect": {
"response": { "result": 5 }
}
}Assert that the tool response text contains the given substring(s).
test('text contains', async ({ mcp }) => {
const result = await mcp.callTool('get_weather', { city: 'London' });
expect(result).toContainToolText('temperature');
expect(result).toContainToolText(['London', 'temperature', 'humidity']);
});Assert that the tool response text matches the given regex pattern(s).
test('pattern match', async ({ mcp }) => {
const result = await mcp.callTool('get_weather', { city: 'London' });
expect(result).toMatchToolPattern('Temperature: \\d+°[CF]');
expect(result).toMatchToolPattern(['^## Weather', '\\d{4}-\\d{2}-\\d{2}']);
});Assert that the tool response validates against a Zod schema.
import { z } from 'zod';
const WeatherSchema = z.object({
city: z.string(),
temperature: z.number(),
conditions: z.string(),
});
test('schema validation', async ({ mcp }) => {
const result = await mcp.callTool('get_weather', { city: 'London' });
expect(result).toMatchToolSchema(WeatherSchema);
});Assert that the tool response matches a saved Playwright snapshot. Use sanitizers to normalize variable fields (timestamps, UUIDs, etc.) before comparison.
test('snapshot', async ({ mcp }, testInfo) => {
const result = await mcp.callTool('help', {});
expect(result).toMatchToolSnapshot('help-output');
});
// With sanitizers
expect(result).toMatchToolSnapshot('user-profile', ['uuid', 'iso-date']);Assert that the tool response is an error (or is not an error when negated). Optionally assert on the error message.
test('error handling', async ({ mcp }) => {
const result = await mcp.callTool('nonexistent_tool', {});
expect(result).toBeToolError();
// Assert specific error message substring
expect(result).toBeToolError('not found');
// Assert response is NOT an error
const good = await mcp.callTool('get_weather', { city: 'London' });
expect(good).not.toBeToolError();
});Assert that the tool response passes an LLM-as-a-judge evaluation. Requires a judge client to be configured.
test('semantic quality', async ({ mcp }) => {
const result = await mcp.callTool('search_docs', { query: 'authentication' });
expect(result).toPassToolJudge(
{
text: 'The results should be relevant to the query about authentication. Score 0-1.',
},
{ threshold: 0.7 }
);
});Assert that the tool response size is within specified byte bounds.
test('response size', async ({ mcp }) => {
const result = await mcp.callTool('list_files', {});
expect(result).toHaveToolResponseSize({ minBytes: 10, maxBytes: 50000 });
});Assert that the tool response satisfies a custom predicate function.
test('custom predicate', async ({ mcp }) => {
const result = await mcp.callTool('list_files', {});
expect(result).toSatisfyToolPredicate(
(r) => Array.isArray(r.content) && r.content.length > 0,
'response should contain at least one file'
);
});Assert that the LLM made specific tool calls when given a natural language prompt. Only meaningful in mcp_host mode.
test('tool discovery', async ({ mcp }) => {
const result = await mcp.callTool('search', { query: 'find recent docs' });
expect(result).toHaveToolCalls({
calls: [{ name: 'search', required: true }],
order: 'any',
exclusive: false,
});
});Assert that the LLM made a specific number of tool calls. Only meaningful in mcp_host mode.
test('call count', async ({ mcp }) => {
const result = await mcp.callTool('search', { query: 'find docs' });
expect(result).toHaveToolCallCount({ min: 1, max: 5 });
});Extract text content from various MCP response formats.
Parameters:
response: CallToolResult- MCP tool call result
Returns: string
const result = await mcp.callTool('get_info', {});
const text = extractText(result);Normalize whitespace for consistent comparison.
Parameters:
text: string- Text to normalize
Returns: string
const normalized = normalizeWhitespace(' hello\n\n world ');
// Returns: "hello world"Create an LLM judge for semantic evaluation of tool responses.
Parameters:
config?: JudgeConfig(all fields optional)provider?: 'anthropic' | 'openai' | 'google'- LLM provider (default:'anthropic')model?: string- Model name (default:'claude-sonnet-4-20250514')temperature?: number- Temperature 0–1 (default:0.0)maxTokens?: number- Maximum tokens for response (default:1000)maxBudgetUsd?: number- Maximum budget in USD (default:0.10)maxToolOutputSize?: number- Fail if response exceeds this byte count
Returns: Judge
Default (Claude):
import { createJudge } from '@gleanwork/mcp-server-tester';
const judge = createJudge();
// Requires: ANTHROPIC_API_KEY environment variableWith configuration:
const judge = createJudge({
provider: 'openai',
model: 'gpt-4o',
temperature: 0.0,
});
// Requires: OPENAI_API_KEY environment variableThe following utilities are available for checking whether optional LLM provider packages are installed. They are useful for debugging provider configuration issues but are not part of the typical test-writing path.
Check whether the npm package required for a given mcp_host provider is installed in the current environment.
import { isProviderAvailable } from '@gleanwork/mcp-server-tester';
if (!isProviderAvailable('anthropic')) {
console.warn('Install @anthropic-ai/sdk to use the anthropic provider');
}Return a human-readable message describing the missing dependency for a provider, suitable for displaying in error output or test skip conditions.
import { getMissingDependencyMessage } from '@gleanwork/mcp-server-tester';
const message = getMissingDependencyMessage('openai');
// e.g. "Provider 'openai' requires the 'openai' package. Run: npm install openai"See LLM Host Guide for full details on configuring mcp_host mode.
Run MCP protocol conformance checks.
Parameters:
mcp: MCPFixtureApi- MCP fixture APIoptions?: objectrequiredTools?: string[]- Tools that must be presentvalidateSchemas?: boolean- Validate tool input schemas (default:false)
Returns: Promise<MCPConformanceResult>
const result = await runConformanceChecks(mcp, {
requiredTools: ['get_weather', 'search_docs'],
validateSchemas: true,
});
expect(result.pass).toBe(true);Result Structure:
interface MCPConformanceResult {
pass: boolean;
checks: Array<{
name: string;
pass: boolean;
message: string;
}>;
}export interface EvalExpectBlock {
/**
* Exact response match (toMatchToolResponse)
*/
response?: unknown;
/**
* Name of schema to validate against (toMatchToolSchema)
*/
schema?: string;
/**
* Text substring(s) that must be present (toContainToolText)
*/
containsText?: string | string[];
/**
* Regex pattern(s) that must match (toMatchToolPattern)
*/
matchesPattern?: string | string[];
/**
* Snapshot name for comparison (toMatchToolSnapshot)
*/
snapshot?: string;
/**
* Snapshot sanitizers to apply
*/
snapshotSanitizers?: SnapshotSanitizer[];
/**
* Error expectation (toBeToolError)
* - true: expects any error
* - false: expects no error
* - string: expects error containing this message
*/
isError?: boolean | string | string[];
/**
* LLM-as-judge evaluation (toPassToolJudge)
*
* Accepts a single judge config or an array for multi-judge evaluation.
* When an array is provided, all judges must pass (AND semantics).
*/
passesJudge?: JudgeExpectConfig | JudgeExpectConfig[];
/**
* Response size validation (toHaveToolResponseSize)
*/
responseSize?: {
/** Maximum allowed size in bytes */
maxBytes?: number;
/** Minimum required size in bytes */
minBytes?: number;
};
/**
* Asserts which tools the LLM called during a mcp_host simulation.
* Only meaningful for mcp_host mode — direct mode has no tool call trace.
*/
toolsTriggered?: {
/** Expected tool calls */
calls: Array<{
/** Tool name */
name: string;
/** Expected arguments (partial match — extra keys are allowed) */
arguments?: Record<string, unknown>;
/** Whether this call MUST have been made (default: true) */
required?: boolean;
}>;
/**
* 'strict': calls must appear in the exact order listed
* 'any': calls can appear in any order (default)
*/
order?: 'strict' | 'any';
/** If true, no tool calls outside the `calls` list are allowed */
exclusive?: boolean;
};
/**
* Asserts the number of tool calls made during a mcp_host simulation.
*/
toolCallCount?: {
/** Minimum number of tool calls */
min?: number;
/** Maximum number of tool calls */
max?: number;
/** Exact number of tool calls */
exact?: number;
};
}export interface EvalCase {
/**
* Unique identifier for this test case
*/
id: string;
/**
* Human-readable description of what this test case validates
*/
description?: string;
/**
* Evaluation mode
* - 'direct': Direct API calls to MCP tools (default)
* - 'mcp_host': LLM-driven tool selection via natural language
*
* @default 'direct'
*/
mode?: EvalMode;
/**
* Name of the MCP tool to call (required for 'direct' mode, optional for 'mcp_host' mode)
*/
toolName?: string;
/**
* Arguments to pass to the tool (required for 'direct' mode, optional for 'mcp_host' mode)
*/
args?: Record<string, unknown>;
/**
* Natural language scenario for LLM to execute (optional, required for 'mcp_host' mode)
*
* @example "Get the weather for London and tell me if I need an umbrella"
*/
scenario?: string;
/**
* MCP host configuration (optional for 'mcp_host' mode)
*
* If not specified, uses default configuration from test environment
*/
mcpHostConfig?: MCPHostConfig;
/**
* Additional metadata for this test case
*
* For 'mcp_host' mode, can include 'expectedToolCalls' for validation
*/
metadata?: Record<string, unknown>;
/**
* Number of times to run this case and compute an assertion pass rate.
* When > 1, `EvalCaseResult.assertionPassRate` is populated and `pass` is determined
* by `accuracyThreshold` rather than a single run.
* @default 1
*/
iterations?: number;
/**
* Minimum accuracy (0–1) required to pass when `iterations > 1`.
* @default 1.0 (all iterations must pass)
*/
accuracyThreshold?: number;
/**
* Number of times to invoke the LLM judge per `passesJudge` assertion.
* Scores are averaged; the mean must meet the threshold to pass.
* Reduces judge variance caused by non-determinism.
* Per-assertion `passesJudge.reps` overrides this value.
* @default 1
*/
judgeReps?: number;
/**
* Golden/expected answer for this case.
* When set, automatically passed as `reference` to the LLM judge
* (unless passesJudge.reference is explicitly provided).
* Mirrors EvalV2's `canonical_answer` field.
*/
canonicalAnswer?: string;
/**
* Arbitrary string labels for this case.
* Use for filtering eval runs with `EvalRunnerOptions.filterTags`
* and for slicing results by category.
*
* @example ['tool-finding', 'multi-hop', 'search']
*/
tags?: string[];
/**
* Expectations to validate against the tool response
*
* Multiple expectations can be combined and will all be validated.
*
* @example
* ```json
* {
* "id": "weather-london",
* "toolName": "get_weather",
* "args": { "city": "London" },
* "expect": {
* "containsText": ["temperature", "conditions"],
* "schema": "WeatherResponse",
* "responseSize": { "maxBytes": 10000 },
* "isError": false
* }
* }
* ```
*/
expect?: EvalExpectBlock;
}interface EvalDataset {
name: string;
description?: string;
cases: EvalCase[];
metadata?: Record<string, unknown>;
schemas?: Record<string, ZodSchema>; // Zod schemas for toMatchToolSchema assertions
}- See the Authentication Guide for OAuth and token auth
- See the Expectations Guide for detailed expectation usage
- Check out the Quick Start Guide for getting started
- Explore Examples for real-world usage patterns