Skip to content

Latest commit

 

History

History
1392 lines (1062 loc) · 39.1 KB

File metadata and controls

1392 lines (1062 loc) · 39.1 KB

API Reference

Complete API documentation for @gleanwork/mcp-server-tester.

Table of Contents

Fixtures

mcpClient: Client

Raw MCP SDK client from @modelcontextprotocol/sdk.

test('use raw client', async ({ mcpClient }) => {
  const tools = await mcpClient.listTools();
  const result = await mcpClient.callTool({ name: 'tool_name', arguments: { ... } });
});

mcp: MCPFixtureApi

High-level test API with helper methods.

export interface MCPFixtureApi {
  /**
   * The underlying MCP client (for advanced usage)
   */
  client: Client;

  /**
   * Authentication type used for this test session
   */
  authType: AuthType;

  /**
   * Playwright project name for this test session
   */
  project?: string;

  /**
   * Lists all available tools from the MCP server
   *
   * @returns Array of tool definitions
   */
  listTools(): Promise<Array<Tool>>;

  /**
   * Calls a tool on the MCP server
   *
   * @param name - Tool name
   * @param args - Tool arguments
   * @returns Tool call result
   */
  callTool<TArgs extends Record<string, unknown> = Record<string, unknown>>(
    name: string,
    args: TArgs
  ): Promise<CallToolResult>;

  /**
   * Gets information about the connected server
   */
  getServerInfo(): {
    name?: string;
    version?: string;
  } | null;
}

Methods

listTools()

List all tools available from the MCP server.

Returns: Promise<Array<Tool>>

const tools = await mcp.listTools();
console.log(tools.map((t) => t.name));
callTool<TArgs>(name, args)

Call a tool by name with arguments.

Parameters:

  • name: string - Tool name
  • args: TArgs - Tool arguments

Returns: Promise<CallToolResult>

const result = await mcp.callTool('get_weather', { city: 'London' });
getServerInfo()

Get server information (name, version).

Returns: { name?: string; version?: string } | null

const info = mcp.getServerInfo();
console.log(info?.name, info?.version);

createMCPFixture(client, testInfo?, options?)

Creates an MCPFixtureApi wrapper around a raw MCP Client. Use this when you need manual fixture setup — for example in custom fixture hierarchies, non-Playwright test runners (Vitest, Jest), or when composing with other lifecycle logic.

For the standard Playwright use case, prefer importing test and mcp from @gleanwork/mcp-server-tester/fixtures/mcp, which wires this up automatically.

Parameters:

  • client: Client — MCP client created via createMCPClientForConfig()
  • testInfo?: TestInfo — Optional Playwright TestInfo. When provided, operations are wrapped in test.step() and attachments are created for the MCP reporter
  • options?: MCPFixtureOptions — Optional configuration

MCPFixtureOptions:

Field Type Default Description
authType 'oauth' | 'api-token' | 'none' 'none' Authentication type for this session
project string Playwright project name (for filtering/grouping in the reporter)
callTimeoutMs number 30000 Timeout in milliseconds for MCP operations

Returns: MCPFixtureApi

import {
  createMCPFixture,
  createMCPClientForConfig,
  closeMCPClient,
} from '@gleanwork/mcp-server-tester';
import { test as base } from '@playwright/test';
import type { MCPFixtureApi } from '@gleanwork/mcp-server-tester';

const test = base.extend<{ mcp: MCPFixtureApi }>({
  mcp: async ({}, use, testInfo) => {
    const client = await createMCPClientForConfig(config);
    const api = createMCPFixture(client, testInfo, { authType: 'api-token' });
    await use(api);
    await closeMCPClient(client);
  },
});

// Non-Playwright usage (no reporter attachments)
const client = await createMCPClientForConfig(config);
const api = createMCPFixture(client);
const tools = await api.listTools();

Authentication

For comprehensive authentication documentation, see the Authentication Guide.

Token Utilities

import {
  createTokenAuthHeaders,
  validateAccessToken,
  isTokenExpired,
  isTokenExpiringSoon,
} from '@gleanwork/mcp-server-tester';

createTokenAuthHeaders(accessToken, tokenType?)

Create HTTP headers with Authorization header.

Parameters:

  • accessToken: string - Access token
  • tokenType?: string - Token type (default: 'Bearer')

Returns: Record<string, string>

const headers = createTokenAuthHeaders(process.env.MCP_ACCESS_TOKEN);
// { Authorization: 'Bearer eyJ...' }

validateAccessToken(accessToken)

Validate that an access token is present and non-empty.

Parameters:

  • accessToken: string | undefined - Token to validate

Throws: Error if token is missing or empty

isTokenExpired(accessToken)

Check if a JWT token appears to be expired.

Parameters:

  • accessToken: string - JWT token

Returns: boolean

isTokenExpiringSoon(expiresAt, bufferMs?)

Check if a token will expire within the buffer time.

Parameters:

  • expiresAt: number | undefined - Expiration timestamp in milliseconds
  • bufferMs?: number - Buffer time (default: 60000 = 1 minute)

Returns: boolean

OAuth Client Provider

import { PlaywrightOAuthClientProvider } from '@gleanwork/mcp-server-tester';

Implements the MCP SDK's OAuthClientProvider interface with file-based storage.

const provider = new PlaywrightOAuthClientProvider({
  storagePath: 'playwright/.auth/mcp-oauth-state.json',
  redirectUri: 'http://localhost:3000/oauth/callback',
  clientId: process.env.MCP_OAUTH_CLIENT_ID,
  clientSecret: process.env.MCP_OAUTH_CLIENT_SECRET,
});

Auth Fixture

import { test } from '@gleanwork/mcp-server-tester/fixtures/mcpAuth';

test('uses auth provider', async ({ mcpAuthProvider }) => {
  // mcpAuthProvider is configured from environment variables
});

Auth Configuration Types

interface MCPAuthConfig {
  accessToken?: string;
  oauth?: MCPOAuthConfig;
}

interface MCPOAuthConfig {
  serverUrl: string;
  scopes?: string[];
  resource?: string;
  authStatePath?: string;
  clientId?: string;
  clientSecret?: string;
  redirectUri?: string;
}

Eval Functions

loadEvalDataset(path, options?)

Load an eval dataset from a JSON file.

Parameters:

  • path: string - Path to dataset JSON file
  • options?: object
    • schemas?: Record<string, ZodSchema> - Zod schemas for validation

Returns: Promise<EvalDataset>

const dataset = await loadEvalDataset('./data/evals.json', {
  schemas: {
    'weather-response': z.object({
      city: z.string(),
      temperature: z.number(),
    }),
  },
});

runEvalDataset(options, context)

Run an eval dataset. Expectations are defined per-case in the dataset's expect blocks.

Parameters:

  • options: EvalRunnerOptions
    • dataset: EvalDataset - Dataset to run
    • schemas?: Record<string, ZodType> - Schema registry for expect.schema validation by name
    • stopOnFailure?: boolean - Stop on first failure (default: false)
    • onCaseComplete?: (result: EvalCaseResult) => void - Callback after each case completes
    • concurrency?: number - Max parallel cases (default: 1 = sequential)
    • defaultLlmIterations?: number - Default iteration count for mcp_host cases (default: 1)
    • defaultJudgeReps?: number - Default judge evaluation count per case (default: 1)
    • filterTags?: string[] - Only run cases whose tags contain at least one match
    • saveResultsTo?: string - Save run results to file for baseline comparison
    • omitResponsesFromBaseline?: boolean - Strip responses from saved baseline (default: true)
    • baselineResultsFrom?: string - Load baseline file for regression detection
    • toolOverrides?: ToolOverrideVariant - Runtime tool metadata overrides for variant experiments
    • mcpHostModel?: string - Model identifier recorded in run metadata
    • judgeModel?: string - Judge model identifier recorded in run metadata
  • context: EvalContext
    • mcp: MCPFixtureApi - MCP fixture API
    • testInfo?: TestInfo - Playwright test info (required for snapshot support)
    • expect?: ExpectType - Playwright expect function (required for snapshot support)

Returns: Promise<EvalRunnerResult>

const result = await runEvalDataset(
  { dataset }, // options — what to run and how
  { mcp, testInfo } // context — Playwright fixtures from your test
);

console.log(`Passed: ${result.passed}/${result.total}`);

Runtime tool overrides let you test alternate tool descriptions or input schemas without editing the eval dataset or MCP server source. Tool names are canonical server tool names; v1 does not support renames.

const variant = {
  id: 'search-description-v2',
  tools: {
    search: {
      description:
        'Search internal company documents, policies, wiki pages, and announcements.',
      inputSchema: {
        type: 'object',
        properties: {
          query: {
            type: 'string',
            description: 'Natural language document or policy query.',
          },
        },
        required: ['query'],
      },
    },
  },
};

const baseline = await runEvalDataset(
  { dataset, defaultLlmIterations: 10 },
  { mcp, testInfo }
);

const candidate = await runEvalDataset(
  {
    dataset,
    defaultLlmIterations: 10,
    toolOverrides: variant,
  },
  { mcp, testInfo }
);

console.log(candidate.metadata?.toolOverrideVariantId);

Use compareEvalRuns() to summarize the completed baseline and candidate runs:

import { compareEvalRuns } from '@gleanwork/mcp-server-tester';

const comparison = compareEvalRuns({
  baseline,
  candidate,
  labels: {
    baseline: 'baseline',
    candidate: variant.id,
  },
});

console.log(`Pass-rate delta: ${comparison.deltaPassRate}`);
console.log(`Improved cases: ${comparison.improvedCases.length}`);
console.log(`Regressed cases: ${comparison.regressedCases.length}`);
interface ToolOverrideVariant {
  id: string;
  description?: string;
  tools: Record<
    string,
    {
      description?: string;
      inputSchema?: Record<string, unknown>;
    }
  >;
}

compareEvalRuns(options)

Compare two completed eval runs. This is a pure utility: it does not run evals, read or write baselines, call LLMs, or mutate datasets.

Parameters:

  • options: CompareEvalRunsOptions
    • baseline: EvalRunnerResult - Baseline run result
    • candidate: EvalRunnerResult - Candidate run result
    • labels?: { baseline?: string; candidate?: string } - Optional display labels

Returns: EvalRunComparisonResult

const comparison = compareEvalRuns({
  baseline,
  candidate,
});

The result includes pass-rate deltas, optional tool precision/recall/F1 deltas, and case buckets:

  • improvedCases - failed in baseline, passed in candidate
  • regressedCases - passed in baseline, failed in candidate
  • unchangedPasses - passed in both runs
  • unchangedFailures - failed in both runs
  • missingFromBaseline - case exists only in candidate
  • missingFromCandidate - case exists only in baseline

External Result Storage

External result storage persists eval runs, reporter runs, and comparison artifacts as JSON. GCS is the first built-in cloud provider.

type StoredArtifactKind =
  | 'eval-runner-result'
  | 'reporter-run'
  | 'eval-run-comparison'
  | 'server-comparison';

interface EvalResultStore {
  saveArtifact<T>(artifact: StoredEvalArtifact<T>): Promise<void>;
  loadArtifact<T>(
    kind: StoredArtifactKind,
    id: string
  ): Promise<StoredEvalArtifact<T>>;
  loadLatestArtifact<T>(
    kind: StoredArtifactKind
  ): Promise<StoredEvalArtifact<T> | null>;
  listArtifacts(
    kind: StoredArtifactKind,
    options?: { limit?: number }
  ): Promise<StoredArtifactSummary[]>;
}

Create a store from config:

import { createEvalResultStore } from '@gleanwork/mcp-server-tester';

const store = createEvalResultStore({
  provider: 'gcs',
  bucket: 'my-mcp-eval-results',
  prefix: 'my-server/main',
});

runEvalDataset() accepts store-backed baseline references in addition to local file paths:

await runEvalDataset(
  {
    dataset,
    resultStore: store,
    baselineResultsFrom: { store: true, ref: 'latest' },
    saveResultsTo: { store: true, ref: { id: 'candidate-run' } },
  },
  { mcp, testInfo }
);

Stored runs can be used with compareEvalRuns():

import {
  compareEvalRuns,
  loadStoredEvalRunnerResult,
  saveEvalRunComparison,
} from '@gleanwork/mcp-server-tester';

const baseline = await loadStoredEvalRunnerResult(store, { id: 'baseline' });
const candidate = await loadStoredEvalRunnerResult(store, { id: 'candidate' });
const comparison = compareEvalRuns({ baseline, candidate });

await saveEvalRunComparison({ store, comparison, id: 'candidate-comparison' });

Result Structure:

  /**
   * Per-tool metadata overrides keyed by canonical tool name.
   */
  tools: Record<string, ToolMetadataOverride>;
}

/**
 * Overall result of running an eval dataset
 */
export interface EvalRunnerResult {
  /**
   * Total number of cases
   */
  total: number;

  /**
   * Number of passing cases
   */
  passed: number;

  /**
   * Number of failing cases
   */
  failed: number;

  /**
   * Individual case results
   */
  caseResults: Array<EvalCaseResult>;

  /**
   * Overall execution time in milliseconds
   */
  durationMs: number;

  /**
   * Difference between current pass rate and baseline pass rate.
   * Positive = improvement, negative = regression.
   * Only present when `baselineResultsFrom` was provided.
   */
  deltaPassRate?: number;

  /**
   * Number of cases that regressed: passed in baseline, failed now.
   * Only present when `baselineResultsFrom` was provided.
   */
  regressions?: number;

  /**
   * Number of cases that improved: failed in baseline, passed now.
   * Only present when `baselineResultsFrom` was provided.
   */
  improvements?: number;

  /**
   * Average tool precision across all mcp_host cases that have a
   * `toolsTriggered` expectation (precision = fraction of called tools
   * that were expected). Only present when at least one such case ran.
   */
  datasetToolPrecision?: number;

  /**
   * Average tool recall across all mcp_host cases that have a
   * `toolsTriggered` expectation (recall = fraction of required tools
   * that were actually called). Only present when at least one such case ran.
   */
  datasetToolRecall?: number;

  /**
   * Harmonic mean of `datasetToolPrecision` and `datasetToolRecall`.
   * Only present when at least one case contributes precision/recall data.
   */
  datasetToolF1?: number;

  /**
   * Experiment tracking metadata captured at run time.
   */
  metadata?: EvalRunMetadata;

runVariantExperiment(options, context)

Run a tool-metadata variant experiment: establish a baseline, inject each candidate variant via toolOverrides, compare against the baseline, rank by a metric, guard against regressions, and emit a structured improvement proposal. This is the high-level API that wraps the manual baseline → candidate → compareEvalRuns loop.

Parameters:

  • options: VariantExperimentOptions
    • dataset: EvalDataset - The dataset to run (never mutated)
    • variants?: ToolOverrideVariant[] - Static candidates tried in round 0
    • proposeVariants?: (ctx: ProposeVariantsContext) => Promise<ToolOverrideVariant[]> - Callback returning the next candidates from prior-round evidence; return [] to stop
    • metric?: 'passRate' | 'toolF1' | 'toolPrecision' | 'toolRecall' - Ranking metric (default 'passRate')
    • maxRounds?: number - Round budget (default 1)
    • minImprovement?: number - Stop when a round's best gain is below this (default 0)
    • allowRegressions?: boolean - Allow winners that regress cases (default false)
    • Plus runEvalDataset passthrough: defaultLlmIterations, defaultJudgeReps, concurrency, filterTags, schemas, mcpHostModel, judgeModel
  • context: EvalContext - { mcp, testInfo? } from your test

Returns: VariantExperimentResult

  • baseline - The original no-override run
  • rounds - Every round's candidates with per-candidate result, comparison, metricValue, metricDelta, disqualified
  • winner - Best non-disqualified candidate across all rounds
  • proposal - VariantImprovementProposal with recommendation: 'apply' | 'reject' | 'inconclusive', metric values, toolChanges, and improved/regressed case ids
  • reason - Why the experiment stopped: 'no-variants' | 'no-improvement' | 'max-rounds' | 'threshold-met'
import { runVariantExperiment } from '@gleanwork/mcp-server-tester';

const result = await runVariantExperiment(
  {
    dataset,
    variants: [variant],
    metric: 'passRate',
    defaultLlmIterations: 10,
  },
  { mcp, testInfo }
);

if (result.proposal?.recommendation === 'apply') {
  console.log(result.winner?.variant.id, result.proposal.delta);
}

A candidate that regresses any case is disqualified from winning unless allowRegressions: true; the best attempt is still surfaced in proposal with recommendation: 'reject' so an agent can see what broke. See MCP Host Simulation for the full agent-loop example.

runEvalCase(evalCase, context, options?)

Run a single eval case. Useful when you want fine-grained control over individual cases outside of a dataset, or when building custom eval orchestration.

Parameters:

  • evalCase: EvalCase - The eval case to run
  • context: EvalContext
    • mcp: MCPFixtureApi - MCP fixture API
    • testInfo?: TestInfo - Playwright test info (for reporter integration)
    • expect?: Expect - Playwright expect (for snapshot support)
  • options?: EvalCaseOptions
    • datasetName?: string - Dataset name for the result (default: 'single-case')
    • schemas?: Record<string, ZodType> - Schema registry for named schema validation

Returns: Promise<EvalCaseResult>

import { runEvalCase } from '@gleanwork/mcp-server-tester';

test('single eval case', async ({ mcp }, testInfo) => {
  const result = await runEvalCase(
    {
      id: 'search-check',
      mode: 'direct',
      toolName: 'search',
      args: { query: 'planning' },
      expect: { textContains: ['result'] },
    },
    { mcp, testInfo }
  );

  expect(result.pass).toBe(true);
});

When evalCase.iterations > 1, the case is run multiple times and result.assertionPassRate is populated with the fraction of passing iterations.


Programmatic Validators

Pure validation functions that power both Playwright matchers and the eval runner. Each returns a ValidationResult with pass, message, and optional details. Use these when you need validation logic outside of Playwright's expect() — for example in Vitest/Jest tests, eval datasets, or custom pipelines.

import { validateText, validateSchema } from '@gleanwork/mcp-server-tester';

interface ValidationResult {
  pass: boolean;
  message: string;
  details?: Record<string, unknown>;
  metrics?: { precision?: number; recall?: number };
}

validateText(response, expected, options?)

Checks that the response contains all expected text substrings.

Parameters:

  • response: unknown — The response to validate
  • expected: string | string[] — Substring(s) to find
  • options?: TextValidatorOptions{ caseSensitive?: boolean } (default: true)
const result = validateText(response, ['temperature', 'conditions']);
const result2 = validateText(response, 'hello', { caseSensitive: false });

validatePattern(response, patterns, options?)

Checks that the response matches all expected regex patterns.

Parameters:

  • response: unknown — The response to validate
  • patterns: string | RegExp | (string | RegExp)[] — Pattern(s) to match
  • options?: PatternValidatorOptions{ caseSensitive?: boolean } (default: true)
const result = validatePattern(response, /temperature: \d+/);
const result2 = validatePattern(response, ['\\d+ degrees', /humidity: \d+%/]);

validateError(response, expected?)

Checks that the response is (or is not) an error, optionally with a specific message.

Parameters:

  • response: unknown — The response to validate
  • expected?: boolean | string | string[]true = expect any error, false = expect no error, string = expect error containing text (default: true)
const result = validateError(response, true); // any error
const result2 = validateError(response, false); // no error
const result3 = validateError(response, 'not found'); // error with message

validateSize(response, options)

Checks that the response size in bytes is within bounds.

Parameters:

  • response: unknown — The response to validate
  • options: SizeValidatorOptions{ minBytes?: number; maxBytes?: number } (at least one required)
const result = validateSize(response, { maxBytes: 10_000 });
const result2 = validateSize(response, { minBytes: 100, maxBytes: 50_000 });

validateSchema(response, schema, options?)

Validates the response against a Zod schema. Automatically parses JSON text responses.

Parameters:

  • response: unknown — The response to validate
  • schema: ZodType — Zod schema to validate against
  • options?: SchemaValidatorOptions{ strict?: boolean } (default: false)
import { z } from 'zod';

const WeatherSchema = z.object({
  temperature: z.number(),
  conditions: z.string(),
});

const result = validateSchema(response, WeatherSchema);

validateResponse(actual, expected)

Deep equality comparison using JSON serialization.

Parameters:

  • actual: unknown — The actual response
  • expected: unknown — The expected response
const result = validateResponse(response, { status: 'ok', count: 42 });

validateToolCalls(response, expectation)

Validates tool calls from an MCP host simulation result. Only applicable to mcp_host mode.

Parameters:

  • response: unknown — Must be an MCPHostSimulationResult
  • expectation: ToolCallExpectation — Expected tool call specification
import type { ToolCallExpectation } from '@gleanwork/mcp-server-tester';

const expectation: ToolCallExpectation = {
  calls: [{ name: 'search', required: true }],
  order: 'any',
  exclusive: false,
};

const result = validateToolCalls(simulationResult, expectation);
// result.metrics contains { precision, recall }

validateToolCallCount(response, options)

Validates the number of tool calls from an MCP host simulation result. Only applicable to mcp_host mode.

Parameters:

  • response: unknown — Must be an MCPHostSimulationResult
  • options: ToolCallCountOptions{ min?: number; max?: number; exact?: number }
const result = validateToolCallCount(simulationResult, { min: 1, max: 3 });

validateJudge(response, config) (async)

Evaluates a response using an LLM-as-a-judge. Returns a Promise<ValidationResult>.

Parameters:

  • response: unknown — The response to evaluate
  • config: JudgeValidatorConfig — Judge configuration

JudgeValidatorConfig:

Field Type Default Description
rubric RubricSpec Evaluation rubric (required unless judge is set)
judge string Name of a registered custom judge
reference unknown Reference response to compare against
threshold number 0.7 Minimum score to pass (0–1)
reps number 1 Number of evaluations to run (scores averaged)
provider ProviderKind 'anthropic' Judge LLM provider
model string Model override
const result = await validateJudge(response, {
  rubric: 'Does the response accurately describe the weather?',
  threshold: 0.8,
});

Playwright Matchers

Custom Playwright matchers for writing inline assertions against MCP tool responses. Import expect from the package or its fixtures:

import { expect } from '@gleanwork/mcp-server-tester';
// or, when using fixtures:
import { test, expect } from '@gleanwork/mcp-server-tester/fixtures/mcp';

toMatchToolResponse(expected)

Assert that the tool response exactly deep-equals the expected value.

test('exact response', async ({ mcp }) => {
  const result = await mcp.callTool('calculate', { a: 2, b: 3 });
  expect(result).toMatchToolResponse({ result: 5 });
});

For eval datasets, use the expect.response field:

{
  "id": "calc-test",
  "toolName": "calculate",
  "args": { "a": 2, "b": 3 },
  "expect": {
    "response": { "result": 5 }
  }
}

toContainToolText(text | text[])

Assert that the tool response text contains the given substring(s).

test('text contains', async ({ mcp }) => {
  const result = await mcp.callTool('get_weather', { city: 'London' });
  expect(result).toContainToolText('temperature');
  expect(result).toContainToolText(['London', 'temperature', 'humidity']);
});

toMatchToolPattern(pattern | pattern[])

Assert that the tool response text matches the given regex pattern(s).

test('pattern match', async ({ mcp }) => {
  const result = await mcp.callTool('get_weather', { city: 'London' });
  expect(result).toMatchToolPattern('Temperature: \\d+°[CF]');
  expect(result).toMatchToolPattern(['^## Weather', '\\d{4}-\\d{2}-\\d{2}']);
});

toMatchToolSchema(schema)

Assert that the tool response validates against a Zod schema.

import { z } from 'zod';

const WeatherSchema = z.object({
  city: z.string(),
  temperature: z.number(),
  conditions: z.string(),
});

test('schema validation', async ({ mcp }) => {
  const result = await mcp.callTool('get_weather', { city: 'London' });
  expect(result).toMatchToolSchema(WeatherSchema);
});

toMatchToolSnapshot(name, sanitizers?)

Assert that the tool response matches a saved Playwright snapshot. Use sanitizers to normalize variable fields (timestamps, UUIDs, etc.) before comparison.

test('snapshot', async ({ mcp }, testInfo) => {
  const result = await mcp.callTool('help', {});
  expect(result).toMatchToolSnapshot('help-output');
});

// With sanitizers
expect(result).toMatchToolSnapshot('user-profile', ['uuid', 'iso-date']);

toBeToolError(expected?)

Assert that the tool response is an error (or is not an error when negated). Optionally assert on the error message.

test('error handling', async ({ mcp }) => {
  const result = await mcp.callTool('nonexistent_tool', {});
  expect(result).toBeToolError();

  // Assert specific error message substring
  expect(result).toBeToolError('not found');

  // Assert response is NOT an error
  const good = await mcp.callTool('get_weather', { city: 'London' });
  expect(good).not.toBeToolError();
});

toPassToolJudge(rubric, options?)

Assert that the tool response passes an LLM-as-a-judge evaluation. Requires a judge client to be configured.

test('semantic quality', async ({ mcp }) => {
  const result = await mcp.callTool('search_docs', { query: 'authentication' });
  expect(result).toPassToolJudge(
    {
      text: 'The results should be relevant to the query about authentication. Score 0-1.',
    },
    { threshold: 0.7 }
  );
});

toHaveToolResponseSize(options)

Assert that the tool response size is within specified byte bounds.

test('response size', async ({ mcp }) => {
  const result = await mcp.callTool('list_files', {});
  expect(result).toHaveToolResponseSize({ minBytes: 10, maxBytes: 50000 });
});

toSatisfyToolPredicate(fn, desc?)

Assert that the tool response satisfies a custom predicate function.

test('custom predicate', async ({ mcp }) => {
  const result = await mcp.callTool('list_files', {});
  expect(result).toSatisfyToolPredicate(
    (r) => Array.isArray(r.content) && r.content.length > 0,
    'response should contain at least one file'
  );
});

toHaveToolCalls(expectation) (mcp_host mode only)

Assert that the LLM made specific tool calls when given a natural language prompt. Only meaningful in mcp_host mode.

test('tool discovery', async ({ mcp }) => {
  const result = await mcp.callTool('search', { query: 'find recent docs' });
  expect(result).toHaveToolCalls({
    calls: [{ name: 'search', required: true }],
    order: 'any',
    exclusive: false,
  });
});

toHaveToolCallCount(options) (mcp_host mode only)

Assert that the LLM made a specific number of tool calls. Only meaningful in mcp_host mode.

test('call count', async ({ mcp }) => {
  const result = await mcp.callTool('search', { query: 'find docs' });
  expect(result).toHaveToolCallCount({ min: 1, max: 5 });
});

Text Utilities

extractText(response)

Extract text content from various MCP response formats.

Parameters:

  • response: CallToolResult - MCP tool call result

Returns: string

const result = await mcp.callTool('get_info', {});
const text = extractText(result);

normalizeWhitespace(text)

Normalize whitespace for consistent comparison.

Parameters:

  • text: string - Text to normalize

Returns: string

const normalized = normalizeWhitespace('  hello\n\n  world  ');
// Returns: "hello world"

Judge Functions

createJudge(config?)

Create an LLM judge for semantic evaluation of tool responses.

Parameters:

  • config?: JudgeConfig (all fields optional)
    • provider?: 'anthropic' | 'openai' | 'google' - LLM provider (default: 'anthropic')
    • model?: string - Model name (default: 'claude-sonnet-4-20250514')
    • temperature?: number - Temperature 0–1 (default: 0.0)
    • maxTokens?: number - Maximum tokens for response (default: 1000)
    • maxBudgetUsd?: number - Maximum budget in USD (default: 0.10)
    • maxToolOutputSize?: number - Fail if response exceeds this byte count

Returns: Judge

Default (Claude):

import { createJudge } from '@gleanwork/mcp-server-tester';

const judge = createJudge();
// Requires: ANTHROPIC_API_KEY environment variable

With configuration:

const judge = createJudge({
  provider: 'openai',
  model: 'gpt-4o',
  temperature: 0.0,
});
// Requires: OPENAI_API_KEY environment variable

LLM Host Diagnostic Utilities

The following utilities are available for checking whether optional LLM provider packages are installed. They are useful for debugging provider configuration issues but are not part of the typical test-writing path.

isProviderAvailable(provider)

Check whether the npm package required for a given mcp_host provider is installed in the current environment.

import { isProviderAvailable } from '@gleanwork/mcp-server-tester';

if (!isProviderAvailable('anthropic')) {
  console.warn('Install @anthropic-ai/sdk to use the anthropic provider');
}

getMissingDependencyMessage(provider)

Return a human-readable message describing the missing dependency for a provider, suitable for displaying in error output or test skip conditions.

import { getMissingDependencyMessage } from '@gleanwork/mcp-server-tester';

const message = getMissingDependencyMessage('openai');
// e.g. "Provider 'openai' requires the 'openai' package. Run: npm install openai"

See LLM Host Guide for full details on configuring mcp_host mode.

Conformance Functions

runConformanceChecks(mcp, options?)

Run MCP protocol conformance checks.

Parameters:

  • mcp: MCPFixtureApi - MCP fixture API
  • options?: object
    • requiredTools?: string[] - Tools that must be present
    • validateSchemas?: boolean - Validate tool input schemas (default: false)

Returns: Promise<MCPConformanceResult>

const result = await runConformanceChecks(mcp, {
  requiredTools: ['get_weather', 'search_docs'],
  validateSchemas: true,
});

expect(result.pass).toBe(true);

Result Structure:

interface MCPConformanceResult {
  pass: boolean;
  checks: Array<{
    name: string;
    pass: boolean;
    message: string;
  }>;
}

Type Definitions

EvalExpectBlock

export interface EvalExpectBlock {
  /**
   * Exact response match (toMatchToolResponse)
   */
  response?: unknown;

  /**
   * Name of schema to validate against (toMatchToolSchema)
   */
  schema?: string;

  /**
   * Text substring(s) that must be present (toContainToolText)
   */
  containsText?: string | string[];

  /**
   * Regex pattern(s) that must match (toMatchToolPattern)
   */
  matchesPattern?: string | string[];

  /**
   * Snapshot name for comparison (toMatchToolSnapshot)
   */
  snapshot?: string;

  /**
   * Snapshot sanitizers to apply
   */
  snapshotSanitizers?: SnapshotSanitizer[];

  /**
   * Error expectation (toBeToolError)
   * - true: expects any error
   * - false: expects no error
   * - string: expects error containing this message
   */
  isError?: boolean | string | string[];

  /**
   * LLM-as-judge evaluation (toPassToolJudge)
   *
   * Accepts a single judge config or an array for multi-judge evaluation.
   * When an array is provided, all judges must pass (AND semantics).
   */
  passesJudge?: JudgeExpectConfig | JudgeExpectConfig[];

  /**
   * Response size validation (toHaveToolResponseSize)
   */
  responseSize?: {
    /** Maximum allowed size in bytes */
    maxBytes?: number;
    /** Minimum required size in bytes */
    minBytes?: number;
  };

  /**
   * Asserts which tools the LLM called during a mcp_host simulation.
   * Only meaningful for mcp_host mode — direct mode has no tool call trace.
   */
  toolsTriggered?: {
    /** Expected tool calls */
    calls: Array<{
      /** Tool name */
      name: string;
      /** Expected arguments (partial match — extra keys are allowed) */
      arguments?: Record<string, unknown>;
      /** Whether this call MUST have been made (default: true) */
      required?: boolean;
    }>;
    /**
     * 'strict': calls must appear in the exact order listed
     * 'any': calls can appear in any order (default)
     */
    order?: 'strict' | 'any';
    /** If true, no tool calls outside the `calls` list are allowed */
    exclusive?: boolean;
  };

  /**
   * Asserts the number of tool calls made during a mcp_host simulation.
   */
  toolCallCount?: {
    /** Minimum number of tool calls */
    min?: number;
    /** Maximum number of tool calls */
    max?: number;
    /** Exact number of tool calls */
    exact?: number;
  };
}

EvalCase

export interface EvalCase {
  /**
   * Unique identifier for this test case
   */
  id: string;

  /**
   * Human-readable description of what this test case validates
   */
  description?: string;

  /**
   * Evaluation mode
   * - 'direct': Direct API calls to MCP tools (default)
   * - 'mcp_host': LLM-driven tool selection via natural language
   *
   * @default 'direct'
   */
  mode?: EvalMode;

  /**
   * Name of the MCP tool to call (required for 'direct' mode, optional for 'mcp_host' mode)
   */
  toolName?: string;

  /**
   * Arguments to pass to the tool (required for 'direct' mode, optional for 'mcp_host' mode)
   */
  args?: Record<string, unknown>;

  /**
   * Natural language scenario for LLM to execute (optional, required for 'mcp_host' mode)
   *
   * @example "Get the weather for London and tell me if I need an umbrella"
   */
  scenario?: string;

  /**
   * MCP host configuration (optional for 'mcp_host' mode)
   *
   * If not specified, uses default configuration from test environment
   */
  mcpHostConfig?: MCPHostConfig;

  /**
   * Additional metadata for this test case
   *
   * For 'mcp_host' mode, can include 'expectedToolCalls' for validation
   */
  metadata?: Record<string, unknown>;

  /**
   * Number of times to run this case and compute an assertion pass rate.
   * When > 1, `EvalCaseResult.assertionPassRate` is populated and `pass` is determined
   * by `accuracyThreshold` rather than a single run.
   * @default 1
   */
  iterations?: number;

  /**
   * Minimum accuracy (0–1) required to pass when `iterations > 1`.
   * @default 1.0 (all iterations must pass)
   */
  accuracyThreshold?: number;

  /**
   * Number of times to invoke the LLM judge per `passesJudge` assertion.
   * Scores are averaged; the mean must meet the threshold to pass.
   * Reduces judge variance caused by non-determinism.
   * Per-assertion `passesJudge.reps` overrides this value.
   * @default 1
   */
  judgeReps?: number;

  /**
   * Golden/expected answer for this case.
   * When set, automatically passed as `reference` to the LLM judge
   * (unless passesJudge.reference is explicitly provided).
   * Mirrors EvalV2's `canonical_answer` field.
   */
  canonicalAnswer?: string;

  /**
   * Arbitrary string labels for this case.
   * Use for filtering eval runs with `EvalRunnerOptions.filterTags`
   * and for slicing results by category.
   *
   * @example ['tool-finding', 'multi-hop', 'search']
   */
  tags?: string[];

  /**
   * Expectations to validate against the tool response
   *
   * Multiple expectations can be combined and will all be validated.
   *
   * @example
   * ```json
   * {
   *   "id": "weather-london",
   *   "toolName": "get_weather",
   *   "args": { "city": "London" },
   *   "expect": {
   *     "containsText": ["temperature", "conditions"],
   *     "schema": "WeatherResponse",
   *     "responseSize": { "maxBytes": 10000 },
   *     "isError": false
   *   }
   * }
   * ```
   */
  expect?: EvalExpectBlock;
}

EvalDataset

interface EvalDataset {
  name: string;
  description?: string;
  cases: EvalCase[];
  metadata?: Record<string, unknown>;
  schemas?: Record<string, ZodSchema>; // Zod schemas for toMatchToolSchema assertions
}

Next Steps