classifier-evals

Offline classifier evaluation harness for intent classification systems. Provides dataset loading, confusion matrices, LLM-as-judge with cost accounting, regression gates for CI, and Phoenix/Langfuse exporters.

Quick Start

# Install
npm install -g classifier-evals

# Run evaluation on a dataset
classifier-evals eval --dataset test-set.csv --format json --output results.json

# Check regression gates
classifier-evals gates --results results.json --gates gates.yaml

Features

Multi-format dataset loading — CSV, JSON, JSONL
Confusion matrix analysis — Multi-class matrices with per-class metrics
Classification metrics — Accuracy, precision, recall, F1 (macro/micro/weighted), MCC, Cohen's kappa
LLM-as-judge — Cost-aware evaluation with consensus voting
Regression gates — CI-integrated quality gates with baseline comparison
Observability — Phoenix and Langfuse exporters, OpenTelemetry tracing

Dataset Format

CSV

text,label,predicted_label,confidence
"Reset my password",password_reset,password_reset,0.95
"Cancel my subscription",cancel_subscription,refund_request,0.72

JSONL

{"text": "Reset my password", "label": "password_reset", "predicted_label": "password_reset", "confidence": 0.95}

Required Fields

Field	Required	Description
`text`	yes	Input text that was classified
`label`	yes	Ground truth label
`predicted_label`	yes	Model's predicted label
`confidence`	no	Model's confidence score (0-1)

CLI Commands

eval

Run a full evaluation pipeline:

classifier-evals eval \
  --dataset datasets/test-set.csv \
  --format json \
  --output results.json

compare

Compare two model evaluations:

classifier-evals compare \
  --baseline results/model-v1.json \
  --candidate results/model-v2.json \
  --output comparison.json

gates

Check regression gates for CI:

classifier-evals gates \
  --results results/latest.json \
  --gates gates.yaml

judge

Run LLM-as-judge on samples:

classifier-evals judge \
  --samples misclassifications.jsonl \
  --model claude-opus \
  --budget 50.00

export

Generate a report or send results to an exporter:

classifier-evals export \
  --results results/latest.json \
  --format html \
  --output reports/eval-report.html

Regression Gates

Configure quality gates in YAML:

# gates.yaml
gates:
  - name: overall-accuracy
    type: threshold
    metric: accuracy
    operator: ">="
    threshold: 0.85

  - name: macro-f1
    type: threshold
    metric: f1_macro
    operator: ">="
    threshold: 0.80

  - name: no-regression
    type: baseline-comparison
    baseline_path: results/baseline.json
    metric: f1_macro
    allow_regression_in: 0

CI Integration

# .github/workflows/eval.yml
name: Classifier Evaluation

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run evaluation
        run: |
          npx classifier-evals eval \
            --dataset datasets/test-set.csv \
            --output results.json
      
      - name: Check gates
        run: |
          npx classifier-evals gates \
            --results results.json \
            --gates gates.yaml

Library Usage

import { loadDataset, createEvalRunFromSamples } from 'classifier-evals';

// Load dataset
const dataset = await loadDataset('test-set.csv');

// Create evaluation run (computes confusion matrix and all metrics)
const evalRun = createEvalRunFromSamples({
  samples: dataset.samples,
});

// Access results
console.log(`Accuracy: ${evalRun.metrics.accuracy}`);
console.log(`Macro F1: ${evalRun.metrics.f1_macro}`);
console.log(`Confusion matrix labels: ${evalRun.confusion_matrix.labels}`);

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key for LLM judge
`ANTHROPIC_API_KEY`	Anthropic API key for LLM judge
`LANGFUSE_PUBLIC_KEY`	Langfuse public key
`LANGFUSE_SECRET_KEY`	Langfuse secret key
`OTEL_EXPORTER_OTLP_ENDPOINT`	OpenTelemetry endpoint

Documentation

AGENTS.md — Agent development guide
ARCHITECTURE.md — System design deep dive
DEV_PLAN.md — Development checklist

Performance Testing

Run the opt-in large-dataset performance suite separately from the default unit tests:

npm run test:perf

The performance suite uses deterministic synthetic datasets and validates the real loader, metrics, and regression gate path on 10k+ samples.

Deployment

Infrastructure as Code

The infra/ directory contains Terraform modules and environment configurations for deploying classifier-evals to multiple cloud platforms:

AWS (Amazon Web Services)

Deploy to AWS using ECS Fargate:

cd infra/environments/aws-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

ECS Cluster with Fargate service
Application Load Balancer
ECR repository for container images
CloudWatch logs and alarms

Azure

Deploy to Azure using Container Apps:

cd infra/environments/azure-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

Container Apps Environment
Container App with autoscaling
Container Registry
Application Insights for monitoring

GCP (Google Cloud Platform)

Deploy to GCP using Cloud Run:

cd infra/environments/gcp-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

Cloud Run service
Cloud Build for CI/CD
Artifact Registry
Cloud Monitoring and Logging

OCI (Oracle Cloud Infrastructure)

Deploy to OCI using OKE (Kubernetes):

cd infra/environments/oci-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

OKE Kubernetes cluster
Load Balancer
Container Engine for Kubernetes
Monitoring and Logging

Netlify

Deploy to Netlify for serverless functions:

cd infra/environments/netlify-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

Netlify site
Serverless functions
Custom domain configuration
Build hooks for CI/CD

Vercel

Deploy to Vercel for edge functions:

cd infra/environments/vercel-production
terraform init
terraform plan -var-file="prod.tfvars"
terraform apply -var-file="prod.tfvars"

Resources created:

Vercel project
Edge functions
Preview deployments
Custom domain configuration

Docker Deployment

# Build the Docker image
docker build -t classifier-evals .

# Run locally
docker run -p 3000:3000 \
  -e OPENAI_API_KEY=your-key \
  -e ANTHROPIC_API_KEY=your-key \
  classifier-evals

# Push to registry
docker tag classifier-evals registry.example.com/classifier-evals:latest
docker push registry.example.com/classifier-evals:latest

Docker Compose (Development)

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f classifier-evals

# Stop all services
docker-compose down

Template Repository

This repository can be used as a GitHub template, but the actual "Template repository" toggle is a GitHub repository setting and is not stored in source control. The repo-tracked support here is the project structure and documentation needed for template consumers.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
.husky		.husky
datasets/examples		datasets/examples
docker		docker
infra		infra
skills		skills
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.lintstagedrc.json		.lintstagedrc.json
.nvmrc		.nvmrc
.prettierrc		.prettierrc
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts
vitest.perf.config.ts		vitest.perf.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

classifier-evals

Quick Start

Features

Dataset Format

CSV

JSONL

Required Fields

CLI Commands

eval

compare

gates

judge

export

Regression Gates

CI Integration

Library Usage

Environment Variables

Documentation

Performance Testing

Deployment

Infrastructure as Code

AWS (Amazon Web Services)

Azure

GCP (Google Cloud Platform)

OCI (Oracle Cloud Infrastructure)

Netlify

Vercel

Docker Deployment

Docker Compose (Development)

Template Repository

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

classifier-evals

Quick Start

Features

Dataset Format

CSV

JSONL

Required Fields

CLI Commands

eval

compare

gates

judge

export

Regression Gates

CI Integration

Library Usage

Environment Variables

Documentation

Performance Testing

Deployment

Infrastructure as Code

AWS (Amazon Web Services)

Azure

GCP (Google Cloud Platform)

OCI (Oracle Cloud Infrastructure)

Netlify

Vercel

Docker Deployment

Docker Compose (Development)

Template Repository

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages