Skip to content

Check if your AI sounds like your brand, stays safe, and behaves consistently. Works with your custom GPTs, hosted APIs, and local models. Get detailed reports in minutes, not days.

License

Notifications You must be signed in to change notification settings

justinGrosvenor/alignmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Alignmenter

PyPI version Downloads License

Persona-aligned evaluation for conversational AI

πŸ“š Documentation β€’ Overview β€’ Quickstart β€’ Quick Start Guide β€’ Contributing β€’ License


Overview

Alignmenter is a production-ready evaluation toolkit for teams shipping AI copilots and chat experiences. Ensure your AI stays on-brand, safe, and stable across model updates.

Three-Dimensional Evaluation

  • 🎨 Authenticity – Does the AI match your brand voice? Measures semantic similarity, linguistic traits, and lexicon compliance.
  • πŸ›‘οΈ Safety – Does it avoid harmful outputs? Combines keyword rules, LLM judges, and offline ML classifiers.
  • βš–οΈ Stability – Are responses consistent? Detects semantic drift and variance across sessions.

Why Alignmenter?

Unlike generic LLM evaluation frameworks, Alignmenter is purpose-built for persona alignment:

  • Persona packs: Define your brand voice in YAML with examples, lexicon, and traits
  • Offline-first: Works without constant API calls (optional LLM judge for higher accuracy)
  • Budget-aware: Built-in cost tracking and guardrails
  • Reproducible: Deterministic scoring, full audit trails
  • Privacy-focused: Local models available, sanitize production data before evaluation

Quickstart

Installation

Option 1 Β· PyPI (recommended for most users)

pip install "alignmenter[safety]"
alignmenter init
alignmenter run --config configs/run.yaml --embedding sentence-transformer:all-MiniLM-L6-v2

Use this path when you want to try Alignmenter quickly, run it in CI, or install it inside a production environment.

Option 2 Β· From Source (for case studies & contributing)

git clone https://github.com/justinGrosvenor/alignmenter.git
cd alignmenter
pip install -e ./alignmenter[dev,safety]

This installs the CLI plus the case-study assets under alignmenter/case-studies/, which are excluded from the PyPI wheel. Ideal when you want to reproduce the Wendy's walkthrough or contribute code.

Run Your First Evaluation

# Set API key (for embeddings and optional judge)
export OPENAI_API_KEY="your-key-here"

# Run demo evaluation (regenerates transcripts via the selected provider)
alignmenter run \
  --model openai:gpt-4o-mini \
  --dataset datasets/demo_conversations.jsonl \
  --persona configs/persona/default.yaml \
  --embedding sentence-transformer:all-MiniLM-L6-v2

# Reuse recorded transcripts (default behavior)
alignmenter run --config configs/run.yaml --embedding sentence-transformer:all-MiniLM-L6-v2

# View interactive report
alignmenter report --last

# Sanitize a dataset (dry run shows sample output)
alignmenter dataset sanitize datasets/demo_conversations.jsonl --dry-run

# Generate fresh transcripts (requires provider access + API keys)
alignmenter run --config configs/run.yaml --generate-transcripts

Output:

Loading dataset: 60 turns across 10 sessions
βœ“ Brand voice score: 0.82 (range: 0.78-0.86)
βœ“ Safety score: 0.97
βœ“ Consistency score: 0.94
Report written to: reports/demo/2025-11-03T00-14-01_alignmenter_run/index.html

πŸ“š Documentation

Full documentation available at docs.alignmenter.com

Quick links:


Case Studies

  • Wendy's Twitter Voice - Full calibration walkthrough, reproduction steps, and diagnostics for a high-sass social persona. (Requires installing from this repo so the case-studies/ assets are present.)

Repository Structure

alignmenter/
β”œβ”€β”€ alignmenter/           # 🐍 Main Python package (CLI, scorers, reporters)
β”‚   β”œβ”€β”€ src/alignmenter/   # Source code
β”‚   β”œβ”€β”€ tests/             # Test suite (69+ tests)
β”‚   β”œβ”€β”€ configs/           # Example configs and persona packs
β”‚   β”œβ”€β”€ datasets/          # Demo conversation data
β”‚   β”œβ”€β”€ scripts/           # Utility scripts (bootstrap, calibrate, sanitize)
β”‚   └── README.md          # πŸ“– Complete CLI documentation
β”‚
β”œβ”€β”€ docs/                  # πŸ“š Documentation and specifications
β”‚   β”œβ”€β”€ persona_annotation.md      # Annotation workflow guide
β”‚   β”œβ”€β”€ offline_safety.md          # Offline safety classifier docs
β”‚   β”œβ”€β”€ alignmenter_requirements.md # Product specification
β”‚   └── competitive_landscape.md   # vs OpenAI Evals, LangSmith
β”‚
β”œβ”€β”€ assets/                # 🎨 Branding assets
β”‚   β”œβ”€β”€ alignmenter-banner.png
β”‚   β”œβ”€β”€ alignmenter-transparent.png
β”‚   └── alignmenter.png
β”‚
β”œβ”€β”€ marketing/             # 🌐 Next.js marketing website
β”‚
└── LICENSE                # Apache 2.0

Package Overview

The core evaluation toolkit lives in alignmenter/:

Component Description
CLI alignmenter run, calibrate-persona, bootstrap-dataset, etc.
Scorers Authenticity, safety, and stability metric engines
Providers OpenAI, Anthropic, local (vLLM, Ollama) integrations
Reporters HTML report cards, JSON exports, CSV downloads
Datasets Demo conversations, sanitization tools
Personas Brand voice definitions (YAML format)

Key Features

🎯 Persona-First Design

Define your brand voice declaratively:

# configs/persona/mybot.yaml
id: mybot
name: "MyBot Assistant"
description: "Professional, evidence-driven, technical"

voice:
  tone: ["professional", "precise", "measured"]
  formality: "business_casual"

  lexicon:
    preferred:
      - "baseline"
      - "signal"
      - "alignment"
    avoided:
      - "lol"
      - "bro"
      - "hype"

examples:
  - "Our baseline analysis indicates a 15% improvement."
  - "The signal-to-noise ratio suggests this approach is viable."

πŸ“Š Interactive Reports

  • Report cards with overall grades (A/B/C)
  • Interactive charts (Chart.js visualizations)
  • Calibration diagnostics (bootstrap confidence intervals, judge agreement)
  • Reproducibility section (Python version, model, timestamps)
  • Export to CSV/JSON for custom analysis

πŸ”§ Production-Ready

  • Multi-provider support: OpenAI, Anthropic, vLLM, Ollama
  • Budget guardrails: Halt at 90% of judge API budget
  • Cost projection: Estimate expenses before execution
  • PII sanitization: Built-in scrubbing with alignmenter dataset sanitize
  • Offline mode: Works without internet using local models

πŸ§ͺ Developer Experience

  • CLI-first: Simple commands for all workflows
  • Python API: Programmatic access for custom pipelines
  • Type-safe: Full type hints throughout
  • Well-tested: 69+ unit tests with pytest
  • CI/CD ready: GitHub Actions examples included

Use Cases

🏒 Enterprise AI Teams

  • Pre-deployment testing: Verify brand voice before shipping
  • Regression testing: Catch drift when updating models
  • A/B testing: Compare GPT-4 vs Claude vs fine-tuned models
  • Compliance audits: Generate safety scorecards for regulators

πŸš€ Startups Building AI Products

  • Rapid iteration: Test persona changes in CI/CD
  • Budget constraints: Use offline classifiers to reduce API costs
  • Multi-tenant: Different personas for different customers
  • Quality assurance: Automated checks on every release

πŸŽ“ Research & Academia

  • Persona fidelity studies: Measure alignment with human raters
  • Safety benchmarks: Compare classifier performance
  • Ablation studies: Test impact of different scoring components
  • Reproducible results: Deterministic scoring with fixed seeds

Roadmap

Completed βœ…

  • Three-dimensional scoring (authenticity, safety, stability)
  • Multi-provider support (OpenAI, Anthropic, local models)
  • HTML report cards with interactive charts
  • Offline safety classifier (distilled-safety-roberta)
  • LLM judges for qualitative analysis
  • Budget guardrails and cost tracking
  • PII sanitization tools
  • Calibration workflow and diagnostics

In Progress 🚧

  • Multi-language support (non-English personas)
  • Batch processing optimizations
  • Additional embedding providers

Future Considerations πŸ’­

  • Synthetic test case generation
  • Custom metric plugins
  • Advanced trait models (neural networks)

Contributing

We welcome contributions from the community!

Ways to Contribute

  • πŸ› Bug Reports: File issues with reproducible examples
  • ✨ Feature Requests: Propose new scorers, providers, or workflows
  • πŸ“ Documentation: Improve guides, add examples
  • πŸ§ͺ Tests: Expand test coverage
  • 🎨 Persona Packs: Share brand voice configs for common use cases

Development Workflow

# Fork and clone
git clone https://github.com/justinGrosvenor/alignmenter.git
cd alignmenter/alignmenter

# Install with dev dependencies
pip install -e .[dev,safety]

# Run tests
pytest

# Run linter
ruff check src/ tests/

# Format code
black src/ tests/

# Submit PR
# - Keep functions small and composable
# - Add tests for new features
# - Update documentation

Community


Citation

If you use Alignmenter in research, please cite:

@software{alignmenter2024,
  title={Alignmenter: A Framework for Persona-Aligned Conversational AI Evaluation},
  author={Alignmenter Contributors},
  year={2025},
  url={https://github.com/justinGrosvenor/alignmenter},
  license={Apache-2.0}
}

Open Source Model

Alignmenter is built as open core:

Open Source (Apache 2.0):

  • CLI and all evaluation tools
  • Scorers, reporters, and providers
  • Persona packs and datasets
  • Documentation and examples

Proprietary (Hosted Service):

  • Web dashboard and team features
  • Audit trails and compliance reports
  • Managed infrastructure
  • Enterprise support

πŸ’‘ Get Started: Use the open-source CLI today. Contact us for hosted features.


License

Apache License 2.0

The CLI, scorers, and supporting libraries are licensed under the Apache License 2.0. This includes all code in the alignmenter/ directory.

Hosted and proprietary cloud components are not part of this repository and are subject to separate commercial terms.

See LICENSE for the full text.


Support

Documentation

Get Help


⭐ Star us on GitHub β€’ 🐦 Follow on Twitter β€’ 🌐 Visit Website

About

Check if your AI sounds like your brand, stays safe, and behaves consistently. Works with your custom GPTs, hosted APIs, and local models. Get detailed reports in minutes, not days.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •