Skip to content

saralabiswal/agentops-eval-llmops

Repository files navigation

agentic-eval

Enterprise evaluation framework for LLM-powered agentic systems.

agentic-eval benchmarks whether an agentic workflow is grounded, relevant, stable, and operationally practical before it is trusted in governed business processes. It is built for pipelines that follow the pattern:

retrieve context -> reason with an LLM -> propose a typed action -> evaluate evidence

The project is mock-first, local-first, and production-minded. A clean checkout can run deterministic benchmarks with no API keys, then graduate to local Ollama or cloud-backed model evaluation when a team is ready.

Author: Sarala Biswal

Framework Scope

agentic-eval is the evaluation framework for governed agentic systems. It validates whether system outputs are faithful to retrieved context, relevant to the business scenario, consistent across repeated runs, and practical within latency expectations.

The banking-agentic-ai-platform integration is a supported system-under-test adapter for real multi-layer banking pipeline evaluation.

Business Problem

Enterprises are adopting LLM agents for workflows such as payment risk intervention, billing dispute resolution, churn prevention, customer servicing, and policy-guided decision support. These systems can produce well-structured answers while still being wrong in ways that matter:

  • The answer may not be grounded in the retrieved policy or customer context.
  • The model may invent facts that are not present in source systems.
  • The response may be relevant in tone but miss the actual business decision.
  • Retrieval may pass noisy documents that confuse the agent or waste tokens.
  • The same input may produce inconsistent decisions across repeated runs.
  • A model may look accurate but violate latency expectations for production use.

Traditional schema validation only proves that the response shape is correct. It does not prove that the reasoning is faithful, stable, relevant, or efficient enough for a governed enterprise workflow.

How This App Solves It

agentic-eval turns governed business scenarios into repeatable benchmark cases. Each case defines the customer context, retrieved policy evidence, expected decision signals, hallucination guards, and score thresholds. The app then runs the system under test and evaluates the output across five dimensions: faithfulness, answer relevance, context precision, consistency, and latency/quality tradeoff.

The result is an evidence-backed benchmark report that helps teams answer:

  • Can this agent explain its decision from approved context?
  • Did the response address the business scenario that was asked?
  • Which policy documents were actually useful?
  • Does the same case produce the same decision over repeated runs?
  • Which model/backend gives the best quality within the latency budget?
  • Are there unsupported claims that should block release?

In practice, this gives product, engineering, risk, and governance teams a shared release gate for agentic systems instead of relying on manual review or one-off prompt testing.

What It Evaluates

Dimension Purpose Enterprise risk addressed
Faithfulness Verifies factual claims against retrieved context Unsupported claims and hallucinations
Answer relevance Confirms the response addresses the scenario Plausible but off-task answers
Context precision Measures whether retrieved chunks were useful Noisy retrieval and wasted tokens
Consistency Re-runs the same case and compares outputs Unstable decisions for identical inputs
Latency / quality Compares quality against response time Model choices that miss service-level expectations

Core Capabilities

  • Case-file-backed benchmark scenarios for repeatable governed evaluation.
  • Separate judge model and system-under-test model configuration.
  • Mock, Ollama, API, and banking platform execution modes.
  • FastAPI service for benchmark execution, runtime settings, results, test cases, and server-sent live progress events.
  • React dashboard for running benchmarks, browsing cases, inspecting results, comparing models, and understanding the architecture.
  • HTML and JSON report generation for human review and machine processing.
  • Startup report cleanup that keeps the latest generated report set manageable.
  • No secret values returned through configuration APIs.

Run Modes

Mode Judge System under test Best for
Mock Deterministic mock judge Deterministic mock system CI, UI checks, repeatable local development
Local Ollama qwen2.5:7b llama3.2 No-key local model evaluation
API OpenAI via LiteLLM OpenAI via LiteLLM Higher-quality final evaluation
Hybrid API judge Ollama system Strong external judge over local model outputs
Banking Platform qwen2.5:7b or API judge banking-agentic-ai-platform pipeline Real multi-layer pipeline evaluation

The judge and system under test should remain separate. Asking the same model to grade itself can hide quality issues and creates self-evaluation bias.

Architecture

agentic-eval logical architecture

Runtime Flow

  1. A user selects a benchmark preset and test case group in the dashboard.
  2. The UI saves runtime model settings through POST /config/runtime.
  3. The UI starts a run through POST /benchmark/run.
  4. FastAPI creates a run id and background task.
  5. BenchmarkEngine loads matching case files from test_cases/.
  6. The selected runner calls the system under test and captures raw response metadata.
  7. CompositeEvaluator scores the output across all evaluation dimensions.
  8. Reporters persist JSON and HTML outputs under reports/.
  9. The UI streams progress from /benchmark/events/{run_id} and reads results from /results.

Code Walkthrough

This is the end-to-end path for one benchmark case.

Step Code path What happens
1 ui/src/pages/BenchmarkRunner.tsx The user selects a preset and a case group, then clicks Run Benchmark.
2 ui/src/api/client.ts updateRuntimeConfig() saves the selected judge and system models; startBenchmark() posts the run request.
3 eval/api/routers/config.py POST /config/runtime validates the model/backend pairing and updates in-memory runtime settings.
4 eval/api/routers/benchmark.py POST /benchmark/run creates the run id, event queue, and background task.
5 eval/benchmark.py BenchmarkEngine.run() loads matching case files, chooses the runner, emits live events, and coordinates scoring.
6 eval/runners/ollama_runner.py, eval/runners/api_runner.py, or eval/runners/mock_runner.py The selected runner calls the system under test and returns agent output, latency, model name, and raw metadata.
7 eval/evaluators/composite.py CompositeEvaluator invokes each dimension evaluator and combines the scores.
8 eval/judge/client.py and eval/judge/prompts.py Faithfulness and context precision use the configured judge client and versioned prompts.
9 eval/evaluators/answer_relevance.py and eval/evaluators/consistency.py Embedding similarity checks relevance and repeated-run stability.
10 eval/reporters/json_reporter.py and eval/reporters/html_reporter.py Reports are written to disk for machine processing and human review.
11 eval/api/routers/results.py and ui/src/pages/ResultDetail.tsx The dashboard reads completed reports and shows scores, evidence, policy documents, and consistency outputs.

Quick Start

Prerequisites

  • Python 3.12+
  • Node.js 20+
  • uv
  • corepack with pnpm
  • Docker and the Ollama CLI only if using local Ollama

Deterministic Local Run

git clone https://github.com/saralabiswal/agentic-eval
cd agentic-eval
make install
cp .env.example .env
make demo

Default behavior uses mock judge + mock system. It requires no API key and is designed for repeatable development and CI checks.

Fully Local Ollama Run

make docker-up
make ollama-models
make ollama-smoke

Recommended local .env pairing:

EVAL_JUDGE_BACKEND=ollama
EVAL_JUDGE_MODEL=qwen2.5:7b

SUT_BACKEND=ollama
SUT_MODEL=llama3.2

This keeps evaluation local and keyless while preserving judge/system separation: Qwen judges Llama outputs.

API-Backed Run

cp .env.example .env

Set:

EVAL_JUDGE_BACKEND=api
EVAL_JUDGE_MODEL=gpt-4o
SUT_BACKEND=api
SUT_MODEL=gpt-4o-mini
OPENAI_API_KEY=your-key

Use a different judge model and system model for evaluation. The judge is assessing the system output, so the normal API setup should not point both roles at the same model.

Then run:

make demo

Banking Platform Integration

Use this path when you want agentic-eval to evaluate the real banking-agentic-ai-platform pipeline instead of a direct mock, Ollama, or API model runner.

What this integration does:

  • Keeps agentic-eval on port 8001.
  • Calls banking-agentic-ai-platform on port 8000.
  • Sends each benchmark case to POST /pipeline/run.
  • Passes customer_id, scenario, and an uppercase blueprint.
  • Stores the platform response in benchmark raw metadata.
  • Scores the returned agent output with the configured judge model.

Prerequisites:

  • Clone and start banking-agentic-ai-platform.
  • Confirm the banking platform API is listening at http://localhost:8000.
  • Keep this evaluation API on http://localhost:8001.
  • Use a judge that is separate from the banking platform system under test.

In the banking platform repo, start the API with that repo's documented command. Then verify the pipeline endpoint:

curl -s -X POST http://localhost:8000/pipeline/run \
  -H "Content-Type: application/json" \
  -d '{"customer_id":"C001","scenario":"payment_risk_intervention","blueprint":"PAYMENT_RISK_INTERVENTION"}'

In this repo, configure .env:

BANKING_PLATFORM_ENABLED=true
BANKING_PLATFORM_URL=http://localhost:8000

EVAL_JUDGE_BACKEND=ollama
EVAL_JUDGE_MODEL=qwen2.5:7b

SUT_BACKEND=platform
SUT_MODEL=banking-platform

Start agentic-eval:

make dev
cd ui
corepack pnpm dev

Open http://localhost:5173, go to Settings, and use Banking Platform Integration -> Test Connection. Then go to Run Benchmark and choose the Banking Platform preset.

You can also run one case from the terminal:

uv run python -m eval.demo --backend platform --cases PR-001

The adapter lives in eval/runners/platform_runner.py, and benchmark dispatch is handled by eval/benchmark.py. If BANKING_PLATFORM_ENABLED=false, platform runs fail fast instead of silently falling back to mock output.

Running the Application

Start the API:

make dev

Start the UI:

cd ui
corepack pnpm dev

Open:

http://localhost:5173

The app opens on the About page so business users see the problem and solution first. The Evaluate navigation then follows the workflow: run a benchmark, review results, compare models, and monitor the dashboard.

The API listens on port 8001 so it can run beside the banking platform on port 8000.

API Surface

Endpoint Purpose
POST /benchmark/run Start a benchmark in the background
GET /benchmark/{run_id} Read run status or completed report
GET /benchmark/events/{run_id} Stream live benchmark events
GET /results List completed benchmark reports
GET /results/{run_id} Fetch a full report
GET /cases List case-file-backed test cases
GET /cases/{case_id} Fetch one test case
GET /config Return non-secret effective runtime config
POST /config/runtime Update non-secret runtime model selection
POST /config/test-connection Check backend reachability without echoing secrets

Repository Layout

eval/
  api/          FastAPI routers, state, report loading, startup cleanup
  core/         Pydantic schemas, settings, runtime config, exceptions
  evaluators/   Faithfulness, relevance, precision, consistency, latency
  judge/        Mock, Ollama, and LiteLLM judge clients
  reporters/    JSON, HTML, and terminal report generation
  runners/      Mock, Ollama, API, platform, and direct runners

test_cases/     Benchmark case definitions
tests/          Unit and integration coverage
ui/             React + Vite dashboard
reports/        Generated reports; only .gitkeep is tracked

Planning materials are intentionally kept under docs/planning/ and ignored by git. Generated reports, build outputs, caches, and local virtual environments are also ignored.

Quality Gates

Run these before handing off a branch:

make test
make typecheck
make lint
cd ui && corepack pnpm typecheck

Optional release confidence checks:

make demo
make ollama-smoke
cd ui && corepack pnpm build

Security and Governance Notes

  • .env is ignored and must not be committed.
  • API keys are never returned from /config or connection-test responses.
  • Mock mode keeps tests deterministic and avoids accidental cloud calls.
  • Ollama mode calls the configured local OLLAMA_BASE_URL.
  • API mode requires an explicit key and should be used deliberately.
  • Judge and system settings are independent to avoid self-evaluation by default.
  • Every benchmark result records evidence, hallucination flags, raw response metadata, latency, and pass/fail status.

Adding Test Cases

Add case definition files under test_cases/<scenario>/.

Each case should include:

  • Customer profile input.
  • Retrieved policy chunks.
  • Scenario context.
  • Expected risk/action signals.
  • Hallucination guards.
  • Per-dimension thresholds.

Keep test cases as data. Evaluator behavior belongs in Python modules with focused unit or integration tests.

Development Standards

  • Use uv for Python dependency management.
  • Use pnpm through corepack for the UI.
  • Add type hints to Python function signatures.
  • Use Pydantic models for cross-boundary schemas.
  • Keep public classes and methods documented.
  • Prefer async I/O for runtime evaluation paths.
  • Never commit generated reports, cache directories, local planning files, virtual environments, or API keys.

License

MIT, as declared in pyproject.toml.

About

Evaluation framework for governed LLM agents — YAML-backed benchmark cases, independent judge and SUT backends (mock/Ollama/API/platform), faithfulness and relevance scoring, SSE streaming, and HTML/JSON reports

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors