This directory contains the MCP servers and infrastructure for the AssetOpsBench project.
- Prerequisites
- Quick Start
- Environment Variables
- MCP Servers — full reference in docs/mcp-servers.md
- Example queries
- Agents
- Observability
- Evaluation
- Running Tests
- Architecture
-
Python 3.12+ — required by
pyproject.toml -
uv — dependency and environment manager
curl -LsSf https://astral.sh/uv/install.sh | sh # macOS / Linux # or: brew install uv
-
Docker — for running CouchDB (IoT data store)
Run from the repo root:
uv syncuv sync creates a virtual environment at .venv/, installs all dependencies, and registers the CLI entry points (plan-execute, *-mcp-server). You can either prefix commands with uv run (no activation needed) or activate the venv once for your shell session:
source .venv/bin/activate # macOS / LinuxCopy .env.public to .env and fill in the required values (see Environment Variables):
cp .env.public .env
# Then edit .env and set WATSONX_APIKEY, WATSONX_PROJECT_ID
# CouchDB defaults work out of the box with the Docker setupdocker compose -f src/couchdb/docker-compose.yaml up -dVerify CouchDB is running:
curl -X GET http://localhost:5984/Servers are stdio processes spawned on-demand by the agent CLIs — no manual startup needed. Pick a runner and pass it a question:
uv run plan-execute "What sensors are on Chiller 6?"See MCP Servers for available tools and docs/mcp-servers.md for launching a server directly.
CouchDB — iot and wo servers
| Variable | Default | Description |
|---|---|---|
COUCHDB_URL |
http://localhost:5984 |
CouchDB connection URL |
COUCHDB_USERNAME |
admin |
CouchDB admin username |
COUCHDB_PASSWORD |
password |
CouchDB admin password |
IOT_DBNAME |
iot |
IoT sensor database name |
WO_DBNAME |
workorder |
Work order database name |
VIBRATION_DBNAME |
vibration |
Vibration sensor database name |
WatsonX — plan-execute runner (when --model-id starts with watsonx/)
| Variable | Default | Description |
|---|---|---|
WATSONX_APIKEY |
(required) | IBM WatsonX API key |
WATSONX_PROJECT_ID |
(required) | IBM WatsonX project ID |
WATSONX_URL |
https://us-south.ml.cloud.ibm.com |
WatsonX endpoint (optional) |
LiteLLM proxy — used by every runner whenever --model-id carries the litellm_proxy/ prefix (the default for claude-agent, openai-agent, deep-agent)
| Variable | Default | Description |
|---|---|---|
LITELLM_API_KEY |
(required) | LiteLLM proxy API key |
LITELLM_BASE_URL |
(required) | LiteLLM proxy base URL, e.g. https://your-litellm-host.example.com |
TokenRouter — OpenAI-compatible gateway, used whenever --model-id carries the tokenrouter/ prefix (the default for direct-llm-agent)
| TOKENROUTER_API_KEY | (tokenrouter/* models) | TokenRouter API key |
| TOKENROUTER_BASE_URL | (tokenrouter/* models) | TokenRouter base URL, e.g. https://api.tokenrouter.com/v1 |
Stirrup code track — stirrup-agent with --code-backend docker
| Variable | Default | Description |
|---|---|---|
STIRRUP_CODE_IMAGE |
python:3.12-slim |
Docker image for the code sandbox (build assetops-code for numpy/pandas/scipy) |
DOCKER_HOST |
(SDK default) | Daemon socket if non-default (e.g. Rancher: unix:///<home>/.rd/docker.sock) |
Six FastMCP servers cover IoT data, time-series ML, work orders, vibration diagnostics, failure-mode reasoning, and utility tools. They speak MCP over stdio and are spawned on-demand by the agent runners — no manual startup needed.
| Server | Tools | Categories | Backing service |
|---|---|---|---|
iot |
4 | read | CouchDB |
utilities |
3 | read | none |
fmsr |
2 | read, LLM-use | LiteLLM + failure_modes.yaml |
wo |
14 | read, write | CouchDB |
tsfm |
6 | read, write, cpu-centric | IBM Granite TinyTimeMixer (torch) |
vibration |
8 | read, cpu-centric | CouchDB + numpy/scipy DSP |
Tool signatures, required env vars, and how to launch a server directly: docs/mcp-servers.md.
The CLI examples below use a $query shell variable so you can swap in any question without editing the commands. Pick one of these to get started:
# Simple single-server queries
query="What sensors are on Chiller 6?"
query="Is LSTM model supported in TSFM?"
query="Get the work order of equipment CWC04013 for year 2017."
# Multi-step / multi-server queries
query="What is the current date and time? Also list assets at site MAIN. Also get sensor list and failure mode list for any of the chiller at site MAIN."Six runners are available as CLIs registered by uv sync; five use MCP tools, while direct-llm-agent is a model-only baseline that makes a direct LiteLLM call without MCP tools, planning, retrieval, or code execution. Each is a CLI registered by uv sync that takes a single positional question argument and spawns the MCP servers as stdio subprocesses on demand.
| Runner | Source | Loop | Default model |
|---|---|---|---|
plan-execute |
src/agent/plan_execute/ |
Custom plan → execute → summarise (no SDK) | watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8 |
claude-agent |
src/agent/claude_agent/ |
claude-agent-sdk agentic loop |
litellm_proxy/aws/claude-opus-4-6 |
openai-agent |
src/agent/openai_agent/ |
openai-agents SDK Runner |
litellm_proxy/azure/gpt-5.4 |
deep-agent |
src/agent/deep_agent/ |
LangChain deep-agents (LangGraph), MCP bridged via langchain-mcp-adapters |
litellm_proxy/aws/claude-opus-4-6 |
stirrup-agent |
src/agent/stirrup_agent/ |
Stirrup agent loop (in-process), MCP via its MCPToolProvider; code-capable (writes/runs Python) |
watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8 |
direct-llm-agent |
src/agent/direct_llm_agent/ |
Single direct LLM call, no MCP tools, planning, retrieval, or code execution | litellm_proxy/Azure/gpt-5-mini-2025-08-07 |
- Agents — Stirrup specifics in docs/stirrup-agent.md
uv run plan-execute "$query"
uv run claude-agent "$query"
uv run openai-agent "$query"
uv run deep-agent "$query"
uv run stirrup-agent "$query"
uv run direct-llm-agent "$query"| Flag | Description |
|---|---|
--model-id MODEL_ID |
Provider-prefixed model string (defaults in the runner table above) |
--show-trajectory |
Print each turn / step (text, tool calls, token usage) |
--json |
Emit the trajectory as JSON |
--verbose |
Show INFO-level logs on stderr |
--run-id ID |
Persist the run under this ID (auto-UUID4 if omitted) — see Observability |
--scenario-id ID |
Tag the run for benchmark grouping |
| Flag | Runner | Description |
|---|---|---|
--show-plan |
plan-execute | Print the generated plan before execution |
--max-turns N |
claude-agent, openai-agent | Max agentic-loop turns (default: 30) |
--recursion-limit N |
deep-agent | Max LangGraph recursion steps (default: 100) |
--code-enabled / --no-code |
stirrup-agent | Enable (default) / disable code execution — selects the code track |
--code-backend B |
stirrup-agent | Code sandbox: docker (default), local, or e2b |
--max-tokens N |
stirrup-agent | Max output tokens per call; keep under provider limit (default 16384) |
# Inspect the plan-execute plan before running
uv run plan-execute --show-plan --model-id watsonx/ibm/granite-3-3-8b-instruct "$query"
# Stream a claude-agent run and pipe to jq
uv run claude-agent --json "$query" | jq .turns
# Direct Anthropic API (no proxy) for claude-agent
uv run claude-agent --model-id claude-opus-4-6 "$query"
# Persist a deep-agent run for benchmark evaluation
AGENT_TRAJECTORY_DIR=./traces/trajectories OTEL_TRACES_FILE=./traces/traces.jsonl \
uv run deep-agent --run-id bench-001 --scenario-id 304 "$query"
# Stirrup tools-only run (comparable to the other runners), native watsonx route
uv run stirrup-agent --no-code --show-trajectory \
--model-id watsonx/meta-llama/llama-4-maverick-17b-128e-instruct-fp8 "$query"
# Stirrup code track in a Docker sandbox (writes and runs Python)
STIRRUP_CODE_IMAGE=assetops-code \
uv run stirrup-agent --code-backend docker "$query"
# Direct model-only baseline, no MCP tools
uv run direct-llm-agent --model-id litellm_proxy/Azure/gpt-5-mini-2025-08-07 \
'Return only JSON: {"test": 1}'Each agent run can persist two artifacts joined by run_id:
- Trace — OpenTelemetry root span with metadata + aggregate metrics (runner, model, IDs, span duration, token totals, turn / tool-call counts).
- Trajectory — per-run JSON with per-turn content (text, tool inputs/outputs, per-turn tokens and timing).
Install the optional deps and set either / both / neither env var:
uv sync --group otel
AGENT_TRAJECTORY_DIR=./traces/trajectories \
OTEL_TRACES_FILE=./traces/traces.jsonl \
uv run deep-agent --run-id bench-001 --scenario-id 304 "$query"--run-id (auto-UUID4 if omitted) and --scenario-id are available on every runner. With nothing set, runs work normally with zero persistence overhead.
See docs/observability.md for span attribute reference, trajectory layout, jq recipes, log rotation, and optional Jaeger / Collector replay.
Offline scoring of saved trajectories against ground-truth scenarios. Three-stage flow:
agent run → trajectory (run_id) → uv run evaluate → reports/<run_id>.json
End-to-end against a ground-truth file:
# 1. Persist trajectories
export AGENT_TRAJECTORY_DIR=$(pwd)/traces/trajectories
uv run claude-agent "List all failure modes of asset Chiller." --scenario-id 101
# 2. Score with LLM-As-Judge
uv run evaluate \
--trajectories traces/trajectories \
--scenarios groundtruth/101.json \
--scorer-default llm_judge \
--judge-model litellm_proxy/azure/gpt-5.4Output lands under reports/ — one <run_id>.json per trajectory plus _aggregate.json for the rollup.
Note
If llm_judge is used, --judge-model must not match the trajectory's model
for any evaluated run. The evaluator now rejects self-judging rows with a clear error.
Scorer families follow MLflow's evaluator/scorer split: llm_judge is wired up; exact_string_match, numeric_match, and semantic_similarity ship as skeletons (raise NotImplementedError).
Full reference — scenario schema, report layout, custom scorers, looping over ground-truth: docs/evaluation.md.
uv run pytest src/ -k "not integration" # unit tests only — no services required
uv run pytest src/ # full suite — integration tests auto-skip if their service is unavailableEach integration suite is gated by a skipif mark; missing service ⇒ silently skipped, not failed:
| Suite | Skip unless |
|---|---|
| iot, wo, vibration | CouchDB reachable — docker compose -f src/couchdb/docker-compose.yaml up -d |
| fmsr | WATSONX_APIKEY, WATSONX_PROJECT_ID set in .env |
| tsfm | PATH_TO_MODELS_DIR, PATH_TO_DATASETS_DIR set in .env |
Narrow scope by path or name pattern:
uv run pytest src/servers/wo/tests/ # one package's full suite
uv run pytest src/servers/wo/tests/test_integration.py -v # one file
uv run pytest src/ -k "integration" # only files / tests with "integration" in the name┌──────────────────────────────────────────────────────────────┐
│ agent/ │
│ │
│ PlanExecuteRunner ClaudeAgentRunner StirrupAgentRunner │
│ OpenAIAgentRunner DeepAgentRunner │
│ │
└──────────────────────────┬───────────────────────────────────┘
│ MCP protocol (stdio)
┌─────────────────┼───────────┬──────────┬──────┬───────────┐
▼ ▼ ▼ ▼ ▼ ▼
iot utilities fmsr tsfm wo vibration
(tools) (tools) (tools) (tools) (tools) (tools)