Measuring what makes AI agents reliable in production-like scenarios. Open benchmarks and controlled harness experiments.
We study one question:
What makes an AI agent remain useful when tasks become long, tools become risky, context becomes crowded, and execution gets interrupted?
Longer context windows expand what an agent can access. They do not decide what belongs in the active working set, what should be persisted or forgotten, when an action needs permission, how sub-agents should hand work back, or whether execution can recover after interruption.
Agent Reliability Lab turns those choices into reproducible system and product evaluations.
| Project | Layer | Research question | Status |
|---|---|---|---|
| Chinese Long-Context LLM Benchmark V2 | Model measurement | How reliably do Chinese LLMs retrieve and reason across long contexts? | Complete — v2.0.1 |
| Deep Research Harness Eval | Agent reliability infrastructure | How do compaction, permission gates, sub-agents, and recovery affect quality and cost? | In progress — spec v0.2 |
| Agent Memory Systems Benchmark | Persistent memory | How do memory systems differ under controlled write, update, conflict, and deletion tasks? | Planned after Harness Eval |
The Deep Research Harness Eval compares four cumulative configurations:
- ReAct baseline
- Baseline + context compaction
- Compaction + permission gate
- Permission gate + structured sub-agents
The primary metrics are:
- Evidence-Grounded Task Success Rate
- Cost per Successful Task
The primary evaluation uses a frozen twenty-task suite. A separate cross-model subset tests whether configuration rankings transfer without mixing model effects into the main causal claim.
- Freeze inputs and version source snapshots.
- Separate model capability from infrastructure failure.
- Measure systems from raw traces, not screenshots.
- Change one architectural decision at a time.
- Publish badcases and failure metadata.
- Report quality and cost together.
- State limitations explicitly.
| Project | Phase | Latest milestone |
|---|---|---|
| Long-Context V2 | Complete | v2.0.1 frozen |
| Deep Research Harness | In progress | v0.2 provider qualification gate |
| Agent Memory Systems | Planned | Begins after the harness evaluation |
Built by Melody Ling.