Agent Reliability Lab

Measuring what makes AI agents reliable in production-like scenarios. Open benchmarks and controlled harness experiments.

We study one question:

What makes an AI agent remain useful when tasks become long, tools become risky, context becomes crowded, and execution gets interrupted?

Longer context windows expand what an agent can access. They do not decide what belongs in the active working set, what should be persisted or forgotten, when an action needs permission, how sub-agents should hand work back, or whether execution can recover after interruption.

Agent Reliability Lab turns those choices into reproducible system and product evaluations.

Active projects

Project	Layer	Research question	Status
Chinese Long-Context LLM Benchmark V2	Model measurement	How reliably do Chinese LLMs retrieve and reason across long contexts?	Complete — v2.0.1
Deep Research Harness Eval	Agent reliability infrastructure	How do compaction, permission gates, sub-agents, and recovery affect quality and cost?	In progress — spec v0.2
Agent Memory Systems Benchmark	Persistent memory	How do memory systems differ under controlled write, update, conflict, and deletion tasks?	Planned after Harness Eval

Current experiment

The Deep Research Harness Eval compares four cumulative configurations:

ReAct baseline
Baseline + context compaction
Compaction + permission gate
Permission gate + structured sub-agents

The primary metrics are:

Evidence-Grounded Task Success Rate
Cost per Successful Task

The primary evaluation uses a frozen twenty-task suite. A separate cross-model subset tests whether configuration rankings transfer without mixing model effects into the main causal claim.

Evaluation principles

Freeze inputs and version source snapshots.
Separate model capability from infrastructure failure.
Measure systems from raw traces, not screenshots.
Change one architectural decision at a time.
Publish badcases and failure metadata.
Report quality and cost together.
State limitations explicitly.

Status — updated June 21, 2026

Project	Phase	Latest milestone
Long-Context V2	Complete	v2.0.1 frozen
Deep Research Harness	In progress	v0.2 provider qualification gate
Agent Memory Systems	Planned	Begins after the harness evaluation

Built by Melody Ling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Agent Reliability Lab

Agent Reliability Lab

Active projects

Current experiment

Evaluation principles

Status — updated June 21, 2026

Pinned Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!