Skip to content
@agent-reliability-lab

Agent Reliability Lab

Measuring what makes AI agents reliable. Benchmarks + harness experiments.

Agent Reliability Lab

Measuring what makes AI agents reliable in production-like scenarios. Open benchmarks and controlled harness experiments.

We study one question:

What makes an AI agent remain useful when tasks become long, tools become risky, context becomes crowded, and execution gets interrupted?

Longer context windows expand what an agent can access. They do not decide what belongs in the active working set, what should be persisted or forgotten, when an action needs permission, how sub-agents should hand work back, or whether execution can recover after interruption.

Agent Reliability Lab turns those choices into reproducible system and product evaluations.

Active projects

Project Layer Research question Status
Chinese Long-Context LLM Benchmark V2 Model measurement How reliably do Chinese LLMs retrieve and reason across long contexts? Complete — v2.0.1
Deep Research Harness Eval Agent reliability infrastructure How do compaction, permission gates, sub-agents, and recovery affect quality and cost? In progress — spec v0.2
Agent Memory Systems Benchmark Persistent memory How do memory systems differ under controlled write, update, conflict, and deletion tasks? Planned after Harness Eval

Current experiment

The Deep Research Harness Eval compares four cumulative configurations:

  1. ReAct baseline
  2. Baseline + context compaction
  3. Compaction + permission gate
  4. Permission gate + structured sub-agents

The primary metrics are:

  • Evidence-Grounded Task Success Rate
  • Cost per Successful Task

The primary evaluation uses a frozen twenty-task suite. A separate cross-model subset tests whether configuration rankings transfer without mixing model effects into the main causal claim.

Evaluation principles

  1. Freeze inputs and version source snapshots.
  2. Separate model capability from infrastructure failure.
  3. Measure systems from raw traces, not screenshots.
  4. Change one architectural decision at a time.
  5. Publish badcases and failure metadata.
  6. Report quality and cost together.
  7. State limitations explicitly.

Status — updated June 21, 2026

Project Phase Latest milestone
Long-Context V2 Complete v2.0.1 frozen
Deep Research Harness In progress v0.2 provider qualification gate
Agent Memory Systems Planned Begins after the harness evaluation

Built by Melody Ling.

Pinned Loading

  1. deep-research-harness deep-research-harness Public

    Controlled ablation of deep-research agent reliability: compaction, permission gates, sub-agents, recovery.

    Python

  2. llm-long-context-eval-zh-V2 llm-long-context-eval-zh-V2 Public

    Chinese long-context LLM benchmark V2 with harder NIAH variants, 10 repeats, and efficiency metrics for DeepSeek, Kimi, and Qwen.

    HTML

Repositories

Showing 3 of 3 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…