GitHub - haorui-harry/agent-harness: A LangChain-powered multi-agent system with agent routing, skill routing, and execution harness.

A high-performance, general-purpose agent runtime: plans with capabilities, uses skills and tools on demand, runs inside a real workspace, and ships inspectable deliverables -- not just answer strings.

One Line

Agent Harness turns an open-ended request into a main deliverable plus the evidence, artifacts, and runtime trace needed to review or continue the work.

Why This Project

The best general agents are usually not the most complicated ones.

What matters is a small number of strong primitives:

one persistent thread
one planner that reasons over capabilities instead of fixed flows
one workspace where actions can create real files
one main deliverable that the user can actually open
one artifact/evidence rail for audit, retry, and follow-up

That is the design center of Agent Harness.

What It Can Produce

Depending on the task, the main deliverable can be:

a research report with evidence-backed findings
an architecture design with trade-off analysis
a comparison matrix with concrete recommendations
a patch draft or engineering handoff memo
a slide deck plan or webpage blueprint
a delivery bundle with linked artifacts

Supporting artifacts can include:

evidence bundles (Tavily > public web search > live_search > static catalog)
workspace findings
validation plans or execution traces
source matrices
interoperability exports for external skill ecosystems

Core Model

Request -> Thread -> Capability Planner -> Skills/Tools/Workspace Actions -> Main Deliverable -> Evidence + Artifact Bundle

This is deliberately simpler than a task-specific workflow catalog.

The runtime decides how to use skills, tools, and workspace actions based on the task -- not because the task was shoved into a hard-coded funnel.

Agent Loop

The runtime is built around one short loop:

open or resume one thread
infer the main deliverable and missing channels from the task
inspect skills, tools, web context, or workspace only when the task calls for them
execute a small task graph inside the thread workspace
publish one primary deliverable plus reviewable evidence and follow-up artifacts

Evidence Pipeline

The runtime now uses a multi-backend evidence stack with clear priority:

Tavily (if TAVILY_API_KEY is set)
Public search fallback:
- DuckDuckGo HTML search (no key)
- SearXNG (if SEARXNG_BASE_URL is set)
LLM-powered live_search (uses the configured model to produce reference candidates)
Static catalog + local dossiers

Priority is enforced in scoring, so stronger backends win when multiple sources return relevant evidence.

Live Agent Enhancement

When a live model API is configured, the runtime adds a 4-stage reasoning chain:

Analysis -- decompose the task, map available evidence, identify gaps
Synthesis -- produce a structured answer grounded in analysis findings
Critique -- peer-review the synthesis: flag red flags, blind spots, unsupported claims
Revision -- fix every issue the critique identified, with mandatory fix targets injected into the prompt

This chain means the final output has been through analysis, writing, review, and revision -- not just one-shot generation.

Demo Gallery

The current demo set compares three execution modes of the same runtime:

1. Public Search Fallback (No Key Required)

"Compare Redis, Valkey, and Memcached for low-latency caching in a high-traffic API."

Full output
Uses DuckDuckGo / SearXNG fallback when no Tavily key is present
No live-agent chain, deterministic evidence-first synthesis only
Good baseline for no-key environments

2. Cheap Fallback Synthesis

"Summarize the key risks of deploying LLMs in production and give 3 concrete mitigations."

Full output
Uses a single cheap-model synthesis pass when live_model is configured but full live-agent mode is off
Better than the old template-based fallback, but cheaper than the 4-stage chain

3. Full Live Agent

"Compare React, Vue, and Svelte for building a real-time dashboard."

Full output
Uses the full 4-stage path: analysis -> synthesis -> critique -> revision
Highest quality path when a live model is configured

What Makes It Different

1. Deliverable-first

The first-class result is the deliverable the user asked for -- not the scorecard, not the bundle, not the planner trace.

2. Generic By Default

The runtime is not "the research flow" or "the code flow". It starts from a task spec, available capabilities, current evidence gaps, and workspace state. That keeps the system usable across code, research, operations, and mixed tasks.

3. Critique-Driven Quality

The live agent doesn't just generate -- it reviews its own output. Critique findings (red flags, blind spots, improvement items) are injected as mandatory fix targets in the revision pass. This is measurably better than one-shot generation.

4. Real Evidence, Not Templates

Evidence collection now prefers the strongest available backend in order:

Tavily (keyed, highest priority)
public web search fallback (DuckDuckGo / optional SearXNG)
LLM-powered live_search
static catalog + local dossiers

The evidence digest flows directly into synthesis and revision prompts.

5. Thread + Workspace + Recovery

Each task runs inside a persistent thread with resumable execution, retry/interrupt/recovery, workspace artifacts, and event stream export.

6. Skills Without Lock-In

Skills are packaged capabilities, not the entire product. The runtime can export an interoperability catalog so external OpenAI/Anthropic-style ecosystems can consume the project's capabilities.

Quickstart

1. Install

pip install -r requirements.txt

2. Run A Task (No API Key Required)

from app.harness.engine import HarnessEngine
from app.harness import HarnessConstraints

engine = HarnessEngine()
run = engine.run(
    query="Summarize the key risks of deploying LLMs in production",
    constraints=HarnessConstraints(max_steps=5, max_tool_calls=4),
)
print(run.final_answer[:2000])

3. Run With Live Model (Higher Quality)

from app.harness.engine import HarnessEngine
from app.harness import HarnessConstraints

engine = HarnessEngine()
run = engine.run(
    query="Design a caching strategy for a microservices architecture",
    constraints=HarnessConstraints(
        max_steps=5,
        max_tool_calls=4,
        enable_live_agent=True,
        max_live_agent_calls=4,
    ),
    live_model={
        "base_url": "https://your-endpoint/v1",
        "api_key": "your_api_key",
        "model_name": "gpt-4o",
    },
)
print(run.final_answer[:3000])

Or use environment variables:

export AGENT_HARNESS_MODEL_BASE_URL=https://your-endpoint/v1
export AGENT_HARNESS_MODEL_API_KEY=your_api_key
export AGENT_HARNESS_MODEL_NAME=gpt-4o
# Optional higher-quality real web search
export TAVILY_API_KEY=your_tavily_key
# Optional self-hosted public metasearch
export SEARXNG_BASE_URL=http://localhost:8080
python -m app.main harness-live "Write an evidence-backed research brief"

4. Run Tests

pytest -q
# 183 tests passing

Main Commands

Goal	Command
Create thread	`python -m app.main agent-thread-create "My Task"`
List threads	`python -m app.main agent-threads`
Run generic super-agent	`python -m app.main agent-thread-run <thread_id> "<query>" --target auto`
Execute task graph	`python -m app.main agent-thread-exec-task <thread_id> "<query>" --target general`
Export workspace view	`python -m app.main agent-thread-workspace-view <thread_id>`
Run harness with live model	`python -m app.main harness-live "<query>"`
Build code mission pack	`python -m app.main harness-code-pack "<query>" --workspace .`
Run showcase	`python -m app.main studio-showcase "<query>" --tag demo`
Export skills interop	`python -m app.main skills-interop-export`

Repository Layout

Runtime

app/harness/: planner, live orchestration, evidence, evaluation, reporting
app/agents/: thread runtime, workspace action mapper, scheduler, sandbox
app/skills/: built-in skills, packages, interop export
app/core/: task spec, capability graph, contracts, shared state

Product Surfaces

app/studio/: showcase and release packaging
docs/demo/: live demo outputs generated with real API calls
reports/: runtime outputs generated locally

Validation

tests/: 183 tests covering runtime, showcase, planner, and live orchestration

Architecture

                    +------------------+
                    |   User Request   |
                    +--------+---------+
                             |
                    +--------v---------+
                    | Capability Planner|  (configurable limits, language-agnostic)
                    +--------+---------+
                             |
              +--------------+--------------+
              |              |              |
     +--------v---+  +------v------+  +----v--------+
     | Skill Router|  | Tool Engine |  |  Workspace  |
     | (26 skills) |  | (15+ tools) |  | (file I/O)  |
     +--------+---+  +------+------+  +----+--------+
              |              |              |
              +--------------+--------------+
                             |
                    +--------v---------+
                    |  Evidence Bundle  |  (live search + static catalog)
                    +--------+---------+
                             |
                    +--------v---------+
                    |   Live Agent     |  (analysis -> synthesis -> critique -> revision)
                    +--------+---------+
                             |
                    +--------v---------+
                    | Main Deliverable |  (task-adaptive format)
                    +------------------+

Honest Status

Agent Harness is strongest today when a task benefits from:

evidence-aware synthesis (not just generation)
structured multi-pass reasoning (analysis + critique + revision)
persistent thread with inspectable artifacts
task-adaptive output format (not forced templates)

What still matters next:

connect real web search APIs for live evidence collection
strengthen long-horizon thread execution for complex multi-step tasks
improve evidence grounding so every claim traces to a named source
reduce the gap between live-agent-enhanced and base-only output quality

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.claude		.claude
.github/workflows		.github/workflows
app		app
data		data
docs		docs
examples		examples
skills/public		skills/public
tests		tests
.gitignore		.gitignore
README.md		README.md
README.zh-CN.md		README.zh-CN.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

One Line

Why This Project

What It Can Produce

Core Model

Agent Loop

Evidence Pipeline

Live Agent Enhancement

Demo Gallery

1. Public Search Fallback (No Key Required)

2. Cheap Fallback Synthesis

3. Full Live Agent

What Makes It Different

1. Deliverable-first

2. Generic By Default

3. Critique-Driven Quality

4. Real Evidence, Not Templates

5. Thread + Workspace + Recovery

6. Skills Without Lock-In

Quickstart

1. Install

2. Run A Task (No API Key Required)

3. Run With Live Model (Higher Quality)

4. Run Tests

Main Commands

Repository Layout

Runtime

Product Surfaces

Validation

Architecture

Honest Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages