A high-performance, general-purpose agent runtime: plans with capabilities, uses skills and tools on demand, runs inside a real workspace, and ships inspectable deliverables -- not just answer strings.
Agent Harness turns an open-ended request into a main deliverable plus the evidence, artifacts, and runtime trace needed to review or continue the work.
The best general agents are usually not the most complicated ones.
What matters is a small number of strong primitives:
- one persistent thread
- one planner that reasons over capabilities instead of fixed flows
- one workspace where actions can create real files
- one main deliverable that the user can actually open
- one artifact/evidence rail for audit, retry, and follow-up
That is the design center of Agent Harness.
Depending on the task, the main deliverable can be:
- a research report with evidence-backed findings
- an architecture design with trade-off analysis
- a comparison matrix with concrete recommendations
- a patch draft or engineering handoff memo
- a slide deck plan or webpage blueprint
- a delivery bundle with linked artifacts
Supporting artifacts can include:
- evidence bundles (Tavily > public web search > live_search > static catalog)
- workspace findings
- validation plans or execution traces
- source matrices
- interoperability exports for external skill ecosystems
Request -> Thread -> Capability Planner -> Skills/Tools/Workspace Actions -> Main Deliverable -> Evidence + Artifact Bundle
This is deliberately simpler than a task-specific workflow catalog.
The runtime decides how to use skills, tools, and workspace actions based on the task -- not because the task was shoved into a hard-coded funnel.
The runtime is built around one short loop:
- open or resume one thread
- infer the main deliverable and missing channels from the task
- inspect skills, tools, web context, or workspace only when the task calls for them
- execute a small task graph inside the thread workspace
- publish one primary deliverable plus reviewable evidence and follow-up artifacts
The runtime now uses a multi-backend evidence stack with clear priority:
- Tavily (if
TAVILY_API_KEYis set) - Public search fallback:
- DuckDuckGo HTML search (no key)
- SearXNG (if
SEARXNG_BASE_URLis set)
- LLM-powered live_search (uses the configured model to produce reference candidates)
- Static catalog + local dossiers
Priority is enforced in scoring, so stronger backends win when multiple sources return relevant evidence.
When a live model API is configured, the runtime adds a 4-stage reasoning chain:
- Analysis -- decompose the task, map available evidence, identify gaps
- Synthesis -- produce a structured answer grounded in analysis findings
- Critique -- peer-review the synthesis: flag red flags, blind spots, unsupported claims
- Revision -- fix every issue the critique identified, with mandatory fix targets injected into the prompt
This chain means the final output has been through analysis, writing, review, and revision -- not just one-shot generation.
The current demo set compares three execution modes of the same runtime:
"Compare Redis, Valkey, and Memcached for low-latency caching in a high-traffic API."
- Full output
- Uses DuckDuckGo / SearXNG fallback when no Tavily key is present
- No live-agent chain, deterministic evidence-first synthesis only
- Good baseline for no-key environments
"Summarize the key risks of deploying LLMs in production and give 3 concrete mitigations."
- Full output
- Uses a single cheap-model synthesis pass when
live_modelis configured but full live-agent mode is off - Better than the old template-based fallback, but cheaper than the 4-stage chain
"Compare React, Vue, and Svelte for building a real-time dashboard."
- Full output
- Uses the full 4-stage path: analysis -> synthesis -> critique -> revision
- Highest quality path when a live model is configured
The first-class result is the deliverable the user asked for -- not the scorecard, not the bundle, not the planner trace.
The runtime is not "the research flow" or "the code flow". It starts from a task spec, available capabilities, current evidence gaps, and workspace state. That keeps the system usable across code, research, operations, and mixed tasks.
The live agent doesn't just generate -- it reviews its own output. Critique findings (red flags, blind spots, improvement items) are injected as mandatory fix targets in the revision pass. This is measurably better than one-shot generation.
Evidence collection now prefers the strongest available backend in order:
- Tavily (keyed, highest priority)
- public web search fallback (DuckDuckGo / optional SearXNG)
- LLM-powered live_search
- static catalog + local dossiers
The evidence digest flows directly into synthesis and revision prompts.
Each task runs inside a persistent thread with resumable execution, retry/interrupt/recovery, workspace artifacts, and event stream export.
Skills are packaged capabilities, not the entire product. The runtime can export an interoperability catalog so external OpenAI/Anthropic-style ecosystems can consume the project's capabilities.
pip install -r requirements.txtfrom app.harness.engine import HarnessEngine
from app.harness import HarnessConstraints
engine = HarnessEngine()
run = engine.run(
query="Summarize the key risks of deploying LLMs in production",
constraints=HarnessConstraints(max_steps=5, max_tool_calls=4),
)
print(run.final_answer[:2000])from app.harness.engine import HarnessEngine
from app.harness import HarnessConstraints
engine = HarnessEngine()
run = engine.run(
query="Design a caching strategy for a microservices architecture",
constraints=HarnessConstraints(
max_steps=5,
max_tool_calls=4,
enable_live_agent=True,
max_live_agent_calls=4,
),
live_model={
"base_url": "https://your-endpoint/v1",
"api_key": "your_api_key",
"model_name": "gpt-4o",
},
)
print(run.final_answer[:3000])Or use environment variables:
export AGENT_HARNESS_MODEL_BASE_URL=https://your-endpoint/v1
export AGENT_HARNESS_MODEL_API_KEY=your_api_key
export AGENT_HARNESS_MODEL_NAME=gpt-4o
# Optional higher-quality real web search
export TAVILY_API_KEY=your_tavily_key
# Optional self-hosted public metasearch
export SEARXNG_BASE_URL=http://localhost:8080
python -m app.main harness-live "Write an evidence-backed research brief"pytest -q
# 183 tests passing| Goal | Command |
|---|---|
| Create thread | python -m app.main agent-thread-create "My Task" |
| List threads | python -m app.main agent-threads |
| Run generic super-agent | python -m app.main agent-thread-run <thread_id> "<query>" --target auto |
| Execute task graph | python -m app.main agent-thread-exec-task <thread_id> "<query>" --target general |
| Export workspace view | python -m app.main agent-thread-workspace-view <thread_id> |
| Run harness with live model | python -m app.main harness-live "<query>" |
| Build code mission pack | python -m app.main harness-code-pack "<query>" --workspace . |
| Run showcase | python -m app.main studio-showcase "<query>" --tag demo |
| Export skills interop | python -m app.main skills-interop-export |
app/harness/: planner, live orchestration, evidence, evaluation, reportingapp/agents/: thread runtime, workspace action mapper, scheduler, sandboxapp/skills/: built-in skills, packages, interop exportapp/core/: task spec, capability graph, contracts, shared state
app/studio/: showcase and release packagingdocs/demo/: live demo outputs generated with real API callsreports/: runtime outputs generated locally
tests/: 183 tests covering runtime, showcase, planner, and live orchestration
+------------------+
| User Request |
+--------+---------+
|
+--------v---------+
| Capability Planner| (configurable limits, language-agnostic)
+--------+---------+
|
+--------------+--------------+
| | |
+--------v---+ +------v------+ +----v--------+
| Skill Router| | Tool Engine | | Workspace |
| (26 skills) | | (15+ tools) | | (file I/O) |
+--------+---+ +------+------+ +----+--------+
| | |
+--------------+--------------+
|
+--------v---------+
| Evidence Bundle | (live search + static catalog)
+--------+---------+
|
+--------v---------+
| Live Agent | (analysis -> synthesis -> critique -> revision)
+--------+---------+
|
+--------v---------+
| Main Deliverable | (task-adaptive format)
+------------------+
Agent Harness is strongest today when a task benefits from:
- evidence-aware synthesis (not just generation)
- structured multi-pass reasoning (analysis + critique + revision)
- persistent thread with inspectable artifacts
- task-adaptive output format (not forced templates)
What still matters next:
- connect real web search APIs for live evidence collection
- strengthen long-horizon thread execution for complex multi-step tasks
- improve evidence grounding so every claim traces to a named source
- reduce the gap between live-agent-enhanced and base-only output quality