Skip to content

AweAI-Team/AweAgent

Repository files navigation

AweAgent logo  AweAgent

Make Agent Research Systematic.

A unified, composable framework to build, evaluate, and train agents.

Python 3.11+ License: Apache-2.0

Agent research is fragmented across domains and stages: each task type tends to come with its own agent stack — code, search, terminal — and moving to RL usually requires rebuilding the rollout pipeline. Nothing carries over. AweAgent brings these pieces into one composable framework for building, evaluating, and training agents.

AweAgent's core capabilities:

  • Unified across task types — search, code, and terminal agents run on the same execution core, with task-specific behavior composed through reusable interfaces instead of separate stacks.
  • Composable agent harnesses — an agent is split into a step(ctx) -> action policy, the loop that runs it, and a context bus that carries state and dependencies; build new agents by recomposing these parts instead of forking the engine.
  • Protocol-centered extensibility — LLM backends, tools, runtime sandboxes, agent scaffolds, tool backends, and evaluators are exposed through small protocols and entry-point registries; register a new component instead of patching the core engine.
  • Evaluation & trajectories as first-class data — every run emits a structured result plus the full trajectory; code tasks can be evaluated in isolated Docker runtimes, and the experimental training path can collect token-level rollout data (loss mask · logprobs · weight versions).

📰 News

  • [2026-06-10] 🎉 Added Long-horizon & DeNovoSWE scaffolds support.
  • [2026-06-04] 🎉 Added DeepSearch & IterResearch scaffolds + BrowseComp support.
  • [2026-05-10] 🎉 Added NL2Repo and SWE-bench Pro task support.
  • [2026-03-16] 🎉 Added unified LLM backends (openai/azure/response/ark/anthropic/sglang) with multi-provider reasoning support (docs).
  • [2026-03-15] 🎉 Added Terminus-2 scaffold with Terminal-Bench 2.0 support.
  • [2026-03-01] 🎉 Initial release with SearchSWE scaffold with BeyondSWE & ScaleSWE.

🧩 Scaffolds

Reference agents shipped in-tree, all on the shared core.

Scaffold Type Highlight Resources
OpenHands-style coding CodeAct-XML coding agent, behavior-compatible with OpenHands (search off) code
SearchSWE coding SWE coding agent with web search & fetch — fixes repo issues, pulls in external docs code
DeepSearch deep search Base web-research QA agent; retry-until-answerable loop policy code
IterResearch deep search Deep search + interaction scaling for long, multi-step research code
Terminus-2 terminal tmux terminal agent driven by raw JSON keystrokes, on the standard loop code

OpenHands-style and SearchSWE are the same scaffold (search_swe), one enable_search flag apart — listed separately because they behave differently.

📋 Datasets & Benchmarks

Training sets — large-scale data for training / distilling agents:

Dataset Description Scaffold Resources
ScaleSWE large-scale SWE-bench-style data SearchSWE / OpenHands data · guide
DeNovoSWE doc2repo — implement a package from a natural-language spec SearchSWE / OpenHands data · guide

Test sets — evaluation benchmarks:

Benchmark Description Scaffold Evaluation Resources
BeyondSWE Doc2Repo · CrossRepo · DepMigrate · DomainFix SearchSWE / OpenHands isolated Docker patch test data · guide
SWE-bench-Pro extended SWE-bench code tasks SearchSWE / OpenHands isolated Docker patch test data · guide
SWE-bench Verified 500-instance human-verified SWE-bench split SearchSWE / OpenHands official SWE-bench harness † data · guide
NL2Repo build a repo from a natural-language spec SearchSWE / OpenHands isolated Docker (artifact + golden tests) data · guide
Terminal-Bench 2.0 terminal tasks in containers Terminus-2 same-container reward repo · guide
BrowseComp web-search QA DeepSearch / IterResearch LLM-as-Judge guide

SWE-bench Verified is a reproduction recipe (recipes/scale_swe/swebench_verified/), not a framework-native task: the agent's patches are exported as predictions and scored by the public SWE-bench harness (with documented eval-side compatibility patches), reproducing the published Scale-SWE-Agent result. The other benchmarks run end-to-end through AweAgent's own isolated evaluator.

🗺️ Roadmap

Long-term goal: practical, general-purpose agents optimized with reinforcement learning. Shipped so far — the four scaffolds plus the datasets & benchmarks above. Next:

  • Multi-agent — multi-agent collaboration and orchestration on the shared core
  • RL training — reinforcement-learning rollouts via Slime with an SGLang rollout engine (experimental today)

🚀 Installation

Requires Python 3.11+ and Docker (for sandboxed execution and isolated evaluation).

uv (recommended)

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
uv venv --python 3.11 && source .venv/bin/activate
uv pip install -e .

pip

git clone https://github.com/AweAI-Team/AweAgent.git && cd AweAgent
python -m venv .venv && source .venv/bin/activate
pip install -e .

A single pip install -e . runs every scaffold and benchmark, and all LLM backends except Volcengine Ark, out of the box. Optional extras: ".[ark]" (Volcengine Ark backend) · ".[dev]" (pytest · ruff · mypy).

Why editable (-e)? You're installing from source and will likely tweak agents, tools, or configs — -e makes changes take effect without reinstalling. Verify everything is registered with awe-agent info.

▶️ Running a Benchmark

Download data

Datasets download through one script, run from the repo root. Data lands under datasets/<task>/ — the path each task config defaults to — so afterward you can run with no extra env vars.

bash datasets/download.sh beyond_swe              # one task
bash datasets/download.sh all                     # everything wired
FORCE=true bash datasets/download.sh beyond_swe   # re-download

Wired today: BeyondSWE · BrowseComp · Terminal-Bench 2.0. See datasets/ for HF token / mirror options and per-task notes; other datasets are covered in each benchmark's guide.

Run

# point at your LLM
export OPENAI_API_KEY="sk-..."

# sanity-check what's registered (backends, runtimes, agents, tools)
awe-agent info

# list instances — no Docker needed
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode dry-run

# batch run
python recipes/beyond_swe/run.py --data-file datasets/beyond_swe/beyond_swe.jsonl --mode batch

See each benchmark's guide for full setup, CLI arguments, and output format.

🏗️ Architecture

AweAgent architecture

Four layers driven by a shared core — the figure maps 1:1 onto the modules below.

Module descriptions

  • TaskRunner (core/task) — the batch engine: loads a Task, provisions its runtime, drives each instance through the loop, routes the result to an evaluator, and writes structured output (concurrency · retries · per-instance isolation).
  • AgentContext (core/agent) — the shared bus: all rollout state (messages, trajectory, stats) plus every injected dependency (LLM, tools, tool-call format, runtime) and an optional training field. The single seam between the agent and the outside world.
  • AgentLoop (core/agent) — the rollout engine: runs the step loop, branches only on the kind of action (finish · message · tool call), dispatches tools by name, and records the trajectory + RL tokens — agnostic to whether it's driving a search, code, or terminal agent.
  • Agent scaffold (scaffold/) — the policy: a near-stateless step(ctx) → action. Built-ins: SearchSWE · DeepSearch · Terminus-2 · IterResearch.
  • Interaction layer (core/llm · core/tool · core/runtime) — the pluggable dependencies the loop injects: LLM backends, tools, tool-call formats, and runtime sandboxes.
  • Evaluation & Data (core/eval · tasks/ · integrations/) — turns a finished run into a score (isolated Docker patch test · LLM-as-judge · in-container reward) or token-level RL rollout data (Slime bridge).
  • Config / Registry (core/config · plugins/) — layered YAML config + entry-point registries that wire every part by name.

⚙️ Configuration

Configs are YAML files with environment-variable substitution (${VAR}, ${VAR:-default}) and !include support.

LLM Backends

Backend Config File Required Env Vars
OpenAI configs/llm/openai.yaml OPENAI_API_KEY
OpenAI (Responses) configs/llm/openai_response.yaml OPENAI_API_KEY
Azure OpenAI configs/llm/azure.yaml AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT
Anthropic configs/llm/anthropic.yaml ANTHROPIC_API_KEY
Volcengine Ark configs/llm/ark.yaml ARK_API_KEY, ARK_MODEL_ID
SGLang configs/llm/sglang.yaml (self-hosted endpoint)

Environment Variables

Copy .env.example to .env and fill in your values:

cp .env.example .env

Sections: LLM Backend (pick one — API key + endpoint), Task Data (DATA_FILE), and Search Tools (optional — SERPAPI_API_KEY, JINA_API_KEY, only for search mode). See each benchmark's recipe guide (linked in the table above) for the full list.

🤝 Contributing

Issues and PRs are welcome. AweAgent is built to be extended — adding an LLM backend, tool, runtime, agent scaffold, or evaluator means implementing a small protocol and registering one entry point, with no changes to the core engine. For development tooling, install with pip install -e ".[dev]" (pytest · ruff · mypy).

📜 Citation

If AweAgent is useful in your work, please consider citing it and giving the repo a ⭐.

@misc{aweagent2026,
  title        = {AweAgent: A Unified, Composable Framework to Build, Evaluate, and Train Agents},
  author       = {AweAI Team},
  year         = {2026},
  howpublished = {\url{https://github.com/AweAI-Team/AweAgent}}
}

📄 License

Released under the Apache-2.0 License.

📨 Contact

Questions or feedback? Open an issue or email gx.chen.chn@gmail.com.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors