Bare-Agent Coding Benchmark

Not a code-generation benchmark. An agentic benchmark.
一种不靠堆砌 Prompt 和工具来造分数的 Agent 评测范式。

是什么

大多数 coding benchmark 在测「模型能写出多好的代码」。我们在测另一件事：

给定 Read、Write、Edit、Shell 四个最朴素的能力，模型能不能像一个真正的开发者一样，在文件系统里持续地理解、操作、纠错，最终完成任务。

不允许多 Agent 协作。不允许几十个专用工具。不允许长篇系统提示。回归 Agent 本身。

四个工具，零魔法

工具	说明
`Read`	读取工作区内的文本文件
`Write`	创建或覆写一个文件
`Edit`	精确文本替换（单处 / 全局）
`Shell`	在工作区内执行 shell 命令

没有 grep 工具，没有 lint，没有 web_search，没有 plan。如果需要这些能力，模型必须用 Shell 自行组合。

快速安装

curl -fsSL https://raw.githubusercontent.com/DWG-ShowMaker/Bare-Agent-Coding-Benchmark/main/install.sh | bash

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc

从源码安装

git clone https://github.com/DWG-ShowMaker/Bare-Agent-Coding-Benchmark
cd Bare-Agent-Coding-Benchmark
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

30 秒上手

codebench init                     # 交互式配置，写入 ~/.codebench/config.json
codebench chat --theme dark        # 启动交互式 Shell
codebench run --prompt "explain the architecture of this repo"
codebench compare --profile gpt41 --profile sonnet

交互式 Shell 基于 prompt_toolkit，支持命令历史、模糊补全、Markdown 输出渲染、Thinking 过程展示。

三条命令走完评测闭环

codebench init         → 配置模型
codebench run          → 单任务调试
codebench compare      → 多模型头对头
codebench benchmark    → 批量跑分

配置

配置优先级：--config <path> > ./codebench.json > ~/.codebench/config.json

{
  "version": 1,
  "default": {
    "provider": "openai",
    "api_key": "sk-...",
    "model": "gpt-4.1",
    "max_steps": 60,
    "temperature": 0.0
  },
  "compare": {
    "parallel": 2,
    "profiles": [
      { "name": "gpt41", "provider": "openai", "model": "gpt-4.1" },
      { "name": "sonnet", "provider": "anthropic", "model": "claude-sonnet-4-20250514" }
    ]
  }
}

完整配置文档 →

Benchmark 任务

每个任务一个目录，task.json 描述输入和验证规则：

{
  "id": "fix-broken-tests",
  "prompt_file": "prompt.md",
  "workspace_dir": "workspace",
  "setup": ["pip install -r requirements.txt"],
  "verify": ["pytest"]
}

tasks/
  fix-broken-tests/
    task.json
    prompt.md
    workspace/         ← Agent 的工作区起点
      src/
      tests/

任务工作区会在运行时被复制到 .runs/，原始文件不会被修改。

Benchmark 设计文档 →

这么做的原因

不是	一个通用型 coding assistant
不是	一个 prompt engineering 竞赛平台
不是	一个工具数量比赛
是	一个对 Agent 核心能力的纯净测试

文档

文档	说明
配置	完整配置项、加载优先级、路径解析
交互式 Shell	命令列表、键盘交互、补全机制
Compare & Benchmark	多模型对比、批量评测、任务格式
TUI	Nothing 风格终端界面

License

MIT

_{README available in
中文 ·
English ·
日本語 ·
한국어}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
src/cli_agent		src/cli_agent
.gitignore		.gitignore
README.en.md		README.en.md
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
codebench.json.example		codebench.json.example
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bare-Agent Coding Benchmark

是什么

四个工具，零魔法

快速安装

30 秒上手

三条命令走完评测闭环

配置

Benchmark 任务

这么做的原因

文档

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bare-Agent Coding Benchmark

是什么

四个工具，零魔法

快速安装

30 秒上手

三条命令走完评测闭环

配置

Benchmark 任务

这么做的原因

文档

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages