-
Quality: When using AI models for coding, or even just for brainstorming, a lot of answers are very vague, not properly checked, not properly developed, as if the AI-model was hoping for you to be fine with low quality work. Solution: automated feedback - the same model can play a role of an auditor that pushes back, forces proper planning and checks the testing evidence.
-
Lack of control: You don't want to give the entire machine to a random AI model that can run unpredictable commands in your system. You need to just let it work within a workspace directory that is only visible from a docker container so it does the work for you but only sees what you want it to see. Solution: running a light script that just spins 2 dockers: one serving a model and the other using the model - everything locked up and safe.
agenticFeedbackCoding runs a Docker-isolated local AI coding workflow: edit one JSON prompt, start an OpenAI-compatible model server, and let one agent implement while a second agent reviews every step with tests, git diffs, command output, file evidence, and screenshots/reports when available.
The main tested setup uses Qwen3.6-27B Q4_K_M GGUF served by llama.cpp/Vulkan through an OpenAI-compatible endpoint on AMD Ryzen AI Max+ 395 / Strix Halo. A second comparison run used Gemma4-26B-A4B Q4_K_M through the same endpoint shape. Other OpenAI-compatible local or remote models can be configured, but the evidence in this README comes from those local GGUF runs. Normal work uses two Docker containers for safety and reproducibility: one model-server container and one agent container on a shared Docker network. Only the generated project workspace is mounted out to the host.
The project is intentionally config-driven. One JSON file defines the model endpoint, workspace, review strictness, allowed tools, web/offline mode, context-safety limits, and the project prompt.
The normal local setup keeps model serving and agent execution in separate containers. The CLI script starts the workflow from one JSON config, the agent container talks to the model container through an OpenAI-compatible REST API, and only the generated workspace is mounted back to the host.
flowchart LR
User["User terminal"]
Config["config.*.json<br/>project prompt + knobs"]
CLI["scripts/build_and_run.sh<br/>scripts/run_agent.sh"]
subgraph Host["Host filesystem"]
Workspace["Mounted output workspace<br/>workspaces/my-project"]
end
subgraph Net["Docker network: agentic-feedback-net"]
subgraph ModelContainer["Model server container"]
Model["GGUF model<br/>Qwen3.6 / Gemma4"]
API["llama.cpp server<br/>OpenAI-compatible REST API<br/>:8161/v1"]
Model --> API
end
subgraph AgentContainer["Agent container"]
Impl["Implementation agent"]
Review["Feedback/review agent"]
Tools["Tools<br/>terminal, git, Python Playwright,<br/>optional web research"]
Impl <--> Review
Impl --> Tools
Review --> Tools
end
end
User --> Config
Config --> CLI
CLI --> AgentContainer
CLI -. starts .-> ModelContainer
AgentContainer <-->|REST API| API
AgentContainer -->|writes files, transcripts, evidence| Workspace
Clone the repo, start the local model server, then run a real benchmark through Docker:
# Ubuntu host prerequisites if you do not already have them:
# sudo apt-get update && sudo apt-get install -y git docker.io python3 curl ca-certificates
git clone https://github.com/krzyszsz/agenticFeedbackCoding.git
cd agenticFeedbackCoding
MODEL_ROOT=$HOME/hf/models bash scripts/start_default_model_server.sh
bash scripts/build_and_run.sh --config config.real-palindrome.jsonThat run starts the model server on the agentic-feedback-net Docker network, builds the agent container, mounts only the configured workspace, asks the local model to build the project, and stores the full transcript plus review evidence under workspaces/real-palindrome/.agent_state/.
The live transcript is printed while the run is active, so a long job should visibly move through requirements, plan review, implementation attempts, and feedback. While a model call is in flight, the terminal also prints a heartbeat with elapsed time and a lightweight REST health check. Those heartbeat lines are human-facing only; they are not written into the reusable agent transcripts under .agent_state/. The final terminal output is compact by default; the full evidence is written under .agent_state/.
Copy a real config and edit the prompt/workspace:
cp config.example.json config.my-project.jsonThe important fields are usually enough:
{
"implementation_model": {
"base_url": "http://127.0.0.1:8161/v1",
"model": "local-gguf",
"context_window": 76800,
"max_tokens": 32768,
"temperature": 0.25,
"request_timeout_seconds": 21600,
"retry_attempts": 20,
"retry_sleep_seconds": 30,
"request_heartbeat_seconds": 30,
"preserve_reasoning": true
},
"feedback_model": null,
"mcp_tools": {
"terminal": true,
"web_scraping": false,
"web_interaction": true
},
"runtime": {
"docker_isolation": true,
"docker_user": "host",
"workspace": "workspaces/my-new-project",
"plan_file": "PLAN.md",
"requirements_file": "REQUIREMENTS.md",
"research_file": "RESEARCH.md",
"command_timeout_seconds": 120,
"max_command_timeout_seconds": 21600,
"color_transcript": true,
"live_turn_max_chars": 0,
"final_summary": "compact",
"feedback_response_max_tokens": 4096
},
"web_research": { "enabled": false },
"project_design": {
"title": "My new project",
"prompt": "Build a browser game with tests and documentation."
}
}For a very small config, start from config.minimal.json or override the prompt
and workspace from the command line:
bash scripts/build_and_run.sh \
--config config.minimal.json \
--workspace workspaces/my-project \
--prompt "Build a small Python CLI with tests and a README."For longer prompts, keep the prompt in the JSON file. The command-line override is a convenience, not a replacement for versioned task configs.
command_timeout_seconds is only the default timeout for one terminal command. It is not the model response timeout. If a generated test or build step needs longer, the agent can request it per command:
{"cmd": ["python", "long_running_check.py"], "timeout_seconds": 7200}That request is clamped by runtime.max_command_timeout_seconds. Model calls use implementation_model.request_timeout_seconds, which is set high by default for long local-model runs.
runtime.final_summary controls only the final stdout block after the live transcript:
compactprints status, step counts, and evidence paths.fullprints the full nestedsummary.jsonobject.nonesuppresses the final block.
runtime.live_turn_max_chars controls live console output for each transcript turn. The default 0 prints each turn fully as it happens. Set a positive value, for example 30000, if you want progress to remain visible without a single huge tool payload flooding the terminal. Saved transcript files are not truncated by this setting.
Normal agentic work runs inside Docker. scripts/run_agent.sh refuses to run the workflow directly on the host unless ALLOW_HOST_AGENT_RUN=1 is explicitly set for harness development.
The standard setup uses two containers on one Docker network:
scripts/start_default_model_server.shcreates/usesagentic-feedback-net, starts the llama.cpp/Vulkan server asagentic-qwen36-server, and publishes127.0.0.1:8161for host-side checks.scripts/run_agent.shstarts the agent container on the same network and overrides the in-container model URL tohttp://agentic-qwen36-server:8161/v1.
The agent container gets one writable mount: the configured runtime.workspace, mapped to /workspace/project. The config file is mounted read-only. The Docker socket is not mounted. Host networking is no longer required for the normal two-container path; keep it only as an explicit compatibility mode with DOCKER_NETWORK=host AGENT_DOCKER_NETWORK=host.
The agent container includes Python, Python Playwright with a preinstalled Chromium browser, system Chromium, pytest, curl, git, jq, requests, and beautifulsoup4, so generated projects can run tests, browser checks, and scraping-style tasks without installing those tools into the host project folder.
Browser validation is intentionally Python-first by default. The container does not include Node, npm, npx, or @playwright/test, so generic browser/UI validation should use from playwright.sync_api import sync_playwright. That is a weak preference, not a technology lock: if a task explicitly requires another SDK/runtime, make dependency discovery and container-local installation an explicit plan step, usually with runtime.docker_user=root, bounded timeouts, and clear evidence of what was installed.
runtime.docker_user defaults to host, so generated files are owned by the host user. Set it to root only for tasks that intentionally need package-manager access inside the disposable agent container, for example a workflow that checks disk space, installs a small diagnostic tool with apt-get, runs it, and writes a report into the mounted workspace. That still does not grant access to the host filesystem outside the configured workspace.
Direct host execution is deliberately awkward:
ALLOW_HOST_AGENT_RUN=1 bash scripts/run_agent.sh --config config.my-project.jsonUse that only for harness development. For normal agentic coding, Docker isolation is the supported path.
If you did not already clone it in Quick Start, clone and enter the repo:
git clone https://github.com/krzyszsz/agenticFeedbackCoding.git
cd agenticFeedbackCodingThe normal Docker-isolated path needs only a small host toolchain: git to clone the repo, Docker to run the model/agent containers, Python 3 for the wrapper scripts, and curl for model-server readiness checks.
sudo apt-get update
sudo apt-get install -y git docker.io python3 curl ca-certificates
sudo usermod -aG docker "$USER" # log out/in afterwards, or use sudo dockerThe agent runtime dependencies live inside the agent Docker image. You do not need to install Playwright, pytest, Python packages, or project-specific SDKs on the host for normal use.
For the tested Qwen3.6 profile, download the model if needed and build/start the llama.cpp/Vulkan model-server image:
HF_TOKEN_FILE=$HOME/hf.key MODEL_ROOT=$HOME/hf/models \
bash scripts/download_default_model.sh
MODEL_ROOT=$HOME/hf/models REBUILD_SERVER_IMAGE=1 \
bash scripts/start_default_model_server.shhf.key is a plain text Hugging Face token outside this repo. Create it only if you need authenticated Hugging Face access:
printf '%s' 'hf_your_token_here' > "$HOME/hf.key"
chmod 600 "$HOME/hf.key"Default model paths are defined in scripts/env.sh:
HF_ROOT=$HOME/hf
MODEL_ROOT=$HF_ROOT/models
HF_TOKEN_FILE=$HOME/hf.keyStart the default llama.cpp/Vulkan server:
MODEL_ROOT=$HOME/hf/models bash scripts/start_default_model_server.shBy default this starts llama.cpp with CTX_SIZE=76800, PARALLEL=1,
MEM_LIMIT=75g, MEMORY_SWAP=75g, GPU_LAYERS=999, reasoning enabled,
REASONING_BUDGET=1024, REASONING_FORMAT=deepseek, and port 8161.
It also creates/uses the agentic-feedback-net Docker network, names the
server container agentic-qwen36-server, and publishes the API on the host at
127.0.0.1:8161 for quick checks.
Override REASONING_MODE=off, REASONING_BUDGET=..., or
REASONING_FORMAT=none|deepseek|deepseek-legacy if a model needs different
thinking behavior.
PARALLEL=1 keeps one server slot instead of multiplying the long context
across several idle slots. Override these values in the shell if you need a
smaller context, more concurrent slots, or CPU fallback.
Set REBUILD_SERVER_IMAGE=1 when you want to force a fresh llama.cpp/Vulkan
server image build after changing docker/llama-cpp-run.sh or its Dockerfile.
The agent runner similarly reuses agentic-feedback-coding:local after the
first build; set REBUILD_AGENT_IMAGE=1 after changing the harness Dockerfile
or Python code copied into that image.
If you already have a prebuilt agent image, run without rebuilding it:
docker pull ghcr.io/krzyszsz/agentic-feedback-coding:latest
AGENT_IMAGE=ghcr.io/krzyszsz/agentic-feedback-coding:latest \
SKIP_AGENT_IMAGE_BUILD=1 \
bash scripts/build_and_run.sh --config config.minimal.json --prompt "Build a tiny checked project."Model serving is intentionally separate from the agent image. You can use the
provided llama.cpp/Vulkan container, another machine on your LAN, or any
OpenAI-compatible cloud/local endpoint by setting implementation_model.base_url
or AGENT_IMPLEMENTATION_BASE_URL.
If you do not want to clone the repo at all, the prebuilt-agent path only needs a config file and an output directory. The config may be minimal because the harness has defaults for all other knobs:
mkdir -p agent-output
cat > config.json <<'JSON'
{
"runtime": {"workspace": "/workspace/project"},
"project_design": {
"title": "Tiny checked project",
"prompt": "Build a tiny Python CLI with tests and a README."
}
}
JSON
docker run --rm --network agentic-feedback-net \
-e AGENT_IMPLEMENTATION_BASE_URL=http://agentic-qwen36-server:8161/v1 \
-e AGENT_WORKSPACE=/workspace/project \
-v "$PWD/agent-output:/workspace/project" \
-v "$PWD/config.json:/app/config.json:ro" \
ghcr.io/krzyszsz/agentic-feedback-coding:latest --config /app/config.jsonThat simplified path assumes the model endpoint already exists. Setting up the model server remains hardware-specific, especially for GPU/Vulkan/driver paths.
The checked-in configs keep a host-friendly endpoint:
http://127.0.0.1:8161/v1
When the agent itself runs in Docker, scripts/run_agent.sh automatically
overrides that URL inside the container to:
http://agentic-qwen36-server:8161/v1
Useful networking overrides:
DOCKER_NETWORK=agentic-feedback-net # model-server container network
AGENT_DOCKER_NETWORK=agentic-feedback-net # agent container network
MODEL_SERVER_CONTAINER=agentic-qwen36-server # DNS name used inside the network
MODEL_SERVER_PORT=8161
AGENT_IMPLEMENTATION_BASE_URL=http://my-model:9000/v1Use DOCKER_NETWORK=host AGENT_DOCKER_NETWORK=host only if you deliberately
want the older host-network behavior.
The validated local path for this project is Vulkan, not ROCm. On the AMD Ryzen AI Max+ 395 / Strix Halo machine used during development, llama.cpp with Vulkan was more reliable than ROCm for GGUF serving.
Optional AMD/Vulkan host diagnostics use these packages:
libvulkan1 mesa-vulkan-drivers vulkan-tools clinfoUseful checks:
vulkaninfo --summary
ls -l /dev/dri
clinfo | head -80If Vulkan causes instability, try CPU fallback for the model server:
USE_DRI=0 GPU_LAYERS=0 MODEL_ROOT=$HOME/hf/models bash scripts/start_default_model_server.shThat is much slower, but it avoids GPU-driver paths. ROCm is not required and is intentionally not automated here.
The workflow is deliberately more structured than one-pass code generation:
- Requirements refinement fills gaps in the user prompt, records assumptions, and drafts a plan.
- Plan validation checks whether the plan is feasible, clear, ordered, and verifiable before implementation starts.
- Per-step implementation loops run one plan item at a time.
- Feedback reviews inspect requirements, code, files, command output, reports, screenshots, git diffs, and previous critique before accepting a step.
- A workspace-local git repository records the accepted baseline and each accepted plan step.
- A final whole-project review checks the complete result after all individual steps are done.
- Context compaction preserves durable memory when the transcript approaches the configured context window.
Both agents share one durable chat history. New feedback is appended at the end; previous requirements, implementation attempts, reviews, and correction requests stay visible until compaction is needed. When compaction does run, the harness pins the current requirements, plan, research notes, step status, and recent plan notes into the compacted active context so the agents do not have to rediscover what they are supposed to be doing.
The harness can work on an existing codebase instead of creating a new project from scratch. Point runtime.workspace at the project folder and, if the project already has its own PLAN.md or REQUIREMENTS.md, give the harness separate state filenames:
"runtime": {
"docker_isolation": true,
"workspace": "workspaces/existing-bugfix-demo",
"plan_file": "AGENT_PLAN.md",
"requirements_file": "AGENT_REQUIREMENTS.md",
"research_file": "AGENT_RESEARCH.md"
}The checked example seeds a small invoice calculator with a syntax error and a logic bug, then asks the agents to diagnose and repair it without rebuilding the project:
bash scripts/seed_existing_bugfix_fixture.sh
bash scripts/build_and_run.sh --config config.real-existing-bugfix.jsonIn the verified run, the reviewer first pushed back on vague investigation evidence, then caught that the implementation had fixed only the syntax error while leaving the tax calculation bug. The accepted result fixed both issues, preserved the public API, added BUGFIX_NOTES.md, and passed python -m unittest discover -v.
When runtime.print_transcript=true, the implementation and feedback turns print live so a long run shows progress instead of going silent. If stdout is a TTY and runtime.color_transcript=true, implementation turns use one color and feedback turns another; redirected logs stay plain text.
Long model calls also emit terminal-only heartbeat lines controlled by implementation_model.request_heartbeat_seconds, for example:
[model-call] still waiting for qwen3.6-27b-q4km: 30s elapsed; health=ok http=200.
The heartbeat probes the model server's OpenAI-compatible /models endpoint. It is intentionally not appended to conversation.jsonl or conversation.full.jsonl, so it cannot pollute later context when a run is resumed or inspected by a model.
The feedback agent does not only read the implementation agent's claims. Before each step review, the harness gives the feedback phase its own evidence:
- a fresh snapshot of generated workspace files
- an independent run of the current plan step's
validation_commands - return codes, stdout/stderr tails, and timeout flags from those commands
git status --short- meaningful changed paths, ignoring harness bookkeeping files such as the configured plan/requirements/research documents and
.agent_state/ git diff --stat- a truncated
git diff
The automatic evidence gate uses that feedback-side evidence first. In hard-pushback mode it rejects a step if validation is missing, fails, times out, or if the implementation claims completion without meaningful git changes.
The plan checker also rejects validation commands that would accidentally loop
the agents, such as a raw python -c "mean([])" command for an expected
exception path. Expected failures must use expected_returncode or a wrapper
assertion that exits 0 only when the intended error occurs.
The harness has two separate context-protection layers:
- Conversation compaction runs before model calls and also accounts for the next prompt plus the configured response budget.
- Tool evidence is bounded before it can enter the live transcript. Command stdout/stderr are drained with a bounded tail buffer, workspace file snapshots keep capped excerpts, and git diffs are capped.
- Bounded reviewer evidence remains available in local run summaries, but the feedback pasted back into the next implementation turn uses a compact evidence summary instead of the raw file/output/diff payload.
This matters because a single noisy command, giant generated file, or huge git diff can otherwise overflow the next local-model request even when ordinary chat-history compaction is enabled.
When git_policy.enabled=true, the generated workspace is initialized as a git repository. After requirements and plan validation, the harness creates a baseline commit. Accepted steps are committed only after feedback returns a resolved status. The final whole-project review also creates an acceptance commit.
If you want the final project left as uncommitted changes for manual inspection, set:
"git_policy": {
"enabled": true,
"commit_completed_steps": true,
"require_step_diff": true,
"leave_final_changes_uncommitted": true,
"final_reset_mode": "soft"
}With that mode, the harness still uses commits internally during review, then resets to the baseline at the end so the final project appears in git as uncommitted/staged changes.
Web research is optional. The project can run fully locally/offline if you disable it:
"mcp_tools": {
"web_scraping": false
},
"web_research": {
"enabled": false
}When enabled, web research only runs if the prompt explicitly asks to search/research/browse, look up current/latest information, or includes source URLs. The harness then fetches pages, writes the configured research file, appends the research result to the transcript, injects compact research notes into later prompts, and asks the generated project to cite/apply source URLs when sources were actually fetched.
Fetched non-text responses, such as PDFs or binary downloads, are recorded as unsupported text sources instead of being decoded into the model prompt. That keeps web research generic and avoids flooding the context window with binary noise.
Each run creates or updates the configured workspace, usually under workspaces/, and writes:
- a workspace-local
.git/repository with baseline, accepted-step, and final-review commits whengit_policy.enabled=true - the configured research file, normally
RESEARCH.md, when web research was requested and enabled - the configured requirements file, normally
REQUIREMENTS.md, with refined requirements and assumptions - the configured plan file, normally
PLAN.md, with ordered tasks, acceptance criteria, validation commands, and status .agent_state/conversation.full.jsonlwith the append-only full machine-readable agent chat.agent_state/conversation.full.mdwith the append-only transcript in readable Markdown.agent_state/conversation.jsonlwith the active model context, which may be compacted during long runs.agent_state/conversation.mdwith the active model context in readable Markdown.agent_state/summary.jsonwith step results, review statuses, and feedback evidence
When llama.cpp exposes thinking as reasoning_content, the client preserves it
in those transcripts by default. Recent thinking stays in active chat context;
older raw thinking is summarized during compaction so long runs keep useful
decisions without filling the context window with every internal token.
Generated workspaces, logs, reports, transcripts, and test evidence are ignored by git. They are useful locally, but they should not be published by accident.
| Field | Purpose | Typical values |
|---|---|---|
implementation_model.name |
Human-readable model profile name. | qwen3.6-27b-q4km |
implementation_model.base_url |
OpenAI-compatible endpoint used by the implementation agent. The Docker runner can override it with AGENT_IMPLEMENTATION_BASE_URL, which is how the agent container reaches the model-server container by DNS. |
http://127.0.0.1:8161/v1 |
implementation_model.model |
Model id sent to the endpoint. llama.cpp accepts local-gguf. |
local-gguf |
implementation_model.context_window |
Context budget used by compaction logic. The default server script starts llama.cpp with CTX_SIZE=76800. |
76800 |
implementation_model.max_tokens |
Max response length per model call. This is an upper bound, not a target; prompts ask for structured JSON, not artificially short answers. | 32768 |
implementation_model.temperature |
Generation randomness. Lower is usually better for coding. | 0.1 to 0.3 |
implementation_model.request_timeout_seconds |
HTTP timeout for one model response. This is separate from terminal command timeouts. | 21600 |
implementation_model.retry_attempts |
Model HTTP retry budget for temporary server/network failures. Retry progress is printed to stderr. | 20 |
implementation_model.retry_sleep_seconds |
Delay between model HTTP retries. Use 0 only for tests. |
30 |
implementation_model.request_heartbeat_seconds |
Prints terminal-only elapsed-time and model REST health lines while a model response is in flight. Set 0 to disable it. |
30 |
implementation_model.preserve_reasoning |
Preserves server-provided thinking/reasoning in the chat transcript as a <think>...</think> block before final content. Disable only if the extra context makes a model less stable. |
true |
feedback_model |
Optional separate reviewer model. null reuses the implementation model. |
null or another model block |
mcp_tools.terminal |
Allows command execution for implementation and reviewer validation. | true |
mcp_tools.web_scraping |
Allows web research/scraping when a task asks for it. | true or false |
mcp_tools.web_interaction |
Adds browser-validation guidance and reviewer expectations. The tested container path is Python Playwright with preinstalled Chromium. | true or false |
runtime.docker_isolation |
Runs generated project work in a container. Normal use should keep this true. | true |
runtime.docker_image |
Agent container image tag. | agentic-feedback-coding:local |
runtime.docker_user |
User used inside the agent container. host maps to the host UID/GID; root is useful only for deliberate container-local package installs. |
host, root |
runtime.workspace |
Host-visible output folder for generated project files. | workspaces/my-task |
runtime.plan_file |
Harness-owned plan filename inside the workspace. Use a custom name when editing an existing repo that already has PLAN.md. |
PLAN.md, AGENT_PLAN.md |
runtime.requirements_file |
Harness-owned requirements filename inside the workspace. | REQUIREMENTS.md, AGENT_REQUIREMENTS.md |
runtime.research_file |
Harness-owned research filename inside the workspace. | RESEARCH.md, AGENT_RESEARCH.md |
runtime.command_timeout_seconds |
Default timeout for one terminal command. Commands can override it with {"cmd": [...], "timeout_seconds": N}. |
60 to 300 |
runtime.max_command_timeout_seconds |
Maximum accepted per-command override. Prevents accidental unbounded terminal commands. | 3600 to 21600 |
runtime.print_transcript |
Prints the live agent conversation. | true for debugging |
runtime.color_transcript |
Uses ANSI colors for live transcript roles when stdout is a terminal. Redirected logs stay plain text. | true |
runtime.live_turn_max_chars |
Optional per-turn cap for live terminal printing only. Saved full transcripts remain append-only and untruncated. | 0 for unlimited, or 30000 |
runtime.final_summary |
Final stdout summary mode after the live transcript. Full evidence is always written to .agent_state/summary.json. |
compact, full, none |
runtime.feedback_response_max_tokens |
Separate reviewer output cap. Keep this lower than implementation max_tokens because feedback should be structured review JSON rather than generated project content. Set 0 to use the model's full ceiling. |
4096 |
context_compaction.enabled |
Enables transcript compaction near context limits. | true |
context_compaction.threshold_ratio |
Trigger compaction at this fraction of context. | 0.8 |
context_compaction.keep_recent_turns |
Recent turns kept verbatim during compaction. | 6 to 12 |
context_compaction.tool_output_max_chars |
Max stdout/stderr tail kept from each terminal command. The process is drained continuously so verbose tools cannot flood memory/context. | 4000 |
context_compaction.workspace_file_max_bytes |
Max bytes read per workspace file for reviewer evidence. Larger files are represented by first/last excerpts plus size metadata. | 20000 |
context_compaction.git_diff_max_chars |
Max git diff text retained for reviewer evidence. | 20000 |
context_compaction.transcript_review_max_chars |
Max compact review payload pasted back into the live implementation chat. | 24000 |
phases.requirements_refinement.max_iterations |
Requirement refinement retry budget. | 2 |
phases.plan_validation.max_iterations |
Plan validation retry budget. | 2 |
phases.implementation.max_iterations |
Per-step implementation retry budget. | 7 |
review_policy.hard_pushback_iterations |
Strict review attempts before compromise. | 3 |
review_policy.compromise_iterations |
Bounded compromise attempts after strict review. | 4 |
review_policy.final_review_iterations |
Whole-project review attempts. | 1 or 2 |
quality_policy.assume_code_quality_when_unspecified |
Adds default structure/tests/docs requirement unless prompt overrides it. | true |
quality_policy.require_research_and_structure_step |
Requires a first research/architecture step. | true |
web_research.enabled |
Enables harness-owned web research before requirements refinement. | true or false |
git_policy.enabled |
Initializes a workspace-local git repository and records git evidence. | true |
git_policy.commit_completed_steps |
Commits each accepted plan step after feedback resolves it. | true |
git_policy.require_step_diff |
Rejects step acceptance when there are no meaningful implementation changes to review. | true |
project_design.title |
Short task title. | Any string |
project_design.prompt |
Actual task prompt. | Detailed project brief |
These configs are intended to run against a real local model endpoint. The table below records the latest successful evidence runs and keeps a few reusable stress configs for future checks.
config.example.json- starter task tracker project.config.minimal.json- tiny override-only config showing that defaults fill in the rest.config.real-palindrome.json- verified CLI benchmark used as the current evidence run.config.real-arithmetic.json- focused arithmetic package task, useful for quick prompt/regression checks.config.real-website.json- static website plus browser interaction task.config.gemma4-palindrome.json- same CLI benchmark using Gemma4-26B-A4B.config.gemma4-website.json- same static website/browser benchmark using Gemma4-26B-A4B and a bounded live transcript.config.real-existing-bugfix.json- existing-project repair benchmark using separate agent-owned state files.config.real-dotnet-dependency.json- dependency-discovery benchmark where the agent installs .NET inside the disposable container without changing this harness Dockerfile.config.real-jsonl-stats.json- Qwen JSONL statistics stress benchmark; the latest long run timed out and is kept as a reusable hard case, not as successful evidence.config.gemma4-jsonl-stats.json- fresh JSONL statistics CLI benchmark using Gemma4-26B-A4B.config.real-interest-rate-research.json- web-research analysis benchmark using Gemma4-26B-A4B.config.real-city-research.json- web-research manifest task.config.real-platformer.json- browser platformer task with Playwright validation requirements.config.gpx-editor.json- GPX editor task with browser/map-style interaction requirements.
| Script | Purpose |
|---|---|
scripts/bootstrap_ubuntu.sh |
Optional convenience bootstrap for local development. The Quick Start shows the minimal host packages explicitly so users can see what is installed. |
scripts/install_ubuntu.sh |
Compatibility wrapper around scripts/bootstrap_ubuntu.sh. |
scripts/download_default_model.sh |
Downloads and verifies the default Qwen3.6 GGUF model and mmproj files. |
scripts/start_default_model_server.sh |
Builds if needed and starts the default llama.cpp/Vulkan model server on agentic-feedback-net, with host port 8161 published for checks. |
scripts/build_and_run.sh |
Convenience wrapper to build/run the agent harness from a config. |
scripts/run_agent.sh |
Lower-level runner that re-enters Docker when runtime.docker_isolation=true and joins the agent container to the model-server network. |
scripts/seed_existing_bugfix_fixture.sh |
Creates the existing-project repair fixture with planted syntax and logic bugs. |
scripts/env.sh |
Shared path/model defaults. Override values in the shell. |
The table below keeps successful real Docker-isolated workload evidence. The newer Qwen runs were deliberately varied so the harness does not become over-fitted to one task shape: a dependency-heavy .NET task, an existing-project bug fix, and a focused Python package task. The prompts keep browser work friendly to Python Playwright because that is preinstalled in the agent container, but dependency installation remains a normal plan step when the project asks for a different stack.
The Qwen server used the default script and port:
HF_ROOT=/mnt/hf MODEL_ROOT=/mnt/hf/models bash scripts/start_default_model_server.sh
bash scripts/build_and_run.sh --config config.real-palindrome.json
bash scripts/build_and_run.sh --config config.real-website.jsonThe Gemma comparison used the same llama.cpp/Vulkan server wrapper with model overrides:
HF_ROOT=/mnt/hf MODEL_ROOT=/mnt/hf/models \
MODEL_PATH=/mnt/hf/models/gemma4-26b-a4b-it-gguf/gemma-4-26B-A4B-it-Q4_K_M.gguf \
MMPROJ_PATH=/mnt/hf/models/gemma4-26b-a4b-it-gguf/mmproj-gemma-4-26B-A4B-it-f16.gguf \
CONTAINER=agentic-gemma4-server PORT=8161 CTX_SIZE=76800 \
bash scripts/start_default_model_server.sh
bash scripts/build_and_run.sh --config config.gemma4-palindrome.json
bash scripts/build_and_run.sh --config config.gemma4-website.jsonThe model server was configured for CTX_SIZE=76800, PARALLEL=1, and the configs used max_tokens=32768 as a response ceiling. The local models usually returned shorter structured JSON than the ceiling because the task contracts are parseable JSON schemas, but the prompts do not ask the implementation agent to be brief when the work needs more detail.
| Model | Workload | Config | Result | Time | Step attempts | Notes |
|---|---|---|---|---|---|---|
| Qwen3.6-27B Q4_K_M | Palindrome CLI with unit tests and docs | config.real-palindrome.json |
resolved | 3,615s | S1=2, S2=1, S3=2, S4=3 | Slower, but disciplined. The reviewer caught a case-sensitive validation mismatch, ambiguous empty-string CLI output, and a too-shallow documentation validator before accepting the corrected project. |
| Qwen3.6-27B Q4_K_M | Three-page website with JS interaction and Playwright validation | config.real-website.json |
resolved | 4,889s | S1=3, S2=3, S3=1, S4=2 | Most robust complex run. The reviewer rejected shallow validation and required browser/runtime evidence before final acceptance. |
| Qwen3.6-27B Q4_K_M | .NET todo analyzer with container-local SDK install | config.real-dotnet-dependency.json |
resolved | 6,321s | S1=2, S2=2, S3=4, S4=1, S5=1 | Confirmed the harness can follow a non-Python stack request without changing its Dockerfile. The agent installed .NET SDK 8.0.420 under /tmp/.dotnet, installed ICU in the disposable container, built the solution, ran 15 tests, and showed CLI output for overdue tasks. |
| Qwen3.6-27B Q4_K_M | Existing invoice project bug fix | config.real-existing-bugfix.json |
resolved | 1,998s | S1=2, S2=1, S3=1, S4=1 | Confirmed the harness can repair an existing project instead of starting from a blank directory. The reviewer forced concrete failure reproduction, separated syntax and tax-logic fixes, then accepted after unittest passed and BUGFIX_NOTES.md documented the repair. |
| Qwen3.6-27B Q4_K_M | Focused arithmetic module with tests and docs | config.real-arithmetic.json |
resolved | 3,840s | S1=1, S2=3, S3=1, S4=1, S5=2 | Good compact regression run. The reviewer rejected shallow import-only evidence, caught an invalid inline python -c validation shape, and accepted only after generated validation scripts and 20 unit tests passed. |
| Gemma4-26B-A4B Q4_K_M | Palindrome CLI with unit tests and docs | config.gemma4-palindrome.json |
resolved | 195s | S1=1, S2=1, S3=1 | Much faster on the small task and completed without reviewer rework. |
| Gemma4-26B-A4B Q4_K_M | Three-page website with JS interaction and Python Playwright validation | config.gemma4-website.json |
resolved | 715s | S1=1, S2=3, S3=3 | Fast overall, but more incremental. The reviewer caught incomplete website files, then forced working Python Playwright validation with dynamic port handling before final acceptance. |
| Gemma4-26B-A4B Q4_K_M | JSONL statistics CLI from scratch | config.gemma4-jsonl-stats.json |
resolved | 974s | S1=4, S2=4, S3=1, S4=1 | Confirmed a fresh project run after the recent harness changes. The reviewer caught missing tests, syntax/runtime errors, missing sample data, and insufficient final evidence. |
One Qwen JSONL statistics stress run using config.real-jsonl-stats.json was intentionally not counted above: it timed out after 7,200s while still in the feedback loop. That run was still useful because it exposed a generic issue with stale early artifacts and overly clever one-line Python validation commands; both are now covered by deterministic tests.
Latest generic-regression retest:
| Model | Workload | Config | Result | Time | Step attempts | Notes |
|---|---|---|---|---|---|---|
| Qwen3.6-27B Q4_K_M | Minimal Python CLI from prompt override | config.minimal.json + CLI prompt/workspace override |
resolved | 5,445s | S1=1, S2=1, S3=1, S4=2 | Confirmed the minimal config/default merge path still works. Final review ended with a labelled compromise note rather than hidden evidence gaps. |
| Qwen3.6-27B Q4_K_M | Existing invoice project bug fix | config.real-existing-bugfix.json |
resolved | 3,539s | S1=2, S2=2, S3=1, S4=1 | Confirmed existing-project repair still works with agent-owned state files separated from project files. |
| Gemma4-26B-A4B Q4_K_M | Palindrome CLI with unit tests and docs | config.gemma4-palindrome.json |
resolved | 889s | S1=3, S2=1, S3=2, S4=1 | Slower than the earlier Gemma run after relaxed prompts and thinking preservation, but the reviewer forced clearer plan and CLI evidence. |
| Gemma4-26B-A4B Q4_K_M | Three-page website with Python Playwright validation | config.gemma4-website.json |
resolved | 837s | S1=1, S2=1, S3=2, S4=1 | Confirmed complex browser validation still works inside the agent container. |
| Gemma4-26B-A4B Q4_K_M | Web-researched interest-rate impact package | config.real-interest-rate-research.json |
resolved | 672s | S1=1, S2=1, S3=2, S4=1 | Confirmed the web-research path stays generic. This run exposed and then verified fixes for malformed grep -m validation commands and binary/PDF research payload handling. |
Observed Qwen simple workload result:
- The run entered Docker via
scripts/run_agent.shbecauseruntime.docker_isolation=true. - The generated project was written to
workspaces/real-palindromethrough the/workspace/projectmount. - Feedback-side validation independently ran the generated unit tests and CLI checks, including positive and negative examples.
- The latest rerun passed 18 generated
unittestcases plus CLI subprocess checks covering core palindrome behavior, case-insensitivity, punctuation handling, non-palindromes, empty strings, Unicode, and command-line integration. - Workspace git recorded a baseline commit, one accepted commit per completed plan step, and a final review commit.
Observed Qwen complex workload result:
- The run entered Docker via the same isolated
/workspace/projectmount and wroteworkspaces/real-website. - Requirements and plan review rejected shallow planning until the steps were independently verifiable.
- The model created a static website, JavaScript interaction, README/notes, and a Playwright validation script.
- Feedback required runtime proof that navigation and the JavaScript interaction worked, not only file-existence checks.
- The accepted project produced browser evidence under the generated workspace, including
out/results.jsonand screenshots.
Observed model behavior from these runs:
- Gemma4 was dramatically faster on these two runs, especially the simple CLI task.
- It handled the simple CLI workload cleanly.
- On browser work it needed more explicit environmental guidance and more feedback. In the latest rerun, it first produced incomplete website/browser validation, then the reviewer forced concrete Playwright evidence and a dynamic-port validation script before final acceptance.
- On the existing-project repair run, configurable state filenames kept the fixture's own files separate from the harness plan/requirements documents. Qwen repaired syntax and logic bugs only after the reviewer forced concrete failure evidence.
- On the Gemma JSONL CLI run, the feedback loop did useful work: it caught broken generated code and missing validation evidence before accepting the project.
- On the .NET run, Qwen followed an explicit non-Python technology request and treated dependency setup as project work inside the disposable agent container. That is the intended behavior: Python Playwright is a convenience for browser validation, not a harness-wide technology requirement.
- On the focused arithmetic run, Qwen was slower than Gemma but the feedback loop stayed useful: it pushed for behavioral evidence instead of superficial imports and recovered from invalid validation-command syntax.
- This is not a universal model ranking. It only says that in this harness and with these prompts, Qwen behaved more conservatively on complex coding, while Gemma was much faster and good enough when the tool environment was described tightly.
The evidence is stored locally in ignored generated workspaces:
workspaces/real-palindrome/.agent_state/summary.json
workspaces/real-palindrome/.agent_state/conversation.full.md
workspaces/real-website/.agent_state/summary.json
workspaces/real-website/.agent_state/conversation.full.md
workspaces/gemma4-palindrome/.agent_state/summary.json
workspaces/gemma4-palindrome/.agent_state/conversation.full.md
workspaces/gemma4-website/.agent_state/summary.json
workspaces/gemma4-website/.agent_state/conversation.full.md
workspaces/existing-bugfix-demo/.agent_state/summary.json
workspaces/existing-bugfix-demo/.agent_state/conversation.full.md
workspaces/real-dotnet-dependency/.agent_state/summary.json
workspaces/real-dotnet-dependency/.agent_state/conversation.full.md
workspaces/real-arithmetic/.agent_state/summary.json
workspaces/real-arithmetic/.agent_state/conversation.full.md
workspaces/gemma4-jsonl-stats/.agent_state/summary.json
workspaces/gemma4-jsonl-stats/.agent_state/conversation.full.md
workspaces/real-interest-rate-research/.agent_state/summary.json
workspaces/real-interest-rate-research/.agent_state/conversation.full.md
Run the harness unit tests without Docker:
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. python3 -m unittest discover -s tests -vRun a real Docker-isolated benchmark. The first command starts the model server in its own container, and the second command starts the agent container on the same Docker network:
MODEL_ROOT=$HOME/hf/models bash scripts/start_default_model_server.sh
bash scripts/build_and_run.sh --config config.real-palindrome.jsonIf your model cache lives outside $HOME/hf, override both roots:
HF_ROOT=/mnt/hf MODEL_ROOT=/mnt/hf/models bash scripts/start_default_model_server.sh