Skip to content

Slice 8 — Real Agent loop: ModelClient + tool use replaces the fake PR #8

@safayavatsal

Description

@safayavatsal

What to build

Replace the stub edit from slice 7 with a real Agent loop. External behavior is unchanged ("assign Issue → PR appears"), but internally the orchestrator now drives a model with tool use to read code, edit files, run tests, and produce the PR.

This is HITL because prompt strategy, tool interface, and eval setup are product-defining and need design care. ADR-0006 fixes Claude as the default; this slice is where that comes alive.

  • ModelClient interface lands here as a deep module: a single surface for model calls; routing/retry/BYOK lookup live behind it. No model SDK imports outside this module.
  • Agent tools at MVP (capability boundary code-only per ADR-0002): read_file, list_files, write_file, run_shell (constrained to git, package manager, test runner), final_answer (signals readiness to open the PR).
  • Prompt assembly: system prompt explains the Agent is a contributor on a Forge Repository; user prompt is the Issue title + body + repo tree summary.
  • Token accounting per Run, persisted on the Run row.
  • A small offline eval harness: a fixed set of toy Issues with expected outcomes (test passes, file edited, etc.) — used in CI to catch obvious regressions in prompt or tool changes.

Decisions to nail in this slice

  • Tool schema and boundaries (what run_shell allows).
  • Failure-recovery policy (model retries vs. Run-level retries).
  • Eval set composition and how a regression blocks merges.

Acceptance criteria

  • Assigning an Agent to a real Issue (e.g., "fix this failing test in lib/foo.py") produces a PR that actually fixes the test on at least the eval set.
  • Token usage is recorded on the Run row and visible in logs.
  • Tools cannot escape the Sandbox (no outbound internet, no host secrets).
  • All model calls go through ModelClient; lint/CI rejects direct SDK imports outside that module.
  • A recorded-response fake of ModelClient is used by tests covering Agent loop control flow.
  • The eval harness runs in CI and fails the build on regression beyond a tolerated threshold.

Blocked by

Metadata

Metadata

Assignees

No one assigned

    Labels

    hitlHuman-in-the-loop — needs design call before implementationtracer-bulletVertical slice through all integration layers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions