Skip to content

[Feature]: Add Vakra LangGraph agent as a comparison target in compare.sh #57

@haroldship

Description

@haroldship

Feature Request

Add Vakra's built-in simple LangGraph agent as a first-class comparison target in compare.sh (and the per-benchmark compare.sh scripts), alongside the existing cuga and react agents.

Parent epic: cuga-project/cuga-agent#239


Motivation / Problem

Today, compare.sh accepts only --agent cuga or --agent react (enforced by a hard validation check in scripts/compare.sh). Evaluators who want to benchmark cuga-agent against Vakra's own LangGraph-based agent must run Vakra separately and reconcile results by hand. There is no automated, reproducible head-to-head comparison between cuga and the Vakra LangGraph agent inside the cuga-eval harness.


Use Case

An evaluator running the M3 or BPO benchmark wants to understand where cuga-agent adds value over a plain LangGraph ReAct loop. They run:

./scripts/compare.sh --benchmark m3 --agents cuga,langgraph --runs 5

and get a single comparison report that shows pass-rate, token usage, and tool-call breakdown for both agents side by side — exactly as today's cuga vs react comparison works.


Proposed Solution

  1. Extend agent validation — remove the hard cuga|react guard in scripts/compare.sh (line ~68) and replace it with a list that includes langgraph. Propagate this to the per-benchmark compare.sh scripts (benchmarks/bpo/compare.sh, benchmarks/m3/compare.sh, etc.).

  2. Add a LangGraph agent runner — create benchmarks/helpers/langgraph_agent.py (analogous to benchmarks/helpers/react_agent.py) that wraps Vakra's simple LangGraph agent and exposes the same interface used by the existing eval scripts (eval_m3.py, eval_bench_sdk_react.py, etc.).

  3. Wire into eval scripts — add langgraph as a valid --agent choice in benchmarks/*/eval_*.py (currently hard-coded to choices=["cuga", "react"]) and route to the new runner.

  4. Update --compare-agents shorthand — decide whether --compare-agents expands to cuga,react,langgraph or stays as cuga,react with a separate --compare-all-agents flag. A new --compare-all-agents flag is the lower-risk option.

  5. Setup / dependencies — Vakra is already cloned and vendored by setup_m3.sh. Document that the langgraph agent target requires setup_m3.sh to have been run first.

  6. Report integration — ensure benchmarks/helpers/compare_report.py renders a langgraph column in the comparison Markdown table.


Alternatives Considered

  • Keep Vakra comparison out-of-band — evaluators continue to run Vakra separately. Rejected because it prevents automated, reproducible, apples-to-apples comparisons inside cuga-eval's reproducibility bundles.
  • Reuse react agent label — map langgraph internally to the react harness. Rejected because Vakra's LangGraph agent differs from the bare react_agent.py implementation and conflating them would obscure meaningful performance differences.

Priority

Medium


Additional Context

  • Vakra is already cloned and installed by setup_m3.sh into vendor/vakra; the M3 benchmark uses it as a scorer/judge today, not as an agent under test.
  • Existing analysis comparing cuga vs react on M3 (Vakra): docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md
  • Relevant files to modify:
    • scripts/compare.sh — agent validation list
    • benchmarks/helpers/common.sh--compare-agents expansion
    • benchmarks/m3/compare.sh, benchmarks/bpo/compare.sh, etc.
    • benchmarks/helpers/react_agent.py — reference implementation
    • benchmarks/*/eval_*.pychoices=["cuga", "react"] argparse args

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions