Feature Request
Add Vakra's built-in simple LangGraph agent as a first-class comparison target in compare.sh (and the per-benchmark compare.sh scripts), alongside the existing cuga and react agents.
Parent epic: cuga-project/cuga-agent#239
Motivation / Problem
Today, compare.sh accepts only --agent cuga or --agent react (enforced by a hard validation check in scripts/compare.sh). Evaluators who want to benchmark cuga-agent against Vakra's own LangGraph-based agent must run Vakra separately and reconcile results by hand. There is no automated, reproducible head-to-head comparison between cuga and the Vakra LangGraph agent inside the cuga-eval harness.
Use Case
An evaluator running the M3 or BPO benchmark wants to understand where cuga-agent adds value over a plain LangGraph ReAct loop. They run:
./scripts/compare.sh --benchmark m3 --agents cuga,langgraph --runs 5
and get a single comparison report that shows pass-rate, token usage, and tool-call breakdown for both agents side by side — exactly as today's cuga vs react comparison works.
Proposed Solution
-
Extend agent validation — remove the hard cuga|react guard in scripts/compare.sh (line ~68) and replace it with a list that includes langgraph. Propagate this to the per-benchmark compare.sh scripts (benchmarks/bpo/compare.sh, benchmarks/m3/compare.sh, etc.).
-
Add a LangGraph agent runner — create benchmarks/helpers/langgraph_agent.py (analogous to benchmarks/helpers/react_agent.py) that wraps Vakra's simple LangGraph agent and exposes the same interface used by the existing eval scripts (eval_m3.py, eval_bench_sdk_react.py, etc.).
-
Wire into eval scripts — add langgraph as a valid --agent choice in benchmarks/*/eval_*.py (currently hard-coded to choices=["cuga", "react"]) and route to the new runner.
-
Update --compare-agents shorthand — decide whether --compare-agents expands to cuga,react,langgraph or stays as cuga,react with a separate --compare-all-agents flag. A new --compare-all-agents flag is the lower-risk option.
-
Setup / dependencies — Vakra is already cloned and vendored by setup_m3.sh. Document that the langgraph agent target requires setup_m3.sh to have been run first.
-
Report integration — ensure benchmarks/helpers/compare_report.py renders a langgraph column in the comparison Markdown table.
Alternatives Considered
- Keep Vakra comparison out-of-band — evaluators continue to run Vakra separately. Rejected because it prevents automated, reproducible, apples-to-apples comparisons inside cuga-eval's reproducibility bundles.
- Reuse
react agent label — map langgraph internally to the react harness. Rejected because Vakra's LangGraph agent differs from the bare react_agent.py implementation and conflating them would obscure meaningful performance differences.
Priority
Medium
Additional Context
- Vakra is already cloned and installed by
setup_m3.sh into vendor/vakra; the M3 benchmark uses it as a scorer/judge today, not as an agent under test.
- Existing analysis comparing cuga vs react on M3 (Vakra):
docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md
- Relevant files to modify:
scripts/compare.sh — agent validation list
benchmarks/helpers/common.sh — --compare-agents expansion
benchmarks/m3/compare.sh, benchmarks/bpo/compare.sh, etc.
benchmarks/helpers/react_agent.py — reference implementation
benchmarks/*/eval_*.py — choices=["cuga", "react"] argparse args
Feature Request
Add Vakra's built-in simple LangGraph agent as a first-class comparison target in
compare.sh(and the per-benchmarkcompare.shscripts), alongside the existingcugaandreactagents.Parent epic: cuga-project/cuga-agent#239
Motivation / Problem
Today,
compare.shaccepts only--agent cugaor--agent react(enforced by a hard validation check inscripts/compare.sh). Evaluators who want to benchmark cuga-agent against Vakra's own LangGraph-based agent must run Vakra separately and reconcile results by hand. There is no automated, reproducible head-to-head comparison between cuga and the Vakra LangGraph agent inside the cuga-eval harness.Use Case
An evaluator running the M3 or BPO benchmark wants to understand where cuga-agent adds value over a plain LangGraph ReAct loop. They run:
and get a single comparison report that shows pass-rate, token usage, and tool-call breakdown for both agents side by side — exactly as today's
cugavsreactcomparison works.Proposed Solution
Extend agent validation — remove the hard
cuga|reactguard inscripts/compare.sh(line ~68) and replace it with a list that includeslanggraph. Propagate this to the per-benchmarkcompare.shscripts (benchmarks/bpo/compare.sh,benchmarks/m3/compare.sh, etc.).Add a LangGraph agent runner — create
benchmarks/helpers/langgraph_agent.py(analogous tobenchmarks/helpers/react_agent.py) that wraps Vakra's simple LangGraph agent and exposes the same interface used by the existing eval scripts (eval_m3.py,eval_bench_sdk_react.py, etc.).Wire into eval scripts — add
langgraphas a valid--agentchoice inbenchmarks/*/eval_*.py(currently hard-coded tochoices=["cuga", "react"]) and route to the new runner.Update
--compare-agentsshorthand — decide whether--compare-agentsexpands tocuga,react,langgraphor stays ascuga,reactwith a separate--compare-all-agentsflag. A new--compare-all-agentsflag is the lower-risk option.Setup / dependencies — Vakra is already cloned and vendored by
setup_m3.sh. Document that the langgraph agent target requiressetup_m3.shto have been run first.Report integration — ensure
benchmarks/helpers/compare_report.pyrenders a langgraph column in the comparison Markdown table.Alternatives Considered
reactagent label — maplanggraphinternally to the react harness. Rejected because Vakra's LangGraph agent differs from the barereact_agent.pyimplementation and conflating them would obscure meaningful performance differences.Priority
Medium
Additional Context
setup_m3.shintovendor/vakra; the M3 benchmark uses it as a scorer/judge today, not as an agent under test.docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.mdscripts/compare.sh— agent validation listbenchmarks/helpers/common.sh—--compare-agentsexpansionbenchmarks/m3/compare.sh,benchmarks/bpo/compare.sh, etc.benchmarks/helpers/react_agent.py— reference implementationbenchmarks/*/eval_*.py—choices=["cuga", "react"]argparse args