fix(m3): repair harness bugs that artificially zeroed CUGA M3 pass rate#3
Conversation
The M3 harness in this repo had several mutually-reinforcing bugs that caused CUGA's M3 (vakra) pass rate to be near zero on PF cases (cases where ReAct passes and CUGA fails), independent of agent quality. This commit fixes the cluster: 1. Tool-name app-prefix breaks vakra's _match_live_name (groundedness=0 on otherwise-correct answers). Registry now uses bare-domain app names; m3_data_loader and m3_vakra_score gain backward-compat suffix- match paths for legacy bundles. Collision guard added because de-prefixing collapses cross-task duplicates (books, mondial_geo, soccer_2016 in tasks 2 and 3); expand_registry_config gains a capability_filter to pre-narrow. 2. M3Evaluator.evaluate_all multiturn branch checked self.task_id (singular, set only when N=1) instead of self.task_ids — so `--task <uuid> <uuid> <uuid> <uuid>` silently ran all ~46 samples in the capability. Switched to plural and lowercase-membership. 3. eval_m3.py:start_registry_server hardcoded port 8001 in three places. Now reads REGISTRY_PORT or DYNACONF_SERVER_PORTS__REGISTRY. 4. eval.sh started an outer registry that collided with eval_m3.py's per-service registry on the same port. Every code path through eval.sh uses --from-config and self-manages its registry, so the outer-start is dead weight; gated behind SKIP_SERVER_START=false now. 5. Per-domain CugaAgent constructors were re-loading policies from disk and writing back the conflict-resolver's culled set, so the .cuga folder count decreased monotonically across domains. Pass auto_load_policies=False, filesystem_sync=False; load policies once via a new _load_m3_policies helper. 6. --capability / --task with nargs="*" overwrote on a second invocation; switched both to action="extend", default=[]. UUID detection now strips non-UUID items before passing the filter down. 7. M3 policy bundle (P-OF-1, P-PB-1..4, P-TG-1..2) authored in benchmarks/m3/policies/ as markdown with YAML frontmatter; compiled to policies.json by scripts/policies_md_to_json.py. eval.sh and compare.sh grew --no-policies and --compare-policies flags mirroring bpo. DYNACONF_POLICY__ENABLED flipped to true in m3.env. Bundle dir naming annotates the policy mode. 8. scripts/check_no_task_prefix.py smoke-tests result files to ensure no tool call still carries the legacy task_<n>_<domain>_ prefix. Headline result on a 4-PF x 5-runs x 2-configs sweep: - baseline: 0/10 - + tool-prefix removal: 5/10 (no-policies) / 4/10 (policies) - + remaining fixes (4 PFs x 5 runs): 81.2% no-policies / 50.0% policies The policy bundle is net-negative on these 4 cases once the tool-prefix root cause is fixed; one task improves (75 -> 100%), one regresses (75 -> 0%). Re-running policies against the full 200-case M3 set is the next step. Full implementation notes appended to the analysis report under "Post-analysis: what we actually changed and what each change did". Out of scope (deferred): nested-arg sandbox codegen bug (cuga-agent), movie_platform/professional_basketball MCP-client health, and splitting results/ per compare.sh invocation to prevent parallel-eval contamination. Refs: #99
The MD→JSON compiler skips files with the .disabled suffix, so the policy was already inactive, but the frontmatter still said `enabled: true`. Misleading to a reader; CodeRabbit flagged the inconsistency in #100. The intent of the .disabled suffix here is "P-OF-2 conflicts with P-OF-1 in the natural-language conflict resolver and shouldn't be loaded for now." Flipping the frontmatter to `enabled: false` makes that state visible without relying on the suffix alone — and means removing the suffix later won't silently re-enable a policy whose trade-offs haven't been re-evaluated.
Add get_registry_port() backed by cuga settings so start_registry_server listens on the same port the agent uses via get_registry_base_url(). Sync REGISTRY_PORT and DYNACONF in eval.sh/compare.sh; stop hardcoding 8001 in eval_m3_react port cleanup. Co-authored-by: Cursor <cursoragent@cursor.com>
…bility Bare-domain registry names (from the vakra tool-name fix) collide when both m3_task_2 and m3_task_3 expand into one yaml. Run one capability at a time automatically so full small_train.zip evals work without --capability on CLI.
Strip inline env comments so CUGA mode parses correctly, lazy-load helpers so bundle creation works without importing the agent, and add a utility to assemble bundles from existing results after long eval runs. Co-authored-by: Cursor <cursoragent@cursor.com>
Inline trajectory lookup so the utility does not import the benchmarks package before subprocess calls. Co-authored-by: Cursor <cursoragent@cursor.com>
Load benchmark env via dotenv inside bundle.py and invoke it with -m from create_eval_bundle so direct script execution does not require sys.path hacks. Co-authored-by: Cursor <cursoragent@cursor.com>
Reload benchmark env in root conftest before cuga imports and add nosec annotations for fixed /tmp log paths in create_eval_bundle. Co-authored-by: Cursor <cursoragent@cursor.com>
The auto-capability-pass logic ran each capability (m3_task_2, m3_task_3) as a separate recursive run_config_mode and each pass saved its own result file. With --m3-data covering 2 capabilities, one eval.sh invocation emitted two 100-task files instead of one 200-task file, so compare_report counted each capability file as a separate "run" (inflated run count) and every run column only showed 100 tasks (the other capability's tasks rendered as "—"). - eval_m3.py: capability passes now run with defer_save=True and return their results; the parent aggregates them and writes ONE result file (+ ground-truth dump and summary) per eval.sh run via the new _finalize_and_save_results helper. Single-capability/--capability paths use the same helper. Also removes a duplicate _write_single_service_yaml definition. - compare.sh / bundle.py: trajectory folders (cuga writes one per domain) are now grouped per eval.sh run and merged into a single runN/trajectories dir, so one bundle run holds all 200 trajectories instead of one run per domain. The bundle CLI still accepts the legacy flat trajectory-dirs shape. Net effect: `compare.sh --runs 3 --compare-policies` produces 6 runs (3 policies + 3 no-policies), each covering all 200 tasks/trajectories. Co-authored-by: Cursor <cursoragent@cursor.com>
… port
Sequential mode starts/stops a registry per domain on the same port. stop only
waited on the `uv` wrapper, so the uvicorn worker could still hold the port when
the next domain started, and start_registry_server hard-failed immediately on a
busy port ("Port 18001 is already in use" mid-run, e.g. on talkingdata).
- eval_m3.py: add _port_in_use/_wait_for_port_free/_kill_port_listeners helpers.
start_registry_server now frees + waits for the port (up to 20s) instead of
failing on first check; stop_registry_server waits for the port to actually be
released (and force-kills stray listeners) before returning.
- eval.sh: only start the "outer" registry for multiturn. Single-turn and
--m3-data flows (cuga/react) manage their own per-service registries, so the
outer registry (which compare.sh forces on via SKIP_SERVER_START=false on its
first run) only collided on $REGISTRY_PORT. Fixes both direct eval and compare.
Co-authored-by: Cursor <cursoragent@cursor.com>
…logs - compare_report: per-task comparison table now leads with Capability, Domain, and # (1-based input task number), sorted by (capability, domain, task#); non-M3 benchmarks keep the legacy table. - thread task_number from m3_data_loader through eval_m3/eval_m3_react into results so the report can render it. - bundle.assemble_compare_bundle accepts per-run grouped logs (runs/<config>_run<i>/logs) while staying backward-compatible with flat/shared logs. - m3/compare.sh snapshots each run's console + live registry_server.log (fixing the stale /tmp/m3_registry.log) and emits per-run grouped --log-files; appworld and other benchmarks keep shared logs. Co-authored-by: Cursor <cursoragent@cursor.com>
Reorganize m3.env with documented settings, disable Evolve, cap cuga_lite_max_steps at 35 (GT M=3 in small_train, formula uses M=4 padding), and inject special_instructions so CugaLite probes unknown tool shapes then accesses results with isinstance checks. Co-authored-by: Cursor <cursoragent@cursor.com>
Use should_trace_langfuse_task() instead of setup_langfuse() on the evaluator; write trace_id whenever a per-task trace is created; route error-path invokes through _invoke_agent_for_eval; flush when tracing is enabled in settings.
Parse --no-policies in compare.sh (config label + eval.sh args). Pass EVAL_M3_EXTRA on all eval_m3 entrypoints. Disable policy engine via DYNACONF_POLICY__ENABLED=false when baselining. Export should_trace_langfuse_task from benchmarks.helpers.
|
Ready for review.
Please focus review on harness-only scope (no Vakra judge / MCP / React changes). |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds M3 policy markdowns and a compiler, centralizes Langfuse tracing/invocation, normalizes registry naming/port handling, propagates task_number, stages per-run trajectory/log groups for compare bundles, and extends Vakra scoring with judge extraction and Langfuse pushes. ChangesM3 Evaluation Harness and Vakra Scoring
Estimated code review effort 🎯 4 (Complex) | ⏱️ ~75 minutes ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/m3/m3_vakra_score.py (1)
102-104:⚠️ Potential issue | 🟠 Major | ⚡ Quick winLegacy prefix stripping misses underscore domains.
The regex only strips
task_<n>_<domain>_when<domain>is a single alnum token. Legacy names liketask_2_codebase_comments_...won’t be stripped, which can break the legacy-bundle compatibility path.Suggested fix
- _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_[A-Za-z0-9]+_(.+)$") + # Domain boundaries are ambiguous because both domain and operation ids + # can contain underscores; strip only the deterministic task prefix. + _REGISTRY_PREFIX_RE = _re.compile(r"^task_\d+_(.+)$")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benchmarks/m3/m3_vakra_score.py` around lines 102 - 104, The current regex _REGISTRY_PREFIX_RE only allows a single alphanumeric token for the domain and thus fails to strip legacy names with underscores (e.g. task_2_codebase_comments_...). Update the pattern used to compile _REGISTRY_PREFIX_RE to accept domains containing underscores (for example include "_" inside the character class), then keep the same match/group logic (use m.group(1) if m else name) so names like task_2_codebase_comments_... correctly return the suffix; ensure you only modify the regex expression and not the subsequent matching logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmarks/helpers/common.sh`:
- Around line 233-256: parse_common_args reads CLI overrides
(--model-name/--openai-base-url) but some call sites call
apply_model_profile_if_set directly which skips the CLI overrides; update those
call sites to call finalize_model_config (which runs apply_model_profile_if_set
then apply_model_cli_overrides_if_set) instead of apply_model_profile_if_set,
and ensure no early returns in apply_model_profile_if_set prevent later CLI
overrides (leave finalize_model_config's sequence as apply_model_profile_if_set
|| return 1; apply_model_cli_overrides_if_set). Also confirm parse_common_args
sets CLI_MODEL_NAME/CLI_OPENAI_BASE_URL so apply_model_cli_overrides_if_set can
export them to MODEL_NAME/OPENAI_BASE_URL.
In `@benchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.md`:
- Around line 75-76: Inline code spans contain extra spaces inside backticks;
remove the inner padding so the symbols render as
`get_forks_to_stars_percentage`, `get_repo_forks`, `get_repo_stars` and the
inline fragment "0 forks / 1 star = 0.00%, confirmed by
`get_forks_to_stars_percentage`" use backticks without leading/trailing spaces
(e.g. change ` get_forks_to_stars_percentage ` to
`get_forks_to_stars_percentage`). Update those inline code spans in the markdown
so there are no spaces immediately inside the backticks.
In `@scripts/check_no_task_prefix.py`:
- Line 25: The LEGACY_RE pattern only allows lowercase letters and underscores
in the domain segment, causing false negatives for domain names with digits;
update the regex used in LEGACY_RE (and the other occurrence around lines 90-91)
to permit digits as well (e.g., change the domain class from [a-z_]+ to
[a-z0-9_]+ or \w+ depending on desired constraints) so names like
"task_3_soccer_2016_..." are matched; ensure you update both occurrences (the
LEGACY_RE definition and the similar pattern at 90-91) and run tests/linters
after the change.
---
Outside diff comments:
In `@benchmarks/m3/m3_vakra_score.py`:
- Around line 102-104: The current regex _REGISTRY_PREFIX_RE only allows a
single alphanumeric token for the domain and thus fails to strip legacy names
with underscores (e.g. task_2_codebase_comments_...). Update the pattern used to
compile _REGISTRY_PREFIX_RE to accept domains containing underscores (for
example include "_" inside the character class), then keep the same match/group
logic (use m.group(1) if m else name) so names like task_2_codebase_comments_...
correctly return the suffix; ensure you only modify the regex expression and not
the subsequent matching logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a1ddb713-f58a-47f2-ac1b-66052ff6a080
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock,!uv.lock
📒 Files selected for processing (30)
benchmarks/helpers/__init__.pybenchmarks/helpers/bundle.pybenchmarks/helpers/common.shbenchmarks/helpers/compare_report.pybenchmarks/helpers/load_env.shbenchmarks/helpers/sdk_eval_helpers.pybenchmarks/helpers/tests/test_invoke_agent_for_eval.pybenchmarks/m3/compare.shbenchmarks/m3/config/m3.envbenchmarks/m3/eval.shbenchmarks/m3/eval_m3.pybenchmarks/m3/eval_m3_react.pybenchmarks/m3/m3_data_loader.pybenchmarks/m3/m3_vakra_score.pybenchmarks/m3/policies/P-OF-1-single-tool-fact-citation.mdbenchmarks/m3/policies/P-OF-2-strip-hedging.md.disabledbenchmarks/m3/policies/P-PB-1-no-enumeration.mdbenchmarks/m3/policies/P-PB-2-one-composite-tool-no-corroboration.mdbenchmarks/m3/policies/P-PB-3-no-idempotent-retries.mdbenchmarks/m3/policies/P-PB-4-validation-error-recovery.mdbenchmarks/m3/policies/P-TG-1-mountain-count-disambiguation.mdbenchmarks/m3/policies/P-TG-2-country-with-most-umpires-returns-id.mdbenchmarks/m3/policies/policies.jsonbenchmarks/m3/tests/test_vakra_langfuse_scores.pyconftest.pydocs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.mdscripts/check_no_task_prefix.pyscripts/create_eval_bundle.pyscripts/model_profiles.shscripts/policies_md_to_json.py
👮 Files not reviewed due to content moderation or server errors (8)
- benchmarks/m3/policies/policies.json
- benchmarks/m3/config/m3.env
- benchmarks/m3/eval.sh
- benchmarks/m3/compare.sh
- benchmarks/helpers/bundle.py
- scripts/create_eval_bundle.py
- benchmarks/m3/m3_data_loader.py
- benchmarks/m3/eval_m3.py
- check_no_task_prefix.py: LEGACY_RE now matches digit-containing domains (e.g. soccer_2016) instead of only [a-z_]+ - eval.sh / compare.sh: call finalize_model_config instead of apply_model_profile_if_set so --model-name / --openai-base-url CLI overrides are actually applied - P-PB-2 policy doc: fix mismatched backtick that broke code-span rendering in the "Wrong" example
|
Reply to CodeRabbit's "Legacy prefix stripping misses underscore domains" finding on Pushing back on this one — the current regex is intentional, not a bug. The docstring on Traced through
So multi-underscore domains are already handled — just via the suffix matcher rather than the prefix regex, exactly as the docstring describes. Adopting the suggested (Findings on |
|
Likely bugs
|
|
Concurrency
|
|
Cross-test leakage |
|
Maintainability / naming
|
|
benchmarks/m3/policies/P-OF-2-strip-hedging.md.disabled — file already has |
Sergey-Zeltyn
left a comment
There was a problem hiding this comment.
See comments above.
No remaining issue is a merge blocker. The two correctness items worth fixing in this PR are the CRLF frontmatter and non-atomic policies.json write; the rest can land as follow-ups.
| run_label = f"{config_key.replace(':', '_')}" | ||
| _copy_logs(bundle_dir, lf_list, dest_subdir=f"runs/{run_label}/logs") | ||
| for config_key, lf_val in log_files.items(): | ||
| if lf_val and isinstance(lf_val[0], list): |
There was a problem hiding this comment.
The code tries to detect whether log files are organized as "one list per run" or "one flat list" by checking if the first element is itself a list. If someone accidentally passes an empty nested structure like [[]], the code won't crash immediately but might behave unexpectedly—it'll try to copy logs from an empty group, which could silently fail or do nothing.
There was a problem hiding this comment.
Checked this — [[]] doesn't actually misbehave: _copy_logs early-returns False on a falsy/empty log_files list (line 250: if not log_files: return False), so the inner empty group is just a no-op — no silent partial copy, no crash. The shape-detection branch correctly routes it to the grouped path and then does nothing for the empty group. Leaving as-is; the existing guard already covers this.
| data_dir = PROJECT_ROOT / "benchmarks" / benchmark / "data" | ||
| if benchmark == "m3": | ||
| if result_file.name.startswith("multiturn_"): | ||
| candidate = data_dir / "olympics_mutliturn.json" |
There was a problem hiding this comment.
there's a typo here mutliturn instead of multiturn
actually looking again i see the error is in the file name itself, your choice whether you would like to fix this..not very important
There was a problem hiding this comment.
Right — that's not a typo introduced here, it's the actual filename of the existing data file (benchmarks/m3/data/olympics_mutliturn.json, pre-existing in the repo). Renaming the reference without renaming the data file would break loading. Renaming the data file itself is a separate, slightly-bigger cleanup (touches whatever else references it) — leaving it out of this PR's scope, agreed it's not important.
- policies_md_to_json.py: rstrip() instead of rstrip("\n") so CRLF
frontmatter delimiters parse correctly; write policies.json via
tmp-file + os.replace so a crash mid-write can't leave a truncated
file for _load_m3_policies to read
- eval_m3.py: rename M3Evaluator.langfuse_handler -> langfuse_enabled
(it stores a bool gate, not a handler; kwarg names passed to
sdk_eval_helpers stay langfuse_handler since that's their declared
legacy-gate parameter name)
- m3_vakra_score.py: log when _match_live_name's length-based
tie-breaks have multiple candidates, so non-deterministic scoring
drift from live-tool-list ordering is debuggable
- benchmarks/helpers/__init__.py: guard that every _LAZY_EXPORTS entry
is declared in __all__
|
@Sergey-Zeltyn — addressed the items below in 4049fb0 (and 790518c for the CodeRabbit batch). Replying to each of your threads: Likely bugs (both fixed, as you flagged as the priority pair):
Maintainability / naming:
Pushing back / leaving as follow-up (matches your own triage that these aren't blockers):
The conftest.py overlapping-keys point in particular ( |
…nsion eval_m3_react.py's _load_m3_registry_services called expand_registry_config without a capability_filter, expanding the entire registry (m3_task_2 and m3_task_3) and tripping the cross-task domain-name collision guard for shared bare-domain names (books, mondial_geo, soccer_2016). The cuga path in eval_m3.py already pre-filters by --capability before expansion; mirror that here so --compare-agents runs with --capability narrowing don't crash the react configuration.
The P-PB-2 markdown fix in 790518c (closing an unmatched backtick) wasn't reflected in the compiled policies.json. Regenerated via the eval harness's compile step so the artifact matches its source.
M3 eval/compare validation (small_train.zip, m3_task_2/hockey, max-samples=2)Ran the README's documented recipes against this branch to validate the harness fixes end-to-end. Eval (
|
Related issues (sub-issues of #37)
--no-policies/--compare-policies, compiler, bundle metadata_match_live_namecompat +check_no_task_prefix.py; (2)M3_SPECIAL_INSTRUCTIONSfor undocumented MCP outputsCloses #38
Closes #39
Part of #37.
Summary
Harness-only changes so CUGA M3 (Vakra) scores are not artificially zeroed. Vakra judges, ground truth, MCP definitions, and the ReAct baseline are untouched.
This PR delivers three tracks:
Tool calling (#39)
1) Vakra tool-name prefix (
task_<n>_<domain>_)hockey_get_…nottask_2_hockey_get_…)FilteredToolProviderusesapp_name=<domain>_match_live_namesuffix fallback for legacy result bundlesscripts/check_no_task_prefix.py— fail if any saved tool call still has legacy prefixFiles:
benchmarks/m3/eval_m3.py(expand_registry_config,evaluate_single_task),benchmarks/m3/m3_data_loader.py,benchmarks/m3/m3_vakra_score.py,scripts/check_no_task_prefix.py2) Undocumented tool outputs
Injected eval-only
M3_SPECIAL_INSTRUCTIONSvia SDKspecial_instructions: probe shape on first tool use, then defensiveisinstance/ normalization before.get()or indexing.Files:
benchmarks/m3/eval_m3.py(M3_SPECIAL_INSTRUCTIONS, passed intoCugaAgent)Policies (#38)
scripts/policies_md_to_json.py→benchmarks/m3/policies/policies.json_load_m3_policies();auto_load_policies=False,filesystem_sync=Falseon domain agents--no-policies,--compare-policiesonbenchmarks/m3/eval.sh,benchmarks/m3/compare.shbenchmarks/helpers/bundle.py4-PF × 5-runs × 2-configs (
codebase_comments, 4 UUIDs) after prefix fix:Policies are not declared net-positive on that slice; full 200-case re-run is follow-up work.
Additional harness fixes (same PR)
--taskfilter ignored in multiturn pathbenchmarks/m3/eval_m3.py—self.task_idsvs singularself.task_ideval_m3.py—REGISTRY_PORT/DYNACONF_SERVER_PORTS__REGISTRYeval.shregistry collides with per-service registrybenchmarks/m3/eval.sh— defaultSKIP_SERVER_START=true--capability/--taskoverwritten on second CLI tokeneval_m3.py—action="extend"eval_m3.py,sdk_eval_helpers.py,should_trace_langfuse_task#columnsbenchmarks/helpers/compare_report.pybenchmarks/m3/compare.sh,eval_m3.pyeval_m3.py,compare.sh--m3-datawithout--capabilityeval_m3.pybenchmarks/helpers/bundle.py,scripts/create_eval_bundle.pybenchmarks/m3/tests/test_vakra_langfuse_scores.py,benchmarks/helpers/tests/test_invoke_agent_for_eval.pydocs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.mdTest results — full
small_train.ziprun (no policies)Bundle:
benchmarks/m3/evaluation_bundles/20260603_220001_default/Branch @ run:
fix/m3-harness-bugs(7676764) · policies_enabled:falseSummary (from
report.md)Per-domain pass rate (Vakra)
m3_task_2 — 45/100
m3_task_3 — 22/100
Changed files (vs
main)Key paths:
benchmarks/m3/eval_m3.py,eval.sh,compare.sh,m3_vakra_score.py,m3_data_loader.py,benchmarks/m3/policies/*,benchmarks/helpers/{bundle,compare_report,sdk_eval_helpers}.py,scripts/{check_no_task_prefix,policies_md_to_json,create_eval_bundle}.py,docs/m3-vakra-analysis-20260428/…, tests underbenchmarks/m3/tests/andbenchmarks/helpers/tests/.Verification
just ci/ GitHub CI green on branchscripts/check_no_task_prefix.pyon latest result JSONsmall_train.zipeval → bundle above--compare-policies(follow-up)Out of scope
cuga-agent)movie_platform/professional_basketballMCP health (infra)compare.shruns sharingbenchmarks/m3/results/(glob contamination)References
docs/m3-vakra-analysis-20260428/cuga_vs_react_full_analysis.md(post-implementation section)Summary by CodeRabbit
New Features
Improvements
Tests
Documentation