feat(databases-on-aws): loosen explainability grader for natural phrasings by anwesham-lab · Pull Request #149 · awslabs/agent-plugins

anwesham-lab · 2026-04-24T22:08:28Z

Summary

Grader-only improvement on the Aurora DSQL query-plan-explainability eval suite introduced by #141. After running the harness post-merge, three assertions were consistently reporting agent-correct outputs as failures because the grader regexes expected specific phrasings the agent rarely uses. This PR loosens those regexes without changing skill content or eval prompts.

Concrete phrasings the old grader rejected

Agent output	Old grader behavior	New grader behavior
"The fix landed as expected"	Fail (no literal "matches expected")	Pass — added "landed as (expected\|predicted\|promised)"
"You're good to ship" / "Finding 1 ... RESOLVED"	Fail on next-hypothesis branch	Pass (vacuously) — broadened success-signal to include these
"Index Only Scan using idx_user_account_tenant_valid_from"	Fell to 0.8-keyword fallback	Pass — new dedicated branch matches literal index name or plan-tree Index Scan

Scope

tools/evals/databases-on-aws/scripts/run_query_explainability_evals.py only
No skill content changes
No eval prompt changes
No fixture changes

Test plan

mise run build passes
Follow-up: rerun the full 9-eval suite against the updated grader to confirm eval 6 moves from 5/8 → 8/8, and that no previously-passing assertion regresses

Split out from #141 (merged). Targeted at the class of failures James's review cycle surfaced: agent does the right thing, grader says otherwise.

Generated with Claude Code

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

…ral phrasings The v11 eval run showed the agent producing correct outputs that the grader rejected on phrasing grounds. Three specific misses were agent-correct, grader-wrong: 1. "The fix landed as expected" / "you're good to ship" — correct match-vs- expected commentary, but the regex only accepted literal "matches expected", "exceeds expected", etc. Broaden to include "landed as (expected|predicted| promised)", "hit the target", "delivered as promised", "the fix worked as expected", and variants using prediction/target as synonyms for expected. 2. "Finding 1 ... RESOLVED", "no new findings", "you're good to ship" — all valid success signals for the next-hypothesis branch, but the old regex only matched "success(fully)" / "as expected" / "resolved". Broaden the success-signal set; broaden the shortfall-signal set (e.g. "short of prediction", "did not match prediction"); broaden the next-hypothesis set. 3. "Identifies the composite index is now being used" assertion was hitting the 0.8-keyword fallback matcher. Added a dedicated branch that accepts any of: the literal index name, an Index Scan line mentioning the new columns, a prose claim that "the planner picked/selected/chose the new index", or a Finding-resolved statement on the original finding. Does not change skill content or eval prompts — only grader regexes. Targeted at eval 6 (Phase 5 reassessment) where the agent's natural phrasing was being penalized despite matching the intent of the assertion. Co-authored-by: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(databases-on-aws): loosen explainability grader for natural phrasings#149

feat(databases-on-aws): loosen explainability grader for natural phrasings#149
anwesham-lab wants to merge 1 commit intoawslabs:mainfrom
anwesham-lab:feat/dsql-explainability-grader-improvements

anwesham-lab commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anwesham-lab commented Apr 24, 2026

Summary

Concrete phrasings the old grader rejected

Scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant