Skip to content

feat(databases-on-aws): loosen explainability grader for natural phrasings#149

Draft
anwesham-lab wants to merge 1 commit intoawslabs:mainfrom
anwesham-lab:feat/dsql-explainability-grader-improvements
Draft

feat(databases-on-aws): loosen explainability grader for natural phrasings#149
anwesham-lab wants to merge 1 commit intoawslabs:mainfrom
anwesham-lab:feat/dsql-explainability-grader-improvements

Conversation

@anwesham-lab
Copy link
Copy Markdown
Member

Summary

Grader-only improvement on the Aurora DSQL query-plan-explainability eval suite introduced by #141. After running the harness post-merge, three assertions were consistently reporting agent-correct outputs as failures because the grader regexes expected specific phrasings the agent rarely uses. This PR loosens those regexes without changing skill content or eval prompts.

Concrete phrasings the old grader rejected

Agent output Old grader behavior New grader behavior
"The fix landed as expected" Fail (no literal "matches expected") Pass — added "landed as (expected|predicted|promised)"
"You're good to ship" / "Finding 1 ... RESOLVED" Fail on next-hypothesis branch Pass (vacuously) — broadened success-signal to include these
"Index Only Scan using idx_user_account_tenant_valid_from" Fell to 0.8-keyword fallback Pass — new dedicated branch matches literal index name or plan-tree Index Scan

Scope

  • tools/evals/databases-on-aws/scripts/run_query_explainability_evals.py only
  • No skill content changes
  • No eval prompt changes
  • No fixture changes

Test plan

  • mise run build passes
  • Follow-up: rerun the full 9-eval suite against the updated grader to confirm eval 6 moves from 5/8 → 8/8, and that no previously-passing assertion regresses

Split out from #141 (merged). Targeted at the class of failures James's review cycle surfaced: agent does the right thing, grader says otherwise.

Generated with Claude Code


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.

…ral phrasings

The v11 eval run showed the agent producing correct outputs that the grader
rejected on phrasing grounds. Three specific misses were agent-correct,
grader-wrong:

1. "The fix landed as expected" / "you're good to ship" — correct match-vs-
   expected commentary, but the regex only accepted literal "matches expected",
   "exceeds expected", etc. Broaden to include "landed as (expected|predicted|
   promised)", "hit the target", "delivered as promised", "the fix worked as
   expected", and variants using prediction/target as synonyms for expected.

2. "Finding 1 ... RESOLVED", "no new findings", "you're good to ship" — all
   valid success signals for the next-hypothesis branch, but the old regex
   only matched "success(fully)" / "as expected" / "resolved". Broaden the
   success-signal set; broaden the shortfall-signal set (e.g. "short of
   prediction", "did not match prediction"); broaden the next-hypothesis set.

3. "Identifies the composite index is now being used" assertion was hitting
   the 0.8-keyword fallback matcher. Added a dedicated branch that accepts
   any of: the literal index name, an Index Scan line mentioning the new
   columns, a prose claim that "the planner picked/selected/chose the new
   index", or a Finding-resolved statement on the original finding.

Does not change skill content or eval prompts — only grader regexes. Targeted
at eval 6 (Phase 5 reassessment) where the agent's natural phrasing was being
penalized despite matching the intent of the assertion.

Co-authored-by: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant