feat(databases-on-aws): loosen explainability grader for natural phrasings#149
Draft
anwesham-lab wants to merge 1 commit intoawslabs:mainfrom
Draft
Conversation
…ral phrasings The v11 eval run showed the agent producing correct outputs that the grader rejected on phrasing grounds. Three specific misses were agent-correct, grader-wrong: 1. "The fix landed as expected" / "you're good to ship" — correct match-vs- expected commentary, but the regex only accepted literal "matches expected", "exceeds expected", etc. Broaden to include "landed as (expected|predicted| promised)", "hit the target", "delivered as promised", "the fix worked as expected", and variants using prediction/target as synonyms for expected. 2. "Finding 1 ... RESOLVED", "no new findings", "you're good to ship" — all valid success signals for the next-hypothesis branch, but the old regex only matched "success(fully)" / "as expected" / "resolved". Broaden the success-signal set; broaden the shortfall-signal set (e.g. "short of prediction", "did not match prediction"); broaden the next-hypothesis set. 3. "Identifies the composite index is now being used" assertion was hitting the 0.8-keyword fallback matcher. Added a dedicated branch that accepts any of: the literal index name, an Index Scan line mentioning the new columns, a prose claim that "the planner picked/selected/chose the new index", or a Finding-resolved statement on the original finding. Does not change skill content or eval prompts — only grader regexes. Targeted at eval 6 (Phase 5 reassessment) where the agent's natural phrasing was being penalized despite matching the intent of the assertion. Co-authored-by: anwesham-lab <64298192+anwesham-lab@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Grader-only improvement on the Aurora DSQL query-plan-explainability eval suite introduced by #141. After running the harness post-merge, three assertions were consistently reporting agent-correct outputs as failures because the grader regexes expected specific phrasings the agent rarely uses. This PR loosens those regexes without changing skill content or eval prompts.
Concrete phrasings the old grader rejected
Scope
tools/evals/databases-on-aws/scripts/run_query_explainability_evals.pyonlyTest plan
mise run buildpassesSplit out from #141 (merged). Targeted at the class of failures James's review cycle surfaced: agent does the right thing, grader says otherwise.
Generated with Claude Code
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.