Skip to content

fix(benchmark): drop invalid focus token and phantom model IDs in corpus.yml#162

Merged
jeff-atriumn merged 1 commit into
mainfrom
worktree-cheerful-mapping-curry
May 30, 2026
Merged

fix(benchmark): drop invalid focus token and phantom model IDs in corpus.yml#162
jeff-atriumn merged 1 commit into
mainfrom
worktree-cheerful-mapping-curry

Conversation

@jeff-atriumn

Copy link
Copy Markdown
Member

Summary

  • Removes does_it_work from matrix.focus — it is not a key in FOCUS_AREAS and causes ValueError: Unknown focus area at runtime
  • Updates the comment above focus: to stop describing does_it_work
  • Removes gemini-2.5-flash-lite and gemini-3.1-pro-preview — neither has a MODEL_PRICING entry
  • Corrects gemini-3-flashgemini-3-flash-preview (actual pricing key)
  • Corrects gpt-5.2gpt-5.4 (actual pricing key)
  • Adds TestCorpusYmlFocusTokens and TestCorpusYmlModelIds in tests/test_benchmark.py to guard against regressions

Test plan

  • pytest tests/test_benchmark.py — all 24 tests pass, including the 3 new corpus validation tests
  • Full pytest — 444/444 pass

Closes #159

…pus.yml

- Remove `does_it_work` from matrix.focus (not a registered FOCUS_AREAS key)
- Update trailing comment above `focus:` to remove reference to does_it_work
- Remove `gemini-2.5-flash-lite` and `gemini-3.1-pro-preview` (no MODEL_PRICING entries)
- Correct `gemini-3-flash` → `gemini-3-flash-preview` (actual pricing key)
- Correct `gpt-5.2` → `gpt-5.4` (actual pricing key)
- Add TestCorpusYmlFocusTokens and TestCorpusYmlModelIds to test_benchmark.py
  to guard against regressions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jeff-atriumn jeff-atriumn force-pushed the worktree-cheerful-mapping-curry branch from d7c496b to 60f37c2 Compare May 30, 2026 17:01
@jeff-atriumn jeff-atriumn merged commit ec85982 into main May 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix benchmark corpus.yml: drop invalid 'does_it_work' focus + phantom models

1 participant