Skip to content

feat: add StringCheckGrader support for OpenAI Evals backend#102

Open
wiliyam wants to merge 4 commits intoagentevals-dev:mainfrom
wiliyam:feat/string-check-grader-95
Open

feat: add StringCheckGrader support for OpenAI Evals backend#102
wiliyam wants to merge 4 commits intoagentevals-dev:mainfrom
wiliyam:feat/string-check-grader-95

Conversation

@wiliyam
Copy link
Copy Markdown

@wiliyam wiliyam commented Apr 1, 2026

Summary

Closes #95

Adds support for OpenAI's string_check grader type alongside the existing text_similarity grader.

Changes

  • config.py: Added _VALID_STRING_CHECK_OPERATIONS set with all supported operations (eq, ne, like, ilike, contains, not_contains, starts_with, ends_with). Updated _validate_grader to validate string_check configs.
  • openai_eval_backend.py: Added string_check case in _build_testing_criteria that maps to the OpenAI testing criteria format.

Usage

evaluators:
  - name: response_check
    type: openai_eval
    grader:
      type: string_check
      operation: contains
      reference: "expected keyword"

Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, added some review comments!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reject all grader types with this conditional, but string_check uses a static reference from config and doesn't need them.

Can you gate this on grader_type?

"actual_response": {"type": "string"},
"expected_response": {"type": "string"},
},
"required": ["actual_response", "expected_response"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_response is no longer required as string_checker does not use it. Maybe we should make the schema grader-aware.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSONL items contain a field not declared in the schema. Please make this builder grader-aware too

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return None for string_check graders. Please make this conditional, or include grader-relevant keys, e.g. operation instead.

raise ValueError("'operation' is required for string_check grader")
if operation not in _VALID_STRING_CHECK_OPERATIONS:
raise ValueError(f"Unknown operation '{operation}'. Valid: {sorted(_VALID_STRING_CHECK_OPERATIONS)}")
if "reference" not in v:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do what we do for the other branch here as well with if not metric?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still relevant.

if "reference" not in v:
raise ValueError("'reference' is required for string_check grader")
else:
supported = "'text_similarity', 'string_check'"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use something like _SUPPORTED_GRADER_TYPES constant for all supported graders?

…ocations on grader type, use _SUPPORTED_GRADER_TYPES constant
@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 2, 2026

Thanks for the detailed review @krisztianfekete! Addressed all 5 points:

  1. Grader type check — moved the grader_type not in _SUPPORTED_GRADER_TYPES check to the top, so unsupported types are rejected immediately regardless of other conditions
  2. Grader-aware schema — added _ACTUAL_ONLY_SCHEMA for graders that don't need expected_response (like string_check), and _get_item_schema(grader_type) helper to select the right schema
  3. expected_invocations gating — now only required for non-string_check graders since string_check uses a static reference from config
  4. operation in error context — the string_check testing criteria now correctly uses operation from config
  5. _SUPPORTED_GRADER_TYPES constant — added, used in both the validator and the unsupported-type error message

Copy link
Copy Markdown
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please take a closer look, EValRunConfig most definitely shouldn't been deleted, and many review feedback haven't been addressed. Also keep our guidelines in mind when contributing: https://github.com/agentevals-dev/agentevals/blob/main/CONTRIBUTING.md#responsible-ai-usage

BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef | OpenAIEvalDef,
Field(discriminator="type"),
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be reverted.

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 2, 2026

Apologies for the sloppy rewrite @krisztianfekete — I accidentally deleted EvalRunConfig when rewriting config.py. Fixed in this push:

  1. EvalRunConfig restored — exactly as it was in upstream
  2. Validator order reverted — type-specific checks first, unsupported type raises at the bottom (original pattern)
  3. if not metric style — matches other branch
  4. Grader-relevant keys in detailsoperation for string_check, evaluation_metric for text_similarity instead of always returning None
  5. _SUPPORTED_GRADER_TYPES constant — kept, used in the final else raise
  6. Grader-aware schema_ACTUAL_ONLY_SCHEMA for string_check, _TEXT_PAIR_SCHEMA for text_similarity
  7. expected_invocations gating — only required for non-string_check graders

Sorry again for the noise!

@wiliyam
Copy link
Copy Markdown
Author

wiliyam commented Apr 3, 2026

Addressed latest comments @krisztianfekete:

  1. JSONL builder grader-aware_build_jsonl_items now accepts grader_type and only includes expected_response for non-string_check graders — matching the item schema exactly
  2. if not v.get("reference") — changed from if "reference" not in v to match the if not metric pattern used in the text_similarity branch

}
)

_VALID_STRING_CHECK_OPERATIONS = frozenset(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not all valid, please fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add StringCheckGrader OpenAI Grader

2 participants