Hi AutoEvals folks,
We just added a small Assay-side sample around ExactMatch, and I wanted to share the narrow version rather than open with a broader integration ask.
The sample is here:
https://github.com/Rul1an/assay/tree/main/examples/autoevals-exactmatch-evidence
We kept it intentionally small. The probe runs ExactMatch on autoevals==0.2.0, stores the compared output / expected values separately as discovery context, and then reduces only the returned Score object. In the Python path we observed the useful public shape as name, score, empty metadata, and error=None.
For the Assay fixture, the canonical artifact keeps only the scorer name from name, the integer 0 / 1 score from score, and a target_kind that says this is an output-vs-expected comparison level rather than a stable target id. We deliberately leave raw compared values, metadata, error state, scorer config, Braintrust run/experiment wrappers, and broader AutoEvals scorer-family semantics out of the artifact.
The question is: does the returned Score object seem like the right minimal public surface for an external evidence consumer, or would you rather point consumers at a different returned/result boundary? If there is a better Python/TypeScript-stable seam, happy to tighten the sample around that.
Thanks for maintaining AutoEvals. ExactMatch is exactly the kind of small deterministic scorer that makes this boundary easy to reason about.
Hi AutoEvals folks,
We just added a small Assay-side sample around
ExactMatch, and I wanted to share the narrow version rather than open with a broader integration ask.The sample is here:
https://github.com/Rul1an/assay/tree/main/examples/autoevals-exactmatch-evidence
We kept it intentionally small. The probe runs
ExactMatchonautoevals==0.2.0, stores the comparedoutput/expectedvalues separately as discovery context, and then reduces only the returnedScoreobject. In the Python path we observed the useful public shape asname,score, emptymetadata, anderror=None.For the Assay fixture, the canonical artifact keeps only the scorer name from
name, the integer0/1score fromscore, and atarget_kindthat says this is an output-vs-expected comparison level rather than a stable target id. We deliberately leave raw compared values, metadata, error state, scorer config, Braintrust run/experiment wrappers, and broader AutoEvals scorer-family semantics out of the artifact.The question is: does the returned
Scoreobject seem like the right minimal public surface for an external evidence consumer, or would you rather point consumers at a different returned/result boundary? If there is a better Python/TypeScript-stable seam, happy to tighten the sample around that.Thanks for maintaining AutoEvals.
ExactMatchis exactly the kind of small deterministic scorer that makes this boundary easy to reason about.