Skip to content

ekon15/braintrust-multi-score-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Braintrust Multi-Score Online Scorer Demo

Reproduction for Pylon #11423 / Linear BRA-4013: Python online scorer returning a list of scores fails when the rule runs automatically, but works on the "Test" page and in SDK offline eval.

The Issue

A Python scorer returns a list of scores (needed when scores must be computed sequentially, where score B depends on score A):

from braintrust import Score

def handler(input, output, expected, metadata) -> list[Score]:
    score1 = Score(name="has_output", score=1.0 if output else 0.0)
    score2 = Score(name="conciseness", score=compute_from(score1))  # depends on score1
    return [score1, score2]
Context Result
SDK offline eval (braintrust.Eval) ✅ works
Online scorer "Test" page (manual trigger on past logs) ✅ works
Online scorer automatic rule (runs on new logs) cannot log ... as a score

Why

The "Test" page calls the scorer function and renders its return value directly — no score logging step. It just displays what the function returns, so a list works fine.

The automatic online rule path calls the scorer, then passes the result to the span logging layer, which expects scores as a flat {name: float} dict. When the result is a list, the logging call fails.

This is a backend feature gap: the online rule execution runtime needs to handle list returns by converting them to a score dict before logging. The test-page path does not have this requirement.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export BRAINTRUST_API_KEY=your_key_here

Reproduce

# Shows SDK offline eval works with list returns
python test_offline_eval.py

# Shows the test-page path vs automatic rule path behavior
python test_online_scorer.py

Fix (BRA-4013)

The online scorer rule runtime needs to handle list-of-Score returns by converting to a score dict before calling span.log(scores=...):

# Before (broken):
span.log(scores=scorer_result)

# After (fixed):
if isinstance(scorer_result, list):
    scores = {s.name: s.score for s in scorer_result}
else:
    scores = {scorer_result.name: scorer_result.score}
span.log(scores=scores)

About

Repro for Pylon #11423 — multi-score returns fail in online scorer automatic rules (BRA-4013)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages