Reproduction for Pylon #11423 / Linear BRA-4013: Python online scorer returning a list of scores fails when the rule runs automatically, but works on the "Test" page and in SDK offline eval.
A Python scorer returns a list of scores (needed when scores must be computed sequentially, where score B depends on score A):
from braintrust import Score
def handler(input, output, expected, metadata) -> list[Score]:
score1 = Score(name="has_output", score=1.0 if output else 0.0)
score2 = Score(name="conciseness", score=compute_from(score1)) # depends on score1
return [score1, score2]| Context | Result |
|---|---|
SDK offline eval (braintrust.Eval) |
✅ works |
| Online scorer "Test" page (manual trigger on past logs) | ✅ works |
| Online scorer automatic rule (runs on new logs) | ❌ cannot log ... as a score |
The "Test" page calls the scorer function and renders its return value directly — no score logging step. It just displays what the function returns, so a list works fine.
The automatic online rule path calls the scorer, then passes the result to the span logging layer, which expects scores as a flat {name: float} dict. When the result is a list, the logging call fails.
This is a backend feature gap: the online rule execution runtime needs to handle list returns by converting them to a score dict before logging. The test-page path does not have this requirement.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export BRAINTRUST_API_KEY=your_key_here# Shows SDK offline eval works with list returns
python test_offline_eval.py
# Shows the test-page path vs automatic rule path behavior
python test_online_scorer.pyThe online scorer rule runtime needs to handle list-of-Score returns by converting to a score dict before calling span.log(scores=...):
# Before (broken):
span.log(scores=scorer_result)
# After (fixed):
if isinstance(scorer_result, list):
scores = {s.name: s.score for s in scorer_result}
else:
scores = {scorer_result.name: scorer_result.score}
span.log(scores=scores)