feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract by wprazuch · Pull Request #955 · NVIDIA-NeMo/Evaluator

wprazuch · 2026-05-04T08:01:32Z

Summary

Ports the multiple-choice loglikelihood + few-shot functionality from kanishks-23/Evaluator#1 onto Sandy's #950 shared-metric-contract branch.

Stacked on schapman/feat/shared-metric-contract — should not merge until #950 merges. Targeting Sandy's branch so the diff is purely the integration work, not the contract types.

Why this PR exists

This is the integration proof that non-trivial benchmark machinery composes with the shared metric contract without protocol changes. Multi-score scorers with per-row solver-emitted payloads (logprobs, ranking metadata) — the trickiest case the contract has to handle — drop on top with zero modifications to MetricInput, MetricResult, MetricDescriptor, or MetricOutputSpec.

What's in the diff (1554 LOC across 13 files)

Scoring (Tier 1 — protocol-fit proof)

scoring/multiple_choice.py (new): multiple_choice_acc and mcq_letter_extract, both opting into @scorer(metric_type=..., outputs=[...]). Choices/logprobs flow through MetricInput.candidate.metadata, exactly the slot the contract designates for solver-emitted payload.
scoring/metric.py: ScorerFunctionMetric.compute_scores merges candidate.metadata into the legacy ScorerInput.metadata so legacy scorers see solver-emitted keys. (This is the C5 fix from earlier review on feat: add shared metric contract for scorer functions #950.)

Solver + Environment (Tier 2 — full functional port)

solvers/logprob.py (new): LogprobRankingSolver ranks candidate continuations via /completions with max_tokens=0, echo=true, logprobs=1. Continuation span is located via text_offset. Per-choice calls run concurrently behind max_concurrent_choices.
environments/custom.py:
- BenchmarkDefinition extensions: choices, choices_field (dotted-path: choices.text), num_fewshot, fewshot_split, fewshot_template, fewshot_separator, fewshot_seed.
- @benchmark decorator threads the new kwargs through.
- ByobEnvironment.seed() renders the few-shot prefix and populates metadata["_mc_choices"].
- _metric_input_from_verify lifts _mc_*/_solver_* namespaced keys onto MetricInput.candidate.metadata rather than row.data.
engine/eval_loop.py: forwards solve_result.scoring_details to env.verify(...) as additional kwargs, giving solvers a per-row payload channel into the scorer.
environments/custom.py:_load_hf: path-segment URI parsing (hf://ns/name/config[/split]) and row filters (?filter_field=...&filter_value=... with _1/_2 suffixes). Required for namespaced multilingual datasets such as CohereForAI/Global-MMLU-Lite/en?split=test.

Tests (74 new tests, all green)

File	Tests
`tests/test_scoring/test_multiple_choice.py`	27 (decorator surface, semantics, gold-index resolution, end-to-end via translator, `validate_metric_result` enforcement)
`tests/test_environments/test_byob_mc_integration.py`	13 (`_resolve_mc_choices`, `_metric_input_from_verify` namespacing, MMLU/ARC-style end-to-end, few-shot prefix, decorator wiring)
`tests/test_environments/test_dataset_uri_parsing.py`	11 (path-segment configs, query overrides, row filters, num_examples slicing)
`tests/test_solvers/test_logprob_solver.py`	9 (response parser, token-straddling fallback, ranking, error paths)

All 25 of Sandy's existing tests still pass. 149 tests green across the touched modules.

Protocol-fit proof — what's NOT changed

Zero modifications to:

MetricInput, MetricResult, MetricDescriptor, MetricOutputSpec, MetricOutput, Metric
ContinuousScore, DiscreteScore, Label, BooleanValue
ScorerInput field set (still response/target/metadata/config/sandbox — the new fields stay on MetricInput.candidate.metadata)
The @scorer(metric_type=..., outputs=...) signature

Mapping from Kanishk's V1 PR to this V2 port

Kanishk V1 (`packages/nemo-evaluator/contrib/byob/`)	V2 (`src/nemo_evaluator/`)
`decorators.py:ScorerInput` extra fields	Not ported. They live on `MetricInput.candidate.metadata` via the `_mc_*` namespace.
`eval_logic.py:MultipleChoiceStrategy`	Split: per-call inference → `LogprobRankingSolver`; per-row scoring stays in `ByobEnvironment.verify`.
`runner.py:call_model_loglikelihood` + parser	`solvers/logprob.py:_score_continuation` + `_parse_loglikelihood_response`.
`scorers.py:multiple_choice_acc`	`scoring/multiple_choice.py:multiple_choice_acc` — typed via `@scorer(outputs=...)`.
`scorers.py:mcq_letter_extract`	`scoring/multiple_choice.py:mcq_letter_extract` — typed.
`decorators.py` few-shot fields on `BenchmarkDefinition`	`environments/custom.py:BenchmarkDefinition`
`eval_logic.py:build_fewshot_prefix`	`ByobEnvironment._fewshot_prefix`
`dataset.py` HF URI parsing	`environments/custom.py:_load_hf` (enriched in-place)

Test plan

All 27 multiple_choice tests pass — protocol-fit proof
All 13 BYOB integration tests pass — end-to-end seed → verify
All 11 dataset URI parsing tests pass
All 9 logprob solver tests pass (mocked HTTP)
All 25 existing feat: add shared metric contract for scorer functions #950 tests still pass — no regression in the contract layer
Real MMLU smoke test against a vLLM endpoint (deferred to follow-up)
Sandy reviews the ScorerFunctionMetric.compute_scores translator merge (C5 fix) and eval_loop.py solver-payload-forward addition

🤖 Generated with Claude Code

Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.

@scorer

Demonstrates that non-trivial benchmark machinery — multi-score scorers with per-row solver-emitted payloads — fits the shared metric contract (#950) without modifying any protocol type. Functionality ported from kanishks-23#1 (V1 layout) to V2 layout against schapman/feat/shared-metric-contract. Protocol-fit proof: * MetricInput, MetricResult, MetricDescriptor, MetricOutputSpec untouched. * multiple_choice_acc opts into @scorer(metric_type=..., outputs=...) like any typed scorer; the choices/logprobs payload flows through MetricInput.candidate.metadata, exactly the slot the contract designates. * validate_metric_result enforces declared-vs-emitted at runtime. Functional port (Tier 2): * LogprobRankingSolver: ranks candidate continuations via /completions with max_tokens=0, echo=true, logprobs=1; parses continuation spans via text_offset; concurrent per-choice calls. * @benchmark extensions: choices, choices_field (dotted-path), num_fewshot, fewshot_split, fewshot_template, fewshot_separator, fewshot_seed. * ByobEnvironment.seed() renders few-shot prefix and populates _mc_choices. * _load_hf URI parsing: path-segment configs (hf://ns/name/cfg[/split]) and row filters (?filter_field=...&filter_value=...) — required for Sovereign-style multilingual datasets like CohereForAI/Global-MMLU-Lite/en. * Eval loop forwards solve_result.scoring_details to env.verify kwargs; ByobEnvironment.verify lifts _mc_*/_solver_* namespaced keys onto MetricInput.candidate.metadata. * ScorerFunctionMetric translator merges candidate.metadata into legacy ScorerInput.metadata so legacy scorers see solver-emitted payloads. Tests: 85 new tests across multiple_choice, BYOB MC integration, dataset URI parsing, and the logprob solver — all green alongside Sandy's existing 25 contract tests. Stacked on schapman/feat/shared-metric-contract; should not merge until #950 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

copy-pr-bot · 2026-05-04T08:01:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-04T08:01:39Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9d9ac144-a9ca-4223-a560-b772c1179556

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch wprazuch/feat/mc-logprob-on-shared-contract

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…ons server End-to-end script that spins up an aiohttp /v1/completions returning OpenAI-shape responses with deterministic logprobs (gold continuation gets the highest score), then runs the full pipeline: ByobEnvironment.seed → LogprobRankingSolver.solve (real HTTP) → ByobEnvironment.verify (with merged seed + scoring_details) → multiple_choice_acc.compute_scores → acc=1.0 assertion This validates the wire format the solver emits (max_tokens=0, echo=true, logprobs=1), the text_offset-based continuation parsing, concurrent per-choice ranking + argmax selection, and the verify-meta plumbing that lifts solver-emitted _mc_* keys onto MetricInput.candidate.metadata. Run: python scripts/smoketest_logprob_solver.py Real-model testing (a vLLM-served model on SLURM) is a follow-up via nel-assistant or local-vllm-eval skills; this script proves the wire format and end-to-end orchestration work without a model server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>

SandyChapman and others added 2 commits April 30, 2026 13:07

feat: add shared metric contract for scorer functions

62efcfa

Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.

github-actions Bot added the tests label May 4, 2026

github-actions Bot added the scripts label May 4, 2026

SandyChapman force-pushed the schapman/feat/shared-metric-contract branch from 62efcfa to 2f6b8d9 Compare May 4, 2026 18:58

SandyChapman force-pushed the schapman/feat/shared-metric-contract branch 2 times, most recently from 8e20f8e to 3ef469f Compare May 15, 2026 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract#955

feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract#955
wprazuch wants to merge 3 commits into
schapman/feat/shared-metric-contractfrom
wprazuch/feat/mc-logprob-on-shared-contract

wprazuch commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wprazuch commented May 4, 2026

Summary

Why this PR exists

What's in the diff (1554 LOC across 13 files)

Scoring (Tier 1 — protocol-fit proof)

Solver + Environment (Tier 2 — full functional port)

Tests (74 new tests, all green)

Protocol-fit proof — what's NOT changed

Mapping from Kanishk's V1 PR to this V2 port

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

coderabbitai Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 4, 2026 •

edited

Loading