Skip to content

feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract#955

Draft
wprazuch wants to merge 3 commits into
schapman/feat/shared-metric-contractfrom
wprazuch/feat/mc-logprob-on-shared-contract
Draft

feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract#955
wprazuch wants to merge 3 commits into
schapman/feat/shared-metric-contractfrom
wprazuch/feat/mc-logprob-on-shared-contract

Conversation

@wprazuch
Copy link
Copy Markdown
Contributor

@wprazuch wprazuch commented May 4, 2026

Summary

Ports the multiple-choice loglikelihood + few-shot functionality from kanishks-23/Evaluator#1 onto Sandy's #950 shared-metric-contract branch.

Stacked on schapman/feat/shared-metric-contract — should not merge until #950 merges. Targeting Sandy's branch so the diff is purely the integration work, not the contract types.

Why this PR exists

This is the integration proof that non-trivial benchmark machinery composes with the shared metric contract without protocol changes. Multi-score scorers with per-row solver-emitted payloads (logprobs, ranking metadata) — the trickiest case the contract has to handle — drop on top with zero modifications to MetricInput, MetricResult, MetricDescriptor, or MetricOutputSpec.

What's in the diff (1554 LOC across 13 files)

Scoring (Tier 1 — protocol-fit proof)

  • scoring/multiple_choice.py (new): multiple_choice_acc and mcq_letter_extract, both opting into @scorer(metric_type=..., outputs=[...]). Choices/logprobs flow through MetricInput.candidate.metadata, exactly the slot the contract designates for solver-emitted payload.
  • scoring/metric.py: ScorerFunctionMetric.compute_scores merges candidate.metadata into the legacy ScorerInput.metadata so legacy scorers see solver-emitted keys. (This is the C5 fix from earlier review on feat: add shared metric contract for scorer functions #950.)

Solver + Environment (Tier 2 — full functional port)

  • solvers/logprob.py (new): LogprobRankingSolver ranks candidate continuations via /completions with max_tokens=0, echo=true, logprobs=1. Continuation span is located via text_offset. Per-choice calls run concurrently behind max_concurrent_choices.
  • environments/custom.py:
    • BenchmarkDefinition extensions: choices, choices_field (dotted-path: choices.text), num_fewshot, fewshot_split, fewshot_template, fewshot_separator, fewshot_seed.
    • @benchmark decorator threads the new kwargs through.
    • ByobEnvironment.seed() renders the few-shot prefix and populates metadata["_mc_choices"].
    • _metric_input_from_verify lifts _mc_*/_solver_* namespaced keys onto MetricInput.candidate.metadata rather than row.data.
  • engine/eval_loop.py: forwards solve_result.scoring_details to env.verify(...) as additional kwargs, giving solvers a per-row payload channel into the scorer.
  • environments/custom.py:_load_hf: path-segment URI parsing (hf://ns/name/config[/split]) and row filters (?filter_field=...&filter_value=... with _1/_2 suffixes). Required for namespaced multilingual datasets such as CohereForAI/Global-MMLU-Lite/en?split=test.

Tests (74 new tests, all green)

File Tests
tests/test_scoring/test_multiple_choice.py 27 (decorator surface, semantics, gold-index resolution, end-to-end via translator, validate_metric_result enforcement)
tests/test_environments/test_byob_mc_integration.py 13 (_resolve_mc_choices, _metric_input_from_verify namespacing, MMLU/ARC-style end-to-end, few-shot prefix, decorator wiring)
tests/test_environments/test_dataset_uri_parsing.py 11 (path-segment configs, query overrides, row filters, num_examples slicing)
tests/test_solvers/test_logprob_solver.py 9 (response parser, token-straddling fallback, ranking, error paths)

All 25 of Sandy's existing tests still pass. 149 tests green across the touched modules.

Protocol-fit proof — what's NOT changed

Zero modifications to:

  • MetricInput, MetricResult, MetricDescriptor, MetricOutputSpec, MetricOutput, Metric
  • ContinuousScore, DiscreteScore, Label, BooleanValue
  • ScorerInput field set (still response/target/metadata/config/sandbox — the new fields stay on MetricInput.candidate.metadata)
  • The @scorer(metric_type=..., outputs=...) signature

Mapping from Kanishk's V1 PR to this V2 port

Kanishk V1 (packages/nemo-evaluator/contrib/byob/) V2 (src/nemo_evaluator/)
decorators.py:ScorerInput extra fields Not ported. They live on MetricInput.candidate.metadata via the _mc_* namespace.
eval_logic.py:MultipleChoiceStrategy Split: per-call inference → LogprobRankingSolver; per-row scoring stays in ByobEnvironment.verify.
runner.py:call_model_loglikelihood + parser solvers/logprob.py:_score_continuation + _parse_loglikelihood_response.
scorers.py:multiple_choice_acc scoring/multiple_choice.py:multiple_choice_acc — typed via @scorer(outputs=...).
scorers.py:mcq_letter_extract scoring/multiple_choice.py:mcq_letter_extract — typed.
decorators.py few-shot fields on BenchmarkDefinition environments/custom.py:BenchmarkDefinition
eval_logic.py:build_fewshot_prefix ByobEnvironment._fewshot_prefix
dataset.py HF URI parsing environments/custom.py:_load_hf (enriched in-place)

Test plan

  • All 27 multiple_choice tests pass — protocol-fit proof
  • All 13 BYOB integration tests pass — end-to-end seed → verify
  • All 11 dataset URI parsing tests pass
  • All 9 logprob solver tests pass (mocked HTTP)
  • All 25 existing feat: add shared metric contract for scorer functions #950 tests still pass — no regression in the contract layer
  • Real MMLU smoke test against a vLLM endpoint (deferred to follow-up)
  • Sandy reviews the ScorerFunctionMetric.compute_scores translator merge (C5 fix) and eval_loop.py solver-payload-forward addition

🤖 Generated with Claude Code

SandyChapman and others added 2 commits April 30, 2026 13:07
Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.
Demonstrates that non-trivial benchmark machinery — multi-score scorers
with per-row solver-emitted payloads — fits the shared metric contract
(#950) without modifying any protocol type. Functionality ported from
kanishks-23#1 (V1 layout) to V2 layout against
schapman/feat/shared-metric-contract.

Protocol-fit proof:
* MetricInput, MetricResult, MetricDescriptor, MetricOutputSpec untouched.
* multiple_choice_acc opts into @scorer(metric_type=..., outputs=...) like
  any typed scorer; the choices/logprobs payload flows through
  MetricInput.candidate.metadata, exactly the slot the contract designates.
* validate_metric_result enforces declared-vs-emitted at runtime.

Functional port (Tier 2):
* LogprobRankingSolver: ranks candidate continuations via /completions
  with max_tokens=0, echo=true, logprobs=1; parses continuation spans
  via text_offset; concurrent per-choice calls.
* @benchmark extensions: choices, choices_field (dotted-path), num_fewshot,
  fewshot_split, fewshot_template, fewshot_separator, fewshot_seed.
* ByobEnvironment.seed() renders few-shot prefix and populates _mc_choices.
* _load_hf URI parsing: path-segment configs (hf://ns/name/cfg[/split]) and
  row filters (?filter_field=...&filter_value=...) — required for
  Sovereign-style multilingual datasets like CohereForAI/Global-MMLU-Lite/en.
* Eval loop forwards solve_result.scoring_details to env.verify kwargs;
  ByobEnvironment.verify lifts _mc_*/_solver_* namespaced keys onto
  MetricInput.candidate.metadata.
* ScorerFunctionMetric translator merges candidate.metadata into legacy
  ScorerInput.metadata so legacy scorers see solver-emitted payloads.

Tests: 85 new tests across multiple_choice, BYOB MC integration, dataset
URI parsing, and the logprob solver — all green alongside Sandy's existing
25 contract tests.

Stacked on schapman/feat/shared-metric-contract; should not merge until
#950 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9d9ac144-a9ca-4223-a560-b772c1179556

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch wprazuch/feat/mc-logprob-on-shared-contract

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the tests label May 4, 2026
…ons server

End-to-end script that spins up an aiohttp /v1/completions returning
OpenAI-shape responses with deterministic logprobs (gold continuation
gets the highest score), then runs the full pipeline:

  ByobEnvironment.seed → LogprobRankingSolver.solve (real HTTP) →
  ByobEnvironment.verify (with merged seed + scoring_details) →
  multiple_choice_acc.compute_scores → acc=1.0 assertion

This validates the wire format the solver emits (max_tokens=0, echo=true,
logprobs=1), the text_offset-based continuation parsing, concurrent
per-choice ranking + argmax selection, and the verify-meta plumbing
that lifts solver-emitted _mc_* keys onto MetricInput.candidate.metadata.

Run: python scripts/smoketest_logprob_solver.py

Real-model testing (a vLLM-served model on SLURM) is a follow-up via
nel-assistant or local-vllm-eval skills; this script proves the wire
format and end-to-end orchestration work without a model server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
@SandyChapman SandyChapman force-pushed the schapman/feat/shared-metric-contract branch from 62efcfa to 2f6b8d9 Compare May 4, 2026 18:58
@SandyChapman SandyChapman force-pushed the schapman/feat/shared-metric-contract branch 2 times, most recently from 8e20f8e to 3ef469f Compare May 15, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants