feat(byob): port multiple-choice loglikelihood + few-shot onto shared metric contract#955
Draft
wprazuch wants to merge 3 commits into
Draft
Conversation
Expose MetricInput -> MetricResult types and adapt decorated scorers via to_metric() so Evaluator OSS scorers can share a runtime contract with platform integrations while preserving BYOB scorer compatibility.
Demonstrates that non-trivial benchmark machinery — multi-score scorers with per-row solver-emitted payloads — fits the shared metric contract (#950) without modifying any protocol type. Functionality ported from kanishks-23#1 (V1 layout) to V2 layout against schapman/feat/shared-metric-contract. Protocol-fit proof: * MetricInput, MetricResult, MetricDescriptor, MetricOutputSpec untouched. * multiple_choice_acc opts into @scorer(metric_type=..., outputs=...) like any typed scorer; the choices/logprobs payload flows through MetricInput.candidate.metadata, exactly the slot the contract designates. * validate_metric_result enforces declared-vs-emitted at runtime. Functional port (Tier 2): * LogprobRankingSolver: ranks candidate continuations via /completions with max_tokens=0, echo=true, logprobs=1; parses continuation spans via text_offset; concurrent per-choice calls. * @benchmark extensions: choices, choices_field (dotted-path), num_fewshot, fewshot_split, fewshot_template, fewshot_separator, fewshot_seed. * ByobEnvironment.seed() renders few-shot prefix and populates _mc_choices. * _load_hf URI parsing: path-segment configs (hf://ns/name/cfg[/split]) and row filters (?filter_field=...&filter_value=...) — required for Sovereign-style multilingual datasets like CohereForAI/Global-MMLU-Lite/en. * Eval loop forwards solve_result.scoring_details to env.verify kwargs; ByobEnvironment.verify lifts _mc_*/_solver_* namespaced keys onto MetricInput.candidate.metadata. * ScorerFunctionMetric translator merges candidate.metadata into legacy ScorerInput.metadata so legacy scorers see solver-emitted payloads. Tests: 85 new tests across multiple_choice, BYOB MC integration, dataset URI parsing, and the logprob solver — all green alongside Sandy's existing 25 contract tests. Stacked on schapman/feat/shared-metric-contract; should not merge until #950 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…ons server End-to-end script that spins up an aiohttp /v1/completions returning OpenAI-shape responses with deterministic logprobs (gold continuation gets the highest score), then runs the full pipeline: ByobEnvironment.seed → LogprobRankingSolver.solve (real HTTP) → ByobEnvironment.verify (with merged seed + scoring_details) → multiple_choice_acc.compute_scores → acc=1.0 assertion This validates the wire format the solver emits (max_tokens=0, echo=true, logprobs=1), the text_offset-based continuation parsing, concurrent per-choice ranking + argmax selection, and the verify-meta plumbing that lifts solver-emitted _mc_* keys onto MetricInput.candidate.metadata. Run: python scripts/smoketest_logprob_solver.py Real-model testing (a vLLM-served model on SLURM) is a follow-up via nel-assistant or local-vllm-eval skills; this script proves the wire format and end-to-end orchestration work without a model server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Wojciech Prazuch <wprazuch@nvidia.com>
62efcfa to
2f6b8d9
Compare
8e20f8e to
3ef469f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the multiple-choice loglikelihood + few-shot functionality from kanishks-23/Evaluator#1 onto Sandy's #950 shared-metric-contract branch.
Stacked on
schapman/feat/shared-metric-contract— should not merge until #950 merges. Targeting Sandy's branch so the diff is purely the integration work, not the contract types.Why this PR exists
This is the integration proof that non-trivial benchmark machinery composes with the shared metric contract without protocol changes. Multi-score scorers with per-row solver-emitted payloads (logprobs, ranking metadata) — the trickiest case the contract has to handle — drop on top with zero modifications to
MetricInput,MetricResult,MetricDescriptor, orMetricOutputSpec.What's in the diff (1554 LOC across 13 files)
Scoring (Tier 1 — protocol-fit proof)
scoring/multiple_choice.py(new):multiple_choice_accandmcq_letter_extract, both opting into@scorer(metric_type=..., outputs=[...]). Choices/logprobs flow throughMetricInput.candidate.metadata, exactly the slot the contract designates for solver-emitted payload.scoring/metric.py:ScorerFunctionMetric.compute_scoresmergescandidate.metadatainto the legacyScorerInput.metadataso legacy scorers see solver-emitted keys. (This is the C5 fix from earlier review on feat: add shared metric contract for scorer functions #950.)Solver + Environment (Tier 2 — full functional port)
solvers/logprob.py(new):LogprobRankingSolverranks candidate continuations via/completionswithmax_tokens=0, echo=true, logprobs=1. Continuation span is located viatext_offset. Per-choice calls run concurrently behindmax_concurrent_choices.environments/custom.py:BenchmarkDefinitionextensions:choices,choices_field(dotted-path:choices.text),num_fewshot,fewshot_split,fewshot_template,fewshot_separator,fewshot_seed.@benchmarkdecorator threads the new kwargs through.ByobEnvironment.seed()renders the few-shot prefix and populatesmetadata["_mc_choices"]._metric_input_from_verifylifts_mc_*/_solver_*namespaced keys ontoMetricInput.candidate.metadatarather thanrow.data.engine/eval_loop.py: forwardssolve_result.scoring_detailstoenv.verify(...)as additional kwargs, giving solvers a per-row payload channel into the scorer.environments/custom.py:_load_hf: path-segment URI parsing (hf://ns/name/config[/split]) and row filters (?filter_field=...&filter_value=...with_1/_2suffixes). Required for namespaced multilingual datasets such asCohereForAI/Global-MMLU-Lite/en?split=test.Tests (74 new tests, all green)
tests/test_scoring/test_multiple_choice.pyvalidate_metric_resultenforcement)tests/test_environments/test_byob_mc_integration.py_resolve_mc_choices,_metric_input_from_verifynamespacing, MMLU/ARC-style end-to-end, few-shot prefix, decorator wiring)tests/test_environments/test_dataset_uri_parsing.pytests/test_solvers/test_logprob_solver.pyAll 25 of Sandy's existing tests still pass. 149 tests green across the touched modules.
Protocol-fit proof — what's NOT changed
Zero modifications to:
MetricInput,MetricResult,MetricDescriptor,MetricOutputSpec,MetricOutput,MetricContinuousScore,DiscreteScore,Label,BooleanValueScorerInputfield set (stillresponse/target/metadata/config/sandbox— the new fields stay onMetricInput.candidate.metadata)@scorer(metric_type=..., outputs=...)signatureMapping from Kanishk's V1 PR to this V2 port
packages/nemo-evaluator/contrib/byob/)src/nemo_evaluator/)decorators.py:ScorerInputextra fieldsMetricInput.candidate.metadatavia the_mc_*namespace.eval_logic.py:MultipleChoiceStrategyLogprobRankingSolver; per-row scoring stays inByobEnvironment.verify.runner.py:call_model_loglikelihood+ parsersolvers/logprob.py:_score_continuation+_parse_loglikelihood_response.scorers.py:multiple_choice_accscoring/multiple_choice.py:multiple_choice_acc— typed via@scorer(outputs=...).scorers.py:mcq_letter_extractscoring/multiple_choice.py:mcq_letter_extract— typed.decorators.pyfew-shot fields onBenchmarkDefinitionenvironments/custom.py:BenchmarkDefinitioneval_logic.py:build_fewshot_prefixByobEnvironment._fewshot_prefixdataset.pyHF URI parsingenvironments/custom.py:_load_hf(enriched in-place)Test plan
multiple_choicetests pass — protocol-fit proofScorerFunctionMetric.compute_scorestranslator merge (C5 fix) andeval_loop.pysolver-payload-forward addition🤖 Generated with Claude Code