feat(models): DoublyRobustModel for sparse benchmark correction (#40) by pranilraichura · Pull Request #48 · aims-foundations/torch_measure

pranilraichura · 2026-06-03T20:09:16Z

Summary

Implements DoublyRobustModel for issue #40. Wraps a fitted base IRT model and learns an additive correction layer trained with IPW-weighted loss to correct for MNAR bias in sparse benchmark matrices.

Background: in the Fantastic Bugs setting, not every LLM is evaluated on every benchmark task. When missingness is informative (frontier models skip easy benchmarks, cheap models skip expensive ones), naive IRT fits are biased. The DR model corrects for this at training time.

Design

final_prediction(i, j) = clamp( base_model(i, j) + correction(i, j) )

DoublyRobustModel(base_model) freezes the base model's parameters
Adds correction_ability and correction_difficulty (residual Rasch layer, initialized to zero so predictions start identical to base)
fit() estimates propensity scores via logistic regression on the observation pattern, then trains the correction via mle_fit with IPW-weighted Bernoulli loss
predict() returns clamp(base_pred + sigmoid(alpha_i - beta_j) - 0.5)

Files

src/torch_measure/models/doubly_robust.py
src/torch_measure/models/__init__.py
tests/test_models/test_doubly_robust.py (11 tests)
tutorials/doubly_robust_sparse_benchmarks.ipynb

Questions for Sang

Currently only method='mle' is supported in fit() -- happy to wire up others
The IPW weighting is Horvitz-Thompson; full AIPW would need the outcome model term in the loss too -- worth discussing if that's in scope
Went with two-stage (fit base, freeze, fit correction) -- let me know if you had joint end-to-end training in mind instead

Test plan

pytest tests/test_models/test_doubly_robust.py -v
Verify DoublyRobustModel importable from torch_measure.models
Run tutorial notebook end-to-end

…ub rendering

…ims-foundations#40)

pranilraichura added 6 commits May 30, 2026 12:14

fix(tutorials): execute GSM8K notebook with outputs embedded for GitH…

50424dd

…ub rendering

fix(tutorials): clean GSM8K notebook outputs for GitHub renderer

f577b45

feat(models): add DoublyRobustModel for sparse benchmark correction (a…

a15ed78

…ims-foundations#40)

fix(lint): sort imports and remove unused numpy in doubly robust

91ebe92

style: ruff format doubly robust files

dff0bd3

docs(tutorials): clean up DoublyRobustModel notebook with outputs

0dc0a8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(models): DoublyRobustModel for sparse benchmark correction (#40)#48

feat(models): DoublyRobustModel for sparse benchmark correction (#40)#48
pranilraichura wants to merge 6 commits into
aims-foundations:mainfrom
pranilraichura:feat/doubly-robust-model

pranilraichura commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pranilraichura commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design

Files

Questions for Sang

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pranilraichura commented Jun 3, 2026 •

edited

Loading