Recent work on general scales for AI evaluation introduces assessors: external models that predict whether a subject AI system will succeed on a particular task instance, without running the subject model on that instance. In the Nature paper “General scales unlock AI evaluation with explanatory and predictive power,” the strongest lightweight assessor uses an item-level demand vector as input and predicts the probability of success for a fixed subject model. It would be useful to add a PyTorch-native demand-based assessor that can predict response probabilities from structured item features, such as cognitive demand annotations, benchmark metadata, or other task-level descriptors.
Goal
Implement a model that predicts:
P(response = 1 | subject_idx, item_features)
where:
subject_idx identifies the model/system being evaluated
item_features contains demand annotations or other structured item-level features
- the output is a calibrated probability of success for each subject-item query
This should support the use case where each subject model has its own response pattern, but the predictor generalizes across items through interpretable item features.
To address this issue, we should add a new model under torch_measure.models, such as:
DemandAssessor(
n_subjects: int,
item_feature_dim: int,
subject_embedding_dim: int = 16,
hidden_dim: int = 128,
n_layers: int = 2,
dropout: float = 0.0,
device: str = "cpu",
)
The model should inherit from the existing Predictor abstraction and implement:
predict(query: dict[str, torch.Tensor]) -> torch.Tensor
Recent work on general scales for AI evaluation introduces assessors: external models that predict whether a subject AI system will succeed on a particular task instance, without running the subject model on that instance. In the Nature paper “General scales unlock AI evaluation with explanatory and predictive power,” the strongest lightweight assessor uses an item-level demand vector as input and predicts the probability of success for a fixed subject model. It would be useful to add a PyTorch-native demand-based assessor that can predict response probabilities from structured item features, such as cognitive demand annotations, benchmark metadata, or other task-level descriptors.
Goal
Implement a model that predicts:
where:
subject_idxidentifies the model/system being evaluateditem_featurescontains demand annotations or other structured item-level featuresThis should support the use case where each subject model has its own response pattern, but the predictor generalizes across items through interpretable item features.
To address this issue, we should add a new model under
torch_measure.models, such as:The model should inherit from the existing
Predictorabstraction and implement: