Hi! Following the registering-a-benchmark docs, I'm requesting allow-list inclusion for SkillsBench.
Traction
- 91 tasks across financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
- 40 co-authors on the arxiv paper (academic + industry).
- 1,109 stars / 277 forks / 226 merged PRs / 56 contributors on `benchflow-ai/skillsbench` at time of writing.
- 8 contributor namespaces with multi-experiment trial bundles in `benchflow-ai/skillsbench-trajectories`.
What's already in place
`eval.yaml` drafted and validated against the spec — will be pushed to the dataset root the moment the framework PR (huggingface/huggingface.js#2139) merges. Preview:
```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:
- id: skillsbench
config: default
split: train
```
Paired leaderboard dataset already public at `benchflow/skillsbench-leaderboard` with submission protocol + example.
Followup
I'll add `.eval_results/skillsbench.yaml` to the relevant model repos as soon as the framework PR merges. Happy to provide more context on the benchmark's evaluation methodology or to wait on additional model evals before allow-list inclusion.
Hi! Following the registering-a-benchmark docs, I'm requesting allow-list inclusion for SkillsBench.
benchflow— registered in Add benchflow evaluation framework huggingface.js#2139 (pending review).Traction
What's already in place
`eval.yaml` drafted and validated against the spec — will be pushed to the dataset root the moment the framework PR (huggingface/huggingface.js#2139) merges. Preview:
```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:
config: default
split: train
```
Paired leaderboard dataset already public at `benchflow/skillsbench-leaderboard` with submission protocol + example.
Followup
I'll add `.eval_results/skillsbench.yaml` to the relevant model repos as soon as the framework PR merges. Happy to provide more context on the benchmark's evaluation methodology or to wait on additional model evals before allow-list inclusion.