Request benchmark allow-list — benchflow/skillsbench

Hi! Following the [registering-a-benchmark](https://huggingface.co/docs/hub/eval-results#registering-a-benchmark) docs, I'm requesting allow-list inclusion for **SkillsBench**.

- **Dataset:** https://huggingface.co/datasets/benchflow/skillsbench
- **Paper:** [arXiv:2602.12670](https://arxiv.org/abs/2602.12670)
- **Framework:** `benchflow` — registered in https://github.com/huggingface/huggingface.js/pull/2139 (pending review).
- **Leaderboard / trajectory submissions:** https://huggingface.co/datasets/benchflow/skillsbench-leaderboard

## Traction

- 91 tasks across financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
- 40 co-authors on the arxiv paper (academic + industry).
- 1,109 stars / 277 forks / 226 merged PRs / 56 contributors on [\`benchflow-ai/skillsbench\`](https://github.com/benchflow-ai/skillsbench) at time of writing.
- 8 contributor namespaces with multi-experiment trial bundles in [\`benchflow-ai/skillsbench-trajectories\`](https://github.com/benchflow-ai/skillsbench-trajectories).

## What's already in place

\`eval.yaml\` drafted and validated against the spec — will be pushed to the dataset root the moment the framework PR (huggingface/huggingface.js#2139) merges. Preview:

\`\`\`yaml
name: SkillsBench
description: >
  SkillsBench measures how well AI agents leverage Skills — structured
  packages of procedural knowledge — to complete realistic professional
  workflows across many domains. The headline metric is the with-skills
  vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:
  - id: skillsbench
    config: default
    split: train
\`\`\`

Paired leaderboard dataset already public at [\`benchflow/skillsbench-leaderboard\`](https://huggingface.co/datasets/benchflow/skillsbench-leaderboard) with submission protocol + example.

## Followup

I'll add \`.eval_results/skillsbench.yaml\` to the relevant model repos as soon as the framework PR merges. Happy to provide more context on the benchmark's evaluation methodology or to wait on additional model evals before allow-list inclusion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request benchmark allow-list — benchflow/skillsbench #2456

Traction

What's already in place

Followup

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request benchmark allow-list — benchflow/skillsbench #2456

Description

Traction

What's already in place

Followup

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions