Skip to content

Request benchmark allow-list — benchflow/skillsbench #2456

@xdotli

Description

@xdotli

Hi! Following the registering-a-benchmark docs, I'm requesting allow-list inclusion for SkillsBench.

Traction

  • 91 tasks across financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
  • 40 co-authors on the arxiv paper (academic + industry).
  • 1,109 stars / 277 forks / 226 merged PRs / 56 contributors on `benchflow-ai/skillsbench` at time of writing.
  • 8 contributor namespaces with multi-experiment trial bundles in `benchflow-ai/skillsbench-trajectories`.

What's already in place

`eval.yaml` drafted and validated against the spec — will be pushed to the dataset root the moment the framework PR (huggingface/huggingface.js#2139) merges. Preview:

```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:

  • id: skillsbench
    config: default
    split: train
    ```

Paired leaderboard dataset already public at `benchflow/skillsbench-leaderboard` with submission protocol + example.

Followup

I'll add `.eval_results/skillsbench.yaml` to the relevant model repos as soon as the framework PR merges. Happy to provide more context on the benchmark's evaluation methodology or to wait on additional model evals before allow-list inclusion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions