Leaderboard missing Qwen 3.5 — an embarrassing omission given its agentic performance

## Qwen 3.5 is absent from the leaderboard (making the benchmark embarassingly selective rather than objective).

The leaderboard currently includes Qwen3 variants (4B, 30B, 235B) but is missing the **Qwen 3.5** family, which has been publicly available since **February 16, 2026** — predating the benchmark's March 2026 publication.

This is a meaningful gap rather than a routine missing-model request:

- Qwen3.5-9B scores **81.7 on GPQA Diamond**, the first sub-30B model to break 80 on that benchmark, beating GPT-OSS-120B (80.1) at a fraction of the size
- On **Tau2-Bench** (the closest public proxy for agentic tool use), the flagship sits at **86.7** — second only to Claude Opus 4.6 (91.6) and ahead of every other model currently on your leaderboard
- The 9B model outperforms GPT-OSS-120B on MMMLU multilingual (81.2 vs 78.2)
- The small-model efficiency story it tells would directly stress-test the oracle-vs-no-oracle gap your paper identifies as the core bottleneck

Because the leaderboard is clearly being actively maintained (Nemotron 3 Super was added recently), the absence of Qwen 3.5 is noticeable. A benchmark that omits the current open-source agentic frontier risks understating how much the field has moved since the paper's evaluation snapshot. 

This is actually **embarassing**, both the paper and the leaderboard currently seem a **misleading, selective advocation** rather than objective science.

**Request:** please evaluate and add at least the flagship (Qwen3.5-397B-A17B) and one mid-size variant (e.g. Qwen3.5-35B-A3B or Qwen3.5-72B) so the leaderboard reflects the current state of open-source models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard missing Qwen 3.5 — an embarrassing omission given its agentic performance #14

Qwen 3.5 is absent from the leaderboard (making the benchmark embarassingly selective rather than objective).

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Leaderboard missing Qwen 3.5 — an embarrassing omission given its agentic performance #14

Description

Qwen 3.5 is absent from the leaderboard (making the benchmark embarassingly selective rather than objective).

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions