Qwen 3.5 is absent from the leaderboard (making the benchmark embarassingly selective rather than objective).
The leaderboard currently includes Qwen3 variants (4B, 30B, 235B) but is missing the Qwen 3.5 family, which has been publicly available since February 16, 2026 — predating the benchmark's March 2026 publication.
This is a meaningful gap rather than a routine missing-model request:
- Qwen3.5-9B scores 81.7 on GPQA Diamond, the first sub-30B model to break 80 on that benchmark, beating GPT-OSS-120B (80.1) at a fraction of the size
- On Tau2-Bench (the closest public proxy for agentic tool use), the flagship sits at 86.7 — second only to Claude Opus 4.6 (91.6) and ahead of every other model currently on your leaderboard
- The 9B model outperforms GPT-OSS-120B on MMMLU multilingual (81.2 vs 78.2)
- The small-model efficiency story it tells would directly stress-test the oracle-vs-no-oracle gap your paper identifies as the core bottleneck
Because the leaderboard is clearly being actively maintained (Nemotron 3 Super was added recently), the absence of Qwen 3.5 is noticeable. A benchmark that omits the current open-source agentic frontier risks understating how much the field has moved since the paper's evaluation snapshot.
This is actually embarassing, both the paper and the leaderboard currently seem a misleading, selective advocation rather than objective science.
Request: please evaluate and add at least the flagship (Qwen3.5-397B-A17B) and one mid-size variant (e.g. Qwen3.5-35B-A3B or Qwen3.5-72B) so the leaderboard reflects the current state of open-source models.
Qwen 3.5 is absent from the leaderboard (making the benchmark embarassingly selective rather than objective).
The leaderboard currently includes Qwen3 variants (4B, 30B, 235B) but is missing the Qwen 3.5 family, which has been publicly available since February 16, 2026 — predating the benchmark's March 2026 publication.
This is a meaningful gap rather than a routine missing-model request:
Because the leaderboard is clearly being actively maintained (Nemotron 3 Super was added recently), the absence of Qwen 3.5 is noticeable. A benchmark that omits the current open-source agentic frontier risks understating how much the field has moved since the paper's evaluation snapshot.
This is actually embarassing, both the paper and the leaderboard currently seem a misleading, selective advocation rather than objective science.
Request: please evaluate and add at least the flagship (Qwen3.5-397B-A17B) and one mid-size variant (e.g. Qwen3.5-35B-A3B or Qwen3.5-72B) so the leaderboard reflects the current state of open-source models.