Score variance of different runs

Dear Authors,

Thank you so much for your benchmark. I am testing my Qwen3.5-35B-A3B with Opus-4.5 as the judge. I ran the benchmark 3 times, and the mean scores are 65.71%, 69.13%, and 74.26%, respectively. And the token cost varies a lot. I am wondering if this is considered normal? And is there any methods to reduce the variance? I am also curious how you calculate the AVG score on your leaderboard?

Here are the logs of my run:

2026-04-03 19:14:12,124 - INFO -    Total tokens used: 1,662,453 (input: 1,625,709, output: 36,744)
2026-04-03 19:14:12,124 - INFO -    Total API requests: 115
2026-04-03 19:14:12,124 - INFO -    Avg tokens/task: 69,269
2026-04-03 19:14:12,124 - INFO -    Mean score: 0.6571
2026-04-03 19:14:12,124 - INFO -    Score per 1K tokens: 0.0095 (higher = more efficient)

2026-04-03 20:29:21,516 - INFO -    Total tokens used: 1,549,510 (input: 1,514,401, output: 35,109)
2026-04-03 20:29:21,516 - INFO -    Total API requests: 100
2026-04-03 20:29:21,516 - INFO -    Avg tokens/task: 64,563
2026-04-03 20:29:21,516 - INFO -    Mean score: 0.6913
2026-04-03 20:29:21,521 - INFO -    Score per 1K tokens: 0.0107 (higher = more efficient)

2026-04-03 21:02:03,418 - INFO -    Total tokens used: 1,524,989 (input: 1,497,458, output: 27,531)
2026-04-03 21:02:03,418 - INFO -    Total API requests: 104
2026-04-03 21:02:03,418 - INFO -    Avg tokens/task: 63,541
2026-04-03 21:02:03,418 - INFO -    Mean score: 0.7426
2026-04-03 21:02:03,418 - INFO -    Score per 1K tokens: 0.0117 (higher = more efficient)

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Score variance of different runs #103

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Score variance of different runs #103

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions