Dear Authors,
Thank you so much for your benchmark. I am testing my Qwen3.5-35B-A3B with Opus-4.5 as the judge. I ran the benchmark 3 times, and the mean scores are 65.71%, 69.13%, and 74.26%, respectively. And the token cost varies a lot. I am wondering if this is considered normal? And is there any methods to reduce the variance? I am also curious how you calculate the AVG score on your leaderboard?
Here are the logs of my run:
2026-04-03 19:14:12,124 - INFO - Total tokens used: 1,662,453 (input: 1,625,709, output: 36,744)
2026-04-03 19:14:12,124 - INFO - Total API requests: 115
2026-04-03 19:14:12,124 - INFO - Avg tokens/task: 69,269
2026-04-03 19:14:12,124 - INFO - Mean score: 0.6571
2026-04-03 19:14:12,124 - INFO - Score per 1K tokens: 0.0095 (higher = more efficient)
2026-04-03 20:29:21,516 - INFO - Total tokens used: 1,549,510 (input: 1,514,401, output: 35,109)
2026-04-03 20:29:21,516 - INFO - Total API requests: 100
2026-04-03 20:29:21,516 - INFO - Avg tokens/task: 64,563
2026-04-03 20:29:21,516 - INFO - Mean score: 0.6913
2026-04-03 20:29:21,521 - INFO - Score per 1K tokens: 0.0107 (higher = more efficient)
2026-04-03 21:02:03,418 - INFO - Total tokens used: 1,524,989 (input: 1,497,458, output: 27,531)
2026-04-03 21:02:03,418 - INFO - Total API requests: 104
2026-04-03 21:02:03,418 - INFO - Avg tokens/task: 63,541
2026-04-03 21:02:03,418 - INFO - Mean score: 0.7426
2026-04-03 21:02:03,418 - INFO - Score per 1K tokens: 0.0117 (higher = more efficient)
Thank you!
Dear Authors,
Thank you so much for your benchmark. I am testing my Qwen3.5-35B-A3B with Opus-4.5 as the judge. I ran the benchmark 3 times, and the mean scores are 65.71%, 69.13%, and 74.26%, respectively. And the token cost varies a lot. I am wondering if this is considered normal? And is there any methods to reduce the variance? I am also curious how you calculate the AVG score on your leaderboard?
Here are the logs of my run:
2026-04-03 19:14:12,124 - INFO - Total tokens used: 1,662,453 (input: 1,625,709, output: 36,744)
2026-04-03 19:14:12,124 - INFO - Total API requests: 115
2026-04-03 19:14:12,124 - INFO - Avg tokens/task: 69,269
2026-04-03 19:14:12,124 - INFO - Mean score: 0.6571
2026-04-03 19:14:12,124 - INFO - Score per 1K tokens: 0.0095 (higher = more efficient)
2026-04-03 20:29:21,516 - INFO - Total tokens used: 1,549,510 (input: 1,514,401, output: 35,109)
2026-04-03 20:29:21,516 - INFO - Total API requests: 100
2026-04-03 20:29:21,516 - INFO - Avg tokens/task: 64,563
2026-04-03 20:29:21,516 - INFO - Mean score: 0.6913
2026-04-03 20:29:21,521 - INFO - Score per 1K tokens: 0.0107 (higher = more efficient)
2026-04-03 21:02:03,418 - INFO - Total tokens used: 1,524,989 (input: 1,497,458, output: 27,531)
2026-04-03 21:02:03,418 - INFO - Total API requests: 104
2026-04-03 21:02:03,418 - INFO - Avg tokens/task: 63,541
2026-04-03 21:02:03,418 - INFO - Mean score: 0.7426
2026-04-03 21:02:03,418 - INFO - Score per 1K tokens: 0.0117 (higher = more efficient)
Thank you!