Hi, thanks for open-sourcing Pinchbench.
We observed that results from our local benchmark runs using a locally deployed sglang instance with the qwen3.5-9b model do not match many of the official reported results.
This is not limited to a single task. We found multiple cases where the official result and the actual execution behavior appear inconsistent. One example is task_05_summary:
2026-04-23 16:28:47,591 - INFO - [VERBOSE] Workspace files after task:
2026-04-23 16:28:47,592 - INFO - .openclaw/workspace-state.json (70 bytes)
2026-04-23 16:28:47,592 - INFO - AGENTS.md (7874 bytes)
2026-04-23 16:28:47,592 - INFO - BOOTSTRAP.md (1470 bytes)
2026-04-23 16:28:47,592 - INFO - HEARTBEAT.md (168 bytes)
2026-04-23 16:28:47,592 - INFO - IDENTITY.md (636 bytes)
2026-04-23 16:28:47,592 - INFO - SOUL.md (1673 bytes)
2026-04-23 16:28:47,592 - INFO - TOOLS.md (860 bytes)
2026-04-23 16:28:47,592 - INFO - USER.md (477 bytes)
2026-04-23 16:28:47,592 - INFO - summary_output.txt (1910 bytes)
2026-04-23 16:28:47,592 - INFO - summary_source.txt (4082 bytes)
2026-04-23 16:28:47,592 - INFO - [VERBOSE] Grading task task_05_summary with type: llm_judge
2026-04-23 16:28:47,592 - INFO - [VERBOSE] Execution status: success
2026-04-23 16:28:47,592 - INFO - [VERBOSE] Transcript summary for judge (first 1000 chars):
User: {'type': 'text', 'text': '[Thu 2026-04-23 16:28 GMT+8] Read the document in summary_source.txt and write a concise 3-paragraph summary to summary_output.txt.'}
Tool: read({"path": "summary_source.txt"})
Result: {'type': 'text', 'text': 'The Rise of Artificial Intelligence in Modern Healthcare\n\nArtificial intelligence (AI) has emerged as a transformative force in healthcare, revolutionizing how medical prof
Tool: write({"path": "summary_output.txt", "content": "Artificial Intelligence has transformed modern healthcare by revolutionizing disease diagnosis, treatment planning, and patient care management. Through machine learning algorithms, AI systems can now...[truncated]"})
Result: {'type': 'text', 'text': 'Successfully wrote 1908 bytes to summary_output.txt'}
Assistant: Done. I've written a 3-paragraph summary to `summary_output.txt` covering:
1. **AI in medical imaging**: detection accuracy matching/exceeding radiologists, DeepMind's 94% accuracy for eye diseases, pattern re
2026-04-23 16:28:47,592 - INFO - [VERBOSE] Workspace files passed to judge (first 500 chars):
### File: summary_output.txt
Artificial Intelligence has transformed modern healthcare by revolutionizing disease diagnosis, treatment planning, and patient care management. Through machine learning algorithms, AI systems can now analyze medical imaging with remarkable accuracy—often matching or surpassing experienced radiologists. Notable achievements include Google's DeepMind AI detecting over 50 eye diseases from retinal scans with 94% accuracy, and deep learning models identifying cancerous
2026-04-23 16:29:02,486 - INFO - [VERBOSE] Judge execution status: success
2026-04-23 16:29:02,486 - INFO - [VERBOSE] Judge raw response parsed: {'scores': {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}, 'total': 0.91, 'notes': 'Comprehensive 3-paragraph summary covering all major themes. ~250 words, slightly verbose but within acceptable range. Well-structured with clear logical flow.'}
2026-04-23 16:29:02,486 - INFO - [VERBOSE] Normalized judge response: {'scores': {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}, 'total': 0.91, 'notes': 'Comprehensive 3-paragraph summary covering all major themes. ~250 words, slightly verbose but within acceptable range. Well-structured with clear logical flow.'}
2026-04-23 16:29:02,486 - INFO - [VERBOSE] LLM judge breakdown: {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}
2026-04-23 16:29:02,486 - INFO - ⚠️ Task task_05_summary: 0.9/1.0 (91%) - llm_judge
#!/bin/bash
set -e
MODEL_PATH="${1:-/xxxx/Qwen/Qwen3.5-4B}"
SERVED_MODEL_NAME="${2:-qwen3.5-4b}"
GPU_SIZE="${3:-8}"
python3 -m sglang.launch_server \
--model-path ${MODEL_PATH} \
--served-model-name ${SERVED_MODEL_NAME} \
--tp-size 1 \
--dp-size ${GPU_SIZE} \
--api-key xxx \
--context-length 262144 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--dist-timeout 3600 \
--mem-fraction-static 0.85 \
--watchdog-timeout 3600 \
--host 0.0.0.0 \
--port 8000 #\
Could you please help clarify how the official results were produced, and whether there are known sources of mismatch between the published results and reproducible local runs?
Thanks.
Hi, thanks for open-sourcing Pinchbench.
We observed that results from our local benchmark runs using a locally deployed sglang instance with the qwen3.5-9b model do not match many of the official reported results.
This is not limited to a single task. We found multiple cases where the official result and the actual execution behavior appear inconsistent. One example is task_05_summary:
Submission ID: f650395e-16de-405a-8bd0-1648dcb6014f)result says: “The agent read the source file but failed to create summary_output.txt.”
The SGLang run script :
At the moment, it is unclear whether this discrepancy comes from:
Could you please help clarify how the official results were produced, and whether there are known sources of mismatch between the published results and reproducible local runs?
Thanks.