Skip to content

Inconsistent results for qwen3.5-9b compared with official reported outputs #358

@Starlight2004

Description

@Starlight2004

Hi, thanks for open-sourcing Pinchbench.
We observed that results from our local benchmark runs using a locally deployed sglang instance with the qwen3.5-9b model do not match many of the official reported results.

This is not limited to a single task. We found multiple cases where the official result and the actual execution behavior appear inconsistent. One example is task_05_summary:

  • The official reported(Benchmark Version: 1.2.1
    Submission ID: f650395e-16de-405a-8bd0-1648dcb6014f)result says: “The agent read the source file but failed to create summary_output.txt.”
Image
  • But in our local execution log, summary_output.txt was in fact created and the task appears to have completed successfully.
2026-04-23 16:28:47,591 - INFO -    [VERBOSE] Workspace files after task:
2026-04-23 16:28:47,592 - INFO -       .openclaw/workspace-state.json (70 bytes)
2026-04-23 16:28:47,592 - INFO -       AGENTS.md (7874 bytes)
2026-04-23 16:28:47,592 - INFO -       BOOTSTRAP.md (1470 bytes)
2026-04-23 16:28:47,592 - INFO -       HEARTBEAT.md (168 bytes)
2026-04-23 16:28:47,592 - INFO -       IDENTITY.md (636 bytes)
2026-04-23 16:28:47,592 - INFO -       SOUL.md (1673 bytes)
2026-04-23 16:28:47,592 - INFO -       TOOLS.md (860 bytes)
2026-04-23 16:28:47,592 - INFO -       USER.md (477 bytes)
2026-04-23 16:28:47,592 - INFO -       summary_output.txt (1910 bytes)
2026-04-23 16:28:47,592 - INFO -       summary_source.txt (4082 bytes)
2026-04-23 16:28:47,592 - INFO -    [VERBOSE] Grading task task_05_summary with type: llm_judge
2026-04-23 16:28:47,592 - INFO -    [VERBOSE] Execution status: success
2026-04-23 16:28:47,592 - INFO -    [VERBOSE] Transcript summary for judge (first 1000 chars):
User: {'type': 'text', 'text': '[Thu 2026-04-23 16:28 GMT+8] Read the document in summary_source.txt and write a concise 3-paragraph summary to summary_output.txt.'}
Tool: read({"path": "summary_source.txt"})
Result: {'type': 'text', 'text': 'The Rise of Artificial Intelligence in Modern Healthcare\n\nArtificial intelligence (AI) has emerged as a transformative force in healthcare, revolutionizing how medical prof
Tool: write({"path": "summary_output.txt", "content": "Artificial Intelligence has transformed modern healthcare by revolutionizing disease diagnosis, treatment planning, and patient care management. Through machine learning algorithms, AI systems can now...[truncated]"})
Result: {'type': 'text', 'text': 'Successfully wrote 1908 bytes to summary_output.txt'}
Assistant: Done. I've written a 3-paragraph summary to `summary_output.txt` covering:

1. **AI in medical imaging**: detection accuracy matching/exceeding radiologists, DeepMind's 94% accuracy for eye diseases, pattern re
2026-04-23 16:28:47,592 - INFO -    [VERBOSE] Workspace files passed to judge (first 500 chars):
### File: summary_output.txt
Artificial Intelligence has transformed modern healthcare by revolutionizing disease diagnosis, treatment planning, and patient care management. Through machine learning algorithms, AI systems can now analyze medical imaging with remarkable accuracy—often matching or surpassing experienced radiologists. Notable achievements include Google's DeepMind AI detecting over 50 eye diseases from retinal scans with 94% accuracy, and deep learning models identifying cancerous 
2026-04-23 16:29:02,486 - INFO -    [VERBOSE] Judge execution status: success
2026-04-23 16:29:02,486 - INFO -    [VERBOSE] Judge raw response parsed: {'scores': {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}, 'total': 0.91, 'notes': 'Comprehensive 3-paragraph summary covering all major themes. ~250 words, slightly verbose but within acceptable range. Well-structured with clear logical flow.'}
2026-04-23 16:29:02,486 - INFO -    [VERBOSE] Normalized judge response: {'scores': {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}, 'total': 0.91, 'notes': 'Comprehensive 3-paragraph summary covering all major themes. ~250 words, slightly verbose but within acceptable range. Well-structured with clear logical flow.'}
2026-04-23 16:29:02,486 - INFO -    [VERBOSE] LLM judge breakdown: {'accuracy_and_completeness': 0.95, 'conciseness': 0.7, 'structure_and_coherence': 1.0, 'writing_quality': 0.9, 'task_completion': 1.0}
2026-04-23 16:29:02,486 - INFO - ⚠️ Task task_05_summary: 0.9/1.0 (91%) - llm_judge

The SGLang run script :

#!/bin/bash
set -e
MODEL_PATH="${1:-/xxxx/Qwen/Qwen3.5-4B}"
SERVED_MODEL_NAME="${2:-qwen3.5-4b}"
GPU_SIZE="${3:-8}"
python3 -m sglang.launch_server \
  --model-path ${MODEL_PATH} \
  --served-model-name ${SERVED_MODEL_NAME} \
  --tp-size 1 \
  --dp-size ${GPU_SIZE} \
  --api-key xxx \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --trust-remote-code \
  --dist-timeout 3600 \
  --mem-fraction-static 0.85 \
  --watchdog-timeout 3600 \
  --host 0.0.0.0 \
  --port 8000 #\

At the moment, it is unclear whether this discrepancy comes from:

  1. Differences between the official evaluation environment and a local sglang deployment
  2. Problems in the evaluation logic or result parsing
  3. Differences in the exact model/runtime configuration used for the official reported results

Could you please help clarify how the official results were produced, and whether there are known sources of mismatch between the published results and reproducible local runs?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions