Skip to content

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#76

Closed
ForceConstant wants to merge 7 commits intopinchbench:mainfrom
ForceConstant:continue-thinking-levels
Closed

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive)#76
ForceConstant wants to merge 7 commits intopinchbench:mainfrom
ForceConstant:continue-thinking-levels

Conversation

@ForceConstant
Copy link
Copy Markdown
Contributor

This is a rebase of #12 by @jb510

Note I haven't gotten to test this out yet, but it does seem valid as I am not currently setup to run.

OpenClaw Agent and others added 5 commits March 25, 2026 17:12
- Add --thinking CLI argument to specify comma-separated thinking levels
- Pass thinking level to OpenClaw agent via --thinking flag
- Run each task across all specified thinking levels
- Include thinking_level in task results
- Add thinking_aggregates section with per-level statistics
- Support levels: off, minimal, low, medium, high
- Update SKILL.md and README.md with documentation

Closes pinchbench#9
- Add xhigh and adaptive to valid thinking levels (matching OpenClaw)
- Add model-aware xhigh validation (only GPT-5.x models support it)
- Validate thinking levels before passing to OpenClaw subprocess
- Document model-specific restrictions in help text and docs
- Follow existing code style (Optional[str] instead of str | None)
- No unnecessary changes to existing code
- Add strict xhigh model matching (provider-aware)
- Add adaptive support detection (Anthropic Claude 4.6 family)
- Deduplicate requested thinking levels while preserving order
- Fail fast when --thinking is provided but no valid levels remain
- Keep subprocess input constrained to validated levels
@jb510
Copy link
Copy Markdown
Contributor

jb510 commented Mar 25, 2026

That's why I didn't do the rebase I uninstalled PinchBench so I couldn't test :D. It was tested and working when I originally PR'd it. Dealing with other OpenClaw issues at the momment so can't test, hope someone can.

Copy link
Copy Markdown
Member

@olearycrew olearycrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ForceConstant can I trouble you for one more rebase? we have a lot of incoming with the new pending v2 release this week

@ForceConstant
Copy link
Copy Markdown
Contributor Author

@olearycrew ok I updated branch.

@ScuttleBot
Copy link
Copy Markdown
Contributor

Superseded by #378 — a clean implementation on current main. The original had 6 conflict blocks across 3 files due to main branch evolution (parallel judging, categories, incremental results, etc.).

@ScuttleBot ScuttleBot closed this May 5, 2026
pull Bot pushed a commit to Stars1233/skill that referenced this pull request May 5, 2026
Add --thinking flag to benchmark.py to configure reasoning depth for
models that support it. Valid levels: off, minimal, low, medium, high,
xhigh, adaptive.

Changes:
- Add VALID_THINKING_LEVELS constant in lib_agent.py
- Add thinking_level parameter to execute_openclaw_task()
- Pass --thinking to openclaw agent command
- Add --thinking argument to benchmark.py with validation
- Update README command reference

This is a clean rebase of PR pinchbench#76 (originally from #12 by @jb510)
against the current main branch which has significantly evolved.

Co-authored-by: ForceConstant <ForceConstant@users.noreply.github.com>
Co-authored-by: jb510 <jb510@users.noreply.github.com>
@ScuttleBot ScuttleBot mentioned this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants