feat: Add thinking-level benchmarking support by ScuttleBot · Pull Request #378 · pinchbench/skill

ScuttleBot · 2026-05-05T16:29:22Z

Clean rebase of #76 (originally from #12 by @jb510) against current main.

Summary

Adds --thinking flag to configure reasoning depth for models that support it.

Changes

Add VALID_THINKING_LEVELS constant: off, minimal, low, medium, high, xhigh, adaptive
Add thinking_level parameter to execute_openclaw_task()
Pass --thinking to openclaw agent command
Add --thinking argument to benchmark.py with validation
Update README command reference

Usage

# Run with high reasoning depth
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 --thinking high

# Run with minimal thinking for speed
./scripts/run.sh --model openrouter/anthropic/claude-sonnet-4 --thinking minimal

Why a new PR?

The original PR #76 had extensive merge conflicts due to main branch evolution (parallel judging, categories, incremental results, etc.). Rather than resolve 6 conflict blocks across 3 files, this is a clean implementation of the same feature on current main.

Supersedes #76.

Co-authored-by: ForceConstant ForceConstant@users.noreply.github.com
Co-authored-by: jb510 jb510@users.noreply.github.com

@jb510

Add --thinking flag to benchmark.py to configure reasoning depth for models that support it. Valid levels: off, minimal, low, medium, high, xhigh, adaptive. Changes: - Add VALID_THINKING_LEVELS constant in lib_agent.py - Add thinking_level parameter to execute_openclaw_task() - Pass --thinking to openclaw agent command - Add --thinking argument to benchmark.py with validation - Update README command reference This is a clean rebase of PR #76 (originally from #12 by @jb510) against the current main branch which has significantly evolved. Co-authored-by: ForceConstant <ForceConstant@users.noreply.github.com> Co-authored-by: jb510 <jb510@users.noreply.github.com>

kilo-code-bot · 2026-05-05T16:30:24Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Clean implementation of thinking-level support. The thinking_level value is validated against a whitelist before use and passed to subprocess.run as a list argument (not shell-interpolated), so there's no injection risk. The optional parameter flows correctly through the call chain with proper None defaults.

Files Reviewed (3 files)

scripts/lib_agent.py — VALID_THINKING_LEVELS constant + thinking_level param in execute_openclaw_task()
scripts/benchmark.py — --thinking CLI arg with validation
README.md — docs update

_{Reviewed by claude-4.6-sonnet-20260217 · 99,364 tokens}

ScuttleBot mentioned this pull request May 5, 2026

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive) #76

Closed

olearycrew merged commit 28ddf62 into main May 5, 2026
2 checks passed

ScuttleBot mentioned this pull request May 6, 2026

Thinking Levels #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add thinking-level benchmarking support#378

feat: Add thinking-level benchmarking support#378
olearycrew merged 1 commit intomainfrom
feat/thinking-levels-rebased

ScuttleBot commented May 5, 2026

Uh oh!

kilo-code-bot Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ScuttleBot commented May 5, 2026

Summary

Changes

Usage

Why a new PR?

Uh oh!

kilo-code-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot Bot commented May 5, 2026 •

edited

Loading