Problem
The run subcommand has two conceptually-different axes that currently share confusable naming:
- Pack-set selectors (the what):
--quick / --medium / --full / --reasoning
- Thinking-mode flags (the how):
--enable-thinking / --no-thinking
--reasoning is a pack-set selector (it runs the reasoning-heavy suite — HE+, LCB v6, GPQA, GSM-Symbolic), but its name reads like a mode ("evaluate with reasoning"), and it sits right next to --enable-thinking/--no-thinking. In practice this is genuinely ambiguous: when you want "with-reasoning vs without-reasoning" you actually want --full --enable-thinking vs --full --no-thinking — not --reasoning. The shared word makes that easy to get wrong.
Proposal
Rename the flag to make it clearly a pack category, parallel to --full:
- Add
--reasoning-packs as the primary flag.
- Keep
--reasoning as a hidden, deprecated alias (back-compat — anything scripted against it keeps working), with a one-line deprecation note in --help.
- Update
--help so the pack-set group reads --quick | --medium | --full | --reasoning-packs and stays visually separate from the --enable-thinking/--no-thinking mode group.
This keeps the mental model crisp: pick the packs, then pick the mode. The canonical quality eval stays --full --no-thinking + --full --enable-thinking; --reasoning-packs is the separate, deliberate deep-dive.
Downstream: the quality-test.sh wrapper (in noonghunna/club-3090) forwards the same token, so mirror the rename there + its usage block.
Optional (bigger, later)
If we ever want the split to be structural rather than by-convention, make the axes explicit: --packs {quick,medium,full,reasoning} + --thinking {on,off}. That's a breaking change though — the --reasoning → --reasoning-packs alias rename gets most of the clarity for a fraction of the churn, so start there.
Context
Hit this while running a KV-cache quality A/B: reached for --medium + --sandboxed-only (which, separately, means all sandboxed packs incl. the slow HE+/LCB) when the right call was simply --full in both thinking modes. The naming nudged toward the wrong flag.
Problem
The
runsubcommand has two conceptually-different axes that currently share confusable naming:--quick/--medium/--full/--reasoning--enable-thinking/--no-thinking--reasoningis a pack-set selector (it runs the reasoning-heavy suite — HE+, LCB v6, GPQA, GSM-Symbolic), but its name reads like a mode ("evaluate with reasoning"), and it sits right next to--enable-thinking/--no-thinking. In practice this is genuinely ambiguous: when you want "with-reasoning vs without-reasoning" you actually want--full --enable-thinkingvs--full --no-thinking— not--reasoning. The shared word makes that easy to get wrong.Proposal
Rename the flag to make it clearly a pack category, parallel to
--full:--reasoning-packsas the primary flag.--reasoningas a hidden, deprecated alias (back-compat — anything scripted against it keeps working), with a one-line deprecation note in--help.--helpso the pack-set group reads--quick | --medium | --full | --reasoning-packsand stays visually separate from the--enable-thinking/--no-thinkingmode group.This keeps the mental model crisp: pick the packs, then pick the mode. The canonical quality eval stays
--full --no-thinking+--full --enable-thinking;--reasoning-packsis the separate, deliberate deep-dive.Downstream: the
quality-test.shwrapper (innoonghunna/club-3090) forwards the same token, so mirror the rename there + its usage block.Optional (bigger, later)
If we ever want the split to be structural rather than by-convention, make the axes explicit:
--packs {quick,medium,full,reasoning}+--thinking {on,off}. That's a breaking change though — the--reasoning → --reasoning-packsalias rename gets most of the clarity for a fraction of the churn, so start there.Context
Hit this while running a KV-cache quality A/B: reached for
--medium+--sandboxed-only(which, separately, means all sandboxed packs incl. the slow HE+/LCB) when the right call was simply--fullin both thinking modes. The naming nudged toward the wrong flag.