K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets) by smirnovlad · Pull Request #257 · IINemo/thinkbooster

smirnovlad · 2026-06-28T12:02:24Z

Summary

Adds the offline best-of-N experiment with K2-Think-V2 generating + the new step_reasoning UHead scoring, on MBPP+, for the CSCS Clariden cluster. Medium and low thinking-budget variants, matching the prompts/budgets the UHead was trained on.

Two framework fixes were required because the thinking-mode path previously assumed Qwen-style output.

Framework

K2 reasoning-budget thinking tags. </think> was hardcoded across the vLLM generator (completion detection, stop-token registration, mid-step leak truncation, answer-phase close) and the offline-BoN stop override. K2-Think emits a budget-dependent close tag (reasoning_effort=medium -> </think_fast>, low -> </think_faster>, else </think>). Generalized via a per-instance think_close_tag + THINK_CLOSE_TAGS / _find_think_close. Default </think> behaviour is unchanged.
<answer>...</answer> extraction. extract_answer only matched <Answer>: / \boxed{} and collapsed to the first line, dropping K2's multi-line fenced-code answers entirely. Added XML-tag handling (multi-line preserved so code blocks survive; \boxed{} wrapper still cleaned).

Experiment

config/experiments/offline_best_of_n/mbpp_plus/offline_bon_vllm_k2_think_v2_mbpp_plus_uhead_{medium,low}.yaml — new UHead checkpoint rediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs, budgets via model.reasoning_effort, <answer> prompt wrapper (config/prompts/k2_think_answer.txt).
scripts/slurm/install_step_reasoning_head.sh — ports the author's step_reasoning UHead head (from cant-access-rediska0123/uncertainty4reasoning) into luh, which carries the vLLM hidden-state path but not that head type.
scripts/slurm/run_k2_uhead_mbpp_clariden.sh + README — CSCS Clariden launcher (uenv + venv, shared a0142 HF cache).

Validation

Smoke test (subset=8, N=4, medium) on a GH200: 6/8 correct (75%), 0 unextracted answers (no_answer_rate: 0.0), real EvalPlus-graded code. Full path exercised: generation -> step_reasoning UHead scoring -> best-of-N selection -> MBPP+ grading. Answer-extraction regression covered (math <Answer>:/\boxed{}, XML code, XML boxed, empty-tag skip).

Review

codex review against main: PASS (no P1). Two P2 findings addressed in the final commit — the offline-BoN stop_tokens_override now uses the budget close tag (so medium/low stop at end-of-thinking instead of running to EOS and re-generating the answer phase), and XML answers get \boxed{} cleanup.

K2-Think emits a budget-dependent thinking close tag: reasoning_effort=medium gives <think_fast>...</think_fast>, low gives <think_faster>...</think_faster>, high/default gives <think>...</think>. The vLLM generator hardcoded </think> for thinking-completion detection, stop-token registration, the mid-step leak truncation, and the answer-phase closing step, so the fast/faster budgets never split the answer out (steps collapsed to 1, no answer extracted). Generalize via a per-instance think_close_tag derived from reasoning_effort plus tolerant multi-tag detection (THINK_CLOSE_TAGS / _find_think_close); default </think> behaviour is unchanged. Also extend extract_answer to handle <answer>...</answer> XML tags, keeping multi-line content so fenced code blocks survive for code benchmarks. The previous extractor only matched <Answer>: and \boxed{} and collapsed to the first line, which dropped K2-Think's code answers entirely.

Medium/low configs using the new UHead checkpoint rediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs and the prompts/thinking budgets it was trained on, reproduced config-only via model.reasoning_effort (medium -> <think_fast>, low -> <think_faster>) plus the k2_think_answer wrapper. install_step_reasoning_head.sh ports the author's step_reasoning UHead head (from cant-access-rediska0123/uncertainty4reasoning) into luh, which has the vLLM hidden-state path but not that head. Includes the CSCS Clariden launcher (uenv + venv, shared a0142 HF cache) and a README.

- Offline best-of-N: stop at the generator's thinking close tag (K2-Think </think_fast> / </think_faster>) instead of a hardcoded </think> in the stop_tokens_override, so medium/low budgets stop at end-of-thinking rather than running to EOS and wastefully re-generating the answer phase. - extract_answer: clean a \boxed{} wrapper from XML <answer> content too, so the <answer>...</answer> path matches the default/boxed paths (otherwise <answer>\boxed{42}</answer> would keep the wrapper and fail exact-match).

Two problems only the full 378-problem runs surfaced (the subset=8 smokes passed): - OOM: generate_trajectories defaults checkpoint_batch_size to len(dataset), so offline best-of-N sent all 378 x N trajectories to vLLM in one chunk and the native hidden-state capture OOM'd a 2x GH200. Set checkpoint_batch_size=16 in the configs (~64 sequences/chunk, plus progressive saves). - Extraction: with the low budget K2 emits valid ```python code but usually without an <answer> wrapper (only ~20% used it), so extract_answer returned empty for 86.5% of low trajectories. Added a fenced-code-block fallback; on the real low run this drops no-answer from 87% to 4% (75% now yield runnable code).

…pu_mem) The full runs cleared chunks 1-2 then OOM'd on chunk 3: across chunks the native HS capture leaves reserved-but-unallocated GPU memory that fragments until a large alloc fails (the error itself suggests expandable_segments). Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the launcher and drop gpu_memory_utilization 0.87 -> 0.78 to leave headroom on GPU 0 for HS + UHead.

…nk size The expandable_segments OOM fix hung vLLM during model loading (logs frozen at 'Starting to load model' for 20+ min, zero shard progress, on two nodes). Remove it and instead handle the cross-chunk fragmentation OOM with headroom: gpu_memory_utilization 0.78->0.72 and checkpoint_batch_size 16->8 (~32 seqs/chunk, the smoke-validated size).

The shared-store K2-Think-V2 snapshot had only the 62 safetensors (no config.json / tokenizer). Online, vLLM stalled trying to fetch the missing config from HF on a flaky compute-node network (looked like a load hang); offline it failed fast. With the config/tokenizer now in the cache, HF_HUB_OFFLINE=1 makes weight loading fully local and deterministic.

…to 16384 Model loaded but KV-cache init failed: 'Available KV cache memory: -8.89 GiB'. Lowering gpu_memory_utilization to 0.72 starved KV (non-KV profiling peak ~77 GiB > 0.72 budget). The real lever is max_model_len: it sets both the profiling peak (KV init) and per-sequence generation memory (the chunk-3 OOM). Medium/low gens are short (~2745/~500 tokens), so cut max_context_budget 32768->16384 (and max_new_tokens 32000->15000) and set gpu_memory_utilization 0.82.

smirnovlad added 9 commits June 28, 2026 13:18

Add CLAUDE.md with project overview and mandatory codex review gate

7c7283e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257

K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257
smirnovlad wants to merge 9 commits into
mainfrom
exp/k2think-uhead-mbpp-offline-bon

smirnovlad commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smirnovlad commented Jun 28, 2026

Summary

Framework

Experiment

Validation

Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant