K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257
Open
smirnovlad wants to merge 9 commits into
Open
K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257smirnovlad wants to merge 9 commits into
smirnovlad wants to merge 9 commits into
Conversation
K2-Think emits a budget-dependent thinking close tag: reasoning_effort=medium
gives <think_fast>...</think_fast>, low gives <think_faster>...</think_faster>,
high/default gives <think>...</think>. The vLLM generator hardcoded </think> for
thinking-completion detection, stop-token registration, the mid-step leak
truncation, and the answer-phase closing step, so the fast/faster budgets never
split the answer out (steps collapsed to 1, no answer extracted). Generalize via
a per-instance think_close_tag derived from reasoning_effort plus tolerant
multi-tag detection (THINK_CLOSE_TAGS / _find_think_close); default </think>
behaviour is unchanged.
Also extend extract_answer to handle <answer>...</answer> XML tags, keeping
multi-line content so fenced code blocks survive for code benchmarks. The
previous extractor only matched <Answer>: and \boxed{} and collapsed to the
first line, which dropped K2-Think's code answers entirely.
Medium/low configs using the new UHead checkpoint rediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs and the prompts/thinking budgets it was trained on, reproduced config-only via model.reasoning_effort (medium -> <think_fast>, low -> <think_faster>) plus the k2_think_answer wrapper. install_step_reasoning_head.sh ports the author's step_reasoning UHead head (from cant-access-rediska0123/uncertainty4reasoning) into luh, which has the vLLM hidden-state path but not that head. Includes the CSCS Clariden launcher (uenv + venv, shared a0142 HF cache) and a README.
- Offline best-of-N: stop at the generator's thinking close tag (K2-Think
</think_fast> / </think_faster>) instead of a hardcoded </think> in the
stop_tokens_override, so medium/low budgets stop at end-of-thinking rather
than running to EOS and wastefully re-generating the answer phase.
- extract_answer: clean a \boxed{} wrapper from XML <answer> content too, so the
<answer>...</answer> path matches the default/boxed paths (otherwise
<answer>\boxed{42}</answer> would keep the wrapper and fail exact-match).
Two problems only the full 378-problem runs surfaced (the subset=8 smokes passed): - OOM: generate_trajectories defaults checkpoint_batch_size to len(dataset), so offline best-of-N sent all 378 x N trajectories to vLLM in one chunk and the native hidden-state capture OOM'd a 2x GH200. Set checkpoint_batch_size=16 in the configs (~64 sequences/chunk, plus progressive saves). - Extraction: with the low budget K2 emits valid ```python code but usually without an <answer> wrapper (only ~20% used it), so extract_answer returned empty for 86.5% of low trajectories. Added a fenced-code-block fallback; on the real low run this drops no-answer from 87% to 4% (75% now yield runnable code).
…pu_mem) The full runs cleared chunks 1-2 then OOM'd on chunk 3: across chunks the native HS capture leaves reserved-but-unallocated GPU memory that fragments until a large alloc fails (the error itself suggests expandable_segments). Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the launcher and drop gpu_memory_utilization 0.87 -> 0.78 to leave headroom on GPU 0 for HS + UHead.
…nk size The expandable_segments OOM fix hung vLLM during model loading (logs frozen at 'Starting to load model' for 20+ min, zero shard progress, on two nodes). Remove it and instead handle the cross-chunk fragmentation OOM with headroom: gpu_memory_utilization 0.78->0.72 and checkpoint_batch_size 16->8 (~32 seqs/chunk, the smoke-validated size).
The shared-store K2-Think-V2 snapshot had only the 62 safetensors (no config.json / tokenizer). Online, vLLM stalled trying to fetch the missing config from HF on a flaky compute-node network (looked like a load hang); offline it failed fast. With the config/tokenizer now in the cache, HF_HUB_OFFLINE=1 makes weight loading fully local and deterministic.
…to 16384 Model loaded but KV-cache init failed: 'Available KV cache memory: -8.89 GiB'. Lowering gpu_memory_utilization to 0.72 starved KV (non-KV profiling peak ~77 GiB > 0.72 budget). The real lever is max_model_len: it sets both the profiling peak (KV init) and per-sequence generation memory (the chunk-3 OOM). Medium/low gens are short (~2745/~500 tokens), so cut max_context_budget 32768->16384 (and max_new_tokens 32000->15000) and set gpu_memory_utilization 0.82.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the offline best-of-N experiment with K2-Think-V2 generating + the new step_reasoning UHead scoring, on MBPP+, for the CSCS Clariden cluster. Medium and low thinking-budget variants, matching the prompts/budgets the UHead was trained on.
Two framework fixes were required because the thinking-mode path previously assumed Qwen-style output.
Framework
</think>was hardcoded across the vLLM generator (completion detection, stop-token registration, mid-step leak truncation, answer-phase close) and the offline-BoN stop override. K2-Think emits a budget-dependent close tag (reasoning_effort=medium -> </think_fast>,low -> </think_faster>, else</think>). Generalized via a per-instancethink_close_tag+THINK_CLOSE_TAGS/_find_think_close. Default</think>behaviour is unchanged.<answer>...</answer>extraction.extract_answeronly matched<Answer>:/\boxed{}and collapsed to the first line, dropping K2's multi-line fenced-code answers entirely. Added XML-tag handling (multi-line preserved so code blocks survive;\boxed{}wrapper still cleaned).Experiment
config/experiments/offline_best_of_n/mbpp_plus/offline_bon_vllm_k2_think_v2_mbpp_plus_uhead_{medium,low}.yaml— new UHead checkpointrediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs, budgets viamodel.reasoning_effort,<answer>prompt wrapper (config/prompts/k2_think_answer.txt).scripts/slurm/install_step_reasoning_head.sh— ports the author'sstep_reasoningUHead head (fromcant-access-rediska0123/uncertainty4reasoning) intoluh, which carries the vLLM hidden-state path but not that head type.scripts/slurm/run_k2_uhead_mbpp_clariden.sh+ README — CSCS Clariden launcher (uenv + venv, shareda0142HF cache).Validation
Smoke test (subset=8, N=4, medium) on a GH200: 6/8 correct (75%), 0 unextracted answers (
no_answer_rate: 0.0), real EvalPlus-graded code. Full path exercised: generation -> step_reasoning UHead scoring -> best-of-N selection -> MBPP+ grading. Answer-extraction regression covered (math<Answer>:/\boxed{}, XML code, XML boxed, empty-tag skip).Review
codex reviewagainstmain: PASS (no P1). Two P2 findings addressed in the final commit — the offline-BoNstop_tokens_overridenow uses the budget close tag (so medium/low stop at end-of-thinking instead of running to EOS and re-generating the answer phase), and XML answers get\boxed{}cleanup.