Skip to content

K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257

Open
smirnovlad wants to merge 9 commits into
mainfrom
exp/k2think-uhead-mbpp-offline-bon
Open

K2-Think UHead offline best-of-N on MBPP+ (medium/low budgets)#257
smirnovlad wants to merge 9 commits into
mainfrom
exp/k2think-uhead-mbpp-offline-bon

Conversation

@smirnovlad

Copy link
Copy Markdown
Collaborator

Summary

Adds the offline best-of-N experiment with K2-Think-V2 generating + the new step_reasoning UHead scoring, on MBPP+, for the CSCS Clariden cluster. Medium and low thinking-budget variants, matching the prompts/budgets the UHead was trained on.

Two framework fixes were required because the thinking-mode path previously assumed Qwen-style output.

Framework

  • K2 reasoning-budget thinking tags. </think> was hardcoded across the vLLM generator (completion detection, stop-token registration, mid-step leak truncation, answer-phase close) and the offline-BoN stop override. K2-Think emits a budget-dependent close tag (reasoning_effort=medium -> </think_fast>, low -> </think_faster>, else </think>). Generalized via a per-instance think_close_tag + THINK_CLOSE_TAGS / _find_think_close. Default </think> behaviour is unchanged.
  • <answer>...</answer> extraction. extract_answer only matched <Answer>: / \boxed{} and collapsed to the first line, dropping K2's multi-line fenced-code answers entirely. Added XML-tag handling (multi-line preserved so code blocks survive; \boxed{} wrapper still cleaned).

Experiment

  • config/experiments/offline_best_of_n/mbpp_plus/offline_bon_vllm_k2_think_v2_mbpp_plus_uhead_{medium,low}.yaml — new UHead checkpoint rediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs, budgets via model.reasoning_effort, <answer> prompt wrapper (config/prompts/k2_think_answer.txt).
  • scripts/slurm/install_step_reasoning_head.sh — ports the author's step_reasoning UHead head (from cant-access-rediska0123/uncertainty4reasoning) into luh, which carries the vLLM hidden-state path but not that head type.
  • scripts/slurm/run_k2_uhead_mbpp_clariden.sh + README — CSCS Clariden launcher (uenv + venv, shared a0142 HF cache).

Validation

Smoke test (subset=8, N=4, medium) on a GH200: 6/8 correct (75%), 0 unextracted answers (no_answer_rate: 0.0), real EvalPlus-graded code. Full path exercised: generation -> step_reasoning UHead scoring -> best-of-N selection -> MBPP+ grading. Answer-extraction regression covered (math <Answer>:/\boxed{}, XML code, XML boxed, empty-tag skip).

Review

codex review against main: PASS (no P1). Two P2 findings addressed in the final commit — the offline-BoN stop_tokens_override now uses the budget close tag (so medium/low stop at end-of-thinking instead of running to EOS and re-generating the answer phase), and XML answers get \boxed{} cleanup.

K2-Think emits a budget-dependent thinking close tag: reasoning_effort=medium
gives <think_fast>...</think_fast>, low gives <think_faster>...</think_faster>,
high/default gives <think>...</think>. The vLLM generator hardcoded </think> for
thinking-completion detection, stop-token registration, the mid-step leak
truncation, and the answer-phase closing step, so the fast/faster budgets never
split the answer out (steps collapsed to 1, no answer extracted). Generalize via
a per-instance think_close_tag derived from reasoning_effort plus tolerant
multi-tag detection (THINK_CLOSE_TAGS / _find_think_close); default </think>
behaviour is unchanged.

Also extend extract_answer to handle <answer>...</answer> XML tags, keeping
multi-line content so fenced code blocks survive for code benchmarks. The
previous extractor only matched <Answer>: and \boxed{} and collapsed to the
first line, which dropped K2-Think's code answers entirely.
Medium/low configs using the new UHead checkpoint
rediska0123/uhead_hs_K2-Think-V2_mixed_code10K_steps_vllm_10epochs and the
prompts/thinking budgets it was trained on, reproduced config-only via
model.reasoning_effort (medium -> <think_fast>, low -> <think_faster>) plus the
k2_think_answer wrapper. install_step_reasoning_head.sh ports the author's
step_reasoning UHead head (from cant-access-rediska0123/uncertainty4reasoning)
into luh, which has the vLLM hidden-state path but not that head. Includes the
CSCS Clariden launcher (uenv + venv, shared a0142 HF cache) and a README.
- Offline best-of-N: stop at the generator's thinking close tag (K2-Think
  </think_fast> / </think_faster>) instead of a hardcoded </think> in the
  stop_tokens_override, so medium/low budgets stop at end-of-thinking rather
  than running to EOS and wastefully re-generating the answer phase.
- extract_answer: clean a \boxed{} wrapper from XML <answer> content too, so the
  <answer>...</answer> path matches the default/boxed paths (otherwise
  <answer>\boxed{42}</answer> would keep the wrapper and fail exact-match).
Two problems only the full 378-problem runs surfaced (the subset=8 smokes
passed):

- OOM: generate_trajectories defaults checkpoint_batch_size to len(dataset), so
  offline best-of-N sent all 378 x N trajectories to vLLM in one chunk and the
  native hidden-state capture OOM'd a 2x GH200. Set checkpoint_batch_size=16 in
  the configs (~64 sequences/chunk, plus progressive saves).

- Extraction: with the low budget K2 emits valid ```python code but usually
  without an <answer> wrapper (only ~20% used it), so extract_answer returned
  empty for 86.5% of low trajectories. Added a fenced-code-block fallback; on the
  real low run this drops no-answer from 87% to 4% (75% now yield runnable code).
…pu_mem)

The full runs cleared chunks 1-2 then OOM'd on chunk 3: across chunks the native
HS capture leaves reserved-but-unallocated GPU memory that fragments until a
large alloc fails (the error itself suggests expandable_segments). Set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True in the launcher and drop
gpu_memory_utilization 0.87 -> 0.78 to leave headroom on GPU 0 for HS + UHead.
…nk size

The expandable_segments OOM fix hung vLLM during model loading (logs frozen at
'Starting to load model' for 20+ min, zero shard progress, on two nodes).
Remove it and instead handle the cross-chunk fragmentation OOM with headroom:
gpu_memory_utilization 0.78->0.72 and checkpoint_batch_size 16->8 (~32 seqs/chunk,
the smoke-validated size).
The shared-store K2-Think-V2 snapshot had only the 62 safetensors (no
config.json / tokenizer). Online, vLLM stalled trying to fetch the missing
config from HF on a flaky compute-node network (looked like a load hang);
offline it failed fast. With the config/tokenizer now in the cache,
HF_HUB_OFFLINE=1 makes weight loading fully local and deterministic.
…to 16384

Model loaded but KV-cache init failed: 'Available KV cache memory: -8.89 GiB'.
Lowering gpu_memory_utilization to 0.72 starved KV (non-KV profiling peak ~77 GiB
> 0.72 budget). The real lever is max_model_len: it sets both the profiling peak
(KV init) and per-sequence generation memory (the chunk-3 OOM). Medium/low gens
are short (~2745/~500 tokens), so cut max_context_budget 32768->16384 (and
max_new_tokens 32000->15000) and set gpu_memory_utilization 0.82.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant