Skip to content

[codex] add sampled-token OPD recipe#391

Open
renfeichen-fw wants to merge 3 commits intomainfrom
codex/opd-loop
Open

[codex] add sampled-token OPD recipe#391
renfeichen-fw wants to merge 3 commits intomainfrom
codex/opd-loop

Conversation

@renfeichen-fw
Copy link
Copy Markdown
Contributor

@renfeichen-fw renfeichen-fw commented Apr 25, 2026

Result

This PR adds a privileged-prompt OPD loop and validates it end-to-end on a real GSM8K-style privileged-context run with the same Qwen 3.5 9B model used as both student and frozen teacher.

The live run completed all requested OPD steps and showed the main teacher-trace logprob gap moving in the right direction:

Metric Start End Result
Teacher-trace NLL gap, student minus teacher 0.2137 0.1684 21.2% lower
Prompt groups sampled 200 200 0 filtered, 0 failed
OPD optimizer steps 25 25 completed
Active OPD tokens - 1,371,235 trained over packed response tokens

The key signal is the teacher-trace gap: the frozen teacher sees the privileged prompt, the student sees the normal prompt, and we score the same teacher trace under both prompts. Lower means the online student is getting closer to the privileged teacher on the target response distribution.

What Changed

  • The teacher path now supports privileged prompts instead of scoring the exact student prompt.
  • Teacher and rollout logprobs are validated on active response tokens instead of silently padding missing values.
  • OPD sampling/eval helpers were split out of the recipe so opd_loop.py is easier to audit.
  • The example flow uses a real GSM8K-style privileged-context dataset path instead of synthetic memorization rows.
  • Teacher deployment can be created without hot load, while the online student deployment keeps the normal weight-sync path.

Live Run

  • Run id: gsm8k-opd-qwen3p5-9b-256k-202604250928-attached
  • Training shape: accounts/fireworks/trainingShapes/qwen3p5-9b-256k
  • Base model for both student and teacher: Qwen 3.5 9B
  • Teacher: frozen deployment with privileged prompt
  • Student: online deployment synced from trainer
  • Batch shape: 8 prompt groups per step, 4 completions per prompt, 25 OPD steps
  • Completion cap: 2048 tokens, reasoning preserved
  • Launch path: SDK / pyroworks-attached flow, no firectl launch/kill mutation commands

Teacher-Trace Gap

Lower is better. This is the strongest OPD validation signal because it compares student vs frozen privileged teacher logprobs on the same teacher trace.

xychart-beta
  title "Teacher-trace gap: student NLL - teacher NLL"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "NLL gap, lower is better" 0.16 --> 0.31
  line [0.2137, 0.3015, 0.2907, 0.2715, 0.2714, 0.2729, 0.2624, 0.2510, 0.2440, 0.2406, 0.2416, 0.2392, 0.2351, 0.2312, 0.2251, 0.2202, 0.2150, 0.2118, 0.2086, 0.2033, 0.1978, 0.1909, 0.1841, 0.1778, 0.1726, 0.1684]
Loading

Final-Answer Token Gap

This isolates the final-answer token span on the teacher traces. It moves below zero early, meaning the trained student assigns at least as much probability as the privileged teacher to the final-answer tokens on these traces.

xychart-beta
  title "Final-answer token gap"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "NLL gap, lower is better" -0.05 --> 0.22
  line [0.1019, 0.0758, 0.2063, 0.0051, -0.0365, -0.0371, -0.0372, -0.0372, -0.0372, -0.0371, -0.0371, -0.0371, -0.0371, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0371, -0.0370, -0.0369, -0.0366]
Loading

Greedy Generation Accuracy

Exact-match generation is noisy on this small validation set, so this should not be treated as the primary OPD metric. It does show intermittent improvement, but the final point is not stable enough to claim deployed generation quality from this run alone.

xychart-beta
  title "Student greedy exact-match accuracy"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "accuracy" 0 --> 1
  line [0.500, 0.000, 0.125, 1.000, 0.000, 0.000, 0.000, 0.125, 0.000, 0.000, 0.000, 0.000, 0.125, 0.250, 0.375, 0.125, 0.000, 0.125, 0.500, 0.250, 0.125, 0.250, 0.500, 0.000, 0.000, 0.000]
Loading

On-Policy Sampled Reverse KL

This is the training-time OPD loss signal from sampled online completions. It is noisier than teacher-trace eval because the sampled responses change every step.

xychart-beta
  title "Sampled OPD reverse KL"
  x-axis "step" [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "reverse KL" 0.08 --> 0.16
  line [0.0894, 0.1503, 0.1561, 0.1266, 0.1015, 0.1201, 0.1331, 0.1214, 0.1221, 0.1027, 0.1090, 0.0978, 0.0906, 0.0979, 0.0965, 0.1078, 0.0966, 0.0976, 0.0974, 0.0995, 0.0911, 0.0969, 0.0860, 0.0884, 0.0886]
Loading

Active OPD Tokens

Each optimizer step receives packed datums rather than one single sample. The run trained over 1.37M active response tokens across 25 steps.

xychart-beta
  title "Active OPD tokens per step"
  x-axis "step" [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "tokens" 35000 --> 66000
  bar [51383, 65095, 58537, 37109, 65536, 53392, 54190, 52371, 57527, 53531, 60109, 52201, 55755, 51058, 56166, 48333, 62156, 61324, 58282, 56972, 54009, 52330, 52431, 46363, 55075]
Loading

Validation

  • Unit suite: 483 passed, 32 skipped
  • Focused OPD tests: 28 passed
  • Static diff check: git diff --check clean
  • Live OPD run: completed 25/25 steps with 200/200 prompt groups sampled and 0 failed/filtered

@renfeichen-fw renfeichen-fw marked this pull request as ready for review April 25, 2026 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant