[codex] add sampled-token OPD recipe by renfeichen-fw · Pull Request #391 · fw-ai/cookbook

renfeichen-fw · 2026-04-25T02:19:40Z

Result

This PR adds a privileged-prompt OPD loop and validates it end-to-end on a real GSM8K-style privileged-context run with the same Qwen 3.5 9B model used as both student and frozen teacher.

The live run completed all requested OPD steps and showed the main teacher-trace logprob gap moving in the right direction:

Metric	Start	End	Result
Teacher-trace NLL gap, student minus teacher	0.2137	0.1684	21.2% lower
Prompt groups sampled	200	200	0 filtered, 0 failed
OPD optimizer steps	25	25	completed
Active OPD tokens	-	1,371,235	trained over packed response tokens

The key signal is the teacher-trace gap: the frozen teacher sees the privileged prompt, the student sees the normal prompt, and we score the same teacher trace under both prompts. Lower means the online student is getting closer to the privileged teacher on the target response distribution.

What Changed

The teacher path now supports privileged prompts instead of scoring the exact student prompt.
Teacher and rollout logprobs are validated on active response tokens instead of silently padding missing values.
OPD sampling/eval helpers were split out of the recipe so opd_loop.py is easier to audit.
The example flow uses a real GSM8K-style privileged-context dataset path instead of synthetic memorization rows.
Teacher deployment can be created without hot load, while the online student deployment keeps the normal weight-sync path.

Live Run

Run id: gsm8k-opd-qwen3p5-9b-256k-202604250928-attached
Training shape: accounts/fireworks/trainingShapes/qwen3p5-9b-256k
Base model for both student and teacher: Qwen 3.5 9B
Teacher: frozen deployment with privileged prompt
Student: online deployment synced from trainer
Batch shape: 8 prompt groups per step, 4 completions per prompt, 25 OPD steps
Completion cap: 2048 tokens, reasoning preserved
Launch path: SDK / pyroworks-attached flow, no firectl launch/kill mutation commands

Teacher-Trace Gap

Lower is better. This is the strongest OPD validation signal because it compares student vs frozen privileged teacher logprobs on the same teacher trace.

xychart-beta
  title "Teacher-trace gap: student NLL - teacher NLL"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "NLL gap, lower is better" 0.16 --> 0.31
  line [0.2137, 0.3015, 0.2907, 0.2715, 0.2714, 0.2729, 0.2624, 0.2510, 0.2440, 0.2406, 0.2416, 0.2392, 0.2351, 0.2312, 0.2251, 0.2202, 0.2150, 0.2118, 0.2086, 0.2033, 0.1978, 0.1909, 0.1841, 0.1778, 0.1726, 0.1684]

Final-Answer Token Gap

This isolates the final-answer token span on the teacher traces. It moves below zero early, meaning the trained student assigns at least as much probability as the privileged teacher to the final-answer tokens on these traces.

xychart-beta
  title "Final-answer token gap"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "NLL gap, lower is better" -0.05 --> 0.22
  line [0.1019, 0.0758, 0.2063, 0.0051, -0.0365, -0.0371, -0.0372, -0.0372, -0.0372, -0.0371, -0.0371, -0.0371, -0.0371, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0372, -0.0371, -0.0370, -0.0369, -0.0366]

Greedy Generation Accuracy

Exact-match generation is noisy on this small validation set, so this should not be treated as the primary OPD metric. It does show intermittent improvement, but the final point is not stable enough to claim deployed generation quality from this run alone.

xychart-beta
  title "Student greedy exact-match accuracy"
  x-axis "step" [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "accuracy" 0 --> 1
  line [0.500, 0.000, 0.125, 1.000, 0.000, 0.000, 0.000, 0.125, 0.000, 0.000, 0.000, 0.000, 0.125, 0.250, 0.375, 0.125, 0.000, 0.125, 0.500, 0.250, 0.125, 0.250, 0.500, 0.000, 0.000, 0.000]

On-Policy Sampled Reverse KL

This is the training-time OPD loss signal from sampled online completions. It is noisier than teacher-trace eval because the sampled responses change every step.

xychart-beta
  title "Sampled OPD reverse KL"
  x-axis "step" [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "reverse KL" 0.08 --> 0.16
  line [0.0894, 0.1503, 0.1561, 0.1266, 0.1015, 0.1201, 0.1331, 0.1214, 0.1221, 0.1027, 0.1090, 0.0978, 0.0906, 0.0979, 0.0965, 0.1078, 0.0966, 0.0976, 0.0974, 0.0995, 0.0911, 0.0969, 0.0860, 0.0884, 0.0886]

Active OPD Tokens

Each optimizer step receives packed datums rather than one single sample. The run trained over 1.37M active response tokens across 25 steps.

xychart-beta
  title "Active OPD tokens per step"
  x-axis "step" [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
  y-axis "tokens" 35000 --> 66000
  bar [51383, 65095, 58537, 37109, 65536, 53392, 54190, 52371, 57527, 53531, 60109, 52201, 55755, 51058, 56166, 48333, 62156, 61324, 58282, 56972, 54009, 52330, 52431, 46363, 55075]

Validation

Unit suite: 483 passed, 32 skipped
Focused OPD tests: 28 passed
Static diff check: git diff --check clean
Live OPD run: completed 25/25 steps with 200/200 prompt groups sampled and 0 failed/filtered

Renfei and others added 3 commits April 25, 2026 02:19

add sampled-token opd recipe

83feb1d

align opd recipe with cookbook loop style

dcb967c

fix privileged OPD loop and validation

528123a

renfeichen-fw marked this pull request as ready for review April 25, 2026 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] add sampled-token OPD recipe#391

[codex] add sampled-token OPD recipe#391
renfeichen-fw wants to merge 3 commits intomainfrom
codex/opd-loop

renfeichen-fw commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renfeichen-fw commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Result

What Changed

Live Run

Teacher-Trace Gap

Final-Answer Token Gap

Greedy Generation Accuracy

On-Policy Sampled Reverse KL

Active OPD Tokens

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renfeichen-fw commented Apr 25, 2026 •

edited

Loading