feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016) by renfeichen-fw · Pull Request #390 · fw-ai/cookbook

renfeichen-fw · 2026-04-24T08:19:26Z

Description

Fixes: N/A

Correct the pure OPD metric helpers in training/utils/opd_metrics.py to match arXiv:2604.13016:
- overlap_advantage now computes the paper-style overlap KL term on renormalized shared top-k support.
- overlap_mass_student / overlap_mass_teacher now use the original top-k logprob probabilities instead of renormalizing top-k mass to 1.
Update unit tests to cover the corrected overlap-advantage and overlap-mass behavior.
Keep the Qwen3 Section 3.2 repro script local/untracked in the Fireworks checkout; this Cookbook PR contains only reusable metric logic and tests.

Companion Fireworks PR: https://github.com/fw-ai/fireworks/pull/23529

Architecture / Code Overview Diagram

flowchart LR
    A["Student top-k logprobs"] --> C["training/utils/opd_metrics.py"]
    B["Teacher top-k logprobs"] --> C
    C --> D["overlap_ratio"]
    C --> E["overlap_advantage"]
    C --> F["entropy_gap"]
    C --> G["overlap_mass_student / overlap_mass_teacher"]
    C --> H["per_position_entropy/q1-q4"]

Type of Change

Testing

Added/updated tests
Tested manually
No testing needed

Commands run:

PYTHONPATH=. pytest -q training/tests/unit/test_opd_metrics.py
PYTHONPATH=train-firetitan-py pytest -q train-firetitan-py/tests/test_kl_distillation.py  # from companion Fireworks checkout
python -m py_compile train-firetitan-py/scripts/test_opd_metrics_local.py train-firetitan-py/firetitan/train/nn/kl_distillation.py

Results:

training/tests/unit/test_opd_metrics.py: 25 passed
train-firetitan-py/tests/test_kl_distillation.py: 31 passed
py_compile: passed

Surface Consistency

No customer-facing surface impact
Related surfaces checked — all consistent or follow-up filed
Inline "keep in sync" comments followed

Deployment Notes

Requires database migration
Requires config/env changes
Requires Terraform/K8s changes
No special deployment considerations

Change Size

Small (< 200 LOC)
Medium (200–999 LOC)
Large (≥ 1,000 LOC) — Design plan attached below

Design Plan (required for large changes)

N/A

Checklist

Agent-reviewed the diff before committing
Self-reviewed my code
Change is the minimum necessary diff
Added tests for my changes
Updated relevant documentation
No new linter warnings/errors
No secrets or credentials in the diff
Checked surface consistency for customer-facing changes
Visual diagram included (or change is cosmetic-only)

Additional Context

This Qwen3 repro targets Section 3.2 of the paper: higher scores do not imply new knowledge. Thinking-pattern consistency is controlled by using Qwen3 non-thinking student/teachers in all cases; the contrast is whether the teacher comes from the same pipeline or has additional RL-Math capability.

Local model paths used:

/shared/text-models/Qwen3-1.7B
/shared/text-models/Qwen3-4B
/shared/text-models/Qwen3-4B-Non-Thinking-RL-Math-Step500

Small local Section 3.2 result:

Setup	Final avg@n	Final pass@n	Overlap	Reverse KL
Base Qwen3-1.7B eval only	0.250	0.300	0.7387	0.1217
OPD from Qwen3-4B same pipeline	0.300	0.400	0.7397	0.1419
OPD from RL-Math Qwen3-4B	0.400	0.400	0.6919	0.3843

Interpretation: the same-pipeline 4B teacher gives little movement over the pretrained 1.7B baseline, while the RL-Math teacher gives a clearer accuracy lift despite lower overlap and higher KL. This is a bounded local repro, not a full paper-scale run with full DAPO-17K, rollout 4, avg@16, and long 7168/31744 response limits.

Per-Step Overlap Plot

16-step local trace used for the overlap trend. Legend: first line = same-pipeline Qwen3-4B; second line = RL-Math Qwen3-4B.

Step	Same-pipeline Qwen3-4B	RL-Math Qwen3-4B
0	0.7511	0.6966
8	0.7495	0.6904
16	0.7373	0.6893

xychart-beta
    title "Qwen3 Section 3.2 local OPD overlap"
    x-axis "step" [0, 8, 16]
    y-axis "overlap_ratio" 0.65 --> 0.78
    line [0.7511, 0.7495, 0.7373]
    line [0.6966, 0.6904, 0.6893]

…13016) Add OPD dynamic metrics that predict whether on-policy distillation will succeed or fail early in training, plus a production recipe using deployment-based teacher scoring. New files: - training/utils/opd_metrics.py: core metrics (Eqs. 6-10) — overlap ratio, overlap advantage, entropy gap, overlap mass, per-position entropy. Pure functions on top-k logprobs. - training/recipes/opd_loop.py: production OPD recipe with sampled-token reverse-KL loss. Teacher is a Fireworks deployment scored via echo+logprobs. - training/tests/unit/test_opd_metrics.py: 22 CPU unit tests covering identical/disjoint/partial overlaps, edge cases, and all-in-one metric dict validation. Made-with: Cursor

Renfei added 2 commits April 24, 2026 08:18

fix(opd): align overlap metrics with paper

59fa1af

renfeichen-fw marked this pull request as draft April 25, 2026 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390

feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390
renfeichen-fw wants to merge 2 commits intomainfrom
renfei/opd-metrics

renfeichen-fw commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renfeichen-fw commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Architecture / Code Overview Diagram

Type of Change

Testing

Surface Consistency

Deployment Notes

Change Size

Checklist

Additional Context

Per-Step Overlap Plot

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renfeichen-fw commented Apr 24, 2026 •

edited

Loading