feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390
Draft
renfeichen-fw wants to merge 2 commits intomainfrom
Draft
feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390renfeichen-fw wants to merge 2 commits intomainfrom
renfeichen-fw wants to merge 2 commits intomainfrom
Conversation
added 2 commits
April 24, 2026 08:18
…13016) Add OPD dynamic metrics that predict whether on-policy distillation will succeed or fail early in training, plus a production recipe using deployment-based teacher scoring. New files: - training/utils/opd_metrics.py: core metrics (Eqs. 6-10) — overlap ratio, overlap advantage, entropy gap, overlap mass, per-position entropy. Pure functions on top-k logprobs. - training/recipes/opd_loop.py: production OPD recipe with sampled-token reverse-KL loss. Teacher is a Fireworks deployment scored via echo+logprobs. - training/tests/unit/test_opd_metrics.py: 22 CPU unit tests covering identical/disjoint/partial overlaps, edge cases, and all-in-one metric dict validation. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes: N/A
training/utils/opd_metrics.pyto match arXiv:2604.13016:overlap_advantagenow computes the paper-style overlap KL term on renormalized shared top-k support.overlap_mass_student/overlap_mass_teachernow use the original top-k logprob probabilities instead of renormalizing top-k mass to 1.Companion Fireworks PR: https://github.com/fw-ai/fireworks/pull/23529
Architecture / Code Overview Diagram
flowchart LR A["Student top-k logprobs"] --> C["training/utils/opd_metrics.py"] B["Teacher top-k logprobs"] --> C C --> D["overlap_ratio"] C --> E["overlap_advantage"] C --> F["entropy_gap"] C --> G["overlap_mass_student / overlap_mass_teacher"] C --> H["per_position_entropy/q1-q4"]Type of Change
Testing
Commands run:
PYTHONPATH=. pytest -q training/tests/unit/test_opd_metrics.py PYTHONPATH=train-firetitan-py pytest -q train-firetitan-py/tests/test_kl_distillation.py # from companion Fireworks checkout python -m py_compile train-firetitan-py/scripts/test_opd_metrics_local.py train-firetitan-py/firetitan/train/nn/kl_distillation.pyResults:
training/tests/unit/test_opd_metrics.py: 25 passedtrain-firetitan-py/tests/test_kl_distillation.py: 31 passedpy_compile: passedSurface Consistency
Deployment Notes
Change Size
Design Plan (required for large changes)
N/A
Checklist
Additional Context
This Qwen3 repro targets Section 3.2 of the paper: higher scores do not imply new knowledge. Thinking-pattern consistency is controlled by using Qwen3 non-thinking student/teachers in all cases; the contrast is whether the teacher comes from the same pipeline or has additional RL-Math capability.
Local model paths used:
/shared/text-models/Qwen3-1.7B/shared/text-models/Qwen3-4B/shared/text-models/Qwen3-4B-Non-Thinking-RL-Math-Step500Small local Section 3.2 result:
Interpretation: the same-pipeline 4B teacher gives little movement over the pretrained 1.7B baseline, while the RL-Math teacher gives a clearer accuracy lift despite lower overlap and higher KL. This is a bounded local repro, not a full paper-scale run with full DAPO-17K, rollout 4, avg@16, and long 7168/31744 response limits.
Per-Step Overlap Plot
16-step local trace used for the overlap trend. Legend: first line = same-pipeline Qwen3-4B; second line = RL-Math Qwen3-4B.
xychart-beta title "Qwen3 Section 3.2 local OPD overlap" x-axis "step" [0, 8, 16] y-axis "overlap_ratio" 0.65 --> 0.78 line [0.7511, 0.7495, 0.7373] line [0.6966, 0.6904, 0.6893]