Skip to content

feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390

Draft
renfeichen-fw wants to merge 2 commits intomainfrom
renfei/opd-metrics
Draft

feat(opd): add on-policy distillation metrics and recipe (arXiv:2604.13016)#390
renfeichen-fw wants to merge 2 commits intomainfrom
renfei/opd-metrics

Conversation

@renfeichen-fw
Copy link
Copy Markdown
Contributor

@renfeichen-fw renfeichen-fw commented Apr 24, 2026

Description

Fixes: N/A

  • Correct the pure OPD metric helpers in training/utils/opd_metrics.py to match arXiv:2604.13016:
    • overlap_advantage now computes the paper-style overlap KL term on renormalized shared top-k support.
    • overlap_mass_student / overlap_mass_teacher now use the original top-k logprob probabilities instead of renormalizing top-k mass to 1.
  • Update unit tests to cover the corrected overlap-advantage and overlap-mass behavior.
  • Keep the Qwen3 Section 3.2 repro script local/untracked in the Fireworks checkout; this Cookbook PR contains only reusable metric logic and tests.

Companion Fireworks PR: https://github.com/fw-ai/fireworks/pull/23529

Architecture / Code Overview Diagram

flowchart LR
    A["Student top-k logprobs"] --> C["training/utils/opd_metrics.py"]
    B["Teacher top-k logprobs"] --> C
    C --> D["overlap_ratio"]
    C --> E["overlap_advantage"]
    C --> F["entropy_gap"]
    C --> G["overlap_mass_student / overlap_mass_teacher"]
    C --> H["per_position_entropy/q1-q4"]
Loading

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation
  • Infrastructure/DevOps

Testing

  • Added/updated tests
  • Tested manually
  • No testing needed

Commands run:

PYTHONPATH=. pytest -q training/tests/unit/test_opd_metrics.py
PYTHONPATH=train-firetitan-py pytest -q train-firetitan-py/tests/test_kl_distillation.py  # from companion Fireworks checkout
python -m py_compile train-firetitan-py/scripts/test_opd_metrics_local.py train-firetitan-py/firetitan/train/nn/kl_distillation.py

Results:

  • training/tests/unit/test_opd_metrics.py: 25 passed
  • train-firetitan-py/tests/test_kl_distillation.py: 31 passed
  • py_compile: passed

Surface Consistency

  • No customer-facing surface impact
  • Related surfaces checked — all consistent or follow-up filed
  • Inline "keep in sync" comments followed

Deployment Notes

  • Requires database migration
  • Requires config/env changes
  • Requires Terraform/K8s changes
  • No special deployment considerations

Change Size

  • Small (< 200 LOC)
  • Medium (200–999 LOC)
  • Large (≥ 1,000 LOC) — Design plan attached below
Design Plan (required for large changes)

N/A

Checklist

  • Agent-reviewed the diff before committing
  • Self-reviewed my code
  • Change is the minimum necessary diff
  • Added tests for my changes
  • Updated relevant documentation
  • No new linter warnings/errors
  • No secrets or credentials in the diff
  • Checked surface consistency for customer-facing changes
  • Visual diagram included (or change is cosmetic-only)

Additional Context

This Qwen3 repro targets Section 3.2 of the paper: higher scores do not imply new knowledge. Thinking-pattern consistency is controlled by using Qwen3 non-thinking student/teachers in all cases; the contrast is whether the teacher comes from the same pipeline or has additional RL-Math capability.

Local model paths used:

  • /shared/text-models/Qwen3-1.7B
  • /shared/text-models/Qwen3-4B
  • /shared/text-models/Qwen3-4B-Non-Thinking-RL-Math-Step500

Small local Section 3.2 result:

Setup Final avg@n Final pass@n Overlap Reverse KL
Base Qwen3-1.7B eval only 0.250 0.300 0.7387 0.1217
OPD from Qwen3-4B same pipeline 0.300 0.400 0.7397 0.1419
OPD from RL-Math Qwen3-4B 0.400 0.400 0.6919 0.3843

Interpretation: the same-pipeline 4B teacher gives little movement over the pretrained 1.7B baseline, while the RL-Math teacher gives a clearer accuracy lift despite lower overlap and higher KL. This is a bounded local repro, not a full paper-scale run with full DAPO-17K, rollout 4, avg@16, and long 7168/31744 response limits.

Per-Step Overlap Plot

16-step local trace used for the overlap trend. Legend: first line = same-pipeline Qwen3-4B; second line = RL-Math Qwen3-4B.

Step Same-pipeline Qwen3-4B RL-Math Qwen3-4B
0 0.7511 0.6966
8 0.7495 0.6904
16 0.7373 0.6893
xychart-beta
    title "Qwen3 Section 3.2 local OPD overlap"
    x-axis "step" [0, 8, 16]
    y-axis "overlap_ratio" 0.65 --> 0.78
    line [0.7511, 0.7495, 0.7373]
    line [0.6966, 0.6904, 0.6893]
Loading

Renfei added 2 commits April 24, 2026 08:18
…13016)

Add OPD dynamic metrics that predict whether on-policy distillation will
succeed or fail early in training, plus a production recipe using
deployment-based teacher scoring.

New files:
- training/utils/opd_metrics.py: core metrics (Eqs. 6-10) — overlap
  ratio, overlap advantage, entropy gap, overlap mass, per-position
  entropy. Pure functions on top-k logprobs.
- training/recipes/opd_loop.py: production OPD recipe with
  sampled-token reverse-KL loss. Teacher is a Fireworks deployment
  scored via echo+logprobs.
- training/tests/unit/test_opd_metrics.py: 22 CPU unit tests covering
  identical/disjoint/partial overlaps, edge cases, and all-in-one
  metric dict validation.

Made-with: Cursor
@renfeichen-fw renfeichen-fw marked this pull request as draft April 25, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant