Skip to content

feat: add RLSD (Self-Distilled RLVR) credit assignment#333

Open
morgendave wants to merge 1 commit intomainfrom
cursor/rlsd-credit-assignment-19da
Open

feat: add RLSD (Self-Distilled RLVR) credit assignment#333
morgendave wants to merge 1 commit intomainfrom
cursor/rlsd-credit-assignment-19da

Conversation

@morgendave
Copy link
Copy Markdown
Contributor

Adds composable per-token credit weighting from self-distillation evidence ratios (arXiv 2604.03128). RLSD modulates the per-token advantage magnitude using the teacher-student log-prob ratio:

w_t = clip(exp(sign(A) · (log P_teacher - log P_student)), 1±ε_w)

Key design: RLSD is a weight that composes with any existing loss (GRPO, DAPO, CISPO, etc.) — just like TIS. No new loss function.

Files:

  • training/utils/rl/rlsd.py: RLSDConfig + compute_rlsd_weights()
  • training/utils/rl/common.py: SampleContext.rlsd_weight + run_loss_loop plumbing
  • training/utils/rl/grpo.py: multiply rlsd_weight into surrogate loss
  • training/utils/rl/init.py: export RLSDConfig

Adds composable per-token credit weighting from self-distillation
evidence ratios (arXiv 2604.03128). RLSD modulates the per-token
advantage magnitude using the teacher-student log-prob ratio:

  w_t = clip(exp(sign(A) · (log P_teacher - log P_student)), 1±ε_w)

Key design: RLSD is a weight that composes with any existing loss
(GRPO, DAPO, CISPO, etc.) — just like TIS. No new loss function.

Files:
- training/utils/rl/rlsd.py: RLSDConfig + compute_rlsd_weights()
- training/utils/rl/common.py: SampleContext.rlsd_weight + run_loss_loop plumbing
- training/utils/rl/grpo.py: multiply rlsd_weight into surrogate loss
- training/utils/rl/__init__.py: export RLSDConfig
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant