Skip to content

docs(diffusion RL v0.1): add single-prompt multi-generation guide#25

Draft
MikukuOvO wants to merge 2 commits into
diffusion_RL_v0.1from
docs/diffusion_RL_v0.1_fengliny
Draft

docs(diffusion RL v0.1): add single-prompt multi-generation guide#25
MikukuOvO wants to merge 2 commits into
diffusion_RL_v0.1from
docs/diffusion_RL_v0.1_fengliny

Conversation

@MikukuOvO
Copy link
Copy Markdown
Collaborator

Summary

This draft PR adds documentation for the diffusion RL v0.1 branch:

  • Adds an advanced guide for single-prompt multi-generation in Miles Diffusion.
  • Explains the relationship between prompt groups, n_samples_per_prompt, diffusion_microgroup_size, and SGLang-D num_outputs_per_prompt.
  • Documents current OCR and PickScore recipe settings, the SGLang-D dependency, known limitations, validation checklist, and practical guidance.

Validation

  • Ran git diff --check rockdu/diffusion_RL_v0.1...rockdu/docs/diffusion_RL_v0.1_fengliny.
  • Confirmed the PR-style diff adds docs/advanced/single_prompt_multi_generation.md only.
  • Confirmed there was no existing open PR from Rockdu:docs/diffusion_RL_v0.1_fengliny into diffusion_RL_v0.1 before opening this PR.

Notes

This is intentionally opened as a draft so the same PR can continue to receive follow-up documentation updates without creating duplicate PRs.


- Algorithmically, Miles Diffusion compares multiple sampled outputs within a
prompt group, so each prompt should have multiple sampled outputs.
- System-wise, packing multiple samples from the same prompt into one SGLang-D
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should highlight the most critical info here. For example, how comes the optimization? (bc of compute-bound after batching & encoder embedding reuse)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I checked the current SGLang-D path and updated the doc to explain the optimization more concretely: the prompt is text-encoded once, Qwen-Image conditioning is expanded in the fixed path, and denoising runs as batched DiT forwards across the timestep loop.

same prompt in one rollout group. In Miles Diffusion, this is both an algorithmic
requirement and a performance knob:

- Algorithmically, Miles Diffusion compares multiple sampled outputs within a
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More specifically, GRPO style of RL requires multiple samples out of a single prompt

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed this by explicitly stating that GRPO-style RL needs multiple samples from the same prompt to compute group-relative advantages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants