forked from radixark/miles
-
Notifications
You must be signed in to change notification settings - Fork 2
docs(diffusion RL v0.1): add quick start, OCR example, accuracy guide, and roadmap #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
zhihengy
wants to merge
3
commits into
diffusion_RL_v0.1
Choose a base branch
from
docs/diffusion_RL_v0.1_zhihengy
base: diffusion_RL_v0.1
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # Rollout Parallel Accuracy | ||
|
|
||
| sglang-diffusion supports several rollout-side parallel strategies. These | ||
| strategies are important for throughput and memory, but they can also change the | ||
| numeric path of the diffusion forward pass and rollout log-prob computation. | ||
| For RL post-training, those differences matter: the trainer consumes rollout | ||
| trajectories, rewards, and log-probs, so a parallel configuration should be | ||
| chosen with a clear understanding of its accuracy behavior. | ||
|
|
||
| This document summarizes the currently relevant rollout parallel strategies and | ||
| their observed precision impact. | ||
|
|
||
| ## Parallel Strategies | ||
|
|
||
| | Strategy | Meaning | Typical purpose | | ||
| | --- | --- | --- | | ||
| | SP / Ulysses | Sequence parallelism over latent/image tokens. Each rank handles a shard of the sequence dimension and uses collectives inside attention. | Increase max resolution or reduce per-GPU activation memory. | | ||
| | TP | Tensor parallelism inside the DiT / transformer layers. | Split model compute and parameters across GPUs. | | ||
| | CFGP | Classifier-free-guidance parallelism. Conditional and unconditional branches are computed on different ranks and then combined. | Reduce wall-clock cost of CFG when both branches are required. | | ||
|
|
||
| The main tensors to watch are: | ||
|
|
||
| | Tensor | Why it matters | | ||
| | --- | --- | | ||
| | `model_output` | Direct output of the DiT denoiser. Differences here affect the denoising trajectory. | | ||
| | `prev_sample_mean` | Scheduler mean update before adding SDE/CPS variance noise. | | ||
| | `variance_noise` | Random noise used by SDE/CPS rollout. | | ||
| | `noise_std_dev` | Scheduler noise scale. | | ||
| | `rollout_log_probs` | Per-step rollout log-prob consumed by RL training. This is the most important rollout-side scalar for policy-gradient correctness. | | ||
|
|
||
| ## Tested Scope | ||
|
|
||
| The rollout-parallel accuracy checks were run on: | ||
|
|
||
| | Model | Resolution | Steps | GPUs | Reference | | ||
| | --- | ---: | ---: | ---: | --- | | ||
| | `Qwen/Qwen-Image` | 1024 x 1024 | 50 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP | | ||
| | `Tongyi-MAI/Z-Image-Turbo` | 1024 x 1024 | 9 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP | | ||
|
|
||
| ## Accuracy Summary | ||
|
|
||
| | Parallel strategy | `rollout_log_probs` vs single-GPU reference | DiT-side tensors vs single-GPU reference | Practical interpretation | | ||
| | --- | --- | --- | --- | | ||
| | SP / Ulysses | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Safest tested rollout parallel mode for accuracy-sensitive log-prob replay. | | ||
| | TP | Bit-exact in the tested rollout log-prob path. | Qwen-Image was bit-exact in the tested TP2-SP1 run; Z-Image-Turbo showed DiT-side drift in `model_output` / `prev_sample_mean`. | Log-prob can remain exact even when the model forward path has small architecture-dependent reduction-order drift. | | ||
| | CFGP | Bit-exact in the tested rollout log-prob path. | `model_output` / `prev_sample_mean` can drift from the serial CFG reference because cond/uncond branches are combined through CFG-parallel collectives. | Useful for CFG throughput, but do not assume full tensor bit-exactness vs serial CFG. | | ||
|
|
||
| ## Detailed Results | ||
|
|
||
| ### SDE Rollout | ||
|
|
||
| | Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` | | ||
| | --- | --- | --- | --- | --- | | ||
| | SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs | | ||
| | TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo | | ||
| | CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | | ||
|
|
||
| ### CPS Rollout | ||
|
|
||
| | Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` | | ||
| | --- | --- | --- | --- | --- | | ||
| | SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs | | ||
| | TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo | | ||
| | CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | | ||
|
|
||
| ### ODE Rollout | ||
|
|
||
| | Parallel strategy | `rollout_log_probs` | `model_output` / deterministic update | | ||
| | --- | --- | --- | | ||
| | SP / Ulysses | 0 max abs diff | 0 max abs diff in tested runs | | ||
| | TP | 0 max abs diff | Model-dependent drift can appear in the DiT forward path | | ||
| | CFGP | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | | ||
|
|
||
| ODE has a special precision contract: the rollout branch should preserve | ||
| bit-exactness with the non-rollout deterministic scheduler step. For this | ||
| reason, SGLang keeps the ODE branch dtype-preserving instead of applying the | ||
| same fp32 entry cast used by SDE/CPS. | ||
|
|
||
| ## Practical Guidance | ||
|
|
||
| - Prefer SP / Ulysses when the main goal is scaling rollout resolution while | ||
| preserving rollout log-prob accuracy. It is the cleanest tested path for | ||
| bit-exact log-prob replay. | ||
| - Use TP when model memory or compute requires it, but validate DiT-side tensor | ||
| drift for the specific backbone. The tested Qwen-Image path was bit-exact; | ||
| the tested Z-Image-Turbo path still showed model-output drift. | ||
| - Use CFGP when CFG throughput matters, but treat it as a numerically different | ||
| forward path from serial CFG for `model_output` and `prev_sample_mean`. | ||
| `rollout_log_probs` were still bit-exact in the tested rollout path. | ||
| - For SDE/CPS, expect fp32 rollout log-prob computation. For ODE, preserve the | ||
| native deterministic scheduler path. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,221 @@ | ||
| # Qwen-Image OCR with 2 GPUs | ||
|
|
||
| This example runs miles-diffusion with Qwen-Image, FSDP training, LoRA updates, | ||
| the built-in diffusion rollout path, and the OCR reward. | ||
|
|
||
| ## Environment Setup | ||
|
|
||
| First complete the base environment setup in | ||
| [Quick Start](../get_started/quick_start.md). | ||
|
|
||
| Then install the OCR task dependencies: | ||
|
|
||
| ```bash | ||
| conda activate miles-diffusion | ||
| cd /path/to/miles | ||
| ``` | ||
|
|
||
| Follow [Task Dependencies: OCR Dependencies](../get_started/task_dependencies.md#ocr-dependencies). | ||
| The important check is: | ||
|
|
||
| ```bash | ||
| python -c "from paddleocr import PaddleOCR; from Levenshtein import distance; import miles.rollout.rm_hub.ocr; print('OCR deps OK')" | ||
| ``` | ||
|
|
||
| The example uses 2 NVIDIA GPUs. It downloads the training and evaluation data | ||
| from Hugging Face during startup, so the machine must be able to access Hugging | ||
| Face. | ||
|
|
||
| Optionally enable Weights & Biases logging: | ||
|
|
||
| ```bash | ||
| export WANDB_API_KEY=... | ||
| ``` | ||
|
|
||
| ## Run Training | ||
|
|
||
| Execute the 2-GPU script: | ||
|
|
||
| ```bash | ||
| cd /path/to/miles | ||
| conda activate miles-diffusion | ||
| bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh | ||
| ``` | ||
|
|
||
| By default, the script uses: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=2,3 | ||
| ``` | ||
|
|
||
| Override it if your available GPUs are different: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1 bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh | ||
| ``` | ||
|
|
||
| The script writes checkpoints under: | ||
|
|
||
| ```bash | ||
| logs/diffusion_grpo_ocr_2gpu_flowgrpo_aligned_<timestamp>/ckpt | ||
| ``` | ||
|
|
||
| ## Data and Model | ||
|
|
||
| The script downloads the OCR dataset to: | ||
|
|
||
| ```bash | ||
| /root/datasets/miles-diffusion-datasets | ||
| ``` | ||
|
|
||
| using: | ||
|
|
||
| ```bash | ||
| hf download --repo-type dataset rockdu/miles-diffusion-datasets \ | ||
| --include "flowgrpo_ocr/**" \ | ||
| --local-dir /root/datasets/miles-diffusion-datasets | ||
| ``` | ||
|
|
||
| Training reads: | ||
|
|
||
| ```bash | ||
| /root/datasets/miles-diffusion-datasets/flowgrpo_ocr/train.jsonl | ||
| ``` | ||
|
|
||
| Evaluation reads: | ||
|
|
||
| ```bash | ||
| /root/datasets/miles-diffusion-datasets/flowgrpo_ocr/test.jsonl | ||
| ``` | ||
|
|
||
| The model is loaded from: | ||
|
|
||
| ```bash | ||
| Qwen/Qwen-Image | ||
| ``` | ||
|
|
||
| ## Parameter Introduction | ||
|
|
||
| Here, we briefly introduce the main parts of | ||
| `scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh`. | ||
|
|
||
| ### Diffusion Rollout | ||
|
|
||
| The script uses the diffusion rollout function: | ||
|
|
||
| ```bash | ||
| --train-backend fsdp | ||
| --rollout-function-path miles.rollout.sglang_diffusion_rollout.generate_rollout | ||
| --hf-checkpoint Qwen/Qwen-Image | ||
| --diffusion-model Qwen/Qwen-Image | ||
| ``` | ||
|
|
||
| ### LoRA Training | ||
|
|
||
| The example trains LoRA weights instead of full model weights: | ||
|
|
||
| ```bash | ||
| --use-lora | ||
| --lora-rank 64 | ||
| --lora-alpha 128 | ||
| --diffusion-init-lora-weight gaussian | ||
| ``` | ||
|
|
||
| ### OCR Reward | ||
|
|
||
| The reward is configured through both the diffusion reward string and the miles | ||
| reward type: | ||
|
|
||
| ```bash | ||
| --diffusion-reward ocr:1.0 | ||
| --rm-type ocr | ||
| --advantage-estimator grpo | ||
| ``` | ||
|
|
||
| ### Colocated Resources | ||
|
|
||
| Training and rollout share the same 2 GPUs: | ||
|
|
||
| ```bash | ||
| --actor-num-gpus-per-node 2 | ||
| --rollout-num-gpus 2 | ||
| --rollout-num-gpus-per-engine 1 | ||
| --num-gpus-per-node 2 | ||
| --colocate | ||
| ``` | ||
|
|
||
| ## Batch and Step Math | ||
|
|
||
| The 2-GPU script is scaled down from the 4-GPU FlowGRPO-aligned OCR recipe. | ||
| Per rollout: | ||
|
|
||
| ```text | ||
| rollout_batch_size = 16 prompts | ||
| n_samples_per_prompt = 16 samples per prompt | ||
| samples_per_rollout = 16 * 16 = 256 samples | ||
| num_steps_per_rollout = 2 optimizer steps | ||
| global_batch_size = 256 / 2 = 128 samples per optimizer step | ||
| ``` | ||
|
|
||
| With 2 training GPUs, each rank receives: | ||
|
|
||
| ```text | ||
| 128 / 2 = 64 samples per optimizer step | ||
| ``` | ||
|
|
||
| The DiT forward is tiled as: | ||
|
|
||
| ```bash | ||
| --micro-batch-size-sample 4 | ||
| --micro-batch-size-tstep 2 | ||
| ``` | ||
|
|
||
| so one forward tile covers `4 * 2 = 8` sample/timestep cells. | ||
|
|
||
| For a deeper explanation of these batch-shape parameters, see | ||
| [Batch sizes in miles-diffusion](../developer_guide/batch_sizes_in_miles_d.md). | ||
|
|
||
| ## Diffusion Sampling Settings | ||
|
|
||
| The example mirrors the Qwen-Image OCR FlowGRPO settings: | ||
|
|
||
| ```bash | ||
| --diffusion-num-steps 10 | ||
| --diffusion-eval-num-steps 50 | ||
| --diffusion-guidance-scale 4.0 | ||
| --diffusion-true-cfg-scale 4.0 | ||
| --diffusion-noise-level 1.2 | ||
| --diffusion-step-strategy-path miles.rollout.step_strategy_hub.sde_window | ||
| --diffusion-sde-window-size 2 | ||
| --diffusion-sde-window-range 3,5 | ||
| --diffusion-height 512 | ||
| --diffusion-width 512 | ||
| ``` | ||
|
|
||
| The active SDE training window has size 2. The `3,5` range selects the same | ||
| effective window used by the aligned FlowGRPO recipe. | ||
|
|
||
| ## 4-GPU Variant | ||
|
|
||
| If you have 4 GPUs, use: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/run-diffusion-grpo-ocr-4gpu-flowgrpo-aligned.sh | ||
| ``` | ||
|
|
||
| The 4-GPU script doubles `--rollout-batch-size` from 16 to 32 while keeping the | ||
| per-rank training load at 64 samples per optimizer step. | ||
|
|
||
| ## Expected Result | ||
|
|
||
| A successful launch should: | ||
|
|
||
| 1. download `flowgrpo_ocr/**` if it is not already present; | ||
| 2. start the colocated FSDP actor and sglang-diffusion rollout engine; | ||
| 3. generate Qwen-Image OCR rollouts; | ||
| 4. compute OCR rewards; | ||
| 5. begin GRPO LoRA updates; | ||
| 6. save checkpoints under the run-specific `logs/` directory. | ||
|
|
||
| If the run fails before training starts, first check GPU visibility, Hugging Face | ||
| access, the base environment, and the OCR task dependencies. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can refer to "batchsizes in miles-d" in docs here