Rockdu · zhihengy · May 5, 2026 · May 5, 2026 · May 5, 2026 · Rockdu
diff --git a/docs/advanced/rollout_parallel_accuracy.md b/docs/advanced/rollout_parallel_accuracy.md
@@ -0,0 +1,91 @@
+# Rollout Parallel Accuracy
+
+sglang-diffusion supports several rollout-side parallel strategies. These
+strategies are important for throughput and memory, but they can also change the
+numeric path of the diffusion forward pass and rollout log-prob computation.
+For RL post-training, those differences matter: the trainer consumes rollout
+trajectories, rewards, and log-probs, so a parallel configuration should be
+chosen with a clear understanding of its accuracy behavior.
+
+This document summarizes the currently relevant rollout parallel strategies and
+their observed precision impact.
+
+## Parallel Strategies
+
+| Strategy | Meaning | Typical purpose |
+| --- | --- | --- |
+| SP / Ulysses | Sequence parallelism over latent/image tokens. Each rank handles a shard of the sequence dimension and uses collectives inside attention. | Increase max resolution or reduce per-GPU activation memory. |
+| TP | Tensor parallelism inside the DiT / transformer layers. | Split model compute and parameters across GPUs. |
+| CFGP | Classifier-free-guidance parallelism. Conditional and unconditional branches are computed on different ranks and then combined. | Reduce wall-clock cost of CFG when both branches are required. |
+
+The main tensors to watch are:
+
+| Tensor | Why it matters |
+| --- | --- |
+| `model_output` | Direct output of the DiT denoiser. Differences here affect the denoising trajectory. |
+| `prev_sample_mean` | Scheduler mean update before adding SDE/CPS variance noise. |
+| `variance_noise` | Random noise used by SDE/CPS rollout. |
+| `noise_std_dev` | Scheduler noise scale. |
+| `rollout_log_probs` | Per-step rollout log-prob consumed by RL training. This is the most important rollout-side scalar for policy-gradient correctness. |
+
+## Tested Scope
+
+The rollout-parallel accuracy checks were run on:
+
+| Model | Resolution | Steps | GPUs | Reference |
+| --- | ---: | ---: | ---: | --- |
+| `Qwen/Qwen-Image` | 1024 x 1024 | 50 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP |
+| `Tongyi-MAI/Z-Image-Turbo` | 1024 x 1024 | 9 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP |
+
+## Accuracy Summary
+
+| Parallel strategy | `rollout_log_probs` vs single-GPU reference | DiT-side tensors vs single-GPU reference | Practical interpretation |
+| --- | --- | --- | --- |
+| SP / Ulysses | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Safest tested rollout parallel mode for accuracy-sensitive log-prob replay. |
+| TP | Bit-exact in the tested rollout log-prob path. | Qwen-Image was bit-exact in the tested TP2-SP1 run; Z-Image-Turbo showed DiT-side drift in `model_output` / `prev_sample_mean`. | Log-prob can remain exact even when the model forward path has small architecture-dependent reduction-order drift. |
+| CFGP | Bit-exact in the tested rollout log-prob path. | `model_output` / `prev_sample_mean` can drift from the serial CFG reference because cond/uncond branches are combined through CFG-parallel collectives. | Useful for CFG throughput, but do not assume full tensor bit-exactness vs serial CFG. |
+
+## Detailed Results
+
+### SDE Rollout
+
+| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` |
+| --- | --- | --- | --- | --- |
+| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs |
+| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo |
+| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |
+
+### CPS Rollout
+
+| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` |
+| --- | --- | --- | --- | --- |
+| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs |
+| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo |
+| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |
+
+### ODE Rollout
+
+| Parallel strategy | `rollout_log_probs` | `model_output` / deterministic update |
+| --- | --- | --- |
+| SP / Ulysses | 0 max abs diff | 0 max abs diff in tested runs |
+| TP | 0 max abs diff | Model-dependent drift can appear in the DiT forward path |
+| CFGP | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |
+
+ODE has a special precision contract: the rollout branch should preserve
+bit-exactness with the non-rollout deterministic scheduler step. For this
+reason, SGLang keeps the ODE branch dtype-preserving instead of applying the
+same fp32 entry cast used by SDE/CPS.
+
+## Practical Guidance
+
+- Prefer SP / Ulysses when the main goal is scaling rollout resolution while
+  preserving rollout log-prob accuracy. It is the cleanest tested path for
+  bit-exact log-prob replay.
+- Use TP when model memory or compute requires it, but validate DiT-side tensor
+  drift for the specific backbone. The tested Qwen-Image path was bit-exact;
+  the tested Z-Image-Turbo path still showed model-output drift.
+- Use CFGP when CFG throughput matters, but treat it as a numerically different
+  forward path from serial CFG for `model_output` and `prev_sample_mean`.
+  `rollout_log_probs` were still bit-exact in the tested rollout path.
+- For SDE/CPS, expect fp32 rollout log-prob computation. For ODE, preserve the
+  native deterministic scheduler path.
diff --git a/docs/examples/qwen_image_ocr_demo.md b/docs/examples/qwen_image_ocr_demo.md
@@ -0,0 +1,221 @@
+# Qwen-Image OCR with 2 GPUs
+
+This example runs miles-diffusion with Qwen-Image, FSDP training, LoRA updates,
+the built-in diffusion rollout path, and the OCR reward.
+
+## Environment Setup
+
+First complete the base environment setup in
+[Quick Start](../get_started/quick_start.md).
+
+Then install the OCR task dependencies:
+
+```bash
+conda activate miles-diffusion
+cd /path/to/miles
+```
+
+Follow [Task Dependencies: OCR Dependencies](../get_started/task_dependencies.md#ocr-dependencies).
+The important check is:
+
+```bash
+python -c "from paddleocr import PaddleOCR; from Levenshtein import distance; import miles.rollout.rm_hub.ocr; print('OCR deps OK')"
+```
+
+The example uses 2 NVIDIA GPUs. It downloads the training and evaluation data
+from Hugging Face during startup, so the machine must be able to access Hugging
+Face.
+
+Optionally enable Weights & Biases logging:
+
+```bash
+export WANDB_API_KEY=...
+```
+
+## Run Training
+
+Execute the 2-GPU script:
+
+```bash
+cd /path/to/miles
+conda activate miles-diffusion
+bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh
+```
+
+By default, the script uses:
+
+```bash
+CUDA_VISIBLE_DEVICES=2,3
+```
+
+Override it if your available GPUs are different:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh
+```
+
+The script writes checkpoints under:
+
+```bash
+logs/diffusion_grpo_ocr_2gpu_flowgrpo_aligned_<timestamp>/ckpt
+```
+
+## Data and Model
+
+The script downloads the OCR dataset to:
+
+```bash
+/root/datasets/miles-diffusion-datasets
+```
+
+using:
+
+```bash
+hf download --repo-type dataset rockdu/miles-diffusion-datasets \
+  --include "flowgrpo_ocr/**" \
+  --local-dir /root/datasets/miles-diffusion-datasets
+```
+
+Training reads:
+
+```bash
+/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/train.jsonl
+```
+
+Evaluation reads:
+
+```bash
+/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/test.jsonl
+```
+
+The model is loaded from:
+
+```bash
+Qwen/Qwen-Image
+```
+
+## Parameter Introduction
+
+Here, we briefly introduce the main parts of
+`scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh`.
+
+### Diffusion Rollout
+
+The script uses the diffusion rollout function:
+
+```bash
+--train-backend fsdp
+--rollout-function-path miles.rollout.sglang_diffusion_rollout.generate_rollout
+--hf-checkpoint Qwen/Qwen-Image
+--diffusion-model Qwen/Qwen-Image
+```
+
+### LoRA Training
+
+The example trains LoRA weights instead of full model weights:
+
+```bash
+--use-lora
+--lora-rank 64
+--lora-alpha 128
+--diffusion-init-lora-weight gaussian
+```
+
+### OCR Reward
+
+The reward is configured through both the diffusion reward string and the miles
+reward type:
+
+```bash
+--diffusion-reward ocr:1.0
+--rm-type ocr
+--advantage-estimator grpo
+```
+
+### Colocated Resources
+
+Training and rollout share the same 2 GPUs:
+
+```bash
+--actor-num-gpus-per-node 2
+--rollout-num-gpus 2
+--rollout-num-gpus-per-engine 1
+--num-gpus-per-node 2
+--colocate
+```
+
+## Batch and Step Math
+
+The 2-GPU script is scaled down from the 4-GPU FlowGRPO-aligned OCR recipe.
+Per rollout:
+
+```text
+rollout_batch_size = 16 prompts
+n_samples_per_prompt = 16 samples per prompt
+samples_per_rollout = 16 * 16 = 256 samples
+num_steps_per_rollout = 2 optimizer steps
+global_batch_size = 256 / 2 = 128 samples per optimizer step
+```
+
+With 2 training GPUs, each rank receives:
+
+```text
+128 / 2 = 64 samples per optimizer step
+```
+
+The DiT forward is tiled as:
+
+```bash
+--micro-batch-size-sample 4
+--micro-batch-size-tstep 2
+```
+
+so one forward tile covers `4 * 2 = 8` sample/timestep cells.
+
+For a deeper explanation of these batch-shape parameters, see
+[Batch sizes in miles-diffusion](../developer_guide/batch_sizes_in_miles_d.md).
+
+## Diffusion Sampling Settings
+
+The example mirrors the Qwen-Image OCR FlowGRPO settings:
+
+```bash
+--diffusion-num-steps 10
+--diffusion-eval-num-steps 50
+--diffusion-guidance-scale 4.0
+--diffusion-true-cfg-scale 4.0
+--diffusion-noise-level 1.2
+--diffusion-step-strategy-path miles.rollout.step_strategy_hub.sde_window
+--diffusion-sde-window-size 2
+--diffusion-sde-window-range 3,5
+--diffusion-height 512
+--diffusion-width 512
+```
+
+The active SDE training window has size 2. The `3,5` range selects the same
+effective window used by the aligned FlowGRPO recipe.
+
+## 4-GPU Variant
+
+If you have 4 GPUs, use:
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/run-diffusion-grpo-ocr-4gpu-flowgrpo-aligned.sh
+```
+
+The 4-GPU script doubles `--rollout-batch-size` from 16 to 32 while keeping the
+per-rank training load at 64 samples per optimizer step.
+
+## Expected Result
+
+A successful launch should:
+
+1. download `flowgrpo_ocr/**` if it is not already present;
+2. start the colocated FSDP actor and sglang-diffusion rollout engine;
+3. generate Qwen-Image OCR rollouts;
+4. compute OCR rewards;
+5. begin GRPO LoRA updates;
+6. save checkpoints under the run-specific `logs/` directory.
+
+If the run fails before training starts, first check GPU visibility, Hugging Face
+access, the base environment, and the OCR task dependencies.