Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions docs/advanced/rollout_parallel_accuracy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Rollout Parallel Accuracy

sglang-diffusion supports several rollout-side parallel strategies. These
strategies are important for throughput and memory, but they can also change the
numeric path of the diffusion forward pass and rollout log-prob computation.
For RL post-training, those differences matter: the trainer consumes rollout
trajectories, rewards, and log-probs, so a parallel configuration should be
chosen with a clear understanding of its accuracy behavior.

This document summarizes the currently relevant rollout parallel strategies and
their observed precision impact.

## Parallel Strategies

| Strategy | Meaning | Typical purpose |
| --- | --- | --- |
| SP / Ulysses | Sequence parallelism over latent/image tokens. Each rank handles a shard of the sequence dimension and uses collectives inside attention. | Increase max resolution or reduce per-GPU activation memory. |
| TP | Tensor parallelism inside the DiT / transformer layers. | Split model compute and parameters across GPUs. |
| CFGP | Classifier-free-guidance parallelism. Conditional and unconditional branches are computed on different ranks and then combined. | Reduce wall-clock cost of CFG when both branches are required. |

The main tensors to watch are:

| Tensor | Why it matters |
| --- | --- |
| `model_output` | Direct output of the DiT denoiser. Differences here affect the denoising trajectory. |
| `prev_sample_mean` | Scheduler mean update before adding SDE/CPS variance noise. |
| `variance_noise` | Random noise used by SDE/CPS rollout. |
| `noise_std_dev` | Scheduler noise scale. |
| `rollout_log_probs` | Per-step rollout log-prob consumed by RL training. This is the most important rollout-side scalar for policy-gradient correctness. |

## Tested Scope

The rollout-parallel accuracy checks were run on:

| Model | Resolution | Steps | GPUs | Reference |
| --- | ---: | ---: | ---: | --- |
| `Qwen/Qwen-Image` | 1024 x 1024 | 50 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP |
| `Tongyi-MAI/Z-Image-Turbo` | 1024 x 1024 | 9 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP |

## Accuracy Summary

| Parallel strategy | `rollout_log_probs` vs single-GPU reference | DiT-side tensors vs single-GPU reference | Practical interpretation |
| --- | --- | --- | --- |
| SP / Ulysses | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Safest tested rollout parallel mode for accuracy-sensitive log-prob replay. |
| TP | Bit-exact in the tested rollout log-prob path. | Qwen-Image was bit-exact in the tested TP2-SP1 run; Z-Image-Turbo showed DiT-side drift in `model_output` / `prev_sample_mean`. | Log-prob can remain exact even when the model forward path has small architecture-dependent reduction-order drift. |
| CFGP | Bit-exact in the tested rollout log-prob path. | `model_output` / `prev_sample_mean` can drift from the serial CFG reference because cond/uncond branches are combined through CFG-parallel collectives. | Useful for CFG throughput, but do not assume full tensor bit-exactness vs serial CFG. |

## Detailed Results

### SDE Rollout

| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` |
| --- | --- | --- | --- | --- |
| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs |
| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo |
| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |

### CPS Rollout

| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` |
| --- | --- | --- | --- | --- |
| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs |
| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo |
| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |

### ODE Rollout

| Parallel strategy | `rollout_log_probs` | `model_output` / deterministic update |
| --- | --- | --- |
| SP / Ulysses | 0 max abs diff | 0 max abs diff in tested runs |
| TP | 0 max abs diff | Model-dependent drift can appear in the DiT forward path |
| CFGP | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` |

ODE has a special precision contract: the rollout branch should preserve
bit-exactness with the non-rollout deterministic scheduler step. For this
reason, SGLang keeps the ODE branch dtype-preserving instead of applying the
same fp32 entry cast used by SDE/CPS.

## Practical Guidance

- Prefer SP / Ulysses when the main goal is scaling rollout resolution while
preserving rollout log-prob accuracy. It is the cleanest tested path for
bit-exact log-prob replay.
- Use TP when model memory or compute requires it, but validate DiT-side tensor
drift for the specific backbone. The tested Qwen-Image path was bit-exact;
the tested Z-Image-Turbo path still showed model-output drift.
- Use CFGP when CFG throughput matters, but treat it as a numerically different
forward path from serial CFG for `model_output` and `prev_sample_mean`.
`rollout_log_probs` were still bit-exact in the tested rollout path.
- For SDE/CPS, expect fp32 rollout log-prob computation. For ODE, preserve the
native deterministic scheduler path.
221 changes: 221 additions & 0 deletions docs/examples/qwen_image_ocr_demo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# Qwen-Image OCR with 2 GPUs

This example runs miles-diffusion with Qwen-Image, FSDP training, LoRA updates,
the built-in diffusion rollout path, and the OCR reward.

## Environment Setup

First complete the base environment setup in
[Quick Start](../get_started/quick_start.md).

Then install the OCR task dependencies:

```bash
conda activate miles-diffusion
cd /path/to/miles
```

Follow [Task Dependencies: OCR Dependencies](../get_started/task_dependencies.md#ocr-dependencies).
The important check is:

```bash
python -c "from paddleocr import PaddleOCR; from Levenshtein import distance; import miles.rollout.rm_hub.ocr; print('OCR deps OK')"
```

The example uses 2 NVIDIA GPUs. It downloads the training and evaluation data
from Hugging Face during startup, so the machine must be able to access Hugging
Face.

Optionally enable Weights & Biases logging:

```bash
export WANDB_API_KEY=...
```

## Run Training

Execute the 2-GPU script:

```bash
cd /path/to/miles
conda activate miles-diffusion
bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh
```

By default, the script uses:

```bash
CUDA_VISIBLE_DEVICES=2,3
```

Override it if your available GPUs are different:

```bash
CUDA_VISIBLE_DEVICES=0,1 bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh
```

The script writes checkpoints under:

```bash
logs/diffusion_grpo_ocr_2gpu_flowgrpo_aligned_<timestamp>/ckpt
```

## Data and Model

The script downloads the OCR dataset to:

```bash
/root/datasets/miles-diffusion-datasets
```

using:

```bash
hf download --repo-type dataset rockdu/miles-diffusion-datasets \
--include "flowgrpo_ocr/**" \
--local-dir /root/datasets/miles-diffusion-datasets
```

Training reads:

```bash
/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/train.jsonl
```

Evaluation reads:

```bash
/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/test.jsonl
```

The model is loaded from:

```bash
Qwen/Qwen-Image
```

## Parameter Introduction

Here, we briefly introduce the main parts of
`scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh`.

### Diffusion Rollout

The script uses the diffusion rollout function:

```bash
--train-backend fsdp
--rollout-function-path miles.rollout.sglang_diffusion_rollout.generate_rollout
--hf-checkpoint Qwen/Qwen-Image
--diffusion-model Qwen/Qwen-Image
```

### LoRA Training

The example trains LoRA weights instead of full model weights:

```bash
--use-lora
--lora-rank 64
--lora-alpha 128
--diffusion-init-lora-weight gaussian
```

### OCR Reward

The reward is configured through both the diffusion reward string and the miles
reward type:

```bash
--diffusion-reward ocr:1.0
--rm-type ocr
--advantage-estimator grpo
```

### Colocated Resources

Training and rollout share the same 2 GPUs:

```bash
--actor-num-gpus-per-node 2
--rollout-num-gpus 2
--rollout-num-gpus-per-engine 1
--num-gpus-per-node 2
--colocate
```

## Batch and Step Math

The 2-GPU script is scaled down from the 4-GPU FlowGRPO-aligned OCR recipe.
Per rollout:

```text
rollout_batch_size = 16 prompts
n_samples_per_prompt = 16 samples per prompt
samples_per_rollout = 16 * 16 = 256 samples
num_steps_per_rollout = 2 optimizer steps
global_batch_size = 256 / 2 = 128 samples per optimizer step
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can refer to "batchsizes in miles-d" in docs here

```

With 2 training GPUs, each rank receives:

```text
128 / 2 = 64 samples per optimizer step
```

The DiT forward is tiled as:

```bash
--micro-batch-size-sample 4
--micro-batch-size-tstep 2
```

so one forward tile covers `4 * 2 = 8` sample/timestep cells.

For a deeper explanation of these batch-shape parameters, see
[Batch sizes in miles-diffusion](../developer_guide/batch_sizes_in_miles_d.md).

## Diffusion Sampling Settings

The example mirrors the Qwen-Image OCR FlowGRPO settings:

```bash
--diffusion-num-steps 10
--diffusion-eval-num-steps 50
--diffusion-guidance-scale 4.0
--diffusion-true-cfg-scale 4.0
--diffusion-noise-level 1.2
--diffusion-step-strategy-path miles.rollout.step_strategy_hub.sde_window
--diffusion-sde-window-size 2
--diffusion-sde-window-range 3,5
--diffusion-height 512
--diffusion-width 512
```

The active SDE training window has size 2. The `3,5` range selects the same
effective window used by the aligned FlowGRPO recipe.

## 4-GPU Variant

If you have 4 GPUs, use:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/run-diffusion-grpo-ocr-4gpu-flowgrpo-aligned.sh
```

The 4-GPU script doubles `--rollout-batch-size` from 16 to 32 while keeping the
per-rank training load at 64 samples per optimizer step.

## Expected Result

A successful launch should:

1. download `flowgrpo_ocr/**` if it is not already present;
2. start the colocated FSDP actor and sglang-diffusion rollout engine;
3. generate Qwen-Image OCR rollouts;
4. compute OCR rewards;
5. begin GRPO LoRA updates;
6. save checkpoints under the run-specific `logs/` directory.

If the run fails before training starts, first check GPU visibility, Hugging Face
access, the base environment, and the OCR task dependencies.
Loading