diff --git a/docs/advanced/rollout_parallel_accuracy.md b/docs/advanced/rollout_parallel_accuracy.md new file mode 100644 index 0000000000..e83054a6d9 --- /dev/null +++ b/docs/advanced/rollout_parallel_accuracy.md @@ -0,0 +1,91 @@ +# Rollout Parallel Accuracy + +sglang-diffusion supports several rollout-side parallel strategies. These +strategies are important for throughput and memory, but they can also change the +numeric path of the diffusion forward pass and rollout log-prob computation. +For RL post-training, those differences matter: the trainer consumes rollout +trajectories, rewards, and log-probs, so a parallel configuration should be +chosen with a clear understanding of its accuracy behavior. + +This document summarizes the currently relevant rollout parallel strategies and +their observed precision impact. + +## Parallel Strategies + +| Strategy | Meaning | Typical purpose | +| --- | --- | --- | +| SP / Ulysses | Sequence parallelism over latent/image tokens. Each rank handles a shard of the sequence dimension and uses collectives inside attention. | Increase max resolution or reduce per-GPU activation memory. | +| TP | Tensor parallelism inside the DiT / transformer layers. | Split model compute and parameters across GPUs. | +| CFGP | Classifier-free-guidance parallelism. Conditional and unconditional branches are computed on different ranks and then combined. | Reduce wall-clock cost of CFG when both branches are required. | + +The main tensors to watch are: + +| Tensor | Why it matters | +| --- | --- | +| `model_output` | Direct output of the DiT denoiser. Differences here affect the denoising trajectory. | +| `prev_sample_mean` | Scheduler mean update before adding SDE/CPS variance noise. | +| `variance_noise` | Random noise used by SDE/CPS rollout. | +| `noise_std_dev` | Scheduler noise scale. | +| `rollout_log_probs` | Per-step rollout log-prob consumed by RL training. This is the most important rollout-side scalar for policy-gradient correctness. | + +## Tested Scope + +The rollout-parallel accuracy checks were run on: + +| Model | Resolution | Steps | GPUs | Reference | +| --- | ---: | ---: | ---: | --- | +| `Qwen/Qwen-Image` | 1024 x 1024 | 50 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP | +| `Tongyi-MAI/Z-Image-Turbo` | 1024 x 1024 | 9 | 1-2 | diffusers, single GPU, TP1 SP1, no CFGP | + +## Accuracy Summary + +| Parallel strategy | `rollout_log_probs` vs single-GPU reference | DiT-side tensors vs single-GPU reference | Practical interpretation | +| --- | --- | --- | --- | +| SP / Ulysses | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Bit-exact in the tested Qwen-Image and Z-Image-Turbo runs. | Safest tested rollout parallel mode for accuracy-sensitive log-prob replay. | +| TP | Bit-exact in the tested rollout log-prob path. | Qwen-Image was bit-exact in the tested TP2-SP1 run; Z-Image-Turbo showed DiT-side drift in `model_output` / `prev_sample_mean`. | Log-prob can remain exact even when the model forward path has small architecture-dependent reduction-order drift. | +| CFGP | Bit-exact in the tested rollout log-prob path. | `model_output` / `prev_sample_mean` can drift from the serial CFG reference because cond/uncond branches are combined through CFG-parallel collectives. | Useful for CFG throughput, but do not assume full tensor bit-exactness vs serial CFG. | + +## Detailed Results + +### SDE Rollout + +| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` | +| --- | --- | --- | --- | --- | +| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs | +| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo | +| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | + +### CPS Rollout + +| Parallel strategy | `variance_noise` | `noise_std_dev` | `rollout_log_probs` | `model_output` / `prev_sample_mean` | +| --- | --- | --- | --- | --- | +| SP / Ulysses | 0 max abs diff | 0 max abs diff | 0 max abs diff | 0 max abs diff in tested runs | +| TP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Model-dependent: exact for tested Qwen-Image; drift observed for tested Z-Image-Turbo | +| CFGP | 0 max abs diff | 0 max abs diff | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | + +### ODE Rollout + +| Parallel strategy | `rollout_log_probs` | `model_output` / deterministic update | +| --- | --- | --- | +| SP / Ulysses | 0 max abs diff | 0 max abs diff in tested runs | +| TP | 0 max abs diff | Model-dependent drift can appear in the DiT forward path | +| CFGP | 0 max abs diff | Drift observed in CFG-parallel `model_output` / `prev_sample_mean` | + +ODE has a special precision contract: the rollout branch should preserve +bit-exactness with the non-rollout deterministic scheduler step. For this +reason, SGLang keeps the ODE branch dtype-preserving instead of applying the +same fp32 entry cast used by SDE/CPS. + +## Practical Guidance + +- Prefer SP / Ulysses when the main goal is scaling rollout resolution while + preserving rollout log-prob accuracy. It is the cleanest tested path for + bit-exact log-prob replay. +- Use TP when model memory or compute requires it, but validate DiT-side tensor + drift for the specific backbone. The tested Qwen-Image path was bit-exact; + the tested Z-Image-Turbo path still showed model-output drift. +- Use CFGP when CFG throughput matters, but treat it as a numerically different + forward path from serial CFG for `model_output` and `prev_sample_mean`. + `rollout_log_probs` were still bit-exact in the tested rollout path. +- For SDE/CPS, expect fp32 rollout log-prob computation. For ODE, preserve the + native deterministic scheduler path. diff --git a/docs/examples/qwen_image_ocr_demo.md b/docs/examples/qwen_image_ocr_demo.md new file mode 100644 index 0000000000..1214cfe5ae --- /dev/null +++ b/docs/examples/qwen_image_ocr_demo.md @@ -0,0 +1,221 @@ +# Qwen-Image OCR with 2 GPUs + +This example runs miles-diffusion with Qwen-Image, FSDP training, LoRA updates, +the built-in diffusion rollout path, and the OCR reward. + +## Environment Setup + +First complete the base environment setup in +[Quick Start](../get_started/quick_start.md). + +Then install the OCR task dependencies: + +```bash +conda activate miles-diffusion +cd /path/to/miles +``` + +Follow [Task Dependencies: OCR Dependencies](../get_started/task_dependencies.md#ocr-dependencies). +The important check is: + +```bash +python -c "from paddleocr import PaddleOCR; from Levenshtein import distance; import miles.rollout.rm_hub.ocr; print('OCR deps OK')" +``` + +The example uses 2 NVIDIA GPUs. It downloads the training and evaluation data +from Hugging Face during startup, so the machine must be able to access Hugging +Face. + +Optionally enable Weights & Biases logging: + +```bash +export WANDB_API_KEY=... +``` + +## Run Training + +Execute the 2-GPU script: + +```bash +cd /path/to/miles +conda activate miles-diffusion +bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh +``` + +By default, the script uses: + +```bash +CUDA_VISIBLE_DEVICES=2,3 +``` + +Override it if your available GPUs are different: + +```bash +CUDA_VISIBLE_DEVICES=0,1 bash scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh +``` + +The script writes checkpoints under: + +```bash +logs/diffusion_grpo_ocr_2gpu_flowgrpo_aligned_/ckpt +``` + +## Data and Model + +The script downloads the OCR dataset to: + +```bash +/root/datasets/miles-diffusion-datasets +``` + +using: + +```bash +hf download --repo-type dataset rockdu/miles-diffusion-datasets \ + --include "flowgrpo_ocr/**" \ + --local-dir /root/datasets/miles-diffusion-datasets +``` + +Training reads: + +```bash +/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/train.jsonl +``` + +Evaluation reads: + +```bash +/root/datasets/miles-diffusion-datasets/flowgrpo_ocr/test.jsonl +``` + +The model is loaded from: + +```bash +Qwen/Qwen-Image +``` + +## Parameter Introduction + +Here, we briefly introduce the main parts of +`scripts/run-diffusion-grpo-ocr-2gpu-flowgrpo-aligned.sh`. + +### Diffusion Rollout + +The script uses the diffusion rollout function: + +```bash +--train-backend fsdp +--rollout-function-path miles.rollout.sglang_diffusion_rollout.generate_rollout +--hf-checkpoint Qwen/Qwen-Image +--diffusion-model Qwen/Qwen-Image +``` + +### LoRA Training + +The example trains LoRA weights instead of full model weights: + +```bash +--use-lora +--lora-rank 64 +--lora-alpha 128 +--diffusion-init-lora-weight gaussian +``` + +### OCR Reward + +The reward is configured through both the diffusion reward string and the miles +reward type: + +```bash +--diffusion-reward ocr:1.0 +--rm-type ocr +--advantage-estimator grpo +``` + +### Colocated Resources + +Training and rollout share the same 2 GPUs: + +```bash +--actor-num-gpus-per-node 2 +--rollout-num-gpus 2 +--rollout-num-gpus-per-engine 1 +--num-gpus-per-node 2 +--colocate +``` + +## Batch and Step Math + +The 2-GPU script is scaled down from the 4-GPU FlowGRPO-aligned OCR recipe. +Per rollout: + +```text +rollout_batch_size = 16 prompts +n_samples_per_prompt = 16 samples per prompt +samples_per_rollout = 16 * 16 = 256 samples +num_steps_per_rollout = 2 optimizer steps +global_batch_size = 256 / 2 = 128 samples per optimizer step +``` + +With 2 training GPUs, each rank receives: + +```text +128 / 2 = 64 samples per optimizer step +``` + +The DiT forward is tiled as: + +```bash +--micro-batch-size-sample 4 +--micro-batch-size-tstep 2 +``` + +so one forward tile covers `4 * 2 = 8` sample/timestep cells. + +For a deeper explanation of these batch-shape parameters, see +[Batch sizes in miles-diffusion](../developer_guide/batch_sizes_in_miles_d.md). + +## Diffusion Sampling Settings + +The example mirrors the Qwen-Image OCR FlowGRPO settings: + +```bash +--diffusion-num-steps 10 +--diffusion-eval-num-steps 50 +--diffusion-guidance-scale 4.0 +--diffusion-true-cfg-scale 4.0 +--diffusion-noise-level 1.2 +--diffusion-step-strategy-path miles.rollout.step_strategy_hub.sde_window +--diffusion-sde-window-size 2 +--diffusion-sde-window-range 3,5 +--diffusion-height 512 +--diffusion-width 512 +``` + +The active SDE training window has size 2. The `3,5` range selects the same +effective window used by the aligned FlowGRPO recipe. + +## 4-GPU Variant + +If you have 4 GPUs, use: + +```bash +CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/run-diffusion-grpo-ocr-4gpu-flowgrpo-aligned.sh +``` + +The 4-GPU script doubles `--rollout-batch-size` from 16 to 32 while keeping the +per-rank training load at 64 samples per optimizer step. + +## Expected Result + +A successful launch should: + +1. download `flowgrpo_ocr/**` if it is not already present; +2. start the colocated FSDP actor and sglang-diffusion rollout engine; +3. generate Qwen-Image OCR rollouts; +4. compute OCR rewards; +5. begin GRPO LoRA updates; +6. save checkpoints under the run-specific `logs/` directory. + +If the run fails before training starts, first check GPU visibility, Hugging Face +access, the base environment, and the OCR task dependencies. diff --git a/docs/get_started/quick_start.md b/docs/get_started/quick_start.md new file mode 100644 index 0000000000..14f183af16 --- /dev/null +++ b/docs/get_started/quick_start.md @@ -0,0 +1,201 @@ +# Quick Start + +This document describes the recommended base environment for +`miles-diffusion`. It covers the common runtime needed by the diffusion training +entrypoint, the pinned sglang-diffusion fork, and the miles package. + +Task-specific reward dependencies are intentionally kept out of the base setup. +After the base environment is ready, install the dependencies required by your +target recipe from [Task Dependencies](task_dependencies.md). + +## Basic Environment Setup + +Miles-diffusion depends on a custom sglang-diffusion fork for multimodal rollout +and RL weight synchronization. The sglang branch can move over time, so the +environment should pin the exact sglang commit instead of installing from a +floating branch tip. + +Run the following block from the repository root: + +```bash +set -euo pipefail + +ENV_NAME="${ENV_NAME:-miles-diffusion}" +PY_VER="${PY_VER:-3.11}" +CUDA_VER="${CUDA_VER:-12.9}" +TORCH_VER="${TORCH_VER:-2.9.1}" +SGLANG_REPO="${SGLANG_REPO:-https://github.com/Rockdu/sglang.git}" +SGLANG_BRANCH="${SGLANG_BRANCH:-sglang-diffusion-rollout-test}" +SGLANG_COMMIT="${SGLANG_COMMIT:-0372158dd66bc7cb0740c733bd60047db790ec7d}" + +PIP_VER="${PIP_VER:-26.0.1}" +WHEEL_VER="${WHEEL_VER:-0.45.1}" +SETUPTOOLS_VER="${SETUPTOOLS_VER:-82.0.1}" +TORCH_MEMORY_SAVER_VER="${TORCH_MEMORY_SAVER_VER:-0.0.9}" + +REPO_DIR="$(pwd)" +SGLANG_DIR="${SGLANG_DIR:-$(dirname "$REPO_DIR")/sglang}" + +if command -v mamba >/dev/null 2>&1; then + CONDA_BIN=mamba +elif command -v conda >/dev/null 2>&1; then + CONDA_BIN=conda +else + echo "conda/mamba not found. Install miniforge first: https://github.com/conda-forge/miniforge" >&2 + exit 1 +fi + +source "$($CONDA_BIN info --base)/etc/profile.d/conda.sh" + +if conda env list | awk '{print $1}' | grep -qx "$ENV_NAME"; then + echo "[install] conda env '$ENV_NAME' exists; reusing" +else + echo "[install] creating conda env '$ENV_NAME'" + "$CONDA_BIN" create -y -n "$ENV_NAME" "python=$PY_VER" +fi +conda activate "$ENV_NAME" + +python -m pip install "pip==$PIP_VER" "wheel==$WHEEL_VER" "setuptools==$SETUPTOOLS_VER" + +CU_TAG="cu$(echo "$CUDA_VER" | tr -d .)" +if python -c "import torch" 2>/dev/null; then + CUR_TORCH="$(python -c 'import torch; print(torch.__version__)')" + if [[ "$CUR_TORCH" == "${TORCH_VER}+${CU_TAG}" || "$CUR_TORCH" == "$TORCH_VER" ]]; then + echo "[install] torch: $CUR_TORCH" + else + echo "[install] reinstalling torch==$TORCH_VER from $CU_TAG" + pip install --force-reinstall "torch==$TORCH_VER" --index-url "https://download.pytorch.org/whl/$CU_TAG" + fi +else + echo "[install] installing torch==$TORCH_VER from $CU_TAG" + pip install "torch==$TORCH_VER" --index-url "https://download.pytorch.org/whl/$CU_TAG" +fi + +if [[ ! -d "$SGLANG_DIR" ]]; then + echo "[install] cloning $SGLANG_REPO -> $SGLANG_DIR" + git clone --branch "$SGLANG_BRANCH" "$SGLANG_REPO" "$SGLANG_DIR" +fi + +pushd "$SGLANG_DIR" >/dev/null +if ! git remote get-url rockdu >/dev/null 2>&1; then + git remote add rockdu "$SGLANG_REPO" +fi +if ! git cat-file -e "$SGLANG_COMMIT^{commit}" 2>/dev/null; then + git fetch rockdu "$SGLANG_BRANCH" +fi +CUR_SGLANG_COMMIT="$(git rev-parse HEAD)" +if [[ "$CUR_SGLANG_COMMIT" != "$SGLANG_COMMIT" ]]; then + git checkout --detach "$SGLANG_COMMIT" +fi +pip install -e "python[all]" +popd >/dev/null + +cd "$REPO_DIR" +pip install -r requirements.txt +pip install -e . --no-deps + +pip install "torch_memory_saver==$TORCH_MEMORY_SAVER_VER" || true + +if command -v nvidia-smi >/dev/null 2>&1; then + nvidia-smi -L +else + echo "[warn] nvidia-smi not found; GPU visibility was not checked" +fi + +python -c "import train_diffusion; from miles.utils.arguments import parse_args; from miles.backends.fsdp_utils import FSDPTrainRayActor; import sglang.multimodal_gen; print('miles-diffusion import OK')" +``` + +The block is idempotent. Re-running it reuses the conda environment, the sglang +checkout, and already installed packages when they match the configured +versions. + +## What the Setup Creates + +By default, the setup creates this layout: + +```bash +/path/to/miles # this repository +/path/to/sglang # Rockdu/sglang checked out at the pinned commit +``` + +It performs the following steps: + +1. creates or reuses a conda environment named `miles-diffusion`; +2. installs pinned Python build tooling; +3. installs pinned PyTorch from the selected CUDA wheel index; +4. clones `Rockdu/sglang` and checks out the pinned sglang-diffusion commit; +5. installs sglang in editable mode with `python[all]`; +6. installs miles dependencies from `requirements.txt`; +7. installs miles itself in editable mode; +8. optionally installs `torch_memory_saver`; +9. runs a Python import smoke test. + +Activate the environment after installation: + +```bash +conda activate miles-diffusion +python -c "import train_diffusion; import sglang.multimodal_gen; print('OK')" +``` + +If the import command succeeds, the base environment can load the miles +diffusion training entrypoint and the sglang multimodal rollout module. + +## Version Pins + +The base setup keeps the key environment choices explicit: + +| Component | Default pin | Override variable | +| --- | --- | --- | +| Conda env | `miles-diffusion` | `ENV_NAME` | +| Python | `3.11` | `PY_VER` | +| pip | `26.0.1` | `PIP_VER` | +| wheel | `0.45.1` | `WHEEL_VER` | +| setuptools | `82.0.1` | `SETUPTOOLS_VER` | +| PyTorch | `torch==2.9.1` | `TORCH_VER` | +| CUDA wheel index | `cu129` | `CUDA_VER=12.9` | +| sglang repo | `https://github.com/Rockdu/sglang.git` | `SGLANG_REPO` | +| sglang branch | `sglang-diffusion-rollout-test` | `SGLANG_BRANCH` | +| sglang commit | `0372158dd66bc7cb0740c733bd60047db790ec7d` | `SGLANG_COMMIT` | +| torch_memory_saver | `0.0.9` | `TORCH_MEMORY_SAVER_VER` | + +Miles package dependencies are pinned in `requirements.txt`, including: + +```text +accelerate==1.12.0 +datasets==4.4.2 +pillow==11.3.0 +ray[default]==2.53.0 +sglang-router==0.3.0 +transformers==5.5.4 +wandb==0.23.1 +``` + +The sglang source revision is pinned by commit SHA. This is important because +miles-diffusion relies on the sglang-diffusion fork for multimodal rollout and +weight synchronization; installing from only the branch name is not reproducible +enough for debugging or sharing results. + +## Configurable Setup + +You can override the defaults before running the setup block: + +```bash +export ENV_NAME=miles-diffusion +export PY_VER=3.11 +export CUDA_VER=12.9 +export TORCH_VER=2.9.1 +export SGLANG_DIR=/path/to/sglang +export SGLANG_REPO=https://github.com/Rockdu/sglang.git +export SGLANG_BRANCH=sglang-diffusion-rollout-test +export SGLANG_COMMIT=0372158dd66bc7cb0740c733bd60047db790ec7d +``` + +Only override `SGLANG_COMMIT` when intentionally testing a new +sglang-diffusion revision. + +## Task Dependencies + +The base setup intentionally does not install task-specific reward dependencies. +Before running a recipe, install the dependency set required by that task: + +- [Task Dependencies](task_dependencies.md) diff --git a/docs/get_started/task_dependencies.md b/docs/get_started/task_dependencies.md new file mode 100644 index 0000000000..01794c9475 --- /dev/null +++ b/docs/get_started/task_dependencies.md @@ -0,0 +1,70 @@ +# Task Dependencies + +The quick start installs the common miles-diffusion runtime. Some tasks require +extra reward or evaluation dependencies. This page records those task-scoped +dependency sets only. + +## OCR Dependencies + +Use this section for any task that enables the OCR reward, for example through +`--rm-type ocr` or `--diffusion-reward ocr:...`. + +The dependency boundary is the OCR reward implementation: + +```text +miles.rollout.rm_hub.ocr +``` + +The OCR reward depends on PaddleOCR, PaddlePaddle, OpenCV runtime libraries, and +string-distance packages. + +Start from the base environment: + +```bash +conda activate miles-diffusion +cd /path/to/miles +``` + +Install the system libraries required by PaddleOCR and OpenCV: + +```bash +sudo apt-get update +sudo apt-get install -y libglib2.0-0 libgl1 +``` + +In a root container, use `apt-get` directly if `sudo` is unavailable: + +```bash +apt-get update +apt-get install -y libglib2.0-0 libgl1 +``` + +Install the Python dependencies from the pinned `flow_grpo` setup file: + +```bash +cd /path/to/miles/flow_grpo +grep -v '^apt-get install ' setup.sh | bash +``` + +The relevant pins include: + +```text +diffusers==0.37.0 +peft==0.18.1 +bitsandbytes==0.48.0 +opencv-python==4.11.0.86 +opencv-python-headless==4.10.0.84 +opencv-contrib-python==4.11.0.86 +paddlepaddle-gpu==2.6.2 +paddleocr==2.9.1 +python-Levenshtein==0.27.3 +levenshtein==0.27.3 +rapidfuzz==3.14.3 +``` + +Verify the OCR reward stack: + +```bash +cd /path/to/miles +python -c "from paddleocr import PaddleOCR; from Levenshtein import distance; import miles.rollout.rm_hub.ocr; print('OCR deps OK')" +``` diff --git a/docs/roadmap.md b/docs/roadmap.md new file mode 100644 index 0000000000..fb4edb0e91 --- /dev/null +++ b/docs/roadmap.md @@ -0,0 +1,56 @@ +# Roadmap + +This roadmap describes planned directions for miles-diffusion. It is intended +for external developers and users, and does not represent a strict release +commitment. Priorities may change as the project evolves. + +## Text-to-Video RL + +We plan to extend miles-diffusion from text-to-image RL to text-to-video RL. The +overall training loop is similar to T2I, but video rollout has higher memory, +compute, and IO pressure. + +An initial direction is to start with short videos or few-frame settings, then +extend to longer video generation as the runtime and training recipes mature. +This also makes it possible to study whether policies trained on short temporal +contexts can transfer to longer video generation settings. + +## Image Editing and TI2I + +We plan to support text-guided image-to-image and image editing workflows. This +includes edit models, condition images, negative images, and other multimodal +conditioning inputs. + +Compared with pure T2I, TI2I introduces additional conditioning paths and +model-specific preprocessing. Supporting these workflows will require extending +the diffusion rollout interface and training recipes while keeping the user +experience close to the existing T2I path. + +## More Diffusion Backbones + +We plan to expand support for more mainstream diffusion and DiT backbones. New +model support should include rollout compatibility, scheduler support, and +log-prob validation against a trusted reference path. + +For supported recipes, we aim to keep rollout log-prob drift within an +acceptable tolerance when compared with the reference implementation. The exact +tolerance may depend on the model, scheduler, dtype, and rollout mode. + +## Mixed Resolution Training + +We plan to explore mixed resolution training and rollout support. Mixed +resolution can make data construction more flexible and may improve hardware +utilization for workloads that naturally contain prompts or tasks at different +image sizes. + +This direction may include NaViT-style batching, resolution-aware sampling, and +batch construction strategies that work across T2I, TI2I, and T2V workloads. + +## More Tasks and Rewards + +We plan to add more task and reward coverage for diffusion RL. The current +direction includes OCR-style rewards, preference or aesthetic rewards, and +custom reward functions provided by users. + +The goal is to make it straightforward to attach new reward signals without +rewriting the core rollout and training loop.