Official cleaned release for the paper MAR-GRPO.
This repository contains the training code used to apply Group Relative Policy Optimization (GRPO) to masked autoregressive image generation models, with experiments on NOVA and Harmon.
Masked autoregressive image generation models couple an autoregressive backbone with a diffusion-style decoder head. While GRPO can improve these models, naive end-to-end RL introduces severe instability: the diffusion head changes the decoding mapping during training, and stochastic denoising trajectories add high-variance gradients to the autoregressive policy.
MAR-GRPO addresses this with three simple ideas:
- Fix the diffusion head during RL, which removes a major source of non-stationarity.
- Estimate the policy with multiple diffusion trajectories, which reduces variance caused by diffusion stochasticity.
- Apply multi-trajectory estimation selectively on uncertain tokens, improving stability without over-smoothing all updates.
Together, these changes make RL training for masked autoregressive image generation substantially more stable and more effective.
This cleaned release keeps the main research code paths while removing private local paths and obvious workspace-specific clutter.
It includes:
- GRPO training entrypoints for NOVA and Harmon
- core trainer implementations used by the paper
- reward modules used in experiments, including HPS, GIT, GroundingDINO, OCR, Geneval, and ORM
- local third-party source trees that some rewards depend on
- example/debug launchers that preserve the original experiment style, but now use generic placeholders instead of personal paths
It does not include:
- model checkpoints
- reward checkpoints
- private datasets
- a fully frozen environment image
.
├── README.md
├── Install.md
├── data/ # data utilities and small local assets
├── jobs/ # original job launch helpers
├── scripts/ # original debug / experiment launchers
└── src/
└── grpo/
└── src/
├── open_r1/ # RL entrypoints and trainers
├── diffnext/ # NOVA-side generation code
├── harmon/ # Harmon-side generation code
├── infer/ # evaluation / inference helpers
├── nova/ # NOVA-side components
└── utils/ # reward modules and local third-party deps
The main training entrypoints used in this release are:
src/grpo/src/open_r1/grpo_v3.py: MAR-GRPO training for NOVAsrc/grpo/src/open_r1/grpo_v1_harmon.py: MAR-GRPO training for Harmon
The corresponding core trainers are:
src/grpo/src/open_r1/trainer/grpo_trainer_nova_v3.pysrc/grpo/src/open_r1/trainer/grpo_trainer_harmon_v1_2.py
Many original experiment scripts in the research workspace contained hard-coded private paths. In this release, those paths have been replaced with generic placeholders.
The most common placeholders are:
$PROJECT_ROOT: the root of this repository$CHECKPOINT_ROOT: your local directory containing base model checkpoints$REWARD_MODEL_ROOT: your local directory containing reward-model checkpoints$WORKSPACE_ROOT: your own experiment output / scratch directory$ENV_ROOT: your environment root, if you want to reference tools or interpreters there$USER_ROOT: a generic placeholder for a personal storage root$GENEVAL_ROOT,$DPG_BENCH_ROOT,$T2I_COMPBENCH_ROOT: roots for external evaluation toolkits if you use them$LEGACY_PROJECT_ROOT: a placeholder used in a few legacy helper scripts that were originally written against an older codebase layout
These are documentation placeholders only. Python will not expand $PROJECT_ROOT automatically unless you explicitly do so in your own shell or script.
If you want a script to work directly, replace these placeholders with either:
- real absolute paths in your own environment, or
os.environ[...]/ shell environment variables that you export before launch.
The cleanest way to use this release is:
- keep the repository path fixed as your own
PROJECT_ROOT - keep all model checkpoints under one
CHECKPOINT_ROOT - keep reward checkpoints under one
REWARD_MODEL_ROOT - keep outputs under one
WORKSPACE_ROOT - pass all runtime-specific paths through CLI flags or environment variables
For example, suppose your environment looks like this:
/home/you/projects/mar_grpo_release
/home/you/checkpoints
/home/you/reward_models
/home/you/outputs
Then you can mentally map:
$PROJECT_ROOT -> /home/you/projects/mar_grpo_release$CHECKPOINT_ROOT -> /home/you/checkpoints$REWARD_MODEL_ROOT -> /home/you/reward_models$WORKSPACE_ROOT -> /home/you/outputs
See Install.md for dependency installation.
After installation:
cd /path/to/mar_grpo_release
export PROJECT_ROOT=$PWD
export PYTHONPATH=$PROJECT_ROOT/src/grpo/src:${PYTHONPATH}This PYTHONPATH is important because the main code lives under src/grpo/src.
Before you launch training, decide the following:
- which base model you want to train: NOVA or Harmon
- where that base model checkpoint lives
- which dataset JSON / JSONL file you want to use
- which reward functions you want to enable
- where your output directory should be written
- which reward-model checkpoints are needed for those rewards
At minimum, most runs need:
- base model checkpoint path
- dataset path
- output directory
- HPS checkpoint path if using HPS reward
- CLIP checkpoint path if using HPS / Geneval style scoring
Depending on your reward set, you may also need:
GIT_CKPT_PATHGDINO_CKPT_PATHGDINO_CONFIG_PATHORM_CKPT_PATHGENEVAL_MMDET_CONFIGGENEVAL_MMDET_CKPT_DIRGENEVAL_CLIP_CKPT
A practical way to adapt the cleaned scripts is to define a small set of shell variables first.
Example:
export PROJECT_ROOT=/path/to/mar_grpo_release
export CHECKPOINT_ROOT=/path/to/checkpoints
export REWARD_MODEL_ROOT=/path/to/reward_models
export WORKSPACE_ROOT=/path/to/outputs
export PYTHONPATH=$PROJECT_ROOT/src/grpo/src:${PYTHONPATH}Then map the model-specific paths:
export MODEL_PATH=$CHECKPOINT_ROOT/BAAI/nova-d48w1024-sd512
export DATASET_PATH=$PROJECT_ROOT/data/your_train_metadata.json
export OUTPUT_DIR=$WORKSPACE_ROOT/nova_run_001
export HPS_CKPT_PATH=$REWARD_MODEL_ROOT/HPS_v2.1_compressed.pt
export CLIP_CKPT_PATH=$CHECKPOINT_ROOT/timm/vit_large_patch14_clip_224.openai/open_clip_pytorch_model.bin
export ORM_CKPT_PATH=$REWARD_MODEL_ROOT/your_orm_checkpointFor Geneval reward:
export GENEVAL_MMDET_CONFIG=/path/to/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco.py
export GENEVAL_MMDET_CKPT_DIR=/path/to/mmdet_ckpts
export GENEVAL_CLIP_CKPT=$CLIP_CKPT_PATHFor GroundingDINO reward:
export GDINO_CONFIG_PATH=/path/to/GroundingDINO_config.py
export GDINO_CKPT_PATH=/path/to/groundingdino_swint_ogc.pthFor Harmon runs that need an extra transformer checkpoint:
export TRANSFORMER_PATH=/path/to/harmon_transformer_checkpointA simple pattern is:
cd $PROJECT_ROOT
export PYTHONPATH=$PROJECT_ROOT/src/grpo/src:${PYTHONPATH}
export MODEL_PATH=/path/to/nova_checkpoint
export DATASET_PATH=/path/to/train_metadata_flow_grpo.json
export OUTPUT_DIR=/path/to/output_nova
export HPS_CKPT_PATH=/path/to/HPS_v2.1_compressed.pt
export CLIP_CKPT_PATH=/path/to/open_clip_pytorch_model.bin
export ORM_CKPT_PATH=/path/to/orm_checkpoint
python src/grpo/src/open_r1/grpo_v3.py --output_dir $OUTPUT_DIR --model_name_or_path $MODEL_PATH --dataset_name $DATASET_PATH --reward_funcs hps orm --hps_ckpt_path $HPS_CKPT_PATH --clip_ckpt_path $CLIP_CKPT_PATH --orm_ckpt_path $ORM_CKPT_PATHIf you prefer using the original experiment-style launcher scripts under scripts/, replace the placeholder strings in those scripts with your own paths or environment-variable lookups.
cd $PROJECT_ROOT
export PYTHONPATH=$PROJECT_ROOT/src/grpo/src:${PYTHONPATH}
export MODEL_PATH=/path/to/harmon_checkpoint
export TRANSFORMER_PATH=/path/to/harmon_transformer_checkpoint
export DATASET_PATH=/path/to/train_metadata_flow_grpo.json
export OUTPUT_DIR=/path/to/output_harmon
export HPS_CKPT_PATH=/path/to/HPS_v2.1_compressed.pt
export CLIP_CKPT_PATH=/path/to/open_clip_pytorch_model.bin
python src/grpo/src/open_r1/grpo_v1_harmon.py --output_dir $OUTPUT_DIR --model_name_or_path $MODEL_PATH --transformer_path $TRANSFORMER_PATH --dataset_name $DATASET_PATH --reward_funcs hps --hps_ckpt_path $HPS_CKPT_PATH --clip_ckpt_path $CLIP_CKPT_PATHThe repository still contains many original debug_entry_*.py scripts because they document the experiment style used during research. These scripts are useful references, but they are not meant to be plug-and-play in a fresh environment.
To adapt one of them:
- open the script you want to reuse
- replace placeholder paths such as
$PROJECT_ROOT/...or$CHECKPOINT_ROOT/... - remove any experiment-specific options you do not need
- confirm that the reward checkpoints referenced in that script actually exist in your environment
- run it from the repository root so relative paths remain valid
A good practice is to copy one script into your own launcher file and make your local edits there.
This release is conservative: it keeps the original trainer structure and many helper scripts intact, while only cleaning obvious clutter and personal path references.
That means you may still see:
- legacy experiment launchers
- helper evaluation scripts
- ablation-oriented code branches
- reward-specific utilities that are only needed in some settings
This is intentional. The goal of this release is to preserve the research code path while making it portable and safe to share.
Usually means PYTHONPATH is missing or incorrect.
Use:
export PYTHONPATH=$PROJECT_ROOT/src/grpo/src:${PYTHONPATH}That means the script is still a template-like launcher. Replace the placeholder with your own path, or change it to read from os.environ.
Many reward modules need extra packages. See Install.md.
The usual missing packages are:
hpsv2open_clip_torchmmdetmmcvclip_benchmarkpaddleocr
Double-check:
GENEVAL_MMDET_CONFIGGENEVAL_MMDET_CKPT_DIRGENEVAL_CLIP_CKPT
The cleaned release does not ship checkpoints. Most failures are simply caused by missing or mismatched local checkpoint paths.
Private path roots from the original workspace have been replaced with generic placeholders. Hard-coded personal storage paths and the original hard-coded WandB API key were removed from this release.
If you see WANDB_API_KEY in some launcher scripts, it now refers only to the environment variable name, not to a bundled secret.
This release builds on several excellent open-source projects, including the base model code and reward-related toolchains used in the paper experiments.
If you use this repository, please also consider citing the corresponding upstream projects such as NOVA, Harmon, T2I-R1, GroundingDINO, and LLaVA-NeXT.