We are currently reimplementing this excellent work, including AWM, DiffusionNFT and FlowGRPO.
However, we cannot obtain the eval curve of the paper. Can you share your config for training Flux.1-Dev.
We find that KL beta = 0 in current config, but in AWM paper, they set it to be 0.01.
We have test both beta and still cannot obtain the result in the paper. I will appreciate very much if you can help us solve this problem.
Our config is in the following:
# >>>>>>>>>> 环境配置(按你的机器改)<<<<<<<<<<
launcher: "accelerate"
config_file: deepspeed/deepspeed_zero2.yaml # 相对于 syf_exp/
num_processes: 8 # GPU 数量,按实际修改
main_process_port: 29500
mixed_precision: "bf16"
# >>>>>>>>>> 数据配置(必须改)<<<<<<<<<<
data:
dataset_dir: "dataset/pickscore" # 相对于 syf_exp/,放数据或建软链接
preprocessing_batch_size: 8
dataloader_num_workers: 16
force_reprocess: true
cache_dir: "cache/datasets" # 相对于 syf_exp/
max_dataset_size: 1024
sampler_type: "auto"
# >>>>>>>>>> 模型配置(必须改)<<<<<<<<<<
model:
finetune_type: 'lora'
lora_rank: 64
lora_alpha: 128
target_modules: "default"
model_name_or_path: "./models/FLUX.1-dev" # HuggingFace ID 或本地绝对路径
model_type: "flux1"
resume_path: null
resume_type: null
# >>>>>>>>>> 日志与保存 <<<<<<<<<<
log:
run_name: null # 为 null 时自动生成,建议改成有意义的名字
project: "PERL-SYF-TEST"
logging_backend: "swanlab" # Options: wandb, swanlab, tensorboard, none
save_dir: "saves/" # 相对于 syf_exp/
save_freq: 20
save_model_only: true
# >>>>>>>>>> 训练配置 <<<<<<<<<<
train:
# Trainer settings
trainer_type: 'awm'
advantage_aggregation: 'sum' # Options: 'sum', 'gdpo'
off_policy: false
awm_weighting: 'ghuber'
ghuber_power: 0.25
# Training Timestep distribution
num_train_timesteps: 4 # Set null to all steps
time_sampling_strategy: discrete # Options: uniform, logit_normal, discrete, discrete_with_init, discrete_wo_init
time_shift: 3.0
timestep_range: 0.7 # Select fraction of timesteps to train on
# Clipping
clip_range: 1 # PPO/GRPO clipping range
adv_clip_range: 1.0 # Advantage clipping range
# KL div
kl_weight: 'Uniform'
kl_type: 'v-based'
kl_beta: 0.01 # KL divergence beta
ref_param_device: 'cuda' # Options: cpu, cuda
# EMA
ema_kl_beta: 0.1 # Coefficient of KL-loss between current policy and EMA policy, used to stablize training
ema_decay_schedule: "linear" # Decay schedule for EMA. Options: ['constant', 'power', 'linear', 'piecewise_linear', 'cosine', 'warmup_cosine']
ema_decay: 0.3 # EMA decay rate (0 to disable)
ema_update_interval: 1 # EMA update interval (in epochs)
warmup_steps: 300
ema_device: "cuda" # Device to store EMA model (options: cpu, cuda)
# Sampling
resolution: 512 # Can be int or [height, width]
num_inference_steps: 10 # Number of timesteps
guidance_scale: 3.5 # Guidance scale for sampling
# Batch and sampling
per_device_batch_size: 8 # Batch size per device
group_size: 16 # Group size for GRPO sampling
global_std: true # Use global std for advantage normalization
unique_sample_num_per_epoch: 48 # Unique samples per group
gradient_step_per_epoch: 1 # Gradient steps per epoch. The first step is on-policy, the rest are off-policy.
# Optimization
learning_rate: 3.0e-4 # Initial learning rate
adam_weight_decay: 1.0e-4 # AdamW weight decay
adam_betas: [0.9, 0.999] # AdamW betas
adam_epsilon: 1.0e-8 # AdamW epsilon
max_grad_norm: 1.0 # Max gradient norm for clipping
# Gradient checkpointing
enable_gradient_checkpointing: false # Enable gradient checkpointing to save memory with extra compute
# Seed
seed: 42 # Random seed
# Scheduler Configuration
scheduler:
dynamics_type: "ODE" # Options: Flow-SDE, Dance-SDE, CPS, ODE
# Evaluation settings
eval:
resolution: 1024 # Evaluation resolution
per_device_batch_size: 1 # Eval batch size
guidance_scale: 3.5 # Guidance scale for sampling
num_inference_steps: 20 # Number of eval timesteps
eval_freq: 20 # Eval frequency in epochs (0 to disable)
seed: 42 # Eval seed (defaults to training seed)
# Reward Model Configuration
rewards:
- name: "pick_score"
reward_model: "PickScore"
batch_size: 16
device: "cuda"
dtype: bfloat16
# Optional Evaluation Reward Models
eval_rewards:
- name: "pick_score"
reward_model: "PickScore"
batch_size: 32
device: "cuda"
dtype: bfloat16
The resulting curve is:

We are currently reimplementing this excellent work, including AWM, DiffusionNFT and FlowGRPO.
However, we cannot obtain the eval curve of the paper. Can you share your config for training Flux.1-Dev.
We find that KL beta = 0 in current config, but in AWM paper, they set it to be 0.01.
We have test both beta and still cannot obtain the result in the paper. I will appreciate very much if you can help us solve this problem.
Our config is in the following:
The resulting curve is: