In fully async training, when save checkpoint after a global step, we need to 1. save checkpoint for training engine: weight, optimizer state 2. save checkpoint for torch stateful dataloader 3. save checkpoint for TransferQueue
In fully async training, when save checkpoint after a global step, we need to