Thanks for your paper and code. I'm confused about the loss calculation of DDPO_IS in the paper and code is different:

- There are mainly two differences in the codes:
- the parameters are updated in each timestep instead of 0-T.
- the
unclipped_loss=-advantage*ratio, I see no log_prob in the unclipped_loss: $$LogProb=log(p_\theta(x_{t-1}|x_t,c))$$