Skip to content

About the equation of DDPO_IS in the paper and the code #13

@PkuDavidGuan

Description

@PkuDavidGuan

Thanks for your paper and code. I'm confused about the loss calculation of DDPO_IS in the paper and code is different:
image

  • There are mainly two differences in the codes:
    • the parameters are updated in each timestep instead of 0-T.
    • the unclipped_loss=-advantage*ratio, I see no log_prob in the unclipped_loss: $$LogProb=log(p_\theta(x_{t-1}|x_t,c))$$

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions