About the equation of DDPO_IS in the paper and the code

Thanks for your paper and code. I'm confused about the loss calculation of DDPO_IS in the paper and [code](https://github.com/jannerm/ddpo/blob/f0b6ca76516809b9534ad51bd4511117e8eb3682/ddpo/training/policy_gradient.py#L124) is different:
<img width="685" alt="image" src="https://github.com/jannerm/ddpo/assets/18279248/90a76d38-a37e-4c1c-87c3-07f9c0eea4e8">

+ There are mainly two differences in the codes:
    - the parameters are updated in each timestep instead of 0-T.
    - the `unclipped_loss=-advantage*ratio`, I see no `log_prob` in the `unclipped_loss`: $$LogProb=log(p_\theta(x_{t-1}|x_t,c))$$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the equation of DDPO_IS in the paper and the code #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

About the equation of DDPO_IS in the paper and the code #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions