Skip to content

Question about the monotonic improvement guarantee of MAT. #36

@CrazySssst

Description

@CrazySssst

Very great work!

I am very interest why MAT can hold the monotonic improvement guarantee while avoids sequential updates.

To guarantee the monotonic improvement, HAPPO updates each policy one-by-one during training, by leveraging previous update results. That means if we want to update ${\pi}^2_{old}$, we have to wait ${\pi}^1_{new}$.

There is only a rough discussion about this issue in the paper:
image

After careful checking the HAPPO paper, I found MAT's Eq 5 is not the same as Eq 11 in HAPPO paper. Specifically, MAT's Eq 5 ignores the first term of $M^{i_{1:m}}$ which depends on previous update results, e.g., ${\pi}^1_{new}$.

Can you explain why Eq.5 can guarantee monotonic improvement ?

This question has been bothering me for a long time and I look forward to getting your reply.

image

image

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions