Very great work!
I am very interest why MAT can hold the monotonic improvement guarantee while avoids sequential updates.
To guarantee the monotonic improvement, HAPPO updates each policy one-by-one during training, by leveraging previous update results. That means if we want to update ${\pi}^2_{old}$, we have to wait ${\pi}^1_{new}$.
There is only a rough discussion about this issue in the paper:

After careful checking the HAPPO paper, I found MAT's Eq 5 is not the same as Eq 11 in HAPPO paper. Specifically, MAT's Eq 5 ignores the first term of $M^{i_{1:m}}$ which depends on previous update results, e.g., ${\pi}^1_{new}$.
Can you explain why Eq.5 can guarantee monotonic improvement ?
This question has been bothering me for a long time and I look forward to getting your reply.



Very great work!
I am very interest why MAT can hold the monotonic improvement guarantee while avoids sequential updates.
To guarantee the monotonic improvement, HAPPO updates each policy one-by-one during training, by leveraging previous update results. That means if we want to update${\pi}^2_{old}$ , we have to wait ${\pi}^1_{new}$ .
There is only a rough discussion about this issue in the paper:

After careful checking the HAPPO paper, I found MAT's Eq 5 is not the same as Eq 11 in HAPPO paper. Specifically, MAT's Eq 5 ignores the first term of$M^{i_{1:m}}$ which depends on previous update results, e.g., ${\pi}^1_{new}$ .
Can you explain why Eq.5 can guarantee monotonic improvement ?
This question has been bothering me for a long time and I look forward to getting your reply.