Background:
For fp8 & sequence parallel/context parallel), we need to pad the last sequence while retaining the seqoffsets. However the output is generated via empty_like(q), and I found out that the padded output sometimes could be inf or nan. This can impact the backward path. The Nan can propagate to the dgrad and wgrad (nan * 0 is still nan) .
I cannot provide a stably reproducible script currently. But I do encounter sometimes.
Background:
For
fp8&sequence parallel/context parallel), we need to pad the last sequence while retaining the seqoffsets. However the output is generated viaempty_like(q), and I found out that the padded output sometimes could beinfornan. This can impact the backward path. The Nan can propagate to the dgrad and wgrad (nan * 0 is still nan) .I cannot provide a stably reproducible script currently. But I do encounter sometimes.