Skip to content

Do not set padding output abormal value #3

@JacoCheung

Description

@JacoCheung

Background:

For fp8 & sequence parallel/context parallel), we need to pad the last sequence while retaining the seqoffsets. However the output is generated via empty_like(q), and I found out that the padded output sometimes could be inf or nan. This can impact the backward path. The Nan can propagate to the dgrad and wgrad (nan * 0 is still nan) .

I cannot provide a stably reproducible script currently. But I do encounter sometimes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions