Skip to content

Have a better way to track capacity reservations used in CI #713

Description

@huydhn

@wdvr and I have been manually tracking all the capacity reservations used by OSDC in this sheet. The process is humanly error-prone, so we should see if there is a better way to renew soon-to-be-expired capacity reservations given that OSDC is now serving production traffics, not only for pytorch/pytorch, but the rest of the org.

There are couples of steps we can do:

  • Have a way to distinguish between capacity reservations used in CI v.s. those used by devgpu
  • Create a SEV if a capacity reservation has expired and is still referred to in OSDC
  • Notify oncalls when a replacement is ready

More suggestions are appreciate. IMO, we want to have a working process here to renew capacity reservations used in CI, but no need to be overly complicated

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions