@wdvr and I have been manually tracking all the capacity reservations used by OSDC in this sheet. The process is humanly error-prone, so we should see if there is a better way to renew soon-to-be-expired capacity reservations given that OSDC is now serving production traffics, not only for pytorch/pytorch, but the rest of the org.
There are couples of steps we can do:
- Have a way to distinguish between capacity reservations used in CI v.s. those used by devgpu
- Create a SEV if a capacity reservation has expired and is still referred to in OSDC
- Notify oncalls when a replacement is ready
More suggestions are appreciate. IMO, we want to have a working process here to renew capacity reservations used in CI, but no need to be overly complicated
@wdvr and I have been manually tracking all the capacity reservations used by OSDC in this sheet. The process is humanly error-prone, so we should see if there is a better way to renew soon-to-be-expired capacity reservations given that OSDC is now serving production traffics, not only for
pytorch/pytorch, but the rest of the org.There are couples of steps we can do:
More suggestions are appreciate. IMO, we want to have a working process here to renew capacity reservations used in CI, but no need to be overly complicated