update GPU partition grouping logic#118
Conversation
|
@biluriuday Is the bug purely an ordering issue i.e. all partitions are still returned correctly, just potentially interleaved across physical GPUs or is there a case where the wrong grouping key actually causes partitions to be dropped or duplicated? |
@bhatturu All the partitions are returned. Problem is that the grouping and GPUID numbering is not ordered correctly. GPU IDs(0, 1, 2, etc.) we assign to the partitions do not match with that of amd-smi and rocm-smi output which might cause some confusion to the end user |
There was a problem hiding this comment.
Pull request overview
Updates AMDGPU partition grouping to be resilient to ROCm 7.0.1+ behavior where unique_id can differ per partition, by deriving a stable parent device key from KFD topology (domain + location_id).
Changes:
- Switch
GetDevIdsFromTopologyto compute a parent devID fromdomain/location_idinstead ofunique_id. - Add
GetUniqueIdsFromTopologyto preserve the previous unique_id mapping behavior where still needed. - Update topology test fixtures and unit tests to include/expect
domainandlocation_id.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
internal/amdgpu/amdgpu.go |
Implements new devID derivation from domain/location_id and adds a helper to still fetch unique_id mappings. |
internal/amdgpu/amdgpu_test.go |
Updates expected mappings for the new devID format in topology-related unit tests. |
tests/amdgpu/topology/nodes/0/properties |
Adds location_id/domain fields to topology fixture. |
tests/amdgpu/topology/nodes/1/properties |
Adds location_id/domain fields to topology fixture. |
tests/amdgpu/topology/nodes/2/properties |
Adds location_id/domain fields to topology fixture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
shiv-tyagi
left a comment
There was a problem hiding this comment.
Do the location_id and domain_id fields exist for pre rocm-7.0.1 driver versions? Just to make sure that we are not breaking the systems using those driver versions.
yes, these fields are present in ROCm 6.X versions as well. |
Motivation
Group GPU partitions by (domain, location_id) instead of unique_id
The logic that groups partitions belonging to the same physical GPU previously relied on the
unique_idfield from the KFD topology properties file. In earlier versions of the amdgpu driver, all partitions of a GPU shared the same unique_id, which made it a reliable grouping key. Starting with ROCm 7.0.1, this is no longer the case — each partition can report a distinct unique_id, breaking the existing grouping.This change updates the grouping logic to use the
domainandlocation_idfields instead, which together uniquely identify the parent PCI device and therefore correctly group all partitions of the same GPU.Technical Details
Test Plan
Test Result
Verified cdi generation manually on AMD Instinct MI308X node. Modified UTs to use the new fields
domainandlocation_idfor grouping.Submission Checklist