Skip to content

update GPU partition grouping logic#118

Merged
sgopinath1 merged 3 commits intoROCm:mainfrom
biluriuday:partition-grouping
Apr 27, 2026
Merged

update GPU partition grouping logic#118
sgopinath1 merged 3 commits intoROCm:mainfrom
biluriuday:partition-grouping

Conversation

@biluriuday
Copy link
Copy Markdown
Contributor

Motivation

Group GPU partitions by (domain, location_id) instead of unique_id

The logic that groups partitions belonging to the same physical GPU previously relied on the unique_id field from the KFD topology properties file. In earlier versions of the amdgpu driver, all partitions of a GPU shared the same unique_id, which made it a reliable grouping key. Starting with ROCm 7.0.1, this is no longer the case — each partition can report a distinct unique_id, breaking the existing grouping.

This change updates the grouping logic to use the domain and location_id fields instead, which together uniquely identify the parent PCI device and therefore correctly group all partitions of the same GPU.

Technical Details

Test Plan

Test Result

Verified cdi generation manually on AMD Instinct MI308X node. Modified UTs to use the new fields domain and location_id for grouping.

uday@uday:~/src/github.com/biluriuday/container-toolkit/internal/amdgpu$ go test -v .
=== RUN   TestGetAMDGPUs
=== RUN   TestGetAMDGPUs/no_amdgpu_driver
=== RUN   TestGetAMDGPUs/single_GPU_device
=== RUN   TestGetAMDGPUs/GPU_with_partition
=== RUN   TestGetAMDGPUs/multiple_GPUs
=== RUN   TestGetAMDGPUs/unordered_partitions
--- PASS: TestGetAMDGPUs (0.00s)
    --- PASS: TestGetAMDGPUs/no_amdgpu_driver (0.00s)
    --- PASS: TestGetAMDGPUs/single_GPU_device (0.00s)
    --- PASS: TestGetAMDGPUs/GPU_with_partition (0.00s)
    --- PASS: TestGetAMDGPUs/multiple_GPUs (0.00s)
    --- PASS: TestGetAMDGPUs/unordered_partitions (0.00s)
=== RUN   TestParseTopologyProperties
=== RUN   TestParseTopologyProperties/valid_property
=== RUN   TestParseTopologyProperties/property_not_found
--- PASS: TestParseTopologyProperties (0.00s)
    --- PASS: TestParseTopologyProperties/valid_property (0.00s)
    --- PASS: TestParseTopologyProperties/property_not_found (0.00s)
=== RUN   TestParseTopologyPropertiesString
=== RUN   TestParseTopologyPropertiesString/valid_property
=== RUN   TestParseTopologyPropertiesString/property_not_found
--- PASS: TestParseTopologyPropertiesString (0.00s)
    --- PASS: TestParseTopologyPropertiesString/valid_property (0.00s)
    --- PASS: TestParseTopologyPropertiesString/property_not_found (0.00s)
=== RUN   TestGetAMDGPUWithFS
=== RUN   TestGetAMDGPUWithFS/valid_GPU_device
=== RUN   TestGetAMDGPUWithFS/valid_render_device
=== RUN   TestGetAMDGPUWithFS/non-existent_device
--- PASS: TestGetAMDGPUWithFS (0.00s)
    --- PASS: TestGetAMDGPUWithFS/valid_GPU_device (0.00s)
    --- PASS: TestGetAMDGPUWithFS/valid_render_device (0.00s)
    --- PASS: TestGetAMDGPUWithFS/non-existent_device (0.00s)
=== RUN   TestGetDevIdsFromTopology
=== RUN   TestGetDevIdsFromTopology/single_GPU_topology
=== RUN   TestGetDevIdsFromTopology/GPU_with_partition_topology
=== RUN   TestGetDevIdsFromTopology/multiple_GPUs_topology
=== RUN   TestGetDevIdsFromTopology/unordered_partitions_topology
--- PASS: TestGetDevIdsFromTopology (0.00s)
    --- PASS: TestGetDevIdsFromTopology/single_GPU_topology (0.00s)
    --- PASS: TestGetDevIdsFromTopology/GPU_with_partition_topology (0.00s)
    --- PASS: TestGetDevIdsFromTopology/multiple_GPUs_topology (0.00s)
    --- PASS: TestGetDevIdsFromTopology/unordered_partitions_topology (0.00s)
=== RUN   TestGetUniqueIdToDeviceIndexMapWithFS
=== RUN   TestGetUniqueIdToDeviceIndexMapWithFS/single_GPU_UUID_mapping
=== RUN   TestGetUniqueIdToDeviceIndexMapWithFS/GPU_with_partition_UUID_mapping
=== RUN   TestGetUniqueIdToDeviceIndexMapWithFS/multiple_GPUs_UUID_mapping
=== RUN   TestGetUniqueIdToDeviceIndexMapWithFS/unordered_partitions_UUID_mapping
--- PASS: TestGetUniqueIdToDeviceIndexMapWithFS (0.00s)
    --- PASS: TestGetUniqueIdToDeviceIndexMapWithFS/single_GPU_UUID_mapping (0.00s)
    --- PASS: TestGetUniqueIdToDeviceIndexMapWithFS/GPU_with_partition_UUID_mapping (0.00s)
    --- PASS: TestGetUniqueIdToDeviceIndexMapWithFS/multiple_GPUs_UUID_mapping (0.00s)
    --- PASS: TestGetUniqueIdToDeviceIndexMapWithFS/unordered_partitions_UUID_mapping (0.00s)
PASS
ok      github.com/ROCm/container-toolkit/internal/amdgpu       (cached)

Submission Checklist

@bhatturu
Copy link
Copy Markdown
Collaborator

@biluriuday Is the bug purely an ordering issue i.e. all partitions are still returned correctly, just potentially interleaved across physical GPUs or is there a case where the wrong grouping key actually causes partitions to be dropped or duplicated?

@biluriuday
Copy link
Copy Markdown
Contributor Author

@biluriuday Is the bug purely an ordering issue i.e. all partitions are still returned correctly, just potentially interleaved across physical GPUs or is there a case where the wrong grouping key actually causes partitions to be dropped or duplicated?

@bhatturu All the partitions are returned. Problem is that the grouping and GPUID numbering is not ordered correctly. GPU IDs(0, 1, 2, etc.) we assign to the partitions do not match with that of amd-smi and rocm-smi output which might cause some confusion to the end user

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates AMDGPU partition grouping to be resilient to ROCm 7.0.1+ behavior where unique_id can differ per partition, by deriving a stable parent device key from KFD topology (domain + location_id).

Changes:

  • Switch GetDevIdsFromTopology to compute a parent devID from domain/location_id instead of unique_id.
  • Add GetUniqueIdsFromTopology to preserve the previous unique_id mapping behavior where still needed.
  • Update topology test fixtures and unit tests to include/expect domain and location_id.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/amdgpu/amdgpu.go Implements new devID derivation from domain/location_id and adds a helper to still fetch unique_id mappings.
internal/amdgpu/amdgpu_test.go Updates expected mappings for the new devID format in topology-related unit tests.
tests/amdgpu/topology/nodes/0/properties Adds location_id/domain fields to topology fixture.
tests/amdgpu/topology/nodes/1/properties Adds location_id/domain fields to topology fixture.
tests/amdgpu/topology/nodes/2/properties Adds location_id/domain fields to topology fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/amdgpu/topology/nodes/0/properties
Comment thread internal/amdgpu/amdgpu.go
Comment thread internal/amdgpu/amdgpu.go Outdated
Comment thread internal/amdgpu/amdgpu.go
Comment thread tests/amdgpu/topology/nodes/1/properties Outdated
Copy link
Copy Markdown
Member

@shiv-tyagi shiv-tyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the location_id and domain_id fields exist for pre rocm-7.0.1 driver versions? Just to make sure that we are not breaking the systems using those driver versions.

@biluriuday
Copy link
Copy Markdown
Contributor Author

Do the location_id and domain_id fields exist for pre rocm-7.0.1 driver versions? Just to make sure that we are not breaking the systems using those driver versions.

yes, these fields are present in ROCm 6.X versions as well.

Copy link
Copy Markdown
Member

@shiv-tyagi shiv-tyagi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @biluriuday.

@sgopinath1 sgopinath1 merged commit 52aaff5 into ROCm:main Apr 27, 2026
3 checks passed
@biluriuday biluriuday deleted the partition-grouping branch April 28, 2026 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants