Skip to content

[CRL] Support pool-only registration and optimize regpool containers#482

Open
MC952-arch wants to merge 4 commits into
flagos-ai:mainfrom
MC952-arch:register-opt
Open

[CRL] Support pool-only registration and optimize regpool containers#482
MC952-arch wants to merge 4 commits into
flagos-ai:mainfrom
MC952-arch:register-opt

Conversation

@MC952-arch
Copy link
Copy Markdown
Collaborator

@MC952-arch MC952-arch commented May 27, 2026

PR Category

CRL

PR Types

Optimizations

PR Description

This PR aims to (1) allow flagcxCommRegister/flagcxCommDeregister to be called with a null comm (pool-only registration), and (2) optimize the registration pool’s internal containers by switching to std::unordered_map and std::vector, introducing a GLOBAL_POOL_KEY for null-comm registrations.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to (1) allow flagcxCommRegister/flagcxCommDeregister to be called with a null comm (pool-only registration), and (2) optimize the registration pool’s internal containers by switching to std::unordered_map and std::vector, introducing a GLOBAL_POOL_KEY for null-comm registrations.

Changes:

  • Allow null comm in flagcxCommRegister / flagcxCommDeregister and skip backend registration steps when comm == nullptr.
  • Refactor flagcxRegPool to use a global pool key and unordered containers; switch handle storage from std::list to std::vector and optimize erase loops.
  • Update reg-pool public types/signatures to reflect new container choices.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
flagcx/flagcx.cc Permits null-comm registration/deregistration and adjusts regKey selection + backend-step skipping.
flagcx/core/reg_pool.cc Refactors reg-pool internals (global pool behavior, unordered containers, vector handle storage, updated cleanup logic).
flagcx/core/include/register.h Switches flagcxRegItem::handles from std::list to std::vector.
flagcx/core/include/reg_pool.h Introduces GLOBAL_POOL_KEY and changes reg-pool container types to unordered maps.
Comments suppressed due to low confidence (1)

flagcx/flagcx.cc:1031

  • flagcxCommRegister() treats regItem->refCount > 1 as a re-registration and skips backend-specific steps (CCL registration / IPC handle creation / one-sided MR). After this PR, refCount is incremented in a global pool shared across comms, so a second different comm registering the same buffer can hit this fast path and return success without initializing comm-specific backend state. This can lead to missing homoRegHandle/ipcHandleData for that comm and subsequent failures. You likely need refCounts (and the re-registration shortcut) to remain per-comm registration context, not global across comms, or store backend state per-comm instead of on the shared flagcxRegItem.
  // Re-registration: backend handle + IPC handle already set up
  if (regItem->refCount > 1) {
    return flagcxSuccess;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flagcx/core/include/reg_pool.h
Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/core/reg_pool.cc Outdated
@MC952-arch MC952-arch closed this May 28, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/flagcx.cc
Comment thread flagcx/core/include/register.h
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

test/script/_gpu_check.sh:19

  • The added || true makes the script continue when nvidia-smi fails, but the subsequent arithmetic expansions ($((${memory_usage_array[$i]})), etc.) will hit $(()) with an empty value and can error out (or compute incorrect values) when the arrays are empty. Consider explicitly detecting nvidia-smi failure/empty output (and gpu_count==0) and sleep/continue rather than proceeding with empty arrays.
    IFS=$'\n' read -d '' -r -a memory_usage_array <<< "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)" || true
    IFS=$'\n' read -d '' -r -a memory_total_array <<< "$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits)" || true

    need_wait=false

    for ((i=0; i<$gpu_count; i++)); do

        memory_usage_i=$((${memory_usage_array[$i]}))
        memory_total_i=$((${memory_total_array[$i]}))
        memory_remin_i=$(($memory_total_i-$memory_usage_i))

Comment thread flagcx/flagcx.cc
Comment thread flagcx/core/p2p.cc
Comment thread flagcx/core/reg_pool.cc
Comment thread flagcx/core/reg_pool.cc
@MC952-arch MC952-arch requested a review from Copilot May 28, 2026 09:33
@MC952-arch MC952-arch changed the title [CRL] regpool: support null comm registration and optimize containers [CRL] Support pool-only registration and optimize regpool containers May 28, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comment thread flagcx/core/reg_pool.cc Outdated
Comment thread test/unittest/core/test_reg_pool.cpp
Comment thread test/script/_gpu_check.sh Outdated
Comment thread .github/workflows/test.yml
Comment thread flagcx/core/reg_pool.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants