Skip to content

[Bug] Group eligible endpoint reuse can leave scheduler requeueing forever #1105

Description

@puddingfjz

Platform

All / Unknown

Runtime Variant

All / Unknown

Description

Current simpler main can leave a group task requeued forever when the final
eligible endpoint sets require reusing the same endpoint for multiple
automatically selected group members.

This is based on origin/main at:

26b7b1507476024d6c97dbf97e52545853d44bd6

The problematic shape is:

eligible_endpoint_ids = {{0}, {0}};

For a group of size 2, if endpoint 0 exists and both members have no explicit
worker affinity, this submit shape can pass validation. Scheduler dispatch then
cannot assign the second member because automatic selection excludes endpoints
already selected for earlier members in the same group.

Current Main Code Example

In src/common/hierarchical/orchestrator.cpp, current main only checks that
each eligible endpoint set is non-empty. If a member has no explicit affinity,
validation skips the rest of the checks:

for (size_t i = 0; i < args_count; ++i) {
    const auto &eligible =
        eligible_endpoint_ids.empty() ? std::vector<int32_t>{} : eligible_endpoint_ids[i];
    if (!eligible_endpoint_ids.empty() && eligible.empty()) {
        throw std::invalid_argument(
            "Orchestrator: final eligible endpoint set is empty for member " + std::to_string(i)
        );
    }
    int8_t affinity = affinities.empty() ? int8_t(-1) : affinities[i];
    if (affinity < 0) continue;

    ...
}

So eligible_endpoint_ids = {{0}, {0}} is not rejected when both group members
are unconstrained by explicit affinity.

In src/common/hierarchical/types.h, current main stores and exposes
per-member eligible endpoint sets:

const std::vector<int32_t> &eligible_endpoints_for(int32_t i) const {
    static const std::vector<int32_t> empty;
    if (eligible_endpoint_ids.empty()) return empty;
    if (i < 0 || static_cast<size_t>(i) >= eligible_endpoint_ids.size()) return empty;
    return eligible_endpoint_ids[static_cast<size_t>(i)];
}

In src/common/hierarchical/scheduler.cpp, current main uses all-or-nothing
group dispatch. It first selects workers for all group members, and only
dispatches after every member has a selected worker:

std::vector<WorkerThread *> workers(static_cast<size_t>(N), nullptr);
bool ok = true;

// Pass 2: fill unconstrained slots from idle pool
if (ok) {
    for (int i = 0; i < N; i++) {
        if (workers[static_cast<size_t>(i)] != nullptr) continue;
        auto *wt =
            cfg_.manager->pick_idle_excluding_eligible(
                s.worker_type, workers, s.eligible_endpoints_for(i));
        if (!wt) {
            ok = false;
            break;
        }
        workers[static_cast<size_t>(i)] = wt;
    }
}

if (!ok) {
    q->push(slot);
    break;
}

s.state.store(TaskState::RUNNING, std::memory_order_release);

The exclusion happens inside
src/common/hierarchical/worker_manager.cpp::pick_idle_excluding_eligible():

bool excluded = false;
for (auto *ex : exclude) {
    if (ex == wt.get()) {
        excluded = true;
        break;
    }
}
if (!excluded) return wt.get();

For eligible_endpoint_ids = {{0}, {0}}, dispatch behaves like this:

  1. member 0 tentatively selects endpoint 0 and stores it in workers[0];
  2. member 1 is also restricted to endpoint 0;
  3. pick_idle_excluding_eligible() sees endpoint 0, but it is already in the
    exclude list;
  4. no endpoint is returned for member 1;
  5. ok = false;
  6. the whole group slot is pushed back to the ready queue;
  7. no member is dispatched, so the same state can repeat forever.

Steps to Reproduce

1. Register one NEXT_LEVEL endpoint with endpoint id 0.
2. Submit a NEXT_LEVEL group task with two members and no explicit worker
   affinity.
3. Set both members' final eligible endpoint set to endpoint 0:

   
   orch.submit_next_level_group(callable, {args0, args1}, cfg, {}, {{0}, {0}});
   

4. Run the scheduler/drain path.

Expected Behavior

The scheduler should not requeue forever. It should choose and document one
contract:

  • allow endpoint reuse by dispatching both group members to endpoint 0, where
    the WorkerThread queue runs them sequentially, or
  • reject this shape at submit time with a clear invalid_argument if group
    members are required to occupy distinct endpoints.

Actual Behavior

The submit can succeed, but scheduler dispatch cannot complete worker
selection. The whole group slot is pushed back to the ready queue and retried.
Since no member is dispatched, the slot can remain undrained.

Git Commit ID

26b7b15

CANN Version

N/A - scheduler logic issue, not hardware-specific

Driver Version

N/A - scheduler logic issue, not hardware-specific

Host Platform

Linux (aarch64)

Additional Context

This was found while reviewing PR #1011's remote L3 worker-id cleanup. PR #1011 should only reject unknown eligible endpoint/worker ids at submit time. It should not force a distinct-endpoint contract for {{0}, {0}}, because endpoint reuse may be a valid scheduler behavior. The broader scheduler contract issue should be tracked separately here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions