[Bug] Group eligible endpoint reuse can leave scheduler requeueing forever

### Platform

All / Unknown

### Runtime Variant

All / Unknown

### Description

Current `simpler` main can leave a group task requeued forever when the final
eligible endpoint sets require reusing the same endpoint for multiple
automatically selected group members.

This is based on `origin/main` at:

```text
26b7b1507476024d6c97dbf97e52545853d44bd6
```

The problematic shape is:

```cpp
eligible_endpoint_ids = {{0}, {0}};
```

For a group of size 2, if endpoint 0 exists and both members have no explicit
worker affinity, this submit shape can pass validation. Scheduler dispatch then
cannot assign the second member because automatic selection excludes endpoints
already selected for earlier members in the same group.

### Current Main Code Example

In `src/common/hierarchical/orchestrator.cpp`, current main only checks that
each eligible endpoint set is non-empty. If a member has no explicit affinity,
validation skips the rest of the checks:

```cpp
for (size_t i = 0; i < args_count; ++i) {
    const auto &eligible =
        eligible_endpoint_ids.empty() ? std::vector<int32_t>{} : eligible_endpoint_ids[i];
    if (!eligible_endpoint_ids.empty() && eligible.empty()) {
        throw std::invalid_argument(
            "Orchestrator: final eligible endpoint set is empty for member " + std::to_string(i)
        );
    }
    int8_t affinity = affinities.empty() ? int8_t(-1) : affinities[i];
    if (affinity < 0) continue;

    ...
}
```

So `eligible_endpoint_ids = {{0}, {0}}` is not rejected when both group members
are unconstrained by explicit affinity.

In `src/common/hierarchical/types.h`, current main stores and exposes
per-member eligible endpoint sets:

```cpp
const std::vector<int32_t> &eligible_endpoints_for(int32_t i) const {
    static const std::vector<int32_t> empty;
    if (eligible_endpoint_ids.empty()) return empty;
    if (i < 0 || static_cast<size_t>(i) >= eligible_endpoint_ids.size()) return empty;
    return eligible_endpoint_ids[static_cast<size_t>(i)];
}
```

In `src/common/hierarchical/scheduler.cpp`, current main uses all-or-nothing
group dispatch. It first selects workers for all group members, and only
dispatches after every member has a selected worker:

```cpp
std::vector<WorkerThread *> workers(static_cast<size_t>(N), nullptr);
bool ok = true;

// Pass 2: fill unconstrained slots from idle pool
if (ok) {
    for (int i = 0; i < N; i++) {
        if (workers[static_cast<size_t>(i)] != nullptr) continue;
        auto *wt =
            cfg_.manager->pick_idle_excluding_eligible(
                s.worker_type, workers, s.eligible_endpoints_for(i));
        if (!wt) {
            ok = false;
            break;
        }
        workers[static_cast<size_t>(i)] = wt;
    }
}

if (!ok) {
    q->push(slot);
    break;
}

s.state.store(TaskState::RUNNING, std::memory_order_release);
```

The exclusion happens inside
`src/common/hierarchical/worker_manager.cpp::pick_idle_excluding_eligible()`:

```cpp
bool excluded = false;
for (auto *ex : exclude) {
    if (ex == wt.get()) {
        excluded = true;
        break;
    }
}
if (!excluded) return wt.get();
```

For `eligible_endpoint_ids = {{0}, {0}}`, dispatch behaves like this:

1. member 0 tentatively selects endpoint 0 and stores it in `workers[0]`;
2. member 1 is also restricted to endpoint 0;
3. `pick_idle_excluding_eligible()` sees endpoint 0, but it is already in the
   exclude list;
4. no endpoint is returned for member 1;
5. `ok = false`;
6. the whole group slot is pushed back to the ready queue;
7. no member is dispatched, so the same state can repeat forever.


### Steps to Reproduce

```markdown
1. Register one NEXT_LEVEL endpoint with endpoint id 0.
2. Submit a NEXT_LEVEL group task with two members and no explicit worker
   affinity.
3. Set both members' final eligible endpoint set to endpoint 0:

   
   orch.submit_next_level_group(callable, {args0, args1}, cfg, {}, {{0}, {0}});
   

4. Run the scheduler/drain path.
```

### Expected Behavior

The scheduler should not requeue forever. It should choose and document one
contract:

- allow endpoint reuse by dispatching both group members to endpoint 0, where
  the `WorkerThread` queue runs them sequentially, or
- reject this shape at submit time with a clear `invalid_argument` if group
  members are required to occupy distinct endpoints.


### Actual Behavior

The submit can succeed, but scheduler dispatch cannot complete worker
selection. The whole group slot is pushed back to the ready queue and retried.
Since no member is dispatched, the slot can remain undrained.


### Git Commit ID

26b7b1507476024d6c97dbf97e52545853d44bd6

### CANN Version

N/A - scheduler logic issue, not hardware-specific

### Driver Version

N/A - scheduler logic issue, not hardware-specific

### Host Platform

Linux (aarch64)

### Additional Context

This was found while reviewing PR #1011's remote L3 worker-id cleanup. PR #1011 should only reject unknown eligible endpoint/worker ids at submit time. It should not force a distinct-endpoint contract for `{{0}, {0}}`, because endpoint reuse may be a valid scheduler behavior. The broader scheduler contract issue should be tracked separately here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Group eligible endpoint reuse can leave scheduler requeueing forever #1105

Platform

Runtime Variant

Description

Current Main Code Example

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Group eligible endpoint reuse can leave scheduler requeueing forever #1105

Description

Platform

Runtime Variant

Description

Current Main Code Example

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions