[Bug] runtime_fatal_codes a5 ST: CANN driver exit-time SIGSEGV (libascend_trace/HDC/URMA race) after test passes

### Platform

a5 (Ascend 950 hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

The `runtime_fatal_codes` ST suite (`test_device_error_class_reaches_host_log`,
the onboard a5 negative-path cases) intermittently fails CI with the subprocess
exiting on **SIGSEGV (`rc=-11`)** even though pytest itself reports `1 passed`.

The crash is **not** in simpler code and **not** in the test logic. It is a
**process-exit-time use-after-free inside the closed-source CANN / Ascend driver
stack** (`libascend_trace` → `libascend_hal` → `liburma` → `libummu`). The test
assertions complete successfully *first* (`1 passed`); the segfault happens
afterward, during interpreter teardown, in a CANN background thread racing the
main thread's `exit()`.

This issue is a **record / tracking entry**: the triggering parametrized case
has already been removed from mainline (the suite was thinned). It is filed to
preserve the full root-cause analysis, the native backtrace, and the
reproduction conditions for (a) reporting upstream to the CANN/driver team and
(b) anyone who re-introduces these negative-path cases later.

### Steps to Reproduce

```markdown
The crash is a probabilistic process-exit race; reproduction rate scales with
**concurrency**, not with the number of cases run.

Faithful CI repro (reproduces ~3 of 5 rounds on an a5 box):

1. Hold 2 a5 devices via task-submit (CI runs the a5 onboard st suite on 2
   devices — `--device 4-5` on this runner — with scheduler
   `max_parallel = device_count = 2`, so two case subprocesses run AND exit
   concurrently; the specific device numbers do not matter, only that two
   processes are concurrent):

   task-submit --device auto --device-num 2 \
     --run "for r in 1 2 3 4 5; do \
              python -m pytest tests/st/runtime_fatal_codes \
                --platform a5 --device \$TASK_DEVICE -q; \
            done"

2. Watch for `FAIL rc=-11` / a subprocess killed by signal 11. pytest still
   prints `1 passed` for the case body; only the process *exit* segfaults.

Negative control (does NOT reproduce — confirms concurrency is the amplifier):
running the cases one-by-one, single device, serially gave 0 SIGSEGV in 30
consecutive process exits on the same (idle) box.

Capturing the native stack (core is owned by root via taskqueue.service):

  coredumpctl dump <pid> --output=/tmp/c.core
  gdb -q <python> /tmp/c.core --batch -ex 'thread apply all bt'
```

### Expected Behavior

The fatal-code negative-path cases assert that the device error class reaches
the host log, then the worker tears down cleanly and the process exits 0.
Process teardown after a successful assertion should not segfault.

### Actual Behavior

pytest reports the case passed, then the **process** dies with SIGSEGV during
exit. The scheduler records `rc=-11` and marks the case FAIL purely on the
non-zero exit code.

Native backtrace from the core dump (crash thread is a CANN background thread,
**not** the main thread; simpler appears in zero frames):

```
Thread 1 (CANN teardown thread):
#0  std::_Hashtable<unsigned int, ...>::find()   from libummu.so   ★ SIGSEGV
#1  hashmap_get                                  from libummu.so
#5  udma_u_unregister_seg                         from liburma-udma.so
#6  urma_unregister_seg                           from liburma.so
#7  hdc_unregister_own_urma_seg                   from libascend_hal.so   (driver)
#9  hdc_delete_ub_context                         from libascend_hal.so
#11 hdc_ub_session_close                          from libascend_hal.so
#13 halHdcSessionCloseEx                          from libascend_hal.so
#14 drvHdcSessionClose                            from libascend_hal.so
#16 AdxDestroyCommHandle                          from libascend_trace.so (CANN)
#17 ...                                           from libascend_trace.so (thread entry)

Thread 8 (main thread, concurrently):
#5  exit()                                        from libc
#1  ??                                            from libummu.so
#0  close()                                        from libc
```

Mechanism: `libascend_trace.so` is the CANN **device-log relay channel** (it
reads `ASCEND_LOG_DEVICE_FLUSH_TIMEOUT` / `ASCEND_TRACE_RECORD_NUM`; it is NOT a
profiling/ADX feature and is NOT enabled by simpler — the driver brings it up
whenever the device emits logs). At process exit, its background thread runs
`AdxDestroyCommHandle → halHdcSessionClose → urma_unregister_seg`, walking the
`libummu` global hashtable, **while the main thread is already in `exit()`**
running C++/atexit teardown. The two race to tear down the same HDC/URMA
resources; the background thread dereferences a hashtable the main thread has
freed → use-after-free → SIGSEGV.

Why only the fatal-code suite: these cases deliberately drive the device into a
FATAL state (e.g. device log shows `aicpu_orchestration_entry "FATAL(code=9):
st injected fatal"`, `PTO2 runtime failed with rc=-9`), so the device-log relay
channel is *busy right up to exit* — the teardown race window is at its widest.
Normal cases emit little/no device log and almost never hit it.

Why CI fails often but a local single run usually passes: **measured** — the
single-process exit hit rate on an idle box is <3% (0/30); with CI's 2-device
concurrency (two subprocesses running and exiting at once, higher load, wider
scheduling jitter) it jumps to ~60% (3/5 rounds). Concurrency widens the
`exit()`-vs-background-thread race window; the specific device numbers and
device independence are irrelevant because the race is **intra-process**.

### Git Commit ID

6d938bf85e68239fbb0f5802e093f2d515336822

### CANN Version

CANN 9.1.T500

### Driver Version

25.6.rc1.b108 (ascendhal_version 7.35.23)

### Host Platform

Linux (aarch64)

### Additional Context

**Root cause ownership.** Every crash frame is in closed-source CANN/driver
libraries; the defect is a missing synchronization between `libascend_trace`'s
exit-time background thread and process teardown. It cannot be *fixed* at the
application layer (we cannot lock the driver's internal hashtable or control its
thread lifecycle) — only mitigated. A true fix must come from the CANN/driver
team (make `AdxDestroyCommHandle` exit-safe / mutually exclusive with atexit, or
join the relay thread before teardown).

**Mitigation options (for if these cases are re-introduced):**
1. Report upstream to CANN with this backtrace + repro (the only real fix).
2. Have the fatal-code suite's process skip native teardown via `os._exit()`
   after assertions complete (a local conftest `pytest_sessionfinish`) — removes
   one leg of the race. Onboard work holds an exclusive task-submit device lock,
   so skipping the graceful device reset is acceptable here. (New exit behavior —
   needs sign-off per `.claude/rules/env-macro-gating.md`.)
3. CI-level: treat "pytest passed but process exited `-11`" as a known
   driver-flaky outcome (attach the core stack) rather than a hard FAIL.
4. Lowering concurrency (serial / single device) reduces but does not eliminate
   the hit rate — not recommended as the final fix.

**Status.** The triggering case was removed from mainline when the
`runtime_fatal_codes` suite was thinned (latest local commit touching it:
`5d4785e4`). This issue exists for the record and for upstream reporting.

Related: #1197 (teardown-ordering segfault where RTS-using destructors run after
`aclFinalize` on a5) — a different teardown bug in our own `DeviceRunner` member
ordering; this issue is the closed-source CANN device-log-relay thread race, not
fixable in simpler.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] runtime_fatal_codes a5 ST: CANN driver exit-time SIGSEGV (libascend_trace/HDC/URMA race) after test passes #1209

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] runtime_fatal_codes a5 ST: CANN driver exit-time SIGSEGV (libascend_trace/HDC/URMA race) after test passes #1209

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions