Platform
a5 (Ascend 950 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The runtime_fatal_codes ST suite (test_device_error_class_reaches_host_log,
the onboard a5 negative-path cases) intermittently fails CI with the subprocess
exiting on SIGSEGV (rc=-11) even though pytest itself reports 1 passed.
The crash is not in simpler code and not in the test logic. It is a
process-exit-time use-after-free inside the closed-source CANN / Ascend driver
stack (libascend_trace → libascend_hal → liburma → libummu). The test
assertions complete successfully first (1 passed); the segfault happens
afterward, during interpreter teardown, in a CANN background thread racing the
main thread's exit().
This issue is a record / tracking entry: the triggering parametrized case
has already been removed from mainline (the suite was thinned). It is filed to
preserve the full root-cause analysis, the native backtrace, and the
reproduction conditions for (a) reporting upstream to the CANN/driver team and
(b) anyone who re-introduces these negative-path cases later.
Steps to Reproduce
The crash is a probabilistic process-exit race; reproduction rate scales with
**concurrency**, not with the number of cases run.
Faithful CI repro (reproduces ~3 of 5 rounds on an a5 box):
1. Hold 2 a5 devices via task-submit (CI runs the a5 onboard st suite on 2
devices — `--device 4-5` on this runner — with scheduler
`max_parallel = device_count = 2`, so two case subprocesses run AND exit
concurrently; the specific device numbers do not matter, only that two
processes are concurrent):
task-submit --device auto --device-num 2 \
--run "for r in 1 2 3 4 5; do \
python -m pytest tests/st/runtime_fatal_codes \
--platform a5 --device \$TASK_DEVICE -q; \
done"
2. Watch for `FAIL rc=-11` / a subprocess killed by signal 11. pytest still
prints `1 passed` for the case body; only the process *exit* segfaults.
Negative control (does NOT reproduce — confirms concurrency is the amplifier):
running the cases one-by-one, single device, serially gave 0 SIGSEGV in 30
consecutive process exits on the same (idle) box.
Capturing the native stack (core is owned by root via taskqueue.service):
coredumpctl dump <pid> --output=/tmp/c.core
gdb -q <python> /tmp/c.core --batch -ex 'thread apply all bt'
Expected Behavior
The fatal-code negative-path cases assert that the device error class reaches
the host log, then the worker tears down cleanly and the process exits 0.
Process teardown after a successful assertion should not segfault.
Actual Behavior
pytest reports the case passed, then the process dies with SIGSEGV during
exit. The scheduler records rc=-11 and marks the case FAIL purely on the
non-zero exit code.
Native backtrace from the core dump (crash thread is a CANN background thread,
not the main thread; simpler appears in zero frames):
Thread 1 (CANN teardown thread):
#0 std::_Hashtable<unsigned int, ...>::find() from libummu.so ★ SIGSEGV
#1 hashmap_get from libummu.so
#5 udma_u_unregister_seg from liburma-udma.so
#6 urma_unregister_seg from liburma.so
#7 hdc_unregister_own_urma_seg from libascend_hal.so (driver)
#9 hdc_delete_ub_context from libascend_hal.so
#11 hdc_ub_session_close from libascend_hal.so
#13 halHdcSessionCloseEx from libascend_hal.so
#14 drvHdcSessionClose from libascend_hal.so
#16 AdxDestroyCommHandle from libascend_trace.so (CANN)
#17 ... from libascend_trace.so (thread entry)
Thread 8 (main thread, concurrently):
#5 exit() from libc
#1 ?? from libummu.so
#0 close() from libc
Mechanism: libascend_trace.so is the CANN device-log relay channel (it
reads ASCEND_LOG_DEVICE_FLUSH_TIMEOUT / ASCEND_TRACE_RECORD_NUM; it is NOT a
profiling/ADX feature and is NOT enabled by simpler — the driver brings it up
whenever the device emits logs). At process exit, its background thread runs
AdxDestroyCommHandle → halHdcSessionClose → urma_unregister_seg, walking the
libummu global hashtable, while the main thread is already in exit()
running C++/atexit teardown. The two race to tear down the same HDC/URMA
resources; the background thread dereferences a hashtable the main thread has
freed → use-after-free → SIGSEGV.
Why only the fatal-code suite: these cases deliberately drive the device into a
FATAL state (e.g. device log shows aicpu_orchestration_entry "FATAL(code=9): st injected fatal", PTO2 runtime failed with rc=-9), so the device-log relay
channel is busy right up to exit — the teardown race window is at its widest.
Normal cases emit little/no device log and almost never hit it.
Why CI fails often but a local single run usually passes: measured — the
single-process exit hit rate on an idle box is <3% (0/30); with CI's 2-device
concurrency (two subprocesses running and exiting at once, higher load, wider
scheduling jitter) it jumps to ~60% (3/5 rounds). Concurrency widens the
exit()-vs-background-thread race window; the specific device numbers and
device independence are irrelevant because the race is intra-process.
Git Commit ID
6d938bf
CANN Version
CANN 9.1.T500
Driver Version
25.6.rc1.b108 (ascendhal_version 7.35.23)
Host Platform
Linux (aarch64)
Additional Context
Root cause ownership. Every crash frame is in closed-source CANN/driver
libraries; the defect is a missing synchronization between libascend_trace's
exit-time background thread and process teardown. It cannot be fixed at the
application layer (we cannot lock the driver's internal hashtable or control its
thread lifecycle) — only mitigated. A true fix must come from the CANN/driver
team (make AdxDestroyCommHandle exit-safe / mutually exclusive with atexit, or
join the relay thread before teardown).
Mitigation options (for if these cases are re-introduced):
- Report upstream to CANN with this backtrace + repro (the only real fix).
- Have the fatal-code suite's process skip native teardown via
os._exit()
after assertions complete (a local conftest pytest_sessionfinish) — removes
one leg of the race. Onboard work holds an exclusive task-submit device lock,
so skipping the graceful device reset is acceptable here. (New exit behavior —
needs sign-off per .claude/rules/env-macro-gating.md.)
- CI-level: treat "pytest passed but process exited
-11" as a known
driver-flaky outcome (attach the core stack) rather than a hard FAIL.
- Lowering concurrency (serial / single device) reduces but does not eliminate
the hit rate — not recommended as the final fix.
Status. The triggering case was removed from mainline when the
runtime_fatal_codes suite was thinned (latest local commit touching it:
5d4785e4). This issue exists for the record and for upstream reporting.
Related: #1197 (teardown-ordering segfault where RTS-using destructors run after
aclFinalize on a5) — a different teardown bug in our own DeviceRunner member
ordering; this issue is the closed-source CANN device-log-relay thread race, not
fixable in simpler.
Platform
a5 (Ascend 950 hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
The
runtime_fatal_codesST suite (test_device_error_class_reaches_host_log,the onboard a5 negative-path cases) intermittently fails CI with the subprocess
exiting on SIGSEGV (
rc=-11) even though pytest itself reports1 passed.The crash is not in simpler code and not in the test logic. It is a
process-exit-time use-after-free inside the closed-source CANN / Ascend driver
stack (
libascend_trace→libascend_hal→liburma→libummu). The testassertions complete successfully first (
1 passed); the segfault happensafterward, during interpreter teardown, in a CANN background thread racing the
main thread's
exit().This issue is a record / tracking entry: the triggering parametrized case
has already been removed from mainline (the suite was thinned). It is filed to
preserve the full root-cause analysis, the native backtrace, and the
reproduction conditions for (a) reporting upstream to the CANN/driver team and
(b) anyone who re-introduces these negative-path cases later.
Steps to Reproduce
Expected Behavior
The fatal-code negative-path cases assert that the device error class reaches
the host log, then the worker tears down cleanly and the process exits 0.
Process teardown after a successful assertion should not segfault.
Actual Behavior
pytest reports the case passed, then the process dies with SIGSEGV during
exit. The scheduler records
rc=-11and marks the case FAIL purely on thenon-zero exit code.
Native backtrace from the core dump (crash thread is a CANN background thread,
not the main thread; simpler appears in zero frames):
Mechanism:
libascend_trace.sois the CANN device-log relay channel (itreads
ASCEND_LOG_DEVICE_FLUSH_TIMEOUT/ASCEND_TRACE_RECORD_NUM; it is NOT aprofiling/ADX feature and is NOT enabled by simpler — the driver brings it up
whenever the device emits logs). At process exit, its background thread runs
AdxDestroyCommHandle → halHdcSessionClose → urma_unregister_seg, walking thelibummuglobal hashtable, while the main thread is already inexit()running C++/atexit teardown. The two race to tear down the same HDC/URMA
resources; the background thread dereferences a hashtable the main thread has
freed → use-after-free → SIGSEGV.
Why only the fatal-code suite: these cases deliberately drive the device into a
FATAL state (e.g. device log shows
aicpu_orchestration_entry "FATAL(code=9): st injected fatal",PTO2 runtime failed with rc=-9), so the device-log relaychannel is busy right up to exit — the teardown race window is at its widest.
Normal cases emit little/no device log and almost never hit it.
Why CI fails often but a local single run usually passes: measured — the
single-process exit hit rate on an idle box is <3% (0/30); with CI's 2-device
concurrency (two subprocesses running and exiting at once, higher load, wider
scheduling jitter) it jumps to ~60% (3/5 rounds). Concurrency widens the
exit()-vs-background-thread race window; the specific device numbers anddevice independence are irrelevant because the race is intra-process.
Git Commit ID
6d938bf
CANN Version
CANN 9.1.T500
Driver Version
25.6.rc1.b108 (ascendhal_version 7.35.23)
Host Platform
Linux (aarch64)
Additional Context
Root cause ownership. Every crash frame is in closed-source CANN/driver
libraries; the defect is a missing synchronization between
libascend_trace'sexit-time background thread and process teardown. It cannot be fixed at the
application layer (we cannot lock the driver's internal hashtable or control its
thread lifecycle) — only mitigated. A true fix must come from the CANN/driver
team (make
AdxDestroyCommHandleexit-safe / mutually exclusive with atexit, orjoin the relay thread before teardown).
Mitigation options (for if these cases are re-introduced):
os._exit()after assertions complete (a local conftest
pytest_sessionfinish) — removesone leg of the race. Onboard work holds an exclusive task-submit device lock,
so skipping the graceful device reset is acceptable here. (New exit behavior —
needs sign-off per
.claude/rules/env-macro-gating.md.)-11" as a knowndriver-flaky outcome (attach the core stack) rather than a hard FAIL.
the hit rate — not recommended as the final fix.
Status. The triggering case was removed from mainline when the
runtime_fatal_codessuite was thinned (latest local commit touching it:5d4785e4). This issue exists for the record and for upstream reporting.Related: #1197 (teardown-ordering segfault where RTS-using destructors run after
aclFinalizeon a5) — a different teardown bug in our ownDeviceRunnermemberordering; this issue is the closed-source CANN device-log-relay thread race, not
fixable in simpler.