Skip to content

[Bug] runtime_fatal_codes a5 ST: CANN driver exit-time SIGSEGV (libascend_trace/HDC/URMA race) after test passes #1209

Description

@doraemonmj

Platform

a5 (Ascend 950 hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

The runtime_fatal_codes ST suite (test_device_error_class_reaches_host_log,
the onboard a5 negative-path cases) intermittently fails CI with the subprocess
exiting on SIGSEGV (rc=-11) even though pytest itself reports 1 passed.

The crash is not in simpler code and not in the test logic. It is a
process-exit-time use-after-free inside the closed-source CANN / Ascend driver
stack
(libascend_tracelibascend_halliburmalibummu). The test
assertions complete successfully first (1 passed); the segfault happens
afterward, during interpreter teardown, in a CANN background thread racing the
main thread's exit().

This issue is a record / tracking entry: the triggering parametrized case
has already been removed from mainline (the suite was thinned). It is filed to
preserve the full root-cause analysis, the native backtrace, and the
reproduction conditions for (a) reporting upstream to the CANN/driver team and
(b) anyone who re-introduces these negative-path cases later.

Steps to Reproduce

The crash is a probabilistic process-exit race; reproduction rate scales with
**concurrency**, not with the number of cases run.

Faithful CI repro (reproduces ~3 of 5 rounds on an a5 box):

1. Hold 2 a5 devices via task-submit (CI runs the a5 onboard st suite on 2
   devices — `--device 4-5` on this runner — with scheduler
   `max_parallel = device_count = 2`, so two case subprocesses run AND exit
   concurrently; the specific device numbers do not matter, only that two
   processes are concurrent):

   task-submit --device auto --device-num 2 \
     --run "for r in 1 2 3 4 5; do \
              python -m pytest tests/st/runtime_fatal_codes \
                --platform a5 --device \$TASK_DEVICE -q; \
            done"

2. Watch for `FAIL rc=-11` / a subprocess killed by signal 11. pytest still
   prints `1 passed` for the case body; only the process *exit* segfaults.

Negative control (does NOT reproduce — confirms concurrency is the amplifier):
running the cases one-by-one, single device, serially gave 0 SIGSEGV in 30
consecutive process exits on the same (idle) box.

Capturing the native stack (core is owned by root via taskqueue.service):

  coredumpctl dump <pid> --output=/tmp/c.core
  gdb -q <python> /tmp/c.core --batch -ex 'thread apply all bt'

Expected Behavior

The fatal-code negative-path cases assert that the device error class reaches
the host log, then the worker tears down cleanly and the process exits 0.
Process teardown after a successful assertion should not segfault.

Actual Behavior

pytest reports the case passed, then the process dies with SIGSEGV during
exit. The scheduler records rc=-11 and marks the case FAIL purely on the
non-zero exit code.

Native backtrace from the core dump (crash thread is a CANN background thread,
not the main thread; simpler appears in zero frames):

Thread 1 (CANN teardown thread):
#0  std::_Hashtable<unsigned int, ...>::find()   from libummu.so   ★ SIGSEGV
#1  hashmap_get                                  from libummu.so
#5  udma_u_unregister_seg                         from liburma-udma.so
#6  urma_unregister_seg                           from liburma.so
#7  hdc_unregister_own_urma_seg                   from libascend_hal.so   (driver)
#9  hdc_delete_ub_context                         from libascend_hal.so
#11 hdc_ub_session_close                          from libascend_hal.so
#13 halHdcSessionCloseEx                          from libascend_hal.so
#14 drvHdcSessionClose                            from libascend_hal.so
#16 AdxDestroyCommHandle                          from libascend_trace.so (CANN)
#17 ...                                           from libascend_trace.so (thread entry)

Thread 8 (main thread, concurrently):
#5  exit()                                        from libc
#1  ??                                            from libummu.so
#0  close()                                        from libc

Mechanism: libascend_trace.so is the CANN device-log relay channel (it
reads ASCEND_LOG_DEVICE_FLUSH_TIMEOUT / ASCEND_TRACE_RECORD_NUM; it is NOT a
profiling/ADX feature and is NOT enabled by simpler — the driver brings it up
whenever the device emits logs). At process exit, its background thread runs
AdxDestroyCommHandle → halHdcSessionClose → urma_unregister_seg, walking the
libummu global hashtable, while the main thread is already in exit()
running C++/atexit teardown. The two race to tear down the same HDC/URMA
resources; the background thread dereferences a hashtable the main thread has
freed → use-after-free → SIGSEGV.

Why only the fatal-code suite: these cases deliberately drive the device into a
FATAL state (e.g. device log shows aicpu_orchestration_entry "FATAL(code=9): st injected fatal", PTO2 runtime failed with rc=-9), so the device-log relay
channel is busy right up to exit — the teardown race window is at its widest.
Normal cases emit little/no device log and almost never hit it.

Why CI fails often but a local single run usually passes: measured — the
single-process exit hit rate on an idle box is <3% (0/30); with CI's 2-device
concurrency (two subprocesses running and exiting at once, higher load, wider
scheduling jitter) it jumps to ~60% (3/5 rounds). Concurrency widens the
exit()-vs-background-thread race window; the specific device numbers and
device independence are irrelevant because the race is intra-process.

Git Commit ID

6d938bf

CANN Version

CANN 9.1.T500

Driver Version

25.6.rc1.b108 (ascendhal_version 7.35.23)

Host Platform

Linux (aarch64)

Additional Context

Root cause ownership. Every crash frame is in closed-source CANN/driver
libraries; the defect is a missing synchronization between libascend_trace's
exit-time background thread and process teardown. It cannot be fixed at the
application layer (we cannot lock the driver's internal hashtable or control its
thread lifecycle) — only mitigated. A true fix must come from the CANN/driver
team (make AdxDestroyCommHandle exit-safe / mutually exclusive with atexit, or
join the relay thread before teardown).

Mitigation options (for if these cases are re-introduced):

  1. Report upstream to CANN with this backtrace + repro (the only real fix).
  2. Have the fatal-code suite's process skip native teardown via os._exit()
    after assertions complete (a local conftest pytest_sessionfinish) — removes
    one leg of the race. Onboard work holds an exclusive task-submit device lock,
    so skipping the graceful device reset is acceptable here. (New exit behavior —
    needs sign-off per .claude/rules/env-macro-gating.md.)
  3. CI-level: treat "pytest passed but process exited -11" as a known
    driver-flaky outcome (attach the core stack) rather than a hard FAIL.
  4. Lowering concurrency (serial / single device) reduces but does not eliminate
    the hit rate — not recommended as the final fix.

Status. The triggering case was removed from mainline when the
runtime_fatal_codes suite was thinned (latest local commit touching it:
5d4785e4). This issue exists for the record and for upstream reporting.

Related: #1197 (teardown-ordering segfault where RTS-using destructors run after
aclFinalize on a5) — a different teardown bug in our own DeviceRunner member
ordering; this issue is the closed-source CANN device-log-relay thread race, not
fixable in simpler.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions