Skip to content

Windows: clean ZMQ teardown — drop TerminateProcess workaround in tests #436

@lukemartinlogan

Description

@lukemartinlogan

Background

On Windows, every test binary that initializes the CLIO runtime aborts
during static-destructor / atexit teardown with:

Assertion failed: Successful WSASTARTUP not yet performed [10093]
(C:\...\zeromq\src\v4.3.5\src\signaler.cpp:163)

The assertion is wsa_assert(nbytes != SOCKET_ERROR) after a send() on
the signaler's wakeup socket pair. Reproduces from the per-socket
destructor path (zmq_close → mailbox::send → signaler::send) — not the
context-level destruction we initially suspected.

All tests pass before the abort fires; the abort just sets a non-zero
exit code and ctest sees that as failure.

What we tried

  1. Pin WSAStartup refcount. socket_win.cc does WSAStartup at
    static-init time and never calls WSACleanup, plus a second
    WSAStartup per InitSocketLib(). Even with refcount pumped to 1024,
    the assert still fires. Suggests the failure isn't a refcount race —
    either the signaler socket itself is in a bad state, or
    WSANOTINITIALISED (10093) is being returned for a different reason
    that libzmq's error formatter misattributes.
  2. Skip zmq_ctx_destroy on Windows in ZeroMqTransport::CtxOwner.
    No effect — the abort fires from a per-socket close path that runs
    well before the ctx-owner destructor.
  3. Skip zmq_ctx_shutdown too. Same result. The libzmq context
    destruction itself is fine; the trigger is a regular socket close
    somewhere in the runtime's Finalize path.

Current workaround

SIMPLE_TEST_PROCESS_EXIT(code) in context-runtime/test/simple_test.h:
on Windows it calls TerminateProcess(GetCurrentProcess(), code) instead
of returning from main, which skips every cleanup path (atexit, static
dtors, CRT cleanup). The OS reclaims sockets at process exit. Applied to
SIMPLE_TEST_MAIN() and the custom mains in test_task_archive.cc and
test_compose.cc; the in-test _exit(0) calls in test_runtime_cleanup
and test_ipc_errors ZZZ-Final-Cleanup cases route through the same
macro for consistency.

Result on Windows: 90/90 tests pass (was 0/37 cr_* before this fix
landed). POSIX is unaffected — the macro is a plain return code for
SIMPLE_TEST_MAIN() and ::_exit(code) for the explicit
worker-join-skip cases, matching pre-fix behaviour.

What needs to happen long-term

This is a real teardown bug — the runtime should be able to close its
ZMQ transports cleanly at exit on Windows, not just in production tests.
Two angles to investigate:

  1. Is the runtime calling zmq_close from a thread whose Winsock state
    is stale?
    All ZMQ socket ops have to happen on a thread that's done
    WSAStartup. We do it process-wide in socket_win.cc, but the
    signaler thread inside libzmq might predate our init or run after a
    refcount imbalance.
  2. Upgrade libzmq. Check whether libzmq master / 4.3.6+ has fixed
    the signaler send during destruction (the relevant wsa_assert may
    be relaxed or the socket pair may be drained more carefully).

Once the underlying issue is fixed, drop SIMPLE_TEST_PROCESS_EXIT and
let static destructors run normally on Windows.

Acceptance criteria

  • Identify the actual cause of the WSANOTINITIALISED from
    signaler.cpp:163 during normal runtime teardown.
  • Remove the TerminateProcess workaround from
    context-runtime/test/simple_test.h.
  • Remove the SIMPLE_TEST_PROCESS_EXIT calls from the custom mains
    (test_task_archive.cc, test_compose.cc) and the ZZZ-Final-Cleanup
    test cases.
  • Tests pass on both Linux and Windows without forcing process exit.

Refs

Branch: windows-compat (commit 65d54605). Related: #435.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions