Background
On Windows, every test binary that initializes the CLIO runtime aborts
during static-destructor / atexit teardown with:
Assertion failed: Successful WSASTARTUP not yet performed [10093]
(C:\...\zeromq\src\v4.3.5\src\signaler.cpp:163)
The assertion is wsa_assert(nbytes != SOCKET_ERROR) after a send() on
the signaler's wakeup socket pair. Reproduces from the per-socket
destructor path (zmq_close → mailbox::send → signaler::send) — not the
context-level destruction we initially suspected.
All tests pass before the abort fires; the abort just sets a non-zero
exit code and ctest sees that as failure.
What we tried
- Pin WSAStartup refcount.
socket_win.cc does WSAStartup at
static-init time and never calls WSACleanup, plus a second
WSAStartup per InitSocketLib(). Even with refcount pumped to 1024,
the assert still fires. Suggests the failure isn't a refcount race —
either the signaler socket itself is in a bad state, or
WSANOTINITIALISED (10093) is being returned for a different reason
that libzmq's error formatter misattributes.
- Skip
zmq_ctx_destroy on Windows in ZeroMqTransport::CtxOwner.
No effect — the abort fires from a per-socket close path that runs
well before the ctx-owner destructor.
- Skip
zmq_ctx_shutdown too. Same result. The libzmq context
destruction itself is fine; the trigger is a regular socket close
somewhere in the runtime's Finalize path.
Current workaround
SIMPLE_TEST_PROCESS_EXIT(code) in context-runtime/test/simple_test.h:
on Windows it calls TerminateProcess(GetCurrentProcess(), code) instead
of returning from main, which skips every cleanup path (atexit, static
dtors, CRT cleanup). The OS reclaims sockets at process exit. Applied to
SIMPLE_TEST_MAIN() and the custom mains in test_task_archive.cc and
test_compose.cc; the in-test _exit(0) calls in test_runtime_cleanup
and test_ipc_errors ZZZ-Final-Cleanup cases route through the same
macro for consistency.
Result on Windows: 90/90 tests pass (was 0/37 cr_* before this fix
landed). POSIX is unaffected — the macro is a plain return code for
SIMPLE_TEST_MAIN() and ::_exit(code) for the explicit
worker-join-skip cases, matching pre-fix behaviour.
What needs to happen long-term
This is a real teardown bug — the runtime should be able to close its
ZMQ transports cleanly at exit on Windows, not just in production tests.
Two angles to investigate:
- Is the runtime calling
zmq_close from a thread whose Winsock state
is stale? All ZMQ socket ops have to happen on a thread that's done
WSAStartup. We do it process-wide in socket_win.cc, but the
signaler thread inside libzmq might predate our init or run after a
refcount imbalance.
- Upgrade libzmq. Check whether libzmq master / 4.3.6+ has fixed
the signaler send during destruction (the relevant wsa_assert may
be relaxed or the socket pair may be drained more carefully).
Once the underlying issue is fixed, drop SIMPLE_TEST_PROCESS_EXIT and
let static destructors run normally on Windows.
Acceptance criteria
Refs
Branch: windows-compat (commit 65d54605). Related: #435.
Background
On Windows, every test binary that initializes the CLIO runtime aborts
during static-destructor / atexit teardown with:
The assertion is
wsa_assert(nbytes != SOCKET_ERROR)after asend()onthe signaler's wakeup socket pair. Reproduces from the per-socket
destructor path (zmq_close → mailbox::send → signaler::send) — not the
context-level destruction we initially suspected.
All tests pass before the abort fires; the abort just sets a non-zero
exit code and ctest sees that as failure.
What we tried
socket_win.ccdoesWSAStartupatstatic-init time and never calls
WSACleanup, plus a secondWSAStartupperInitSocketLib(). Even with refcount pumped to 1024,the assert still fires. Suggests the failure isn't a refcount race —
either the signaler socket itself is in a bad state, or
WSANOTINITIALISED(10093) is being returned for a different reasonthat libzmq's error formatter misattributes.
zmq_ctx_destroyon Windows inZeroMqTransport::CtxOwner.No effect — the abort fires from a per-socket close path that runs
well before the ctx-owner destructor.
zmq_ctx_shutdowntoo. Same result. The libzmq contextdestruction itself is fine; the trigger is a regular socket close
somewhere in the runtime's
Finalizepath.Current workaround
SIMPLE_TEST_PROCESS_EXIT(code)incontext-runtime/test/simple_test.h:on Windows it calls
TerminateProcess(GetCurrentProcess(), code)insteadof returning from main, which skips every cleanup path (atexit, static
dtors, CRT cleanup). The OS reclaims sockets at process exit. Applied to
SIMPLE_TEST_MAIN()and the custom mains intest_task_archive.ccandtest_compose.cc; the in-test_exit(0)calls intest_runtime_cleanupand
test_ipc_errorsZZZ-Final-Cleanup cases route through the samemacro for consistency.
Result on Windows: 90/90 tests pass (was 0/37 cr_* before this fix
landed). POSIX is unaffected — the macro is a plain
return codeforSIMPLE_TEST_MAIN()and::_exit(code)for the explicitworker-join-skip cases, matching pre-fix behaviour.
What needs to happen long-term
This is a real teardown bug — the runtime should be able to close its
ZMQ transports cleanly at exit on Windows, not just in production tests.
Two angles to investigate:
zmq_closefrom a thread whose Winsock stateis stale? All ZMQ socket ops have to happen on a thread that's done
WSAStartup. We do it process-wide insocket_win.cc, but thesignaler thread inside libzmq might predate our init or run after a
refcount imbalance.
the signaler send during destruction (the relevant
wsa_assertmaybe relaxed or the socket pair may be drained more carefully).
Once the underlying issue is fixed, drop
SIMPLE_TEST_PROCESS_EXITandlet static destructors run normally on Windows.
Acceptance criteria
WSANOTINITIALISEDfromsignaler.cpp:163during normal runtime teardown.TerminateProcessworkaround fromcontext-runtime/test/simple_test.h.SIMPLE_TEST_PROCESS_EXITcalls from the custom mains(
test_task_archive.cc,test_compose.cc) and the ZZZ-Final-Cleanuptest cases.
Refs
Branch:
windows-compat(commit65d54605). Related: #435.