Skip to content

Flaky crash in ManyClientsOneServerDeallocateBlockingTest: multithreaded peer-teardown race in RakPeer #7

Description

@Segfaultd

Summary

ManyClientsOneServerDeallocateBlockingTest crashes intermittently during connection teardown. It's a pre-existing, flaky, multithreaded race in RakPeer's peer/connection teardown — not related to any RPC4 work; surfaced while investigating CI on #5.

Symptoms

  • Release (CI Linux, Docker): SIGSEGV → process exits 139. Reproduced locally in the exact CI container 5/5 runs.
  • Debug + ASan (-DMAFIANET_SANITIZER=address+undefined): RakAssert fires / SIGBUS (exit 134/138).
  • Always in ManyClientsOneServerDeallocateBlockingTest, around the connect/disconnect/deallocate churn and the Verifying connections... stage. Only manifests in the full suite (accumulated state/timing); passes in isolation and under gdb/lldb (timing masks the race) — classic heisenbug.

Root cause

The test destroys and immediately recreates client peers while their connections and network threads are still live:

// ManyClientsOneServerDeallocateBlockingTest.cpp:325
RakPeerInterface::DestroyInstance(clientList[i]);
clientList[i]=RakPeerInterface::GetInstance();

Tearing a RakPeer down mid-flight races with its internal update/network thread and the peer's connection cleanup. One concrete null-deref on this path was in RakPeer::CloseConnection (RakPeer.cpp:1659), where remoteSystemList[index].rakNetSocket can be null (the index-0 fallback lands on a free slot); RakAssert is a no-op in release so it dereferenced null → 139. That specific deref is now guarded in #5, but the suite still crashes further along in the same teardown path, so there is at least one more race here.

Pre-existing (not from #5)

Confirmed on clean master (aa9af6a9, no RPC4 changes): full suite under ASan aborts at the same CloseConnection:1659 assertion. (Release-Docker reproduction on master in progress.) Recent master CI runs are green only because the race is timing-dependent and got lucky.

Repro

# release, like CI
docker build -t mafianet-test . && for i in $(seq 5); do docker run --rm mafianet-test; echo "exit=$?"; done
# debug + ASan, full suite
cmake -B build-asan -DCMAKE_BUILD_TYPE=Debug -DMAFIANET_SANITIZER=address+undefined -DMAFIANET_BUILD_SAMPLES=ON
cmake --build build-asan --target Tests && CI=1 ./build-asan/Samples/Tests/Tests

Interim mitigation

Quarantined under CI in #5 (skips when CI is set; still runs locally for debugging) so unrelated PRs aren't blocked. This issue tracks the real fix: make RakPeer teardown safe against in-flight connections/threads (join the network thread before tearing down connection state; remove the // #med index-0 fallback in CloseConnection).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions