test(gateway): E2E test for session takeover — issue #492#493
Merged
Conversation
pedrosakuma
pushed a commit
that referenced
this pull request
May 27, 2026
Two bugs found by code review on PR #493: 1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted session (kind=SessionTakeOver), overwriting the snapshot that the new session had just persisted with the higher sessionVerId. Fixed by excluding SessionTakeOver from the 'else' persist branch — the new session already owns the durable state; the evicted session must not touch it. 2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the rollback only released the new session's claim but did not restore the evicted session's claim. Left the old live TCP unclaimed with _lastSessionVerId advanced to the new verId, making the old session unable to re-negotiate. Fixed by adding SessionClaimRegistry. TryRestoreTakeOver() and calling it from the rollback path before Release(), atomically reinstating the evicted session's claim and reverting _lastSessionVerId. Adds two tests: - Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver (E2E, no persister — covers the core #492 takeover acceptance path) - TakeOver_WithStatePersister_FinalSnapshotHasNewVerid (persister-wired — catches the snapshot overwrite regression) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pedrosakuma
pushed a commit
that referenced
this pull request
Jun 23, 2026
Two bugs found by code review on PR #493: 1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted session (kind=SessionTakeOver), overwriting the snapshot that the new session had just persisted with the higher sessionVerId. Fixed by excluding SessionTakeOver from the 'else' persist branch — the new session already owns the durable state; the evicted session must not touch it. 2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the rollback only released the new session's claim but did not restore the evicted session's claim. Left the old live TCP unclaimed with _lastSessionVerId advanced to the new verId, making the old session unable to re-negotiate. Fixed by adding SessionClaimRegistry. TryRestoreTakeOver() and calling it from the rollback path before Release(), atomically reinstating the evicted session's claim and reverting _lastSessionVerId. Adds two tests: - Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver (E2E, no persister — covers the core #492 takeover acceptance path) - TakeOver_WithStatePersister_FinalSnapshotHasNewVerid (persister-wired — catches the snapshot overwrite regression) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
80b2061 to
6e2a03f
Compare
Owner
Author
|
Rebaseado em Rodei um code-review (gpt-5.5) sobre o diff e ele apontou um defeito real de High: no rollback por falha de persistência durante um takeover, o claim era restaurado para a sessão evicta, mas o Corrigido no commit
CI verde (Build & Test Release/Debug, Format). Build local: 0 warnings; 575 testes do Gateway passam. |
Adds FixpSessionTakeoverTests with an end-to-end integration test that exercises the exact scenario described in issue #492: - Client 1 establishes a session (sessionId=1, verId=2) over TCP and remains connected. - Client 2 connects on a new transport and sends Negotiate with the same sessionId but a strictly-greater verId=3, simulating a fast reconnect after a crash before the exchange idle-timeout fires. - Asserts the exchange accepts the takeover (NegotiateResponse, not NegotiateReject) and evicts the stale session from ActiveSessions. - Asserts the claim registry records verId=3 for sessionId=1. The test confirms that PR #491 code (TryForceTakeOver path) is correct. The issue was a log-level or deployment concern on the reporter's side, not a code defect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two bugs found by code review on PR #493: 1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted session (kind=SessionTakeOver), overwriting the snapshot that the new session had just persisted with the higher sessionVerId. Fixed by excluding SessionTakeOver from the 'else' persist branch — the new session already owns the durable state; the evicted session must not touch it. 2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the rollback only released the new session's claim but did not restore the evicted session's claim. Left the old live TCP unclaimed with _lastSessionVerId advanced to the new verId, making the old session unable to re-negotiate. Fixed by adding SessionClaimRegistry. TryRestoreTakeOver() and calling it from the rollback path before Release(), atomically reinstating the evicted session's claim and reverting _lastSessionVerId. Adds two tests: - Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver (E2E, no persister — covers the core #492 takeover acceptance path) - TakeOver_WithStatePersister_FinalSnapshotHasNewVerid (persister-wired — catches the snapshot overwrite regression) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback When a session takeover's Negotiate persist fails, the rollback restored the SessionClaimRegistry claim to the evicted old session but left it unregistered in SessionRegistry (UpdateIdentityAfterNegotiate had overwritten its entry with the failed new session, and RollbackIdentity then removed it). The old, still-live owner therefore stopped receiving routed execution reports. Re-register the evicted session via a new onTakeOverRollback hook, gated on TryRestoreTakeOver (now bool) so the registry stays in lock-step with the claim registry and a racing concurrent takeover is not clobbered. Adds unit coverage for TryRestoreTakeOver's restore and racing-loser paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6e2a03f to
77230e8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
FixpSessionTakeoverTeststo confirm that PR #491'sTryForceTakeOverimplementation correctly handles the scenario from issue #492.What the test does
sessionId=1, verId=2) and keeps the TCP connection alive.Negotiate(sessionId=1, verId=3), simulating a trading-host crash + fast restart before the exchange detects the dead TCP.TryForceTakeOver), not NegotiateReject.ActiveSessions.Findings
The test passes in 107 ms, confirming the production code in PR #491 is correct. The issue reporter's observation ("no 'taking over sessionId' log appears") is most likely a log-level filtering problem — the log statement is
LogInformationand production may be configured atWarningor above.Closes #492