Skip to content

test(gateway): E2E test for session takeover — issue #492#493

Merged
pedrosakuma merged 3 commits into
mainfrom
test/session-takeover-e2e-492
Jun 23, 2026
Merged

test(gateway): E2E test for session takeover — issue #492#493
pedrosakuma merged 3 commits into
mainfrom
test/session-takeover-e2e-492

Conversation

@pedrosakuma

Copy link
Copy Markdown
Owner

Summary

Adds FixpSessionTakeoverTests to confirm that PR #491's TryForceTakeOver implementation correctly handles the scenario from issue #492.

What the test does

  1. Client 1 establishes a full FIXP session (sessionId=1, verId=2) and keeps the TCP connection alive.
  2. Client 2 connects on a new transport and sends Negotiate(sessionId=1, verId=3), simulating a trading-host crash + fast restart before the exchange detects the dead TCP.
  3. Asserts the exchange responds with NegotiateResponse (takeover accepted via TryForceTakeOver), not NegotiateReject.
  4. Asserts the stale old session (verId=2) is evicted from ActiveSessions.
  5. Asserts the claim registry holds verId=3 for sessionId=1.

Findings

The test passes in 107 ms, confirming the production code in PR #491 is correct. The issue reporter's observation ("no 'taking over sessionId' log appears") is most likely a log-level filtering problem — the log statement is LogInformation and production may be configured at Warning or above.

Closes #492

pedrosakuma pushed a commit that referenced this pull request May 27, 2026
Two bugs found by code review on PR #493:

1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted
   session (kind=SessionTakeOver), overwriting the snapshot that the new
   session had just persisted with the higher sessionVerId. Fixed by
   excluding SessionTakeOver from the 'else' persist branch — the new
   session already owns the durable state; the evicted session must not
   touch it.

2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the
   rollback only released the new session's claim but did not restore the
   evicted session's claim. Left the old live TCP unclaimed with
   _lastSessionVerId advanced to the new verId, making the old session
   unable to re-negotiate. Fixed by adding SessionClaimRegistry.
   TryRestoreTakeOver() and calling it from the rollback path before
   Release(), atomically reinstating the evicted session's claim and
   reverting _lastSessionVerId.

Adds two tests:
- Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver
  (E2E, no persister — covers the core #492 takeover acceptance path)
- TakeOver_WithStatePersister_FinalSnapshotHasNewVerid
  (persister-wired — catches the snapshot overwrite regression)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pedrosakuma pedrosakuma marked this pull request as ready for review May 27, 2026 23:14
@pedrosakuma pedrosakuma enabled auto-merge (squash) May 28, 2026 02:04
pedrosakuma pushed a commit that referenced this pull request Jun 23, 2026
Two bugs found by code review on PR #493:

1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted
   session (kind=SessionTakeOver), overwriting the snapshot that the new
   session had just persisted with the higher sessionVerId. Fixed by
   excluding SessionTakeOver from the 'else' persist branch — the new
   session already owns the durable state; the evicted session must not
   touch it.

2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the
   rollback only released the new session's claim but did not restore the
   evicted session's claim. Left the old live TCP unclaimed with
   _lastSessionVerId advanced to the new verId, making the old session
   unable to re-negotiate. Fixed by adding SessionClaimRegistry.
   TryRestoreTakeOver() and calling it from the rollback path before
   Release(), atomically reinstating the evicted session's claim and
   reverting _lastSessionVerId.

Adds two tests:
- Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver
  (E2E, no persister — covers the core #492 takeover acceptance path)
- TakeOver_WithStatePersister_FinalSnapshotHasNewVerid
  (persister-wired — catches the snapshot overwrite regression)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pedrosakuma pedrosakuma force-pushed the test/session-takeover-e2e-492 branch from 80b2061 to 6e2a03f Compare June 23, 2026 19:10
@pedrosakuma

Copy link
Copy Markdown
Owner Author

Rebaseado em main (o commit e6c6b46 já tinha entrado via outro PR, então restaram apenas o teste E2E + o fix anti-rollback de snapshot).

Rodei um code-review (gpt-5.5) sobre o diff e ele apontou um defeito real de High: no rollback por falha de persistência durante um takeover, o claim era restaurado para a sessão evicta, mas o SessionRegistry não — UpdateIdentityAfterNegotiate havia sobrescrito a entrada da sessão antiga com a nova (falha) e RollbackIdentity a removia, deixando o dono antigo (TCP ainda vivo) sem rota para execution reports.

Corrigido no commit 6e2a03f:

  • SessionClaimRegistry.TryRestoreTakeOver agora retorna bool.
  • Novo hook opcional onTakeOverRollback (ligado a SessionRegistry.Register) re-registra a sessão evicta no rollback, após RollbackIdentity e somente quando TryRestoreTakeOver teve sucesso — mantendo claim e registry em lock-step e sem atropelar um takeover concorrente que tenha vencido a corrida.
  • Testes unitários para os caminhos de restauração e de perda na corrida.

CI verde (Build & Test Release/Debug, Format). Build local: 0 warnings; 575 testes do Gateway passam.

PedroTravi and others added 3 commits June 23, 2026 19:40
Adds FixpSessionTakeoverTests with an end-to-end integration test that
exercises the exact scenario described in issue #492:

- Client 1 establishes a session (sessionId=1, verId=2) over TCP and
  remains connected.
- Client 2 connects on a new transport and sends Negotiate with the same
  sessionId but a strictly-greater verId=3, simulating a fast reconnect
  after a crash before the exchange idle-timeout fires.
- Asserts the exchange accepts the takeover (NegotiateResponse, not
  NegotiateReject) and evicts the stale session from ActiveSessions.
- Asserts the claim registry records verId=3 for sessionId=1.

The test confirms that PR #491 code (TryForceTakeOver path) is correct.
The issue was a log-level or deployment concern on the reporter's side,
not a code defect.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two bugs found by code review on PR #493:

1. (High) CloseLocked called SaveStateSnapshotSafe() for the evicted
   session (kind=SessionTakeOver), overwriting the snapshot that the new
   session had just persisted with the higher sessionVerId. Fixed by
   excluding SessionTakeOver from the 'else' persist branch — the new
   session already owns the durable state; the evicted session must not
   touch it.

2. (Medium) On TrySaveStateSnapshot() failure in the takeover path, the
   rollback only released the new session's claim but did not restore the
   evicted session's claim. Left the old live TCP unclaimed with
   _lastSessionVerId advanced to the new verId, making the old session
   unable to re-negotiate. Fixed by adding SessionClaimRegistry.
   TryRestoreTakeOver() and calling it from the rollback path before
   Release(), atomically reinstating the evicted session's claim and
   reverting _lastSessionVerId.

Adds two tests:
- Negotiate_HigherVerid_WhileOldSessionStillConnected_AcceptsViaTakeOver
  (E2E, no persister — covers the core #492 takeover acceptance path)
- TakeOver_WithStatePersister_FinalSnapshotHasNewVerid
  (persister-wired — catches the snapshot overwrite regression)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llback

When a session takeover's Negotiate persist fails, the rollback restored
the SessionClaimRegistry claim to the evicted old session but left it
unregistered in SessionRegistry (UpdateIdentityAfterNegotiate had
overwritten its entry with the failed new session, and RollbackIdentity
then removed it). The old, still-live owner therefore stopped receiving
routed execution reports.

Re-register the evicted session via a new onTakeOverRollback hook, gated
on TryRestoreTakeOver (now bool) so the registry stays in lock-step with
the claim registry and a racing concurrent takeover is not clobbered.

Adds unit coverage for TryRestoreTakeOver's restore and racing-loser paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pedrosakuma pedrosakuma force-pushed the test/session-takeover-e2e-492 branch from 6e2a03f to 77230e8 Compare June 23, 2026 19:40
@pedrosakuma pedrosakuma merged commit dfa2ff9 into main Jun 23, 2026
6 checks passed
@pedrosakuma pedrosakuma deleted the test/session-takeover-e2e-492 branch June 23, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session takeover rejects reconnection with DUPLICATE_SESSION_CONNECTION despite incremented sessionVerId

2 participants