fix(cgka-engine): make EpochManager::begin_pending atomic (#146)#408
fix(cgka-engine): make EpochManager::begin_pending atomic (#146)#408agent-p1p wants to merge 1 commit into
Conversation
begin_pending removed the group's EpochState from the states map BEFORE running the fallible prev.begin_pending(...) transition, re-inserting only on success. When prev was not Stable the inner transition returned InvalidTransition, the ? propagated, and the entry was left removed and never re-inserted -- orphaning the group to UnknownGroup, flipping can_ingest to true, and stranding the just-allocated pending_ref as UnknownPending. This is the remove-before-transition non-atomicity the Sm1 audit fixed for confirm_publish/rollback_publish but never applied here. Mirror the Sm1 clone-before-transition fix: clone the prior state and run the transition before mutating states/committed_from/pending, so a failing transition leaves every map untouched. Also guard the auto-commit ingest arm to require Stable (via new EpochState::is_stable) before staging a commit. The engine accepts ingest while Recovering (can_ingest is true for Recovering), making the SelfRemove auto-commit path reachable in a non-Stable state; without the guard it would stage an OpenMLS commit, snapshot, and sent-message record only for begin_pending to reject the transition, leaving a dangling commit. The proposal is left queued until the group returns to Stable. Adds epoch_manager regression tests and updates the Sm1 AGENTS.md note. Closes #146
|
Warning Review limit reached
More reviews will be available in 23 minutes and 25 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Ready to review this PR? Stage has broken it down into 4 individual chapters for you:
Chapters generated by Stage for commit cf9dde7 on Jun 14, 2026 7:29pm UTC. |
|
Pip adversarial review for #146 / PR #408. Verdict: no blocking findings. What I checked:
Local verification:
Non-blocking suggestion:
Sensitive paths touched: |
Summary
EpochManager::begin_pendingremoved the group'sEpochStatefrom thestatesmap before running the fallibleprev.begin_pending(...)transition, re-inserting only on success. Whenprevwas notStable, the inner transition returnedInvalidTransition, the?propagated, and the map entry was left removed and never re-inserted.After that:
epoch(group_id)returnsNone→Engine::epochyieldsUnknownGroupcan_ingestflips totrueconfirm_published/publish_failedfor the just-allocatedpending_reffail withUnknownPending(the pending meta is inserted only after the failing line)This is the exact remove-before-transition non-atomicity the Sm1 audit fixed for
confirm_publish/rollback_publishbut never applied tobegin_pending.Reachability
The auto-commit ingest arm accepts proposals while
Recovering(becausecan_ingest()is true forRecovering). When an inboundSelfRemovearrives for which this client is the selected committer, the engine stages an OpenMLS commit, records a sent message, creates a fork-recovery snapshot, then callsbegin_pendingwithprev = Recovering→InvalidTransition→ orphaned state + dangling staged commit + untracked snapshot.Fix
begin_pending— mirror the Sm1 clone-before-transition pattern: clone the prior state and runprev.begin_pending(...)?before mutatingstates/committed_from/pending. A failing transition now leaves every map untouched.Stable(newEpochState::is_stable) before staging a commit in theProposalMessageauto-commit arm. A non-Stablegroup leaves the SelfRemove proposal queued (it was already stored) instead of staging a commitbegin_pendingwould reject. Emits anAutoCommitDecisionaudit row with reasongroup_not_stable.Tests
epoch_manager::tests::begin_pending_failure_leaves_state_intact— drives a group intoRecovering, asserts a failedbegin_pendingleavesstates/committed_from/pendingintact (no orphan).epoch_manager::tests::begin_pending_success_records_all_bookkeeping— happy path still records all bookkeeping.Verification
cargo fmt --all --check✅cargo clippy -p cgka-traits -p cgka-engine --all-targets -- -D warnings✅cargo test -p cgka-traits -p cgka-engine✅ (incl. publish-lifecycle + SelfRemove auto-commit tests)RUSTFLAGS='-D warnings' cargo check --workspace --all-targets✅Sensitive paths
Touches CGKA group-state core (
cgka-engineepoch state machine + ingest). Flagged for merge-gate escalation; not auto-merged.Closes #146