Skip to content

fix(cgka-engine): make EpochManager::begin_pending atomic (#146)#408

Open
agent-p1p wants to merge 1 commit into
masterfrom
pip/darkmatter-146
Open

fix(cgka-engine): make EpochManager::begin_pending atomic (#146)#408
agent-p1p wants to merge 1 commit into
masterfrom
pip/darkmatter-146

Conversation

@agent-p1p

@agent-p1p agent-p1p commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

EpochManager::begin_pending removed the group's EpochState from the states map before running the fallible prev.begin_pending(...) transition, re-inserting only on success. When prev was not Stable, the inner transition returned InvalidTransition, the ? propagated, and the map entry was left removed and never re-inserted.

After that:

  • epoch(group_id) returns NoneEngine::epoch yields UnknownGroup
  • can_ingest flips to true
  • confirm_published / publish_failed for the just-allocated pending_ref fail with UnknownPending (the pending meta is inserted only after the failing line)

This is the exact remove-before-transition non-atomicity the Sm1 audit fixed for confirm_publish / rollback_publish but never applied to begin_pending.

Reachability

The auto-commit ingest arm accepts proposals while Recovering (because can_ingest() is true for Recovering). When an inbound SelfRemove arrives for which this client is the selected committer, the engine stages an OpenMLS commit, records a sent message, creates a fork-recovery snapshot, then calls begin_pending with prev = RecoveringInvalidTransition → orphaned state + dangling staged commit + untracked snapshot.

Fix

  1. Atomic begin_pending — mirror the Sm1 clone-before-transition pattern: clone the prior state and run prev.begin_pending(...)? before mutating states / committed_from / pending. A failing transition now leaves every map untouched.
  2. Guard the auto-commit path — require Stable (new EpochState::is_stable) before staging a commit in the ProposalMessage auto-commit arm. A non-Stable group leaves the SelfRemove proposal queued (it was already stored) instead of staging a commit begin_pending would reject. Emits an AutoCommitDecision audit row with reason group_not_stable.

Tests

  • epoch_manager::tests::begin_pending_failure_leaves_state_intact — drives a group into Recovering, asserts a failed begin_pending leaves states/committed_from/pending intact (no orphan).
  • epoch_manager::tests::begin_pending_success_records_all_bookkeeping — happy path still records all bookkeeping.

Verification

  • cargo fmt --all --check
  • cargo clippy -p cgka-traits -p cgka-engine --all-targets -- -D warnings
  • cargo test -p cgka-traits -p cgka-engine ✅ (incl. publish-lifecycle + SelfRemove auto-commit tests)
  • RUSTFLAGS='-D warnings' cargo check --workspace --all-targets

Sensitive paths

Touches CGKA group-state core (cgka-engine epoch state machine + ingest). Flagged for merge-gate escalation; not auto-merged.

Closes #146


Open in Stage

begin_pending removed the group's EpochState from the states map BEFORE
running the fallible prev.begin_pending(...) transition, re-inserting only
on success. When prev was not Stable the inner transition returned
InvalidTransition, the ? propagated, and the entry was left removed and
never re-inserted -- orphaning the group to UnknownGroup, flipping
can_ingest to true, and stranding the just-allocated pending_ref as
UnknownPending. This is the remove-before-transition non-atomicity the Sm1
audit fixed for confirm_publish/rollback_publish but never applied here.

Mirror the Sm1 clone-before-transition fix: clone the prior state and run
the transition before mutating states/committed_from/pending, so a failing
transition leaves every map untouched.

Also guard the auto-commit ingest arm to require Stable (via new
EpochState::is_stable) before staging a commit. The engine accepts ingest
while Recovering (can_ingest is true for Recovering), making the SelfRemove
auto-commit path reachable in a non-Stable state; without the guard it
would stage an OpenMLS commit, snapshot, and sent-message record only for
begin_pending to reject the transition, leaving a dangling commit. The
proposal is left queued until the group returns to Stable.

Adds epoch_manager regression tests and updates the Sm1 AGENTS.md note.

Closes #146
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@agent-p1p, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 25 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2f136095-48d6-45a4-9f62-3d5289160152

📥 Commits

Reviewing files that changed from the base of the PR and between fa32b89 and cf9dde7.

📒 Files selected for processing (4)
  • crates/cgka-engine/AGENTS.md
  • crates/cgka-engine/src/epoch_manager.rs
  • crates/cgka-engine/src/message_processor/ingest.rs
  • crates/traits/src/engine_state.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pip/darkmatter-146

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@stage-review

stage-review Bot commented Jun 14, 2026

Copy link
Copy Markdown

Ready to review this PR? Stage has broken it down into 4 individual chapters for you:

Title
1 Add is_stable helper to EpochState
2 Make EpochManager::begin_pending atomic
3 Guard auto-commit ingest with stability check
4 Update AGENTS documentation for Sm1 audit
Open in Stage

Chapters generated by Stage for commit cf9dde7 on Jun 14, 2026 7:29pm UTC.

@agent-p1p

Copy link
Copy Markdown
Contributor Author

Pip adversarial review for #146 / PR #408.

Verdict: no blocking findings.

What I checked:

  • EpochManager::begin_pending now clones the prior state, performs the fallible prev.begin_pending(...) transition first, and only mutates states, committed_from, and pending after success. Failed non-Stable transitions leave the group state intact.
  • The inbound proposal auto-commit path now checks epoch_manager.state(&group_id).is_none_or(|s| s.is_stable()) before creating the recovery snapshot, OpenMLS pending commit, sent-message record, or pending ref. When the group is not Stable it leaves the proposal queued and returns Processed without staging the auto-commit. That addresses the reachable Recovering/SelfRemove side-effect chain described in EpochManager::begin_pending orphans group epoch state when the inner transition fails (remove-before-transition non-atomicity, Sm1 fix not applied here) #146.
  • Existing SelfRemove/deferred-publish behavior still passes.

Local verification:

  • git diff --check origin/master...HEAD passed.
  • PATH=/home/jeff/.cargo/bin:$PATH cargo test -p cgka-traits -p cgka-engine begin_pending -- --nocapture passed.
  • PATH=/home/jeff/.cargo/bin:$PATH cargo test -p cgka-engine selfremove -- --nocapture passed: 4 selected integration tests passed (selfremove_full_flow_with_auto_commit, selfremove_auto_commit_publish_failed_rolls_back_projection, leave_produces_selfremove_proposal, reopen_preserves_deferred_selfremove_auto_commit).
  • GitHub PR checks are green.

Non-blocking suggestion:

Sensitive paths touched: crates/cgka-engine/src/epoch_manager.rs, crates/cgka-engine/src/message_processor/ingest.rs, crates/traits/src/engine_state.rs.

@agent-p1p agent-p1p marked this pull request as ready for review June 15, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EpochManager::begin_pending orphans group epoch state when the inner transition fails (remove-before-transition non-atomicity, Sm1 fix not applied here)

1 participant