Parent: #726
Depends on: #727 (site-wide actions must be dispatched outside the per-resource FSM so the workload-scoped lock has a stable target).
Background
Replication mutating endpoints (POST / PUT / DELETE on /ops/replication/{wl}/{name}) are registered with ProxyTarget: false in api/ops_replication.go:31-41. A request lands on whichever microceph node received it. There is no in-process or cross-node coordination preventing two simultaneous mutations against the same Ceph resource.
Failure modes:
- Two concurrent
enable requests on the same RBD pool both run PreFill, which snapshots PoolInfo.Mode and PoolStatus.State from Ceph (ceph/replication_rbd.go:85-109). Both believe the pool is not yet enabled. Both fire the FSM (each with its own state machine instance). Both call handlePoolEnablement, whose idempotency guard (ceph/replication_rbd.go:324-326) is read-then-act and does not interlock with the parallel request. Second request may dupe rbd mirror pool peer add or partially configure schedules.
- Concurrent
disable and enable on the same resource produce ordering-dependent outcomes the operator cannot predict.
- Concurrent
enable requests landing on different microceph nodes serialize at the Ceph mon level only as far as Ceph itself enforces; node-local state (keyrings written to disk, IsRemoteConfiguredForRbdMirror checks) is not coordinated.
The per-request FSM provides no help: each request gets its own state machine, so there is no shared this resource is currently being modified signal.
Proposed approach
Two layers, both required.
1. Route mutating requests through the cluster leader
Flip mutating replication endpoints to ProxyTarget: true so that microcluster routes them to the leader. Reads stay non-proxied for parallelism.
This serializes all mutating replication ops onto a single node, which is a precondition for in-process locking to be meaningful.
2. Per-resource named mutex on the executing node
Introduce a process-local lock keyed by {workload}/{resource-id}:
// ceph/replication_lock.go
var repLocks sync.Map // map[string]*sync.Mutex
func AcquireReplicationLock(req types.ReplicationRequest) func() {
key := fmt.Sprintf("%s/%s", req.GetWorkloadType(), req.GetAPIObjectID())
m, _ := repLocks.LoadOrStore(key, &sync.Mutex{})
mu := m.(*sync.Mutex)
mu.Lock()
return mu.Unlock
}
Wired in api/ops_replication.go::handleReplicationRequest around PreFill + FireCtx:
release := ceph.AcquireReplicationLock(req)
defer release()
err := rh.PreFill(ctx, req)
// ...
err = repFsm.FireCtx(ctx, event, rh, &resp, ...)
Reads (GET endpoints) skip lock acquisition. Writes block until the prior write on the same resource releases.
The sync.Map of mutexes accumulates one entry per unique resource ever touched, bounded by total resource count. Eviction unnecessary at expected scale.
Site-wide actions
Workload-level promote and demote (dispatched outside the per-resource FSM after #727) run against many resources. They acquire a workload-scoped lock — one mutex per workload — that excludes per-resource mutations during a site-wide action.
Tradeoff: site-wide action blocks all per-resource ops on the workload for its duration. Acceptable given site-wide actions are rare and operator-driven.
Out of scope
Acceptance
- Two parallel
POST /ops/replication/rbd/pools/foo requests serialize: second observes the first's mutation in its PreFill.
- Read endpoints remain unblocked by writes on the same resource.
- Site-wide promote and demote exclude concurrent per-resource mutations on the same workload.
- Mutating endpoints route to the cluster leader regardless of which node received the request.
Parent: #726
Depends on: #727 (site-wide actions must be dispatched outside the per-resource FSM so the workload-scoped lock has a stable target).
Background
Replication mutating endpoints (
POST/PUT/DELETEon/ops/replication/{wl}/{name}) are registered withProxyTarget: falseinapi/ops_replication.go:31-41. A request lands on whichever microceph node received it. There is no in-process or cross-node coordination preventing two simultaneous mutations against the same Ceph resource.Failure modes:
enablerequests on the same RBD pool both runPreFill, which snapshotsPoolInfo.ModeandPoolStatus.Statefrom Ceph (ceph/replication_rbd.go:85-109). Both believe the pool is not yet enabled. Both fire the FSM (each with its own state machine instance). Both callhandlePoolEnablement, whose idempotency guard (ceph/replication_rbd.go:324-326) is read-then-act and does not interlock with the parallel request. Second request may duperbd mirror pool peer addor partially configure schedules.disableandenableon the same resource produce ordering-dependent outcomes the operator cannot predict.enablerequests landing on different microceph nodes serialize at the Ceph mon level only as far as Ceph itself enforces; node-local state (keyrings written to disk,IsRemoteConfiguredForRbdMirrorchecks) is not coordinated.The per-request FSM provides no help: each request gets its own state machine, so there is no shared
this resource is currently being modifiedsignal.Proposed approach
Two layers, both required.
1. Route mutating requests through the cluster leader
Flip mutating replication endpoints to
ProxyTarget: trueso that microcluster routes them to the leader. Reads stay non-proxied for parallelism.This serializes all mutating replication ops onto a single node, which is a precondition for in-process locking to be meaningful.
2. Per-resource named mutex on the executing node
Introduce a process-local lock keyed by
{workload}/{resource-id}:Wired in
api/ops_replication.go::handleReplicationRequestaroundPreFill+FireCtx:Reads (
GETendpoints) skip lock acquisition. Writes block until the prior write on the same resource releases.The
sync.Mapof mutexes accumulates one entry per unique resource ever touched, bounded by total resource count. Eviction unnecessary at expected scale.Site-wide actions
Workload-level promote and demote (dispatched outside the per-resource FSM after #727) run against many resources. They acquire a workload-scoped lock — one mutex per workload — that excludes per-resource mutations during a site-wide action.
Tradeoff: site-wide action blocks all per-resource ops on the workload for its duration. Acceptable given site-wide actions are rare and operator-driven.
Out of scope
rbd/cephCLI usage by an admin running raw commands outside microceph. The lock cannot coordinate with an external process. Documented as a limitation.Acceptance
POST /ops/replication/rbd/pools/foorequests serialize: second observes the first's mutation in itsPreFill.