Serialize concurrent mutating replication requests on the same resource

Parent: #726
Depends on: #727 (site-wide actions must be dispatched outside the per-resource FSM so the workload-scoped lock has a stable target).

## Background

Replication mutating endpoints (`POST` / `PUT` / `DELETE` on `/ops/replication/{wl}/{name}`) are registered with `ProxyTarget: false` in `api/ops_replication.go:31-41`. A request lands on whichever microceph node received it. There is no in-process or cross-node coordination preventing two simultaneous mutations against the same Ceph resource.

Failure modes:

- Two concurrent `enable` requests on the same RBD pool both run `PreFill`, which snapshots `PoolInfo.Mode` and `PoolStatus.State` from Ceph (`ceph/replication_rbd.go:85-109`). Both believe the pool is not yet enabled. Both fire the FSM (each with its own state machine instance). Both call `handlePoolEnablement`, whose idempotency guard (`ceph/replication_rbd.go:324-326`) is read-then-act and does not interlock with the parallel request. Second request may dupe `rbd mirror pool peer add` or partially configure schedules.
- Concurrent `disable` and `enable` on the same resource produce ordering-dependent outcomes the operator cannot predict.
- Concurrent `enable` requests landing on different microceph nodes serialize at the Ceph mon level only as far as Ceph itself enforces; node-local state (keyrings written to disk, `IsRemoteConfiguredForRbdMirror` checks) is not coordinated.

The per-request FSM provides no help: each request gets its own state machine, so there is no shared `this resource is currently being modified` signal.

## Proposed approach

Two layers, both required.

### 1. Route mutating requests through the cluster leader

Flip mutating replication endpoints to `ProxyTarget: true` so that microcluster routes them to the leader. Reads stay non-proxied for parallelism.

This serializes all mutating replication ops onto a single node, which is a precondition for in-process locking to be meaningful.

### 2. Per-resource named mutex on the executing node

Introduce a process-local lock keyed by `{workload}/{resource-id}`:

```go
// ceph/replication_lock.go
var repLocks sync.Map // map[string]*sync.Mutex

func AcquireReplicationLock(req types.ReplicationRequest) func() {
    key := fmt.Sprintf("%s/%s", req.GetWorkloadType(), req.GetAPIObjectID())
    m, _ := repLocks.LoadOrStore(key, &sync.Mutex{})
    mu := m.(*sync.Mutex)
    mu.Lock()
    return mu.Unlock
}
```

Wired in `api/ops_replication.go::handleReplicationRequest` around `PreFill` + `FireCtx`:

```go
release := ceph.AcquireReplicationLock(req)
defer release()
err := rh.PreFill(ctx, req)
// ...
err = repFsm.FireCtx(ctx, event, rh, &resp, ...)
```

Reads (`GET` endpoints) skip lock acquisition. Writes block until the prior write on the same resource releases.

The `sync.Map` of mutexes accumulates one entry per unique resource ever touched, bounded by total resource count. Eviction unnecessary at expected scale.

### Site-wide actions

Workload-level promote and demote (dispatched outside the per-resource FSM after #727) run against many resources. They acquire a workload-scoped lock — one mutex per workload — that excludes per-resource mutations during a site-wide action.

Tradeoff: site-wide action blocks all per-resource ops on the workload for its duration. Acceptable given site-wide actions are rare and operator-driven.

## Out of scope

- External `rbd` / `ceph` CLI usage by an admin running raw commands outside microceph. The lock cannot coordinate with an external process. Documented as a limitation.
- FSM internals and transition cleanup (#727).
- REST URL redesign (#728).

## Acceptance

- Two parallel `POST /ops/replication/rbd/pools/foo` requests serialize: second observes the first's mutation in its `PreFill`.
- Read endpoints remain unblocked by writes on the same resource.
- Site-wide promote and demote exclude concurrent per-resource mutations on the same workload.
- Mutating endpoints route to the cluster leader regardless of which node received the request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize concurrent mutating replication requests on the same resource #729

Background

Proposed approach

1. Route mutating requests through the cluster leader

2. Per-resource named mutex on the executing node

Site-wide actions

Out of scope

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Serialize concurrent mutating replication requests on the same resource #729

Description

Background

Proposed approach

1. Route mutating requests through the cluster leader

2. Per-resource named mutex on the executing node

Site-wide actions

Out of scope

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions