fix(pipeline): publisher fencing/WAL + agent gossip + network-sim parity (4/4)#334
fix(pipeline): publisher fencing/WAL + agent gossip + network-sim parity (4/4)#334
Conversation
Fixes the data-pipeline layer of the bug-fix backlog from PR #327. Slice 4 of 4 (split by architectural layer for reviewability) — the heaviest slice because the publisher↔agent gossip flow cross-cuts the most fixtures. Publisher bug fixes - P-2: fencing token in dkg-publisher.ts. Concurrent publish requests could step on each other's WAL entries; now each publish acquires a monotonic fencing token, and the WAL recovery path rejects entries whose token is older than the latest known. - chain-event-poller.ts: transient-error classification. Network blips now get retried with backoff instead of bubbling as fatal; permanent errors still surface. - async-lift-publisher-impl.ts: tighter key-bound subtraction so the lift accumulator can't drift across cg boundaries. - update-handler.ts: r23-4 fix — don't double-emit on the publish retry path. Agent bug fixes - A-5: gossip envelope signing in dkg-agent.ts (the canonicalisation step previously dropped optional fields, causing signature mismatches on round-trip through the network-sim). - A-7: DID-format validation hardened in endorse.ts — accept the full did:* spec rather than the v9 short-form only. - A-12 / A-13: workspace-config validation now catches missing cg-private-key + asymmetric per-cg overrides at config-load time rather than at first publish (much better operator UX). - A-15: signed-gossip-publish egress now includes the cg-id binding so a malicious peer can't replay a signed envelope to a different cg. Network-sim - K-4 / K-5: deterministic RNG + libp2p parity in sim-engine.ts. The sim previously seeded its RNG from Math.random(), making test failures non-reproducible. Also, the gossip-fanout topology diverged from real libp2p peer-selection; now matches the pubsub.gossipsub heartbeat semantics so sim-only passes can't lie about real-network behaviour. (These are the K-4/K-5 sentinels documented in .test-audit/BUGS_FOUND.md.) Tests added - publisher/test/wal-recovery.test.ts (new, +1320 lines): full P-2 fencing-token lifecycle including crash/restore, concurrent-writer starvation, and out-of-order replay rejection. - publisher/test/chain-event-poller-r24-4.test.ts (new, +478 lines): transient-vs-fatal classification matrix. - publisher/test/chain-event-poller-transient-classifier.test.ts (new, +214 lines): unit-level coverage for the classifier itself. - publisher/test/async-lift-subtraction-key-bound.test.ts (new, +167): the cg-boundary regression for async-lift. - publisher/test/per-cg-quorum-state.test.ts (new, +245). - publisher/test/update-handler-r23-4.test.ts (new, +132). - agent/test/op-wallets-and-workspace-config.test.ts (new, +503): A-12/A-13 operator-config validation matrix. - agent/test/per-cg-quorum-rpc-failure-extra.test.ts (new, +177): RPC-failure isolation per cg quorum. - agent/test/signed-gossip-publish-egress.test.ts (new, +150): A-15 cg-id binding on egress. - agent/test/strict-gossip-envelope-extra.test.ts (new, +106): A-5 envelope canonicalisation regression. - agent/test/finalization-handler.test.ts (new, +103). - agent/test/wm-multi-agent-isolation-extra.test.ts (+393): expanded isolation coverage. - network-sim/test/network-sim-extra.test.ts (+187): K-4/K-5 deterministic-RNG + parity assertions. Coverage ratchet - tornadoAgentCoverage 67/68/57/66 → 75/78/63/74. Sequencing: this is PR 4 of 4 — the final and largest slice. PR 1 (#331) contracts+chain, PR 2 (#332) storage+query, PR 3 (#333) operator surface. This PR sits on top of those layers and may need a clean rebase if any of them merge first. Made-with: Cursor
Two AA-2 assertions in attested-assets-extra.test.ts used to fan-out a fixed number of `setTimeout(r, 0)` yields between an async gossip publish (which enqueues an async ed25519 verification on the receiver) and the assertion observing the resulting event. On slower CI runners the verification didn't resolve before the assertions ran, so the test false-failed. Switched to a 10ms-poll `waitFor(predicate, timeoutMs)` helper that polls the observable we are about to assert against. Faster on the happy path AND immune to the race. Belongs in this PR because the race is on the agent/publisher gossip-publish path covered by slice 4. Made-with: Cursor
| const message = err instanceof Error ? err.message : String(err); | ||
| const isTransientHeadRace = | ||
| /block range extends beyond current head block/i.test(message) | ||
| || /code=UNKNOWN_ERROR.*32602/i.test(message); |
| && opts.agentAddress | ||
| && this.localAgents.size > 1 | ||
| && !callerSelfReadsOwnWm | ||
| && !opts.adminAuthenticated |
There was a problem hiding this comment.
🔴 Bug: this bypass never fires today because /api/query still only passes callerAgentAddress into agent.query() and does not set adminAuthenticated. On multi-agent nodes, valid node-level admin tokens will now hit the new WM signature gate and get empty results for legitimate cross-agent reads. Either plumb the flag from the daemon in this PR or derive the admin case here from the authenticated caller context.
| throw new Error('workspace config: root must be an object'); | ||
| } | ||
| const obj = raw as Record<string, unknown>; | ||
| const contextGraph = obj.contextGraph; |
There was a problem hiding this comment.
🔴 Bug: this parser hard-requires contextGraph, but the existing workspace config contract still accepts project as a backward-compatible alias (packages/mcp-dkg/src/config.ts already does contextGraph ?? project). Any repo that still uses project: in .dkg/config.* or dkg-config fences will start failing to load here. Accept both keys to avoid breaking existing workspaces.
| ); | ||
| if (bData.bindings.length > 0) break; | ||
| const confirmedAsk = await nodeB.query( | ||
| `ASK { GRAPH ?g { ?kc <http://dkg.io/ontology/status> "confirmed" } }`, |
There was a problem hiding this comment.
🟡 Issue: this ASK is unscoped, so any unrelated confirmed KC already present in node B will make bHasConfirmed true and let the loop pass even if ENTITY_1 only arrived via sync. Scope the check to the batch under test (for example by UAL/root entity in the target _meta graph) so the test actually proves FinalizationHandler ran for this publish.
Summary
PR 4 of 4 — the final and heaviest slice in the split of #327. This carries the data-pipeline layer where the publisher↔agent gossip flow cross-cuts the most fixtures, which is why all the related changes ride together.
Changes by category
Publisher bug fixes
Agent bug fixes
Network-sim
Tests added
Coverage ratchet
Sequencing context
This is slice 4 of 4 from the #327 split:
This PR sits on top of the other layers and may need a clean rebase if any of them merge first. No code-level conflicts expected — only fixture conflicts if storage adapter contracts change.
Test plan
Made with Cursor