Leios: late-join support#2040
Conversation
| -- ^ protocol version | ||
| , hbMayEbAnnouncement :: Maybe EbAnnouncement | ||
| -- ^ Leios EB announcement | ||
| , hbMayCertifiedEb :: Maybe LeiosPoint |
There was a problem hiding this comment.
What is this used for (right now)? In the CIP we speculatively put it (as a single bit) as it may help synchronizing nodes to know the size of what they request.
There was a problem hiding this comment.
Currently it tells ChainSel "this ranking block certifies EB X" and this is what we use to check whether the EB closure is available and, if not, defer chain selection until it arrives (concretely we use the LeiosPoint to look in the LeiosDB if the EB closure is available).
The ranking-blocks spec mentions the addition of a hash32 but not a (slotNo, EbHash), so I don't know if we want to change hbMayCertifiedEb accordingly.
There was a problem hiding this comment.
I see it now, we need it to have a proof that the node that gave us the header actually has the EB. Of course, this means that we must also check whether there is a cert in the body if the header claims it.
There was a problem hiding this comment.
I'd propose to follow what I implemented but removed because it wasn't used...
I introduced a notion of a BodyType which is imo better naming for what its purpose is (Header is associated with Bodies of many types) into the PraosHeader.
Combined with LeiosState that's part of PraosState
One gets the answer to what's being certified...ie. just look into leiosStatePreviousAnnouncement and if the Header is associated with a LeiosCertificate BodyType.
There was a problem hiding this comment.
It seems there's a contradiction between CIP and CDDL:
- CIP-0164 (prose + appendix CDDL): https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md
- Prose §"Ranking Blocks" —
certified_ebas single bit:
https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md#L695-L697 - Appendix B CDDL —
? certified_eb : bool:
https://github.com/cardano-scaling/CIPs/blob/leios/CIP-0164/README.md#L2903 - Linear ranking-blocks CDDL diff (
? certified_eb : hash32):
https://github.com/input-output-hk/ouroboros-leios/blob/main/docs/cddl/diffs/linear/ranking-blocks.md#L41
Before deciding on this we should agree which spec we should follow.
There was a problem hiding this comment.
After a bit of git archeology, I see that Will wrote ranking-blocks.md as part of the process of writing the CIP. So I think that's a "showing your work" file. Perhaps we could just add a comment to its certified_eb field saying the CIP now uses a bit for this.
Parameterise runThreadNet over NodeJoinPlan and add a new property that starts node 3 at a random slot while nodes 0–2 run from slot 0. This demonstrates the crash in resolveLeiosBlock when a late-joining node encounters a CertRB referencing an EB it never received.
Add a Maybe LeiosPoint field to the Praos HeaderBody that records which EB a certifying ranking block certifies. The CBOR codec uses length-switching (10/11/12) to stay backwards compatible with non-Leios headers. Forging passes the certificate's EB point for CertRBs and Nothing for regular transaction blocks. The field propagates through HeaderView, mkHeader, and all construction sites (generators, examples).
When a CertRB arrives whose EB closure is not in the LeiosDB, record it in cdbPendingEBs and skip chain selection. Subsequent chain selections filter pending hashes from both lookupBlockInfo (predecessor tracing) and succsOf (successor enumeration), making the CertRB invisible until its EB closure arrives. Adds certifiedEbFromHeader to ResolveLeiosBlock so ChainSel can inspect the header without reaching into block-type-specific layers.
Assert that all nodes converge to the same chain. Fails as expected: the late node's chain is shorter (1 block vs 10) because CertRBs with missing EB closures are permanently excluded from ChainSel.
Add ChainSelReprocessBlock message type that re-runs chain selection for a single CertRB whose EB closure was previously missing. A new background thread (ebCompletionRunner) subscribes to LeiosDB notifications and enqueues ChainSelReprocessBlock when an EB becomes complete. The chain-consistency assertion still fails: the re-trigger fires correctly but most EB closures never complete on the late node because the fetch mechanism doesn't deliver historical EB bodies and txs.
When ChainSel filters a CertRB because its EB closure is missing, drive a fetch through LeiosFetch using each peer's ChainSync candidate fragment as a fallback peer source. * Expose cdbPendingEBs via ChainDB.getPendingCertRBs. * pendingEbReconciler in NodeKernel mirrors the pending set into Leios missingEbBodies with size 0; it never overwrites offer-supplied sizes and only removes its own size-0 entries. * leiosFetchLogic walks per-peer ChainSync candidate fragments, extracts certified EB hashes via certifiedEbFromHeader, and passes a per-peer Set EbHash to leiosFetchLogicIteration. * choosePeerEb and choosePeerTx fall back to candidate-derived peers when no peer has offered the EB body / tx-closure. A peer whose candidate contains the CertRB must have validated the closure locally, so it must also hold both the body and the txs. * Relax the response-size check in msgLeiosBlock when the expected size is 0; the hash check remains authoritative.
The previous range allowed the late node to join as late as numSlots-1, leaving insufficient catch-up time for the chain-consistency assertion to hold for reasons unrelated to the late-join logic under test.
Closes a TOCTOU window where a CertRB could remain stranded in cdbPendingEBs after its EB closure arrived: if the closure completed between ChainSel's closure-query and its cdbPendingEBs insert, the ebCompletionRunner notification fired against an empty pending set and was dropped. The sweep re-enqueues any pending CertRB whose closure is now in LeiosDb, covering this race and other missed-notification scenarios (subscription gap at startup, etc.). Adds addReprocessBlock to the ChainDB API record so the fetch loop can trigger ChainSel reprocessing without holding ChainDbEnv.
If the EB closure completes between ChainSel's first 'is the closure present?' query and its cdbPendingEBs insert, ebCompletionRunner fires against an empty pending set and drops the notification. The previous commit's leiosFetchLogic sweep covers this race on its iteration cadence; this inline recheck closes the immediate window so the CertRB is processed in-place rather than waiting for the next tick. Cross-references between the two sites: ChainSel.hs points at the sweep as the load-bearing fix, NodeKernel.hs points at the recheck as the local optimization for the immediate race.
Drop unused Data.Set and EbHash imports, and rename the shadowing 'handle' binding to 'csHandle' in the candidateCertEbs computation.
Async cancellation on shutdown was not releasing the connection. The in-memory backend's close is a no-op so tests were unaffected, but the SQLite backend leaked the database handle.
Move the size-0 / offer-coexistence Map discipline out of the inline reconciler and into 'applyPendingAdded' / 'applyPendingRemoved' next to 'LeiosOutstanding'. Add unit tests for the two invariants: pending entries round-trip cleanly, and offer-supplied entries survive a pending add/remove.
The HeaderBody generators were hardcoding hbMayCertifiedEb to Nothing, so the len-12 CBOR branch (and the (Nothing, Just) / (Just, Just) combinations of the two optional fields) was never exercised by roundtrip property tests. Add an Arbitrary LeiosPoint and let both HeaderBody generators sample the optionals.
Previously the new ChainSelReprocessBlock message reused the LoE event, conflating two unrelated reprocessing mechanisms on the operator side. Add AddedReprocessBlockToQueue / PoppedReprocessBlockFromQueue constructors, both carrying the CertRB hash so the events are correlatable across logs, and thread the tracer through addReprocessBlock.
Insertion into cdbPendingEBs is keyed by LeiosPoint, so removal should be too. Carrying the point on the reprocess message replaces an O(n) Map.filter (/= hash) with an O(log n) Map.delete and removes the implicit value-equals-header-hash invariant. The header hash stays on the message because the consumer still needs it to look up the header in the VolatileDB.
"Step 2" and "step 3" were private references to the late-join implementation plan. Replace them with cross-references to the mechanisms themselves (ChainSel filter, ebCompletionRunner).
The project is moving away from RecordWildCards. Project to explicit field accessors at the two sites this branch introduced new wildcard uses (ebCompletionRunner and the ChainSelReprocessBlock equation of chainSelSync). The pre-existing wildcards elsewhere in the file are left intact.
Pull the per-peer scan into a named 'certifiedEbsFromCandidate' helper, and add inline comments explaining the mapKeysMonotonic safety justification and the singleton-list-generator pattern. No behaviour change.
The voting-key hack in protocolInfoCardano did `credssShelleyBased !! 0`, which crashes a relay node (no leader credentials) with "Prelude.!!: index too large" before diffusion starts. Use `listToMaybe` so a node without Shelley-based credentials gets `topLevelConfigVotingKey = Nothing` instead.
| -- ^ protocol version | ||
| , hbMayEbAnnouncement :: Maybe EbAnnouncement | ||
| -- ^ Leios EB announcement | ||
| , hbMayCertifiedEb :: Maybe LeiosPoint |
There was a problem hiding this comment.
I'd propose to follow what I implemented but removed because it wasn't used...
I introduced a notion of a BodyType which is imo better naming for what its purpose is (Header is associated with Bodies of many types) into the PraosHeader.
Combined with LeiosState that's part of PraosState
One gets the answer to what's being certified...ie. just look into leiosStatePreviousAnnouncement and if the Header is associated with a LeiosCertificate BodyType.
|
Action points after discussing this PR with @nfrisby (to be tackled in no particular order):
|
Summary
Closes: input-output-hk/ouroboros-leios#890
prop_leios_late_jointest: 4 nodes, node 3 joins at a random slot.Demonstrates the
resolveLeiosBlockcrash when a late node encounters a CertRB referencing an EB it never saw.hbMayCertifiedEb :: Maybe LeiosPointto the PraosHeaderBodyso CertRBs carry the certified EB point in the header (length-switching CBOR 10/11/12).cdbPendingEBsand made invisible to chain selection (both successor enumeration and predecessor tracing vialookupBlockInfo).Late-join test passes 100 runs (no crash).
Still WIP — CertRBs are permanently excluded on the late node (no re-trigger yet).
Remaining steps: