Skip to content

Leios: late-join support#2040

Draft
dnadales wants to merge 20 commits into
leios-prototypefrom
leios-late-join
Draft

Leios: late-join support#2040
dnadales wants to merge 20 commits into
leios-prototypefrom
leios-late-join

Conversation

@dnadales
Copy link
Copy Markdown
Member

@dnadales dnadales commented May 13, 2026

Summary

Closes: input-output-hk/ouroboros-leios#890

  • Add prop_leios_late_join test: 4 nodes, node 3 joins at a random slot.
    Demonstrates the resolveLeiosBlock crash when a late node encounters a CertRB referencing an EB it never saw.
  • Add hbMayCertifiedEb :: Maybe LeiosPoint to the Praos HeaderBody so CertRBs carry the certified EB point in the header (length-switching CBOR 10/11/12).
  • Filter pending CertRBs from ChainSel: CertRBs whose EB closure hasn't arrived are recorded in cdbPendingEBs and made invisible to chain selection (both successor enumeration and predecessor tracing via lookupBlockInfo).

Late-join test passes 100 runs (no crash).
Still WIP — CertRBs are permanently excluded on the late node (no re-trigger yet).

Remaining steps:

  • Chain-consistency assertion (all nodes converge to same chain)
  • Re-trigger ChainSel when EB closure arrives
  • Fetch mechanism for missing EB closures

@dnadales dnadales changed the title Leios: late-join support (steps 0–2) Leios: late-join support May 13, 2026
-- ^ protocol version
, hbMayEbAnnouncement :: Maybe EbAnnouncement
-- ^ Leios EB announcement
, hbMayCertifiedEb :: Maybe LeiosPoint
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this used for (right now)? In the CIP we speculatively put it (as a single bit) as it may help synchronizing nodes to know the size of what they request.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it tells ChainSel "this ranking block certifies EB X" and this is what we use to check whether the EB closure is available and, if not, defer chain selection until it arrives (concretely we use the LeiosPoint to look in the LeiosDB if the EB closure is available).

The ranking-blocks spec mentions the addition of a hash32 but not a (slotNo, EbHash), so I don't know if we want to change hbMayCertifiedEb accordingly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it now, we need it to have a proof that the node that gave us the header actually has the EB. Of course, this means that we must also check whether there is a cert in the body if the header claims it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to follow what I implemented but removed because it wasn't used...

b7f6f6c

I introduced a notion of a BodyType which is imo better naming for what its purpose is (Header is associated with Bodies of many types) into the PraosHeader.

Combined with LeiosState that's part of PraosState

data LeiosState = LeiosState
{ leiosStatePreviousAnnouncement :: Maybe EbHash
, leiosStateCanCertify :: Bool
}

One gets the answer to what's being certified...ie. just look into leiosStatePreviousAnnouncement and if the Header is associated with a LeiosCertificate BodyType.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there's a contradiction between CIP and CDDL:

Before deciding on this we should agree which spec we should follow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of git archeology, I see that Will wrote ranking-blocks.md as part of the process of writing the CIP. So I think that's a "showing your work" file. Perhaps we could just add a comment to its certified_eb field saying the CIP now uses a bit for this.

dnadales added 6 commits May 14, 2026 19:49
Parameterise runThreadNet over NodeJoinPlan and add a new property that
starts node 3 at a random slot while nodes 0–2 run from slot 0.
This demonstrates the crash in resolveLeiosBlock when a late-joining
node encounters a CertRB referencing an EB it never received.
Add a Maybe LeiosPoint field to the Praos HeaderBody that records which
EB a certifying ranking block certifies.
The CBOR codec uses length-switching (10/11/12) to stay backwards
compatible with non-Leios headers.

Forging passes the certificate's EB point for CertRBs and Nothing for
regular transaction blocks.
The field propagates through HeaderView, mkHeader, and all construction
sites (generators, examples).
When a CertRB arrives whose EB closure is not in the LeiosDB, record it
in cdbPendingEBs and skip chain selection.
Subsequent chain selections filter pending hashes from both
lookupBlockInfo (predecessor tracing) and succsOf (successor
enumeration), making the CertRB invisible until its EB closure arrives.

Adds certifiedEbFromHeader to ResolveLeiosBlock so ChainSel can inspect
the header without reaching into block-type-specific layers.
Assert that all nodes converge to the same chain. Fails as expected:
the late node's chain is shorter (1 block vs 10) because CertRBs with
missing EB closures are permanently excluded from ChainSel.
Add ChainSelReprocessBlock message type that re-runs chain selection
for a single CertRB whose EB closure was previously missing. A new
background thread (ebCompletionRunner) subscribes to LeiosDB
notifications and enqueues ChainSelReprocessBlock when an EB becomes
complete.

The chain-consistency assertion still fails: the re-trigger fires
correctly but most EB closures never complete on the late node because
the fetch mechanism doesn't deliver historical EB bodies and txs.
When ChainSel filters a CertRB because its EB closure is missing,
drive a fetch through LeiosFetch using each peer's ChainSync
candidate fragment as a fallback peer source.

* Expose cdbPendingEBs via ChainDB.getPendingCertRBs.
* pendingEbReconciler in NodeKernel mirrors the pending set into
  Leios missingEbBodies with size 0; it never overwrites
  offer-supplied sizes and only removes its own size-0 entries.
* leiosFetchLogic walks per-peer ChainSync candidate fragments,
  extracts certified EB hashes via certifiedEbFromHeader, and passes
  a per-peer Set EbHash to leiosFetchLogicIteration.
* choosePeerEb and choosePeerTx fall back to candidate-derived peers
  when no peer has offered the EB body / tx-closure. A peer whose
  candidate contains the CertRB must have validated the closure
  locally, so it must also hold both the body and the txs.
* Relax the response-size check in msgLeiosBlock when the expected
  size is 0; the hash check remains authoritative.
dnadales added 14 commits May 14, 2026 19:57
The previous range allowed the late node to join as late as numSlots-1,
leaving insufficient catch-up time for the chain-consistency assertion
to hold for reasons unrelated to the late-join logic under test.
Closes a TOCTOU window where a CertRB could remain stranded in
cdbPendingEBs after its EB closure arrived: if the closure completed
between ChainSel's closure-query and its cdbPendingEBs insert, the
ebCompletionRunner notification fired against an empty pending set and
was dropped. The sweep re-enqueues any pending CertRB whose closure is
now in LeiosDb, covering this race and other missed-notification
scenarios (subscription gap at startup, etc.).

Adds addReprocessBlock to the ChainDB API record so the fetch loop can
trigger ChainSel reprocessing without holding ChainDbEnv.
If the EB closure completes between ChainSel's first 'is the closure
present?' query and its cdbPendingEBs insert, ebCompletionRunner fires
against an empty pending set and drops the notification. The previous
commit's leiosFetchLogic sweep covers this race on its iteration
cadence; this inline recheck closes the immediate window so the
CertRB is processed in-place rather than waiting for the next tick.

Cross-references between the two sites: ChainSel.hs points at the
sweep as the load-bearing fix, NodeKernel.hs points at the recheck
as the local optimization for the immediate race.
Drop unused Data.Set and EbHash imports, and rename the shadowing
'handle' binding to 'csHandle' in the candidateCertEbs computation.
Async cancellation on shutdown was not releasing the connection. The
in-memory backend's close is a no-op so tests were unaffected, but
the SQLite backend leaked the database handle.
Move the size-0 / offer-coexistence Map discipline out of the inline
reconciler and into 'applyPendingAdded' / 'applyPendingRemoved' next to
'LeiosOutstanding'. Add unit tests for the two invariants: pending
entries round-trip cleanly, and offer-supplied entries survive a
pending add/remove.
The HeaderBody generators were hardcoding hbMayCertifiedEb to Nothing,
so the len-12 CBOR branch (and the (Nothing, Just) / (Just, Just)
combinations of the two optional fields) was never exercised by
roundtrip property tests. Add an Arbitrary LeiosPoint and let both
HeaderBody generators sample the optionals.
Previously the new ChainSelReprocessBlock message reused the LoE
event, conflating two unrelated reprocessing mechanisms on the
operator side. Add AddedReprocessBlockToQueue / PoppedReprocessBlockFromQueue
constructors, both carrying the CertRB hash so the events are
correlatable across logs, and thread the tracer through
addReprocessBlock.
Insertion into cdbPendingEBs is keyed by LeiosPoint, so removal should
be too. Carrying the point on the reprocess message replaces an
O(n) Map.filter (/= hash) with an O(log n) Map.delete and removes the
implicit value-equals-header-hash invariant. The header hash stays on
the message because the consumer still needs it to look up the header
in the VolatileDB.
"Step 2" and "step 3" were private references to the late-join
implementation plan. Replace them with cross-references to the
mechanisms themselves (ChainSel filter, ebCompletionRunner).
The project is moving away from RecordWildCards. Project to explicit
field accessors at the two sites this branch introduced new wildcard
uses (ebCompletionRunner and the ChainSelReprocessBlock equation of
chainSelSync). The pre-existing wildcards elsewhere in the file are
left intact.
Pull the per-peer scan into a named 'certifiedEbsFromCandidate'
helper, and add inline comments explaining the mapKeysMonotonic
safety justification and the singleton-list-generator pattern. No
behaviour change.
The voting-key hack in protocolInfoCardano did `credssShelleyBased !! 0`,
which crashes a relay node (no leader credentials) with "Prelude.!!: index
too large" before diffusion starts. Use `listToMaybe` so a node without
Shelley-based credentials gets `topLevelConfigVotingKey = Nothing` instead.
@dnadales dnadales moved this to 🏗 In progress in Consensus Team Backlog May 18, 2026
@dnadales dnadales self-assigned this May 18, 2026
-- ^ protocol version
, hbMayEbAnnouncement :: Maybe EbAnnouncement
-- ^ Leios EB announcement
, hbMayCertifiedEb :: Maybe LeiosPoint
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to follow what I implemented but removed because it wasn't used...

b7f6f6c

I introduced a notion of a BodyType which is imo better naming for what its purpose is (Header is associated with Bodies of many types) into the PraosHeader.

Combined with LeiosState that's part of PraosState

data LeiosState = LeiosState
{ leiosStatePreviousAnnouncement :: Maybe EbHash
, leiosStateCanCertify :: Bool
}

One gets the answer to what's being certified...ie. just look into leiosStatePreviousAnnouncement and if the Header is associated with a LeiosCertificate BodyType.

@dnadales
Copy link
Copy Markdown
Member Author

Action points after discussing this PR with @nfrisby (to be tackled in no particular order):

  1. Keep the Leios fetch logic as it was. Assume that if a peer offers us a CertRB then we can consider this an implicit offer for the certified EB (both the body and the closure).
  2. Try to get rid of cdbPendingEBs. We could keep a announcementsMap :: hash RB -> hash EB as part of the VolDB functionality. We still require in ChainSel the set of dowloaded EB closures, which the Leios fetch logic should take care of tracking in an in-memory data structure. This results in a better separation of concerns. We don't want to change the on disk format to avoid interfering with Mithril so the announcementsMap has to live in-memory and be reconstructed from scratch on node-startup.
    • NOTE: The other option is to not to filter the candidates based on EB closure availability and let block application fail dynamically (which might require changing the types involved in block application). In this path, we need to check EB closure availability before setting the tentative header (otherwise the assumption that (RB) header offer implies EB closure availability breaks).
      2.1 We need to re-trigger chain selection when an EB-closure is downloaded. We should do this on the announcing EB, because there's a 1-1 mapping between announcing RB and EB. Thus for a given we'd only have to call selection once and chain selection would take care of sorting out the candidate chains that include the certifying RBs.
  3. Change hbMayCertifiedEb to a single bit to comply with the CIP, and consider using the PraosState (ChainDepState) to get the certified EB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

4 participants