Skip to content

[CRUTCH -- REVIEW CAREFULLY] Prover simulation logs (stderr capture)#1933

Open
evgenyzdanovich wants to merge 2 commits into
mainfrom
sim-logs
Open

[CRUTCH -- REVIEW CAREFULLY] Prover simulation logs (stderr capture)#1933
evgenyzdanovich wants to merge 2 commits into
mainfrom
sim-logs

Conversation

@evgenyzdanovich
Copy link
Copy Markdown
Contributor

@evgenyzdanovich evgenyzdanovich commented Jun 3, 2026

Description

Currently, if (built-in) simulation fails, we:
(1) do not send the proof request (which is correct IMO -- the request anyway would be unfulfillable)
(2) do not have any idea whatsoever why simulation fails (and, transitively, why precisely we haven't sent the proof request): for example, a panic on the assert of some proof statement within guest simply does not appear in our logs

The current PR is an attempt to alleviate (2) in some form.
It's intended to be very hacky temporary solution to ease debugging and investigation (and I'd advocate to remove it once we stabilize stuff).

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature/Enhancement (non-breaking change which adds functionality or enhances an existing one)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactor
  • New or updated tests
  • Dependency Update

Notes to Reviewers

PLEASE ASSESS CRITICALLY THE LOGIC INSIDE THE PR AND PROS/CONS AS DESCRIBED IN THE DOCSTRINGS OF STDERR_CAPTURE
I am absolutely fine if reviewers decide its not worth it in such a form and we don't merge.

Is this PR addressing any specification, design doc or external reference document?

  • Yes
  • No

If yes, please add relevant links:

Checklist

  • I have performed a self-review of my code.
  • I have commented my code where necessary.
  • I have updated the documentation if needed.
  • My changes do not introduce new warnings.
  • I have added (where necessary) tests that prove my changes are effective or that my feature works.
  • New and existing tests pass with my changes.
  • I have disclosed my use of AI in the body of this PR.

Related Issues

tokio::task::spawn_blocking runs the closure on a blocking-pool thread,
not the async task's thread. tracing's span dispatch is thread-local,
so the prove{task=...} span we set up via .instrument(span) is not
active inside the closure. Every event emitted by the strategy
(zkaleido logs, SP1 SDK logs, any future guest-stderr tee) loses the
task tag.

Capture the active span before spawn_blocking and re-enter it on the
blocking thread via _guard = parent_span.enter().
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf5c0a8bc5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Capture the guest output SP1 writes to host stderr during
// simulation/proving and re-emit it under the prove span. See
// `stderr_capture` for why fd-level capture is the only seam.
let (result, captured) = stderr_capture::capture(|| strategy.prove(&input, ctx));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Limit stderr capture to local simulation

In production alpen-client builds both EE provers with .remote(...), and RemoteStrategy::prove submits then polls until the remote proof completes. Wrapping the entire strategy.prove call here means each remote proof holds the process-global CAPTURE_LOCK and keeps fd 2 redirected for the whole polling lifetime, so concurrent chunk proofs submitted as a fan-out cannot even enter strategy.prove until the previous remote proof finishes, and any real stderr from other threads is delayed/mislabelled for that duration. Please avoid using this fd capture around remote polling (or restrict it to the native/local simulation window that actually emits SP1 guest stderr).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's a "known" caveat described in the docstring.

A more proper solution would reduce the capture window and instead capture it "more surgically" on the zkaleido side (only the execute, not a full proving path)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, why not add the capture in zkaleido-sp1-host's execute function in https://github.com/alpenlabs/zkaleido/blob/1d75e67f7d9bf389fcbaca40acbc1af9d0066401/adapters/sp1/host/src/prover.rs?plain=1#L73-L77 ?

Capture only around SP1's execute, and tee only when execution fails or ensure_clean_exit rejects a non-zero guest exit

Copy link
Copy Markdown
Contributor Author

@evgenyzdanovich evgenyzdanovich Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it involves bumping zkaleido to beta.4+, and that in turn requires broader changes to alpen.

on top of that, Im not even sure myself we end up having it long term (let's say, at the day before mainnet), that's why I'm leaning towards "an immediate hack at a higher level" that gets delivered now and removed sooner (right after we stabilize tn3). those are rationales, hence this hacky pr at alpen

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we move stderr_capture::capture around submit_proof inside RemoteStrategy, then restore stderr before poll_until_done? That should still catch SP1 request() simulation failures without capturing the polling path.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Commit: f641ceb

SP1 Execution Results

program cycles gas
EVM EE Chunk 824,732 969,394
EVM EE Account 404,056 498,593
Checkpoint 2,602,415 3,010,370

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

❌ Patch coverage is 84.76190% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.27%. Comparing base (097b2d8) to head (bf5c0a8).
⚠️ Report is 55 commits behind head on main.

Files with missing lines Patch % Lines
crates/prover-core/src/stderr_capture.rs 83.50% 16 Missing ⚠️
@@            Coverage Diff             @@
##             main    #1933      +/-   ##
==========================================
+ Coverage   79.68%   84.27%   +4.58%     
==========================================
  Files         673      634      -39     
  Lines       74661    75785    +1124     
==========================================
+ Hits        59495    63867    +4372     
+ Misses      15166    11918    -3248     
Flag Coverage Δ
functional 65.80% <80.72%> (+5.65%) ⬆️
unit 69.70% <73.33%> (+4.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crates/prover-core/src/prover.rs 62.81% <100.00%> (+0.74%) ⬆️
crates/prover-core/src/stderr_capture.rs 83.50% <83.50%> (ø)

... and 331 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return (f(), Vec::new());
};

let result = f();
Copy link
Copy Markdown
Contributor

@irnb irnb Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for f() to panic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants