[CRUTCH -- REVIEW CAREFULLY] Prover simulation logs (stderr capture)#1933
[CRUTCH -- REVIEW CAREFULLY] Prover simulation logs (stderr capture)#1933evgenyzdanovich wants to merge 2 commits into
Conversation
tokio::task::spawn_blocking runs the closure on a blocking-pool thread,
not the async task's thread. tracing's span dispatch is thread-local,
so the prove{task=...} span we set up via .instrument(span) is not
active inside the closure. Every event emitted by the strategy
(zkaleido logs, SP1 SDK logs, any future guest-stderr tee) loses the
task tag.
Capture the active span before spawn_blocking and re-enter it on the
blocking thread via _guard = parent_span.enter().
…o tracing under the task span
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf5c0a8bc5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Capture the guest output SP1 writes to host stderr during | ||
| // simulation/proving and re-emit it under the prove span. See | ||
| // `stderr_capture` for why fd-level capture is the only seam. | ||
| let (result, captured) = stderr_capture::capture(|| strategy.prove(&input, ctx)); |
There was a problem hiding this comment.
Limit stderr capture to local simulation
In production alpen-client builds both EE provers with .remote(...), and RemoteStrategy::prove submits then polls until the remote proof completes. Wrapping the entire strategy.prove call here means each remote proof holds the process-global CAPTURE_LOCK and keeps fd 2 redirected for the whole polling lifetime, so concurrent chunk proofs submitted as a fan-out cannot even enter strategy.prove until the previous remote proof finishes, and any real stderr from other threads is delayed/mislabelled for that duration. Please avoid using this fd capture around remote polling (or restrict it to the native/local simulation window that actually emits SP1 guest stderr).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
yeah, that's a "known" caveat described in the docstring.
A more proper solution would reduce the capture window and instead capture it "more surgically" on the zkaleido side (only the execute, not a full proving path)
There was a problem hiding this comment.
Exactly, why not add the capture in zkaleido-sp1-host's execute function in https://github.com/alpenlabs/zkaleido/blob/1d75e67f7d9bf389fcbaca40acbc1af9d0066401/adapters/sp1/host/src/prover.rs?plain=1#L73-L77 ?
Capture only around SP1's execute, and tee only when execution fails or ensure_clean_exit rejects a non-zero guest exit
There was a problem hiding this comment.
it involves bumping zkaleido to beta.4+, and that in turn requires broader changes to alpen.
on top of that, Im not even sure myself we end up having it long term (let's say, at the day before mainnet), that's why I'm leaning towards "an immediate hack at a higher level" that gets delivered now and removed sooner (right after we stabilize tn3). those are rationales, hence this hacky pr at alpen
There was a problem hiding this comment.
What if we move stderr_capture::capture around submit_proof inside RemoteStrategy, then restore stderr before poll_until_done? That should still catch SP1 request() simulation failures without capturing the polling path.
|
Commit: f641ceb SP1 Execution Results
|
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## main #1933 +/- ##
==========================================
+ Coverage 79.68% 84.27% +4.58%
==========================================
Files 673 634 -39
Lines 74661 75785 +1124
==========================================
+ Hits 59495 63867 +4372
+ Misses 15166 11918 -3248
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 331 files with indirect coverage changes 🚀 New features to boost your workflow:
|
| return (f(), Vec::new()); | ||
| }; | ||
|
|
||
| let result = f(); |
There was a problem hiding this comment.
Is it possible for f() to panic?
Description
Currently, if (built-in) simulation fails, we:
(1) do not send the proof request (which is correct IMO -- the request anyway would be unfulfillable)
(2) do not have any idea whatsoever why simulation fails (and, transitively, why precisely we haven't sent the proof request): for example, a panic on the assert of some proof statement within guest simply does not appear in our logs
The current PR is an attempt to alleviate (2) in some form.
It's intended to be very hacky temporary solution to ease debugging and investigation (and I'd advocate to remove it once we stabilize stuff).
Type of Change
Notes to Reviewers
PLEASE ASSESS CRITICALLY THE LOGIC INSIDE THE PR AND PROS/CONS AS DESCRIBED IN THE DOCSTRINGS OF
STDERR_CAPTUREI am absolutely fine if reviewers decide its not worth it in such a form and we don't merge.
Is this PR addressing any specification, design doc or external reference document?
If yes, please add relevant links:
Checklist
Related Issues