Conversation
d7a71fa to
c09e714
Compare
| [features] | ||
| default = [] | ||
| cuda = ["zisk-sdk/gpu"] | ||
| cuda = [] |
There was a problem hiding this comment.
In v0.17.0 ZisK has cuda enabled by default (by detecting if nvcc is available), and a cpu-only feature flag is available to force cuda disabled. If we try to initialize gpu prover without nvcc available, it'd show a runtime error, so we add a build.rs to turn that into a compile time error.
| fn validate_format(&self) -> Result<(), Error> { | ||
| if self.0.len() != PROOF_WORDS { | ||
| return Err(Error::InvalidProofFormat(format!( | ||
| "proof has {} u64 words, expected to be {PROOF_WORDS}", | ||
| self.0.len(), | ||
| ))); | ||
| } | ||
| if self.0[0] != PROOF_PREFIX_WORDS as u64 { | ||
| return Err(Error::InvalidProofFormat(format!( | ||
| "proof n_publics is {}, expected to be {PROOF_PREFIX_WORDS}", | ||
| self.0[0], | ||
| ))); | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Always validates the constant proof size and metadata, otherwise the verify function in proofman-verifier might be tricked to panic.
There was a problem hiding this comment.
Uhm, panicking on verifying untrusted bytes sounds more like a bug to me. Prob we should ask them to avoid this eventually.
| libclang-dev \ | ||
| clang \ |
There was a problem hiding this comment.
Added for ZisK's zkvm-interface crate compilation.
| clang \ | ||
| # Used to kill the ASM services when prover panics | ||
| psmisc \ | ||
| procps \ |
There was a problem hiding this comment.
We use pkill for defensive ASM service cleanup now (previously using fuser)
| let result = panic::catch_unwind(AssertUnwindSafe(|| { | ||
| self.prover | ||
| .prove(&self.program, stdin) | ||
| .wrap_proof(ProofKind::VadcopFinalMinimal) |
There was a problem hiding this comment.
Wrap into the minimal stark proof for smaller proof size (~256 KiB instead of ~329 KiB), this is also the default proof kind.
jsign
left a comment
There was a problem hiding this comment.
LGTM, left just some comments/questions.
| ))); | ||
| fn parse_proof(bytes: &[u8]) -> Result<ZiskProof, Error> { | ||
| #[derive(Default, Serialize, Deserialize)] | ||
| pub struct Proof<'a> { |
There was a problem hiding this comment.
Are all these structs not exported by Zisk?
There was a problem hiding this comment.
Yes, there is structure for the proof, but the proof is eventually flatten into the format added in this PR, and in the verifier they don't check the size, so a malformed proof will trigger index out of bounds.
Will file an issue in upstream as well.
| fn cycle_scope_start(name: &str) { | ||
| let tag = SCOPE_REGISTRY.get_or_assign_tag(name); | ||
| dispatch_profile!(start, tag); | ||
| fn cycle_scope_start(_name: &str) { |
There was a problem hiding this comment.
@jsign This breaks the zisk profilng flow in zkevm-benchmark-workload, we probably need a custom build for the guests to do profiling, or we adapt to the Function-Level Profiling that works for prover and enabled by default like:
Uhm, function level profiling is a bit annoying since doesn't allow to undesrtand more cleanly where the cost is comming from in the app level logic.
By "custom build" you mean compiling the guest program or running it with Zisk with some feature enabled, or you mean actually re-implementing all the logic to have custom scopes as they had before?
Got a bit lost since you also mention they now support string scope names instead of constants, so "scope profiling" is still supported in some official way?
There was a problem hiding this comment.
By "custom build" you mean compiling the guest program or running it with Zisk with some feature enabled, or you mean actually re-implementing all the logic to have custom scopes as they had before?
custom build I mean build with ziskos::ziskos_syscall!(ziskos::SYSCALL_PROFILE_ID, ...) uncommented, it's commented out becuase with these syscalls, the ELF can't be proved (but can be used with ziskemu to do the profiling we used to do)
Got a bit lost since you also mention they now support string scope names instead of constants, so "scope profiling" is still supported in some official way?
Scope profiling is supported by ziskemu, but the ELF can't be proved anymore.
There was a problem hiding this comment.
Oh, I see. so the ziskos crate that teh guest program needs to add these scope markers needs a fork that uncomment some code. And we can use normal ere code (i.e non-modified) to emulate the elf run?
There was a problem hiding this comment.
If that's the case, ere and workload isn't strictly affected but more like the guest program ziskos dependency used?
This assuming that we put the markers manually in guest with ziskos.
If we rely on ere for cycle-scoping, I agree that prob means custom ere which maybe is the most elegant thing now that strings names are supported?
There was a problem hiding this comment.
This assuming that we put the markers manually in guest with
ziskos.
I thought this might be the solution, but I just had another idea: Add a feature flag in ere-platform-zisk to enabling these profiling, so we can keep the same guest code and no forking, just need to enable a feature flag to generate the ELF for profiling only.
Will add a commit soon
There was a problem hiding this comment.
Added in 1810ac4, so I think in ere-guests we can release another ELF with the feature on for the profiling (need to change the ere-compiler to accept feature flags first, will add a follow up PR soon)
| /// Clear the program cache so the next `setup` spawns fresh ASM services, then SIGTERM any orphan | ||
| /// ASM child still bound to our `ZISK_{pid}_*` shmem prefix. |
There was a problem hiding this comment.
Tangent question: do you feel the cluster setup is more elegant in how setup/teardown works?
There was a problem hiding this comment.
I checked the teardown logic in zisk and it looks like there are still some chance the ASM services get dangling, and the start function doesn't cleanup them, so I would prefer to keep the pkill for now
There was a problem hiding this comment.
Ah but I was wondering in cluster mode -- that setup also needs these kind of services teardown stuff? (mainly asking for curiosity)
There was a problem hiding this comment.
Ah I see, just tested and it seems if the worker ASM services die (guest program panics), it'll remain in bad status and coordinator doesn't remove the worker from the pool, so the whole cluster is also paralyzed...
Will need to figure out a way to check if the worker is still healthy so the supervisor can restart it if needed (e.g. docker-compose health check by grepping the asm services).
| fn validate_format(&self) -> Result<(), Error> { | ||
| if self.0.len() != PROOF_WORDS { | ||
| return Err(Error::InvalidProofFormat(format!( | ||
| "proof has {} u64 words, expected to be {PROOF_WORDS}", | ||
| self.0.len(), | ||
| ))); | ||
| } | ||
| if self.0[0] != PROOF_PREFIX_WORDS as u64 { | ||
| return Err(Error::InvalidProofFormat(format!( | ||
| "proof n_publics is {}, expected to be {PROOF_PREFIX_WORDS}", | ||
| self.0[0], | ||
| ))); | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Uhm, panicking on verifying untrusted bytes sounds more like a bug to me. Prob we should ask them to avoid this eventually.
Update ZisK to
v0.17.0, a patch is still needed to recover from prover errorSet complete for the ASM service so there is no dangling threads that causes memory leak, without this the prover restart will end up out of memory
Expose a function to cleanup ASM services when it fails and exits (e.g. guest program panics)
Expose a function to compute program vk without starting the ASM services
Patch the
ziskfloat.elfby extracting from the releasedcargo-ziskin upstream repo, and disable thebuild.rsscript oflib-floatcrate to avoidziskfloat.elfgets updated (which makes the program vk different). This ensures the ZisK prover in Ere to compute the same program vk as the official releasedcargo-ziskdoes.Update the
ZiskProofformat to match the format expected byproofman-verifier, and do strict constant size check to avoid the verifier panicsUpdate the
ZiskPlatformimplementation, now the profile syscall reads string as naming instead of requiring a constant tag. However, the AOT compilation process doesn't support ELF with the profile syscall (depended by prover), so currently thecycle_scope_*functions are left no-op.@jsign This breaks the zisk profilng flow in
zkevm-benchmark-workload, we probably need a custom build for the guests to do profiling, or we adapt to the Function-Level Profiling that works for prover and enabled by default like:ziskemu -e <elf-path> -i <input-path> -X -S -DUpdate the
ZiskClusterClientAPI, adds functioncreate_prove_job,wait_prove_job, andcancel_prove_job, andproveis built by composing the async endpoints with an optional deadline.