Skip to content

feat: update zisk to v0.17.0#351

Open
han0110 wants to merge 15 commits intomasterfrom
han/feature/zisk-v0.17.0
Open

feat: update zisk to v0.17.0#351
han0110 wants to merge 15 commits intomasterfrom
han/feature/zisk-v0.17.0

Conversation

@han0110
Copy link
Copy Markdown
Collaborator

@han0110 han0110 commented May 4, 2026

  • Update ZisK to v0.17.0, a patch is still needed to recover from prover error

    • han0110/zisk@9499da6
      Set complete for the ASM service so there is no dangling threads that causes memory leak, without this the prover restart will end up out of memory
    • han0110/zisk@4c38059
      Expose a function to cleanup ASM services when it fails and exits (e.g. guest program panics)
    • han0110/zisk@6d09f8e
      Expose a function to compute program vk without starting the ASM services
    • han0110/zisk@4019fa3
      Patch the ziskfloat.elf by extracting from the released cargo-zisk in upstream repo, and disable the build.rs script of lib-float crate to avoid ziskfloat.elf gets updated (which makes the program vk different). This ensures the ZisK prover in Ere to compute the same program vk as the official released cargo-zisk does.
  • Update the ZiskProof format to match the format expected by proofman-verifier, and do strict constant size check to avoid the verifier panics

  • Update the ZiskPlatform implementation, now the profile syscall reads string as naming instead of requiring a constant tag. However, the AOT compilation process doesn't support ELF with the profile syscall (depended by prover), so currently the cycle_scope_* functions are left no-op.

    @jsign This breaks the zisk profilng flow in zkevm-benchmark-workload, we probably need a custom build for the guests to do profiling, or we adapt to the Function-Level Profiling that works for prover and enabled by default like:

    ziskemu -e <elf-path> -i <input-path> -X -S -D
    ...
    
      TOP STEP FUNCTIONS (STEPS, % STEPS, CALLS, STEPS/CALL, FUNCTION)
    ----------------------------------------------------------------
                475   3.93%          1             475 <ere_platform_zisk::platform::ZiskPlatform as ere_platform_core::platform::Platform>::write_whole_output
                474   3.92%          1             474 <ere_util_test::program::basic::BasicProgramOutput<ere_util_test::codec::BincodeLegacy> as ere_codec::encode::Encode>::encode_to_vec
                470   3.89%          1             470 ziskos::io::commit_slice
                443   3.67%          1             443 <ere_util_test::program::basic::BasicProgramOutput<…> as serde_core::ser::Serialize>::serialize::<bincode::features::serde::ser::SerdeEncoder<…>>
                291   2.41%          3              97 <alloc::raw_vec::RawVecInner<_>>::reserve::do_reserve_and_handle::<alloc::alloc::Global>
                241   1.99%          1             241 <ere_util_test::program::basic::BasicProgramInput<ere_util_test::codec::BincodeLegacy> as ere_codec::decode::Decode>::decode_from_slice
                206   1.70%          1             206 <bincode::features::serde::de_borrowed::SerdeDecoder<…> as serde_core::de::Deserializer>::deserialize_struct::<<…>::deserialize::__Visitor<…>>
                184   1.52%          3              61 <alloc::raw_vec::RawVecInner>::finish_grow
                 63   0.52%          1              63 <ere_platform_zisk::platform::ZiskPlatform as ere_platform_core::platform::Platform>::read_whole_input
                 42   0.35%         14               3 memcpy
                 40   0.33%          2              20 __rustc::__rust_realloc
                 40   0.33%          1              40 ziskos::io::read_input_slice
                 32   0.26%          2              16 __rustc::__rust_alloc
                 16   0.13%          1              16 __rustc::__rust_alloc_zeroed
                  3   0.02%          3               1 __rustc::__rust_dealloc
                  3   0.02%          3               1 __rustc::__rust_no_alloc_shim_is_unstable_v2
                  2   0.02%          2               1 zkvm_init
    
    ...
    
  • Update the ZiskClusterClient API, adds function create_prove_job, wait_prove_job, and cancel_prove_job, and prove is built by composing the async endpoints with an optional deadline.

@han0110 han0110 marked this pull request as ready for review May 5, 2026 04:36
[features]
default = []
cuda = ["zisk-sdk/gpu"]
cuda = []
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In v0.17.0 ZisK has cuda enabled by default (by detecting if nvcc is available), and a cpu-only feature flag is available to force cuda disabled. If we try to initialize gpu prover without nvcc available, it'd show a runtime error, so we add a build.rs to turn that into a compile time error.

Comment on lines +90 to +104
fn validate_format(&self) -> Result<(), Error> {
if self.0.len() != PROOF_WORDS {
return Err(Error::InvalidProofFormat(format!(
"proof has {} u64 words, expected to be {PROOF_WORDS}",
self.0.len(),
)));
}
if self.0[0] != PROOF_PREFIX_WORDS as u64 {
return Err(Error::InvalidProofFormat(format!(
"proof n_publics is {}, expected to be {PROOF_PREFIX_WORDS}",
self.0[0],
)));
}
Ok(())
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always validates the constant proof size and metadata, otherwise the verify function in proofman-verifier might be tricked to panic.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, panicking on verifying untrusted bytes sounds more like a bug to me. Prob we should ask them to avoid this eventually.

Comment on lines +24 to +25
libclang-dev \
clang \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for ZisK's zkvm-interface crate compilation.

clang \
# Used to kill the ASM services when prover panics
psmisc \
procps \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use pkill for defensive ASM service cleanup now (previously using fuser)

let result = panic::catch_unwind(AssertUnwindSafe(|| {
self.prover
.prove(&self.program, stdin)
.wrap_proof(ProofKind::VadcopFinalMinimal)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap into the minimal stark proof for smaller proof size (~256 KiB instead of ~329 KiB), this is also the default proof kind.

@han0110 han0110 requested a review from jsign May 5, 2026 11:04
Copy link
Copy Markdown
Collaborator

@jsign jsign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left just some comments/questions.

)));
fn parse_proof(bytes: &[u8]) -> Result<ZiskProof, Error> {
#[derive(Default, Serialize, Deserialize)]
pub struct Proof<'a> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these structs not exported by Zisk?

Copy link
Copy Markdown
Collaborator Author

@han0110 han0110 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is structure for the proof, but the proof is eventually flatten into the format added in this PR, and in the verifier they don't check the size, so a malformed proof will trigger index out of bounds.

Will file an issue in upstream as well.

fn cycle_scope_start(name: &str) {
let tag = SCOPE_REGISTRY.get_or_assign_tag(name);
dispatch_profile!(start, tag);
fn cycle_scope_start(_name: &str) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsign This breaks the zisk profilng flow in zkevm-benchmark-workload, we probably need a custom build for the guests to do profiling, or we adapt to the Function-Level Profiling that works for prover and enabled by default like:

Uhm, function level profiling is a bit annoying since doesn't allow to undesrtand more cleanly where the cost is comming from in the app level logic.

By "custom build" you mean compiling the guest program or running it with Zisk with some feature enabled, or you mean actually re-implementing all the logic to have custom scopes as they had before?

Got a bit lost since you also mention they now support string scope names instead of constants, so "scope profiling" is still supported in some official way?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "custom build" you mean compiling the guest program or running it with Zisk with some feature enabled, or you mean actually re-implementing all the logic to have custom scopes as they had before?

custom build I mean build with ziskos::ziskos_syscall!(ziskos::SYSCALL_PROFILE_ID, ...) uncommented, it's commented out becuase with these syscalls, the ELF can't be proved (but can be used with ziskemu to do the profiling we used to do)

Got a bit lost since you also mention they now support string scope names instead of constants, so "scope profiling" is still supported in some official way?

Scope profiling is supported by ziskemu, but the ELF can't be proved anymore.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. so the ziskos crate that teh guest program needs to add these scope markers needs a fork that uncomment some code. And we can use normal ere code (i.e non-modified) to emulate the elf run?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, ere and workload isn't strictly affected but more like the guest program ziskos dependency used?

This assuming that we put the markers manually in guest with ziskos.

If we rely on ere for cycle-scoping, I agree that prob means custom ere which maybe is the most elegant thing now that strings names are supported?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assuming that we put the markers manually in guest with ziskos.

I thought this might be the solution, but I just had another idea: Add a feature flag in ere-platform-zisk to enabling these profiling, so we can keep the same guest code and no forking, just need to enable a feature flag to generate the ELF for profiling only.

Will add a commit soon

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 1810ac4, so I think in ere-guests we can release another ELF with the feature on for the profiling (need to change the ere-compiler to accept feature flags first, will add a follow up PR soon)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 🙏

Comment on lines +163 to +164
/// Clear the program cache so the next `setup` spawns fresh ASM services, then SIGTERM any orphan
/// ASM child still bound to our `ZISK_{pid}_*` shmem prefix.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangent question: do you feel the cluster setup is more elegant in how setup/teardown works?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the teardown logic in zisk and it looks like there are still some chance the ASM services get dangling, and the start function doesn't cleanup them, so I would prefer to keep the pkill for now

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but I was wondering in cluster mode -- that setup also needs these kind of services teardown stuff? (mainly asking for curiosity)

Copy link
Copy Markdown
Collaborator Author

@han0110 han0110 May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, just tested and it seems if the worker ASM services die (guest program panics), it'll remain in bad status and coordinator doesn't remove the worker from the pool, so the whole cluster is also paralyzed...

Will need to figure out a way to check if the worker is still healthy so the supervisor can restart it if needed (e.g. docker-compose health check by grepping the asm services).

Comment on lines +90 to +104
fn validate_format(&self) -> Result<(), Error> {
if self.0.len() != PROOF_WORDS {
return Err(Error::InvalidProofFormat(format!(
"proof has {} u64 words, expected to be {PROOF_WORDS}",
self.0.len(),
)));
}
if self.0[0] != PROOF_PREFIX_WORDS as u64 {
return Err(Error::InvalidProofFormat(format!(
"proof n_publics is {}, expected to be {PROOF_PREFIX_WORDS}",
self.0[0],
)));
}
Ok(())
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, panicking on verifying untrusted bytes sounds more like a bug to me. Prob we should ask them to avoid this eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants