Skip to content

feat(server): add global watchdog for hard prove timeouts#352

Draft
Andrurachi wants to merge 1 commit intoeth-act:masterfrom
Andrurachi:feat-ere-prove-timeout-253
Draft

feat(server): add global watchdog for hard prove timeouts#352
Andrurachi wants to merge 1 commit intoeth-act:masterfrom
Andrurachi:feat-ere-prove-timeout-253

Conversation

@Andrurachi
Copy link
Copy Markdown

Fixes #253

Opening as a draft to request feedback on the use of std::process::exit(1) for hardware recovery.

Problem:

Airbender deadlocks on heavy blocks. Because the heavy proving runs in-process via FFI, deadlocked GPU threads cannot be safely cancelled from Rust. The existing ERE_PROVE_TIMEOUT_MS handles API health (503s) but cannot free the locked GPU.

Solution

  • Added ERE_PROVE_TIMEOUT_SEC for hard timeouts.
  • Implemented a detached tokio::spawn global watchdog in server.rs that checks ProveState::started_at every 5 seconds.
  • On timeout, the watchdog calls std::process::exit(1). This forces Docker to restart the container, wiping the deadlocked GPU memory.

Minor Fix:

  • Added the missing .inherit_env("ERE_PROVE_TIMEOUT_MS") in crates/dockerized/src/prover.rs so the soft timeout properly passes to the container.

is this crash-only approach acceptable for handling this deadlocks?

@han0110
Copy link
Copy Markdown
Collaborator

han0110 commented May 5, 2026

I'd prefer to let the supervisor to restart the container (e.g. docker-compose) when it's stuck. Currently the DockerizedzkVM has a client side timeout configured, and it restart the container if timed out already, the Airbender issue should be resolved separately in the Airbender crate I think, but so far I couldn't reproduce that.

@Andrurachi
Copy link
Copy Markdown
Author

Ah, that makes total sense. I didn't realize DockerizedzkVM already had the client-side kill switch configured.

I'll close this draft out since the global watchdog isn't the right path. However, I noticed that ERE_PROVE_TIMEOUT_MS wasn't actually being inherited by the docker runner in crates/dockerized/src/prover.rs. Should I convert this PR to just fix that small missing environment variable injection, or would you prefer I just close it entirely?

@han0110
Copy link
Copy Markdown
Collaborator

han0110 commented May 6, 2026

The server timeout is mainly for docker usage, the client timeout will take effect earlier to restart the container, so I think there is no need to inherit the env.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

airbender: deadlocks on heavy blocks

2 participants