feat(server): add global watchdog for hard prove timeouts by Andrurachi · Pull Request #352 · eth-act/ere

Andrurachi · 2026-05-05T03:20:43Z

Fixes #253

Opening as a draft to request feedback on the use of std::process::exit(1) for hardware recovery.

Problem:

Airbender deadlocks on heavy blocks. Because the heavy proving runs in-process via FFI, deadlocked GPU threads cannot be safely cancelled from Rust. The existing ERE_PROVE_TIMEOUT_MS handles API health (503s) but cannot free the locked GPU.

Solution

Added ERE_PROVE_TIMEOUT_SEC for hard timeouts.
Implemented a detached tokio::spawn global watchdog in server.rs that checks ProveState::started_at every 5 seconds.
On timeout, the watchdog calls std::process::exit(1). This forces Docker to restart the container, wiping the deadlocked GPU memory.

Minor Fix:

Added the missing .inherit_env("ERE_PROVE_TIMEOUT_MS") in crates/dockerized/src/prover.rs so the soft timeout properly passes to the container.

is this crash-only approach acceptable for handling this deadlocks?

closes eth-act#253

han0110 · 2026-05-05T05:59:50Z

I'd prefer to let the supervisor to restart the container (e.g. docker-compose) when it's stuck. Currently the DockerizedzkVM has a client side timeout configured, and it restart the container if timed out already, the Airbender issue should be resolved separately in the Airbender crate I think, but so far I couldn't reproduce that.

Andrurachi · 2026-05-05T14:59:41Z

Ah, that makes total sense. I didn't realize DockerizedzkVM already had the client-side kill switch configured.

I'll close this draft out since the global watchdog isn't the right path. However, I noticed that ERE_PROVE_TIMEOUT_MS wasn't actually being inherited by the docker runner in crates/dockerized/src/prover.rs. Should I convert this PR to just fix that small missing environment variable injection, or would you prefer I just close it entirely?

han0110 · 2026-05-06T03:19:10Z

The server timeout is mainly for docker usage, the client timeout will take effect earlier to restart the container, so I think there is no need to inherit the env.

feat(server): add global watchdog for hard prove timeouts

116a5ec

closes eth-act#253

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): add global watchdog for hard prove timeouts#352

feat(server): add global watchdog for hard prove timeouts#352
Andrurachi wants to merge 1 commit intoeth-act:masterfrom
Andrurachi:feat-ere-prove-timeout-253

Andrurachi commented May 5, 2026

Uh oh!

han0110 commented May 5, 2026

Uh oh!

Andrurachi commented May 5, 2026

Uh oh!

han0110 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Andrurachi commented May 5, 2026

Problem:

Solution

Minor Fix:

Uh oh!

han0110 commented May 5, 2026

Uh oh!

Andrurachi commented May 5, 2026

Uh oh!

han0110 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants