feat(server): add global watchdog for hard prove timeouts#352
feat(server): add global watchdog for hard prove timeouts#352Andrurachi wants to merge 1 commit intoeth-act:masterfrom
Conversation
|
I'd prefer to let the supervisor to restart the container (e.g. |
|
Ah, that makes total sense. I didn't realize DockerizedzkVM already had the client-side kill switch configured. I'll close this draft out since the global watchdog isn't the right path. However, I noticed that ERE_PROVE_TIMEOUT_MS wasn't actually being inherited by the docker runner in crates/dockerized/src/prover.rs. Should I convert this PR to just fix that small missing environment variable injection, or would you prefer I just close it entirely? |
|
The server timeout is mainly for docker usage, the client timeout will take effect earlier to restart the container, so I think there is no need to inherit the env. |
Fixes #253
Opening as a draft to request feedback on the use of
std::process::exit(1)for hardware recovery.Problem:
Airbender deadlocks on heavy blocks. Because the heavy proving runs in-process via FFI, deadlocked GPU threads cannot be safely cancelled from Rust. The existing
ERE_PROVE_TIMEOUT_MShandles API health (503s) but cannot free the locked GPU.Solution
ERE_PROVE_TIMEOUT_SECfor hard timeouts.tokio::spawnglobal watchdog inserver.rsthat checksProveState::started_atevery 5 seconds.std::process::exit(1). This forces Docker to restart the container, wiping the deadlocked GPU memory.Minor Fix:
.inherit_env("ERE_PROVE_TIMEOUT_MS")incrates/dockerized/src/prover.rsso the soft timeout properly passes to the container.is this crash-only approach acceptable for handling this deadlocks?