Summary
Add a poisoning mechanism to the Servo plugin's shared renderer thread so that instances which panic repeatedly are automatically disabled rather than retried indefinitely. Currently, catch_unwind recovers from panics and returns the last good frame, but a pathological page (e.g. one that reliably triggers a Servo bug) will panic on every tick() call, flooding logs with backtraces.
Related PRs: #326 (hardening — added catch_unwind recovery)
Parent issue: #167
Motivation
From PR #326 review (Devin Review Comment 10): the current hardening wraps each work item in catch_unwind(AssertUnwindSafe(...)), which is good for resilience but has no circuit-breaker. A node hitting a deterministic Servo bug will:
- Panic every render tick (~30 times/sec)
- Log a full backtrace each time
- Always return the stale cached frame (or blank if no frame was ever cached)
- Never recover (since the same URL/state causes the same panic)
Proposed Design
Per-instance panic counter
struct InstanceState {
// ... existing fields ...
/// Consecutive panic count (reset on successful render).
consecutive_panics: u32,
/// If true, this instance is poisoned and all renders return the
/// cached frame without attempting Servo calls.
poisoned: bool,
}
Poisoning threshold
After N consecutive panics (suggested: const POISON_THRESHOLD: u32 = 5), mark the instance as poisoned:
if state.consecutive_panics >= POISON_THRESHOLD {
state.poisoned = true;
tracing::error!(
node_id = %node_id,
panics = state.consecutive_panics,
"Instance poisoned after {POISON_THRESHOLD} consecutive panics — \
renders will return cached frame until URL is changed"
);
}
Recovery path
A UpdateConfig with a new URL should reset the poison state, since the new page may not trigger the same bug:
fn handle_update_config(...) {
// ... navigate to new URL ...
state.consecutive_panics = 0;
state.poisoned = false;
}
Render behavior when poisoned
fn handle_render(...) {
if state.poisoned {
// Return cached frame without touching Servo
let frame = state.last_good_frame.clone().unwrap_or_default();
let _ = state.result_tx.send(ServoThreadResult::Frame(frame));
return;
}
// ... normal render path with catch_unwind ...
}
Acceptance Criteria
Complexity
Effort: S (half day)
Straightforward state machine addition to existing InstanceState.
Notes
- The
AssertUnwindSafe wrapper is a known risk (Devin Review Comment 8) — poisoning mitigates the worst case (infinite panic loops) but does not solve potential state corruption from unwinding through !UnwindSafe types. A deeper fix would be to restart the WebView on panic, but that is a separate concern.
- Consider also emitting a metric (
servo.instance.poisoned counter) for OTel dashboards.
Summary
Add a poisoning mechanism to the Servo plugin's shared renderer thread so that instances which panic repeatedly are automatically disabled rather than retried indefinitely. Currently,
catch_unwindrecovers from panics and returns the last good frame, but a pathological page (e.g. one that reliably triggers a Servo bug) will panic on everytick()call, flooding logs with backtraces.Related PRs: #326 (hardening — added
catch_unwindrecovery)Parent issue: #167
Motivation
From PR #326 review (Devin Review Comment 10): the current hardening wraps each work item in
catch_unwind(AssertUnwindSafe(...)), which is good for resilience but has no circuit-breaker. A node hitting a deterministic Servo bug will:Proposed Design
Per-instance panic counter
Poisoning threshold
After N consecutive panics (suggested:
const POISON_THRESHOLD: u32 = 5), mark the instance as poisoned:Recovery path
A
UpdateConfigwith a new URL should reset the poison state, since the new page may not trigger the same bug:Render behavior when poisoned
Acceptance Criteria
UpdateConfig) resets poison statetracing::error!emitted when an instance becomes poisonedtracing::info!emitted when poison state is resetComplexity
Effort: S (half day)
Straightforward state machine addition to existing
InstanceState.Notes
AssertUnwindSafewrapper is a known risk (Devin Review Comment 8) — poisoning mitigates the worst case (infinite panic loops) but does not solve potential state corruption from unwinding through!UnwindSafetypes. A deeper fix would be to restart the WebView on panic, but that is a separate concern.servo.instance.poisonedcounter) for OTel dashboards.