Bug Description
mpr_refine_portal() in genesis/engine/solvers/rigid/collider/mpr.py appears to be able to run indefinitely because its portal-refinement loop has no explicit iteration cap:
@ti.func # @qd.func in newer Genesis versions
def mpr_refine_portal(...):
ret = 1
while True:
direction = mpr_portal_dir(...)
...
mpr_expand_portal(...)
return ret
This looks inconsistent with nearby collision code:
mpr_find_penetration() stops when iterations > mpr_info.CCD_ITERATIONS[None].
mpr_discover_portal() has a num_trials == 15 cap for a documented rare deadlock condition.
- GJK/EPA paths also use explicit iteration limits.
I found this while investigating a rare, non-deterministic hang where scene.step() stops returning during GPU rigid-body simulation. Once triggered, the call does not recover on its own. In some runs, the machine later ends up in a blue screen, which is consistent with a GPU kernel that never returns and eventually trips the driver/system.
Tracing narrowed the last known simulation stage to convex-vs-convex narrow-phase collision detection, which led me to inspect this MPR path. As an additional validation point, after locally changing this loop to be finite, the hang stopped occurring in my reproduction runs. This makes the unbounded mpr_refine_portal() loop the strongest current suspect.
I also reproduced a non-returning GPU synchronization hang in a temporary test environment using genesis-world==0.4.7 with the newer Quadrants backend. That version still has the same unbounded mpr_refine_portal() loop.
Steps to Reproduce
I cannot currently provide a compact guaranteed reproducer. My setup is project-specific, and the issue occurs probabilistically during large batched reinforcement-learning runs.
The cases where I observe the hang are consistent with sustained, nearly planar contacts during batched simulation. The contacting geometry includes convex mesh feet with relatively flat contact surfaces against flat stage geometry. This may produce degenerate or non-unique support points during MPR portal refinement.
The important localized observation is:
scene.step() stops returning during GPU rigid-body simulation.
- The last known simulation stage is convex-vs-convex narrow-phase collision detection.
- The active path reaches MPR convex collision.
mpr_refine_portal() has an unbounded while True loop.
- Making this loop finite stopped the hang in my reproduction runs.
Expected Behavior
Convex collision detection should not be able to enter a non-returning device loop. scene.step() should either complete normally or fail/abort gracefully instead of hanging indefinitely.
Screenshots/Videos
No response
Relevant log output
No useful error log is emitted because the issue is a non-returning hang. Once the hang is triggered, `scene.step()` does not return, so no Python exception or later diagnostic log is produced.
Environment
- OS: Windows
- Backend: GPU / CUDA
- GPU: NVIDIA GeForce RTX 4090
- Local original Genesis version: 0.3.13
- Local newer Genesis test version:
genesis-world==0.4.7
- Quadrants in the 0.4.7 test: version
0.8.0, commit a22cc2de
- Relevant upstream file checked on 2026-05-20:
Release version or Commit ID
Observed locally with:
genesis==0.3.13
genesis-world==0.4.7
The latest public main branch still appeared to have the same unbounded loop as of 2026-05-20.
Additional Context
No response
Bug Description
mpr_refine_portal()ingenesis/engine/solvers/rigid/collider/mpr.pyappears to be able to run indefinitely because its portal-refinement loop has no explicit iteration cap:This looks inconsistent with nearby collision code:
mpr_find_penetration()stops wheniterations > mpr_info.CCD_ITERATIONS[None].mpr_discover_portal()has anum_trials == 15cap for a documented rare deadlock condition.I found this while investigating a rare, non-deterministic hang where
scene.step()stops returning during GPU rigid-body simulation. Once triggered, the call does not recover on its own. In some runs, the machine later ends up in a blue screen, which is consistent with a GPU kernel that never returns and eventually trips the driver/system.Tracing narrowed the last known simulation stage to convex-vs-convex narrow-phase collision detection, which led me to inspect this MPR path. As an additional validation point, after locally changing this loop to be finite, the hang stopped occurring in my reproduction runs. This makes the unbounded
mpr_refine_portal()loop the strongest current suspect.I also reproduced a non-returning GPU synchronization hang in a temporary test environment using
genesis-world==0.4.7with the newer Quadrants backend. That version still has the same unboundedmpr_refine_portal()loop.Steps to Reproduce
I cannot currently provide a compact guaranteed reproducer. My setup is project-specific, and the issue occurs probabilistically during large batched reinforcement-learning runs.
The cases where I observe the hang are consistent with sustained, nearly planar contacts during batched simulation. The contacting geometry includes convex mesh feet with relatively flat contact surfaces against flat stage geometry. This may produce degenerate or non-unique support points during MPR portal refinement.
The important localized observation is:
scene.step()stops returning during GPU rigid-body simulation.mpr_refine_portal()has an unboundedwhile Trueloop.Expected Behavior
Convex collision detection should not be able to enter a non-returning device loop.
scene.step()should either complete normally or fail/abort gracefully instead of hanging indefinitely.Screenshots/Videos
No response
Relevant log output
Environment
genesis-world==0.4.70.8.0, commita22cc2deRelease version or Commit ID
Observed locally with:
genesis==0.3.13genesis-world==0.4.7The latest public
mainbranch still appeared to have the same unbounded loop as of 2026-05-20.Additional Context
No response