Skip to content

[Bug]: Possible non-returning GPU kernel: mpr_refine_portal() has no iteration cap #2815

@kobayashi0921

Description

@kobayashi0921

Bug Description

mpr_refine_portal() in genesis/engine/solvers/rigid/collider/mpr.py appears to be able to run indefinitely because its portal-refinement loop has no explicit iteration cap:

@ti.func  # @qd.func in newer Genesis versions
def mpr_refine_portal(...):
    ret = 1
    while True:
        direction = mpr_portal_dir(...)
        ...
        mpr_expand_portal(...)
    return ret

This looks inconsistent with nearby collision code:

  • mpr_find_penetration() stops when iterations > mpr_info.CCD_ITERATIONS[None].
  • mpr_discover_portal() has a num_trials == 15 cap for a documented rare deadlock condition.
  • GJK/EPA paths also use explicit iteration limits.

I found this while investigating a rare, non-deterministic hang where scene.step() stops returning during GPU rigid-body simulation. Once triggered, the call does not recover on its own. In some runs, the machine later ends up in a blue screen, which is consistent with a GPU kernel that never returns and eventually trips the driver/system.

Tracing narrowed the last known simulation stage to convex-vs-convex narrow-phase collision detection, which led me to inspect this MPR path. As an additional validation point, after locally changing this loop to be finite, the hang stopped occurring in my reproduction runs. This makes the unbounded mpr_refine_portal() loop the strongest current suspect.

I also reproduced a non-returning GPU synchronization hang in a temporary test environment using genesis-world==0.4.7 with the newer Quadrants backend. That version still has the same unbounded mpr_refine_portal() loop.

Steps to Reproduce

I cannot currently provide a compact guaranteed reproducer. My setup is project-specific, and the issue occurs probabilistically during large batched reinforcement-learning runs.

The cases where I observe the hang are consistent with sustained, nearly planar contacts during batched simulation. The contacting geometry includes convex mesh feet with relatively flat contact surfaces against flat stage geometry. This may produce degenerate or non-unique support points during MPR portal refinement.

The important localized observation is:

  1. scene.step() stops returning during GPU rigid-body simulation.
  2. The last known simulation stage is convex-vs-convex narrow-phase collision detection.
  3. The active path reaches MPR convex collision.
  4. mpr_refine_portal() has an unbounded while True loop.
  5. Making this loop finite stopped the hang in my reproduction runs.

Expected Behavior

Convex collision detection should not be able to enter a non-returning device loop. scene.step() should either complete normally or fail/abort gracefully instead of hanging indefinitely.

Screenshots/Videos

No response

Relevant log output

No useful error log is emitted because the issue is a non-returning hang. Once the hang is triggered, `scene.step()` does not return, so no Python exception or later diagnostic log is produced.

Environment

Release version or Commit ID

Observed locally with:

  • genesis==0.3.13
  • genesis-world==0.4.7

The latest public main branch still appeared to have the same unbounded loop as of 2026-05-20.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions