[Bug]: Possible non-returning GPU kernel: mpr_refine_portal() has no iteration cap

### Bug Description

`mpr_refine_portal()` in `genesis/engine/solvers/rigid/collider/mpr.py` appears to be able to run indefinitely because its portal-refinement loop has no explicit iteration cap:

```python
@ti.func  # @qd.func in newer Genesis versions
def mpr_refine_portal(...):
    ret = 1
    while True:
        direction = mpr_portal_dir(...)
        ...
        mpr_expand_portal(...)
    return ret
```

This looks inconsistent with nearby collision code:

- `mpr_find_penetration()` stops when `iterations > mpr_info.CCD_ITERATIONS[None]`.
- `mpr_discover_portal()` has a `num_trials == 15` cap for a documented rare deadlock condition.
- GJK/EPA paths also use explicit iteration limits.

I found this while investigating a rare, non-deterministic hang where `scene.step()` stops returning during GPU rigid-body simulation. Once triggered, the call does not recover on its own. In some runs, the machine later ends up in a blue screen, which is consistent with a GPU kernel that never returns and eventually trips the driver/system.

Tracing narrowed the last known simulation stage to convex-vs-convex narrow-phase collision detection, which led me to inspect this MPR path. As an additional validation point, after locally changing this loop to be finite, the hang stopped occurring in my reproduction runs. This makes the unbounded `mpr_refine_portal()` loop the strongest current suspect.

I also reproduced a non-returning GPU synchronization hang in a temporary test environment using `genesis-world==0.4.7` with the newer Quadrants backend. That version still has the same unbounded `mpr_refine_portal()` loop.

### Steps to Reproduce

I cannot currently provide a compact guaranteed reproducer. My setup is project-specific, and the issue occurs probabilistically during large batched reinforcement-learning runs.

The cases where I observe the hang are consistent with sustained, nearly planar contacts during batched simulation. The contacting geometry includes convex mesh feet with relatively flat contact surfaces against flat stage geometry. This may produce degenerate or non-unique support points during MPR portal refinement.

The important localized observation is:

1. `scene.step()` stops returning during GPU rigid-body simulation.
2. The last known simulation stage is convex-vs-convex narrow-phase collision detection.
3. The active path reaches MPR convex collision.
4. `mpr_refine_portal()` has an unbounded `while True` loop.
5. Making this loop finite stopped the hang in my reproduction runs.


### Expected Behavior

Convex collision detection should not be able to enter a non-returning device loop. `scene.step()` should either complete normally or fail/abort gracefully instead of hanging indefinitely.

### Screenshots/Videos

_No response_

### Relevant log output

```shell
No useful error log is emitted because the issue is a non-returning hang. Once the hang is triggered, `scene.step()` does not return, so no Python exception or later diagnostic log is produced.
```

### Environment

- OS: Windows
- Backend: GPU / CUDA
- GPU: NVIDIA GeForce RTX 4090
- Local original Genesis version: 0.3.13
- Local newer Genesis test version: `genesis-world==0.4.7`
- Quadrants in the 0.4.7 test: version `0.8.0`, commit `a22cc2de`
- Relevant upstream file checked on 2026-05-20:
  - https://raw.githubusercontent.com/Genesis-Embodied-AI/Genesis/main/genesis/engine/solvers/rigid/collider/mpr.py
  - https://raw.githubusercontent.com/Genesis-Embodied-AI/genesis-world/main/genesis/engine/solvers/rigid/collider/mpr.py


### Release version or Commit ID

Observed locally with:

- `genesis==0.3.13`
- `genesis-world==0.4.7`

The latest public `main` branch still appeared to have the same unbounded loop as of 2026-05-20.

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Possible non-returning GPU kernel: mpr_refine_portal() has no iteration cap #2815

Bug Description

Steps to Reproduce

Expected Behavior

Screenshots/Videos

Relevant log output

Environment

Release version or Commit ID

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Possible non-returning GPU kernel: mpr_refine_portal() has no iteration cap #2815

Description

Bug Description

Steps to Reproduce

Expected Behavior

Screenshots/Videos

Relevant log output

Environment

Release version or Commit ID

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions