[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations. by hughperkins · Pull Request #2830 · Genesis-Embodied-AI/genesis-world

hughperkins · 2026-05-25T13:08:51Z

Description

Cooperative warp-per-env optimization of the link-pair contact dedup kernel (func_prune_contacts), plus auto-enable / gating logic.

This branch is rebased onto #2831 (Alexis's fix_contact_pruning_batched). #2831's three changes are incorporated verbatim into the cooperative kernel:

Memory-fence scratch write between the lower-hull and upper-hull passes of Andrew's monotone chain. Without it, on Quadrants Metal _B >= 2 the upper-hull reads of contact_hull_stack don't observe the lower-hull writes and every candidate is kept (hull size == bucket size, no pruning). Standalone reproduction: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py.
New prune_hull_collinear_tol tunable (default 1e-3) separated from the depth coplanarity tol. Lets users tune the hull-popper cross-product slop independently of the depth gate.
Deep-penetration restore now uses MAX of hull penetrations instead of average.

Once #2831 merges, this branch's base should be moved back to origin/main and #2831's commit will be absorbed.

Cooperative-kernel optimizations (from previous iteration of this branch)

32 threads per env (qd.loop_config(name="prune_contacts_coop", block_dim=32)).
Coop reductions for the per-bucket mean-normal / centroid sum and the coplanarity max-depth / max-in-plane-r2 check (qd.simt.subgroup.reduce_all_add_tiled / reduce_all_max_tiled).
Stride-tid parallel writes for contact_keep init, phase-1a key+idx init, projection (sort_key + proj_v), mark-drop, lex_idx init.
Lane-0 serial only where required: phase-1a insertion sort, phase-2 lex sort, Andrew monotone-chain hull build, the deep-penetration restore, and the fused phase 3 (compact + spatial sort + 9-field cycle-permute).
qd.simt.subgroup.sync() between coop and serial sections.

Auto-enable gating

Dedup is on by default whenever the scene has multi-geom links, nonconvex non-terrain geoms, or terrain.
CPU gated off (cooperative kernel adds overhead without parallel speedup on a serial backend).
n_envs > 8192 gated off (GPU already grid-saturated; per-env serial sub-phases dominate).
Metal works now (post-[BUG FIX] Fix contact point pruning on Apple Metal. #2831). The previous Metal exclusion is removed.

`test_contact_pruning`

No longer skipped on darwin (depended on the Metal-exclude path that #2831 makes unnecessary). Parametrized on gs.gpu only.

Related Issue

Builds on #2829 (Alexis's initial dedup PR) and #2831 (Alexis's Metal codegen bugfix).

How Has This Been / Can This Be Tested?

Cluster unit tests on 8 GPUs (cmp-tooling/unit_tests_cluster.py, ndarray + field backends).
Cluster benchmarks on 8 GPUs against WandB baseline (cmp-tooling/bench_cluster_wandb.py).
Genesis CI: ubuntu + macOS + windows + production.
test_contact_pruning on gs.gpu (CUDA + Metal).
Standalone Quadrants subgroup primitive reproductions for the Metal bug under perso_hugh/prot/qd_*_repro.py.

Checklist

I followed the contribution guidelines.
I tagged the title correctly.
I tested my changes and added instructions.

…is-Embodied-AI#2831 Rebase of my cooperative-warp dedup work onto Alexis's Genesis-Embodied-AI#2831 (fix_contact_pruning_batched) which fixes the Quadrants Metal codegen bug that was blocking dedup on macOS. PR Genesis-Embodied-AI#2831's contact.py changes are incorporated verbatim into my cooperative kernel: 1. Memory-fence scratch write between the lower-hull and upper-hull passes of Andrew's monotone chain (contact_hull_stack[max_contact_pairs - 1, i_b] = 0). Without this, on Metal _B >= 2 the upper-hull reads of contact_hull_stack don't observe the lower-hull writes and every candidate is kept (hull size == bucket size, no pruning). Standalone repro: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py 2. New tunable prune_hull_collinear_tol (1e-3 default) separated from the depth coplanarity tol. Lets users tune the hull-popper cross- product slop independently of the depth gate. 3. Deep-penetration restore threshold changed from average to MAX of hull penetrations. A non-hull contact whose penetration substantially exceeds the deepest hull vertex represents a distinct physical support (deep middle of a long body) that the support-polygon argument doesn't authorize dropping. Side cleanups now possible because Genesis-Embodied-AI#2831 fixes the Metal bug: * Removed the gs.backend not in (gs.cuda, gs.amdgpu) guard from collider.py. Dedup is now enabled on Metal too; only CPU is still excluded (the cooperative kernel adds overhead without parallel speedup on a serial backend). * Removed the @pytest.mark.skipif(darwin) from test_contact_pruning; the test now runs on every GPU backend. The cooperative-warp structure (32 threads/env, qd.simt.subgroup reductions for the per-bucket mean-normal / centroid, coop projection and mark-drop, fused phase 3 compact+spatial-sort) is preserved from the previous branch. Note: this branch is currently based on duburcqa:fix_contact_pruning_batched (Genesis-Embodied-AI#2831). Once Genesis-Embodied-AI#2831 merges, the branch base should be moved back to origin/main and the Genesis-Embodied-AI#2831 commit will simply be absorbed. Backup of the pre-rebase tip: hp/dedup-alexis-perf-9d-stage3-r2-pre-2831-rebase-backup.

hughperkins · 2026-05-25T15:53:25Z

Rebased onto #2831 (commit 6582830). #2831's three contact.py changes (memory-fence write, prune_hull_collinear_tol, hull_pen_max) are incorporated. Removed the Metal-only exclusion in collider.py and the darwin skipif on test_contact_pruning since #2831 fixes the Metal codegen issue.

Minimal standalone reproduction for the Quadrants Metal codegen bug: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py — 8-point Andrew's monotone chain with B=4 envs. Without the PR #2831 fence, hull size = bucket size on Metal; with the fence, hull size = 4 (the 4 corners). Passes on CUDA for both kernels.

hughperkins · 2026-05-25T16:46:39Z

Status update after rebasing onto upstream #2831:

Benchmarks (GPU, 4096 envs, vs WandB main@b28fdb8c):

dex_hand: +4.83%
g1_fall: +3.92%
box_pyramid_3..6 / duck_in_box_*: within noise (-1.2% to +1.4%)
No GPU regressions

Unit tests (cluster, --backend ndarray, GPU+CPU): 2 failed, 712 passed

The 2 failures are test_smooth_box_no_drift[False-capsule-prim-prim] and [False-capsule-prim-mesh]. I traced these to the underlying #2831 PR itself, not to this PR's deskai6 changes:

Ran the failing tests on duburcqa/fix_contact_pruning_batched directly (zero deskai6 commits on top) → same 2 failures, byte-identical numerical drift (Max abs diff 0.01361053).
The failing scenes are single-link / single-geom (a primitive capsule on a primitive or convex mesh box), so dedup is gated off (link_pair_pruning_supported = False) and func_prune_contacts is never invoked.

Reported upstream at #2831 (see comment thread there). They will hit main as soon as #2831 merges.

CI production-unit_tests-ndarray and production-unit_tests-field show these same 2 failures here, confirming the chain.

…backends + n_envs Adopt Alexis's fused func_clamp_prune_and_sort_contacts kernel (single per-env serial pass that does clamp + optional pruning + optional spatial sort, gated at compile time via qd.static). This supersedes the deskai6 cooperative warp-per-env split into func_prune_contacts + func_clamp_and_sort_contacts: the fusion removes one kernel launch + one full 11-field permute pass per step, which on net matches or beats the cooperative-warp approach without needing any backend-specific gating. Remove the deskai6 gates that previously disabled dedup on CPU and at n_envs > 8192. Both were tied to the cooperative-kernel approach (warp-cooperative loops serialise on CPU; n_envs > 8192 oversaturates the GPU grid). The fused per-env serial kernel has no such overhead, so dedup now runs unconditionally whenever the scene is eligible (link_pair_pruning_supported and not autodiff). Also remove _effective_pruning_tolerance / _DEDUP_DEFAULT_TOLERANCE / _DEDUP_MAX_N_ENVS: the kernel now reads contact_pruning_tolerance directly from the user option (upstream default 0.1 post Genesis-Embodied-AI#2831), matching the fused kernel's expectation. Remove the @pytest.mark.parametrize backend=gs.gpu restriction on test_contact_pruning so it runs on CPU + GPU per matrix.

…s collapsed it into tol*max_in_plane_r2)

…ernel for CPU Re-add `func_prune_contacts_coop`, the 32-lane warp-cooperative variant of the fused clamp+prune+sort path. Same pruning logic as Alexis's serial `func_clamp_prune_and_sort_contacts` (depth coplanarity gate, Andrew's monotone chain, hull-pen-max deep-penetration restore, Metal memory-fence workaround), but the per-env work is spread across 32 warp lanes: - PARALLEL: contact_keep init, phase-1a (link-pair) key+idx init, phase-2 mean-normal / centroid reduction via reduce_all_add_tiled, coplanarity check reduction via reduce_all_max_tiled, in-plane projection writes. - SERIAL on lane 0: insertion sorts (phase 1a + lex), bucket walk, monotone chain, hull mark + deep-penetration restore, and the final fused compact + spatial sort with sentinel +inf for dropped contacts. Dispatched only when gs.backend != gs.cpu AND scene is dedup-eligible AND not requires_grad AND contact_pruning_tolerance > 0. Everything else (CPU, autodiff, ineligible scenes) falls back to the serial fused kernel, which has internal qd.static gates that no-op the prune phases when they don't apply. So dedup still runs on every backend as before; the change just upgrades the GPU path to coop.

The cooperative warp-per-env dedup kernel (func_prune_contacts_coop) only beats the serial fused kernel while the GPU has spare occupancy. Once _B exceeds half the GPU's CUDA-core count the coop launch fills the device, the warp-cooperation overhead stops paying off, and Alexis's serial fused kernel (one thread per env) is faster. The half-core threshold also leaves headroom for the other kernels already sharing the SMs (narrowphase MPR/GJK, constraint solver) so we don't push them into second-wave scheduling on tight scenes like dex_hand and g1_fall. Concrete thresholds (gpu_cores / 2): - RTX 5090 (170 SMs x 128): _B <= 10880 - RTX 6000 Blackwell (188 x 128): _B <= 12032 - MI300X (304 CUs x 64): _B <= 9728 - Fallback (16384): _B <= 8192 Factored the per-backend gpu_cores computation out of the _use_split_narrowphase branch into a single Collider._gpu_cores field so both the contact0 chunking heuristic and the new dispatch gate share the same number. No behavior change for split-narrowphase chunking.

github-actions · 2026-05-25T20:45:25Z

🔴 Benchmark Regression Detected ➡️ Report

duburcqa and others added 2 commits May 25, 2026 17:33

Fix contact point pruning in batched sim.

1957db6

hughperkins force-pushed the hp/dedup-alexis-perf-9d-stage3-r2 branch from a8230f6 to 6582830 Compare May 25, 2026 15:51

chore: fix F541 f-string without placeholders in CPU dedup-gate log

edec990

hughperkins mentioned this pull request May 25, 2026

[BUG FIX] Fix contact point pruning on Apple Metal. #2831

Merged

duburcqa and others added 5 commits May 25, 2026 19:41

Fuse pruning and clamp + sort.

c5a3579

Remove stale prune_hull_collinear_tol after fused-kernel merge (Alexi…

ec50db0

…s collapsed it into tol*max_in_plane_r2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations.#2830

[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations.#2830
hughperkins wants to merge 8 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/dedup-alexis-perf-9d-stage3-r2

hughperkins commented May 25, 2026 •

edited

Loading

Uh oh!

hughperkins commented May 25, 2026

Uh oh!

hughperkins commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Cooperative-kernel optimizations (from previous iteration of this branch)

Auto-enable gating

test_contact_pruning

Related Issue

How Has This Been / Can This Be Tested?

Checklist

Uh oh!

hughperkins commented May 25, 2026

Uh oh!

hughperkins commented May 25, 2026

Uh oh!

github-actions Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented May 25, 2026 •

edited

Loading

`test_contact_pruning`