Skip to content

[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations.#2830

Draft
hughperkins wants to merge 8 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/dedup-alexis-perf-9d-stage3-r2
Draft

[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations.#2830
hughperkins wants to merge 8 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/dedup-alexis-perf-9d-stage3-r2

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented May 25, 2026

Description

Cooperative warp-per-env optimization of the link-pair contact dedup kernel (func_prune_contacts), plus auto-enable / gating logic.

This branch is rebased onto #2831 (Alexis's fix_contact_pruning_batched). #2831's three changes are incorporated verbatim into the cooperative kernel:

  1. Memory-fence scratch write between the lower-hull and upper-hull passes of Andrew's monotone chain. Without it, on Quadrants Metal _B >= 2 the upper-hull reads of contact_hull_stack don't observe the lower-hull writes and every candidate is kept (hull size == bucket size, no pruning). Standalone reproduction: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py.
  2. New prune_hull_collinear_tol tunable (default 1e-3) separated from the depth coplanarity tol. Lets users tune the hull-popper cross-product slop independently of the depth gate.
  3. Deep-penetration restore now uses MAX of hull penetrations instead of average.

Once #2831 merges, this branch's base should be moved back to origin/main and #2831's commit will be absorbed.

Cooperative-kernel optimizations (from previous iteration of this branch)

  • 32 threads per env (qd.loop_config(name="prune_contacts_coop", block_dim=32)).
  • Coop reductions for the per-bucket mean-normal / centroid sum and the coplanarity max-depth / max-in-plane-r2 check (qd.simt.subgroup.reduce_all_add_tiled / reduce_all_max_tiled).
  • Stride-tid parallel writes for contact_keep init, phase-1a key+idx init, projection (sort_key + proj_v), mark-drop, lex_idx init.
  • Lane-0 serial only where required: phase-1a insertion sort, phase-2 lex sort, Andrew monotone-chain hull build, the deep-penetration restore, and the fused phase 3 (compact + spatial sort + 9-field cycle-permute).
  • qd.simt.subgroup.sync() between coop and serial sections.

Auto-enable gating

  • Dedup is on by default whenever the scene has multi-geom links, nonconvex non-terrain geoms, or terrain.
  • CPU gated off (cooperative kernel adds overhead without parallel speedup on a serial backend).
  • n_envs > 8192 gated off (GPU already grid-saturated; per-env serial sub-phases dominate).
  • Metal works now (post-[BUG FIX] Fix contact point pruning on Apple Metal. #2831). The previous Metal exclusion is removed.

test_contact_pruning

No longer skipped on darwin (depended on the Metal-exclude path that #2831 makes unnecessary). Parametrized on gs.gpu only.

Related Issue

Builds on #2829 (Alexis's initial dedup PR) and #2831 (Alexis's Metal codegen bugfix).

How Has This Been / Can This Be Tested?

  • Cluster unit tests on 8 GPUs (cmp-tooling/unit_tests_cluster.py, ndarray + field backends).
  • Cluster benchmarks on 8 GPUs against WandB baseline (cmp-tooling/bench_cluster_wandb.py).
  • Genesis CI: ubuntu + macOS + windows + production.
  • test_contact_pruning on gs.gpu (CUDA + Metal).
  • Standalone Quadrants subgroup primitive reproductions for the Metal bug under perso_hugh/prot/qd_*_repro.py.

Checklist

  • I followed the contribution guidelines.
  • I tagged the title correctly.
  • I tested my changes and added instructions.

duburcqa and others added 2 commits May 25, 2026 17:33
…is-Embodied-AI#2831

Rebase of my cooperative-warp dedup work onto Alexis's Genesis-Embodied-AI#2831
(fix_contact_pruning_batched) which fixes the Quadrants Metal codegen
bug that was blocking dedup on macOS. PR Genesis-Embodied-AI#2831's contact.py changes are
incorporated verbatim into my cooperative kernel:

1. Memory-fence scratch write between the lower-hull and upper-hull
   passes of Andrew's monotone chain (contact_hull_stack[max_contact_pairs - 1, i_b] = 0).
   Without this, on Metal _B >= 2 the upper-hull reads of
   contact_hull_stack don't observe the lower-hull writes and every
   candidate is kept (hull size == bucket size, no pruning).
   Standalone repro: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py

2. New tunable prune_hull_collinear_tol (1e-3 default) separated from
   the depth coplanarity tol. Lets users tune the hull-popper cross-
   product slop independently of the depth gate.

3. Deep-penetration restore threshold changed from average to MAX of
   hull penetrations. A non-hull contact whose penetration substantially
   exceeds the deepest hull vertex represents a distinct physical
   support (deep middle of a long body) that the support-polygon
   argument doesn't authorize dropping.

Side cleanups now possible because Genesis-Embodied-AI#2831 fixes the Metal bug:

* Removed the gs.backend not in (gs.cuda, gs.amdgpu) guard from
  collider.py. Dedup is now enabled on Metal too; only CPU is still
  excluded (the cooperative kernel adds overhead without parallel
  speedup on a serial backend).
* Removed the @pytest.mark.skipif(darwin) from test_contact_pruning;
  the test now runs on every GPU backend.

The cooperative-warp structure (32 threads/env, qd.simt.subgroup
reductions for the per-bucket mean-normal / centroid, coop projection
and mark-drop, fused phase 3 compact+spatial-sort) is preserved from
the previous branch.

Note: this branch is currently based on duburcqa:fix_contact_pruning_batched (Genesis-Embodied-AI#2831).
Once Genesis-Embodied-AI#2831 merges, the branch base should be moved back to origin/main and
the Genesis-Embodied-AI#2831 commit will simply be absorbed.

Backup of the pre-rebase tip: hp/dedup-alexis-perf-9d-stage3-r2-pre-2831-rebase-backup.
@hughperkins hughperkins force-pushed the hp/dedup-alexis-perf-9d-stage3-r2 branch from a8230f6 to 6582830 Compare May 25, 2026 15:51
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Rebased onto #2831 (commit 6582830). #2831's three contact.py changes (memory-fence write, prune_hull_collinear_tol, hull_pen_max) are incorporated. Removed the Metal-only exclusion in collider.py and the darwin skipif on test_contact_pruning since #2831 fixes the Metal codegen issue.

Minimal standalone reproduction for the Quadrants Metal codegen bug: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py — 8-point Andrew's monotone chain with B=4 envs. Without the PR #2831 fence, hull size = bucket size on Metal; with the fence, hull size = 4 (the 4 corners). Passes on CUDA for both kernels.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

Status update after rebasing onto upstream #2831:

Benchmarks (GPU, 4096 envs, vs WandB main@b28fdb8c):

  • dex_hand: +4.83%
  • g1_fall: +3.92%
  • box_pyramid_3..6 / duck_in_box_*: within noise (-1.2% to +1.4%)
  • No GPU regressions

Unit tests (cluster, --backend ndarray, GPU+CPU): 2 failed, 712 passed

The 2 failures are test_smooth_box_no_drift[False-capsule-prim-prim] and [False-capsule-prim-mesh]. I traced these to the underlying #2831 PR itself, not to this PR's deskai6 changes:

  • Ran the failing tests on duburcqa/fix_contact_pruning_batched directly (zero deskai6 commits on top) → same 2 failures, byte-identical numerical drift (Max abs diff 0.01361053).
  • The failing scenes are single-link / single-geom (a primitive capsule on a primitive or convex mesh box), so dedup is gated off (link_pair_pruning_supported = False) and func_prune_contacts is never invoked.

Reported upstream at #2831 (see comment thread there). They will hit main as soon as #2831 merges.

CI production-unit_tests-ndarray and production-unit_tests-field show these same 2 failures here, confirming the chain.

duburcqa and others added 5 commits May 25, 2026 19:41
…backends + n_envs

Adopt Alexis's fused func_clamp_prune_and_sort_contacts kernel (single per-env
serial pass that does clamp + optional pruning + optional spatial sort, gated
at compile time via qd.static). This supersedes the deskai6 cooperative
warp-per-env split into func_prune_contacts + func_clamp_and_sort_contacts:
the fusion removes one kernel launch + one full 11-field permute pass per
step, which on net matches or beats the cooperative-warp approach without
needing any backend-specific gating.

Remove the deskai6 gates that previously disabled dedup on CPU and at
n_envs > 8192. Both were tied to the cooperative-kernel approach
(warp-cooperative loops serialise on CPU; n_envs > 8192 oversaturates the
GPU grid). The fused per-env serial kernel has no such overhead, so dedup
now runs unconditionally whenever the scene is eligible
(link_pair_pruning_supported and not autodiff).

Also remove _effective_pruning_tolerance / _DEDUP_DEFAULT_TOLERANCE /
_DEDUP_MAX_N_ENVS: the kernel now reads contact_pruning_tolerance directly
from the user option (upstream default 0.1 post Genesis-Embodied-AI#2831), matching the fused
kernel's expectation. Remove the @pytest.mark.parametrize backend=gs.gpu
restriction on test_contact_pruning so it runs on CPU + GPU per matrix.
…ernel for CPU

Re-add `func_prune_contacts_coop`, the 32-lane warp-cooperative variant of
the fused clamp+prune+sort path. Same pruning logic as Alexis's serial
`func_clamp_prune_and_sort_contacts` (depth coplanarity gate, Andrew's
monotone chain, hull-pen-max deep-penetration restore, Metal memory-fence
workaround), but the per-env work is spread across 32 warp lanes:

  - PARALLEL: contact_keep init, phase-1a (link-pair) key+idx init,
    phase-2 mean-normal / centroid reduction via reduce_all_add_tiled,
    coplanarity check reduction via reduce_all_max_tiled, in-plane
    projection writes.
  - SERIAL on lane 0: insertion sorts (phase 1a + lex), bucket walk,
    monotone chain, hull mark + deep-penetration restore, and the final
    fused compact + spatial sort with sentinel +inf for dropped contacts.

Dispatched only when gs.backend != gs.cpu AND scene is dedup-eligible AND
not requires_grad AND contact_pruning_tolerance > 0. Everything else
(CPU, autodiff, ineligible scenes) falls back to the serial fused kernel,
which has internal qd.static gates that no-op the prune phases when they
don't apply. So dedup still runs on every backend as before; the change
just upgrades the GPU path to coop.
The cooperative warp-per-env dedup kernel (func_prune_contacts_coop)
only beats the serial fused kernel while the GPU has spare occupancy.
Once _B exceeds half the GPU's CUDA-core count the coop launch fills
the device, the warp-cooperation overhead stops paying off, and
Alexis's serial fused kernel (one thread per env) is faster. The
half-core threshold also leaves headroom for the other kernels
already sharing the SMs (narrowphase MPR/GJK, constraint solver) so
we don't push them into second-wave scheduling on tight scenes like
dex_hand and g1_fall.

Concrete thresholds (gpu_cores / 2):
- RTX 5090 (170 SMs x 128):       _B <= 10880
- RTX 6000 Blackwell (188 x 128): _B <= 12032
- MI300X (304 CUs x 64):          _B <=  9728
- Fallback (16384):               _B <=  8192

Factored the per-backend gpu_cores computation out of the
_use_split_narrowphase branch into a single Collider._gpu_cores
field so both the contact0 chunking heuristic and the new dispatch
gate share the same number. No behavior change for split-narrowphase
chunking.
@github-actions
Copy link
Copy Markdown

🔴 Benchmark Regression Detected ➡️ Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants