[MISC] Enable link-level dedupe for gpu and n_nenvs <= 8192 and add optimizations.#2830
Conversation
…is-Embodied-AI#2831 Rebase of my cooperative-warp dedup work onto Alexis's Genesis-Embodied-AI#2831 (fix_contact_pruning_batched) which fixes the Quadrants Metal codegen bug that was blocking dedup on macOS. PR Genesis-Embodied-AI#2831's contact.py changes are incorporated verbatim into my cooperative kernel: 1. Memory-fence scratch write between the lower-hull and upper-hull passes of Andrew's monotone chain (contact_hull_stack[max_contact_pairs - 1, i_b] = 0). Without this, on Metal _B >= 2 the upper-hull reads of contact_hull_stack don't observe the lower-hull writes and every candidate is kept (hull size == bucket size, no pruning). Standalone repro: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py 2. New tunable prune_hull_collinear_tol (1e-3 default) separated from the depth coplanarity tol. Lets users tune the hull-popper cross- product slop independently of the depth gate. 3. Deep-penetration restore threshold changed from average to MAX of hull penetrations. A non-hull contact whose penetration substantially exceeds the deepest hull vertex represents a distinct physical support (deep middle of a long body) that the support-polygon argument doesn't authorize dropping. Side cleanups now possible because Genesis-Embodied-AI#2831 fixes the Metal bug: * Removed the gs.backend not in (gs.cuda, gs.amdgpu) guard from collider.py. Dedup is now enabled on Metal too; only CPU is still excluded (the cooperative kernel adds overhead without parallel speedup on a serial backend). * Removed the @pytest.mark.skipif(darwin) from test_contact_pruning; the test now runs on every GPU backend. The cooperative-warp structure (32 threads/env, qd.simt.subgroup reductions for the per-bucket mean-normal / centroid, coop projection and mark-drop, fused phase 3 compact+spatial-sort) is preserved from the previous branch. Note: this branch is currently based on duburcqa:fix_contact_pruning_batched (Genesis-Embodied-AI#2831). Once Genesis-Embodied-AI#2831 merges, the branch base should be moved back to origin/main and the Genesis-Embodied-AI#2831 commit will simply be absorbed. Backup of the pre-rebase tip: hp/dedup-alexis-perf-9d-stage3-r2-pre-2831-rebase-backup.
a8230f6 to
6582830
Compare
|
Rebased onto #2831 (commit 6582830). #2831's three contact.py changes (memory-fence write, prune_hull_collinear_tol, hull_pen_max) are incorporated. Removed the Metal-only exclusion in collider.py and the darwin skipif on test_contact_pruning since #2831 fixes the Metal codegen issue. Minimal standalone reproduction for the Quadrants Metal codegen bug: perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py — 8-point Andrew's monotone chain with B=4 envs. Without the PR #2831 fence, hull size = bucket size on Metal; with the fence, hull size = 4 (the 4 corners). Passes on CUDA for both kernels. |
|
Status update after rebasing onto upstream #2831: Benchmarks (GPU, 4096 envs, vs WandB main@b28fdb8c):
Unit tests (cluster, --backend ndarray, GPU+CPU): 2 failed, 712 passed The 2 failures are
Reported upstream at #2831 (see comment thread there). They will hit main as soon as #2831 merges. CI |
…backends + n_envs Adopt Alexis's fused func_clamp_prune_and_sort_contacts kernel (single per-env serial pass that does clamp + optional pruning + optional spatial sort, gated at compile time via qd.static). This supersedes the deskai6 cooperative warp-per-env split into func_prune_contacts + func_clamp_and_sort_contacts: the fusion removes one kernel launch + one full 11-field permute pass per step, which on net matches or beats the cooperative-warp approach without needing any backend-specific gating. Remove the deskai6 gates that previously disabled dedup on CPU and at n_envs > 8192. Both were tied to the cooperative-kernel approach (warp-cooperative loops serialise on CPU; n_envs > 8192 oversaturates the GPU grid). The fused per-env serial kernel has no such overhead, so dedup now runs unconditionally whenever the scene is eligible (link_pair_pruning_supported and not autodiff). Also remove _effective_pruning_tolerance / _DEDUP_DEFAULT_TOLERANCE / _DEDUP_MAX_N_ENVS: the kernel now reads contact_pruning_tolerance directly from the user option (upstream default 0.1 post Genesis-Embodied-AI#2831), matching the fused kernel's expectation. Remove the @pytest.mark.parametrize backend=gs.gpu restriction on test_contact_pruning so it runs on CPU + GPU per matrix.
…s collapsed it into tol*max_in_plane_r2)
…ernel for CPU
Re-add `func_prune_contacts_coop`, the 32-lane warp-cooperative variant of
the fused clamp+prune+sort path. Same pruning logic as Alexis's serial
`func_clamp_prune_and_sort_contacts` (depth coplanarity gate, Andrew's
monotone chain, hull-pen-max deep-penetration restore, Metal memory-fence
workaround), but the per-env work is spread across 32 warp lanes:
- PARALLEL: contact_keep init, phase-1a (link-pair) key+idx init,
phase-2 mean-normal / centroid reduction via reduce_all_add_tiled,
coplanarity check reduction via reduce_all_max_tiled, in-plane
projection writes.
- SERIAL on lane 0: insertion sorts (phase 1a + lex), bucket walk,
monotone chain, hull mark + deep-penetration restore, and the final
fused compact + spatial sort with sentinel +inf for dropped contacts.
Dispatched only when gs.backend != gs.cpu AND scene is dedup-eligible AND
not requires_grad AND contact_pruning_tolerance > 0. Everything else
(CPU, autodiff, ineligible scenes) falls back to the serial fused kernel,
which has internal qd.static gates that no-op the prune phases when they
don't apply. So dedup still runs on every backend as before; the change
just upgrades the GPU path to coop.
The cooperative warp-per-env dedup kernel (func_prune_contacts_coop) only beats the serial fused kernel while the GPU has spare occupancy. Once _B exceeds half the GPU's CUDA-core count the coop launch fills the device, the warp-cooperation overhead stops paying off, and Alexis's serial fused kernel (one thread per env) is faster. The half-core threshold also leaves headroom for the other kernels already sharing the SMs (narrowphase MPR/GJK, constraint solver) so we don't push them into second-wave scheduling on tight scenes like dex_hand and g1_fall. Concrete thresholds (gpu_cores / 2): - RTX 5090 (170 SMs x 128): _B <= 10880 - RTX 6000 Blackwell (188 x 128): _B <= 12032 - MI300X (304 CUs x 64): _B <= 9728 - Fallback (16384): _B <= 8192 Factored the per-backend gpu_cores computation out of the _use_split_narrowphase branch into a single Collider._gpu_cores field so both the contact0 chunking heuristic and the new dispatch gate share the same number. No behavior change for split-narrowphase chunking.
|
🔴 Benchmark Regression Detected ➡️ Report |
Description
Cooperative warp-per-env optimization of the link-pair contact dedup kernel (
func_prune_contacts), plus auto-enable / gating logic.This branch is rebased onto #2831 (Alexis's
fix_contact_pruning_batched). #2831's three changes are incorporated verbatim into the cooperative kernel:_B >= 2the upper-hull reads ofcontact_hull_stackdon't observe the lower-hull writes and every candidate is kept (hull size == bucket size, no pruning). Standalone reproduction:perso_hugh/prot/qd_metal_hull_chain_visibility_repro.py.prune_hull_collinear_toltunable (default 1e-3) separated from the depth coplanaritytol. Lets users tune the hull-popper cross-product slop independently of the depth gate.MAXof hull penetrations instead of average.Once #2831 merges, this branch's base should be moved back to
origin/mainand #2831's commit will be absorbed.Cooperative-kernel optimizations (from previous iteration of this branch)
qd.loop_config(name="prune_contacts_coop", block_dim=32)).qd.simt.subgroup.reduce_all_add_tiled/reduce_all_max_tiled).contact_keepinit, phase-1a key+idx init, projection (sort_key + proj_v), mark-drop, lex_idx init.qd.simt.subgroup.sync()between coop and serial sections.Auto-enable gating
test_contact_pruningNo longer skipped on darwin (depended on the Metal-exclude path that #2831 makes unnecessary). Parametrized on
gs.gpuonly.Related Issue
Builds on #2829 (Alexis's initial dedup PR) and #2831 (Alexis's Metal codegen bugfix).
How Has This Been / Can This Be Tested?
cmp-tooling/unit_tests_cluster.py, ndarray + field backends).cmp-tooling/bench_cluster_wandb.py).test_contact_pruningongs.gpu(CUDA + Metal).perso_hugh/prot/qd_*_repro.py.Checklist