Apply GPU optimizations to TLSPH#1139
Apply GPU optimizations to TLSPH#1139efaulhaber wants to merge 19 commits intotrixi-framework:mainfrom
Conversation
69f3560 to
006d4f5
Compare
006d4f5 to
7518a8c
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors Total Lagrangian SPH (TLSPH) neighbor interactions to better match GPU-friendly execution patterns (per-particle threading, reduced memory traffic, fewer repeated loads), following the newer PointNeighbors neighbor-iteration approach.
Changes:
- Refactor TLSPH deformation gradient and RHS assembly to use per-particle
@threadedloops withforeach_neighbor, accumulating intoRefs to reduce global writes. - Optimize penalty force and viscosity kernels for GPU performance (preload deformation gradients, use
div_fast, and usesmoothing_kernel_unsafeafter cutoff filtering). - Add a SIMD-based fast path for extracting 2×2 matrices (
extract_smatrix) and wire in theSIMDdependency; update tests’ mock systems accordingly.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/systems/tlsph_system.jl | Updates TLSPH test mock system to a concrete struct (GPU-friendly) and adjusts numeric literals. |
| test/schemes/structure/total_lagrangian_sph/rhs.jl | Updates RHS tests’ mock system layout and adds deformation_gradient stub required by new RHS path. |
| src/schemes/structure/total_lagrangian_sph/viscosity.jl | Threads deformation gradient through viscosity path and applies div_fast in hot divisions. |
| src/schemes/structure/total_lagrangian_sph/system.jl | Reworks deformation gradient assembly into per-particle neighbor loops with reduced memory writes. |
| src/schemes/structure/total_lagrangian_sph/rhs.jl | Reworks RHS assembly similarly; passes deformation gradients into penalty/viscosity for fewer loads. |
| src/schemes/structure/total_lagrangian_sph/penalty_force.jl | Converts penalty force to an in-place accumulator API and switches to unsafe kernel + fast divisions. |
| src/schemes/boundary/wall_boundary/system.jl | Adds smoothing_kernel_unsafe specialization for wall boundary systems. |
| src/general/neighborhood_search.jl | Adds foreach_neighbor wrapper around PointNeighbors neighbor iteration. |
| src/general/abstract_system.jl | Adds extract_smatrix Val-specialization and a SIMD 2D fast path. |
| src/TrixiParticles.jl | Imports SIMD module for use in extract_smatrix. |
| Project.toml | Adds SIMD dependency + compat; minor reordering of weakdeps/extensions entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # The compiler is smart enough to optimize this away if no penalty force is used | ||
| F_b = @inbounds deformation_gradient(system, neighbor) |
There was a problem hiding this comment.
Does this have to be placed here in order to be optimized by the compiler?
| # The compiler is smart enough to optimize this away if no penalty force is used | |
| F_b = @inbounds deformation_gradient(system, neighbor) | |
| # F_b is used here only for the penalty force. The compiler is smart enough to optimize this away if no penalty force is used. | |
| F_b = @inbounds deformation_gradient(system, neighbor) |
Does this have to be placed here in order to be optimized by the compiler?
There was a problem hiding this comment.
Unfortunately, yes. Moving this into the penalty force makes it slower when penalty force is used. Having it here does not make it slower when no penalty force is used.
| eps_sum = (F_a + F_b) * initial_pos_diff - 2 * current_pos_diff | ||
| delta_sum = dot(eps_sum, current_pos_diff) * inv_current_distance | ||
| eps_a = F_a * initial_pos_diff - current_pos_diff | ||
| eps_b = -(F_b * initial_pos_diff - current_pos_diff) |
There was a problem hiding this comment.
This is not the same as the line that is replaced? Why now with a minus for eps_b?
There was a problem hiding this comment.
You're right. The new version is the correct epsilon from the paper, but then the delta is missing a minus.
I'm surprised that the tests are not failing because of this, only the validation. Apparently, this incorrect penalty force and be countered by a small time step. I added another oscillating beam example test with more penalty force and a tightly tuned CFL as a regression test, which correctly catches this error.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1139 +/- ##
===========================================
- Coverage 89.06% 67.26% -21.81%
===========================================
Files 128 128
Lines 9868 9857 -11
===========================================
- Hits 8789 6630 -2159
- Misses 1079 3227 +2148
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/run-gpu-tests |
|
/run-gpu-tests |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rticles.jl into tlsph-gpu-performance
Based on trixi-framework/PointNeighbors.jl#154. Tests will pass once PointNeighbors 0.6.6 is released.