bugfix: kernel packing of send buffers must complete before initiatin… by jeremyfirst22 · Pull Request #105 · OPM/LBPM

jeremyfirst22 · 2025-11-11T20:01:18Z

Bugfix

CRITICAL: Extension of GPU race condition in MPI communication #98 to other places in ScaLBL.cpp.

There is some relevant conversation on where this fix is most appropriate in #103.

…g message passing

JamesEMcClure · 2025-11-22T03:56:10Z

common/ScaLBL.cpp

                      &Aq[Component * 7 * N], N);
+
+    ScaLBL_DeviceBarrier();
    req1[0] =


You should be able to

make all the calls to ScaLBL_D3Q19_Pack in a row

make a single call to ScaLBL_DeviceBarrier()

all of the calls to MPI_COMM_SCALBL.Isend in a row
This way there is only one synchronization point, which should be faster.

I agree. This is the way that SendD3Q19AA, TriSendD3Q7AA, and SendHalo are structured.

I wasn't sure if there was a reason for SendD3Q7AA (from this commit) and BiSendD3Q7AA (from fe6f38a) using the interwoven packing-sending structure or not, and didn't want to introduce bugs while trying to fix this one.

If you're confident that these two routines can separate packing and sending, I'll update this commit to use this structure for both this function and BiSendD3Q7AA. That should put all sending routines in the same structure.

I'm confident that that this will work.

The D3Q19 distributions all have buffers that are explicitly created to hold the data needed for each timestep. If you use the same communicator (ScaLBL_Comm) to send multiple distributions (like the BiSendD3Q7 ) then you need to synchronize so that you don't overwrite the first distribution with the second. In the multi-component diffusion cases you might have an arbitrarily large number of D3Q7 distributions, for example.

In general, the fastest way to catch communication errors is to run the following

https://github.com/OPM/LBPM/blob/master/tests/TestCommD3Q19.cpp

If you ever build with a new version of MPI it is a good idea to run this first until you have tuned the compile flags and the launcher flags. Certain things aren't part of the MPI standard so you can't design software behavior around them. The GPU conventions are almost always configurable at compile time and / or runtime based on flags, so you can't get around twiddling these.

This test was used to debug LBPM communications on lots of large supercomputers with several thousands of GPU (both AMD and NVIDIA).

bugfix: kernel packing of send buffers must complete before initiatin…

c455a3e

…g message passing

JamesEMcClure reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: kernel packing of send buffers must complete before initiatin…#105

bugfix: kernel packing of send buffers must complete before initiatin…#105
jeremyfirst22 wants to merge 1 commit intoOPM:masterfrom
jeremyfirst22:gpu-race-condition

jeremyfirst22 commented Nov 11, 2025

Uh oh!

JamesEMcClure Nov 22, 2025

Uh oh!

jeremyfirst22 Dec 1, 2025

Uh oh!

JamesEMcClure Dec 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeremyfirst22 commented Nov 11, 2025

Bugfix

Uh oh!

JamesEMcClure Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

jeremyfirst22 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

JamesEMcClure Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamesEMcClure Dec 7, 2025 •

edited

Loading