bugfix: kernel packing of send buffers must complete before initiatin…#105
bugfix: kernel packing of send buffers must complete before initiatin…#105jeremyfirst22 wants to merge 1 commit intoOPM:masterfrom
Conversation
…g message passing
| &Aq[Component * 7 * N], N); | ||
|
|
||
| ScaLBL_DeviceBarrier(); | ||
| req1[0] = |
There was a problem hiding this comment.
You should be able to
- make all the calls to
ScaLBL_D3Q19_Packin a row - make a single call to
ScaLBL_DeviceBarrier() - all of the calls to
MPI_COMM_SCALBL.Isendin a row
This way there is only one synchronization point, which should be faster.
There was a problem hiding this comment.
I agree. This is the way that SendD3Q19AA, TriSendD3Q7AA, and SendHalo are structured.
I wasn't sure if there was a reason for SendD3Q7AA (from this commit) and BiSendD3Q7AA (from fe6f38a) using the interwoven packing-sending structure or not, and didn't want to introduce bugs while trying to fix this one.
If you're confident that these two routines can separate packing and sending, I'll update this commit to use this structure for both this function and BiSendD3Q7AA. That should put all sending routines in the same structure.
There was a problem hiding this comment.
I'm confident that that this will work.
The D3Q19 distributions all have buffers that are explicitly created to hold the data needed for each timestep. If you use the same communicator (ScaLBL_Comm) to send multiple distributions (like the BiSendD3Q7 ) then you need to synchronize so that you don't overwrite the first distribution with the second. In the multi-component diffusion cases you might have an arbitrarily large number of D3Q7 distributions, for example.
In general, the fastest way to catch communication errors is to run the following
https://github.com/OPM/LBPM/blob/master/tests/TestCommD3Q19.cpp
If you ever build with a new version of MPI it is a good idea to run this first until you have tuned the compile flags and the launcher flags. Certain things aren't part of the MPI standard so you can't design software behavior around them. The GPU conventions are almost always configurable at compile time and / or runtime based on flags, so you can't get around twiddling these.
This test was used to debug LBPM communications on lots of large supercomputers with several thousands of GPU (both AMD and NVIDIA).
Bugfix
There is some relevant conversation on where this fix is most appropriate in #103.