A collection of simple load balancers for MPI made in OpenMP manner.
All load balancers are available via SLB4MPI.hpp header which provides <Load-Balancer-Type>LoadBalancer types.
Each <Load-Balancer-Type>LoadBalancer implements abstract interface defined in abstract_load_balancer.hpp.
See possible Load-Balancer-Types in the next section.
Initialization can be done in the following manner, for example (StaticLoadBalancer will be used):
#include <SLB4MPI.h>
...
StaticLoadBalancer slb(mpi_comm, 1, 100, 2, 4);That will initialize load balancer slb with range from lower bound (1) to upper bound (100) and with min (2) and max (4) chunk sizes which are optional which will work on communicator mpi_comm.
Simpler initialization is also possible with default min and max chuck sizes:
StaticLoadBalancer slb(mpi_comm, 1, 100);This is a barrier for the whole communicator.
Then, one can ask range for computing with get_range method which returns range's start and end by arguments, and the method returns logical value that signals is there something to compute.
So, the pattern of usage this routine is:
if (slb.get_range(range_start, range_end)) { // returns [ range_start, range_end ]
...
}Single call may return only a part of whole range, so, proper call should be inside of while-loop.
After finishing loop, load balancer slb must be destroyed.
Internally, it uses RAII, so it most of cases you do not think about destructing it.
For Runtime load balancer, a proper load balancer must be specified via first argument during initialization:
std::string rlbtype = std::string("dynamic");
RuntimeLoadBalancer rlb(rlbtype, mpi_comm, 1, 100, 2, 4);See possible values in the next section.
All load balancers are available in SLB4MPI module which provides <load-balancer-type>_load_balancer_t types.
Each <load-balancer-type>_load_balancer_t implements abstract interface defined in abstract_load_balancer.f90.
See possible load-balancer-types in the next section.
Type definition can be done in the following manner, for example:
use SLB4MPI
...
type(static_load_balancer_t) :: lbHere, static load balancer will be used.
Now, before usage load balancer lb, one must initialize the load balancer lb with range from lower bound to upper bound and with min and max chunk sizes which are optional which will work on communicator comm.
For range from 1 to 1000 with min chunk size equals 6 and max chunk size equals 12, the initialization will look as follows:
call lb%initialize(comm, 1_8, 1000_8, 6_8, 12_8)The simpler call is also possible, where min and max chunk sizes will be determined automatically:
call lb%initialize(comm, 1_8, 1000_8)This call is a barrier for the whole communicator.
Then, one can ask range for computing with get_range call which returns range's start and end by arguments, and the call returns logical value that signals is there something to compute.
So, the pattern of usage this routine is:
if (lb%get_range(range_start, range_end)) then
...
endifSingle call may return only a part of whole range, so, proper call should be inside of do-loop.
After finishing loop, load balancer lb must be destroyed before next usage via clear call:
call lb%clean()For runtime load balancer, default load balancer can be changed with SLB4MPI_set_schedule call in the following way before its initialization:
call SLB4MPI_set_schedule("guided")staticlocal_staticdynamicguidedwork_stealingruntime
Typenames of different balancers in diffent languages:
| load balancer | Typename in Fortran | Typename in C++ |
|---|---|---|
static |
static_load_balancer_t |
StaticLoadBalancer |
local_static |
local_static_load_balancer_t |
LocalStaticLoadBalancer |
dynamic |
dynamic_load_balancer_t |
DynamicLoadBalancer |
guided |
guided_load_balancer_t |
GuidedLoadBalancer |
work_stealing |
work_stealing_load_balancer_t |
WorkStealingLoadBalancer |
runtime |
runtime_load_balancer_t |
RuntimeLoadBalancer |
Iterations are divided into chunks of size max_chunk_size, and the chunks are assigned to the MPI ranks in the communicator in a round-robin fashion in the order of the MPI rank.
Each chunk contains max_chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size is not specified, it defaults to 1.
If max_chunk_size is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size.
Iterations are divided into chunks of size max_chunk_size, and the chunks are assigned to the MPI ranks in the communicator in a continuous fashion in the order of the MPI rank.
Each chunk contains max_chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size is not specified, it defaults to 1.
If max_chunk_size is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size.
Each MPI rank executes a chunk, then requests another chunk, until no chunks remain to be assigned.
Each chunk contains min_chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If min_chunk_size is not specified, it defaults to 1.
Each MPI rank executes a chunk, then requests another chunk, until no chunks remain to be assigned.
For a min_chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1 but no more than max_chunk_size iterations.
For a min_chunk_size with value k > 1, the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations but no more than max_chunk_size iterations
(except for the chunk that contains the sequentially last iteration, which may have fewer than k iterations).
If min_chunk_size is not specified, it defaults to 1.
If max_chunk_size is not specified, the value is determined as average number of elements per MPI rank, but no less than min_chunk_size.
Iterations are divided into chunks of size max_chunk_size, and the chunks are assigned to the MPI ranks in the communicator in a continuous fashion in the order of the MPI rank.
Each chunk contains max_chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations.
If MPI rank finished its chunks, it starts to steal tasks from other MPI ranks in round-robin fashion in the order of MPI ranks.
These chunks contains min_chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations for victim MPI rank.
If min_chunk_size is not specified, it defaults to 1.
If max_chunk_size is not specified, the iteration range is divided into chunks that are approximately equal in size, but no less than min_chunk_size.
The load balancer is determing by SLB4MPI_set_schedule procedure.
The list of possible values passed as string:
env(default) selects slice algorithm bySLB4MPI_LOAD_BALANCERenvironment variable;staticselectsstaticload balancer;local_staticselectslocal_staticload balancer;dynamicselectsdynamicload balancer;guidedselectsguidedload balancer;work_stealingselectswork_stealingload balancer.
SLB4MPI_LOAD_BALANCER environment variable accepts the following values:
- not set: runtime load balancer will use
staticload balancer; - empty string: runtime load balancer will use
staticload balancer; static: runtime load balancer will usestaticload balancer;local_static: runtime load balancer will uselocal_staticload balancer;dynamic: runtime load balancer will usedynamicload balancer;guided: runtime load balancer will useguidedload balancer;work_stealing: runtime load balancer will usework_stealingload balancer;- other values: runtime load balancer will use
staticload balancer.
The collection is assumed to be compiled in source tree of parent project with passing all flags from it. Additional flags, which are required for compiling of collection, are kept inside. Currently, only CMake build system is supported.
To use SLB4MPI from your project, add the following lines into your CMakeLists.txt for fetching the library:
include(FetchContent)
FetchContent_Declare(SLB4MPI
GIT_REPOSITORY https://github.com/foxtran/SLB4MPI.git
GIT_TAG 0.0.2
)
FetchContent_MakeAvailable(SLB4MPI)
FetchContent_GetProperties(SLB4MPI SOURCE_DIR SLB4MPI_SOURCE_DIR)
set(SLB4MPI_ENABLE_CXX ON) # or OFF
set(SLB4MPI_ENABLE_Fortran ON) # or OFF
set(SLB4MPI_WITH_MPI ON) # or OFF
add_subdirectory(${SLB4MPI_SOURCE_DIR} ${SLB4MPI_SOURCE_DIR}-binary)After this, you can link the library with your Fortran application or library (SLB4MPI_ENABLE_Fortran must be ON):
target_link_libraries(<TARGET> PUBLIC SLB4MPI::SLB4MPI_Fortran)or with C++ application or library (SLB4MPI_ENABLE_CXX must be ON):
target_link_libraries(<TARGET> PUBLIC SLB4MPI::SLB4MPI_CXX)Useful flags that changes behaviour of library:
SLB4MPI_ENABLE_CXXenables/disables MPI load balancers for C++ language (requires C++14)SLB4MPI_ENABLE_Fortranenables/disables MPI load balancers for Fortran language (requires Fortran 2003)SLB4MPI_WITH_MPIenables/disables support of MPI for MPI/non-MPI builds
See examples for Fortran and C++.
NOTE: The library does not provide ILP64 support.
To build documentation, you need to have installed doxygen.
With installed doxygen, run:
doxygen DOXYGENto build documentation in html and latex formats.
- In Fortran,
cleancall is synchonization point. In C++, it is an end of scope containing load balancer(s).
- Load balancers
dynamic,guided,work_stealingas well asruntimemay have significant performance issues. Report on Intel Forum MPI_Accumulateis used instead ofMPI_Put. Report on Intel ForumI_MPI_ASYNC_PROGRESS=1leads to runtime fails. Report on Intel Forum- NOTE: no ILP64 support.
Not tested yet.