Fast linear algebra routines for batches of small matrices.
Batmat is used as the linear algebra backend for the Cyqlone solver, where it is used to perform vectorized operations across multiple stages in an optimal control problem.
To enable vectorization, batmat stores batches of small matrices in an interleaved “compact” format in memory, where the corresponding elements of all matrices in a batch are stored together, as shown in the figure below (for a batch size of two). Custom linear algebra routines then operate on all matrices in a batch simultaneously using SIMD instructions. These routines are built on top of highly optimized micro-kernels.
| Routine | Operation | Notes |
|---|---|---|
gemm(A, B, D) |
||
gemm_neg(A, B, D) |
||
gemm_add(A, B, C, D) |
||
gemm_sub(A, B, C, D) |
||
syrk(A, D) |
|
|
syrk_neg(A, D) |
|
|
syrk_add(A, C, D) |
|
|
syrk_sub(A, C, D) |
|
|
trmm(A, B, D) |
|
|
trmm_neg(A, B, D) |
|
|
trmm_add(A, B, C, D) |
|
|
trmm_sub(A, B, C, D) |
|
|
potrf(C, D) |
|
|
syrk_add_potrf(A, C, D) |
|
|
syrk_sub_potrf(A, C, D) |
|
|
trsm(A, B, D) |
|
|
trtri(A, D) |
|
|
gemm_diag(A, B, D, d) |
||
gemm_diag_add(A, B, C, D, d) |
||
syrk_diag_add(A, C, D, d) |
|
|
symm_add(A, B, C, D) |
|
|
copy(A, B) |
||
fill(a, B) |
A selection of these routines also support masking, shifting, or rotating the arguments (for example,
#include <batmat/linalg/copy.hpp>
#include <batmat/linalg/gemm.hpp>
#include <batmat/linalg/potrf.hpp>
#include <batmat/matrix/matrix.hpp>
#include <guanaqo/print.hpp>
#include <algorithm>
#include <cmath>
#include <iostream>
#include <limits>
#include <random>
using batmat::index_t;
using batmat::real_t;
namespace la = batmat::linalg;
int main() {
using batch_size = std::integral_constant<index_t, 4>;
constexpr auto storage_order = batmat::matrix::StorageOrder::ColMajor;
// Class representing a batch of four matrices.
using Mat = batmat::matrix::Matrix<real_t, index_t, batch_size, batch_size, storage_order>;
// Allocate some batches of matrices (initialized to zero).
index_t n = 3, m = n + 5;
Mat C{{.rows = n, .cols = n}}, A{{.rows = n, .cols = m}};
// Fill A with random values.
std::mt19937 rng{12345};
std::uniform_real_distribution<real_t> uni{-1.0, 1.0};
std::ranges::generate(A, [&] { return uni(rng); });
// Compute C = AAᵀ to make it symmetric positive definite (lower triangular part only).
la::syrk(A, la::tril(C));
// Allocate L for the Cholesky factors.
Mat L{{.rows = n, .cols = n}, batmat::matrix::uninitialized};
// Compute the Cholesky factors L of C (lower triangular).
la::fill(0, la::triu(L));
la::potrf(la::tril(C), la::tril(L));
// Print the results.
for (index_t l = 0; l < C.depth(); ++l) {
guanaqo::print_python(std::cout << "C[" << l << "] =\n", C(l));
guanaqo::print_python(std::cout << "L[" << l << "] =\n", L(l));
}
// Compute LLᵀ (in-place).
la::syrk(la::tril(L));
// Check that LLᵀ == C.
int errors = 0;
const auto eps = std::numeric_limits<real_t>::epsilon();
for (index_t l = 0; l < C.depth(); ++l)
for (index_t c = 0; c < C.cols(); ++c)
for (index_t r = c; r < C.rows(); ++r)
errors += std::abs(C(l, r, c) - L(l, r, c)) < 10 * eps ? 0 : 1;
return errors;
}Dependencies: guanaqo
Batmat can be installed using the standard CMake workflow. To install the necessary dependencies, it is recommended to use the Conan package manager:
git clone https://github.com/tttapa/conan-recipes.git
conan remote add tttapa-conan-recipes "$PWD/conan-recipes"
conan install . --build=missing
cmake --fresh --preset conan-release
cmake --build --preset conan-releaseTip
Batmat makes extensive use of vectorization (in fact, that's kind of the whole point). Be sure to enable SIMD ISA extension support in your Conan profile's compiler flags for the best performance. See scripts/dev/profiles/laptop for an example.
GCC 14 or later is required (and can be installed through Conan).
Clang is also supported, but this requires the -o\&:with_gsi_hpc_simd=True option to be passed to Conan
to enable support for the GSI-HPC SIMD library (Clang does not support libstdc++'s <experimental/simd> header).
Batmat performs exceptionally well on matrices smaller than around 100×100 (that fit in the L2 cache), where it outperforms traditional scalar linear algebra libraries such as Intel MKL, OpenBLAS, and BLASFEO (especially for triangular or symmetric matrices).
. ~/intel/oneapi/setvars.sh # For the Intel MKL
python3 -m pip install -r benchmarks/scripts/requirements.txt
conan install . --build=missing -o guanaqo/\*:with_mkl=True -o\&:with_benchmarks=True -o\&:with_blasfeo=True
cmake --fresh --preset conan-release -DBATMAT_WITH_ACCURATE_BUILD_TIME=Off
cmake --build --preset conan-release -t viz-benchmark-potrf # or gemm, syrk, trsm, syrk-potrf, trmm, trtri, hyh