by Cody Rivera [[email protected]], Jieyang Chen [[email protected]], and Dingwen Tao [[email protected]]
This repository contains an implementation of two irregular-shape matrix-matrix
multiplication algorithms, TSM2R and TSM2L. TSM2R is designed to efficiently
multiply a large square (or near-square) matrix by a tall-and-skinny matrix, or
more specifically, an (m * k) and (k * n) matrix-matrix multiplication where
m and k are approximately equal, and n is much smaller than k. TSM2L is designed
to efficiently multiply a tall-and-skinny matrix by a small square matrix, or
more specifically, an (m * k) and (k * n) matrix-matrix multiplication where
k is much smaller than m, and k and n are approximately equal.
We propose TSM2R and TSM2L in our preprint,
"TSM2X: High-Performance Tall-and-Skinny Matrix-MatrixMultiplication on GPUs." [1].
Our work extends an ICS conference paper [2], which introduces TSM2R, by expanding
its techniques for different matrix sizes as well as porting the algorithm to the Nvidia
Tesla V100.
We have implemented the kernels as templates, with the parameters t1, t2, and t3 as
template variables [1]. The program will select an optimal kernel depending on the
size of the input matrices. This repository currently provides a set of optimal kernels for
the Nvidia V100 GPU only.
This implementation is designed for Unix platforms, and can be built using
make. The usage of this program is:
./multiply [-d] [-i] a.mtx b.mtx c.mtx,
where a.mtx and b.mtx are input matrices and c.mtx is an output matrix.
-d indicates that the matrices are double-precision, while -i indicates
that TSM2L (instead of TSM2R) is to be used.
The format of the matrices is binary, with a structure as follows:
template <typename FloatType>
struct matrixFormat {
uint32_t rows, cols;
FloatType values[rows * cols];
};The matrix is stored in column-major format. All multibyte values are little-endian.
You may use the provided gen.cpp program to generate input
matrices. The usage is ./gen [-d] -r ROW_COUNT -c COL_COUNT file,
where -d signifies double precision.
You may also use the provided print.cpp program to print matrices.
The usage is ./print [-d] file.
To evaluate performance across a range of inputs, a Python3 script
test.py is provided. The script can be invoked with
python3 test.py. The program requires that ../multiply and
../gen exist, and writes its output to CSV files.
[1] Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, and Dingwen Tao. "TSM2X: High-Performance Tall-and-Skinny Matrix-MatrixMultiplication on GPUs." 2020. arXiv:2002.03258 [cs.DC].
[2] Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. "TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs." In Proceedings of the ACM International Conference on Supercomputing (ICS), pp. 106-116. ACM, 2019. https://doi.org/10.1145/3330345.3330355