BLAS/LAPACK Benchmarks¶
Test Setup¶
Algorithms¶
The following H-matrix arithmetics are tested:
H-mm-eager | eager H matrix multiplication |
H-mm-accu | H matrix multiplication using accumulators |
H-lu-eager | eager H-LU factorization |
H-lu-accu | H-LU factorization using accumulators |
UniH-mm | Uniform-H matrix multiplication |
UniH-lu | Uniform-H LU factorization |
Applications¶
The following applications are used for testing:
laplace : Laplace SLP
command line:
--app laplaceslp --adm std --cluster h --ntile 64 --grid sphere2-6 -e 1e-6
problem size: 65536
truncation accuracy: 1e-6
BLAS/LAPACK Libraries¶
OpenBLAS¶
version: 0.3.20
compilation options:
make TARGET=<...> USE_THREAD=0 USE_LOCKING=1 NO_LAPACK=1
with targets ZEN and SKYLAKEXwith Reference-LAPACK v3.10.0
BLIS¶
version: 0.8.1
compilation options:
./configure --disable-threading --enable-mixed-dt --enable-sup-handling <...>
with targets zen2 and skxwith Reference-LAPACK v3.10.0
Intel MKL¶
version: 2020.0
on AMD Epyc with
MKL_DEBUG_CPU_TYPE=5
to enable the AVX2 code pathLAPACK also via Intel MKL
Results¶
Each benchmark is run 10 times and the best runtime is used.
Laplace¶
AMD Epyc 7702 (Rome) | Intel Xeon 8360Y (IceLake) | |||||
---|---|---|---|---|---|---|
Algorithm | OpenBLAS | BLIS | Intel MKL | OpenBLAS | BLIS | Intel MKL |
H-mm-accu | 144.0s | 167.9s | 127.3s | 160.4s | 288.9s | 152.7s |
UniH-mm | 173.1s | 169.5s | 147.1s | 161.7s | 304.0s | 154.3s |
H-lu-eager | 111.6s | 145.2s | 106.0s | 136.5s | 211.4s | 128.6s |
H-lu-accu | 60.84s | 70.14s | 53.91s | 73.94s | 111.4s | 68.80s |
UniH-lu | 86.23s | 90.23s | 75.95s | 95.31s | 141.0s | 90.50s |