BLAS/LAPACK Benchmarks¶
Test Setup¶
Algorithms¶
The following H-matrix arithmetics are tested:
H-LU Factorization:
recursive (sequential) or DAG based (parallel) LU factorization with eager updatesUniform H-LU Factorization:
accumulator based LU factorization for uniform-H matrices with lazy updates
In both cases, the sizes of the dense blocks as lowrank factors may vary significantly with the size of the H-matrix blocks. However, in the Uniform-H case, the majority of the memory is stored in the \(k \times k\) coupling matrices and only the shared bases hold the larger dense blocks. In this case, the load therefore shifts more towards dense arithmetic for small blocks.
In the parallel case, TBB is used to execute a precomputed task graph (DAG). Normally, a sequential BLAS implementation is strongly recommended as the parallelization of the compute work load is performed within the DAG execution. For the benchmarks below, this is ensured either during compilation or with the correct linking flags.
Applications¶
The following applications are used for testing:
Laplace SLP
command line:
--app laplaceslp --adm std --cluster h --ntile 64 --grid sphere-<n>
problem size: 32768 (sequential) and 524288 (parallel)
truncation accuracy: 1e-4 (sequential) and 1e-6 (parallel)
Matérn Covariance
command line:
--app materncov --adm weak --cluster h --ntile 64 --grid randcube-<n>
problem size: 32768 (sequential) and 131072 (parallel)
truncation accuracy: 1e-4 (sequential) and 1e-6 (parallel)
BLAS/LAPACK Libraries¶
With the exception of oneMKL, Reference-LAPACK v3.11 is used to provide LAPACK functionality.
Reference BLAS¶
version: 3.11
compilation options:
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_Fortran_FLAGS_RELEASE="-O3 -march=native -funroll-loops -ffast-math" ..
Please note, that the cmake configuration was changed to permit -O3.
OpenBLAS¶
version: 0.3.23
compilation options:
make TARGET=<...> USE_THREAD=0 USE_LOCKING=1 NO_LAPACK=1
BLIS/AOCL-BLIS¶
version: 0.9.0
compilation options:
./configure --disable-threading --enable-sup-handling auto
AMD-BLIS¶
only used for AMD CPUs
version: 4.0
compilation options:
./configure --disable-threading --enable-sup-handling auto
oneMKL¶
version: 2022.0
sequential MKL libraries (mkl_sequential)
LAPACK also via Intel MKL
HLR contains the function
int mkl_serv_intel_cpu_true () { return 1; }
to permit fully optimized code paths for AMD CPUs.
Processors¶
AMD Epyc 7601
Naples generation
32 cores, 2.2 - 3.2 GHz
2 processors
2x8x64 GB DDR4-2666 RAM
AMD Epyc 7702
Rome generation
64 cores, 2.0 - 3.35 GHz
2 processors
2x8x32 GB DDR3-3200 RAM
AMD Epyc 9554
Genoa generation
64 cores, 3.1 - 3.75 GHz
2 processors
2x12x32 GB DDR5-4800 RAM
Intel Xeon E7-8867v4
Broadwell generation
18 cores, 2.4 - 3.3 GHz
4 processors
Intel Xeon Platinum 8360Y
Icelake generation
36 cores, 2.4 - 3.5 GHz
2 processors
Results¶
Each benchmark is run 10 times and the best runtime is used. Shown is the runtime in seconds, i.e., less is better.
Aside from the comparison of the different BLAS libraries per processor also the best runtimes for each processor are compared.
Sequential H-LU Factorization¶
Laplace SLP | Matérn Covariance |
---|---|
![]() |
![]() |
Best per Processor | Best per Processor |
![]() |
![]() |
With the exception of the Epyc 7601 CPU, MKL shows best performance. Whats especially notable is the large performance advantage of MKL on the Xeon 8360Y.
Parallel H-LU Factorization¶
Laplace SLP | Matérn Covariance |
---|---|
![]() |
![]() |
Best per Processor | Best per Processor |
![]() |
![]() |
Here, the global lock that is in use in OpenBLAS and BLIS/AMD-BLIS can be observed which severely limits the usability of these BLAS implementations on many-core processors within H-arithmetic. This is less a problem with the Matérn covariance matrix as the ranks are typically larger compared to the Laplace SLP problem making individual runtimes of BLAS functions bigger with smaller collision probability between tasks.
Sequential Uniform-H-LU Factorization¶
Laplace SLP | Matérn Covariance |
---|---|
![]() |
![]() |
Best per Processor | Best per Processor |
![]() |
![]() |
Here we see the same behaviour as for H-LU factorization, i.e., MKL resulting in best performance except for the Epyc Naples CPU.
Unfortunately, no parallel benchmarks are available so far as the Uniform-H-LU factorization still is not yet stable enough for multi-threaded execution.