BLAS/LAPACK Benchmarks

Test Setup

Algorithms

The following H-matrix arithmetics are tested:

  • H-LU Factorization:
    recursive (sequential) or DAG based (parallel) LU factorization with eager updates

  • Uniform H-LU Factorization:
    accumulator based LU factorization for uniform-H matrices with lazy updates

In both cases, the sizes of the dense blocks as lowrank factors may vary significantly with the size of the H-matrix blocks. However, in the Uniform-H case, the majority of the memory is stored in the \(k \times k\) coupling matrices and only the shared bases hold the larger dense blocks. In this case, the load therefore shifts more towards dense arithmetic for small blocks.

In the parallel case, TBB is used to execute a precomputed task graph (DAG). Normally, a sequential BLAS implementation is strongly recommended as the parallelization of the compute work load is performed within the DAG execution. For the benchmarks below, this is ensured either during compilation or with the correct linking flags.

Applications

The following applications are used for testing:

  • Laplace SLP

    • command line: --app laplaceslp --adm std --cluster h --ntile 64 --grid sphere-<n>

    • problem size: 32768 (sequential) and 524288 (parallel)

    • truncation accuracy: 1e-4 (sequential) and 1e-6 (parallel)

  • Matérn Covariance

    • command line: --app materncov --adm weak --cluster h --ntile 64 --grid randcube-<n>

    • problem size: 32768 (sequential) and 131072 (parallel)

    • truncation accuracy: 1e-4 (sequential) and 1e-6 (parallel)

BLAS/LAPACK Libraries

With the exception of oneMKL, Reference-LAPACK v3.11 is used to provide LAPACK functionality.

Reference BLAS

  • version: 3.11

  • compilation options:
    cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_Fortran_FLAGS_RELEASE="-O3 -march=native -funroll-loops -ffast-math" ..

Please note, that the cmake configuration was changed to permit -O3.

OpenBLAS

  • version: 0.3.23

  • compilation options:
    make TARGET=<...> USE_THREAD=0 USE_LOCKING=1 NO_LAPACK=1

BLIS/AOCL-BLIS

  • version: 0.9.0

  • compilation options:
    ./configure --disable-threading --enable-sup-handling auto

AMD-BLIS

  • only used for AMD CPUs

  • version: 4.0

  • compilation options:
    ./configure --disable-threading --enable-sup-handling auto

oneMKL

  • version: 2022.0

  • sequential MKL libraries (mkl_sequential)

  • LAPACK also via Intel MKL

HLR contains the function

int mkl_serv_intel_cpu_true () { return 1; }

to permit fully optimized code paths for AMD CPUs.

Processors

  • AMD Epyc 7601

    • Naples generation

    • 32 cores, 2.2 - 3.2 GHz

    • 2 processors

    • 2x8x64 GB DDR4-2666 RAM

  • AMD Epyc 7702

    • Rome generation

    • 64 cores, 2.0 - 3.35 GHz

    • 2 processors

    • 2x8x32 GB DDR3-3200 RAM

  • AMD Epyc 9554

    • Genoa generation

    • 64 cores, 3.1 - 3.75 GHz

    • 2 processors

    • 2x12x32 GB DDR5-4800 RAM

  • Intel Xeon E7-8867v4

    • Broadwell generation

    • 18 cores, 2.4 - 3.3 GHz

    • 4 processors

  • Intel Xeon Platinum 8360Y

    • Icelake generation

    • 36 cores, 2.4 - 3.5 GHz

    • 2 processors

Results

Each benchmark is run 10 times and the best runtime is used. Shown is the runtime in seconds, i.e., less is better.

Aside from the comparison of the different BLAS libraries per processor also the best runtimes for each processor are compared.

Sequential H-LU Factorization

Laplace SLP Matérn Covariance
laplace--approx-lu--seq materncov--approx-lu--seq
Best per Processor Best per Processor
laplace--approx-lu--seq materncov--approx-lu--seq

With the exception of the Epyc 7601 CPU, MKL shows best performance. Whats especially notable is the large performance advantage of MKL on the Xeon 8360Y.

Parallel H-LU Factorization

Laplace SLP Matérn Covariance
laplace--approx-lu--tbb materncov--approx-lu--tbb
Best per Processor Best per Processor
laplace--approx-lu--tbb materncov--approx-lu--tbb

Here, the global lock that is in use in OpenBLAS and BLIS/AMD-BLIS can be observed which severely limits the usability of these BLAS implementations on many-core processors within H-arithmetic. This is less a problem with the Matérn covariance matrix as the ranks are typically larger compared to the Laplace SLP problem making individual runtimes of BLAS functions bigger with smaller collision probability between tasks.

Sequential Uniform-H-LU Factorization

Laplace SLP Matérn Covariance
laplace--uniform-lu--seq materncov--uniform-lu--seq
Best per Processor Best per Processor
laplace--uniform-lu--seq materncov--uniform-lu--seq

Here we see the same behaviour as for H-LU factorization, i.e., MKL resulting in best performance except for the Epyc Naples CPU.

Unfortunately, no parallel benchmarks are available so far as the Uniform-H-LU factorization still is not yet stable enough for multi-threaded execution.