BLAS/LAPACK Benchmarks

Test Setup

Algorithms

The following H-matrix arithmetics are tested:

H-mm-eager

eager H matrix multiplication

H-mm-accu

H matrix multiplication using accumulators

H-lu-eager

eager H-LU factorization

H-lu-accu

H-LU factorization using accumulators

UniH-mm

Uniform-H matrix multiplication

UniH-lu

Uniform-H LU factorization

Applications

The following applications are used for testing:

  • laplace : Laplace SLP

    • command line: --app laplaceslp --adm std --cluster h --ntile 64 --grid sphere2-6 -e 1e-6

    • problem size: 65536

    • truncation accuracy: 1e-6

BLAS/LAPACK Libraries

OpenBLAS

  • version: 0.3.20

  • compilation options: make TARGET=<...> USE_THREAD=0 USE_LOCKING=1 NO_LAPACK=1 with targets ZEN and SKYLAKEX

  • with Reference-LAPACK v3.10.0

BLIS

  • version: 0.8.1

  • compilation options: ./configure --disable-threading --enable-mixed-dt --enable-sup-handling <...> with targets zen2 and skx

  • with Reference-LAPACK v3.10.0

Intel MKL

  • version: 2020.0

  • on AMD Epyc with MKL_DEBUG_CPU_TYPE=5 to enable the AVX2 code path

  • LAPACK also via Intel MKL

Results

Each benchmark is run 10 times and the best runtime is used.

Laplace

AMD Epyc 7702 (Rome) Intel Xeon 8360Y (IceLake)
Algorithm OpenBLAS BLIS Intel MKL OpenBLAS BLIS Intel MKL
H-mm-accu 144.0s 167.9s 127.3s 160.4s 288.9s 152.7s
UniH-mm 173.1s 169.5s 147.1s 161.7s 304.0s 154.3s
H-lu-eager 111.6s 145.2s 106.0s 136.5s 211.4s 128.6s
H-lu-accu 60.84s 70.14s 53.91s 73.94s 111.4s 68.80s
UniH-lu 86.23s 90.23s 75.95s 95.31s 141.0s 90.50s