# BLAS/LAPACK Benchmarks¶

## Test Setup¶

### Algorithms¶

The following H-matrix arithmetics are tested:

**H-LU Factorization**:

recursive (sequential) or DAG based (parallel) LU factorization with eager updates**Uniform H-LU Factorization**:

accumulator based LU factorization for uniform-H matrices with lazy updates

In both cases, the sizes of the dense blocks as lowrank factors may vary significantly with the size of the H-matrix blocks. However, in the Uniform-H case, the majority of the memory is stored in the \(k \times k\) coupling matrices and only the shared bases hold the larger dense blocks. In this case, the load therefore shifts more towards dense arithmetic for small blocks.

In the parallel case, TBB is used to execute a precomputed task graph (DAG). Normally, a sequential BLAS implementation is
strongly recommended as the parallelization of the compute work load is performed *within* the DAG execution. For the
benchmarks below, this is ensured either during compilation or with the correct linking flags.

### Applications¶

The following applications are used for testing:

**Laplace SLP**command line:

`--app laplaceslp --adm std --cluster h --ntile 64 --grid sphere-<n>`

problem size:

*32768*(sequential) and*524288*(parallel)truncation accuracy:

*1e-4*(sequential) and*1e-6*(parallel)

**Matérn Covariance**command line:

`--app materncov --adm weak --cluster h --ntile 64 --grid randcube-<n>`

problem size:

*32768*(sequential) and*131072*(parallel)truncation accuracy:

*1e-4*(sequential) and*1e-6*(parallel)

### BLAS/LAPACK Libraries¶

With the exception of oneMKL, Reference-LAPACK v3.11 is used to provide LAPACK functionality.

#### Reference BLAS¶

version: 3.11

compilation options:

`cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_Fortran_FLAGS_RELEASE="-O3 -march=native -funroll-loops -ffast-math" ..`

Please note, that the cmake configuration was changed to permit *-O3*.

#### OpenBLAS¶

version: 0.3.23

compilation options:

`make TARGET=<...> USE_THREAD=0 USE_LOCKING=1 NO_LAPACK=1`

#### BLIS/AOCL-BLIS¶

version: 0.9.0

compilation options:

`./configure --disable-threading --enable-sup-handling auto`

#### AMD-BLIS¶

only used for AMD CPUs

version: 4.0

compilation options:

`./configure --disable-threading --enable-sup-handling auto`

#### oneMKL¶

version: 2022.0

sequential MKL libraries (

*mkl_sequential*)LAPACK also via Intel MKL

HLR contains the function

```
int mkl_serv_intel_cpu_true () { return 1; }
```

to permit fully optimized code paths for AMD CPUs.

### Processors¶

**AMD Epyc 7601***Naples*generation32 cores, 2.2 - 3.2 GHz

2 processors

2x8x64 GB DDR4-2666 RAM

**AMD Epyc 7702***Rome*generation64 cores, 2.0 - 3.35 GHz

2 processors

2x8x32 GB DDR3-3200 RAM

**AMD Epyc 9554***Genoa*generation64 cores, 3.1 - 3.75 GHz

2 processors

2x12x32 GB DDR5-4800 RAM

**Intel Xeon E7-8867v4***Broadwell*generation18 cores, 2.4 - 3.3 GHz

4 processors

**Intel Xeon Platinum 8360Y***Icelake*generation36 cores, 2.4 - 3.5 GHz

2 processors

## Results¶

Each benchmark is run **10 times** and the **best** runtime is used. Shown is the runtime in **seconds**, i.e., less is better.

Aside from the comparison of the different BLAS libraries per processor also the best runtimes for each processor are compared.

### Sequential H-LU Factorization¶

Laplace SLP | Matérn Covariance |
---|---|

Best per Processor | Best per Processor |

With the exception of the Epyc 7601 CPU, MKL shows best performance. Whats especially notable is the large performance advantage of MKL on the Xeon 8360Y.

### Parallel H-LU Factorization¶

Laplace SLP | Matérn Covariance |
---|---|

Best per Processor | Best per Processor |

Here, the **global lock** that is in use in OpenBLAS and BLIS/AMD-BLIS can be observed which severely limits the
usability of these BLAS implementations on many-core processors within H-arithmetic. This is less a problem with the
Matérn covariance matrix as the ranks are typically larger compared to the Laplace SLP problem making individual
runtimes of BLAS functions bigger with smaller collision probability between tasks.

### Sequential Uniform-H-LU Factorization¶

Laplace SLP | Matérn Covariance |
---|---|

Best per Processor | Best per Processor |

Here we see the same behaviour as for H-LU factorization, i.e., MKL resulting in best performance except for the Epyc Naples CPU.

Unfortunately, no parallel benchmarks are available so far as the Uniform-H-LU factorization still is not yet stable enough for multi-threaded execution.