# Binary Compression¶

Making use of the *memory accessor* concept from [1], different precisions are used for data storage and for
computation. For H-matrices, dense and low-rank data is represented in different, *compressed* formats while the
arithmetic is still performed in double precision (FP64).

See also https://arxiv.org/abs/2308.10960.

## Compression Schemes¶

HLR uses *direct compression* where a given accuracy is applied to the full data block and *APLR* compression which only
works for lowrank data.

### Direct Compression¶

Use one of the following compressors during compilation as

```
> scons compressor=<...>
```

In addition also **none** is supported which disables direct compression.

#### afl¶

The mantissa bit length is chosen as \(m := \lceil - \log_2 \varepsilon \rceil\) with \(\varepsilon\) being the (relative) lowrank approximation error. For the exponent bit length \(e\), for given data \(D\) the minimal and maximal (absolute) values \(d_{\min}\) and \(d_{\max}\) are determined. Then \(e := \lceil \log_2 \log_2 d_{\max} / d_{\min} \rceil\).

The total bit size \(n\) is then defined as

\(n = 1 + e + m\)

and may be between 3 and 64.

Furthermore, all values are scaled by \(1/d_{\min}\) and shifted by \(1\) such that only the lowest exponent bits need to be stored.

All compressed data is then stored and loaded bit-wise to minimize storage.

#### aflp¶

This compression method works identical to *afl* but the mantissa size \(m\) is increased such that \(n = 1 + e + m\) is a
multiple of \(8\). This results in much more efficient memory I/O with a loss in storage efficiency.

#### bfl (1-8-m)¶

This format generalizes BF16/FP32 to arbitrary precision and uses a fixed exponent size of 8 bit but chooses the
mantissa length as in *afl*, again with rounding up for byte aligned storage.

For 16 bit, the format is identical to BF16, for 32 bit identical to FP32 and for 64 bit identical to FP64 (also with 11 bit exponent!).

#### dfl (1-11-m)¶

This works in the same way as *bfl* but uses 11 exponent bits as with the FP64 format.

#### Posits¶

Uses the Posits number format implemented by the universal
library. Precision bits are chosen as in *afl*. A constant *2* exponent bits are used. All data is scaled by maximal
entry. Storage is bit-wise.

#### ZFP¶

Uses the floating point compression library ZFP with *constant bit rate*, chosen
depending on the accuracy.

Better compression result could be achived with ZFP and *fixed accuracy mode* but then error control was problematic
for different problem sizes.

#### BLOSC¶

Uses BLOSC for compression with mantissa truncation with bit length as chosen in
*afl*.

### APLR Compression¶

Combines a direct compression scheme with the mixed precision approach from [2].

Define the APLR method via

```
> scons aplr=<...>
```

with one of the following or **default**, in which case the same compressor as for direct compression is used.

In addition to *afl, aflp, bfl, dfl, zfp and blosc*, two mixed precision methods are implemented.

#### mp3¶

Uses a mixed precision approach using FP64, FP32 and BF16 as in [2].

#### mp2¶

Uses a mixed precision approach using FP64, FP32 as in [3].

## H-Matrix-Construction¶

### Program¶

The corresponding programs are *compress* and *mixedprec* and built as

```
scons programs=compress,mixedprec frameworks=seq,tbb compressor=<...> aplr=<...>
```

Set *aplr* to *default* to use the same method as for *compressor*.

### Results¶

All experiments are performed on a 2-socket AMD Epyc 7702 (Rome) system with 2×64 cores and 2x8x32 GB DDR4-3200 DIMMs. All cores of the system are used, though without hyper threading. Parallelization is performed using Intel TBB.

For runtimes, the median out of 10 runs is used.

### Matrix Storage¶

Laplace SLP | Matérn Covariance |
---|---|

Direct Compression | |

APLR Compression | |

BLOSC gives the best compression results with Posits, ZFP and AFL following behind. The rounded up mantissa and fixed exponent size of AFLP, BFL and DFL increases memory. The mixed precision approach especially shows little compression for a low accuracy as no dense blocks are compressed.

### Error¶

For a given H-matrix \(M\) the relative error \(\frac{\lVert M-\tilde M \rVert_2}{\lVert M \rVert_2}\) is computed for the compressed version \(\tilde M\) relative to the initial lowrank error \(\varepsilon\) shown as a black line. Results are only shown for APLR as this is more demanding.

Laplace SLP | Matérn Covariance |
---|---|

All methods are (by construction) close to the requested error with BLOSC and AFL being slightly above. The rounded up methods (aflp, bfl and dfl) result in a too accurate compression, thereby loosing storage efficiency.

### Time for Compression¶

The runtime in percantage of the matrix construction time is shown and hence depends on the cost for setting up the matrix, which is higher for Laplace compared to Matérn covariance. As compression is performed for each block of the matrix independently, parallelization efficiency is equally good as for construction (near perfect speedup).

Again, only results for APLR compression are shown since this takes more time.

Laplace SLP | Matérn Covariance |
---|---|

BLOSC shows a high runtime for compression, followed by ZFP. All other methods are much faster and do not increase the total construction time.

Please note, that Posits are excluded due to their inefficient conversion and storage implementation within HLR.

## H-Matrix-Vector Multiplication¶

The time for decompression has a direct influence on the mat-vec performance. Shown again is the relative runtime compared to the uncompressed version (black line).

Laplace SLP | Matérn Covariance |
---|---|

Direct Compression | |

APLR Compression | |

Of particular interest are aflp, bfl and dfl, which, for lower accuracies are **faster** compared to the uncompressed
mat-vec. This is due to less matrix data being loaded/stored which more than compensates the additional decompression
costs in a thread-parallel setting.

Decompression with BLOSC is much faster than compression.

ZFP showed the highest decompression cost resulting in a much increased runtime. Due to inefficient memory I/O afl is up to 3 times slower compared to the uncompressed mat-vec.

## References¶

Anzt, Flegar, Grützmacher, Quintana-Ortí:

*“Toward a modular precision ecosystem for high-performance computing”*, Int. J. of HPC Applications, 33(6), 1069–1078, 2019Amestoy, Boiteau, Buttari, Gerest, Jézéquel, L’Excellent, Mary:

*“Mixed precision low-rank approximations and their application to block low-rank LU factorization”*, IMA J. of Num. Analysis, 2022Ooi, Iwashita, Fukaya, Ida, Yokota.:

*“Effect of Mixed Precision Computing on H-Matrix Vector Multiplication in BEM Analysis”*, Proceedings of HPCAsia2020, 2020