# Binary Compression¶

Making use of the *memory accessor* concept from [1], different precisions are used for data storage and for
computation. For H-matrices, dense and low-rank data is here represented in different, *compressed* formats while the
arithmetic is still performed in double precision (FP64).

## Program¶

The corresponding programs are *compress* and *compress-lu* and built as

```
scons programs=compress,compress-lu frameworks=seq,tbb
```

## Compressors¶

### ZFP¶

Uses the floating point compression library ZFP with *constant bit rate*, chosen
depending on the accuracy.

Better compression result could be achived with ZFP and *fixed accuracy mode* but then error control was problematic
for different problem sizes.

### SZ/SZ3¶

Uses SZ and SZ3 for compression. However, SZ resulted in issues with multi-threaded usage and SZ3 had runtime issues for larger data. So, only compression rates are presented.

### afloat¶

The motivation for this format is, that most (standard) floating point representation support a huge dynamic range of the data, which typically is not needed in practise. Instead of the 8 bits (BF16/24/FP32) or 11 bits (FP64) often only 1-6 bits are needed (observed during H-LU factorization for different applications). As with the other formats, this is in addition to the savings in mantissa size due to reduced precision required for the data (especially for low-rank approximation with its user-defined accuracy).

For given data \(D\) the minimal and maximal values \(d_{\min}\) and \(d_{\max}\) are determined and the exponent bit range \(e\) chosen to represent the dynamic range of \(D\). The mantissa length \(m\) is chosen based on the required precision.

The total bit size \(n\) is then defined as

\(n = 1 + e + m\)

and may be between 3 and 64.

Furthermore, all values are scaled by \(1/d_{\min}\) and shifted by \(1\) such that only the lowest exponent bits need to be stored.

All compressed data is then stored and loaded bit-wise to minimize storage.

### apfloat¶

This compression method works similar to *afloat* but the mantissa size \(m\) is padded such that \(n = 1 + e + m\) is a
multiple of \(8\). This results in much more efficient memory I/O with a loss in storage efficiency.

### bfloat¶

This format generalizes BF16/FP32 to arbitrary precision and uses a fixed exponent size of 8 bit but chooses the mantissa length depending on the required precision such that in multiples of 8 in total bit size are achieved, i.e.,

1-8-N

such that \(1+8+N = 8·p\) for some integer p.

For 16 bit, the format is identical to BF16, for 32 bit identical to FP32 and for 64 bit identical to FP64 (also with 11 bit exponent!).

### dfloat¶

This works in the same way as *bfloat* but uses 11 exponent bits as with the FP64 format.

## Results¶

All experiments are performed on a 2-socket AMD Epyc 7702 (Rome) system with 2×64 cores and 2x8x32 GB DDR4-3200 DIMMs. All cores of the system are used, though without hyper threading. Parallelization is performed using Intel TBB.

For runtimes, the median out of 5 runs is used.

### Matrix Storage¶

Laplace SLP | Matérn Covariance |
---|---|

Shown is the percentage of the storage needed for the original H-matrix (black line).

Best compression is achieved by ZFP, followed by afloat, apfloat, bfloat and float.

### Error¶

For a given H-matrix \(M\) the relative error \(\frac{\lVert M-\tilde M \rVert_2}{\lVert M \rVert_2}\) is computed for the compressed version \(\tilde M\).

Shown is also the chosen accuracy (black line).

Laplace SLP | Matérn Covariance |
---|---|

All methods are (by construction) below the requested error boundary, though the padded methods (apfloat, bfloat and dfloat) usually result in a too accurate compression, thereby loosing storage efficiency.

### Time for Compression¶

The *relative* runtime compared to matrix construction time (black line) is shown and hence depends on the cost for
setting up the matrix, which is higher for Laplace compared to Matérn covariance. As compression is performed for each
block of the matrix independently, parallelization efficiency is equally good as for construction (near perfect speedup).

Laplace SLP | Matérn Covariance |
---|---|

ZFP is slightly slower compared to the other compression schemes, but for both applications the runtime is small compared to matrix construction itself.

### Time for Mat-Vec¶

The time for decompression has a direct influence on the mat-vec performance. Shown again is the relative runtime compared to the uncompressed version (black line).

Laplace SLP | Matérn Covariance |
---|---|

Of particular interest are apfloat, bfloat and dfloat, which, for accuracies above \(10^{-8}\) (Laplace) and
\(10^{-7}\) (Matérn covariance) are **faster** compared to the uncompressed mat-vec. This is due to less matrix data being
loaded/stored which more than compensates the additional decompression costs in a thread-parallel setting.

ZFP showed the highest decompression cost resulting in up to 10× higher runtime. Due to inefficient memory I/O afloat is up to 3 times slower compared to the uncompressed mat-vec.

## References¶

Anzt, Flegar, Grützmacher, Quintana-Ortí:

*“Toward a modular precision ecosystem for high-performance computing”*, Int. J. of HPC Applications, 33(6), 1069–1078, 2019