Binary Compression

Making use of the memory accessor concept from [1], different precisions are used for data storage and for computation. For H-matrices, dense and low-rank data is represented in different, compressed formats while the arithmetic is still performed in double precision (FP64).

See also https://arxiv.org/abs/2308.10960.


Compression Schemes

HLR uses direct compression where a given accuracy is applied to the full data block and APLR compression which only works for lowrank data.

Direct Compression

Use one of the following compressors during compilation as

> scons compressor=<...>

In addition also none is supported which disables direct compression.

afl

The mantissa bit length is chosen as \(m := \lceil - \log_2 \varepsilon \rceil\) with \(\varepsilon\) being the (relative) lowrank approximation error. For the exponent bit length \(e\), for given data \(D\) the minimal and maximal (absolute) values \(d_{\min}\) and \(d_{\max}\) are determined. Then \(e := \lceil \log_2 \log_2 d_{\max} / d_{\min} \rceil\).

The total bit size \(n\) is then defined as

\(n = 1 + e + m\)

and may be between 3 and 64.

Furthermore, all values are scaled by \(1/d_{\min}\) and shifted by \(1\) such that only the lowest exponent bits need to be stored.

All compressed data is then stored and loaded bit-wise to minimize storage.

aflp

This compression method works identical to afl but the mantissa size \(m\) is increased such that \(n = 1 + e + m\) is a multiple of \(8\). This results in much more efficient memory I/O with a loss in storage efficiency.

bfl (1-8-m)

This format generalizes BF16/FP32 to arbitrary precision and uses a fixed exponent size of 8 bit but chooses the mantissa length as in afl, again with rounding up for byte aligned storage.

For 16 bit, the format is identical to BF16, for 32 bit identical to FP32 and for 64 bit identical to FP64 (also with 11 bit exponent!).

dfl (1-11-m)

This works in the same way as bfl but uses 11 exponent bits as with the FP64 format.

Posits

Uses the Posits number format implemented by the universal library. Precision bits are chosen as in afl. A constant 2 exponent bits are used. All data is scaled by maximal entry. Storage is bit-wise.

ZFP

Uses the floating point compression library ZFP with constant bit rate, chosen depending on the accuracy.

Better compression result could be achived with ZFP and fixed accuracy mode but then error control was problematic for different problem sizes.

BLOSC

Uses BLOSC for compression with mantissa truncation with bit length as chosen in afl.

APLR Compression

Combines a direct compression scheme with the mixed precision approach from [2].

Define the APLR method via

> scons aplr=<...>

with one of the following or default, in which case the same compressor as for direct compression is used.

In addition to afl, aflp, bfl, dfl, zfp and blosc, two mixed precision methods are implemented.

mp3

Uses a mixed precision approach using FP64, FP32 and BF16 as in [2].

mp2

Uses a mixed precision approach using FP64, FP32 as in [3].


H-Matrix-Construction

Program

The corresponding programs are compress and mixedprec and built as

scons programs=compress,mixedprec frameworks=seq,tbb compressor=<...> aplr=<...>

Set aplr to default to use the same method as for compressor.

Results

All experiments are performed on a 2-socket AMD Epyc 7702 (Rome) system with 2×64 cores and 2x8x32 GB DDR4-3200 DIMMs. All cores of the system are used, though without hyper threading. Parallelization is performed using Intel TBB.

For runtimes, the median out of 10 runs is used.

Matrix Storage

Laplace SLP Matérn Covariance
Direct Compression
APLR Compression

BLOSC gives the best compression results with Posits, ZFP and AFL following behind. The rounded up mantissa and fixed exponent size of AFLP, BFL and DFL increases memory. The mixed precision approach especially shows little compression for a low accuracy as no dense blocks are compressed.

Error

For a given H-matrix \(M\) the relative error \(\frac{\lVert M-\tilde M \rVert_2}{\lVert M \rVert_2}\) is computed for the compressed version \(\tilde M\) relative to the initial lowrank error \(\varepsilon\) shown as a black line. Results are only shown for APLR as this is more demanding.

Laplace SLP Matérn Covariance

All methods are (by construction) close to the requested error with BLOSC and AFL being slightly above. The rounded up methods (aflp, bfl and dfl) result in a too accurate compression, thereby loosing storage efficiency.

Time for Compression

The runtime in percantage of the matrix construction time is shown and hence depends on the cost for setting up the matrix, which is higher for Laplace compared to Matérn covariance. As compression is performed for each block of the matrix independently, parallelization efficiency is equally good as for construction (near perfect speedup).

Again, only results for APLR compression are shown since this takes more time.

Laplace SLP Matérn Covariance

BLOSC shows a high runtime for compression, followed by ZFP. All other methods are much faster and do not increase the total construction time.

Please note, that Posits are excluded due to their inefficient conversion and storage implementation within HLR.

H-Matrix-Vector Multiplication

The time for decompression has a direct influence on the mat-vec performance. Shown again is the relative runtime compared to the uncompressed version (black line).

Laplace SLP Matérn Covariance
Direct Compression
APLR Compression

Of particular interest are aflp, bfl and dfl, which, for lower accuracies are faster compared to the uncompressed mat-vec. This is due to less matrix data being loaded/stored which more than compensates the additional decompression costs in a thread-parallel setting.

Decompression with BLOSC is much faster than compression.

ZFP showed the highest decompression cost resulting in a much increased runtime. Due to inefficient memory I/O afl is up to 3 times slower compared to the uncompressed mat-vec.

References

  1. Anzt, Flegar, Grützmacher, Quintana-Ortí: “Toward a modular precision ecosystem for high-performance computing”, Int. J. of HPC Applications, 33(6), 1069–1078, 2019

  2. Amestoy, Boiteau, Buttari, Gerest, Jézéquel, L’Excellent, Mary: “Mixed precision low-rank approximations and their application to block low-rank LU factorization”, IMA J. of Num. Analysis, 2022

  3. Ooi, Iwashita, Fukaya, Ida, Yokota.: “Effect of Mixed Precision Computing on H-Matrix Vector Multiplication in BEM Analysis”, Proceedings of HPCAsia2020, 2020