Binary Compression

Making use of the memory accessor concept from [1], different precisions are used for data storage and for computation. For H-matrices, dense and low-rank data is here represented in different, compressed formats while the arithmetic is still performed in double precision (FP64).

Program

The corresponding programs are compress and compress-lu and built as

scons programs=compress,compress-lu frameworks=seq,tbb

Compressors

ZFP

Uses the floating point compression library ZFP with constant bit rate, chosen depending on the accuracy.

Better compression result could be achived with ZFP and fixed accuracy mode but then error control was problematic for different problem sizes.

SZ/SZ3

Uses SZ and SZ3 for compression. However, SZ resulted in issues with multi-threaded usage and SZ3 had runtime issues for larger data. So, only compression rates are presented.

afloat

The motivation for this format is, that most (standard) floating point representation support a huge dynamic range of the data, which typically is not needed in practise. Instead of the 8 bits (BF16/24/FP32) or 11 bits (FP64) often only 1-6 bits are needed (observed during H-LU factorization for different applications). As with the other formats, this is in addition to the savings in mantissa size due to reduced precision required for the data (especially for low-rank approximation with its user-defined accuracy).

For given data \(D\) the minimal and maximal values \(d_{\min}\) and \(d_{\max}\) are determined and the exponent bit range \(e\) chosen to represent the dynamic range of \(D\). The mantissa length \(m\) is chosen based on the required precision.

The total bit size \(n\) is then defined as

\(n = 1 + e + m\)

and may be between 3 and 64.

Furthermore, all values are scaled by \(1/d_{\min}\) and shifted by \(1\) such that only the lowest exponent bits need to be stored.

All compressed data is then stored and loaded bit-wise to minimize storage.

apfloat

This compression method works similar to afloat but the mantissa size \(m\) is padded such that \(n = 1 + e + m\) is a multiple of \(8\). This results in much more efficient memory I/O with a loss in storage efficiency.

bfloat

This format generalizes BF16/FP32 to arbitrary precision and uses a fixed exponent size of 8 bit but chooses the mantissa length depending on the required precision such that in multiples of 8 in total bit size are achieved, i.e.,

1-8-N

such that \(1+8+N = 8·p\) for some integer p.

For 16 bit, the format is identical to BF16, for 32 bit identical to FP32 and for 64 bit identical to FP64 (also with 11 bit exponent!).

dfloat

This works in the same way as bfloat but uses 11 exponent bits as with the FP64 format.

Results

All experiments are performed on a 2-socket AMD Epyc 7702 (Rome) system with 2×64 cores and 2x8x32 GB DDR4-3200 DIMMs. All cores of the system are used, though without hyper threading. Parallelization is performed using Intel TBB.

For runtimes, the median out of 5 runs is used.

Matrix Storage

Laplace SLP Matérn Covariance

Shown is the percentage of the storage needed for the original H-matrix (black line).

Best compression is achieved by ZFP, followed by afloat, apfloat, bfloat and float.

Error

For a given H-matrix \(M\) the relative error \(\frac{\lVert M-\tilde M \rVert_2}{\lVert M \rVert_2}\) is computed for the compressed version \(\tilde M\).

Shown is also the chosen accuracy (black line).

Laplace SLP Matérn Covariance

All methods are (by construction) below the requested error boundary, though the padded methods (apfloat, bfloat and dfloat) usually result in a too accurate compression, thereby loosing storage efficiency.

Time for Compression

The relative runtime compared to matrix construction time (black line) is shown and hence depends on the cost for setting up the matrix, which is higher for Laplace compared to Matérn covariance. As compression is performed for each block of the matrix independently, parallelization efficiency is equally good as for construction (near perfect speedup).

Laplace SLP Matérn Covariance

ZFP is slightly slower compared to the other compression schemes, but for both applications the runtime is small compared to matrix construction itself.

Time for Mat-Vec

The time for decompression has a direct influence on the mat-vec performance. Shown again is the relative runtime compared to the uncompressed version (black line).

Laplace SLP Matérn Covariance

Of particular interest are apfloat, bfloat and dfloat, which, for accuracies above \(10^{-8}\) (Laplace) and \(10^{-7}\) (Matérn covariance) are faster compared to the uncompressed mat-vec. This is due to less matrix data being loaded/stored which more than compensates the additional decompression costs in a thread-parallel setting.

ZFP showed the highest decompression cost resulting in up to 10× higher runtime. Due to inefficient memory I/O afloat is up to 3 times slower compared to the uncompressed mat-vec.

References

  1. Anzt, Flegar, Grützmacher, Quintana-Ortí: “Toward a modular precision ecosystem for high-performance computing”, Int. J. of HPC Applications, 33(6), 1069–1078, 2019