Uniform H-Matrices

Program

The programs are uniform, uniform-mm and uniform-lu (see also below) and can be compiled via

scons programs=uniform,uniform-mm,uniform-lu frameworks=seq,tbb

Compression is also supported for construction and matrix-vector multiplication. Choose the wanted compressor via

scons programs=uniform,uniform-mm,uniform-lu frameworks=seq,tbb compressor=<...>

Introduction

Standard H-matrices store the low-rank factors \(U,V\) for each admissible block, which is equivalent to have individual row- and column-bases for each such block.

This has the advantage that for each low-rank block an optimal rank approximation can be used. However, for many application the row- and column spaces are very close per block row/column. Therefore, this storage scheme is not optimal.

Uniform H-matrices use shared bases, i.e., each block row (or column) uses a shared basis \(\mathcal{U_{\tau}}\) (or \(\mathcal{V_{\sigma}}\)) for all blocks in that row (or column) with \(k \times k\) coupling matrices \(S_{\tau,\sigma}\) for each block.

The storage requirements for the actual matrix data is thereby reduced to \(\mathcal{O}(n)\) with \(n\) being the number of rows/columns. However, the cluster bases still need \(\mathcal{O}(n \log n)\) storage, though in practise the majority of the memory is still used by the matrix itself.

It should be noted that the number of low-rank blocks sharing a particular basis is limited by the constant \(c\_{sp}\).

Finally, H²-matrices represent the cluster bases in a nested way, only storing bases only for the leaf clusters and all other bases only indirectly by transfer matrices.

Though this brings down the total storage to \(\mathcal{O}(n)\) it also may result in several problems. If a relaxed (weak) admissibility is used, rank-problems, e.g., larger than expected rank, in a single block induces a significant rank growth in all blocks because of the sharing and nestedness of the bases. For uniform H-matrices, only a single block row/column would be affected and in H-matrices only the specific block.

Furthermore, parallelization is far more problematic because an update to a single block affects all blocks sharing data, i.e., in all blocks in the block rows/columns on all levels with a non-disjoint indexset. In contrast to this, all updates within an H-matrix are completely independent from each other. For uniform H-matrices, only the specific block row/column is affected.

Uniform H-matrices are therefore an option to optimize memory storage while still keeping algorithmic properties close to standard H-arithmetic and we will investigate this here.

Matrix-Construction (uniform)

Two different algorithms are implemented for constructing the cluster bases for uniform H-matrices.

The first algorithm recursively goes through the block tree and constructs standard low-rank blocks \(A_{\tau,\sigma} = U_{\tau,\sigma} V_{\tau,\sigma}^H\). By applying QR factorization to the low-rank factors, one can construct and local representation with orthogonal bases \(\mathcal{W},\mathcal{X}\) :

\[\begin{split} \begin{aligned} \mathcal{W} R_w & = U_{\tau,\sigma} \\ \mathcal{X} R_x & = V_{\tau,\sigma} \end{aligned} \end{split}\]

and a coupling matrix \(S_{\tau,\sigma} = R_w R_x^H\).

Assuming that this is not the first low-rank block (in which case \(\mathcal{W},\mathcal{X}\) are the corresponding block row/column bases), we extend the current row as

\[\begin{split} \begin{aligned} & \begin{pmatrix} \mathcal{U}_{\tau} S_{\tau,\sigma_1} \mathcal{V}_{\sigma_1}^H & \cdots & \mathcal{U}_{\tau} S_{\tau,\sigma_i} \mathcal{V}_{\sigma_i}^H & \mathcal{W} T_{\tau,\sigma} \mathcal{X}^H \end{pmatrix} \\ & = \begin{pmatrix} \mathcal{U}_{\tau} & \mathcal{W} \end{pmatrix} \begin{pmatrix} S_{\tau,\sigma_1} & \cdots & 0 \\ 0 & \cdots & T_{\tau,\sigma} \end{pmatrix} \begin{pmatrix} \mathcal{V}_{\sigma_1} & & \\ & \ddots & \\ & & \mathcal{X} \\ \end{pmatrix}^H \end{aligned} \end{split}\]

As the right-most factor forms an orthogonal matrix (all bases are assumed to be orthogonal!), we can omit it for the computation of a new row basis, yielding

\[\begin{split} \begin{pmatrix} \mathcal{U}_{\tau} & \mathcal{W} \end{pmatrix} \begin{pmatrix} S_{\tau,\sigma_1} & \cdots & 0 \\ 0 & \cdots & T_{\tau,\sigma} \end{pmatrix} \end{split}\]

By applying QR factorization to the right factor, one obtains

\[ \begin{pmatrix} \mathcal{U}_{\tau} & \mathcal{W} \end{pmatrix} R^H Q^H \]

of which again the factor \(Q^H\) can be omitted. For the remaining \(k \times 2k\) matrix \(\begin{pmatrix} \mathcal{U}_{\tau} & \mathcal{W} \end{pmatrix} R^H\) a new row basis is computed by any of the standard techniques, e.g., SVD, RRQR, RandSVD, etc…

Finally, after updating the coupling matrices of the existing uniform blocks, the new block was integrated into the uniform H-matrix.

By applying this procedure immediately after constructing standard low-rank blocks, the memory overhead can be minimized as the standard low-rank factors can be deleted afterwards, with only the uniform H-matrix data remaining.

However, as the update of the cluster basis is performed multiple times per cluster, additional computational costs arise.

In the same way, an existing H-matrix can be converted into the uniform H-format by adding one low-rank block after the other.

This version of the construction algorithm is denoted as *rec due to the recursive nature of the implementation

Alternatively, the lvl version implements the construction in a level-wise way. All low-rank blocks on a particular level of the H-tree are constructed and the all low-rank blocks are converted in into the uniform format simultaneously. While this increases the memory costs it should safe computation work as cluster bases are only computed once.

The program is compiled as

scons programs=uniform

For the (Laplace SLP)[/apps] application the following results are obtained for the pure matrix storage and for the cluster basis data, both per index:

Matrix Data Cluster Bases

One clearly see the constant memory for uniform H-matrices and H²-matrices due to \(\mathcal{O}(n)\) memory complexity.

For the cluster basis, the \(\mathcal{O}(n \log n)\) complexity of uniform \mcH-matrices is clearly visible.

Combining both, one gets the following picture

Uniform H-matrices only consume a little more memory compared to H²-matrices for typical problem sizes.

Within this program is also a benchmark of the matrix-vector multiplication \(Ax = y\), which, for uniform H-matrices, works similar to H²-matrices, e.g., with a foward transformation phase which computes the representation of \(x\) for each cluster in the column basis of \(A\) followed by the multiplication with only the coupling matrices of \(A\) and finally the update of \(y\) with the obtained basis coefficients per cluster, the backward transformation. The total runtime complexity is identical to H-matrices, i.e., \(\mathcal{O}(n \log n)\) because of the log-linear storage of the cluster basis.

The runtime results for the Laplace SLP example are

As can be seen, uniform H-matrices are close to H² in terms of runtime, with both transformation phases requiring slightly more runtime (the coupling matrix multiplication is identical to H²).

Matrix Multiplication (uniform-mm)

Standard H-matrix arithmetic works in an eager way by applying updates as soon as possible. This poses a problem for uniform H-arithmetic as each update induces a recomputation of the corresponding cluster bases, which affects up to \(2 c_{sp}\) blocks.

An alternative formulation uses so called accumulator matrices. Here, the algorithms goes from root to the leaves in the order defined by the arithmetic operation, e.g., multiplication or LU factorization. All updates to a block are hereby collected in the accumulator matrices and only applied to the destination block after all updates are available. By this, only a single cluster basis update is needed for uniform H-matrices. Therefore, this version is used in the following.

For uniform H-matrix multiplication \(C := C + A \times B\) the algorithm looks as

procedure UniHmul ( C, A_C, P_C )
  for all (A_i,B_i) in P_C
    if (A_i,B_i) is computable # Update Accumulator
      A_C := A_C + A_i · B_i   # optimized

  if C has sub-blocks
    for all sub-blocks C_ij
      A_ij := A_C restricted to C_ij
      P_ij := P_C restricted to C_ij
      UniHmul( C_ij, A_ij, P_ij )
  else
    C := C + A_C
    Update Bases
end

Here, \(A_C\) is the accumulator matrix and \(P_C\) the set of pending updates, which need further recursion to be computable, e.g., all factors being structured.

Evaluation of computable updates can be optimized by exploiting the shared bases in the uniform H-format. The update of cluster basis works exactly like for the recursive matrix construction above.

The runtime of matrix multiplication for the Laplace SLP example compared to standard H-matrices is

As can be seen, both algorithms are very close in terms of runtime. However uniform H-matrices still maintain the memory advantage.

LU factorization (uniform-lu)

LU factorization follows the same principles as matrix multiplication. However, while during matrix multiplication the factors (\(A\) and \(B\)) stay constant during LU factorization \(L\) and \(U\) are modified during updates.

Below are the runtime and the memory consumption of standard H, uniform H and H²-matrices. For the latter format, the result of the H-LU was converted into the H²-format (no H²-arithmetic available). Just for comparison, the same was done with uniform H-matrices, i.e., the result of the H-LU converted into the uniform H-format.

Runtime Memory

While the runtime of uniform H-LU is slightly slower compared to H-LU, the memory of the result is again almost halved, though still the H²-format is more efficient.