# DAG-Generation¶

Task based H-arithmetic is usually more efficient on many-core systems compared to loop parallelization of the recursive H-arithmetic. In the following figures, this can be seen by the percentage of idle cores during H-LU factorization. There, red are dense factorizations, blue are matrix solves and green matrix updates.

While the identification of the actual tasks in an H-arithmetic operation is easy by using the recursive H-functions and following the recursion until a leaf is encounterd, setting up the dependencies between the tasks is much more difficult as these are on a global scope whereas recursive algorithms only work on a purely local scope, i.e., on direct sub-blocks.

A previous (cite!) diverted from the standard H-arithmetic function for H-LU by using a global approach to directly identify dependencies per level of the H-tree. However, this was not portable to other H-functions and difficult to set up error-free.

Therefore, an alternative strategy was implemented based on the block index sets of each matrix block involved in an H-operation together with the standard recursive H-arithmetic. During a recursive call, existing dependencies of the parent blocks are automatically divided into dependencies for the sub-blocks based on their index sets.

Root One Recursion Two Recursions

The advantage of this approach is, that the user can use existing arithmetic functions with little augmentation, i.e., also new arithmetic functions can easliy exploit task-based parallelization.

H-function DAG-Generator
void  hlu ( TMatrix &  A ) {

if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
hlu( A(i,i) );
for ( uint j = i+1; j < n; j++ ){
htrsu( A(i,i), A(j,i) );
htrsl( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
hmul( -1, A(j,i), A(i,l), A(j,l) );
}
}
else
{
// handle leaf
}
}

struct hlu : public node {
TMatrix &  A;

local_graph  refine_  () {
local_graph  g;
if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
g.alloc_node<hlu>( A(i,i) );
for ( uint j = i+1; j < n; j++ ) {
g.alloc_node<htrsu>( A(i,i), A(j,i) );
g.alloc_node<htrsl>( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
g.alloc_node<hmul>( ... );
} }
return  g;
}

void run_ ( TTruncAcc &  acc ) { /* handle leaf */ }

};


The DAG construction can also be easily parallelized compared to the previous approach, and for different frameworks implementations are available in libHLR. The same holds for DAG execution. Both may also be mixed. As an example, the following DAG for H-LU is constructed using sequential DAG construction (seq::dag::refine) but executed via TBB (tbb::dag::run):

Furthermore, as the DAG typically has more dependencies than necessary, e.g., direct and indirect dependencies, various optimizations are applied to reduce the number of edges in the task graph (edge sparsification).

auto  dag = hlr::dag::gen_dag_lu_rec( *C, seq::dag::refine ); // sequential construction

tbb::dag::run( dag, acc );                                    // parallel execution


## H-LU factorization (dag-lu)¶

In the following the (parallel) runtime and the number of edges compared to the previous approach are shown showing the advantages of the new DAG generation routine.

Runtime Number of Edges

Furthermore, different variations of the H-LU DAG are implemented

ip

assuming in-place computation, i.e., blocks of $$A$$ are overwriten by $$L$$ and $$U$$ with more restricted data dependencies.

oop

assuming out-of-place computation, i.e., $$L$$ and $$U$$ use separate storage, relaxing data dependencies.

lvl

the previous level-wise approach for comparison.

oop-coarse

constructs task graph for a coarser H-matrix, i.e., not following recursion to the leaves, thereby saving execution and memory costs, and during task graph execution constructs DAGs for the sub-blocks on-the-fly for further execution.

oop-accu

incorporates accumulated H-arithmetic into H-LU.

## dag-hodlr¶

Construct a full task graph for the LU factorization of an H-matrix based on HODLR clustering with low-rank blocks stored using tiles, i.e., each factor $$U$$ (and $$V$$) is decomposed into

$\begin{split} U = \begin{pmatrix} U_1 \\ U_2 \\ U_3 \\ U_4 \\ \cdots \end{pmatrix} \end{split}$

with $$U_i$$ having at most ntile rows.

Tasks are constructed based on these tiles with corresponding dependencies, even down to low-rank truncation.

The idea is that all tasks will then have comparable costs, which may lead to better scheduling.

## dag-inv¶

Implements H inversion using H-LU, triangular inversion and triangular matrix multiplication.