# DAG-Generation¶

Task based H-arithmetic is usually more efficient on many-core systems compared to loop parallelization of the recursive H-arithmetic. In the following figures, this can be seen by the percentage of idle cores during H-LU factorization. There, red are dense factorizations, blue are matrix solves and green matrix updates.

recursive | task based |
---|---|

While the identification of the actual tasks in an H-arithmetic operation is easy by using the recursive H-functions and following the recursion until a leaf is encounterd, setting up the dependencies between the tasks is much more difficult as these are on a global scope whereas recursive algorithms only work on a purely local scope, i.e., on direct sub-blocks.

Tasks | Dependencies in H-tree |
---|---|

A previous (cite!) diverted from the standard H-arithmetic function for H-LU by using a global approach to directly identify dependencies per level of the H-tree. However, this was not portable to other H-functions and difficult to set up error-free.

Therefore, an alternative strategy was implemented based on the block index sets of each matrix block involved in an H-operation together with the standard recursive H-arithmetic. During a recursive call, existing dependencies of the parent blocks are automatically divided into dependencies for the sub-blocks based on their index sets.

Root | One Recursion | Two Recursions |
---|---|---|

The advantage of this approach is, that the user can use existing arithmetic functions with little augmentation, i.e., also new arithmetic functions can easliy exploit task-based parallelization.

H-function | DAG-Generator |
---|---|

```
void hlu ( TMatrix & A ) {
if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
hlu( A(i,i) );
for ( uint j = i+1; j < n; j++ ){
htrsu( A(i,i), A(j,i) );
htrsl( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
hmul( -1, A(j,i), A(i,l), A(j,l) );
}
}
else
{
// handle leaf
}
}
``` |
```
struct hlu : public node {
TMatrix & A;
local_graph refine_ () {
local_graph g;
if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
g.alloc_node<hlu>( A(i,i) );
for ( uint j = i+1; j < n; j++ ) {
g.alloc_node<htrsu>( A(i,i), A(j,i) );
g.alloc_node<htrsl>( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
g.alloc_node<hmul>( ... );
} }
return g;
}
void run_ ( TTruncAcc & acc ) { /* handle leaf */ }
};
``` |

The DAG construction can also be easily parallelized compared to the previous approach,
and for different frameworks implementations are available in libHLR. The same holds for
DAG execution. Both may also be mixed. As an example, the following DAG for H-LU is
constructed using sequential DAG construction (`seq::dag::refine`

) but executed via
TBB (`tbb::dag::run`

):

Furthermore, as the DAG typically has more dependencies than necessary, e.g., direct
and indirect dependencies, various optimizations are applied to reduce the number of edges
in the task graph (*edge sparsification*).

```
auto dag = hlr::dag::gen_dag_lu_rec( *C, seq::dag::refine ); // sequential construction
tbb::dag::run( dag, acc ); // parallel execution
```

## H-LU factorization (`dag-lu`

)¶

In the following the (parallel) runtime and the number of edges compared to the previous approach are shown showing the advantages of the new DAG generation routine.

Runtime | Number of Edges |
---|---|

Furthermore, different variations of the H-LU DAG are implemented

**ip**assuming

*in-place*computation, i.e., blocks of \(A\) are overwriten by \(L\) and \(U\) with more restricted data dependencies.**oop**assuming

*out-of-place*computation, i.e., \(L\) and \(U\) use separate storage, relaxing data dependencies.**lvl**the previous

*level-wise*approach for comparison.**oop-coarse**constructs task graph for a coarser H-matrix, i.e., not following recursion to the leaves, thereby saving execution and memory costs, and during task graph execution constructs DAGs for the sub-blocks on-the-fly for further execution.

**oop-accu**incorporates accumulated H-arithmetic into H-LU.

`dag-hodlr`

¶

Construct a full task graph for the LU factorization of an H-matrix based on HODLR
clustering with low-rank blocks stored using *tiles*, i.e., each factor \(U\) (and \(V\))
is decomposed into

with \(U_i\) having at most `ntile`

rows.

Tasks are constructed based on these tiles with corresponding dependencies, even down to low-rank truncation.

The idea is that all tasks will then have comparable costs, which may lead to better scheduling.

`dag-inv`

¶

Implements H inversion using H-LU, triangular inversion and triangular matrix multiplication.