DAG-Generation¶
Task based H-arithmetic is usually more efficient on many-core systems compared to loop parallelization of the recursive H-arithmetic. In the following figures, this can be seen by the percentage of idle cores during H-LU factorization. There, red are dense factorizations, blue are matrix solves and green matrix updates.
recursive | task based |
---|---|
While the identification of the actual tasks in an H-arithmetic operation is easy by using the recursive H-functions and following the recursion until a leaf is encounterd, setting up the dependencies between the tasks is much more difficult as these are on a global scope whereas recursive algorithms only work on a purely local scope, i.e., on direct sub-blocks.
Tasks | Dependencies in H-tree |
---|---|
A previous (cite!) diverted from the standard H-arithmetic function for H-LU by using a global approach to directly identify dependencies per level of the H-tree. However, this was not portable to other H-functions and difficult to set up error-free.
Therefore, an alternative strategy was implemented based on the block index sets of each matrix block involved in an H-operation together with the standard recursive H-arithmetic. During a recursive call, existing dependencies of the parent blocks are automatically divided into dependencies for the sub-blocks based on their index sets.
Root | One Recursion | Two Recursions |
---|---|---|
The advantage of this approach is, that the user can use existing arithmetic functions with little augmentation, i.e., also new arithmetic functions can easliy exploit task-based parallelization.
H-function | DAG-Generator |
---|---|
void hlu ( TMatrix & A ) {
if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
hlu( A(i,i) );
for ( uint j = i+1; j < n; j++ ){
htrsu( A(i,i), A(j,i) );
htrsl( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
hmul( -1, A(j,i), A(i,l), A(j,l) );
}
}
else
{
// handle leaf
}
}
|
struct hlu : public node {
TMatrix & A;
local_graph refine_ () {
local_graph g;
if ( is_blocked( A ) ) {
for ( uint i = 0; i < n; ++i ) {
g.alloc_node<hlu>( A(i,i) );
for ( uint j = i+1; j < n; j++ ) {
g.alloc_node<htrsu>( A(i,i), A(j,i) );
g.alloc_node<htrsl>( A(i,i), A(i,j) );
}
for ( uint j = i+1; j < n; j++ )
for ( uint l = i+1; l < n; l++ )
g.alloc_node<hmul>( ... );
} }
return g;
}
void run_ ( TTruncAcc & acc ) { /* handle leaf */ }
};
|
The DAG construction can also be easily parallelized compared to the previous approach,
and for different frameworks implementations are available in libHLR. The same holds for
DAG execution. Both may also be mixed. As an example, the following DAG for H-LU is
constructed using sequential DAG construction (seq::dag::refine
) but executed via
TBB (tbb::dag::run
):
Furthermore, as the DAG typically has more dependencies than necessary, e.g., direct and indirect dependencies, various optimizations are applied to reduce the number of edges in the task graph (edge sparsification).
auto dag = hlr::dag::gen_dag_lu_rec( *C, seq::dag::refine ); // sequential construction
tbb::dag::run( dag, acc ); // parallel execution
H-LU factorization¶
Program
name:
dag-lu
frameworks:
seq
,tbb
,omp
,tf
,hpx
In the following the (parallel) runtime and the number of edges compared to the previous approach are shown showing the advantages of the new DAG generation routine.
Runtime | Number of Edges |
---|---|
Furthermore, different variations of the H-LU DAG are implemented
- ip
assuming in-place computation, i.e., blocks of \(A\) are overwriten by \(L\) and \(U\) with more restricted data dependencies.
- oop
assuming out-of-place computation, i.e., \(L\) and \(U\) use separate storage, relaxing data dependencies.
- lvl
the previous level-wise approach for comparison.
- oop-coarse
constructs task graph for a coarser H-matrix, i.e., not following recursion to the leaves, thereby saving execution and memory costs, and during task graph execution constructs DAGs for the sub-blocks on-the-fly for further execution.
- oop-accu
incorporates accumulated H-arithmetic into H-LU.
LU factorization for HODLR matrices¶
Program
name:
dag-hodlr
frameworks:
seq
,tbb
Construct a full task graph for the LU factorization of an H-matrix based on HODLR clustering with low-rank blocks stored using tiles, i.e., each factor \(U\) (and \(V\)) is decomposed into
with \(U_i\) having at most ntile
rows.
Tasks are constructed based on these tiles with corresponding dependencies, even down to low-rank truncation.
The idea is that all tasks will then have comparable costs, which may lead to better scheduling.
H-Inversion¶
Program
name:
dag-inv
frameworks:
seq
,tbb
Implements H inversion using H-LU, triangular inversion and triangular matrix multiplication.