\( \newcommand{\blu}[1]{{\color{blue}#1}} \newcommand{\red}[1]{{\color{red}#1}} \newcommand{\grn}[1]{{\color{green!50!black}#1}} \newcommand{\local}{_i} \newcommand{\inv}{^{-1}} % Index for interface and interior \newcommand{\G}{\Gamma} \newcommand{\Gi}{\Gamma_i} \newcommand{\I}{{\cal I}} \newcommand{\Ii}{\I_i} % Matrix A \newcommand{\A}{{\cal A}} \newcommand{\Ai}{\A\local} \newcommand{\Aj}{\A_j} \newcommand{\Aib}{\bar{\A}\local} \newcommand{\AII}{\A_{\I\I}} \newcommand{\AIG}{\A_{\I\G}} \newcommand{\AGI}{\A_{\G\I}} \newcommand{\AGG}{\A_{\G\G}} \newcommand{\AIiIi}{\A_{\Ii\Ii}} \newcommand{\AIiGi}{\A_{\Ii\Gi}} \newcommand{\AGiIi}{\A_{\Gi\Ii}} \newcommand{\AGiGi}{\A_{\Gi\Gi}} \newcommand{\AGiGiw} {{\ensuremath{\A_{\Gi\Gi}^w}}} \newcommand{\AIIi}{\AII\local} \newcommand{\AIGi}{\AIG\local} \newcommand{\AGIi}{\AGI\local} \newcommand{\AGGi}{\AGG\local} \newcommand{\Ab}{\bar{\A}} \newcommand{\Ah}{{\widehat{\A}}} \newcommand{\Aih}{{\Ah\local}} \newcommand{\At}{{\widetilde{\A}}} \newcommand{\Ait}{{\At\local}} \newcommand{\Ao}{\A_0} \newcommand{\Aot}{\At_0} \newcommand{\Aob}{\Ab_0} \newcommand{\AiNN}{\Ai^{(NN)}{}} \newcommand{\AitNN}{\Ait^{(NN)}{}} \newcommand{\AiAS}{\Ai^{(AS)}{}} \newcommand{\AitAS}{\Ait^{(AS)}{}} % Matrix S \renewcommand{\S}{{\cal S}} \newcommand{\Si}{\S\local} \newcommand{\Sb}{\bar{\S}} \newcommand{\Sib}{\Sb\local} \newcommand{\Sh}{{\widehat{\S}}} \newcommand{\Sih}{{\Sh\local}} \newcommand{\St}{{\widetilde{\S}}} \newcommand{\Sit}{{\St\local}} \newcommand{\So}{\S_0} \newcommand{\Soi}{\S_0^i} \newcommand{\Sot}{\St_0} \newcommand{\Sob}{\Sb_0} \newcommand{\SiNN}{\Si^{(NN)}{}} \newcommand{\SitNN}{\Sit^{(NN)}{}} \newcommand{\SiAS}{\Si^{(AS)}{}} \newcommand{\SitAS}{\Sit^{(AS)}{}} % Matrix K \newcommand{\K}{{\cal K}} \newcommand{\Ki}{\K\local} \newcommand{\Kb}{\bar{\K}} \newcommand{\Kib}{\Kb\local} \newcommand{\Kh}{{\widehat{\K}}} \newcommand{\Kih}{{\Kh\local}} \newcommand{\Kt}{{\widetilde{\K}}} \newcommand{\Kit}{{\Kt\local}} \newcommand{\Ko}{\K_0} \newcommand{\Kot}{\Kt_0} \newcommand{\Kob}{\Kb_0} \newcommand{\KiNN}{\Ki^{(NN)}{}} \newcommand{\KitNN}{\Kit^{(NN)}{}} \newcommand{\KiAS}{\Ki^{(AS)}{}} \newcommand{\KitAS}{\Kit^{(AS)}{}} \newcommand{\KII}{\K_{\I\I}} \newcommand{\KIG}{\K_{\I\G}} \newcommand{\KGI}{\K_{\G\I}} \newcommand{\KGG}{\K_{\G\G}} \newcommand{\KIiIi}{\K_{\Ii\Ii}} \newcommand{\KIiGi}{\K_{\Ii\Gi}} \newcommand{\KGiIi}{\K_{\Gi\Ii}} \newcommand{\KGiGi}{\K_{\Gi\Gi}} \newcommand{\KIIi}{\KII\local} \newcommand{\KIGi}{\KIG\local} \newcommand{\KGIi}{\KGI\local} \newcommand{\KGGi}{\KGG\local} \newcommand{\KGiGiw} {{\ensuremath{\K_{\Gi\Gi}^w}}} % Matrix B \newcommand{\B}{{\cal B}} \newcommand{\Bi}{\B\local} \newcommand{\Bib}{\widehat{\B}\local} \newcommand{\Bob}{\widehat{\B}_0} % Matrix C \newcommand{\C}{{\cal C}} % Matrix T \newcommand{\T}{{\cal T}} \newcommand{\Ti}{{\T\local}} % Vectors \newcommand{\uI}{u_\I} \newcommand{\uG}{u_\G} \newcommand{\xI}{x_\I} \newcommand{\xG}{x_\G} \newcommand{\bI}{b_\I} \newcommand{\bG}{b_\G} \newcommand{\fI}{f_\I} \newcommand{\fG}{f_\G} \newcommand{\ftG}{\widetilde f_\G} \newcommand{\ftGi}{\widetilde f_\Gi^{(i)}} \newcommand{\ftGj}{\widetilde f_\Gj^{(j)}} \)

Compose
Tutorial

Table of Contents

Back to index.

1. A minimal example

1.1. Requirements

For this tutorial we shall use Compose (A.K.A Maphys++) as a header-only C++ library, so requirement tools are limited. You only need git, cmake and a C++ compiler (g++ or clang++ for instance) supporting C++20 standard.

Later on, when performance becomes important, optimized libraries such as BLAS, LAPACK, direct sparse solvers… must be used. Interfaces for those are already present in Compose and you will be able to take advantage of them will little effort. For small prototypes such as this tutorial, or when performance is not important, the header-only version is sufficient.

Now let's get the Compose C++ headers. The following command will clone the source code (with git) and set the variable MAPHYSPP_INCLUDES to the path of those headers.

git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
MAPHYSPP_INCLUDES=`realpath maphyspp/include`
TLAPACK_INCLUDES=`realpath maphyspp/tlapack/include`

The --recursive parameter is important: we do need the submodules. If you forgot it, you can use the following commands to get and update them:

# In the repository of compose
git submodule init
git submodule update

1.2. A simple example

Let's start with the following C++ example in the file minimal.cpp:

#include <maphys.hpp>

int main(){

  using namespace maphys;

  DenseMatrix<double> A(3, 3);

  A(0, 0) = 4;
  A(1, 0) = -1;

  A(0, 1) = A(1, 0);
  A(1, 1) = 3;
  A(2, 1) = -0.5;

  A(1, 2) = A(2, 1);
  A(2, 2) = A(0, 0);

  A.display("A");

  return 0;
}
A
m: 3 n: 3
Symmetry: general; Storage: full

4	-1	0
-1	3	-0.5
0	-0.5	4


1.3. Compiling and running

Compose is written in C++20. In this tutorial we use the GNU compiler g++. The beginning of the command line will be:

g++ -std=c++20

Since we are using header-only C++ libraries, all we need to do is to include the correct headers to compile. We need to indicate where the headers of Compose and <T>LAPACK are:

-I${MAPHYSPP_INCLUDES} -I${TLAPACK_INCLUDES}

We an extra flag to indicate to Compose that we want to use the header-only implementation, renouncing to the dependencies that require compilation and linking. In particular, we can't use MPI or a compiled BLAS implementation with this flag.

-DMAPHYSPP_HEADER_ONLY

Remark: if it's more convenient for you, it is also possible to add the definition in the code before including maphyspp.hpp rather that using the flag in the command line:

#include <iostream>
#include <vector>
#define MAPHYSPP_HEADER_ONLY
#include <maphys.hpp>

//...

So the command line to compile our example minimal.cpp is:

g++ -std=c++20 -I${MAPHYSPP_INCLUDES} -I ${TLAPACK_INCLUDES} -DMAPHYSPP_HEADER_ONLY minimal.cpp -o minimal

And we can run the code with ./minimal. The output is:

A
m: 3 n: 3
Symmetry: general; Storage: full

4	-1	0
-1	3	-0.5
0	-0.5	4


1.4. Explanation of the program

The only feature used here is the class DenseMatrix implemented in Compose. It corresponds to a matrix where all coefficient are stored in a 2D-array. In this example we created a 3 by 3 matrix \(A\) with the constructor DenseMatrix<double> A(3, 3), filled with zeros. The scalar type of the coefficients of the matrix is given as a template parameter, here we use double precision floating-point arithmetic for real numbers, which corresponds to the type double. Then, we can access and assign the coefficients \(a_{i, j}\) of the matrix with A(i, j). Here we filled \(A\) with some values. Finally, we displayed the matrix on the standard output using the display method of the class. Optionally we can pass the name of the matrix we display as a parameter of the method, so that we can easily identify it on the output (here we passed "A").

On the output, we can see the name of the matrix "A", followed by its description. \(m\) is the number of rows, \(n\) the number of columns. We can see that no assumption is made on the properties of the matrix: although the matrix is clearly symmetric, the program doesn't have this piece of information unless specified by the user. For this example it is not necessary make this effort.

1.5. Another example with a vector

Let's write another simple example to perform a matrix-vector product.

#include <maphys.hpp>

int main(){

  using namespace maphys;

  DenseMatrix<double> A({ 4,   -1,    0,
                         -1,    3, -0.5,
                          0, -0.5,    4}, 3, 3);

  Vector<double> u{1, 2, 3};
  Vector<double> b = A * u;

  u.display("u");
  b.display("b");

  return 0;
}
u
m: 3 (vector)

1
2
3

b
m: 3 (vector)

2
3.5
11

First we introduced a new constructor for the DenseMatrix. The first parameter is a bracket list of the matrix coefficients, then its number of rows \(m\) and number of columns \(n\). The coefficients are given in column-major format, meaning the \(m\) first values describe the first column, the \(m\) next the second column, etc. Here the matrix is symmetric so it does not make a difference, but you can try to modify a non diagonal coefficient and display the matrix to see the effect.

Secondly, we declared \(u\), a Vector of double. Actually Vector is special case of DenseMatrix where the number of columns is fixed to 1. The constructor is therefore very similar, you only need to give the list of the coefficients of the vectors, and the size of the vector is deduced from the size of the list.

Then we perform a matrix vector product of the matrix \(A\) and \(u\) and store the result in a new Vector \(b\). We display \(u\) and \(b\) in the same way we displayed the matrix \(A\) before, using the display method.

1.6. Solving a linear system

1.6.1. With a Jacobi iterative solver

Now let's try to solve a linear system. The matrix \(A\) is diagonally dominant matrix, so we can try so solve it with a Jacobi iterative solver. On the following example, we use this iterative method to find \(x\) such that \(Ax = b\). We can display \(x\) and make sure it has the same values as \(u = (1, 2, 3)^T\).

#include <maphys.hpp>
#include <maphys/solver/Jacobi.hpp>

int main(){

  using namespace maphys;

  DenseMatrix<double> A({ 4,  -1,    0,
                          -1,   3, -0.5,
                          0, -0.5,   4}, 3, 3);

  Vector<double> u{1, 2, 3};
  Vector<double> b = A * u;

  Jacobi<DenseMatrix<double>, Vector<double>> solver;
  solver.setup(parameters::A{A},
               parameters::verbose{true},
               parameters::max_iter{20},
               parameters::tolerance{1e-8});

  Vector<double> x = solver * b;

  x.display("x");

  return 0;
}
Starting Jacobi. Max iterations: 20
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 11.7154
---
0 -	||B-AX||/||B||	1
1 -	||B-AX||/||B||	0.194964
2 -	||B-AX||/||B||	0.067276
3 -	||B-AX||/||B||	0.0203088
4 -	||B-AX||/||B||	0.00700792
5 -	||B-AX||/||B||	0.0021155
6 -	||B-AX||/||B||	0.000729992
7 -	||B-AX||/||B||	0.000220364
8 -	||B-AX||/||B||	7.60408e-05
9 -	||B-AX||/||B||	2.29546e-05
10 -	||B-AX||/||B||	7.92092e-06
11 -	||B-AX||/||B||	2.39111e-06
12 -	||B-AX||/||B||	8.25095e-07
13 -	||B-AX||/||B||	2.49073e-07
14 -	||B-AX||/||B||	8.59474e-08
15 -	||B-AX||/||B||	2.59452e-08
16 -	||B-AX||/||B||	8.95286e-09
x
m: 3 (vector)

1
2
3

We include the header with the conjugate gradient in maphys/solver/Jacobi.hpp to have access to the solver. We need to indicate the type of matrix and vector used to the solver in template parameters: DenseMatrix<double> for the matrix, and Vector<double> for the vector.

Many parameters can be given to the solver, under the namespace parameters. Here we are giving the most useful ones.

  • parameters::A, which is mandatory, is the input square matrix, or the operator of the system;
  • parameters::tolerance is the targeted precision of the solution, more specifically, the iterative solver considers the convergence achieved when the backward error \(||b - Ax||_2 / ||b||_2\) is lower than this value;
  • parameters::max_iter is the maximum number of iterations. When reached, the solver stops even if it has not converged to the required tolerance;
  • parameters::verbose is a boolean value, false by default, printing information on the standard output when true.

    Those parameters can be passed to the setup method in any order and have default values if not given (although they may not be relevant values in your context).

    Once setup, solving the linear system consists in a simple multiplication: solver * rhs, returning the solution vector. One may consider the solver instance as the \(A^{-1}\) operator, even if no explicit inversion of a matrix is ever performed.

1.6.2. With a conjugate gradient

The matrix \(A\) is actually not only diagonally dominant, it's also symmetric positive definite. Therefore we can use a conjugate gradient to solve the system. Since iterative solvers share the same set of basic parameters, we can take the same code and just replace Jacobi by ConjugateGradient to solve the system.

#include <maphys.hpp>
#include <maphys/solver/ConjugateGradient.hpp>

int main(){

  using namespace maphys;

  DenseMatrix<double> A({ 4,  -1,    0,
                          -1,   3, -0.5,
                          0, -0.5,   4}, 3, 3);

  Vector<double> u{1, 2, 3};
  Vector<double> b = A * u;

  ConjugateGradient<DenseMatrix<double>, Vector<double>> solver;
  solver.setup(parameters::A{A},
               parameters::verbose{true},
               parameters::max_iter{20},
               parameters::tolerance{1e-8});

  Vector<double> x = solver * b;

  x.display("x");

  return 0;
}
Starting Conjugate Gradient. Max iterations: 20
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 11.7154
---
0 -	||R||/||B||	1
1 -	||R||/||B||	0.248805
2 -	||R||/||B||	0.048057
3 -	||R||/||B||	2.6488e-18
        ||B-AX||/||B||	0
x
m: 3 (vector)

1
2
3

As expected, we only need 3 (the size of the linear system) iterations to converge with the conjugate gradient.

2. Sparse matrices

We propose 3 types of sparse matrices in Compose:

  • SparseMatrixCOO: a matrix in coordinate format (also known as IJV format);
  • SparseMatrixCSC: a matrix in Compressed Sparse Column format;
  • SparseMatrixLIL: a matrix in list of lists format, a special format for coefficient insertion.

2.1. Create a sparse matrix with SparseMatrixLIL

The easiest way to create a sparse matrix is to insert the coefficients of the matrix one by one. The only format that allows this kind of insertion is SparseMatrixLIL, which is implemented with an array of \(n\) (the number of columns) lists. Thus insertion can be performed in linear time.

#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixLIL.hpp>

int main(){
  using namespace maphys;

  SparseMatrixLIL<double> lilmat(3, 3);

  lilmat.insert(0, 0, 4);

  lilmat.insert(1, 0, -1);
  lilmat.insert(0, 1, -1);

  lilmat.insert(1, 1, 3);

  lilmat.insert(1, 2, -0.5);
  lilmat.insert(2, 1, -0.5);

  lilmat.insert(2, 2, 4);

  lilmat.display("lilmat");

  lilmat.to_dense().display("lilmat converted to dense");

  return 0;
}
lilmat
m: 3 n: 3 nnz: 7
Symmetry: general; Storage: full

Column 0
i	v
0	4
1	-1
Column 1
i	v
0	-1
1	3
2	-0.5
Column 2
i	v
1	-0.5
2	4

lilmat converted to dense
m: 3 n: 3
Symmetry: general; Storage: full

4	-1	0
-1	3	-0.5
0	-0.5	4


We included the header for the lil matrix: maphys/loc_data/SparseMatrixLIL.hpp. First we construct the sparse matrix its dimensions \((m, n)\), and then we add the coefficients \((i, j, v)\) (index of row, index of column and value of the coefficient respectively) one by one. The matrix type is very well suited when you want to construct a matrix but you don't have information about the final numbers of nonzero elements it contains.

2.2. Using half storage for symmetric matrices

As we fill this symmetric sparse matrix, we can see an obvious optimization: half storage. Indeed, if we could indicate that the matrix is symmetric, we would not need to add twice the same value \(v\) in \((i, j)\) and \((j, i)\).

#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixLIL.hpp>

int main(){
  using namespace maphys;

  SparseMatrixLIL<double> lilmat(3, 3);

  lilmat.insert(0, 0, 4);
  lilmat.insert(1, 0, -1);
  lilmat.insert(1, 1, 3);
  lilmat.insert(2, 1, -0.5);
  lilmat.insert(2, 2, 4);

  lilmat.set_property(MatrixSymmetry::symmetric, MatrixStorage::lower);

  lilmat.display("lilmat");

  lilmat.to_dense().display("lilmat converted to dense");

  return 0;
}
lilmat
m: 3 n: 3 nnz: 5
Symmetry: symmetric; Storage: lower

Column 0
i	v
0	4
1	-1
Column 1
i	v
1	3
2	-0.5
Column 2
i	v
2	4

lilmat converted to dense
m: 3 n: 3
Symmetry: symmetric; Storage: lower

4	-	-
-1	3	-
0	-0.5	4


Here we filled only the lower triangular part of the matrix. We can indicate with set_property(MatrixSymmetry::symmetric) that the matrix is symmetric, and then with lilmat.set_property(MatrixStorage::lower) that only the lower part of the matrix is stored. These methods can be called on any type of matrix in Compose, and any kind of matrix property. As shown on the example, it's possible to set several properties at once when calling set_property.

Notice that for the DenseMatrix constructed at the end, the conversion propagates the properties, and only the lower triangular part of the matrix is displayed. The other coefficients, whatever their values, won't be used in any operation performed with the matrix.

Remark: some shortcuts exist to fill the properties of a matrix. Here in particular, we have an symmetric positive definite matrix, so we could (should?) have done lilmat.set_spd(MatrixStorage::lower) to set all properties at the same time.

2.3. Dealing with duplicates

What happens when we add twice a value in the same locate \((i, j)\) ? Well it's entirely up to you. The default behavior is to sum the two values given, but a lambda function can be passed to override this. The function takes two parameters: the first one is a reference to the value stored in the matrix, and the second is a const reference to the value you are inserting.

Let's see on a example:

#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixLIL.hpp>

int main(){
  using namespace maphys;

  SparseMatrixLIL<double> lilmat(2, 2);

  std::cout << "Default behavior: add coefficients\n";
  std::cout << "Insert (0, 0) -> 1\n";
  lilmat.insert(0, 0, 1);
  std::cout << "Insert (0, 0) -> 0.5\n";
  lilmat.insert(0, 0, 0.5);
  lilmat.to_dense().display("lilmat converted to dense");

  auto replace_coeff = [](double& present_v, const double& insert_v){ present_v = insert_v; };

  std::cout << "\nNew behavior: replace coefficients\n";
  std::cout << "Insert (0, 1) -> 1\n";
  lilmat.insert(0, 1, 1, replace_coeff);
  std::cout << "Insert (0, 1) -> 0.5\n";
  lilmat.insert(0, 1, 0.5, replace_coeff);
  lilmat.to_dense().display("lilmat updated 1");

  auto mean_coeff = [](double& present_v, const double& insert_v){ present_v = (present_v + insert_v) / 2; };

  std::cout << "\nNew behavior: compute the mean of the coefficients\n";
  std::cout << "Insert (1, 1) -> 1\n";
  lilmat.insert(1, 1, 1, mean_coeff);
  std::cout << "Insert (1, 1) -> 0.5\n";
  lilmat.insert(1, 1, 0.5, mean_coeff);
  lilmat.to_dense().display("lilmat updated 2");

  return 0;
}
Default behavior: add coefficients
Insert (0, 0) -> 1
Insert (0, 0) -> 0.5
lilmat converted to dense
m: 2 n: 2
Symmetry: general; Storage: full

1.5	0
0	0



New behavior: replace coefficients
Insert (0, 1) -> 1
Insert (0, 1) -> 0.5
lilmat updated 1
m: 2 n: 2
Symmetry: general; Storage: full

1.5	0.5
0	0



New behavior: compute the mean of the coefficients
Insert (1, 1) -> 1
Insert (1, 1) -> 0.5
lilmat updated 2
m: 2 n: 2
Symmetry: general; Storage: full

1.5	0.5
0	0.75


Remark: after the construction of your matrix, you may have zeros or almost zeros values stored in your matrix. You can remove them with the .drop() method. Optionally, this method takes a real value tolerance, and all values lower than this tolerance in absolute value will be removed. For example, you can write: lilmat.drop(1e-12).

2.4. Conversion to other matrix types

The lil format cannot be used to do any other operation than inserting at present. Once you are satisfied with the lil matrix you have built, you can convert it to the format you want to use: DenseMatrix, SparseMatrixCOO or SparseMatrixCSC.

#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixLIL.hpp>

int main(){
  using namespace maphys;

  SparseMatrixLIL<double> lilmat(2, 2);

  lilmat.insert(0, 0, 4);
  lilmat.insert(1, 0, -1);
  lilmat.insert(1, 1, 3);
  lilmat.display("lilmat");

  const SparseMatrixCOO<double> coomat(lilmat.to_coo());
  coomat.display("coomat");

  const SparseMatrixCSC<double> cscmat(lilmat.to_csc());
  cscmat.display("cscmat");

  const DenseMatrix<double> densemat(lilmat.to_dense());
  densemat.display("densemat");

  return 0;
}
lilmat
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

Column 0
i	v
0	4
1	-1
Column 1
i	v
1	3

coomat
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

i	j	v

0	0	4
1	0	-1
1	1	3

cscmat
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

Column 0
i	v
0	4
1	-1
Column 1
i	v
1	3

densemat
m: 2 n: 2
Symmetry: general; Storage: full

4	0
-1	3


Remark: notice that the header file SparseMatrixLIL includes the headers of the other types of matrices (DenseMatrix, SparseMatrixCOO and SparseMatrixCSC).

2.5. COO and CSC formats

While constructing a SparseMatrixLIL adding the coefficients one by one can be convenient, it is not optimal if you already know the number of non-zero values before hand. In that case it is simpler to just pass the \(i\), \(j\) and \(v\) arrays to build the coo or csc matrix directly.

Several constructors are available.

#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixCOO.hpp>

int main(){
  using namespace maphys;

  const int M = 2; // Number of rows
  const int N = 2; // Number of columns
  const int NNZ = 3; // Number of non-zero values

  // Pass 3 initalizer lists i, j and v, then dimensions M and N
  SparseMatrixCOO<double> coomat_1({0, 0, 1}, {0, 1, 1}, {0.5, 1.5, 2}, M, N);
  coomat_1.display("coomat build with initializer lists");

  // Pass dimensions M, N then NNZ and 3 pointers to the arrays for i, j and v
  // The arrays are not destroyed after the operation: values are copied
  int v_i[] = {0, 0, 1};
  int v_j[] = {0, 1, 1};
  double v_v[] = {0.5, 1.5, 2};
  SparseMatrixCOO<double> coomat_2(M, N, NNZ, v_i, v_j, v_v);

  coomat_2.display("coomat build from pointers");

  // Use maphys++ IndexArray class (an extended std::vector) We move the values
  // into the SparseMatrixCOO class, the IndexArray are empty after the
  // operation but no copy is performed
  IndexArray<int> ia_i{0, 0, 1};
  IndexArray<int> ia_j{0, 1, 1};
  IndexArray<double> ia_v{0.5, 1.5, 2};
  SparseMatrixCOO<double> coomat_3(M, N, NNZ, std::move(ia_i), std::move(ia_j), std::move(ia_v));

  coomat_3.display("coomat build with IndexArray");

  return 0;
}
coomat build with initializer lists
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

i	j	v

0	0	0.5
0	1	1.5
1	1	2

coomat build from pointers
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

i	j	v

0	0	0.5
0	1	1.5
1	1	2

coomat build with IndexArray
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

i	j	v

0	0	0.5
0	1	1.5
1	1	2

The same example can be written with SparseMatrixCSC instead of SparseMatrixCOO. We do not describe here how the csc format works, but in our case the arrays for \(i\) and \(v\) are the same as the ones for the coordinate format, only the \(j\) array (of size \(N + 1\)) is different.

#include <iostream>
#include <vector>
#include <maphys.hpp>
#include <maphys/loc_data/SparseMatrixCSC.hpp>

int main(){
  using namespace maphys;

  const int M = 2; // Number of rows
  const int N = 2; // Number of columns
  const int NNZ = 3; // Number of non-zero values

  IndexArray<int> ia_i{0, 0, 1};
  IndexArray<int> ia_j{0, 1, 3};
  IndexArray<double> ia_v{0.5, 1.5, 2};
  SparseMatrixCSC<double> cscmat(M, N, NNZ, std::move(ia_i), std::move(ia_j), std::move(ia_v));

  cscmat.display("csc build with IndexArray");

  return 0;
}
csc build with IndexArray
m: 2 n: 2 nnz: 3
Symmetry: general; Storage: full

Column 0
i	v
0	0.5
Column 1
i	v
0	1.5
1	2

Remark: these format also support conversion, as for the lil matrices, with the methods .to_coo(), .to_csc() and to_dense().

2.6. Solving the system

Now let's take a look at the system we wanted to solve: \(A\) is a symmetric definite positive matrix with 7 non-zero values (5 if we use half storage). Let's build this matrix as a coo sparse matrix in lower storage and use the conjugate gradient to solve the system:

#include <maphys.hpp>
#include <maphys/solver/ConjugateGradient.hpp>
#include <maphys/loc_data/SparseMatrixCOO.hpp>

int main(){

  using namespace maphys;

  SparseMatrixCOO<double> A({0, 1, 1, 2, 2},
                            {0, 0, 1, 1, 2},
                            {4, -1, 3, -0.5, 4},
                            3, 3);
  A.set_spd(MatrixStorage::lower);

  Vector<double> u{1, 2, 3};
  Vector<double> b = A * u;

  ConjugateGradient<SparseMatrixCOO<double>, Vector<double>> solver;
  solver.setup(parameters::A{A},
               parameters::verbose{true},
               parameters::max_iter{20},
               parameters::tolerance{1e-8});

  Vector<double> x = solver * b;

  x.display("x");

  return 0;
}
Starting Conjugate Gradient. Max iterations: 20
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 11.7154
---
0 -	||R||/||B||	1
1 -	||R||/||B||	0.248805
2 -	||R||/||B||	0.048057
3 -	||R||/||B||	2.6488e-18
        ||B-AX||/||B||	0
x
m: 3 (vector)

1
2
3

If we look at the modification compared to the dense version, we only build a SparseMatrixCOO instead of a DenseMatrix, and we indicated to the ConjugateGradient solver that our operator \(A\) is given by the matrix format SparseMatrixCOO. The rest of the code did not change.

One could make a very generic code using the C++ tools: templates, decltype, auto

#include <maphys.hpp>
#include <maphys/solver/ConjugateGradient.hpp>
#include <maphys/loc_data/SparseMatrixCOO.hpp>

using namespace maphys;

template<typename Mat, typename Vect>
void solve_with_cg(const Mat& A, const Vect& b){
  ConjugateGradient<Mat, Vect> solver;
  solver.setup(parameters::A{A},
               parameters::verbose{true},
               parameters::max_iter{20},
               parameters::tolerance{1e-8});

  auto x = solver * b;
  x.display("x");
}

int main(){
  SparseMatrixCOO<double> A_coo({0, 1, 1, 2, 2},
                                {0, 0, 1, 1, 2},
                                {4, -1, 3, -0.5, 4},
                                3, 3);
  A_coo.set_spd(MatrixStorage::lower);

  const Vector<double> u{1, 2, 3};
  const Vector<double> b = A_coo * u;

  std::cout << "Solve CG with COO matrix\n";
  solve_with_cg(A_coo, b);

  const DenseMatrix<double> A_dense = A_coo.to_dense();
  std::cout << "Solve CG with dense matrix\n";
  solve_with_cg(A_dense, b);

  return 0;
}
Solve CG with COO matrix
Starting Conjugate Gradient. Max iterations: 20
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 11.7154
---
0 -	||R||/||B||	1
1 -	||R||/||B||	0.248805
2 -	||R||/||B||	0.048057
3 -	||R||/||B||	2.6488e-18
        ||B-AX||/||B||	0
x
m: 3 (vector)

1
2
3

Solve CG with dense matrix
Starting Conjugate Gradient. Max iterations: 20
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 11.7154
---
0 -	||R||/||B||	1
1 -	||R||/||B||	0.248805
2 -	||R||/||B||	0.048057
3 -	||R||/||B||	2.6488e-18
        ||B-AX||/||B||	0
x
m: 3 (vector)

1
2
3

3. A domain decomposition example

Introduction

In this tutorial we shall compile and run a small example of Compose on the supercomputer plafrim or on a simple unix laptop.

3.1. Environment setup

3.1.1. guix setup

The simplest way to install Compose is to use guix. This piece of software is available on plafrim and can be easily installed on a laptop. The Compose package is distributed in the guix-hpc repository. You may also be interested by the guix-hpc-non-free package which gives access to additional non open-source packages such as the intel MKL.

To check that you installation is correct, the following command should give you an output:

guix search maphys++

and you should be able to find the Compose package. If it is not the case, refer to the documentations on repositories linked above.

You can then check that the Compose package compiles:

guix build maphys++

3.1.2. guix environment

To obtain a clean environment with Compose, we launch a bash in a pure guix shell:

guix shell --pure maphys++

In this environment, only Compose and its dependencies are available. Furthermore, environment variables are set so that every library can be easily found by a compilation tool such as CMake (you can check the values of C_INCLUDE_PATH, LIBRARY_PATH, PKG_CONFIG_PATH…)

In particular, you will find the c++ headers of Compose:

#NB: add coreutils to use "ls"
guix shell --pure maphys++ coreutils
ls $GUIX_ENVIRONMENT/include/maphys
  1. Add other packages in the environment

    When using guix shell --pure maphys++, we obtain a very clean environment, containing only maphys++ and its required dependencies. Let's add some other packages necessary to modify and compile the code. Feel free to add other packages you need (your favorite text editor for instance). The following line should provide all the necessary tools to compile.

    guix shell --pure maphys++ coreutils emacs cmake make pkg-config gcc-toolchain@12.3 gfortran-toolchain
    
  2. Using MKL

    If you have guix-hpc-non-free setup and you want to use intel MKL, you need to tell guix that you want to replace openblas (default provider for blas and lapack) by mkl. MUMPS, a dependency of Compose also has a specific package for this purpose, and python package numpy as well.

    guix shell --pure maphys++ coreutils emacs cmake \
                      make pkg-config gcc-toolchain@12.3 gfortran-toolchain \
                      --with-input=mumps-openmpi=mumps-mkl-openmpi \
                      --with-input=openblas=mkl \
                      --with-input=python-numpy=python-numpy-mkl
    

3.2. Compile an example

3.2.1. Example description

  1. Domain decomposition

    In the example presented, the adjacency graph of the global matrix consists in 7 vertices which are assigned to the following subdomains:

    Node 0 1 2 3 4 5 6
    Subdomain(s) 0, 2 0 0, 1 0, 1, 2 1 1, 2 2

    The MPI ranks are numbered from 0 to \(N\) - 1 consistently with the default MPI numbering. However, subdomains are numbered from 1 to \(N\) to stay consistent with previous chapters where index 0 is reserved for the coarse space. In the code, the subdomain \(\Omega_i\) is therefore identified its unique ID \(i\). A local numbering of the unknowns is used in each subdomain: in the example, the local indices [0, 1, 2, 3] in \(\Omega_0\) correspond to the same global indices [0, 1, 2, 3]. The global indices corresponding to local indices [0, 1, 2, 3] of \(\Omega_1\) and \(\Omega_2\) are [3, 2, 4, 5] and [5, 6, 0, 3], respectively. Each subdomain has to know which unknowns it shares with which other subdomains (dotted lines in the figure (c)). In our example, for instance, in local ordering:

    • \(\Omega_0\) shares [2, 3] with \(\Omega_1\) and [0, 3] with \(\Omega_2\).
    • \(\Omega_1\) shares [1, 0] with \(\Omega_0\) and [0, 3] with \(\Omega_2\).
    • \(\Omega_2\) shares [2, 3] with \(\Omega_0\) and [3, 0] with \(\Omega_1\).

      example_subdomains.svg

      The global domain in (a) is partitioned in three subdomains \(\Omega_0\), \(\Omega_1\) and \(\Omega_2\) in (b). Each subdomain can use a local ordering of its unknowns as in (c); duplicate unknowns in different subdomains are linked by a dotted line.}

      example_sd_intrf.svg

      The interior unknowns can be eliminated to keep only interface unknown that are shared by two subdomains or more (a). A global interface numbering is introduced (b). The global interface can be expressed as the union of the three local interfaces \(\G_0\), \(\G_1\) and \(\G_2\) in (c).

  2. Input matrix

    The system we are solving is \(K . u = f\), where \(K\) is a partitioned sparse SPD matrix, \(u\) the unknown vector and \(f\) the right hand side.

    For the sake of simplicity, we put the same local matrix on each subdomain. Notice that the sparse structure of the matrix corresponds to the domain decomposition describe above.

    Local matrices: \[\begin{pmatrix} 3 &-1 & 0 &-1 \\ -1 & 4 &-1 &-1 \\ 0 &-1 & 3 &-1 \\ -1 &-1 &-1 & 4 \end{pmatrix}\]

    Which results in this global matrix:

    Global matrix: \[\K = \begin{pmatrix} 6 & -1 & 0 & -2 & 0 & 0 & -1 \\ -1 & 4 & -1 & -1 & 0 & 0 & 0 \\ 0 & -1 & 7 & -2 & -1 & -1 & 0 \\ -2 & -1 & -2 & 11 & 0 & -2 & -1 \\ 0 & 0 & -1 & 0 & 3 & -1 & 0 \\ 0 & 0 & -1 & -2 & -1 & 7 & -1 \\ -1 & 0 & 0 & -1 & 0 & -1 & 4 \end{pmatrix}\]

    The solution vector we expect to find is simply: \[u = \begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \end{pmatrix}\]

    Thus, we shall use \(f = \K u\) for the right hand side vector:

    Global vector: \[f = \begin{pmatrix} -13 \\ -1 \\ -2 \\ 12 \\ 5 \\ 17 \\ 16 \end{pmatrix}\]

    The vectors are always assembled. So the local right hand side vectors are respectively:

    \[\begin{pmatrix} -13 \\ -1 \\ -2 \\12 \end{pmatrix}, \begin{pmatrix} 12 \\ -2 \\ 5 \\ 17 \end{pmatrix} \text{, and } \begin{pmatrix} 17 \\ 16 \\ -13 \\ 12 \end{pmatrix}\]

3.2.2. Example code

First we need an example of Compose usage. Let's take the example in src/examples/simple_part.cpp of the Compose repository:

#include <iostream>
#include <vector>
#include <mpi.h>
#include <maphys.hpp>
#include <maphys/part_data/PartMatrix.hpp>
#include <maphys/solver/Pastix.hpp>
#include <maphys/solver/ConjugateGradient.hpp>
#include <maphys/solver/PartSchurSolver.hpp>

int main(){

  using namespace maphys;
  using Scalar = double;

  MMPI::init();
  {
    const int rank = MMPI::rank();

    const int M = 4;
    const int NNZ = 9;
    const int N_DOFS = 4;
    const int N_SUBDOMAINS = 3;

    // Bind subdomains to MPI process
    std::shared_ptr<Process> p = bind_subdomains(N_SUBDOMAINS);

    // Define topology for each subdomain
    std::vector<Subdomain> sd;
    if(p->owns_subdomain(0)){
      std::map<int, IndexArray<int>> nei_map_0 {{1, {2, 3}},
                                                {2, {0, 3}}};
      sd.emplace_back(0, N_DOFS, std::move(nei_map_0));
    }
    if(p->owns_subdomain(1)){
      std::map<int, IndexArray<int>> nei_map_1 {{0, {1, 0}},
                                                {2, {0, 3}}};
      sd.emplace_back(1, N_DOFS, std::move(nei_map_1));
    }
    if(p->owns_subdomain(2)){
      std::map<int, IndexArray<int>> nei_map_2 {{0, {2, 3}},
                                                {1, {3, 0}}};
      sd.emplace_back(2, N_DOFS, std::move(nei_map_2));
    }
    p->load_subdomains(sd);

    // Local matrices (here the same matrix on every subdomain)
    std::vector<int> a_i{0, 0, 0, 1, 1, 1, 2, 2, 3};
    std::vector<int> a_j{0, 1, 3, 1, 2, 3, 2, 3, 3};
    std::vector<Scalar> a_v{3, -1, -1, 4, -1, -1, 3, -1, 4};
    SparseMatrixCOO<Scalar, int> A_loc(M, M, NNZ, a_i.data(), a_j.data(), a_v.data());
    A_loc.set_spd(MatrixStorage::upper);

    // Create the partitioned matrix
    std::map<int, SparseMatrixCOO<Scalar, int>> A_map;
    for(int k = 0; k < 3; ++k) if(p->owns_subdomain(k)) A_map[k] = A_loc;
    const PartMatrix<SparseMatrixCOO<Scalar>> A(p, std::move(A_map));

    // Local rhs vectors
    std::map<int, Vector<Scalar>> b_map;
    if(p->owns_subdomain(0)) b_map[0] = Vector<Scalar>{-13, -1, -2, 12};
    if(p->owns_subdomain(1)) b_map[1] = Vector<Scalar>{12, -2, 5, 17};
    if(p->owns_subdomain(2)) b_map[2] = Vector<Scalar>{17, 16, -13, 12};

    // Create the partitioned vector
    const PartVector<Vector<Scalar>> b(p, std::move(b_map));

    // Solver definitions
    using Solver_K = Pastix<SparseMatrixCOO<Scalar>, Vector<Scalar>>;
    using Solver_S = ConjugateGradient<PartMatrix<DenseMatrix<Scalar>>, PartVector<Vector<Scalar>>>;
    using Solver = PartSchurSolver<PartMatrix<SparseMatrixCOO<Scalar>>, PartVector<Vector<Scalar>>, Solver_K, Solver_S>;

    // Setup
    Solver schur_solver;
    schur_solver.setup(A);

    // Set parameters for the solver S (ConjugateGradient)
    Solver_S& iter_solver = schur_solver.get_solver_S();
    iter_solver.setup(parameters::max_iter{50},
                      parameters::tolerance{1e-8},
                      parameters::verbose{(rank == 0)});

    // Solve
    auto X = schur_solver * b;

    X.display_centralized("X found");
    if(rank == 0) std::cout << "Niter: " << iter_solver.get_n_iter() << '\n';

  }
  MMPI::finalize();

  return 0;
}

It is an example that solves a distributed sparse linear system on 3 subdomains, using Pastix to compute the Schur complement and a conjugate gradient.

3.2.3. Compilation

Now let's create a directory tutorial and copy this file simple_part.cpp inside:

mkdir tutorial && cd tutorial
#create or copy simple_part.cpp

In order to get Compose and all its dependencies, we use the following CMakeLists.txt file:

cmake_minimum_required(VERSION 3.12)

project(MAPHYS_EXAMPLE CXX C Fortran)

find_package(maphyspp REQUIRED)

# compile your example
add_executable(simple_part simple_part.cpp)

# link to maphyspp
target_link_libraries(simple_part PRIVATE MAPHYSPP::maphyspp)

Now all we can compile our example using the standard CMake way:

mkdir build && cd build
cmake ..
make

3.2.4. Running the example

Let's run the example. It is written so that it can be run with any number of MPI processes. However, the optimal number of processes is one per subdomain, which is 3 in this examples:

mpirun -np 3 ./simple_part

The output should look like:

Starting Conjugate Gradient. Max iterations: 50
Stopping criterion ||R||/||B|| < 1e-08
with ||B|| = 29.116
---
0 -	||R||/||B||	0.97707
1 -	||R||/||B||	0.441184
2 -	||R||/||B||	0.220872
3 -	||R||/||B||	0.0646697
4 -	||R||/||B||	2.80055e-17
        ||B-AX||/||B||	1.49027e-16
X found
m: 7 (vector)

2.22045e-16
1
2
3
4
5
6

Niter: 4

4. Dense matrices

Introduction

In this tutorial we show an examples of dense matrices operations on a laptop and on the supercomputer plafrim, using either Blas/Lapack (e.g. Intel MKL, OpenBLAS) or Chameleon. The Compose class used is DenseMatrix for which there are associated kernels (scal, axpy, norm2, gemm, symm, eig, svd, etc) and solvers (getrf/s, potrf/s, geqrf/s, gelqf/s, trsm). Note that operations on dense matrices are limited to one single machine node for now (no distributed MPI) but with Chameleon there is the possibility to exploit GPUs in addition to CPUs.

4.1. Source code

The source code is a C++ file demo_dense.cpp which relies on Compose to

  1. perform a dense matrix multiplication \(C = A * B\),
  2. solve a linear system \(A * x = b\) with \(A\) symmetric positive definite (SPD) using Cholesky factorization \(A=LL^T\),
  3. solve an overdetermined system \(A * x = b\) with \(A\) rectangular using least squares and a factorization \(A=QR\).

The Chameleon C library is also used to generate random matrices cf. CHAMELEON_dplrnt (general matrix) and CHAMELEON_dplgsy (SPD matrix). Chameleon context must be initialized with Chameleon chameleon;. The number of CPUs and GPUs to be used by chameleon can be chosen with Chameleon::ncpus and Chameleon::ngpus. The size of sub-blocks (tiles), on which sequential kernels operate, can be set with Chameleon::tile_size. Those parameters have to be set before Chameleon initialization. If not set those parameters will be chosen by default internally and will try to use all CPUs, no GPUs and tile size equal 320.

#include <maphys.hpp>

using namespace maphys;

int main(int argc, char *argv[]) {

  using Scalar = double;

  /* Initialize Chameleon context */
  Chameleon::ncpus = 4;
  Chameleon::ngpus = 0;
  Chameleon::tile_size = 640;
  Chameleon chameleon;

  /* Matrix product GEMM: C=AB */

  // Fill the matrix with random values
  int M=1000;
  DenseMatrix<Scalar> A(M, M);
  DenseMatrix<Scalar> B(M, M);
  DenseMatrix<Scalar> C(M, M);

  int seedA = 10;
  CHAMELEON_dplrnt( M, M, get_ptr(A), M, seedA );
  int seedB = 20;
  CHAMELEON_dplrnt( M, M, get_ptr(B), M, seedB );

  // Compute the product (GEMM)
  chameleon_kernels::gemm(A, B, C);
  // or one can directly write: C = A * B;


  /* Linear system solve (Cholesky): Ax=b */

  // Fill the matrix A with random values so that it is square and SPD
  CHAMELEON_dplgsy( (Scalar)M, ChamLower, M, get_ptr(A), M, seedA );
  A.set_spd(MatrixStorage::full);
  Vector<Scalar> b(M);
  CHAMELEON_dplrnt( M, 1, get_ptr(b), M, seedB );

  // Declare the Solver
  ChameleonSolver<DenseMatrix<Scalar>, Vector<Scalar>> SCcham;
  // if A is given it is copied in the solver class
  // giving A as rvalue allows to avoid copy
  SCcham.setup(std::move(A));

  // Solve the linear system
  Vector<Scalar> x = SCcham * b;

  // Can also call SCcham.factorize(); to perform factorization before solving


  /* Overdetermined system solve (QR): Ax=b */

  // clear previous A and realloc
  int N=500;
  A = DenseMatrix<Scalar>(M, N);

  // Fill the rectangular matrix A with random values
  CHAMELEON_dplrnt( M, N, get_ptr(A), M, seedA );

  // Declare the solver
  ChameleonSolver<DenseMatrix<Scalar>, Vector<Scalar>> SLcham;
  SLcham.setup(std::move(A));

  // Solve the least squares problem
  Vector<Scalar> x = SLcham * b;

  return 0;
}

Another version of this example, more complete, can be read in src/examples/demo_dense.cpp of the Compose repository, with cpu time comparison between Blas/Lapack and Chameleon.

There are of course other routines available to perform dense matrices operations but there is not always a Chameleon implementation of Blas/Lapack function.

We summary the Blas/Lapack operations available in Compose through the kernels calls in the following table. All routines can be called from blas_kernels:: or chameleon_kernels:: but if there is no Chameleon implementation it will actually call the blas/lapack kernel. The X symbol in the table means there is a Chameleon implementation which can accelerate this operation.

Operation Chameleon
\(\text{scal}\)  
\(\text{axpy}\)  
\(\text{dot}\)  
\(\text{norm2}\)  
\(\text{gemv}\)  
\(\text{gemm}\) X
\(\text{symm}\) X
\(\text{hemm}\) X
\(\text{eig}\)  
\(\text{eigvals}\)  
\(\text{eigh}\)  
\(\text{eigvalsh}\)  
\(\text{eigh_select}\)  
\(\text{eigh_smallest}\)  
\(\text{eigh_largest}\)  
\(\text{gen_eigh_select}\)  
\(\text{gen_eigh_smallest}\)  
\(\text{gen_eigh_largest}\)  
\(\text{svd}\)  
\(\text{svd_left_sv}\)  
\(\text{svd_right_sv}\)  
\(\text{svdvals}\)  

We summary the solver routines available in Compose through the Solver class (BlasSolver, ChameleonSolver) in the following table.

Operation Chameleon
LU (\(\text{getrf/s}\))  
Cholesky (\(\text{potrf/s}\)) X
Symmetric complex (\(\text{sytrf/s}\)) X
Least Squares (\(\text{geqrf/s}\)) X
Minimum norm (\(\text{gelqf/s}\)) X
Triangular Solve (\(\text{trsm}\)) X

4.2. Execution on a Linux Ubuntu laptop

4.2.1. Install Compose with CMake

Install system packages requirements

sudo apt-get update -y
sudo apt-get install -y git cmake build-essential gfortran python wget tar curl pkg-config libmkl-dev

Notice one can replace libmkl-dev by libopenblas-dev.

Install specific packages with CMake

# install blaspp, lapackpp and arpack-ng
git clone https://github.com/icl-utk-edu/blaspp.git
git clone https://github.com/icl-utk-edu/lapackpp.git
git clone https://github.com/opencollab/arpack-ng.git

cd blaspp
cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build -j5
sudo cmake --install build
cd ../

cd lapackpp
cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build -j5
sudo cmake --install build
cd ../

cd arpack-ng
cmake -B build -DICB=ON
cmake --build build -j5
sudo cmake --install build
cd ../

# Install Starpu and Chameleon
git clone https://gitlab.inria.fr/starpu/starpu.git
cd starpu
./configure --prefix=/usr/local --disable-build-doc --disable-build-test --disable-opencl --disable-fortran --disable-socl --disable-starpufft
make -j5
sudo make install
cd ../

git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
cmake -B build -DCHAMELEON_USE_MPI=OFF -DCMAKE_INSTALL_PREFIX=/usr/local -DBUILD_SHARED_LIBS=ON
# if starpu is installed in a specific path add the following to the configuration: -DCMAKE_PREFIX_PATH="path to starpu"
cmake --build build -j5
sudo cmake --install build
cd ../

Install Compose

git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build -j5
sudo cmake --install build

4.2.2. Compile and run the example

To link a C++ executable with Compose from a file demo_dense.cpp the straightforward way is to use CMake, example of CMakeLists.txt file:

cmake_minimum_required(VERSION 3.5)
project(DEMO_DENSE CXX C Fortran)

find_package(maphyspp REQUIRED)

set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

add_executable(demo_dense demo_dense.cpp)
target_link_libraries(demo_dense PRIVATE MAPHYSPP::maphyspp)
target_compile_definitions(demo_dense PRIVATE MAPHYSPP_USE_CHAMELEON)

Then, in the same directory where demo_dense.cpp and CMakeLists.txt files are located use cmake command as follows:

cmake -B build
# if dependencies are installed in specific directories , add them to the configure: -DCMAKE_PREFIX_PATH="path to chameleon;path to starpu
cmake --build build
# Run the example
./build/demo_dense

4.3. Execution using GNU Guix

We consider Guix is up-to-date and that the guix-hpc-non-free is setup. We consider using the complete example here, located in src/examples/demo_dense.cpp. Lets compare CPU times given by Lapack (Intel MKL here) and Chameleon using four threads i.e. four CPUs.

Compile the example using Guix

git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
guix shell --pure -D maphys++-mkl -- /bin/bash --norc
cmake -B build-guix  -DMAPHYSPP_COMPILE_EXAMPLES=ON -DMAPHYSPP_COMPILE_TESTS=OFF -DMAPHYSPP_DRIVERS=OFF -DMAPHYSPP_C_DRIVER=OFF -DMAPHYSPP_Fortran_DRIVER=OFF -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build-guix --target mpp_demo_dense

Arguments which can be used with the executable:

  1. 0 for Blas/Lapack (blaspp/lapackpp+Intel MKL MT), 1 for Chameleon (+Intel MKL sequential)
  2. Size M (size of the GEMM and linear system, number of rows and columns)
  3. Size N (number of columns of the rectangular matrix during QR and least squares problem)
  4. Chameleon block/tile size
  5. Chameleon number of CPU workers
  6. Chameleon number of GPU workers

Execution with Blas/Lapack from the MKL using 4 threads

MKL_NUM_THREADS=4 ./build-guix/src/examples/mpp_demo_dense 0 10000 5000

Example results

Least squares system time, 1, 4.74672
Linear system (Cholesky) time, 1, 2.64744
Gemm time, 1, 14.4328

Execution with Chameleon using 4 threads

MKL_NUM_THREADS=1 ./build-guix/src/examples/mpp_demo_dense 1 10000 5000 640 4 0

Example results

Least squares system time, 1, 3.77979
Linear system (Cholesky) time, 1, 2.73428
Gemm time, 1, 15.419

Chameleon delivers performances close to MKL ones, knowing that Intel MKL is a Blas/Lapack reference for using a single machine with Intel CPUs.

Notice that we have set MKL_NUM_THREADS=1 in the Chameleon case since all the parallelism is direcly handled by Chameleon which relies on sequential kernels scheduled by the Starpu runtime system.

4.4. Execution on the Plafrim supercomputer using GPUs

Lets compare CPU times given by Lapack (Intel MKL here) and Chameleon on a plafrim sirocco (2x16 cores + 2x gpus p100) node.

Compile the example using Guix

ssh plafrim
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
guix shell --pure -D maphys++-mkl --with-input=chameleon-mkl-mt-nompi=chameleon-cuda-mkl-mt-nompi slurm -- /bin/bash --norc
cmake -B build-guix -DMAPHYSPP_COMPILE_EXAMPLES=ON -DMAPHYSPP_COMPILE_TESTS=OFF -DMAPHYSPP_DRIVERS=OFF -DMAPHYSPP_C_DRIVER=OFF -DMAPHYSPP_Fortran_DRIVER=OFF -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build-guix --target mpp_demo_dense

Notice we replace the default chameleon package by a version built with CUDA to be able to use GPUs.

Reserve one p100 node and expose the location of the CUDA driver in the Guix environment (can be different depending on the supercomputer)

salloc --nodes=1 --time=01:00:00 --constraint p100 --exclusive --ntasks-per-node=1 --threads-per-core=1
export LD_PRELOAD="/usr/lib64/libcuda.so"

Execution with Blas/Lapack from the MKL using 32 threads

export MKL_NUM_THREADS=32
srun -n 1 -c 32 ./build-guix/src/examples/mpp_demo_dense 0 50000 10000

Example results

Least squares system time, 1, 24.2582
Linear system (Cholesky) time, 1, 95.9617
Gemm time, 1, 354.629

Execution with Chameleon using 29 threads for CPUs kernels, 2 threads for GPUs kernels and 1 reserved for Starpu

export MKL_NUM_THREADS=1
srun -n 1 -c 32 ./build-guix/src/examples/mpp_demo_dense 1 50000 10000 1240 29 2

Example results

Least squares system time, 1, 12.3178
Linear system (Cholesky) time, 1, 24.4654
Gemm time, 1, 86.5949

Performances with Chameleon outperform MKL ones thanks to the GPUs kernels (cuBLAS here) Chameleon is able to exploit.

Date: 18/01/2024 at 18:51:41 18/01/2024 at 18:51:41

Author: HiePACS Gilles Marait

Created: 2024-01-18 Thu 18:51

Validate