Compose
Tutorial
Table of Contents
Back to index.
1. A minimal example
1.1. Requirements
For this tutorial we shall use Compose
(A.K.A Maphys++
) as a
header-only C++ library, so requirement tools are limited. You only
need git
, cmake
and a C++ compiler (g++
or clang++
for
instance) supporting C++20 standard.
Later on, when performance becomes important, optimized libraries
such as BLAS, LAPACK, direct sparse solvers… must be used.
Interfaces for those are already present in Compose
and you will
be able to take advantage of them will little effort. For small
prototypes such as this tutorial, or when performance is not
important, the header-only version is sufficient.
Now let's get the Compose
C++ headers. The following command will
clone the source code (with git
) and set the variable
MAPHYSPP_INCLUDES
to the path of those headers.
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git MAPHYSPP_INCLUDES=`realpath maphyspp/include` TLAPACK_INCLUDES=`realpath maphyspp/tlapack/include`
The --recursive
parameter is important: we do need the
submodules. If you forgot it, you can use the following commands to
get and update them:
# In the repository of compose git submodule init git submodule update
1.2. A simple example
Let's start with the following C++ example in the file minimal.cpp
:
#include <maphys.hpp> int main(){ using namespace maphys; DenseMatrix<double> A(3, 3); A(0, 0) = 4; A(1, 0) = -1; A(0, 1) = A(1, 0); A(1, 1) = 3; A(2, 1) = -0.5; A(1, 2) = A(2, 1); A(2, 2) = A(0, 0); A.display("A"); return 0; }
A m: 3 n: 3 Symmetry: general; Storage: full 4 -1 0 -1 3 -0.5 0 -0.5 4
1.3. Compiling and running
Compose
is written in C++20. In this tutorial we use the GNU
compiler g++
. The beginning of the command line will be:
g++ -std=c++20
Since we are using header-only C++ libraries, all we need to do is to include
the correct headers to compile. We need to indicate where the headers of
Compose
and <T>LAPACK
are:
-I${MAPHYSPP_INCLUDES} -I${TLAPACK_INCLUDES}
We an extra flag to indicate to Compose
that we want to use the
header-only implementation, renouncing to the dependencies that
require compilation and linking. In particular, we can't use MPI or
a compiled BLAS implementation with this flag.
-DMAPHYSPP_HEADER_ONLY
Remark: if it's more convenient for you, it is also possible to add the
definition in the code before including maphyspp.hpp
rather that using the
flag in the command line:
#include <iostream> #include <vector> #define MAPHYSPP_HEADER_ONLY #include <maphys.hpp> //...
So the command line to compile our example minimal.cpp
is:
g++ -std=c++20 -I${MAPHYSPP_INCLUDES} -I ${TLAPACK_INCLUDES} -DMAPHYSPP_HEADER_ONLY minimal.cpp -o minimal
And we can run the code with ./minimal
. The output is:
A m: 3 n: 3 Symmetry: general; Storage: full 4 -1 0 -1 3 -0.5 0 -0.5 4
1.4. Explanation of the program
The only feature used here is the class DenseMatrix
implemented in
Compose
. It corresponds to a matrix where all coefficient are stored in a
2D-array. In this example we created a 3 by 3 matrix \(A\) with the constructor
DenseMatrix<double> A(3, 3)
, filled with zeros. The scalar type of the
coefficients of the matrix is given as a template parameter, here we use
double precision floating-point arithmetic for real numbers, which
corresponds to the type double
. Then, we can access and assign the
coefficients \(a_{i, j}\) of the matrix with A(i, j)
. Here we filled \(A\) with
some values. Finally, we displayed the matrix on the standard output using
the display
method of the class. Optionally we can pass the name of the
matrix we display as a parameter of the method, so that we can easily
identify it on the output (here we passed "A"
).
On the output, we can see the name of the matrix "A"
, followed by its
description. \(m\) is the number of rows, \(n\) the number of columns. We can see
that no assumption is made on the properties of the matrix: although the
matrix is clearly symmetric, the program doesn't have this piece of
information unless specified by the user. For this example it is not
necessary make this effort.
1.5. Another example with a vector
Let's write another simple example to perform a matrix-vector product.
#include <maphys.hpp> int main(){ using namespace maphys; DenseMatrix<double> A({ 4, -1, 0, -1, 3, -0.5, 0, -0.5, 4}, 3, 3); Vector<double> u{1, 2, 3}; Vector<double> b = A * u; u.display("u"); b.display("b"); return 0; }
u m: 3 (vector) 1 2 3 b m: 3 (vector) 2 3.5 11
First we introduced a new constructor for the DenseMatrix
. The first
parameter is a bracket list of the matrix coefficients, then its number of
rows \(m\) and number of columns \(n\). The coefficients are given in
column-major format, meaning the \(m\) first values describe the first
column, the \(m\) next the second column, etc. Here the matrix is symmetric
so it does not make a difference, but you can try to modify a non diagonal
coefficient and display the matrix to see the effect.
Secondly, we declared \(u\), a Vector
of double
. Actually Vector
is
special case of DenseMatrix
where the number of columns is fixed to 1.
The constructor is therefore very similar, you only need to give the list
of the coefficients of the vectors, and the size of the vector is deduced
from the size of the list.
Then we perform a matrix vector product of the matrix \(A\) and \(u\) and store
the result in a new Vector
\(b\). We display \(u\) and \(b\) in the same way we
displayed the matrix \(A\) before, using the display
method.
1.6. Solving a linear system
1.6.1. With a Jacobi iterative solver
Now let's try to solve a linear system. The matrix \(A\) is diagonally dominant matrix, so we can try so solve it with a Jacobi iterative solver. On the following example, we use this iterative method to find \(x\) such that \(Ax = b\). We can display \(x\) and make sure it has the same values as \(u = (1, 2, 3)^T\).
#include <maphys.hpp> #include <maphys/solver/Jacobi.hpp> int main(){ using namespace maphys; DenseMatrix<double> A({ 4, -1, 0, -1, 3, -0.5, 0, -0.5, 4}, 3, 3); Vector<double> u{1, 2, 3}; Vector<double> b = A * u; Jacobi<DenseMatrix<double>, Vector<double>> solver; solver.setup(parameters::A{A}, parameters::verbose{true}, parameters::max_iter{20}, parameters::tolerance{1e-8}); Vector<double> x = solver * b; x.display("x"); return 0; }
Starting Jacobi. Max iterations: 20 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 11.7154 --- 0 - ||B-AX||/||B|| 1 1 - ||B-AX||/||B|| 0.194964 2 - ||B-AX||/||B|| 0.067276 3 - ||B-AX||/||B|| 0.0203088 4 - ||B-AX||/||B|| 0.00700792 5 - ||B-AX||/||B|| 0.0021155 6 - ||B-AX||/||B|| 0.000729992 7 - ||B-AX||/||B|| 0.000220364 8 - ||B-AX||/||B|| 7.60408e-05 9 - ||B-AX||/||B|| 2.29546e-05 10 - ||B-AX||/||B|| 7.92092e-06 11 - ||B-AX||/||B|| 2.39111e-06 12 - ||B-AX||/||B|| 8.25095e-07 13 - ||B-AX||/||B|| 2.49073e-07 14 - ||B-AX||/||B|| 8.59474e-08 15 - ||B-AX||/||B|| 2.59452e-08 16 - ||B-AX||/||B|| 8.95286e-09 x m: 3 (vector) 1 2 3
We include the header with the conjugate gradient in
maphys/solver/Jacobi.hpp
to have access to the solver. We need
to indicate the type of matrix and vector used to the solver in template
parameters: DenseMatrix<double>
for the matrix, and Vector<double>
for
the vector.
Many parameters can be given to the solver, under the namespace parameters
.
Here we are giving the most useful ones.
parameters::A
, which is mandatory, is the input square matrix, or the operator of the system;parameters::tolerance
is the targeted precision of the solution, more specifically, the iterative solver considers the convergence achieved when the backward error \(||b - Ax||_2 / ||b||_2\) is lower than this value;parameters::max_iter
is the maximum number of iterations. When reached, the solver stops even if it has not converged to the required tolerance;parameters::verbose
is a boolean value,false
by default, printing information on the standard output whentrue
.Those parameters can be passed to the
setup
method in any order and have default values if not given (although they may not be relevant values in your context).Once setup, solving the linear system consists in a simple multiplication:
solver * rhs
, returning the solution vector. One may consider thesolver
instance as the \(A^{-1}\) operator, even if no explicit inversion of a matrix is ever performed.
1.6.2. With a conjugate gradient
The matrix \(A\) is actually not only diagonally dominant, it's also symmetric
positive definite. Therefore we can use a conjugate gradient to solve the
system. Since iterative solvers share the same set of basic parameters, we
can take the same code and just replace Jacobi
by ConjugateGradient
to
solve the system.
#include <maphys.hpp> #include <maphys/solver/ConjugateGradient.hpp> int main(){ using namespace maphys; DenseMatrix<double> A({ 4, -1, 0, -1, 3, -0.5, 0, -0.5, 4}, 3, 3); Vector<double> u{1, 2, 3}; Vector<double> b = A * u; ConjugateGradient<DenseMatrix<double>, Vector<double>> solver; solver.setup(parameters::A{A}, parameters::verbose{true}, parameters::max_iter{20}, parameters::tolerance{1e-8}); Vector<double> x = solver * b; x.display("x"); return 0; }
Starting Conjugate Gradient. Max iterations: 20 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 11.7154 --- 0 - ||R||/||B|| 1 1 - ||R||/||B|| 0.248805 2 - ||R||/||B|| 0.048057 3 - ||R||/||B|| 2.6488e-18 ||B-AX||/||B|| 0 x m: 3 (vector) 1 2 3
As expected, we only need 3 (the size of the linear system) iterations to converge with the conjugate gradient.
2. Sparse matrices
We propose 3 types of sparse matrices in Compose
:
SparseMatrixCOO
: a matrix in coordinate format (also known as IJV format);SparseMatrixCSC
: a matrix in Compressed Sparse Column format;SparseMatrixLIL
: a matrix in list of lists format, a special format for coefficient insertion.
2.1. Create a sparse matrix with SparseMatrixLIL
The easiest way to create a sparse matrix is to insert the coefficients of
the matrix one by one. The only format that allows this kind of insertion is
SparseMatrixLIL
, which is implemented with an array of \(n\) (the number of
columns) lists. Thus insertion can be performed in linear time.
#include <maphys.hpp> #include <maphys/loc_data/SparseMatrixLIL.hpp> int main(){ using namespace maphys; SparseMatrixLIL<double> lilmat(3, 3); lilmat.insert(0, 0, 4); lilmat.insert(1, 0, -1); lilmat.insert(0, 1, -1); lilmat.insert(1, 1, 3); lilmat.insert(1, 2, -0.5); lilmat.insert(2, 1, -0.5); lilmat.insert(2, 2, 4); lilmat.display("lilmat"); lilmat.to_dense().display("lilmat converted to dense"); return 0; }
lilmat m: 3 n: 3 nnz: 7 Symmetry: general; Storage: full Column 0 i v 0 4 1 -1 Column 1 i v 0 -1 1 3 2 -0.5 Column 2 i v 1 -0.5 2 4 lilmat converted to dense m: 3 n: 3 Symmetry: general; Storage: full 4 -1 0 -1 3 -0.5 0 -0.5 4
We included the header for the lil matrix:
maphys/loc_data/SparseMatrixLIL.hpp
. First we construct the sparse matrix
its dimensions \((m, n)\), and then we add the coefficients \((i, j, v)\) (index
of row, index of column and value of the coefficient respectively) one by
one. The matrix type is very well suited when you want to construct a matrix
but you don't have information about the final numbers of nonzero elements
it contains.
2.2. Using half storage for symmetric matrices
As we fill this symmetric sparse matrix, we can see an obvious optimization: half storage. Indeed, if we could indicate that the matrix is symmetric, we would not need to add twice the same value \(v\) in \((i, j)\) and \((j, i)\).
#include <maphys.hpp> #include <maphys/loc_data/SparseMatrixLIL.hpp> int main(){ using namespace maphys; SparseMatrixLIL<double> lilmat(3, 3); lilmat.insert(0, 0, 4); lilmat.insert(1, 0, -1); lilmat.insert(1, 1, 3); lilmat.insert(2, 1, -0.5); lilmat.insert(2, 2, 4); lilmat.set_property(MatrixSymmetry::symmetric, MatrixStorage::lower); lilmat.display("lilmat"); lilmat.to_dense().display("lilmat converted to dense"); return 0; }
lilmat m: 3 n: 3 nnz: 5 Symmetry: symmetric; Storage: lower Column 0 i v 0 4 1 -1 Column 1 i v 1 3 2 -0.5 Column 2 i v 2 4 lilmat converted to dense m: 3 n: 3 Symmetry: symmetric; Storage: lower 4 - - -1 3 - 0 -0.5 4
Here we filled only the lower triangular part of the matrix. We
can indicate with set_property(MatrixSymmetry::symmetric)
that
the matrix is symmetric, and then with
lilmat.set_property(MatrixStorage::lower)
that only the lower
part of the matrix is stored. These methods can be called on any
type of matrix in Compose
, and any kind of matrix property. As
shown on the example, it's possible to set several properties at
once when calling set_property
.
Notice that for the DenseMatrix
constructed at the end, the conversion
propagates the properties, and only the lower triangular part of the matrix
is displayed. The other coefficients, whatever their values, won't be used
in any operation performed with the matrix.
Remark: some shortcuts exist to fill the properties of a matrix.
Here in particular, we have an symmetric positive definite matrix,
so we could (should?) have done
lilmat.set_spd(MatrixStorage::lower)
to set all properties at
the same time.
2.3. Dealing with duplicates
What happens when we add twice a value in the same locate \((i, j)\) ? Well it's entirely up to you. The default behavior is to sum the two values given, but a lambda function can be passed to override this. The function takes two parameters: the first one is a reference to the value stored in the matrix, and the second is a const reference to the value you are inserting.
Let's see on a example:
#include <maphys.hpp> #include <maphys/loc_data/SparseMatrixLIL.hpp> int main(){ using namespace maphys; SparseMatrixLIL<double> lilmat(2, 2); std::cout << "Default behavior: add coefficients\n"; std::cout << "Insert (0, 0) -> 1\n"; lilmat.insert(0, 0, 1); std::cout << "Insert (0, 0) -> 0.5\n"; lilmat.insert(0, 0, 0.5); lilmat.to_dense().display("lilmat converted to dense"); auto replace_coeff = [](double& present_v, const double& insert_v){ present_v = insert_v; }; std::cout << "\nNew behavior: replace coefficients\n"; std::cout << "Insert (0, 1) -> 1\n"; lilmat.insert(0, 1, 1, replace_coeff); std::cout << "Insert (0, 1) -> 0.5\n"; lilmat.insert(0, 1, 0.5, replace_coeff); lilmat.to_dense().display("lilmat updated 1"); auto mean_coeff = [](double& present_v, const double& insert_v){ present_v = (present_v + insert_v) / 2; }; std::cout << "\nNew behavior: compute the mean of the coefficients\n"; std::cout << "Insert (1, 1) -> 1\n"; lilmat.insert(1, 1, 1, mean_coeff); std::cout << "Insert (1, 1) -> 0.5\n"; lilmat.insert(1, 1, 0.5, mean_coeff); lilmat.to_dense().display("lilmat updated 2"); return 0; }
Default behavior: add coefficients Insert (0, 0) -> 1 Insert (0, 0) -> 0.5 lilmat converted to dense m: 2 n: 2 Symmetry: general; Storage: full 1.5 0 0 0 New behavior: replace coefficients Insert (0, 1) -> 1 Insert (0, 1) -> 0.5 lilmat updated 1 m: 2 n: 2 Symmetry: general; Storage: full 1.5 0.5 0 0 New behavior: compute the mean of the coefficients Insert (1, 1) -> 1 Insert (1, 1) -> 0.5 lilmat updated 2 m: 2 n: 2 Symmetry: general; Storage: full 1.5 0.5 0 0.75
Remark: after the construction of your matrix, you may have zeros or almost
zeros values stored in your matrix. You can remove them with the .drop()
method. Optionally, this method takes a real value tolerance
, and all values
lower than this tolerance
in absolute value will be removed. For example,
you can write: lilmat.drop(1e-12)
.
2.4. Conversion to other matrix types
The lil format cannot be used to do any other operation than inserting at
present. Once you are satisfied with the lil matrix you have built, you
can convert it to the format you want to use: DenseMatrix
,
SparseMatrixCOO
or SparseMatrixCSC
.
#include <maphys.hpp> #include <maphys/loc_data/SparseMatrixLIL.hpp> int main(){ using namespace maphys; SparseMatrixLIL<double> lilmat(2, 2); lilmat.insert(0, 0, 4); lilmat.insert(1, 0, -1); lilmat.insert(1, 1, 3); lilmat.display("lilmat"); const SparseMatrixCOO<double> coomat(lilmat.to_coo()); coomat.display("coomat"); const SparseMatrixCSC<double> cscmat(lilmat.to_csc()); cscmat.display("cscmat"); const DenseMatrix<double> densemat(lilmat.to_dense()); densemat.display("densemat"); return 0; }
lilmat m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full Column 0 i v 0 4 1 -1 Column 1 i v 1 3 coomat m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full i j v 0 0 4 1 0 -1 1 1 3 cscmat m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full Column 0 i v 0 4 1 -1 Column 1 i v 1 3 densemat m: 2 n: 2 Symmetry: general; Storage: full 4 0 -1 3
Remark: notice that the header file SparseMatrixLIL
includes the headers
of the other types of matrices (DenseMatrix
, SparseMatrixCOO
and
SparseMatrixCSC
).
2.5. COO and CSC formats
While constructing a SparseMatrixLIL
adding the coefficients one by one can
be convenient, it is not optimal if you already know the number of non-zero
values before hand. In that case it is simpler to just pass the \(i\), \(j\) and
\(v\) arrays to build the coo or csc matrix directly.
Several constructors are available.
#include <maphys.hpp> #include <maphys/loc_data/SparseMatrixCOO.hpp> int main(){ using namespace maphys; const int M = 2; // Number of rows const int N = 2; // Number of columns const int NNZ = 3; // Number of non-zero values // Pass 3 initalizer lists i, j and v, then dimensions M and N SparseMatrixCOO<double> coomat_1({0, 0, 1}, {0, 1, 1}, {0.5, 1.5, 2}, M, N); coomat_1.display("coomat build with initializer lists"); // Pass dimensions M, N then NNZ and 3 pointers to the arrays for i, j and v // The arrays are not destroyed after the operation: values are copied int v_i[] = {0, 0, 1}; int v_j[] = {0, 1, 1}; double v_v[] = {0.5, 1.5, 2}; SparseMatrixCOO<double> coomat_2(M, N, NNZ, v_i, v_j, v_v); coomat_2.display("coomat build from pointers"); // Use maphys++ IndexArray class (an extended std::vector) We move the values // into the SparseMatrixCOO class, the IndexArray are empty after the // operation but no copy is performed IndexArray<int> ia_i{0, 0, 1}; IndexArray<int> ia_j{0, 1, 1}; IndexArray<double> ia_v{0.5, 1.5, 2}; SparseMatrixCOO<double> coomat_3(M, N, NNZ, std::move(ia_i), std::move(ia_j), std::move(ia_v)); coomat_3.display("coomat build with IndexArray"); return 0; }
coomat build with initializer lists m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full i j v 0 0 0.5 0 1 1.5 1 1 2 coomat build from pointers m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full i j v 0 0 0.5 0 1 1.5 1 1 2 coomat build with IndexArray m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full i j v 0 0 0.5 0 1 1.5 1 1 2
The same example can be written with SparseMatrixCSC
instead of
SparseMatrixCOO
. We do not describe here how the csc format works, but in
our case the arrays for \(i\) and \(v\) are the same as the ones for the
coordinate format, only the \(j\) array (of size \(N + 1\)) is different.
#include <iostream> #include <vector> #include <maphys.hpp> #include <maphys/loc_data/SparseMatrixCSC.hpp> int main(){ using namespace maphys; const int M = 2; // Number of rows const int N = 2; // Number of columns const int NNZ = 3; // Number of non-zero values IndexArray<int> ia_i{0, 0, 1}; IndexArray<int> ia_j{0, 1, 3}; IndexArray<double> ia_v{0.5, 1.5, 2}; SparseMatrixCSC<double> cscmat(M, N, NNZ, std::move(ia_i), std::move(ia_j), std::move(ia_v)); cscmat.display("csc build with IndexArray"); return 0; }
csc build with IndexArray m: 2 n: 2 nnz: 3 Symmetry: general; Storage: full Column 0 i v 0 0.5 Column 1 i v 0 1.5 1 2
Remark: these format also support conversion, as for the lil matrices,
with the methods .to_coo()
, .to_csc()
and to_dense()
.
2.6. Solving the system
Now let's take a look at the system we wanted to solve: \(A\) is a symmetric definite positive matrix with 7 non-zero values (5 if we use half storage). Let's build this matrix as a coo sparse matrix in lower storage and use the conjugate gradient to solve the system:
#include <maphys.hpp> #include <maphys/solver/ConjugateGradient.hpp> #include <maphys/loc_data/SparseMatrixCOO.hpp> int main(){ using namespace maphys; SparseMatrixCOO<double> A({0, 1, 1, 2, 2}, {0, 0, 1, 1, 2}, {4, -1, 3, -0.5, 4}, 3, 3); A.set_spd(MatrixStorage::lower); Vector<double> u{1, 2, 3}; Vector<double> b = A * u; ConjugateGradient<SparseMatrixCOO<double>, Vector<double>> solver; solver.setup(parameters::A{A}, parameters::verbose{true}, parameters::max_iter{20}, parameters::tolerance{1e-8}); Vector<double> x = solver * b; x.display("x"); return 0; }
Starting Conjugate Gradient. Max iterations: 20 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 11.7154 --- 0 - ||R||/||B|| 1 1 - ||R||/||B|| 0.248805 2 - ||R||/||B|| 0.048057 3 - ||R||/||B|| 2.6488e-18 ||B-AX||/||B|| 0 x m: 3 (vector) 1 2 3
If we look at the modification compared to the dense version, we only build a
SparseMatrixCOO
instead of a DenseMatrix
, and we indicated to the
ConjugateGradient
solver that our operator \(A\) is given by the matrix
format SparseMatrixCOO
. The rest of the code did not change.
One could make a very generic code using the C++ tools: templates,
decltype
, auto
…
#include <maphys.hpp> #include <maphys/solver/ConjugateGradient.hpp> #include <maphys/loc_data/SparseMatrixCOO.hpp> using namespace maphys; template<typename Mat, typename Vect> void solve_with_cg(const Mat& A, const Vect& b){ ConjugateGradient<Mat, Vect> solver; solver.setup(parameters::A{A}, parameters::verbose{true}, parameters::max_iter{20}, parameters::tolerance{1e-8}); auto x = solver * b; x.display("x"); } int main(){ SparseMatrixCOO<double> A_coo({0, 1, 1, 2, 2}, {0, 0, 1, 1, 2}, {4, -1, 3, -0.5, 4}, 3, 3); A_coo.set_spd(MatrixStorage::lower); const Vector<double> u{1, 2, 3}; const Vector<double> b = A_coo * u; std::cout << "Solve CG with COO matrix\n"; solve_with_cg(A_coo, b); const DenseMatrix<double> A_dense = A_coo.to_dense(); std::cout << "Solve CG with dense matrix\n"; solve_with_cg(A_dense, b); return 0; }
Solve CG with COO matrix Starting Conjugate Gradient. Max iterations: 20 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 11.7154 --- 0 - ||R||/||B|| 1 1 - ||R||/||B|| 0.248805 2 - ||R||/||B|| 0.048057 3 - ||R||/||B|| 2.6488e-18 ||B-AX||/||B|| 0 x m: 3 (vector) 1 2 3 Solve CG with dense matrix Starting Conjugate Gradient. Max iterations: 20 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 11.7154 --- 0 - ||R||/||B|| 1 1 - ||R||/||B|| 0.248805 2 - ||R||/||B|| 0.048057 3 - ||R||/||B|| 2.6488e-18 ||B-AX||/||B|| 0 x m: 3 (vector) 1 2 3
3. A domain decomposition example
Introduction
In this tutorial we shall compile and run a small example of Compose
on
the supercomputer plafrim or on a simple unix laptop.
3.1. Environment setup
3.1.1. guix setup
The simplest way to install Compose
is to use guix. This piece
of software is available on plafrim and can be easily installed on
a laptop. The Compose
package is distributed in the guix-hpc
repository. You may also be interested by the guix-hpc-non-free
package which gives access to additional non open-source packages
such as the intel MKL.
To check that you installation is correct, the following command should give you an output:
guix search maphys++
and you should be able to find the Compose
package. If it is not
the case, refer to the documentations on repositories linked above.
You can then check that the Compose
package compiles:
guix build maphys++
3.1.2. guix environment
To obtain a clean environment with Compose
, we launch a bash in a pure
guix shell
:
guix shell --pure maphys++
In this environment, only Compose
and its dependencies are
available. Furthermore, environment variables are set so that
every library can be easily found by a compilation tool such as
CMake (you can check the values of C_INCLUDE_PATH
,
LIBRARY_PATH
, PKG_CONFIG_PATH
…)
In particular, you will find the c++ headers of Compose
:
#NB: add coreutils to use "ls" guix shell --pure maphys++ coreutils ls $GUIX_ENVIRONMENT/include/maphys
- Add other packages in the environment
When using
guix shell --pure maphys++
, we obtain a very clean environment, containing onlymaphys++
and its required dependencies. Let's add some other packages necessary to modify and compile the code. Feel free to add other packages you need (your favorite text editor for instance). The following line should provide all the necessary tools to compile.guix shell --pure maphys++ coreutils emacs cmake make pkg-config gcc-toolchain@12.3 gfortran-toolchain
- Using MKL
If you have guix-hpc-non-free setup and you want to use intel MKL, you need to tell guix that you want to replace
openblas
(default provider for blas and lapack) bymkl
. MUMPS, a dependency ofCompose
also has a specific package for this purpose, and python packagenumpy
as well.guix shell --pure maphys++ coreutils emacs cmake \ make pkg-config gcc-toolchain@12.3 gfortran-toolchain \ --with-input=mumps-openmpi=mumps-mkl-openmpi \ --with-input=openblas=mkl \ --with-input=python-numpy=python-numpy-mkl
3.2. Compile an example
3.2.1. Example description
- Domain decomposition
In the example presented, the adjacency graph of the global matrix consists in 7 vertices which are assigned to the following subdomains:
Node 0 1 2 3 4 5 6 Subdomain(s) 0, 2 0 0, 1 0, 1, 2 1 1, 2 2 The MPI ranks are numbered from 0 to \(N\) - 1 consistently with the default MPI numbering. However, subdomains are numbered from 1 to \(N\) to stay consistent with previous chapters where index 0 is reserved for the coarse space. In the code, the subdomain \(\Omega_i\) is therefore identified its unique ID \(i\). A local numbering of the unknowns is used in each subdomain: in the example, the local indices [0, 1, 2, 3] in \(\Omega_0\) correspond to the same global indices [0, 1, 2, 3]. The global indices corresponding to local indices [0, 1, 2, 3] of \(\Omega_1\) and \(\Omega_2\) are [3, 2, 4, 5] and [5, 6, 0, 3], respectively. Each subdomain has to know which unknowns it shares with which other subdomains (dotted lines in the figure (c)). In our example, for instance, in local ordering:
- \(\Omega_0\) shares [2, 3] with \(\Omega_1\) and [0, 3] with \(\Omega_2\).
- \(\Omega_1\) shares [1, 0] with \(\Omega_0\) and [0, 3] with \(\Omega_2\).
\(\Omega_2\) shares [2, 3] with \(\Omega_0\) and [3, 0] with \(\Omega_1\).
The global domain in (a) is partitioned in three subdomains \(\Omega_0\), \(\Omega_1\) and \(\Omega_2\) in (b). Each subdomain can use a local ordering of its unknowns as in (c); duplicate unknowns in different subdomains are linked by a dotted line.}
The interior unknowns can be eliminated to keep only interface unknown that are shared by two subdomains or more (a). A global interface numbering is introduced (b). The global interface can be expressed as the union of the three local interfaces \(\G_0\), \(\G_1\) and \(\G_2\) in (c).
- Input matrix
The system we are solving is \(K . u = f\), where \(K\) is a partitioned sparse SPD matrix, \(u\) the unknown vector and \(f\) the right hand side.
For the sake of simplicity, we put the same local matrix on each subdomain. Notice that the sparse structure of the matrix corresponds to the domain decomposition describe above.
Local matrices: \[\begin{pmatrix} 3 &-1 & 0 &-1 \\ -1 & 4 &-1 &-1 \\ 0 &-1 & 3 &-1 \\ -1 &-1 &-1 & 4 \end{pmatrix}\]
Which results in this global matrix:
Global matrix: \[\K = \begin{pmatrix} 6 & -1 & 0 & -2 & 0 & 0 & -1 \\ -1 & 4 & -1 & -1 & 0 & 0 & 0 \\ 0 & -1 & 7 & -2 & -1 & -1 & 0 \\ -2 & -1 & -2 & 11 & 0 & -2 & -1 \\ 0 & 0 & -1 & 0 & 3 & -1 & 0 \\ 0 & 0 & -1 & -2 & -1 & 7 & -1 \\ -1 & 0 & 0 & -1 & 0 & -1 & 4 \end{pmatrix}\]
The solution vector we expect to find is simply: \[u = \begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \end{pmatrix}\]
Thus, we shall use \(f = \K u\) for the right hand side vector:
Global vector: \[f = \begin{pmatrix} -13 \\ -1 \\ -2 \\ 12 \\ 5 \\ 17 \\ 16 \end{pmatrix}\]
The vectors are always assembled. So the local right hand side vectors are respectively:
\[\begin{pmatrix} -13 \\ -1 \\ -2 \\12 \end{pmatrix}, \begin{pmatrix} 12 \\ -2 \\ 5 \\ 17 \end{pmatrix} \text{, and } \begin{pmatrix} 17 \\ 16 \\ -13 \\ 12 \end{pmatrix}\]
3.2.2. Example code
First we need an example of Compose
usage. Let's take the example in
src/examples/simple_part.cpp
of the Compose
repository:
#include <iostream> #include <vector> #include <mpi.h> #include <maphys.hpp> #include <maphys/part_data/PartMatrix.hpp> #include <maphys/solver/Pastix.hpp> #include <maphys/solver/ConjugateGradient.hpp> #include <maphys/solver/PartSchurSolver.hpp> int main(){ using namespace maphys; using Scalar = double; MMPI::init(); { const int rank = MMPI::rank(); const int M = 4; const int NNZ = 9; const int N_DOFS = 4; const int N_SUBDOMAINS = 3; // Bind subdomains to MPI process std::shared_ptr<Process> p = bind_subdomains(N_SUBDOMAINS); // Define topology for each subdomain std::vector<Subdomain> sd; if(p->owns_subdomain(0)){ std::map<int, IndexArray<int>> nei_map_0 {{1, {2, 3}}, {2, {0, 3}}}; sd.emplace_back(0, N_DOFS, std::move(nei_map_0)); } if(p->owns_subdomain(1)){ std::map<int, IndexArray<int>> nei_map_1 {{0, {1, 0}}, {2, {0, 3}}}; sd.emplace_back(1, N_DOFS, std::move(nei_map_1)); } if(p->owns_subdomain(2)){ std::map<int, IndexArray<int>> nei_map_2 {{0, {2, 3}}, {1, {3, 0}}}; sd.emplace_back(2, N_DOFS, std::move(nei_map_2)); } p->load_subdomains(sd); // Local matrices (here the same matrix on every subdomain) std::vector<int> a_i{0, 0, 0, 1, 1, 1, 2, 2, 3}; std::vector<int> a_j{0, 1, 3, 1, 2, 3, 2, 3, 3}; std::vector<Scalar> a_v{3, -1, -1, 4, -1, -1, 3, -1, 4}; SparseMatrixCOO<Scalar, int> A_loc(M, M, NNZ, a_i.data(), a_j.data(), a_v.data()); A_loc.set_spd(MatrixStorage::upper); // Create the partitioned matrix std::map<int, SparseMatrixCOO<Scalar, int>> A_map; for(int k = 0; k < 3; ++k) if(p->owns_subdomain(k)) A_map[k] = A_loc; const PartMatrix<SparseMatrixCOO<Scalar>> A(p, std::move(A_map)); // Local rhs vectors std::map<int, Vector<Scalar>> b_map; if(p->owns_subdomain(0)) b_map[0] = Vector<Scalar>{-13, -1, -2, 12}; if(p->owns_subdomain(1)) b_map[1] = Vector<Scalar>{12, -2, 5, 17}; if(p->owns_subdomain(2)) b_map[2] = Vector<Scalar>{17, 16, -13, 12}; // Create the partitioned vector const PartVector<Vector<Scalar>> b(p, std::move(b_map)); // Solver definitions using Solver_K = Pastix<SparseMatrixCOO<Scalar>, Vector<Scalar>>; using Solver_S = ConjugateGradient<PartMatrix<DenseMatrix<Scalar>>, PartVector<Vector<Scalar>>>; using Solver = PartSchurSolver<PartMatrix<SparseMatrixCOO<Scalar>>, PartVector<Vector<Scalar>>, Solver_K, Solver_S>; // Setup Solver schur_solver; schur_solver.setup(A); // Set parameters for the solver S (ConjugateGradient) Solver_S& iter_solver = schur_solver.get_solver_S(); iter_solver.setup(parameters::max_iter{50}, parameters::tolerance{1e-8}, parameters::verbose{(rank == 0)}); // Solve auto X = schur_solver * b; X.display_centralized("X found"); if(rank == 0) std::cout << "Niter: " << iter_solver.get_n_iter() << '\n'; } MMPI::finalize(); return 0; }
It is an example that solves a distributed sparse linear system on 3
subdomains, using Pastix
to compute the Schur complement and a conjugate
gradient.
3.2.3. Compilation
Now let's create a directory tutorial
and copy this file simple_part.cpp
inside:
mkdir tutorial && cd tutorial #create or copy simple_part.cpp
In order to get Compose
and all its dependencies, we use the following
CMakeLists.txt
file:
cmake_minimum_required(VERSION 3.12) project(MAPHYS_EXAMPLE CXX C Fortran) find_package(maphyspp REQUIRED) # compile your example add_executable(simple_part simple_part.cpp) # link to maphyspp target_link_libraries(simple_part PRIVATE MAPHYSPP::maphyspp)
Now all we can compile our example using the standard CMake way:
mkdir build && cd build
cmake ..
make
3.2.4. Running the example
Let's run the example. It is written so that it can be run with any number of MPI processes. However, the optimal number of processes is one per subdomain, which is 3 in this examples:
mpirun -np 3 ./simple_part
The output should look like:
Starting Conjugate Gradient. Max iterations: 50 Stopping criterion ||R||/||B|| < 1e-08 with ||B|| = 29.116 --- 0 - ||R||/||B|| 0.97707 1 - ||R||/||B|| 0.441184 2 - ||R||/||B|| 0.220872 3 - ||R||/||B|| 0.0646697 4 - ||R||/||B|| 2.80055e-17 ||B-AX||/||B|| 1.49027e-16 X found m: 7 (vector) 2.22045e-16 1 2 3 4 5 6 Niter: 4
4. Dense matrices
Introduction
In this tutorial we show an examples of dense matrices operations on a laptop and on the supercomputer plafrim, using
either Blas/Lapack (e.g. Intel MKL, OpenBLAS) or Chameleon. The Compose class used is DenseMatrix
for which there are
associated kernels (scal, axpy, norm2, gemm, symm, eig, svd, etc) and solvers (getrf/s, potrf/s, geqrf/s, gelqf/s,
trsm). Note that operations on dense matrices are limited to one single machine node for now (no distributed MPI) but
with Chameleon there is the possibility to exploit GPUs in addition to CPUs.
4.1. Source code
The source code is a C++ file demo_dense.cpp
which relies on Compose to
- perform a dense matrix multiplication \(C = A * B\),
- solve a linear system \(A * x = b\) with \(A\) symmetric positive definite (SPD) using Cholesky factorization \(A=LL^T\),
- solve an overdetermined system \(A * x = b\) with \(A\) rectangular using least squares and a factorization \(A=QR\).
The Chameleon C library is also used to generate random matrices cf. CHAMELEON_dplrnt
(general matrix) and
CHAMELEON_dplgsy
(SPD matrix). Chameleon context must be initialized with Chameleon chameleon;
. The number of CPUs
and GPUs to be used by chameleon can be chosen with Chameleon::ncpus
and Chameleon::ngpus
. The size of sub-blocks
(tiles), on which sequential kernels operate, can be set with Chameleon::tile_size
. Those parameters have to be set
before Chameleon initialization. If not set those parameters will be chosen by default internally and will try to use
all CPUs, no GPUs and tile size equal 320.
#include <maphys.hpp> using namespace maphys; int main(int argc, char *argv[]) { using Scalar = double; /* Initialize Chameleon context */ Chameleon::ncpus = 4; Chameleon::ngpus = 0; Chameleon::tile_size = 640; Chameleon chameleon; /* Matrix product GEMM: C=AB */ // Fill the matrix with random values int M=1000; DenseMatrix<Scalar> A(M, M); DenseMatrix<Scalar> B(M, M); DenseMatrix<Scalar> C(M, M); int seedA = 10; CHAMELEON_dplrnt( M, M, get_ptr(A), M, seedA ); int seedB = 20; CHAMELEON_dplrnt( M, M, get_ptr(B), M, seedB ); // Compute the product (GEMM) chameleon_kernels::gemm(A, B, C); // or one can directly write: C = A * B; /* Linear system solve (Cholesky): Ax=b */ // Fill the matrix A with random values so that it is square and SPD CHAMELEON_dplgsy( (Scalar)M, ChamLower, M, get_ptr(A), M, seedA ); A.set_spd(MatrixStorage::full); Vector<Scalar> b(M); CHAMELEON_dplrnt( M, 1, get_ptr(b), M, seedB ); // Declare the Solver ChameleonSolver<DenseMatrix<Scalar>, Vector<Scalar>> SCcham; // if A is given it is copied in the solver class // giving A as rvalue allows to avoid copy SCcham.setup(std::move(A)); // Solve the linear system Vector<Scalar> x = SCcham * b; // Can also call SCcham.factorize(); to perform factorization before solving /* Overdetermined system solve (QR): Ax=b */ // clear previous A and realloc int N=500; A = DenseMatrix<Scalar>(M, N); // Fill the rectangular matrix A with random values CHAMELEON_dplrnt( M, N, get_ptr(A), M, seedA ); // Declare the solver ChameleonSolver<DenseMatrix<Scalar>, Vector<Scalar>> SLcham; SLcham.setup(std::move(A)); // Solve the least squares problem Vector<Scalar> x = SLcham * b; return 0; }
Another version of this example, more complete, can be read in src/examples/demo_dense.cpp
of the Compose
repository,
with cpu time comparison between Blas/Lapack and Chameleon.
There are of course other routines available to perform dense matrices operations but there is not always a Chameleon implementation of Blas/Lapack function.
We summary the Blas/Lapack operations available in Compose through the kernels calls in the following table. All
routines can be called from blas_kernels::
or chameleon_kernels::
but if there is no Chameleon implementation it will
actually call the blas/lapack kernel. The X symbol in the table means there is a Chameleon implementation which can
accelerate this operation.
Operation | Chameleon |
---|---|
\(\text{scal}\) | |
\(\text{axpy}\) | |
\(\text{dot}\) | |
\(\text{norm2}\) | |
\(\text{gemv}\) | |
\(\text{gemm}\) | X |
\(\text{symm}\) | X |
\(\text{hemm}\) | X |
\(\text{eig}\) | |
\(\text{eigvals}\) | |
\(\text{eigh}\) | |
\(\text{eigvalsh}\) | |
\(\text{eigh_select}\) | |
\(\text{eigh_smallest}\) | |
\(\text{eigh_largest}\) | |
\(\text{gen_eigh_select}\) | |
\(\text{gen_eigh_smallest}\) | |
\(\text{gen_eigh_largest}\) | |
\(\text{svd}\) | |
\(\text{svd_left_sv}\) | |
\(\text{svd_right_sv}\) | |
\(\text{svdvals}\) |
We summary the solver routines available in Compose through the Solver class (BlasSolver
, ChameleonSolver
) in the
following table.
Operation | Chameleon |
---|---|
LU (\(\text{getrf/s}\)) | |
Cholesky (\(\text{potrf/s}\)) | X |
Symmetric complex (\(\text{sytrf/s}\)) | X |
Least Squares (\(\text{geqrf/s}\)) | X |
Minimum norm (\(\text{gelqf/s}\)) | X |
Triangular Solve (\(\text{trsm}\)) | X |
4.2. Execution on a Linux Ubuntu laptop
4.2.1. Install Compose with CMake
Install system packages requirements
sudo apt-get update -y sudo apt-get install -y git cmake build-essential gfortran python wget tar curl pkg-config libmkl-dev
Notice one can replace libmkl-dev by libopenblas-dev.
Install specific packages with CMake
# install blaspp, lapackpp and arpack-ng git clone https://github.com/icl-utk-edu/blaspp.git git clone https://github.com/icl-utk-edu/lapackpp.git git clone https://github.com/opencollab/arpack-ng.git cd blaspp cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local cmake --build build -j5 sudo cmake --install build cd ../ cd lapackpp cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local cmake --build build -j5 sudo cmake --install build cd ../ cd arpack-ng cmake -B build -DICB=ON cmake --build build -j5 sudo cmake --install build cd ../ # Install Starpu and Chameleon git clone https://gitlab.inria.fr/starpu/starpu.git cd starpu ./configure --prefix=/usr/local --disable-build-doc --disable-build-test --disable-opencl --disable-fortran --disable-socl --disable-starpufft make -j5 sudo make install cd ../ git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git cd chameleon cmake -B build -DCHAMELEON_USE_MPI=OFF -DCMAKE_INSTALL_PREFIX=/usr/local -DBUILD_SHARED_LIBS=ON # if starpu is installed in a specific path add the following to the configuration: -DCMAKE_PREFIX_PATH="path to starpu" cmake --build build -j5 sudo cmake --install build cd ../
Install Compose
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
cmake -B build -DCMAKE_INSTALL_PREFIX=/usr/local -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build -j5
sudo cmake --install build
4.2.2. Compile and run the example
To link a C++ executable with Compose from a file demo_dense.cpp
the straightforward way is to use CMake, example of
CMakeLists.txt
file:
cmake_minimum_required(VERSION 3.5) project(DEMO_DENSE CXX C Fortran) find_package(maphyspp REQUIRED) set(CMAKE_CXX_STANDARD 20) set(CMAKE_CXX_STANDARD_REQUIRED ON) add_executable(demo_dense demo_dense.cpp) target_link_libraries(demo_dense PRIVATE MAPHYSPP::maphyspp) target_compile_definitions(demo_dense PRIVATE MAPHYSPP_USE_CHAMELEON)
Then, in the same directory where demo_dense.cpp
and CMakeLists.txt
files are located use cmake command as follows:
cmake -B build # if dependencies are installed in specific directories , add them to the configure: -DCMAKE_PREFIX_PATH="path to chameleon;path to starpu cmake --build build # Run the example ./build/demo_dense
4.3. Execution using GNU Guix
We consider Guix is up-to-date and that the guix-hpc-non-free is setup. We consider using the complete example here,
located in src/examples/demo_dense.cpp
. Lets compare CPU times given by Lapack (Intel MKL here) and Chameleon using
four threads i.e. four CPUs.
Compile the example using Guix
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
guix shell --pure -D maphys++-mkl -- /bin/bash --norc
cmake -B build-guix -DMAPHYSPP_COMPILE_EXAMPLES=ON -DMAPHYSPP_COMPILE_TESTS=OFF -DMAPHYSPP_DRIVERS=OFF -DMAPHYSPP_C_DRIVER=OFF -DMAPHYSPP_Fortran_DRIVER=OFF -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build-guix --target mpp_demo_dense
Arguments which can be used with the executable:
- 0 for Blas/Lapack (blaspp/lapackpp+Intel MKL MT), 1 for Chameleon (+Intel MKL sequential)
- Size M (size of the GEMM and linear system, number of rows and columns)
- Size N (number of columns of the rectangular matrix during QR and least squares problem)
- Chameleon block/tile size
- Chameleon number of CPU workers
- Chameleon number of GPU workers
Execution with Blas/Lapack from the MKL using 4 threads
MKL_NUM_THREADS=4 ./build-guix/src/examples/mpp_demo_dense 0 10000 5000
Example results
Least squares system time, 1, 4.74672 Linear system (Cholesky) time, 1, 2.64744 Gemm time, 1, 14.4328
Execution with Chameleon using 4 threads
MKL_NUM_THREADS=1 ./build-guix/src/examples/mpp_demo_dense 1 10000 5000 640 4 0
Example results
Least squares system time, 1, 3.77979 Linear system (Cholesky) time, 1, 2.73428 Gemm time, 1, 15.419
Chameleon delivers performances close to MKL ones, knowing that Intel MKL is a Blas/Lapack reference for using a single machine with Intel CPUs.
Notice that we have set MKL_NUM_THREADS=1
in the Chameleon case since all the parallelism is direcly handled by
Chameleon which relies on sequential kernels scheduled by the Starpu runtime system.
4.4. Execution on the Plafrim supercomputer using GPUs
Lets compare CPU times given by Lapack (Intel MKL here) and Chameleon on a plafrim sirocco (2x16 cores + 2x gpus p100) node.
Compile the example using Guix
ssh plafrim
git clone --recursive https://gitlab.inria.fr/solverstack/maphys/maphyspp.git
cd maphyspp
guix shell --pure -D maphys++-mkl --with-input=chameleon-mkl-mt-nompi=chameleon-cuda-mkl-mt-nompi slurm -- /bin/bash --norc
cmake -B build-guix -DMAPHYSPP_COMPILE_EXAMPLES=ON -DMAPHYSPP_COMPILE_TESTS=OFF -DMAPHYSPP_DRIVERS=OFF -DMAPHYSPP_C_DRIVER=OFF -DMAPHYSPP_Fortran_DRIVER=OFF -DMAPHYSPP_USE_CHAMELEON=ON -DMAPHYSPP_USE_MUMPS=OFF -DMAPHYSPP_USE_PASTIX=OFF
cmake --build build-guix --target mpp_demo_dense
Notice we replace the default chameleon package by a version built with CUDA to be able to use GPUs.
Reserve one p100 node and expose the location of the CUDA driver in the Guix environment (can be different depending on the supercomputer)
salloc --nodes=1 --time=01:00:00 --constraint p100 --exclusive --ntasks-per-node=1 --threads-per-core=1 export LD_PRELOAD="/usr/lib64/libcuda.so"
Execution with Blas/Lapack from the MKL using 32 threads
export MKL_NUM_THREADS=32 srun -n 1 -c 32 ./build-guix/src/examples/mpp_demo_dense 0 50000 10000
Example results
Least squares system time, 1, 24.2582 Linear system (Cholesky) time, 1, 95.9617 Gemm time, 1, 354.629
Execution with Chameleon using 29 threads for CPUs kernels, 2 threads for GPUs kernels and 1 reserved for Starpu
export MKL_NUM_THREADS=1 srun -n 1 -c 32 ./build-guix/src/examples/mpp_demo_dense 1 50000 10000 1240 29 2
Example results
Least squares system time, 1, 12.3178 Linear system (Cholesky) time, 1, 24.4654 Gemm time, 1, 86.5949
Performances with Chameleon outperform MKL ones thanks to the GPUs kernels (cuBLAS here) Chameleon is able to exploit.