‎

1. Version
2. Authors
3. Copying
4. Introduction to Chameleon
- 4.1. MORSE project
- 4.2. Chameleon
  - 4.2.1. Chameleon software
  - 4.2.2. PLASMA's design principles
5. Installing Chameleon
6. Using Chameleon
7. Chameleon Performances on PlaFRIM
- 7.1. bora (36 CPUs) nodes
  - 7.1.1. CPU times
  - 7.1.2. GFLOP/s
- 7.2. sirocco [14-17] (32 CPUs + 2 GPUs V100) nodes

This is the users guide to Chameleon. The software ecosystem will be presented, the installation instructions detailed and some usage examples are presented. To get more information about the application programming interface, please refer to the doxygen documentation.

1 Version

This manual documents the usage of Chameleon version 1.3.0. It was last updated on 2024-04-17.

2 Authors

Inria
University of Tennessee
University of Colorado Denver
King Abdullah University of Science and Technology

3 Copying

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer listed in this license in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

4 Introduction to Chameleon

4.1 MORSE project

Chameleon is a linear algebra software created jointly by several research teams as part of the MORSE associate team: ICL, Inria, KAUST, The University of Colorado Denver.

4.1.1 MORSE Objectives

When processor clock speeds flatlined in 2004, after more than fifteen years of exponential increases, the era of near automatic performance improvements that the HPC application community had previously enjoyed came to an abrupt end. To develop software that will perform well on petascale and exascale systems with thousands of nodes and millions of cores, the list of major challenges that must now be confronted is formidable:

dramatic escalation in the costs of intrasystem communication between processors and/or levels of memory hierarchy;
increased heterogeneity of the processing units (mixing CPUs, GPUs, etc. in varying and unexpected design combinations);
high levels of parallelism and more complex constraints means that cooperating processes must be dynamically and unpredictably scheduled for asynchronous execution;
software will not run at scale without much better resilience to faults and far more robustness; and
new levels of self-adaptivity will be required to enable software to modulate process speed in order to satisfy limited energy budgets.

The MORSE associate team will tackle the first three challenges in a orchestrating work between research groups respectively specialized in sparse linear algebra, dense linear algebra and runtime systems. The overall objective is to develop robust linear algebra libraries relying on innovative runtime systems that can fully benefit from the potential of those future large-scale complex machines. Challenges 4) and 5) will also be investigated by the different teams in the context of other partnerships, but they will not be the main focus of the associate team as they are much more prospective.

4.1.2 Research fields

The overall goal of the MORSE associate team is to enable advanced numerical algorithms to be executed on a scalable unified runtime system for exploiting the full potential of future exascale machines. We expect advances in three directions based first on strong and closed interactions between the runtime and numerical linear algebra communities. This initial activity will then naturally expand to more focused but still joint research in both fields.

Fine interaction between linear algebra and runtime systems

On parallel machines, HPC applications need to take care of data movement and consistency, which can be either explicitly managed at the level of the application itself or delegated to a runtime system. We adopt the latter approach in order to better keep up with hardware trends whose complexity is growing exponentially. One major task in this project is to define a proper interface between HPC applications and runtime systems in order to maximize productivity and expressivity. As mentioned in the next section, a widely used approach consists in abstracting the application as a DAG that the runtime system is in charge of scheduling. Scheduling such a DAG over a set of heterogeneous processing units introduces a lot of new challenges, such as predicting accurately the execution time of each type of task over each kind of unit, minimizing data transfers between memory banks, performing data prefetching, etc. Expected advances: In a nutshell, a new runtime system API will be designed to allow applications to provide scheduling hints to the runtime system and to get real-time feedback about the consequences of scheduling decisions.
Runtime systems

A runtime environment is an intermediate layer between the system and the application. It provides low-level functionality not provided by the system (such as scheduling or management of the heterogeneity) and high-level features (such as performance portability). In the framework of this proposal, we will work on the scalability of runtime environment. To achieve scalability it is required to avoid all centralization. Here, the main problem is the scheduling of the tasks. In many task-based runtime environments the scheduler is centralized and becomes a bottleneck as soon as too many cores are involved. It is therefore required to distribute the scheduling decision or to compute a data distribution that impose the mapping of task using, for instance the so-called ``owner-compute'' rule. Expected advances: We will design runtime systems that enable an efficient and scalable use of thousands of distributed multicore nodes enhanced with accelerators.
Linear algebra

Because of its central position in HPC and of the well understood structure of its algorithms, dense linear algebra has often pioneered new challenges that HPC had to face. Again, dense linear algebra has been in the vanguard of the new era of petascale computing with the design of new algorithms that can efficiently run on a multicore node with GPU accelerators. These algorithms are called ``communication-avoiding'' since they have been redesigned to limit the amount of communication between processing units (and between the different levels of memory hierarchy). They are expressed through Direct Acyclic Graphs (DAG) of fine-grained tasks that are dynamically scheduled. Expected advances: First, we plan to investigate the impact of these principles in the case of sparse applications (whose algorithms are slightly more complicated but often rely on dense kernels). Furthermore, both in the dense and sparse cases, the scalability on thousands of nodes is still limited; new numerical approaches need to be found. We will specifically design sparse hybrid direct/iterative methods that represent a promising approach.

4.1.3 Research papers

Research papers about MORSE can be found here.

4.2 Chameleon

4.2.1 Chameleon software

The main purpose is to address the performance shortcomings of the LAPACK and ScaLAPACK libraries on multicore processors and multi-socket systems of multicore processors and their inability to efficiently utilize accelerators such as Graphics Processing Units (GPUs).

Chameleon is a framework written in C which provides routines to solve dense general systems of linear equations, symmetric positive definite systems of linear equations and linear least squares problems, using LU, Cholesky, QR and LQ factorizations. Real arithmetic and complex arithmetic are supported in both single precision and double precision. It supports Linux and Mac OS/X machines (mainly tested on Intel x86-64 and IBM Power architectures).

Chameleon is based on the PLASMA source code but is not limited to shared-memory environment and can exploit multiple GPUs. Chameleon is interfaced in a generic way with StarPU, PaRSEC, OpenMP, QUARK runtime systems. This feature allows to analyze in a unified framework how sequential task-based algorithms behave regarding different runtime systems implementations. Using Chameleon with StarPU runtime system allows to exploit GPUs through kernels provided by cuBLAS and clusters of interconnected nodes with distributed memory (using MPI). Computation of very large systems with dense matrices on a cluster of nodes is still being experimented and stabilized. It is not expected to get stable performances with the current version using MPI.

4.2.2 PLASMA's design principles

Chameleon is originally based on PLASMA so that design principles are very similar. The content of this section PLASMA's design principles has been copied from the Design principles section of the PLASMA User's Guide.

Tile Algorithms

Tile algorithms are based on the idea of processing the matrix by square tiles of relatively small size, such that a tile fits entirely in one of the cache levels associated with one core. This way a tile can be loaded to the cache and processed completely before being evicted back to the main memory. Of the three types of cache misses, compulsory, capacity and conflict, the use of tile algorithms minimizes the number of capacity misses, since each operation loads the amount of data that does not ``overflow'' the cache.

For some operations such as matrix multiplication and Cholesky factorization, translating the classic algorithm to the tile algorithm is trivial. In the case of matrix multiplication, the tile algorithm is simply a product of applying the technique of loop tiling to the canonical definition of three nested loops. It is very similar for the Cholesky factorization. The left-looking definition of Cholesky factorization from LAPACK is a loop with a sequence of calls to four routines: xSYRK (symmetric rank-k update), xPOTRF (Cholesky factorization of a small block on the diagonal), xGEMM (matrix multiplication) and xTRSM (triangular solve). If the xSYRK, xGEMM and xTRSM operations are expressed with the canonical definition of three nested loops and the technique of loop tiling is applied, the tile algorithm results. Since the algorithm is produced by simple reordering of operations, neither the number of operations nor numerical stability of the algorithm are affected.

The situation becomes slightly more complicated for LU and QR factorizations, where the classic algorithms factorize an entire panel of the matrix (a block of columns) at every step of the algorithm. One can observe, however, that the process of matrix factorization is synonymous with introducing zeros in approproate places and a tile algorithm can be fought of as one that zeroes one tile of the matrix at a time. This process is referred to as updating of a factorization or incremental factorization. The process is equivalent to factorizing the top tile of a panel, then placing the upper triangle of the result on top of the tile blow and factorizing again, then moving to the next tile and so on. Here, the tile LU and QR algorithms perform slightly more floating point operations and require slightly more memory for auxiliary data. Also, the tile LU factorization applies a different pivoting pattern and, as a result, is less numerically stable than classic LU with full pivoting. Numerical stability is not an issue in case of the tile QR, which relies on orthogonal transformations (Householder reflections), which are numerically stable.

Figure 2: Schematic illustration of the tile LU factorization (kernel names for real arithmetics in double precision), courtesey of the PLASMA team.
Tile Data Layout

Tile layout is based on the idea of storing the matrix by square tiles of relatively small size, such that each tile occupies a continuous memory region. This way a tile can be loaded to the cache memory efficiently and the risk of evicting it from the cache memory before it is completely processed is minimized. Of the three types of cache misses, compulsory, capacity and conflict, the use of tile layout minimizes the number of conflict misses, since a continuous region of memory will completely fill out a set-associative cache memory before an eviction can happen. Also, from the standpoint of multithreaded execution, the probability of false sharing is minimized. It can only affect the cache lines containing the beginning and the ending of a tile.

In standard cache-based architecture, tiles continously laid out in memory maximize the profit from automatic prefetching. Tile layout is also beneficial in situations involving the use of accelerators, where explicit communication of tiles through DMA transfers is required, such as moving tiles between the system memory and the local store in Cell B. E. or moving tiles between the host memory and the device memory in GPUs. In most circumstances tile layout also minimizes the number of TLB misses and conflicts to memory banks or partitions. With the standard (column-major) layout, access to each column of a tile is much more likely to cause a conflict miss, a false sharing miss, a TLB miss or a bank or partition conflict. The use of the standard layout for dense matrix operations is a performance minefield. Although occasionally one can pass through it unscathed, the risk of hitting a spot deadly to performance is very high.

Another property of the layout utilized in PLASMA is that it is ``flat'', meaning that it does not involve a level of indirection. Each tile stores a small square submatrix of the main matrix in a column-major layout. In turn, the main matrix is an arrangement of tiles immediately following one another in a column-major layout. The offset of each tile can be calculated through address arithmetics and does not involve pointer indirection. Alternatively, a matrix could be represented as an array of pointers to tiles, located anywhere in memory. Such layout would be a radical and unjustifiable departure from LAPACK and ScaLAPACK. Flat tile layout is a natural progression from LAPACK's column-major layout and ScaLAPACK's block-cyclic layout.

Another related property of PLASMA's tile layout is that it includes provisions for padding of tiles, i.e., the actual region of memory designated for a tile can be larger than the memory occupied by the actual data. This allows to force a certain alignment of tile boundaries, while using the flat organization described in the previous paragraph. The motivation is that, at the price of small memory overhead, alignment of tile boundaries may prove benefivial in multiple scenarios involving memory systems of standard multicore processors, as well as accelerators. The issues that come into play are, again, the use of TLBs and memory banks or partitions.

Figure 3: Schematic illustration of the tile layout with column-major order of tiles, column-major order of elements within tiles and (optional) padding for enforcing a certain alighment of tile bondaries, courtesey of the PLASMA team.
Dynamic Task Scheduling

Dynamic scheduling is the idea of assigning work to cores based on the availability of data for processing at any given point in time and is also referred to as data-driven scheduling. The concept is related closely to the idea of expressing computation through a task graph, often referred to as the DAG (Direct Acyclic Graph), and the flexibility exploring the DAG at runtime. Thus, to a large extent, dynamic scheduling is synonymous with runtime scheduling. An important concept here is the one of the critical path, which defines the upper bound on the achievable parallelism, and needs to be pursued at the maximum speed. This is in direct opposition to the fork-and-join or data-parallel programming models, where artificial synchronization points expose serial sections of the code, where multiple cores are idle, while sequential processing takes place. The use of dynamic scheduling introduces a trade-off, though. The more dynamic (flexible) scheduling is, the more centralized (and less scalable) the scheduling mechanism is. For that reason, currently PLASMA uses two scheduling mechanisms, one which is fully dynamic and one where work is assigned statically and dependency checks are done at runtime.

The first scheduling mechanism relies on unfolding a sliding window of the task graph at runtime and scheduling work by resolving data hazards: Read After Write(RAW), Write After Read (WAR) and Write After Write (WAW), a technique analogous to instruction scheduling in superscalar processors. It also relies on work-stealing for balanding the load among all multiple cores. The second scheduling mechanism relies on statically designating a path through the execution space of the algorithm to each core and following a cycle: transition to a task, wait for its dependencies, execute it, update the overall progress. Task are identified by tuples and task transitions are done through locally evaluated formulas. Progress information can be centralized, replicated or distributed (currently centralized).

Figure 4: A trace of the tile QR factorization executing on eight cores without any global synchronization points (kernel names for real arithmetics in single precision), courtesey of the PLASMA team.

5 Installing Chameleon

Chameleon is written in C and depends on a couple of external libraries that must be installed on the system.

Chameleon can be built and installed on UNIX systems (Linux) by the standard means of CMake. General information about CMake, as well as installation binaries and CMake source code are available from here.

To get support to install a full distribution Chameleon + dependencies we encourage users to use GNU Guix or Spack.

Finally some packages or also available for Debian/Ubuntu and MacOSX.

5.1 Getting Chameleon

The latest official release tarballs of Chameleon sources are available for download from the gitlab tags page.

The latest development state is available on gitlab. You need Git

git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git

5.2 Prerequisites for installing Chameleon

To install Chameleon's libraries, header files, and executables, one needs:

CMake (version 3.3 minimum, 3.17 to compile with the H-Mat support): the build system
C and Fortran compilers: GNU compiler suite, Clang, Intel or IBM can be used
python: to generate files in the different precisions
external libraries: this depends on the configuration, by default the required libraries are
- runtimes: StarPU or PaRSEC or QUARK or OpenMP
- kernels : CBLAS, LAPACKE (with TMG). These are C interfaces to Fortran kernels BLAS and LAPACK. There exist several providers that can be used with Chameleon (Intel MKL, Netlib, OpenBLAS, BLIS/FLAME)

Optional libraries:

cuda: cuda, cublas (comes with cuda)
mpi: openmpi, mpich, intelmpi
rocm: HIP and hipBLAS

These packages must be installed on the system before trying to configure/build chameleon. Please look at the distrib/ directory which gives some hints for the installation of dependencies for Unix systems.

We give here some examples for a Debian system:


# Update Debian packages list
sudo apt-get update
# Install BLAS/LAPACK, can be OpenBLAS, Intel MKL, Netlib LAPACK
sudo apt-get install -y libopenblas-dev liblapacke-dev
# or sudo apt-get install -y libmkl-dev
# or sudo apt-get install -y liblapack-dev liblapacke-dev
# Install OpenMPI
sudo apt-get install -y libopenmpi-dev
# Install StarPU
sudo apt-get install libstarpu-dev

# Optionnaly to make some specific developments, the following may be installed
# Install hwloc (used by StarPU or QUARK, already a dependency of OpenMPI)
sudo apt-get install -y libhwloc-dev
# install EZTrace, usefull to export some nice execution traces
with all runtimes
sudo apt-get install -y libeztrace-dev
# install FxT, usefull to export some nice execution traces with StarPU
sudo apt-get install -y libfxt-dev
# Install cuda and cuBLAS: only if you have a GPU cuda compatible
sudo apt-get install -y nvidia-cuda-toolkit nvidia-cuda-dev
# Install HIP and hipBLAS: only if you have a GPU AMD compatible
curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/rocm-keyring.gpg
echo 'deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] https://repo.radeon.com/rocm/apt/5.4.2 jammy main' | sudo tee /etc/apt/sources.list.d/rocm.list
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600
sudo apt-get update
sudo apt-get install -y rocm-hip-sdk

# If you prefer a specific version of StarPU, install it yourself, e.g.
# Install StarPU (with MPI and FxT enabled)
mkdir -p $HOME/install
cd $HOME/install
wget https://files.inria.fr/starpu/starpu-1.3.10/starpu-1.3.10.tar.gz
tar xvzf starpu-1.3.10.tar.gz
cd starpu-1.3.10/
./configure --prefix=/usr/local --with-fxt=/usr/lib/x86_64-linux-gnu/
make -j5
sudo make install

# Install PaRSEC: to be used in place of StarPU
mkdir -p $HOME/install
cd $HOME/install
git clone https://bitbucket.org/mfaverge/parsec.git
cd parsec
git checkout mymaster
git submodule update
mkdir -p build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr/local -DBUILD_SHARED_LIBS=ON
make -j5
sudo make install

# Install QUARK: to be used in place of StarPU
mkdir -p $HOME/install
cd $HOME/install
git clone https://github.com/ecrc/quark
cd quark/
sed -i -e "s#prefix=.*#prefix=/usr/local#g" make.inc
sed -i -e "s#CFLAGS=.*#CFLAGS= -O2 -DADD_ -fPIC#g" make.inc
make
sudo make install

See also our script example in the distrib/debian sub-directory.

5.2.1 Known issues

we need the lapacke interface to tmg routines and symbol like LAPACKE_dlatms_work should be defined in the lapacke library. Make sure the Debian packages libopenblas-dev and liblapacke-dev (no problem with Intel MKL) do provide the tmg interface. If not you can possibly update your distribution or install the lapacke interface library in another way, by yourself from source or with Spack, or with Guix-HPC,…

5.2.2 Some details about dependencies

BLAS implementation

BLAS (Basic Linear Algebra Subprograms), are a de facto standard for basic linear algebra operations such as vector and matrix multiplication. FORTRAN implementation of BLAS is available from Netlib. Also, C implementation of BLAS is included in GSL (GNU Scientific Library). Both these implementations are reference implementation of BLAS, are not optimized for modern processor architectures and provide an order of magnitude lower performance than optimized implementations. Highly optimized implementations of BLAS are available from many hardware vendors, such as Intel MKL, IBM ESSL and AMD ACML. Fast implementations are also available as academic packages, such as ATLAS and OpenBLAS. The standard interface to BLAS is the FORTRAN interface.

Caution about the compatibility: Chameleon has been mainly tested with the reference BLAS from NETLIB, OpenBLAS and Intel MKL.
CBLAS

CBLAS is a C language interface to BLAS. Most commercial and academic implementations of BLAS also provide CBLAS. Netlib provides a reference implementation of CBLAS on top of FORTRAN BLAS (Netlib CBLAS). Since GSL is implemented in C, it naturally provides CBLAS.

Caution about the compatibility: Chameleon has been mainly tested with the reference CBLAS from NETLIB, OpenBLAS and Intel MKL.
LAPACK implementation

LAPACK (Linear Algebra PACKage) is a software library for numerical linear algebra, a successor of LINPACK and EISPACK and a predecessor of Chameleon. LAPACK provides routines for solving linear systems of equations, linear least square problems, eigenvalue problems and singular value problems. Most commercial and academic BLAS packages also provide some LAPACK routines.

Caution about the compatibility: Chameleon has been mainly tested with the reference LAPACK from NETLIB, OpenBLAS and Intel MKL.
LAPACKE

LAPACKE is a C language interface to LAPACK (or CLAPACK). It is produced by Intel in coordination with the LAPACK team and is available in source code from Netlib in its original version (Netlib LAPACKE) and from Chameleon website in an extended version (LAPACKE for Chameleon). In addition to implementing the C interface, LAPACKE also provides routines which automatically handle workspace allocation, making the use of LAPACK much more convenient.

Caution about the compatibility: Chameleon has been mainly tested with the reference LAPACKE from NETLIB, OpenBLAS and Intel MKL. In addition the LAPACKE library must be configured to provide the interface with the TMG routines and symbols like LAPACKE_dlatms_work should be defined.
libtmg

libtmg is a component of the LAPACK library, containing routines for generation of input matrices for testing and timing of LAPACK. The testing and timing suites of LAPACK require libtmg, but not the library itself. Note that the LAPACK library can be built and used without libtmg.

Caution about the compatibility: Chameleon has been mainly tested with the reference TMGLIB from NETLIB, OpenBLAS and Intel MKL.
StarPU
StarPU is a task programming library for hybrid architectures. StarPU handles run-time concerns such as:
- Task dependencies
- Optimized heterogeneous scheduling
- Optimized data transfers and replication between main memory and discrete memories
- Optimized cluster communications
StarPU can be used to benefit from GPUs and distributed-memory environment. Note StarPU is enabled by default.

Caution about the compatibility: Chameleon has been mainly tested with StarPU-1.1, 1.2 and 1.3 releases.
PaRSEC

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures.

Caution about the compatibility: Chameleon is compatible with this version https://bitbucket.org/mfaverge/parsec/branch/mymaster.
QUARK

QUARK (QUeuing And Runtime for Kernels) provides a library that enables the dynamic execution of tasks with data dependencies in a multi-core, multi-socket, shared-memory environment. When Chameleon is linked with QUARK or OPENMP, it is not possible to exploit neither CUDA (for GPUs) nor MPI (distributed-memory environment). You can use StarPU to do so.

Caution about the compatibility: Chameleon has been mainly tested with the QUARK library coming from https://github.com/ecrc/quark.
EZTrace

This library provides efficient modules for recording traces. Chameleon can trace kernels execution on CPU workers thanks to EZTrace and produce .paje files. EZTrace also provides integrated modules to trace MPI calls and/or memory usage. See how to use this feature here Execution trace using EZTrace. To trace kernels execution on all kind of workers, such as CUDA, We recommend to use the internal tracing support of the runtime system used done by the underlying runtime. See how to use this feature here Execution trace using StarPU/FxT.
hwloc

hwloc (Portable Hardware Locality) is a software package for accessing the topology of a multicore system including components like: cores, sockets, caches and NUMA nodes. The topology discovery library, hwloc, is strongly recommended to be used through the runtime system. It allows to increase performance, and to perform some topology aware scheduling. hwloc is available in major distributions and for most OSes and can be downloaded from http://www.open-mpi.org/software/hwloc.

Caution about the compatibility: hwloc should be compatible with the runtime system used.
OpenMPI

OpenMPI is an open source Message Passing Interface implementation for execution on multiple nodes with distributed-memory environment. MPI can be enabled only if the runtime system chosen is StarPU (default). To use MPI through StarPU, it is necessary to compile StarPU with MPI enabled.

Caution about the compatibility: OpenMPI should be built with the –enable-mpi-thread-multiple option.
Nvidia CUDA Toolkit

Nvidia CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. Chameleon can use a set of low level optimized kernels coming from cuBLAS to accelerate computations on GPUs. The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the Nvidia CUDA runtime. cuBLAS is normaly distributed with Nvidia CUDA Toolkit. CUDA/cuBLAS can be enabled in Chameleon only if the runtime system chosen is StarPU (default). To use CUDA through StarPU, it is necessary to compile StarPU with CUDA enabled.

Caution about the compatibility: your compiler must be compatible with CUDA.
HIP

HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code.

5.3 Build and install Chameleon with CMake

Compilation of Chameleon libraries and executables are done with CMake (http://www.cmake.org/). This version has been tested with CMake 3.10.2 but any version superior to 3.3 should be fine, unless you enable the H-Mat support that requires the minimum revision to be 3.17.

Here the steps to configure, build, test and install

configure:

cmake path/to/chameleon -DOPTION1= -DOPTION2= ...
# see the "Configuration options" section to get list of options
# see the "Dependencies detection" for details about libraries detection

build:

make
# do not hesitate to use -j[ncores] option to speedup the compilation

test (optional, required CHAMELEON_ENABLE_TESTING=ON):
```
make test
# or
ctest
```
install (optional):
```
make install
```
Do not forget to specify the install directory with -DCMAKE_INSTALL_PREFIX at configure.
```
cmake /home/jdoe/chameleon -DCMAKE_INSTALL_PREFIX=/home/jdoe/install/chameleon
```
Note that the install process is optional. You are free to use Chameleon binaries compiled in the build directory.

5.3.1 Configuration options

You can optionally activate some options at cmake configure (like CUDA, MPI, …) invoking cmake path/to/your/CMakeLists.txt -DOPTION1= -DOPTION2= ...

cmake /home/jdoe/chameleon/ -DCMAKE_BUILD_TYPE=Debug \
			    -DCMAKE_INSTALL_PREFIX=/home/jdoe/install/ \
			    -DCHAMELEON_USE_CUDA=ON \
			    -DCHAMELEON_USE_MPI=ON \
			    -DBLA_VENDOR=Intel10_64lp_seq

You can get the full list of options with -L[A][H] options of cmake command

cmake -LH /home/jdoe/chameleon/

You can also set the options thanks to the ccmake interface.

Native CMake options (non-exhaustive list)
- CMAKE_BUILD_TYPE=Debug|Release|RelWithDebInfo|MinSizeRel: level of compiler optimization, enable/disable debug information.
- CMAKE_PREFIX_PATH="path1;path2": where paths denotes root to dependencies that may be installed with CMake.
- CMAKE_INSTALL_PREFIX=path/to/your/install/dir: where headers, libraries, executables, etc, will be copied when invoking make install.
- BUILD_SHARED_LIBS=ON|OFF: indicates whether or not CMake has to build Chameleon static (OFF) or shared (ON) libraries.
- CMAKE_C_COMPILER=gcc|icc|…: to choose the C compilers if several exist in the environment
- CMAKE_Fortran_COMPILER=gfortran|ifort|…: to choose the Fortran compilers if several exist in the environment
Related to specific modules (find_package) to find external libraries
- BLA_VENDOR=All|OpenBLAS|Generic|Intel10_64lp|Intel10_64lp_seq|FLAME: to use intel mkl for example, see the list of BLA_VENDOR.
Libraries detected with an official cmake module (see module files in CMAKE_ROOT/Modules/): BLAS - LAPACK - CUDA - MPI - OpenMP - Threads.

Libraries detected with our cmake modules (see module files in cmake_modules/morse_cmake/modules/find/ directory of Chameleon sources): CBLAS - EZTRACE - FXT - HWLOC - LAPACKE - PARSEC - QUARK - SIMGRID - STARPU.
Chameleon specific options
- CHAMELEON_SCHED=STARPU|PARSEC|QUARK|OPENMP (default STARPU): to link respectively with StarPU, PaRSEC, Quark, OpenMP library (runtime system)
- CHAMELEON_USE_MPI=ON|OFF (default OFF): to link with MPI library (message passing implementation for use of multiple nodes with distributed memory), can only be used with StarPU
- CHAMELEON_USE_CUDA=ON|OFF (default OFF): to link with CUDA runtime (implementation paradigm for accelerated codes on Nvidia GPUs) and cuBLAS library (optimized BLAS kernels on Nvidia GPUs), can only be used with StarPU and PaRSEC
- CHAMELEON_USE_HIP_ROC=ON|OFF (default OFF): to link with HIP runtime (implementation paradigm for accelerated codes on AMD GPUs) and hipBLAS library (optimized BLAS kernels on AMD GPUs), can only be used with StarPU
- CHAMELEON_USE_HIP_CUDA=ON|OFF (default OFF): to link with HIP runtime (implementation paradigm for accelerated codes on Nvidia GPUs) and hipBLAS library (optimized BLAS kernels on Nvidia GPUs), can only be used with StarPU
- CHAMELEON_ENABLE_DOC=ON|OFF (default OFF): to control build of the documentation contained in doc/ sub-directory
- CHAMELEON_ENABLE_EXAMPLE=ON|OFF (default ON): to control build of the examples executables (API usage) contained in example/ sub-directory
- CHAMELEON_ENABLE_PRUNING_STATS=ON|OFF (default OFF)
- CHAMELEON_ENABLE_TESTING=ON|OFF (default ON): to control build of testing executables (timer and numerical check) contained in testing/ sub-directory
- CHAMELEON_SIMULATION=ON|OFF (default OFF): to enable simulation mode, means Chameleon will not really execute tasks, see details in section Use simulation mode with StarPU-SimGrid. This option must be used with StarPU compiled with SimGrid allowing to guess the execution time on any architecture. This feature should be used to make experiments on the scheduler behaviors and performances not to produce solutions of linear systems.
- CHAMELEON_USE_MIGRATE=ON|OFF (default OFF): enables the data migration in QR algorithms.
- CHAMELEON_USE_MPI_DATATYPES (default OFF): enables MPI datatypes whenever supported by the runtime.
- CHAMELEON_USE_HMATOSS=ON|OFF (default OFF): enables Hmat-OSS kernels.
- CHAMELEON_RUNTIME_SYNC (default OFF): enables synchronous task submission when available to debug the code without parallelism.
- CHAMELEON_KERNELS_TRACE (default OFF): enables kernel traces to debug the task execution order.
- CHAMELEON_KERNELS_MT (default OFF): Use multithreaded kernels (e.g. intel MKL MT)

5.3.2 Dependencies detection

You have different choices to detect dependencies on your system, either by setting some environment variables containing paths to the libs and headers or by specifying them directly at cmake configure. In any case, if the dependencies are installed in non standard directories, do not forget to use the PKG_CONFIG_PATH environment variable and the CMAKE_PREFIX_PATH environment (or CMake) variable. Different cases:

detection of dependencies through environment variables:
- LD_LIBRARY_PATH (DYLD_LIBRARY_PATH on Mac OSX) should contain the list of paths where to find the libraries:
```
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:install/path/to/your/lib
```
- INCLUDE (or CPATH, or C_INCLUDE_PATH)should contain the list of paths where to find the header files of libraries
```
export INCLUDE=$INCLUDE:install/path/to/your/headers
```

detection with user's given paths:

you can specify the path at cmake configure by invoking

cmake path/to/your/CMakeLists.txt -DLIB_DIR=path/to/your/lib

where LIB stands for the name of the lib to look for, e.g.

cmake path/to/your/CMakeLists.txt -DQUARK_DIR=path/to/quarkdir \
				  -DCBLAS_DIR= ...

it is also possible to specify headers and library directories separately

cmake path/to/your/CMakeLists.txt \
      -DQUARK_INCDIR=path/to/quark/include \
      -DQUARK_LIBDIR=path/to/quark/lib

detection with custom environment variables: all variables like _DIR, _INCDIR, _LIBDIR can be set as environment variables instead of CMake options, there will be read
using pkg-config for libraries that provide .pc files
- update your PKG_CONFIG_PATH to the paths where to find .pc files of installed external libraries like hwloc, starpu, some blas/lapack, etc
using CMAKE_PREFIX_PATH for libraries that provide some CMake config files containing targets definitions (e.g. fooConfig.cmake).

Note that PaRSEC and StarPU are only detected with pkg-config mechanism because it is always provided and this avoids errors. The CMAKE_PREFIX_PATH can be used to indicate where dependencies are installed.

5.4 Distribution Debian

Download one of the available package for your distribution here https://gitlab.inria.fr/solverstack/chameleon/-/packages, then install as follows

sudo apt-get install ./chameleon_1.2.0-1_amd64.deb -y
// and to uninstall use
// sudo apt-get autoremove -y chameleon

Chameleon will be installed on your system meaning you can use drivers for performance tests

export OPENBLAS_NUM_THREADS=1
mpiexec -n 2 chameleon_stesting -o gemm -n 6400

and use Chameleon library in your own project

# example usage: use chameleon library in your own cmake project (we provide a CHAMELEONConfig.cmake)
git clone https://gitlab.inria.fr/solverstack/distrib.git
cd distrib/cmake/test/chameleon && mkdir build && cd build && cmake .. && make && ./test_chameleon

# example usage: use chameleon library in your own not cmake project
# use pkg-config to get compiler flags and linking
pkg-config --cflags chameleon
pkg-config --libs chameleon
# if there are static libraries use the --static option of pkg-config

Do not hesitate to send an email if you need a package for your Debian distribution.

5.5 Distribution of Chameleon using GNU Guix

We provide Guix packages to install Chameleon with its dependencies in a reproducible way on GNU/Linux systems. For MacOSX please refer to the next sections about Brew or Spack packaging.

If you are "root" on the system you can install Guix and directly use it to install the libraries. On supercomputers your are not root on you may still be able to use it if Docker or Singularity are available on the machine because Chameleon can be packaged as Docker/Singularity images with Guix.

5.5.1 Installing Guix

Guix requires a running GNU/Linux system, GNU tar and Xz. Follow the installation instructions

cd /tmp
wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh
chmod +x guix-install.sh
sudo ./guix-install.sh

or on Debian

sudo apt install guix

The Chameleon packages are not official Guix packages. It is then necessary to add a channel to get additional packages. Create a ~/.config/guix/channels.scm file with the following snippet:

(cons (channel
    (name 'guix-hpc-non-free)
    (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
  %default-channels)

Update guix package definition

guix pull

Update new guix in the path

PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
hash guix

For further shell sessions, add this to the ~/.bash_profile file

export PATH="$HOME/.config/guix/current/bin${PATH:+:}$PATH"
export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"

Chameleon packages are now available

guix search ^chameleon

Refer to the official documentation of Guix to learn the basic commands.

5.5.2 Installing Chameleon with Guix

Standard Chameleon, last release

guix install chameleon
# or use guix shell to get a new shell (isolated from the native environment) with chameleon available in the PATH
guix shell --pure chameleon -- /bin/bash --norc

Notice that there exist several build variants

chameleon (default) : with starpu - with mpi - with OpenBlas
chameleon-mkl-mt : default version but with Intel MKL multithreaded to replace OpenBlas
chameleon-mkl-mt-wompi : with Intel MKL multithreaded and without mpi
chameleon-cuda : with starpu - with mpi - with cuda
chameleon-cuda-wompi : with starpu - without mpi - with cuda
chameleon-cuda-mkl-mt : with starpu - with mpi - with cuda - with Intel MKL multithreaded to replace OpenBlas
chameleon-cuda-mkl-mt-wompi : with starpu - without mpi - with cuda - with Intel MKL multithreaded to replace OpenBlas
chameleon-simgrid : with starpu - with mpi - with simgrid
chameleon-openmp : with openmp - without mpi
chameleon-parsec : with parsec - without mpi
chameleon-quark : with quark - without mpi

Change the version

guix install chameleon --with-branch=chameleon=master
guix install chameleon --with-commit=chameleon=b31d7575fb7d9c0e1ba2d8ec633e16cb83778e8b
guix install chameleon --with-git-url=chameleon=https://gitlab.inria.fr/fpruvost/chameleon.git
guix install chameleon --with-source=chameleon=$HOME/git/chameleon

Notice also that default mpi is OpenMPI and default blas/lapack is Openblas. This can be changed with a transformation option.

Change some dependencies

# install chameleon with intel mkl to replace openblas, nmad to replace openmpi and starpu with fxt
guix install chameleon --with-input=openblas=mkl --with-input=openmpi=nmad --with-input=starpu=starpu-fxt

5.5.3 Generate a Chameleon Docker image with Guix

To install Chameleon and its dependencies within a docker image (OpenMPI stack)

docker_chameleon=`guix pack -f docker chameleon chameleon --with-branch=chameleon=master --with-input=openblas=mkl mkl starpu hwloc openmpi openssh slurm bash coreutils inetutils util-linux procps git grep tar sed gzip which gawk perl emacs-minimal vim gcc-toolchain make cmake pkg-config -S /bin=bin --entry-point=/bin/bash`
# Load the generated tarball as a docker image
docker_chameleon_tag=`docker load --input $docker_chameleon | grep "Loaded image: " | cut -d " " -f 3-`
# Change tag name, see the existing image name with "docker images" command, then change to a more simple name
docker tag $docker_chameleon_tag guix/chameleon-tmp

Create a Dockerfile inheriting from the image (renamed guix/chameleon here):

FROM guix/chameleon-tmp

# Create a directory for user 1000
RUN mkdir -p /builds
RUN chown -R 1000 /builds

ENTRYPOINT ["/bin/bash", "-l"]

# Enter the image as user 1000 in /builds
USER 1000
WORKDIR /builds
ENV HOME /builds

Then create the final docker image from this docker file.

docker build -t guix/chameleon .

Test the image

docker run -it guix/chameleon
# test starpu
STARPU=`pkg-config --variable=prefix libstarpu`
mpiexec -np 4 $STARPU/lib/starpu/mpi/comm
# test chameleon
CHAMELEON=`pkg-config --variable=prefix chameleon`
mpiexec -np 2 $CHAMELEON/bin/chameleon_stesting -H -o gemm -P 2 -t 2 -m 2000 -n 2000 -k 2000

5.5.4 Generate a Chameleon Singularity image with Guix

To package Chameleon and its dependencies within a singularity image (OpenMPI stack)

# define reproducible guix environment
guix describe -f channels > guix-channels.scm
guix shell --export-manifest chameleon-cuda --with-branch=chameleon=master --with-input=openblas=mkl bash coreutils emacs gawk grep inetutils openmpi openssh procps sed time util-linux vim which > guix-manifests.scm
SINGULARITY_IMAGE=`guix time-machine -C guix-channels.scm -- pack -f squashfs -m guix-manifests.scm -S /bin=bin --entry-point=/bin/bash`
cp $SINGULARITY_IMAGE chameleon-cuda.gz.sif

# copy the singularity image on the supercomputer, e.g. 'supercomputer'
scp chameleon-cuda.gz.sif supercomputer:

On a machine where Singularity is installed Chameleon can then be called as follows

# at least openmpi and singularity are required here, e.g. module add openmpi singularity
module add openmpi singularity
export SINGULARITY_CMD=`which singularity`
export SINGULARITY_IMAGE=$HOME/chameleon-cuda.gz.sif
# use LD_PRELOAD to give the location of the CUDA driver installed on the supercomputer
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so
# then in your allocation with Slurm or OAR, for example
mpirun $MPI_OPTIONS -x LD_PRELOAD $SINGULARITY_CMD exec --bind /usr/lib/x86_64-linux-gnu/:/usr/lib/x86_64-linux-gnu/ $SINGULARITY_IMAGE chameleon_stesting -o gemm -n 96000 -b 1600 --nowarmup -g 2

5.5.5 Generate a tar.gz package of Chameleon with Guix

One can generate a tar.gz archive the same way as the singularity image

guix describe -f channels > guix-channels.scm
guix shell --export-manifest chameleon-cuda --with-branch=chameleon=master --with-input=openblas=mkl bash coreutils emacs gawk grep inetutils openmpi openssh procps sed time util-linux vim which > guix-manifests.scm
PACKRR=`guix time-machine --channels=guix-channels.scm -- pack -RR --manifest=guix-manifests.scm -S /bin=bin`
cp $PACKRR chameleon-cuda.tar.gz
# copy the archive on the supercomputer, e.g. 'supercomputer'
scp chameleon-cuda.tar.gz supercomputer:

Then on the supercomputer that has neither Guix nor Singularity one can do the following

mkdir guixrr/
cd guixrr/
tar xvf $HOME/chameleon-cuda.tar.gz
chmod +w .
export GUIX_ROOT=$PWD
# then in your allocation with Slurm or OAR, for example
${GUIX_ROOT}/bin/mpirun --launch-agent ${GUIX_ROOT}/bin/orted -x GUIX_EXECUTION_ENGINE=performance -x LD_PRELOAD="/usr/lib64/libcuda.so" -x STARPU_SILENT=1 ${GUIX_ROOT}/bin/chameleon_stesting -o gemm -n 16000,32000,64000,96000,128000 -b 2000 -g 4 -P 2

5.6 Distribution of Chameleon using Spack

5.6.1 Installing Spack

We provide a Chameleon Spack package (with StarPU) for Linux or macOS. Please refer to the documentation for installation instructions.

# please read https://spack.readthedocs.io/en/latest/getting_started.html
git clone https://github.com/spack/spack.git
. spack/share/spack/setup-env.sh
cd spack
git checkout v0.19.1

Chameleon is then available

spack info chameleon
spack spec chameleon

Refer to the getting started guide and basic usage guide to learn how to use Spack properly.

5.6.2 Installing Chameleon with Spack

Standard Chameleon, last state on the 'master' branch

spack install -v chameleon
# chameleon is installed here:
spack location -i chameleon

Notice that there exist several build variants (see spack info chameleon)

chameleon (default) : with starpu - with mpi
tune the build type (CMake) with build_type=RelWithDebInfo|Debug|Release
enable/disable shared libraries with +/- shared
enable/disable mpi with +/- mpi
enable/disable cuda with +/- cuda
enable/disable fxt with +/- fxt
enable/disable simgrid with +/- simgrid
runtime=openmp : with openmp - without starpu

Change the version

spack install -v chameleon@master

Notice also that default mpi is OpenMPI and default blas/lapack is Openblas. This can be changed by adding some constraints on virtual packages.

Change some dependencies

# see lapack providers
spack providers lapack
# see mpi providers
spack providers mpi
# install chameleon with intel mkl to replace openblas
spack install -v chameleon ^intel-mkl # or ^intel-oneapi-mkl

5.7 Distribution Brew for Mac OS X

We provide some brew packages here https://gitlab.inria.fr/solverstack/brew-repo (under construction).

5.8 Linking an external application with Chameleon libraries

Compilation and link with Chameleon libraries have been tested with the GNU compiler suite gcc/gfortran and the Intel compiler suite icc/ifort.

5.8.1 For CMake projects

A CHAMELEONConfig.cmake file is provided at installation, stored in <prefix>/lib/cmake/chameleon, so that users in cmake project can use through the variable CHAMELEON_ROOT (set it as environment or CMake variable).

sudo apt-get update
sudo apt-get install -y libopenblas-dev liblapacke-dev libstarpu-dev
git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon && mkdir -p build && cd build
CHAMELEON_ROOT=$PWD/install
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT && make -j5 install
# chameleon is installed in $CHAMELEON_ROOT

# if your work in a cmake project you can use the CHAMELEONConfig.cmake file
# installed under <prefix>/lib/cmake/chameleon/ by setting your
# CMAKE_PREFIX_PATH with the path of installation. In your cmake project, use
# find_package(CHAMELEON) and link your libraries and/or executables with the
# library target CHAMELEON::chameleon
cmake . -DCMAKE_PREFIX_PATH=$CHAMELEON_ROOT

5.8.2 For non CMake projects

The compiler, linker flags that are necessary to build an application using Chameleon are given through the pkg-config mechanism.

sudo apt-get update
sudo apt-get install -y libopenblas-dev liblapacke-dev libstarpu-dev
git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon && mkdir -p build && cd build
CHAMELEON_ROOT=$PWD/install
cmake .. -DCMAKE_INSTALL_PREFIX=$CHAMELEON_ROOT && make -j5 install
# chameleon is installed in $CHAMELEON_ROOT

export PKG_CONFIG_PATH=$CHAMELEON_ROOT/lib/pkgconfig:$PKG_CONFIG_PATH
pkg-config --cflags chameleon
pkg-config --libs chameleon
pkg-config --libs --static chameleon

# use it in your configure/make

The .pc files required are located in the sub-directory lib/pkgconfig of your Chameleon install directory.

5.8.3 Static linking in C

Lets imagine you have a file main.c that you want to link with Chameleon static libraries. Lets consider /home/yourname/install/chameleon is the install directory of Chameleon containing sub-directories include/ and lib/. Here could be your compilation command with gcc compiler:

gcc -I/home/yourname/install/chameleon/include -o main.o -c main.c

Now if you want to link your application with Chameleon static libraries, you could do:

gcc main.o -o main                                         \
/home/yourname/install/chameleon/lib/libchameleon.a        \
/home/yourname/install/chameleon/lib/libchameleon_starpu.a \
/home/yourname/install/chameleon/lib/libcoreblas.a         \
-lstarpu-1.3 -Wl,--no-as-needed -lmkl_intel_lp64           \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

As you can see in this example, we also link with some dynamic libraries starpu-1.3, Intel MKL libraries (for BLAS/LAPACK/CBLAS/LAPACKE), pthread, m (math) and rt. These libraries will depend on the configuration of your Chameleon build. You can find these dependencies in .pc files we generate during compilation and that are installed in the sub-directory lib/pkgconfig of your Chameleon install directory. Note also that you could need to specify where to find these libraries with -L option of your compiler/linker.

Before to run your program, make sure that all shared libraries paths your executable depends on are known. Enter ldd main to check. If some shared libraries paths are missing append them in the LD_LIBRARY_PATH (for Linux systems) environment variable (DYLD_LIBRARY_PATH on Mac).

5.8.4 Dynamic linking in C

For dynamic linking (need to build Chameleon with CMake option BUILD_SHARED_LIBS=ON) it is similar to static compilation/link but instead of specifying path to your static libraries you indicate the path to dynamic libraries with -L option and you give the name of libraries with -l option like this:

gcc main.o -o main \
-L/home/yourname/install/chameleon/lib \
-lchameleon -lchameleon_starpu -lcoreblas \
-lstarpu-1.3 -Wl,--no-as-needed -lmkl_intel_lp64 \
-lmkl_sequential -lmkl_core -lpthread -lm -lrt

Note that an update of your environment variable LD_LIBRARY_PATH (DYLD_LIBRARY_PATH on Mac) with the path of the libraries could be required before executing

export LD_LIBRARY_PATH=path/to/libs:path/to/chameleon/lib

6 Using Chameleon

6.1 Using Chameleon executables

Chameleon provides several test executables that are compiled and linked with Chameleon's dependencies. Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. This set of binaries are separated into three categories and can be found in three different directories:

example: contains examples of API usage and more specifically the sub-directory lapack_to_chameleon/ provides a tutorial that explains how to use Chameleon functionalities starting from a full LAPACK code, see Tutorial LAPACK to Chameleon
testing: contains testing drivers to check numerical correctness and assess performance of Chameleon linear algebra routines with a wide range of parameters
```
./testing/chameleon_stesting -H -o gemm -t 2 -m 2000 -n 2000 -k 2000
```
To get the list of parameters, use the -h or --help option.
```
./testing/chameleon_stesting -h
```
Available algorithms for testing are:
- gels_hqr: Linear least squares with general matrix using hierarchical reduction trees
- ormlq_hqr: Q application with hierarchical reduction trees (LQ)
- orglq_hqr: Q generation with hierarchical reduction trees (LQ)
- gelqf_hqr: General LQ factorization with hierachical reduction trees
- ormqr_hqr: Q application with hierarchical reduction trees (QR)
- orgqr_hqr: Q generation with hierarchical reduction trees (QR)
- geqrf_hqr: General QR factorization with hierachical reduction trees
- gels: Linear least squares with general matrix
- ormlq: Q application (LQ)
- orglq: Q generation (LQ)
- gelqf: General LQ factorization
- ormqr: Q application (QR)
- orgqr: Q generation (QR)
- geqrf: General QR factorization
- gesv: General linear system solve (LU without pivoting)
- getrs: General triangular solve (LU without pivoting)
- getrf: General factorization (LU without pivoting)
- potri: Symmetric positive definite matrix inversion
- lauum: Triangular in-place matrix-matrix computation for Cholesky inversion
- trtri: Triangular matrix inversion
- posv: Symmetric positive definite linear system solve (Cholesky)
- potrs: Symmetric positive definite solve (Cholesky)
- potrf: Symmetric positive definite factorization (Cholesky)
- trsm: Triangular matrix solve
- trmm: Triangular matrix-matrix multiply
- syr2k: Symmetrix matrix-matrix rank 2k update
- syrk: Symmetrix matrix-matrix rank k update
- symm: Symmetric matrix-matrix multiply
- gemm: General matrix-matrix multiply
- lascal: General matrix scaling
- tradd: Triangular matrix-matrix addition
- geadd: General matrix-matrix addition
- lantr: Triangular matrix norm
- lansy: Symmetric matrix norm
- lange: General matrix norm
- lacpy: General matrix copy

6.1.1 Configuration through environment variables

Some parameters of the Chameleon library can be set to some default values through environment variables which are listed below. Note that the code itself can modify these values through calls to `CHAMELEON_Enable()`, `CHAMELEON_Disable()`, or `CHAMELEON_Set()` (see Options)

CHAMELEON_TILE_SIZE defines the default tile size value for all algorithms. The default value is 384.
CHAMELEON_INNER_BLOCK_SIZE defines the default inner blocking size value for algorithms that requires it (mainly QR/LQ algorithms). The default value is 48.
CHAMELEON_HOUSEHOLDER_MODE changes the basic QR algorithm from a flat tree (1, ChamFlatHouseholder or Flat) to an Householder tree (2, ChamTreeHouseholder, or Tree ). The default value is ChamFlatTree.
CHAMELEON_HOUSEHOLDER_SIZE defines the size of the local housholder trees if the Houselmoder tree mode is set. The default value is 4.
CHAMELEON_TRANSLATION_MODE defines the translation used in the LAPACK API routines. 1, In, or ChamInPlace sets the in-place translation to avoid copies. 2, Out, ChamOutOfPlace sets the out-of-place translation that uses a copy of the matrix. The default is ChamInPlace.
CHAMELEON_GENERIC, if ON all algorithms using specialized algorithms specific to data distributions are disabled.
CHAMELEON_AUTOMINMAX, if ON the minimal/maximal limits of tasks that can be submitted to the runtime system are set. These limits are computed per algorithm using the lookahead parameter. (StarPU specific, and currently only available for getrf)
CHAMELEON_LOOKAHEAD defines the number of steps that will be submitted in advance in algorithms using lookahead techniques. The default is 1.
CHAMELEON_WARNINGS enables/disables the warning output
CHAMELEON_PARALLEL_KERNEL enables/disables the use of multi-threaded kernels. Available only for StarPU runtime system.
CHAMELEON_GENERATE_STATS enables the profiling information of the kernels (StarPU specific)
CHAMELEON_PROGRESS enables the progress function to show the percentage of tasks completed.

6.1.2 Execution trace using EZTrace

EZTrace can be used by chameleon to generate traces. Two modules are automatically generated as soon as EZTrace is detected on the system. The first one (which is recommended) is the chameleon_tcore module. It traces all the TCORE_...() functions that are called by the codelets of all the runtime but PaRSEC. The second one is the chameleon_core module which traces the lower level CORE_...() functions. If using PaRSEC, you need to use this module to generate the traces.

To generate traces with EZTrace, you need first to compile with -DBUILD_SHARED_LIBS=ON. EZTrace is using weak symbols to overload function calls with ld_preload and enable trace generation. Then, either you install the libeztrace-*.so files into the EZTrace install directory, or you can add the path of the modules to your environement

export EZTRACE_LIBRARY_PATH=/path/to/your/modules

To check if the modules are available you should have

$ eztrace_avail
1	omp	Module for OpenMP parallel regions
2	pthread	Module for PThread synchronization functions (mutex, semaphore, spinlock, etc.)
3	stdio	Module for stdio functions (read, write, select, poll, etc.)
4	mpi	Module for MPI functions
5	memory	Module for memory functions (malloc, free, etc.)
6	papi	Module for PAPI Performance counters
128	chameleon_core	Module for Chameleon CORE functions
129	chameleon_tcore	Module for Chameleon TCORE functions

Then, you can restrict the modules used during the execution

export EZTRACE_TRACE="mpi chameleon_tcore"

The module mpi is required if you want to run in distributed.

The setup can be checked with eztrace_loaded

$ eztrace_loaded
4	mpi	Module for MPI functions
129	chameleon_tcore	Module for Chameleon TCORE functions

To generate the traces, you need to run your binary through eztrace:

eztrace ./chameleon_dtesting -o gemm -n 1000 -b 200
mpirun -np 4 eztrace ./chameleon_dtesting -o gemm -n 1000 -b 200 -P 2

Convert the binary files into a .trace file, and visualize it.

eztrace_convert <username>_eztrace_log_rank_<[0-9]*>
vite eztrace_output.trace

For more information on EZTrace, you can follow the support page.

6.1.3 Execution trace using StarPU/FxT

StarPU can generate its own trace log files by compiling it with the --with-fxt option at the configure step (you can have to specify the directory where you installed FxT by giving --with-fxt=... instead of --with-fxt alone). In addition, the environment variable STARPU_FXT_TRACE must be set to 1.

export STARPU_FXT_TRACE=1

By doing so, traces are generated after each execution of a program which uses StarPU in the directory pointed by the STARPU_FXT_PREFIX environment variable (if not set the default path is tmp).

export STARPU_FXT_PREFIX=/home/jdoe/fxt_files/

When executing a ./testing/... Chameleon program, if it has been enabled (StarPU compiled with FxT), the program will generate trace files in the directory $STARPU_FXT_PREFIX.

To save only some specific types of events the variable STARPU_FXT_EVENTS.

Finally, to generate the trace file which can be opened with Vite program, you can use the starpu_fxt_tool executable of StarPU. This tool should be in the bin directory of StarPU's installation. You can use it to generate the trace file like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename

There is one file per mpi processus (prof_filename₀, prof_filename₁ …). To generate a trace of mpi programs you can call it like this:

path/to/your/install/starpu/bin/starpu_fxt_tool -i prof_filename*

The trace file will be named paje.trace (use -o option to specify an output name). Alternatively, for non mpi execution (only one processus and profiling file), you can set the environment variable STARPU_GENERATE_TRACE=1 to automatically generate the paje trace file.

6.1.4 Use simulation mode with StarPU-SimGrid

Simulation mode can be activated by setting the cmake option CHAMELEON_SIMULATION to ON. This mode allows you to simulate execution of algorithms with StarPU compiled with SimGrid. To do so, we provide some perfmodels in the simucore/perfmodels/ directory of Chameleon sources. To use these perfmodels, please set your STARPU_HOME environment variable to path/to/your/chameleon_sources/simucore/perfmodels. Finally, you need to set your STARPU_HOSTNAME environment variable to the name of the machine to simulate.

The algorithms available for now: gemm, symm, potrf, potrs, potri, posv, getrf_nopiv, getrs_nopiv, geqrf, geqrf_hqr, gels, gels_hqr, simple and double precisions on PlaFRIM nodes with GPUs. The tile size to use depending on the platform i.e. STARPU_HOSTNAME (choose a size N multiple of the tile size):

sirocco-k40m: 960
sirocco-p100: 1240
sirocco-v100: 1600
sirocco-a100: 1600
sirocco-rtx8000: 1600

In addition the potrf algorithm is also available on mirage and sirocco machines for the following tile sizes

mirage: 320, 960
sirocco: 80, 440, 960, 1440, 1920

Database of models is subject to change.

export STARPU_HOME=/tmp/chameleon/simucore/perfmodels/
export STARPU_HOSTNAME=sirocco
./testing/chameleon_dtesting -o potrf -t 22 -g 2 -n 14400 -b 1440 --nowarmup
0;dpotrf;22;2;1;1;0;1440;121;14400;14400;1804289383;0.000000e+00;7.867404e-01;1.265261e+03

export STARPU_HOSTNAME=sirocco-k40m
./testing/chameleon_stesting -o gemm -t 38 -g 2 -n 64000 -b 1600 --nowarmup
0;sgemm;38;2;1;1;0;1600;111;111;64000;64000;64000;64000;64000;64000;4.892778e-01;-1.846424e-01;1649760492;596516649;1189641421;0.000000e+00;2.010660e+01;2.607541e+04

export STARPU_HOSTNAME=sirocco-p100
./testing/chameleon_dtesting -o geqrf -g 2 -t 30 -b 1240 -n 39680 --nowarmup
0;dgeqrf;30;2;1;1;0;1240;48;39680;39680;39680;4;1804289383;0.000000e+00;3.893336e+01;2.139677e+03

6.1.5 Use out of core support with StarPU

If the matrix can not fit in the main memory, StarPU can automatically evict tiles to the disk. The following variables need to be set:

STARPU_DISK_SWAP environment variable to a place where to store

evicted tiles, for example: STARPU_DISK_SWAP=/tmp

STARPU_DISK_SWAP_BACKEND environment variable to the I/O method,

for example: STARPU_DISK_SWAP_BACKEND=unistd_o_direct

STARPU_LIMIT_CPU_MEM environment variable to the amount of memory

that can be used in MBytes, for example: STARPU_LIMIT_CPU_MEM=1000

6.2 Tutorial LAPACK to Chameleon

Chameleon provides routines to solve dense general systems of linear equations, symmetric positive definite systems of linear equations and linear least squares problems, using LU, Cholesky, QR and LQ factorizations. Real arithmetic and complex arithmetic are supported in both single precision and double precision. Routines that compute linear algebra are of the following form:

CHAMELEON_name[_Tile[_Async]]

all user routines are prefixed with CHAMELEON
in the pattern CHAMELEON_name[_Tile[_Async]], name follows the BLAS/LAPACK naming scheme for algorithms (e.g. sgemm for general matrix-matrix multiply simple precision)
Chameleon provides three interface levels
- CHAMELEON_name: simplest interface, very close to CBLAS and LAPACKE, matrices are given following the LAPACK data layout (1-D array column-major). It involves copy of data from LAPACK layout to tile layout and conversely (to update LAPACK data), see Step1.
- CHAMELEON_name_Tile: the tile interface avoid copies between LAPACK and tile layouts. It is the standard interface of Chameleon and it should achieved better performance than the previous simplest interface. The data are given through a specific structure called a descriptor, see Step2.
- CHAMELEON_name_Tile_Async: similar to the tile interface, it avoids synchonization barrier normally called between Tile routines. At the end of an Async function, completion of tasks is not guaranteed and data are not necessarily up-to-date. To ensure that tasks have been all executed, a synchronization function has to be called after the sequence of Async functions, see Step4.

CHAMELEON routine calls have to be preceded from

CHAMELEON_Init( NCPU, NGPU );

to initialize CHAMELEON and the runtime system and followed by

CHAMELEON_Finalize();

to free some data and finalize the runtime and/or MPI.

This tutorial is dedicated to the API usage of Chameleon. The idea is to start from a simple code and step by step explain how to use Chameleon routines. The first step is a full BLAS/LAPACK code without dependencies to Chameleon, a code that most users should easily understand. Then, the different interfaces Chameleon provides are exposed, from the simplest API (step1) to more complicated ones (until step4). The way some important parameters are set is discussed in step5. step6 is an example about distributed computation with MPI. Finally step7 shows how to let Chameleon initialize user's data (matrices/vectors) in parallel.

Source files can be found in the example/lapack_to_chameleon/ directory. If CMake option CHAMELEON_ENABLE_EXAMPLE is ON then source files are compiled with the project libraries. The arithmetic precision is double. To execute a step X, enter the following command:

./stepX --option1 --option2 ...

Instructions about the arguments to give to executables are accessible thanks to the option -[-]help or -[-]h. Note there exist default values for options.

For all steps, the program solves a linear system $Ax=B$ The matrix values are randomly generated but ensure that matrix $A$ is symmetric positive definite so that $A$ can be factorized in a $LL^T$ form using the Cholesky factorization.

The different steps of the tutorial are:

Step0: a simple Cholesky example using the C interface of BLAS/LAPACK
Step1: introduces the LAPACK equivalent interface of Chameleon
Step2: introduces the tile interface
Step3: indicates how to give your own tile matrix to Chameleon
Step4: introduces the tile async interface
Step5: shows how to set some important parameters
Step6: introduces how to benefit from MPI in Chameleon
Step7: introduces how to let Chameleon initialize the user's matrix data

6.2.1 Step0

The C interface of BLAS and LAPACK, that is, CBLAS and LAPACKE, are used to solve the system. The size of the system (matrix) and the number of right hand-sides can be given as arguments to the executable (be careful not to give huge numbers if you do not have an infinite amount of RAM!). As for every step, the correctness of the solution is checked by calculating the norm $||Ax-B||/(||A||||x||+||B||)$. The time spent in factorization+solve is recorded and, because we know exactly the number of operations of these algorithms, we deduce the number of operations that have been processed per second (in GFlops/s). The important part of the code that solves the problem is:

/* Cholesky factorization:
 * A is replaced by its factorization L or L^T depending on uplo */
LAPACKE_dpotrf( LAPACK_COL_MAJOR, 'U', N, A, N );
/* Solve:
 * B is stored in X on entry, X contains the result on exit.
 * Forward ...
 */
cblas_dtrsm(
    CblasColMajor,
    CblasLeft,
    CblasUpper,
    CblasConjTrans,
    CblasNonUnit,
    N, NRHS, 1.0, A, N, X, N);
/* ... and back substitution */
cblas_dtrsm(
    CblasColMajor,
    CblasLeft,
    CblasUpper,
    CblasNoTrans,
    CblasNonUnit,
    N, NRHS, 1.0, A, N, X, N);

6.2.2 Step1

It introduces the simplest Chameleon interface which is equivalent to CBLAS/LAPACKE. The code is very similar to step0 but instead of calling CBLAS/LAPACKE functions, we call Chameleon equivalent functions. The solving code becomes:

/* Factorization: */
CHAMELEON_dpotrf( UPLO, N, A, N );
/* Solve: */
CHAMELEON_dpotrs(UPLO, N, NRHS, A, N, X, N);

The API is almost the same so that it is easy to use for beginners. It is important to keep in mind that before any call to CHAMELEON routines, CHAMELEON_Init has to be invoked to initialize CHAMELEON and the runtime system. Example:

CHAMELEON_Init( NCPU, NGPU );

After all CHAMELEON calls have been done, a call to CHAMELEON_Finalize is required to free some data and finalize the runtime and/or MPI.

CHAMELEON_Finalize();

We use CHAMELEON routines with the LAPACK interface which means the routines accepts the same matrix format as LAPACK (1-D array column-major). Note that we copy the matrix to get it in our own tile structures, see details about this format here Tile Data Layout. This means you can get an overhead coming from copies.

6.2.3 Step2

This program is a copy of step1 but instead of using the LAPACK interface which reads to copy LAPACK matrices inside CHAMELEON routines we use the tile interface. We will still use standard format of matrix but we will see how to give this matrix to create a CHAMELEON descriptor, a structure wrapping data on which we want to apply sequential task-based algorithms. The solving code becomes:

/* Factorization: */
CHAMELEON_dpotrf_Tile( UPLO, descA );
/* Solve: */
CHAMELEON_dpotrs_Tile( UPLO, descA, descX );

To use the tile interface, a specific structure CHAM_desc_t must be created. This can be achieved from different ways.

Use the existing function CHAMELEON_Desc_Create: means the matrix data are considered contiguous in memory as it is considered in PLASMA (Tile Data Layout).
Use the existing function CHAMELEON_Desc_Create_OOC: means the matrix data is allocated on-demand in memory tile by tile, and possibly pushed to disk if that does not fit memory.
Use the existing function CHAMELEON_Desc_Create_User: it is more flexible than Desc_Create because you can give your own way to access to tile data so that your tiles can be allocated wherever you want in memory, see next paragraph Step3.
Create you own function to fill the descriptor. If you understand well the meaning of each item of CHAM_desc_t, you should be able to fill correctly the structure.

In Step2, we use the first way to create the descriptor:

CHAMELEON_Desc_Create(&descA, NULL, ChamRealDouble,
                  NB, NB, NB*NB, N, N,
                  0, 0, N, N,
                  1, 1);

descA is the descriptor to create.
The second argument is a pointer to existing data. The existing data must follow LAPACK/PLASMA matrix layout Tile Data Layout (1-D array column-major) if CHAMELEON_Desc_Create is used to create the descriptor. The CHAMELEON_Desc_Create_User function can be used if you have data organized differently. This is discussed in the next paragraph Step3. Giving a NULL pointer means you let the function allocate memory space. This requires to copy your data in the memory allocated by the *Desc_Create. This can be done with
```
CHAMELEON_Lapack_to_Tile(A, N, descA);
```
Third argument of @code{Desc_Create} is the datatype (used for memory allocation).
Fourth argument until sixth argument stand for respectively, the number of rows (NB), columns (NB) in each tile, the total number of values in a tile (NB*NB), the number of rows (N), colmumns (N) in the entire matrix.
Seventh argument until ninth argument stand for respectively, the beginning row (0), column (0) indexes of the submatrix and the number of rows (N), columns (N) in the submatrix. These arguments are specific and used in precise cases. If you do not consider submatrices, just use 0, 0, NROWS, NCOLS.
Two last arguments are the parameter of the 2-D block-cyclic distribution grid, see ScaLAPACK. To be able to use other data distribution over the nodes, CHAMELEON_Desc_Create_User function should be used.

6.2.4 Step3

This program makes use of the same interface than Step2 (tile interface) but does not allocate LAPACK matrices anymore so that no copy between LAPACK matrix layout and tile matrix layout are necessary to call CHAMELEON routines. To generate random right hand-sides you can use:

/* Allocate memory and initialize descriptor B */
CHAMELEON_Desc_Create(&descB,  NULL, ChamRealDouble,
                  NB, NB,  NB*NB, N, NRHS,
                  0, 0, N, NRHS, 1, 1);
/* generate RHS with random values */
CHAMELEON_dplrnt_Tile( descB, 5673 );

The other important point is that is it possible to create a descriptor, the necessary structure to call CHAMELEON efficiently, by giving your own pointer to tiles if your matrix is not organized as a 1-D array column-major. This can be achieved with the CHAMELEON_Desc_Create_User routine. Here is an example:

CHAMELEON_Desc_Create_User(&descA, matA, ChamRealDouble,
                       NB, NB, NB*NB, N, N,
                       0, 0, N, N, 1, 1,
                       user_getaddr_arrayofpointers,
                       user_getblkldd_arrayofpointers,
                       user_getrankof_zero, NULL);

Firsts arguments are the same than CHAMELEON_Desc_Create routine. Following arguments allows you to give pointer to functions that manage the access to tiles from the structure given as second argument. Here for example, matA is an array containing addresses to tiles, see the function allocate_tile_matrix defined in step3.h. If you want the matrix to be allocate by Chameleon, you can use the CHAMELEON_MAT_ALLOC_GLOBAL, or the CHAMELEON_MAT_ALLOC_TILE variables to allocate repectively as a single large allocation, or to allocate tile by tile as late as possible. The three functions you have to define for Desc_Create_User are:

a function that returns address of tile $A(m,n)$, m and n standing for the indexes of the tile in the global matrix. Lets consider a matrix @math{4x4} with tile size 2x2, the matrix contains four tiles of indexes: $A(m=0,n=0)$, $A(m=0,n=1)$, $A(m=1,n=0)$, $A(m=1,n=1)$
a function that returns the leading dimension of tile $A(m,*)$
a function that returns MPI rank of tile $A(m,n)$
a pointer to a structure that these three functions can use to store additional data.

Examples for these functions are visible in step3.h. Note that the way we define these functions is related to the tile matrix format and to the data distribution considered. This example should not be used with MPI since all tiles are affected to processus 0, which means a large amount of data will be potentially transfered between nodes.

6.2.5 Step4

This program is a copy of step2 but instead of using the tile interface, it uses the tile async interface. The goal is to exhibit the runtime synchronization barriers. Keep in mind that when the tile interface is called, like CHAMELEON_dpotrf_Tile, a synchronization function, waiting for the actual execution and termination of all tasks, is called to ensure the proper completion of the algorithm (i.e. data are up-to-date). The code shows how to exploit the async interface to pipeline subsequent algorithms so that less synchronisations are done. The code becomes:

/* Cham structure containing parameters and a structure to interact with
 * the Runtime system */
CHAM_context_t *chamctxt;
/* CHAMELEON sequence uniquely identifies a set of asynchronous function calls
 * sharing common exception handling */
RUNTIME_sequence_t *sequence = NULL;
/* CHAMELEON request uniquely identifies each asynchronous function call */
RUNTIME_request_t request = CHAMELEON_REQUEST_INITIALIZER;
int status;

...

chameleon_sequence_create(chamctxt, &sequence);

/* Factorization: */
CHAMELEON_dpotrf_Tile_Async( UPLO, descA, sequence, &request );

/* Solve: */
CHAMELEON_dpotrs_Tile_Async( UPLO, descA, descX, sequence, &request);

/* Synchronization barrier (the runtime ensures that all submitted tasks
 * have been terminated */
RUNTIME_barrier(chamctxt);
/* Ensure that all data processed on the gpus we are depending on are back
 * in main memory */
RUNTIME_desc_getoncpu(descA);
RUNTIME_desc_getoncpu(descX);

status = sequence->status;

Here the sequence of dpotrf and dpotrs algorithms is processed without synchronization so that some tasks of dpotrf and dpotrs can be concurently executed which could increase performances. The async interface is very similar to the tile one. It is only necessary to give two new objects RUNTIME_sequence_t and RUNTIME_request_t used to handle asynchronous function calls.

Figure 5: POTRI (POTRF, TRTRI, LAUUM) algorithm with and without synchronization barriers, courtesey of the PLASMA team.

6.2.6 Step5

Step5 shows how to set some important parameters. This program is a copy of Step4 but some additional parameters are given by the user. The parameters that can be set are:

number of Threads

number of GPUs

The number of workers can be given as argument to the executable with --threads= and --gpus= options. It is important to notice that we assign one thread per gpu to optimize data transfer between main memory and devices memory. The number of workers of each type CPU and CUDA must be given at CHAMELEON_Init.

if ( iparam[IPARAM_THRDNBR] == -1 ) {
    get_thread_count( &(iparam[IPARAM_THRDNBR]) );
    /* reserve one thread par cuda device to optimize memory transfers */
    iparam[IPARAM_THRDNBR] -=iparam[IPARAM_NCUDAS];
}
NCPU = iparam[IPARAM_THRDNBR];
NGPU = iparam[IPARAM_NCUDAS];
/* initialize CHAMELEON with main parameters */
CHAMELEON_Init( NCPU, NGPU );

matrix size
number of right-hand sides
block (tile) size

The problem size is given with --n= and --nrhs= options. The tile size is given with option --nb=. These parameters are required to create descriptors. The size tile NB is a key parameter to get performances since it defines the granularity of tasks. If NB is too large compared to N, there are few tasks to schedule. If the number of workers is large this leads to limit parallelism. On the contrary, if NB is too small (i.e. many small tasks), workers could not be correctly fed and the runtime systems operations could represent a substantial overhead. A trade-off has to be found depending on many parameters: problem size, algorithm (drive data dependencies), architecture (number of workers, workers speed, workers uniformity, memory bus speed). By default it is set to 128. Do not hesitate to play with this parameter and compare performances on your machine.
inner-blocking size

The inner-blocking size is given with option --ib=. This parameter is used by kernels (optimized algorithms applied on tiles) to perform subsequent operations with data block-size that fits the cache of workers. Parameters NB and IB can be given with CHAMELEON_Set function:
```
CHAMELEON_Set(CHAMELEON_TILE_SIZE,        iparam[IPARAM_NB] );
CHAMELEON_Set(CHAMELEON_INNER_BLOCK_SIZE, iparam[IPARAM_IB] );
```

6.2.7 Step6

This program is a copy of Step5 with some additional parameters to be set for the data distribution. To use this program properly CHAMELEON must use StarPU Runtime system and MPI option must be activated at configure. The data distribution used here is 2-D block-cyclic, see for example ScaLAPACK for explanation. The user can enter the parameters of the distribution grid at execution with --p= option. Example using OpenMPI on four nodes with one process per node:

mpirun -np 4 ./step6 --n=10000 --nb=320 --ib=64 --threads=8 --gpus=2 --p=2

In this program we use the tile data layout from PLASMA so that the call

CHAMELEON_Desc_Create(&descA, NULL, ChamRealDouble,
                      NB, NB, NB*NB, N, N,
                      0, 0, N, N,
                      GRID_P, GRID_Q);

is equivalent to the following call

CHAMELEON_Desc_Create_User(&descA, NULL, ChamRealDouble,
                           NB, NB, NB*NB, N, N,
                           0, 0, N, N,
                           GRID_P, GRID_Q,
                           chameleon_getaddr_ccrb,
                           chameleon_getblkldd_ccrb,
                           chameleon_getrankof_2d, NULL);

functions chameleon_getaddr_ccrb, chameleon_getblkldd_ccrb, chameleon_getrankof_2d being used in Desc_Create. It is interesting to notice that the code is almost the same as Step5. The only additional information to give is the way tiles are distributed through the third function given to CHAMELEON_Desc_Create_User. Here, because we have made experiments only with a 2-D block-cyclic distribution, we have parameters P and Q in the interface of Desc_Create but they have sense only for 2-D block-cyclic distribution and then using chameleon_getrankof_2d function. Of course it could be used with other distributions, being no more the parameters of a 2-D block-cyclic grid but of another distribution. And the last parameter void* get_rankof_arg of CHAMELEON_Desc_Create_User can be used to get custom data in the get_rankof function.

6.2.8 Step7

This program is a copy of step6 with some additional calls to build a matrix from within chameleon using a function provided by the user. This can be seen as a replacement of the function like CHAMELEON_dplgsy_Tile() that can be used to fill the matrix with random data, CHAMELEON_dLapack_to_Tile() to fill the matrix with data stored in a lapack-like buffer, or CHAMELEON_Desc_Create_User() that can be used to describe an arbitrary tile matrix structure. In this example, the build callback function are just wrapper towards CORE_xxx() functions, so the output of the program step7 should be exactly similar to that of step6. The difference is that the function used to fill the tiles is provided by the user, and therefore this approach is much more flexible.

The new function to understand is CHAMELEON_map_Tile, e.g.

struct data_pl data_A={(double)N, 51};
CHAMELEON_map_Tile(ChamW, ChamUpperLower, descA, Cham_map_plgsy, (void*)&data_A);

The idea here is to let Chameleon fill the matrix data in a task-based fashion (parallel) by using a function given by the user. First, the user has to give the access mode to the matrix between: ChamR, *ChamW, ChamRW depending on the kind of operations the callback function needs to do on the tiles. In our example here we fill the matrix with random values for the first time so that we use the access mode ChamW. Second, the user should define if all the blocks must be entirelly filled or just the upper/lower part with, e.g. ChamUpperLower. We still relies on the same structure CHAM_desc_t which must be initialized with the proper parameters, by calling for example CHAMELEON_Desc_Create. Then comes the pointer to the user's function. And finally the last parameter is an opaque pointer is used to let the user give some extra data used by his function.

6.3 Using custom data distributions

6.3.1 Interface

It is possible to provide custom data distributions to Chameleon, to go beyond the 2D block cyclic distributions. A generic interface is provided with the functions chameleon_getrankof_custom_init, chameleon_getrankof_custom_destroy and chameleon_getrankof_custom, with the following signatures:

int chameleon_getrankof_custom_init( custom_dist_t **custom_dist,
				     const char     *dist_file );

int chameleon_getrankof_custom_destroy( custom_dist_t **dist );

int chameleon_getrankof_custom( const CHAM_desc_t *desc, int m, int n );

The first function is used to read a custom distribution from an external file, whose name is provided in the dist_file argument. The file format is described below. The second function can be used to destroy the custom_dist_t pointer when it is no longer useful. The last function should be used as the get_rankof argument to CHAMELEON_Desc_Create_User, together with the custom distribution obtained from chameleon_getrankof_custom_init. The typical usage is the following:

custom_dist_t* custom_dist;
chameleon_getrankof_custom_init( &custom_dist, "filename" );
CHAMELEON_Desc_Create_User(&descA, NULL, ChamRealDouble,
			   NB, NB, NB*NB, N, N,
			   0, 0, N, N,
			   CHAMELEON_Comm_size(), 1,
			   chameleon_getaddr_ccrb,
			   chameleon_getblkldd_ccrb,
			   chameleon_getrankof_custom, custom_dist);
/* Use the descriptor */
CHAMELEON_Desc_Destroy(&descA);
chameleon_getrankof_custom_destroy(&custom_dist);

Since we do not use a 2D block-cyclic distribution, the values of P and Q have no importance in CHAMELEON_Desc_Create_User. However, make sure that the product of P and Q is equal to the number of processes by using the couple (CHAMELEON_Comm_size(), 1) as a replacement for (P, Q).

6.3.2 File format

The custom distribution is provided by a pattern that can have any dimension, and which is repeated all over the matrix. The file format expected by chameleon_getrankof_custom_init is a simple text format, with space-separated integer values. The first two values represent the size of the pattern (number of rows $m_d$ and number of columns $n_d$). Then, the function expects $m_d * n_d$ values, where each value is the index of the process that should handle this tile. For example, the following file content would result in a 2D block-cyclic distribution with P=2 and Q=3 (it is not necessary to skip lines, but it can make the file more readable):

2 3
0 1 2
3 4 5

6.4 List of available routines

6.4.1 Linear Algebra routines

We list the linear algebra routines of the form CHAMELEON_name[_Tile[_Async]] (name follows LAPACK naming scheme, see http://www.netlib.org/lapack/lug/node24.html) that can be used with the Chameleon library. For details about these functions please refer to the doxygen documentation. name can be one of the following:

BLAS 2/3 routines
- gemm: matrix matrix multiply and addition
- hemm: gemm with A Hermitian
- herk: rank k operations with A Hermitian
- her2k: rank 2k operations with A Hermitian
- lauum: computes the product U * U' or L' * L, where the triangular factor U or L is stored in the upper or lower triangular part of the array A
- symm: gemm with A symmetric
- syrk: rank k operations with A symmetric
- syr2k: rank 2k with A symmetric
- trmm: gemm with A triangular
Triangular solving routines
- trsm: computes triangular solve
- trsmpl: performs the forward substitution step of solving a system of linear equations after the tile LU factorization of the matrix
- trsmrv:
- trtri: computes the inverse of a complex upper or lower triangular matrix A
LL' (Cholesky) routines
- posv: linear systems solving using Cholesky factorization
- potrf: Cholesky factorization
- potri: computes the inverse of a complex Hermitian positive definite matrix A using the Cholesky factorization A
- potrimm:
- potrs: linear systems solving using existing Cholesky factorization
- sysv: linear systems solving using Cholesky decomposition with A symmetric
- sytrf: Cholesky decomposition with A symmetric
- sytrs: linear systems solving using existing Cholesky decomposition with A symmetric
LU routines
- gesv_incpiv: linear systems solving with LU factorization and partial pivoting
- gesv_nopiv: linear systems solving with LU factorization and without pivoting
- getrf_incpiv: LU factorization with partial pivoting
- getrf_nopiv: LU factorization without pivoting
- getrs_incpiv: linear systems solving using existing LU factorization with partial pivoting
- getrs_nopiv: linear systems solving using existing LU factorization without pivoting
QR/LQ routines
- gelqf: LQ factorization
- gelqf_param: gelqf with hqr
- gelqs: computes a minimum-norm solution min || A*X - B || using the LQ factorization
- gelqs_param: gelqs with hqr
- gels: Uses QR or LQ factorization to solve a overdetermined or underdetermined linear system with full rank matrix
- gels_param: gels with hqr
- geqrf: QR factorization
- geqrf_param: geqrf with hqr
- geqrs: computes a minimum-norm solution min || A*X - B || using the RQ factorization
- hetrd: reduces a complex Hermitian matrix A to real symmetric tridiagonal form S
- geqrs_param: geqrs with hqr
- tpgqrt: generates a partial Q matrix formed with a blocked QR factorization of a "triangular-pentagonal" matrix C, which is composed of a unused triangular block and a pentagonal block V, using the compact representation for Q. See tpqrt to generate V
- tpqrt: computes a blocked QR factorization of a "triangular-pentagonal" matrix C, which is composed of a triangular block A and a pentagonal block B, using the compact representation for Q
- unglq: generates an M-by-N matrix Q with orthonormal rows, which is defined as the first M rows of a product of the elementary reflectors returned by CHAMELEON_zgelqf
- unglq_param: unglq with hqr
- ungqr: generates an M-by-N matrix Q with orthonormal columns, which is defined as the first N columns of a product of the elementary reflectors returned by CHAMELEON_zgeqrf
- ungqr_param: ungqr with hqr
- unmlq: overwrites C with Q*C or C*Q or equivalent operations with transposition on conjugate on C (see doxygen documentation)
- unmlq_param: unmlq with hqr
- unmqr: similar to unmlq (see doxygen documentation)
- unmqr_param: unmqr with hqr
EVD/SVD
- gesvd: singular value decomposition
- heevd: eigenvalues/eigenvectors computation with A Hermitian

Specific Matrix transformation for Data Analysis
- cesca: centered-scaled matrix transformation, pretreatment algorithm for Principal Component Analysis
- gram: Gram matrix transformation, pretreatment algorithm for Multidimensional Scaling
- Extra routines
  - Norms
    - lange: compute norm of a matrix (Max, One, Inf, Frobenius)
    - lanhe: lange with A Hermitian
    - lansy: lange with A symmetric
    - lantr: lange with A triangular
  - Random matrices generation
    - plghe: generate a random Hermitian matrix
    - plgsy: generate a random symmetrix matrix
    - plgtr: generate a random trapezoidal matrix
    - plrnt: generate a random matrix
    - plrnk: generate a random matrix of rank K with K <= min(M,N)
  - Others
    - geadd: general matrix matrix addition
    - lacpy: copy matrix into another
    - lascal: scale a matrix
    - laset: copy the triangular part of a matrix into another, set a value for the diagonal and off-diagonal part
    - tradd: trapezoidal matrices addition
  - Map functions
    - map: apply a user operator on each tile of the matrix
In addition, all BLAS 3 routines gemm, hemm, her2k, herk, lauum, symm, syr2k, syrk, trmm, trsm and LAPACK lacpy, lange, lanhe, lansy, lantr, laset, posv, potrf, potri, potrs, trtri can be called using an equivalent of the (C)BLAS/LAPACK(E) API. The parameters are the same and the user just has to add CHAMELEON_ to the standard name of the routine. For example, in C
```
CHAMELEON_Init(4,0);
CHAMELEON_cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
		      N, NRHS, N, 1.0, A, N, X, N, -1.0, B, N);
CHAMELEON_Finalize();
```
In Fortran, the function names are for example: CHAMELEON_blas_dgemm instead of DGEMM and CHAMELEON_lapack_dposv instead of DPOSV.

6.4.2 Options routines

Enable CHAMELEON feature.

int CHAMELEON_Enable  (CHAMELEON_enum option);

Features that can be enabled/disabled:

CHAMELEON_WARNINGS: printing of warning messages,
CHAMELEON_AUTOTUNING: autotuning for tile size and inner block size (inactive),
CHAMELEON_GENERATE_TRACE: enable/start the trace generation
CHAMELEON_GENERATE_STATS: enable/start the kernel statistics
CHAMELEON_PROGRESS: to print a progress status,
CHAMELEON_GEMM3M: to enable the use of the gemm3m BLAS function.

Disable CHAMELEON feature.

int CHAMELEON_Disable (CHAMELEON_enum option);

Symmetric to CHAMELEON_Enable.

Set CHAMELEON parameter.

int CHAMELEON_Set     (CHAMELEON_enum param, int  value);

Parameters to be set:

CHAMELEON_TILE_SIZE: size matrix tile,
CHAMELEON_INNER_BLOCK_SIZE: size of tile inner block,
CHAMELEON_HOUSEHOLDER_MODE: type of householder trees (FLAT or TREE),
CHAMELEON_HOUSEHOLDER_SIZE: size of the groups in householder trees,
CHAMELEON_TRANSLATION_MODE: related to the CHAMELEON_Lapack_to_Tile, see ztile.c.

Get value of CHAMELEON parameter.

int CHAMELEON_Get     (CHAMELEON_enum param, int *value);

Alternatively, Chameleon can also be configured through environment variables.
- CHAMELEON_GEMM_ALGO give the possibility to switch among multiple variants of the GEMM algorithms. These variants are GENERIC for the generic variant that should work with any configuration; SUMMA_C that works for 2D block cyclic distribution of the matrices A, B, and C with a C stationnary version; SUMMA_A and SUMMA_B are SUMMA variant of the algorithm that works for any distribution with respectively A, or *B that are stationnary. Note that the last two variants are only available with the StarPU runtime backend.

6.4.3 Auxiliary routines

Reports CHAMELEON version number.

int CHAMELEON_Version        (int *ver_major, int *ver_minor, int *ver_micro);

Initialize CHAMELEON: initialize some parameters, initialize the runtime and/or MPI.

int CHAMELEON_Init           (int nworkers, int ncudas);

Finalyze CHAMELEON: free some data and finalize the runtime and/or MPI.

int CHAMELEON_Finalize       (void);

Suspend CHAMELEON runtime to poll for new tasks, to avoid useless CPU consumption when no tasks have to be executed by CHAMELEON runtime system.

int CHAMELEON_Pause          (void);

Symmetrical call to CHAMELEON_Pause, used to resume the workers polling for new tasks.

int CHAMELEON_Resume         (void);

Return the MPI rank of the calling process.

int CHAMELEON_My_Mpi_Rank    (void);

Return the size of the distributed computation

int CHAMELEON_Comm_size( int *size )

Return the rank of the distributed computation

int CHAMELEON_Comm_rank( int *rank )

Prepare the distributed processes for computation

int CHAMELEON_Distributed_start(void)

Clean the distributed processes after computation

int CHAMELEON_Distributed_stop(void)

Return the number of CPU workers initialized by the runtime

int CHAMELEON_GetThreadNbr()

Conversion from LAPACK layout to tile layout.

int CHAMELEON_Lapack_to_Tile (void *Af77, int LDA, CHAM_desc_t *A);

Conversion from tile layout to LAPACK layout.

int CHAMELEON_Tile_to_Lapack (CHAM_desc_t *A, void *Af77, int LDA);

6.4.4 Descriptor routines

Create matrix descriptor, internal function.

int CHAMELEON_Desc_Create(CHAM_desc_t **desc, void *mat, cham_flttype_t dtyp,
		      int mb, int nb, int bsiz, int lm, int ln,
		      int i, int j, int m, int n, int p, int q);

Create matrix descriptor, user function.

int CHAMELEON_Desc_Create_User(CHAM_desc_t **desc, void *mat, cham_flttype_t dtyp,
			   int mb, int nb, int bsiz, int lm, int ln,
			   int i, int j, int m, int n, int p, int q,
			   void* (*get_blkaddr)( const CHAM_desc_t*, int, int),
			   int (*get_blkldd)( const CHAM_desc_t*, int ),
			   int (*get_rankof)( const CHAM_desc_t*, int, int ),
			   void* get_rankof_arg);

Create matrix descriptor for tiled matrix which may not fit memory.

int CHAMELEON_Desc_Create_OOC(CHAM_desc_t **descptr, cham_flttype_t dtyp, int mb, int nb, int bsiz,
			  int lm, int ln, int i, int j, int m, int n, int p, int q);

User's function version of CHAMELEON_Desc_Create_OOC.

int CHAMELEON_Desc_Create_OOC_User(CHAM_desc_t **descptr, cham_flttype_t dtyp, int mb, int nb, int bsiz,
			       int lm, int ln, int i, int j, int m, int n, int p, int q,
			       int (*get_rankof)( const CHAM_desc_t*, int, int ),
			       void* get_rankof_arg);

Destroys matrix descriptor.

int CHAMELEON_Desc_Destroy (CHAM_desc_t **desc);

Ensures that all data of the descriptor are up-to-date.

int CHAMELEON_Desc_Acquire (CHAM_desc_t  *desc);

Release the data of the descriptor acquired by the application. Should be called if CHAMELEON_Desc_Acquire has been called on the descriptor and if you do not need to access to its data anymore.

int CHAMELEON_Desc_Release (CHAM_desc_t  *desc);

Ensure that all data are up-to-date in main memory (even if some tasks have been processed on GPUs).

int CHAMELEON_Desc_Flush(CHAM_desc_t  *desc, RUNTIME_sequence_t *sequence);

6.4.5 Sequences routines

Create a sequence.

int CHAMELEON_Sequence_Create  (RUNTIME_sequence_t **sequence);

Destroy a sequence.

int CHAMELEON_Sequence_Destroy (RUNTIME_sequence_t *sequence);

Wait for the completion of a sequence.

int CHAMELEON_Sequence_Wait    (RUNTIME_sequence_t *sequence);

Terminate a sequence.

int CHAMELEON_Sequence_Flush(RUNTIME_sequence_t *sequence, RUNTIME_request_t *request)

6.5 Using the `CHAMELEON_PARALLEL_WORKER` interface.

The CHAMELEON_PARALLEL_WORKER interface is a extension only available with the StarPU runtime system that allows to run concurrently multi-threaded kernels.

A StarPU parallel worker, previously called a cluster, is a set of workers which execute a single parallel task (see StarPU Documentation).

To use this functionnality:

StarPU must be compiled with the configure option --enable-parallel-worker
Chameleon automatically detects if the StarPU parallel workers are available or not and can exploit them, but you need to force Chameleon to be linked with a multi-threaded BLAS library if you want the parallel workers to be able to do parallel BLAS calls. To do that, you must add -DCHAMELEON_KERNELS_MT=ON to your cmake line.

Below are given some examples to use the couple Chameleon/StarPU to enable parallel tasks to be run concurrently. For now, this is only available for a few subset of tasks that are used in the Cholesky decomposition (POTRF, TRSM, SYRK, HERK, and GEMM) but all other algorithms using these kernels benefit from it.

6.5.1 Environment variables to configure the parallel workers

CHAMELEON_PARALLEL_WORKER_LEVEL=hardware-level[:number-of-parallel-workers]

Specify the number of parallel workers per hardware-level. The default value is 1. Note that hardware-level must correspond to an hwloc machine level type (hwloc_obj_type_t) e.g.: L2, L3, SOCKET, MACHINE.

CHAMELEON_PARALLEL_WORKER_SHOW : When defined, the parallel workers contents is displayed.

6.5.2 Limitations

For now, there is still an issue of bad performances with the usage of the lws scheduler with the parallel workers.

6.5.3 Examples

In the following examples, STARPU_MAIN_THREAD_BIND is set to 1 to bind the main thread of StarPU to a dedicated CPU, subtracted from the CPU workers. This avoids using a whole parallel worker to make the submission.

The machine has 64 CPUs. One is dedicated to the task submission, Two CPUs are dedicated to run the GPUs.

Figure 6: lstopo-sirocco24

Example 1: Define a parallel worker per L3 cache (sirocco24)

Here we ask StarPU to create 1 parallel worker per L3 cache. The last parallel worker does not have all the CPUs of the last L3 cache, as there are 3 dedicated CPUs.

CHAMELEON_PARALLEL_WORKER_LEVEL=L3 \
CHAMELEON_PARALLEL_WORKER_SHOW=1 \
STARPU_MAIN_THREAD_BIND=1 \
STARPU_CALIBRATE=1 \
STARPU_SCHED=dmdar \
STARPU_NWORKER_PER_CUDA=2 \
STARPU_SILENT=1 \
$PTCHAMELEON/chameleon/build/testing/chameleon_dtesting -o potrf -n 59520 -b 1440:3000:480  -g 2

Number of parallel workers created: 8
Parallel worker 0 contains the following logical indexes:
    0 1 2 3 4 5 6 7
Parallel worker 1 contains the following logical indexes:
    8 9 10 11 12 13 14 15
Parallel worker 2 contains the following logical indexes:
    16 17 18 19 20 21 22 23
Parallel worker 3 contains the following logical indexes:
    24 25 26 27 28 29 30 31
Parallel worker 4 contains the following logical indexes:
    32 33 34 35 36 37 38 39
Parallel worker 5 contains the following logical indexes:
    40 41 42 43 44 45 46 47
Parallel worker 6 contains the following logical indexes:
    48 49 50 51 52 53 54 55
Parallel worker 7 contains the following logical indexes:
    56 57 58 59 60
Id;Function;threads;gpus;P;Q;mtxfmt;nb;uplo;n;lda;seedA;tsub;time;gflops
0;dpotrf;61;2;1;1;0;1440;121;59520;59520;846930886;0.000000e+00;3.282047e+00;2.141577e+04
1;dpotrf;61;2;1;1;0;1920;121;59520;59520;1681692777;0.000000e+00;3.404408e+00;2.064605e+04
2;dpotrf;61;2;1;1;0;2400;121;59520;59520;1714636915;0.000000e+00;3.427721e+00;2.050563e+04
3;dpotrf;61;2;1;1;0;2880;121;59520;59520;1957747793;0.000000e+00;3.707147e+00;1.896001e+04

Example 2: Define 2 parallel workers per socket (sirocco24)

Here we ask StarPU to create 2 parallel workers per socket. This ends up with having the workers 45 and 46 in different parallel workers even though they share the same L3 cache.

:CUSTOM_ID: example-define-2-parallel-workers-per-socket-sirocco24

CHAMELEON_PARALLEL_WORKER_LEVEL=socket:2 \
CHAMELEON_PARALLEL_WORKER_SHOW=1 \
STARPU_MAIN_THREAD_BIND=1 \
STARPU_CALIBRATE=1 \
STARPU_SCHED=dmdar \
STARPU_NWORKER_PER_CUDA=2 \
STARPU_SILENT=1 \
$PTCHAMELEON/chameleon/build/testing/chameleon_dtesting -o potrf -n 59520 -b 1440:3000:480  -g 2

Number of parallel workers created: 4
Parallel worker 0 contains the following logical indexes:
    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Parallel worker 1 contains the following logical indexes:
    16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Parallel worker 2 contains the following logical indexes:
    32 33 34 35 36 37 38 39 40 41 42 43 44 45
Parallel worker 3 contains the following logical indexes:
    46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Id;Function;threads;gpus;P;Q;mtxfmt;nb;uplo;n;lda;seedA;tsub;time;gflops
0;dpotrf;61;2;1;1;0;1440;121;59520;59520;846930886;0.000000e+00;3.256134e+00;2.158620e+04
1;dpotrf;61;2;1;1;0;1920;121;59520;59520;1681692777;0.000000e+00;7.003285e+00;1.003637e+04
2;dpotrf;61;2;1;1;0;2400;121;59520;59520;1714636915;0.000000e+00;8.816605e+00;7.972179e+03
3;dpotrf;61;2;1;1;0;2880;121;59520;59520;1957747793;0.000000e+00;1.064581e+01;6.602370e+03

6.5.4 How-to for the plafrim users

# Root directory
PTCHAMELEON=~/PTCHAMELEON
mkdir $PTCHAMELEON
cd $PTCHAMELEON
git clone git@gitlab.inria.fr:starpu/starpu.git
git clone --recursive git@gitlab.inria.fr:solverstack/chameleon.git

Setup on sirocco16 (2 cpu intel + 2 v100)

module load build/cmake/3.15.3  \
       linalg/mkl/2022.0.2      \
       trace/eztrace/1.1-8      \
       hardware/hwloc/2.7.0     \
       compiler/gcc/11.2.0      \
       compiler/cuda/11.6       \
       mpi/openmpi/4.0.2        \
       trace/fxt/0.3.14         \
       trace/eztrace/1.1-9      \
       language/python

# Build StarPU
cd $PTCHAMELEON/starpu
./autogen.sh
mkdir build && cd build
# In case you want to debug take the first line
#../configure --enable-debug --enable-verbose --enable-parallel-worker --disable-opencl \
#             --disable-build-doc --enable-maxcpus=64 --disable-socl \
#             --prefix=$PTCHAMELEON/starpu/build/install --enable-fxt
#
../configure --enable-parallel-worker --disable-opencl --disable-build-doc \
             --enable-maxcpus=64 --disable-socl \
             --prefix=$PTCHAMELEON/starpu/build/install
make -j install
source $PTCHAMELEON/starpu/build/install/bin/starpu_env

# Build Chameleon
cd  $PTCHAMELEON/chameleon
mkdir build && cd build
cmake .. -DBLA_VENDOR=Intel10_64lp -DCHAMELEON_KERNELS_MT=ON \
         -DCHAMELEON_ENABLE_EXAMPLE=OFF -DCHAMELEON_USE_CUDA=ON
make -j

# test
STARPU_SILENT=1                       \
  STARPU_SCHED=dmdar                  \
  CHAMELEON_PARALLEL_WORKER_LEVEL=L3  \
  CHAMELEON_PARALLEL_WORKER_SHOW=1    \
  STARPU_MAIN_THREAD_BIND=1           \
  STARPU_CUDA_PIPELINE=2              \
  STARPU_NWORKER_PER_CUDA=4           \
  STARPU_CALIBRATE=1                  \
  $PTCHAMELEON/chameleon/build/testing/chameleon_dtesting -o potrf -n 59520 -b 960:3000:480 -g 2

Setup on sirocco24 (2 cpu amd + 2 a100)

Identical to sirocco16 except for the Intel MKL library:

module load build/cmake/3.15.3      \
            linalg/mkl/2020_update4 \
            trace/eztrace/1.1-8     \
            hardware/hwloc/2.7.0    \
            compiler/gcc/11.2.0     \
            compiler/cuda/11.6      \
            mpi/openmpi/4.0.2       \
            trace/fxt/0.3.14        \
            trace/eztrace/1.1-9     \
            language/python

Strangely the execution requires the creation of links:

cd $PTCHAMELEON
for lib in libmkl_gf_lp64.so libmkl_gnu_thread.so libmkl_intel_lp64.so libmkl_sequential.so ;
do
    ln -s  /cm/shared/modules/amd/rome/compiler/intel/2020_update4/mkl/lib/intel64/$lib $lib.2
done

LD_LIBRARY_PATH=$PTCHAMELEON:$LD_LIBRARY_PATH

7 Chameleon Performances on PlaFRIM

Chameleon commit: 8d2fdd45af53339796f0d23184d6d909adb808e1.

Show performances on PlaFRIM supercomputer.

See characteristics to get details about the hardwares.

See script tools/bench/plafrim/run.sh to get details about the environment (Guix, Slurm, etc) and the build.

Chameleon is run this way:

mpiexec -np $nmpi $CHAMELEON_BUILD/testing/chameleon_${precision}testing -o ${algorithm} -P $p -t $nthr -g $ngpu -m $m -n $n -k $k -b $b

runtime : starpu
precision : s or d for simple or double precision
algorithm : gemm or potrf or geqrf_hqr
nmpi = p x p
nthr : depends on the node
ngpu : depends on the node
m = n = k
b : depends on the node

7.1 bora (36 CPUs) nodes

nmpi = 1, 4, 9
2D block cyclic parameters : PxQ = 1x1, 2x2 and 3x3
Number of threads (t) = 34, one CPU being dedicated for the scheduler and one other for MPI communications
Number of GPUs = 0
Tile Size (b) = 280

7.1.1 CPU times

Figure 7: Performances in CPU time of GEMM, POTRF and QR on bora nodes

7.1.2 GFLOP/s

Figure 8: Performances in GFlop/s of GEMM, POTRF and QR on bora nodes

7.2 sirocco [14-17] (32 CPUs + 2 GPUs V100) nodes

nmpi = 1
2D block cyclic parameters : PxQ = 1x1
Number of threads (t) = 29, one CPU being dedicated for the scheduler and two others for the 2 GPUs
Number of GPUs = 2
Tile Size (b) = 1600

Figure 9: Performances in CPU time of GEMM, POTRF and QR on sirocco nodes

Table of Contents