FAQ


Table of Contents

  1. Getting started with StarPU and Chameleon
  2. Debugging StarPU and Chameleon within Guix
  3. Examples of installation and usage of Chameleon, PaStiX and Composyx using Guix on PlaFRIM
  4. Help CMake in finding specific BLAS/LAPACK libraries
  5. Getting started Chameleon on Grid5000

Getting started with StarPU and Chameleon

  • StarPU: task-based runtime system
  • Chameleon: dense linear algebra library, built on top of StarPU.

Both are available as Guix (in the Guix-HPC channel) or Spack packages.

Building StarPU

sudo apt install libtool-bin libhwloc-dev libmkl-dev pkg-config
git clone https://gitlab.inria.fr/starpu/starpu.git
cd starpu
./autogen.sh
mkdir build
cd build
../configure --prefix=$HOME/dev/builds/starpu --disable-opencl --disable-cuda --disable-fortran
# see https://files.inria.fr/starpu/testing/master/doc/html/CompilationConfiguration.html
make -j20 install

Adjust environment variables (for example in your .bash_profile):

export PATH=$HOME/dev/builds/starpu/bin:${PATH}
export LD_LIBRARY_PATH=$HOME/dev/builds/starpu/lib/:${LD_LIBRARY_PATH}
export PKG_CONFIG_PATH=$HOME/dev/builds/starpu/lib/pkgconfig:${PKG_CONFIG_PATH}

After sourcing .bash_profile, you should be able to execute:

starpu_machine_display

Which shows which hardware is available on your local machine.

Full information on how to build StarPU is available here

Building Chameleon

sudo apt install cmake libmkl-dev
git clone --recurse-submodules https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/dev/builds/chameleon/ -DCHAMELEON_USE_MPI=OFF -DCHAMELEON_ENABLE_EXAMPLE=OFF -DCHAMELEON_ENABLE_TESTING=ON -DCHAMELEON_PREC_C=OFF -DCHAMELEON_PREC_Z=OFF -DBLA_VENDOR=Intel10_64lp_seq
make -j20 install

$HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H # should print some nice results

Distributed version

StarPU should have detected MPI during its building.

For Chameleon, you have to add the options -DCHAMELEON_USE_MPI=ON -DCHAMELEON_USE_MPI_DATATYPES=ON to the cmake command line and build again.

The common way of using distributed StarPU is to launch one MPI/StarPU process per compute node, and then StarPU takes care of feeding all available cores with task. You can run:

mpirun -np 4 $HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H

This will execute a Cholesky decomposition (potrf) with 4 MPI processes (-np 4) and presents results in a human-readable way (-H; for a CSV-like output, you can omit this option).

You can measure performance of different matrix size with the option -n 3200:32000:3200 (from matrix size 3200 to 32000 with a step of 3200).

You can do several iteration of the same matrix size with --niter 2.

Basic performance tuning

A good matrix distribution is square 2D-block-cyclic, for this add -P x where x should be (close to) the square root of the number of MPI processes (ie, you should use a square number of compute nodes).

To get better results, you should bind the main thread:

export STARPU_MAIN_THREAD_BIND=1

Set the number of workers (CPU cores executing task) to the number of cores available on the compute node minus one:

export STARPU_NCPU=15

You should not use hyperthreads.

To know what is the good matrix size range, just execute with sizes, let's say, 3200:50000:3200, plot the obtained Gflop/s and see with which size you reach the plateau.

Misc

Have a look on Section 3 about thread binding in this presentation

Debugging StarPU and Chameleon within Guix using GDB

Get debug information for Chameleon (not StarPU because of a bug), without all chameleon sources

With Guix we should normally get degugging information of chameleon and starpu by adding the outputs "debug" to the guix shell command:

guix shell --pure chameleon chameleon:debug starpu:debug coreutils gdb xterm -- /bin/bash --norc

And debug info should be installed in $GUIX_ENVIRONMENT/lib/debug, thus we could add it to the GDB debug file directory as follows

echo "set debug separate-debug-file on" > gdbcommands
echo "set debug-file-directory $GUIX_ENVIRONMENT/lib/debug" >> gdbcommands
echo "start" >> gdbcommands
echo "b chameleon_starpu_init" >> gdbcommands
echo "b starpu_mpi_init_conf" >> gdbcommands
echo "continue" >> gdbcommands

And run with gdb

gdb -ex 'source gdbcommands' --args chameleon_dtesting -o potrf

Here you will be able to debug chameleon functions, see the stack, print variables. Unfortunately, there is still difficulties to get debug info from starpu libraries because of a mismatch of paths from guix, e.g.

Looking for separate debug info (debug link) for /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3
  Trying /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.
  Trying /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/.debug/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.
  Trying /gnu/store/d5rd533nf3sn1hyq5ihxn6yw3liic6hk-profile/lib/debug//gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.

In addition you dont' have access to the source code. To add them you could tell gdb where are sources, for instance

export SOURCES_CHAMELEON=`guix build --source chameleon`
guix shell --pure --preserve=SOURCES chameleon chameleon:debug starpu:debug coreutils gdb xterm -- /bin/bash --norc

echo "set debug separate-debug-file on" > gdbcommands
echo "set debug-file-directory $GUIX_ENVIRONMENT/lib/debug" >> gdbcommands
echo "set substitute-path /tmp/guix-build-chameleon-1.2.0.drv-0/source $SOURCES_CHAMELEON" >> gdbcommands
echo "start" >> gdbcommands
echo "b chameleon_starpu_init" >> gdbcommands
echo "continue" >> gdbcommands

gdb -ex 'source gdbcommands' --args chameleon_dtesting -o potrf

But unfortunately many sources are missing, all ones generated in the build directory (which has been dropped by guix) which contain all algorithms related to a specific precision s, d, c, z.

Hence, in order to get all debugging and sources information we suggest to build by your own Chameleon and StarPU in a guix shell environment.

Build StarPU and Chameleon by your own in a guix shell

Build starpu with guix adding debug symbols

cd /tmp
git clone https://gitlab.inria.fr/starpu/starpu.git
cd starpu
guix shell --pure -D starpu -- /bin/bash --norc
./autogen.sh
rm build -rf && mkdir build && cd build
../configure --prefix=$PWD/install --enable-debug --disable-opencl
make -j20 install
exit

Build chameleon with guix adding debug symbols

cd /tmp
git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
guix shell --pure -D chameleon -- /bin/bash --norc
export PKG_CONFIG_PATH=/tmp/starpu/build/install/lib/pkgconfig:$PKG_CONFIG_PATH
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCHAMELEON_USE_MPI=ON
cmake --build build -j20
exit

Debug Chameleon using GDB

cd /tmp
guix shell --pure -D chameleon gdb xterm -- /bin/bash --norc

# to execute successive gdb commands
echo "b starpu_mpi_init_conf" >> gdbcommands
echo "run" >> gdbcommands

# without mpi
gdb -ex 'source gdbcommands' --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf

# with mpi
mpiexec --oversubscribe -n 2 xterm -hold -e gdb -ex 'source gdbcommands' --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf -t 2

Example on Plafrim

 ssh -X plafrim

 # repeat the compilation steps of StarPU and Chameleon seen just above

 # reserve 2 nodes and to use 4 MPI processes (1 by socket)
 salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=2 --threads-per-core=1

 # deploy the guix environment
 guix shell --pure --preserve='SLURM' -D chameleon gdb xterm slurm -- /bin/bash --norc

 # if openblas is used you should set number of threads to 1
 export OPENBLAS_NUM_THREADS=1

 # run with gdb: 4 xterm terminals will appear
 mpiexec --map-by socket xterm -hold -e gdb -ex run --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf -n 32000 -P 2

Examples of installation and usage of Chameleon, PaStiX and Composyx using Guix on PlaFRIM

See the COMPAS 2025 tutorial (in french).

Help CMake in finding specific BLAS/LAPACK libraries

Many solvers depend on BLAS/LAPACK functions and require to link with an already installed implementation, most of the time OpenBLAS (libopenblas.so), Intel MKL (libmkl_rt.so), BLIS/FLAME (libblis.so, libflame.so) or Netlib (libblas.so, liblapack.so). Usually, CMake successfully finds them when calling find_package(BLAS) and find_package(LAPACK) cf. FindBLAS, FindLAPACK.

If several implementations are installed on the system, set the CMake variable BLA_VENDOR to the Vendor preferred.

In some environments you may have trouble to find BLAS/LAPACK libraries, e.g. if not officialy supported by the CMake module or because the module fails (find_package error). In this situation, try to specify the exact working link flags using the CMake variable BLAS_LIBRARIES.

For example, on the CINES Adastra supercomputer, you may want to use libsci from HPE:

cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
               -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DCMAKE_Fortran_COMPILER=ftn \
               -DBLAS_LIBRARIES="/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/lib/libsci_cray.so"

In some other cases, you may want to mix BLAS/LAPACK libraries, for example using BLIS as optimized BLAS library and Netlib for all other interfaces (CBLAS, LAPACK, LAPACKE):

# We consider blis is already installed in $HOME/blis/install

# Install Netlib lapack, with C interfaces cblas and lapacke
git clone https://github.com/Reference-LAPACK/lapack.git
cmake -B build -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$PWD/install \
               -DCBLAS=ON -DLAPACKE=ON -DLAPACKE_WITH_TMG=ON -DUSE_OPTIMIZED_BLAS=ON \
               -DBLA_VENDOR=FLAME -DCMAKE_PREFIX_PATH="$HOME/blis/install"
cmake --build build -j5 --target install

# Configure Chameleon with this specific BLAS/LAPACK couple BLIS+Netlib
cd ~/chameleon
cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
  -DBLAS_LIBRARIES="$HOME/lapack/install/lib/liblapacke.so;$HOME/lapack/install/lib/liblapack.so;$HOME/lapack/install/lib/libcblas.so;$HOME/blis/install/lib/libblis.so"

# or you may also try as follows:
cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
               -DLAPACKE_LIBRARIES="$HOME/lapack/install/lib/liblapacke.so" \
               -DLAPACK_LIBRARIES="$HOME/lapack/install/lib/liblapack.so" \
               -DCBLAS_LIBRARIES="$HOME/lapack/install/lib/libcblas.so" \
               -DBLAS_LIBRARIES="$HOME/blis/install/lib/libblis.so"

Getting started Chameleon on Grid5000

Getting started

Have a look on the official https://www.grid5000.fr/w/Getting_Started.

First login on a site, e.g. grenoble

ssh grenoble.g5k

Then, to reserve a node use oarsub command, using -I for an interactive session shell

oarsub -q default -p dahu -l nodes=1,walltime=0:30:00 -I
  • -q default queue by default can be used by every users.
  • -q abaca queue may also be available, for example for Inria users
  • -p dahu to specify the cluster name, see https://www.grid5000.fr/w/Hardware to get the list of available ressources
  • -l nodes=1,walltime=0:30:00 to specify the number of nodes and time of the reservation (default is 1h)

Here you are directly logged in on the first node of the reservation. You can try commands like:

lscpu # get hardware info
mpiexec hostname # one mpi process per core of the node displaying the hostname
mpiexec -n 2 hostname # to limit the number of process
mpiexec -npernode 1 hostname # only one process per node
exit

MPI tests

Read this page about MPI: https://www.grid5000.fr/w/Run_MPI_On_Grid%275000.

Interactive Mode:

ssh grenoble.g5k

oarsub -q default -p dahu -l nodes=2 -I

# Measure bandwidth with intel-mpi-benchmarks
guix shell --pure --preserve='OAR' openmpi@4 intel-mpi-benchmarks -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong

# the documentation suggests this to ensure proper network usage, but the performances are similar -> not sure this is useful
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib \
                              $GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong
exit

# Measure bandwidth and latency with osu-micro-benchmarks
guix shell --pure --preserve='OAR' openmpi@4 osu-micro-benchmarks -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
exit

Batch Script Mode

We can submit batch jobs as follows:

# Script Mode
oarsub -q default -p dahu -l nodes=2 -n test -O test.out -E test.out ~/guixenv.sh

oarstat -u # to check status of the job
cat test.out # result in test.out

guixenv.sh is a script to deploy the software environment:

guix shell openmpi@4 intel-mpi-benchmarks grep which -- ~/test.sh

test.sh contains the MPI programs to execute:

$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board $GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong

Chameleon on grenoble.g5k for testing with MPI

In what follows we will install Chameleon configured with Intel MKL for blas/lapack kernels. To get this version, one needs to get the guix-hpc-non-free channel:

ssh grenoble.g5k

mkdir -p ~/.config/guix && cat > ~/.config/guix/channels.scm << 'EOF'
(cons (channel
        (name 'guix-hpc-non-free)
        (url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
      %default-channels)
EOF

then, update guix:

guix pull # may be faster adding --url=https://codeberg.org/guix/guix-mirror.git

and you will be able to see chameleon-mkl:

guix search chameleon-mkl

Run Chameleon+MKL GEMM on 2 nodes:

oarsub -q default -p dahu -l nodes=2 -I

guix shell --pure --preserve='OAR' chameleon-mkl -- /bin/bash --norc

# gemm C = A*B timing, matrices size m=n=25000, tile size b=500, 2 MPI processus, 1 per node and 2d block-cyclic parameters p=1 q=2
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 25000 -b 500 -w -P 1
Id Function     threads gpus  P  Q mtxfmt  nb transA    transB        m     n     k   lda   ldb   ldc         alpha          beta       seedA       seedB       seedC          tsub          time        gflops
 0 dgemm             31    0  1  2      0 500 NoTrans   NoTrans   25000 25000 25000 25000 25000 25000 -1.935109e-01  3.817601e-01  1681692777  1714636915  1957747793  0.000000e+00  1.226693e+01  2.547499e+03
# i.e. 12.26s and 2547 GFlop/s

# the same but with 4 MPI processus, 1 per socket (=processor Intel Xeon Gold 6130 16 cores), 2d block-cyclic parameters p=2 q=2
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -machinefile $OAR_NODEFILE -npernode 2 --bind-to socket \
                              $GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 25000 -b 500 -w -P 2
Id Function     threads gpus  P  Q mtxfmt  nb transA    transB        m     n     k   lda   ldb   ldc         alpha          beta       seedA       seedB       seedC          tsub          time        gflops
 0 dgemm             15    0  2  2      0 500 NoTrans   NoTrans   25000 25000 25000 25000 25000 25000 -1.935109e-01  3.817601e-01  1681692777  1714636915  1957747793  0.000000e+00  1.213652e+01  2.574874e+03
# i.e. 12.13s and 2574 GFlop/s

Remark that StarPU reserves 1 core for its operations (scheduling).

Chameleon on lille.g5k for testing with GPUs

We will now experiment usage of nodes with GPUs, here chifflot with 2 Nvidia P100. Remember to set guix-hpc-non-free channel in order to get MKL and CUDA version of Chameleon.

ssh lille.g5k

oarsub -q default -p chifflot -l host=1 -I

# Using CUDA and sequential MKL
guix shell --pure --preserve='OAR' chameleon-cuda-mkl -- /bin/bash --norc
# i.e. guix shell --pure --preserve='OAR' chameleon-cuda --with-input=openblas=intel-oneapi-mkl -- /bin/bash --norc

$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so \
                              -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm 4960,49600 -b 1240 -g 2
exit

# Using CUDA and multi-threaded MKL, see:
# https://solverstack.gitlabpages.inria.fr/chameleon/#interface-chameleon_parallel_worker
guix shell --pure --preserve='OAR' chameleon-cuda-mkl-mt -- /bin/bash --norc

$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so \
                              -x STARPU_MAIN_THREAD_BIND=1 -x MKL_NUM_THREADS=10 \
                              -x CHAMELEON_PARALLEL_WORKER_LEVEL=SOCKET -x CHAMELEON_PARALLEL_WORKER_SHOW=1 \
                              -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
                              $GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 4960,49600 -b 1240 -g 2
Number of parallel workers created: 2
Parallel worker 0 contains the following logical indexes:
        26 28 30 32 34 36 38 40 42 44
Parallel worker 1 contains the following logical indexes:
        4 6 8 10 12 14 16 18 20 22
 Id Function     threads gpus  P  Q mtxfmt  nb transA    transB        m     n     k   lda   ldb   ldc         alpha          beta       seedA       seedB       seedC          tsub          time        gflops
  0 dgemm             20    2  1  1      0 1240 NoTrans   NoTrans    4960  4960  4960  4960  4960  4960  4.892778e-01 -1.846424e-01  1649760492   596516649  1189641421  0.000000e+00  8.529406e-02  2.861253e+03
  1 dgemm             20    2  1  1      0 1240 NoTrans   NoTrans   49600 49600 49600 49600 49600 49600 -2.410269e-01  4.000452e-01   783368690  1102520059  2044897763  0.000000e+00  2.945459e+01  8.285564e+03