FAQ
Table of Contents
- Getting started with StarPU and Chameleon
- Debugging StarPU and Chameleon within Guix
- Examples of installation and usage of Chameleon, PaStiX and Composyx using Guix on PlaFRIM
- Help CMake in finding specific BLAS/LAPACK libraries
- Getting started Chameleon on Grid5000
Getting started with StarPU and Chameleon
Both are available as Guix (in the Guix-HPC channel) or Spack packages.
Building StarPU
sudo apt install libtool-bin libhwloc-dev libmkl-dev pkg-config
git clone https://gitlab.inria.fr/starpu/starpu.git
cd starpu
./autogen.sh
mkdir build
cd build
../configure --prefix=$HOME/dev/builds/starpu --disable-opencl --disable-cuda --disable-fortran
# see https://files.inria.fr/starpu/testing/master/doc/html/CompilationConfiguration.html
make -j20 install
Adjust environment variables (for example in your .bash_profile):
export PATH=$HOME/dev/builds/starpu/bin:${PATH}
export LD_LIBRARY_PATH=$HOME/dev/builds/starpu/lib/:${LD_LIBRARY_PATH}
export PKG_CONFIG_PATH=$HOME/dev/builds/starpu/lib/pkgconfig:${PKG_CONFIG_PATH}
After sourcing .bash_profile, you should be able to execute:
starpu_machine_display
Which shows which hardware is available on your local machine.
Full information on how to build StarPU is available here
Building Chameleon
sudo apt install cmake libmkl-dev
git clone --recurse-submodules https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/dev/builds/chameleon/ -DCHAMELEON_USE_MPI=OFF -DCHAMELEON_ENABLE_EXAMPLE=OFF -DCHAMELEON_ENABLE_TESTING=ON -DCHAMELEON_PREC_C=OFF -DCHAMELEON_PREC_Z=OFF -DBLA_VENDOR=Intel10_64lp_seq
make -j20 install
$HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H # should print some nice results
Distributed version
StarPU should have detected MPI during its building.
For Chameleon, you have to add the options -DCHAMELEON_USE_MPI=ON
-DCHAMELEON_USE_MPI_DATATYPES=ON to the cmake command line and build
again.
The common way of using distributed StarPU is to launch one MPI/StarPU process per compute node, and then StarPU takes care of feeding all available cores with task. You can run:
mpirun -np 4 $HOME/dev/builds/chameleon/bin/chameleon_stesting -o potrf -H
This will execute a Cholesky decomposition (potrf) with 4 MPI
processes (-np 4) and presents results in a human-readable way
(-H; for a CSV-like output, you can omit this option).
You can measure performance of different matrix size with the option
-n 3200:32000:3200 (from matrix size 3200 to 32000 with a step of
3200).
You can do several iteration of the same matrix size with --niter 2.
Basic performance tuning
A good matrix distribution is square 2D-block-cyclic, for this add -P
x where x should be (close to) the square root of the number of MPI
processes (ie, you should use a square number of compute nodes).
To get better results, you should bind the main thread:
export STARPU_MAIN_THREAD_BIND=1
Set the number of workers (CPU cores executing task) to the number of cores available on the compute node minus one:
export STARPU_NCPU=15
You should not use hyperthreads.
To know what is the good matrix size range, just execute with sizes,
let's say, 3200:50000:3200, plot the obtained Gflop/s and see with
which size you reach the plateau.
Misc
Have a look on Section 3 about thread binding in this presentation
Debugging StarPU and Chameleon within Guix using GDB
Get debug information for Chameleon (not StarPU because of a bug), without all chameleon sources
With Guix we should normally get degugging information of chameleon and starpu by adding the outputs "debug" to the guix shell command:
guix shell --pure chameleon chameleon:debug starpu:debug coreutils gdb xterm -- /bin/bash --norc
And debug info should be installed in $GUIX_ENVIRONMENT/lib/debug, thus we could add it to the GDB debug file directory as follows
echo "set debug separate-debug-file on" > gdbcommands
echo "set debug-file-directory $GUIX_ENVIRONMENT/lib/debug" >> gdbcommands
echo "start" >> gdbcommands
echo "b chameleon_starpu_init" >> gdbcommands
echo "b starpu_mpi_init_conf" >> gdbcommands
echo "continue" >> gdbcommands
And run with gdb
gdb -ex 'source gdbcommands' --args chameleon_dtesting -o potrf
Here you will be able to debug chameleon functions, see the stack, print variables. Unfortunately, there is still difficulties to get debug info from starpu libraries because of a mismatch of paths from guix, e.g.
Looking for separate debug info (debug link) for /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3
Trying /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.
Trying /gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/.debug/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.
Trying /gnu/store/d5rd533nf3sn1hyq5ihxn6yw3liic6hk-profile/lib/debug//gnu/store/3qd11j3xpi0vfqhdj9hqwl8fd5n28czj-starpu-1.4.7/lib/libstarpumpi-1.4.so.3.0.1.debug... no, unable to open.
In addition you dont' have access to the source code. To add them you could tell gdb where are sources, for instance
export SOURCES_CHAMELEON=`guix build --source chameleon`
guix shell --pure --preserve=SOURCES chameleon chameleon:debug starpu:debug coreutils gdb xterm -- /bin/bash --norc
echo "set debug separate-debug-file on" > gdbcommands
echo "set debug-file-directory $GUIX_ENVIRONMENT/lib/debug" >> gdbcommands
echo "set substitute-path /tmp/guix-build-chameleon-1.2.0.drv-0/source $SOURCES_CHAMELEON" >> gdbcommands
echo "start" >> gdbcommands
echo "b chameleon_starpu_init" >> gdbcommands
echo "continue" >> gdbcommands
gdb -ex 'source gdbcommands' --args chameleon_dtesting -o potrf
But unfortunately many sources are missing, all ones generated in the build directory (which has been dropped by guix) which contain all algorithms related to a specific precision s, d, c, z.
Hence, in order to get all debugging and sources information we suggest to build by your own Chameleon and StarPU in a guix shell environment.
Build StarPU and Chameleon by your own in a guix shell
Build starpu with guix adding debug symbols
cd /tmp
git clone https://gitlab.inria.fr/starpu/starpu.git
cd starpu
guix shell --pure -D starpu -- /bin/bash --norc
./autogen.sh
rm build -rf && mkdir build && cd build
../configure --prefix=$PWD/install --enable-debug --disable-opencl
make -j20 install
exit
Build chameleon with guix adding debug symbols
cd /tmp
git clone --recursive https://gitlab.inria.fr/solverstack/chameleon.git
cd chameleon
guix shell --pure -D chameleon -- /bin/bash --norc
export PKG_CONFIG_PATH=/tmp/starpu/build/install/lib/pkgconfig:$PKG_CONFIG_PATH
cmake -B build -DCMAKE_BUILD_TYPE=Debug -DCHAMELEON_USE_MPI=ON
cmake --build build -j20
exit
Debug Chameleon using GDB
cd /tmp
guix shell --pure -D chameleon gdb xterm -- /bin/bash --norc
# to execute successive gdb commands
echo "b starpu_mpi_init_conf" >> gdbcommands
echo "run" >> gdbcommands
# without mpi
gdb -ex 'source gdbcommands' --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf
# with mpi
mpiexec --oversubscribe -n 2 xterm -hold -e gdb -ex 'source gdbcommands' --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf -t 2
Example on Plafrim
ssh -X plafrim
# repeat the compilation steps of StarPU and Chameleon seen just above
# reserve 2 nodes and to use 4 MPI processes (1 by socket)
salloc --nodes=2 --time=01:00:00 --constraint bora --exclusive --ntasks-per-node=2 --threads-per-core=1
# deploy the guix environment
guix shell --pure --preserve='SLURM' -D chameleon gdb xterm slurm -- /bin/bash --norc
# if openblas is used you should set number of threads to 1
export OPENBLAS_NUM_THREADS=1
# run with gdb: 4 xterm terminals will appear
mpiexec --map-by socket xterm -hold -e gdb -ex run --args /tmp/chameleon/build/testing/chameleon_stesting -o potrf -n 32000 -P 2
Examples of installation and usage of Chameleon, PaStiX and Composyx using Guix on PlaFRIM
See the COMPAS 2025 tutorial (in french).
Help CMake in finding specific BLAS/LAPACK libraries
Many solvers depend on BLAS/LAPACK functions and require to link with an already installed implementation, most of the time OpenBLAS (libopenblas.so), Intel MKL (libmkl_rt.so), BLIS/FLAME (libblis.so, libflame.so) or Netlib (libblas.so, liblapack.so).
Usually, CMake successfully finds them when calling find_package(BLAS) and find_package(LAPACK) cf. FindBLAS, FindLAPACK.
If several implementations are installed on the system, set the CMake variable BLA_VENDOR to the Vendor preferred.
In some environments you may have trouble to find BLAS/LAPACK libraries, e.g. if not officialy supported by the CMake module or because the module fails (find_package error).
In this situation, try to specify the exact working link flags using the CMake variable BLAS_LIBRARIES.
For example, on the CINES Adastra supercomputer, you may want to use libsci from HPE:
cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
-DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=CC -DCMAKE_Fortran_COMPILER=ftn \
-DBLAS_LIBRARIES="/opt/cray/pe/libsci/24.07.0/CRAY/18.0/x86_64/lib/libsci_cray.so"
In some other cases, you may want to mix BLAS/LAPACK libraries, for example using BLIS as optimized BLAS library and Netlib for all other interfaces (CBLAS, LAPACK, LAPACKE):
# We consider blis is already installed in $HOME/blis/install
# Install Netlib lapack, with C interfaces cblas and lapacke
git clone https://github.com/Reference-LAPACK/lapack.git
cmake -B build -DBUILD_SHARED_LIBS=ON -DCMAKE_INSTALL_PREFIX=$PWD/install \
-DCBLAS=ON -DLAPACKE=ON -DLAPACKE_WITH_TMG=ON -DUSE_OPTIMIZED_BLAS=ON \
-DBLA_VENDOR=FLAME -DCMAKE_PREFIX_PATH="$HOME/blis/install"
cmake --build build -j5 --target install
# Configure Chameleon with this specific BLAS/LAPACK couple BLIS+Netlib
cd ~/chameleon
cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
-DBLAS_LIBRARIES="$HOME/lapack/install/lib/liblapacke.so;$HOME/lapack/install/lib/liblapack.so;$HOME/lapack/install/lib/libcblas.so;$HOME/blis/install/lib/libblis.so"
# or you may also try as follows:
cmake -B build -DBUILD_SHARED_LIBS=ON -DCHAMELEON_USE_MPI=ON \
-DLAPACKE_LIBRARIES="$HOME/lapack/install/lib/liblapacke.so" \
-DLAPACK_LIBRARIES="$HOME/lapack/install/lib/liblapack.so" \
-DCBLAS_LIBRARIES="$HOME/lapack/install/lib/libcblas.so" \
-DBLAS_LIBRARIES="$HOME/blis/install/lib/libblis.so"
Getting started Chameleon on Grid5000
Getting started
Have a look on the official https://www.grid5000.fr/w/Getting_Started.
First login on a site, e.g. grenoble
ssh grenoble.g5k
Then, to reserve a node use oarsub command, using -I for an
interactive session shell
oarsub -q default -p dahu -l nodes=1,walltime=0:30:00 -I
-q defaultqueue by default can be used by every users.-q abacaqueue may also be available, for example for Inria users-p dahuto specify the cluster name, see https://www.grid5000.fr/w/Hardware to get the list of available ressources-l nodes=1,walltime=0:30:00to specify the number of nodes and time of the reservation (default is 1h)
Here you are directly logged in on the first node of the reservation. You can try commands like:
lscpu # get hardware info
mpiexec hostname # one mpi process per core of the node displaying the hostname
mpiexec -n 2 hostname # to limit the number of process
mpiexec -npernode 1 hostname # only one process per node
exit
MPI tests
Read this page about MPI: https://www.grid5000.fr/w/Run_MPI_On_Grid%275000.
Interactive Mode:
ssh grenoble.g5k
oarsub -q default -p dahu -l nodes=2 -I
# Measure bandwidth with intel-mpi-benchmarks
guix shell --pure --preserve='OAR' openmpi@4 intel-mpi-benchmarks -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong
# the documentation suggests this to ensure proper network usage, but the performances are similar -> not sure this is useful
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
-mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib \
$GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong
exit
# Measure bandwidth and latency with osu-micro-benchmarks
guix shell --pure --preserve='OAR' openmpi@4 osu-micro-benchmarks -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
exit
Batch Script Mode
We can submit batch jobs as follows:
# Script Mode
oarsub -q default -p dahu -l nodes=2 -n test -O test.out -E test.out ~/guixenv.sh
oarstat -u # to check status of the job
cat test.out # result in test.out
guixenv.sh is a script to deploy the software environment:
guix shell openmpi@4 intel-mpi-benchmarks grep which -- ~/test.sh
test.sh contains the MPI programs to execute:
$GUIX_ENVIRONMENT/bin/mpiexec -machinefile $OAR_NODEFILE -npernode 1 --bind-to board $GUIX_ENVIRONMENT/bin/IMB-MPI1 PingPong
Chameleon on grenoble.g5k for testing with MPI
In what follows we will install Chameleon configured with Intel MKL for blas/lapack kernels. To get this version, one needs to get the guix-hpc-non-free channel:
ssh grenoble.g5k
mkdir -p ~/.config/guix && cat > ~/.config/guix/channels.scm << 'EOF'
(cons (channel
(name 'guix-hpc-non-free)
(url "https://gitlab.inria.fr/guix-hpc/guix-hpc-non-free.git"))
%default-channels)
EOF
then, update guix:
guix pull # may be faster adding --url=https://codeberg.org/guix/guix-mirror.git
and you will be able to see chameleon-mkl:
guix search chameleon-mkl
Run Chameleon+MKL GEMM on 2 nodes:
oarsub -q default -p dahu -l nodes=2 -I
guix shell --pure --preserve='OAR' chameleon-mkl -- /bin/bash --norc
# gemm C = A*B timing, matrices size m=n=25000, tile size b=500, 2 MPI processus, 1 per node and 2d block-cyclic parameters p=1 q=2
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 25000 -b 500 -w -P 1
Id Function threads gpus P Q mtxfmt nb transA transB m n k lda ldb ldc alpha beta seedA seedB seedC tsub time gflops
0 dgemm 31 0 1 2 0 500 NoTrans NoTrans 25000 25000 25000 25000 25000 25000 -1.935109e-01 3.817601e-01 1681692777 1714636915 1957747793 0.000000e+00 1.226693e+01 2.547499e+03
# i.e. 12.26s and 2547 GFlop/s
# the same but with 4 MPI processus, 1 per socket (=processor Intel Xeon Gold 6130 16 cores), 2d block-cyclic parameters p=2 q=2
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -machinefile $OAR_NODEFILE -npernode 2 --bind-to socket \
$GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 25000 -b 500 -w -P 2
Id Function threads gpus P Q mtxfmt nb transA transB m n k lda ldb ldc alpha beta seedA seedB seedC tsub time gflops
0 dgemm 15 0 2 2 0 500 NoTrans NoTrans 25000 25000 25000 25000 25000 25000 -1.935109e-01 3.817601e-01 1681692777 1714636915 1957747793 0.000000e+00 1.213652e+01 2.574874e+03
# i.e. 12.13s and 2574 GFlop/s
Remark that StarPU reserves 1 core for its operations (scheduling).
Chameleon on lille.g5k for testing with GPUs
We will now experiment usage of nodes with GPUs, here chifflot with 2 Nvidia P100. Remember to set guix-hpc-non-free channel in order to get MKL and CUDA version of Chameleon.
ssh lille.g5k
oarsub -q default -p chifflot -l host=1 -I
# Using CUDA and sequential MKL
guix shell --pure --preserve='OAR' chameleon-cuda-mkl -- /bin/bash --norc
# i.e. guix shell --pure --preserve='OAR' chameleon-cuda --with-input=openblas=intel-oneapi-mkl -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so \
-machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm 4960,49600 -b 1240 -g 2
exit
# Using CUDA and multi-threaded MKL, see:
# https://solverstack.gitlabpages.inria.fr/chameleon/#interface-chameleon_parallel_worker
guix shell --pure --preserve='OAR' chameleon-cuda-mkl-mt -- /bin/bash --norc
$GUIX_ENVIRONMENT/bin/mpiexec -x HFI_NO_CPUAFFINITY=1 -x LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libcuda.so \
-x STARPU_MAIN_THREAD_BIND=1 -x MKL_NUM_THREADS=10 \
-x CHAMELEON_PARALLEL_WORKER_LEVEL=SOCKET -x CHAMELEON_PARALLEL_WORKER_SHOW=1 \
-machinefile $OAR_NODEFILE -npernode 1 --bind-to board \
$GUIX_ENVIRONMENT/bin/chameleon_dtesting -H -o gemm -n 4960,49600 -b 1240 -g 2
Number of parallel workers created: 2
Parallel worker 0 contains the following logical indexes:
26 28 30 32 34 36 38 40 42 44
Parallel worker 1 contains the following logical indexes:
4 6 8 10 12 14 16 18 20 22
Id Function threads gpus P Q mtxfmt nb transA transB m n k lda ldb ldc alpha beta seedA seedB seedC tsub time gflops
0 dgemm 20 2 1 1 0 1240 NoTrans NoTrans 4960 4960 4960 4960 4960 4960 4.892778e-01 -1.846424e-01 1649760492 596516649 1189641421 0.000000e+00 8.529406e-02 2.861253e+03
1 dgemm 20 2 1 1 0 1240 NoTrans NoTrans 49600 49600 49600 49600 49600 49600 -2.410269e-01 4.000452e-01 783368690 1102520059 2044897763 0.000000e+00 2.945459e+01 8.285564e+03