PaStiX Handbook
6.3.2
|
After you have installed all the components listed in Installing PaStiX with GPU support, you can now launch PaStiX with GPUs. The option -g allows you to define the number of GPUs that you want to build. The options -s defines the scheduler:
Note that GPUs have a significant impact when the number of computations is large enough. If not, the time to transfer data to the GPU will be larger than your computation time.
Note that the performance are largely impacted on the first factorization by the loading of the CUDA library. To better evaluate the performance on your matrices, we recommend to do multiple runs with the help of the bench_facto
example ($PASTIX_SRC_DIR/example/bench_facto.c
) or to do CUDA calls on fake data before calling the solver.
In your home directory, you can create a ~/.parsec/mca-params.conf
file to better configure PaRSEC.
This file defines all MCA parameters of the PaRSEC library. One per line. We list here a few parameters that can be helpful to get better performance with PaStiX:
>= 3
, as the first two streams are reserved to data transfers from/to the GPUs. For example, we recommend a value of 8
for Nvidia V100, and larger values do not seem to improve the performance. 4
is the common value that we usually recommend. 0
. Enabling may cause communication issues. Note that you can set all other MCA parameters of PaRSEC through this file such as the scheduler, some limits, enabling/disabling profiling, ... These parameters can be found by looking for mca_param_reg_.*
in the PaRSEC code.
Then, all these variables will automatically be used by PaStiX when it uses the PaRSEC scheduler as in:
Note that this example uses 1 GPU (-g 1
) and PaRSEC (-s 2
), and the total number of threads is let to the topology discovery by hwloc.
In the same way, you can obtain better performance with StarPU thanks to environment variables. You can find the complete list in StarPU Handbook.
We list here a subset of these variables that may have an impact of the PaStiX performance:
4
is the common value that we usually recommend. 8
for Nvidia V100, and larger values do not seem to improve the performance. Not that, as opposed to PaRSEC, the data transfer streams are not included in this value. To use them, you can simply export this variables in your environment or put them at the beginning of you command line:
or
Note that these examples use a single gpu (-g 1
) with StarPU (-s 3
). The total number of threads is let to the discovery of the topology by hwloc.