PaStiX Handbook 6.4.0
|
Multiple level of parallelism can be tricky to initialize correctly to avoid conflict between different libraries. We try to summarize here the different usage you may have with the PaStiX library. However, we strongly recommend to use only the pastixInit() function**, with a correct linkage to HwLoc.
If you use the sequential library, no specific attention is required for the bindings.
In that case, the binding of the threads can be very important for the performance of your application. The function pastixInit() by default will bind all the cores (except the initial thread) to the cores sequentially by grouping them on common memory banks if HwLoc is used.
If HwLoc is not supported, then if the numbering of the cores is non linear as in this example:
$ hwloc-ls Machine (128GB total) Package L#0 NUMANode L#0 (P#0 32GB) L3 L#0 (15MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10) NUMANode L#1 (P#2 32GB) + L3 L#1 (15MB) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22) Package L#1 NUMANode L#2 (P#1 32GB) L3 L#2 (15MB) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#1) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#3) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#7) L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#9) L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11) NUMANode L#3 (P#3 32GB) + L3 L#3 (15MB) L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#13) L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#15) L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17) L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#19) L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#21) L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
You should replace the call to pastixInit() by pastixInitWithAffinity(), and provide a array that linearize the access to the cores:
bintab[] = { -1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23 };
Note that the first thread is given -1 to avoid binding him. This can be useful, if you want to switch from sequential BLAS for PaStiX, to multi-threaded BLAS outside PaStiX. Or if you have OpenMP sections in your code. If the thread 0 is bound to something, any other threads later created from this one will inherit its properties, including the binding. Thus, if it is bound to a single core, you may have performance issues with other portion of your code that will create their own threads. So, we strongly recommend to let bindtab[0] to -1, unless you know exactly what you are doing.
First of all, whatever configuration you are using for PaStiX itself (sequential, multi-threaded, MPI+X), the first step is to correctly use the binding capabilities of your preferred MPI launcher to cluster the processes of different parts of the machine. If you do this, the pastixInit() function call with HwLoc will handle the correct binding for you.
If you use the binding capabilities of the MPI launcher and HwLoc, you have nothing to do. Otherwise, it becomes tricky to bind your only thread to the single core you'll get. This is explained in the next section with the example for multi-threaded pastix.
Once again, If you use the binding capabilities of the MPI launcher and HwLoc, you have nothing to do. Otherwise, you need to correctly bind each thread of your MPI processes. In the case of one MPI process per node, this is equivalent to Using multi-threaded PaStiX. If you do have multiple MPI processes per node, the first thing is to detect which MPI processes are running on the same node. Once this is done, the resources of a same node need to be shared correctly. We consider here a linear numbering of the cores, but you may have the same issue as in Using multi-threaded PaStiX.
The example tools/binding_for_multimpi.c describes how you can do to split your resources among the MPI processes and generate the correct bindtab. Don't forget that bindtab[0] may still need to be set to -1 in a context where the application is mixing multiple threaded libraries.