PaStiX Handbook  6.3.2
How to correctly initialize your PaStiX library

Multiple level of parallelism can be tricky to initialize correctly to avoid conflict between different libraries. We try to summarize here the different usage you may have with the PaStiX library. However, we strongly recommend to use only the pastixInit() function**, with a correct linkage to HwLoc.

Using PaStiX in a shared memory environment

Using sequential PaStiX

If you use the sequential library, no specific attention is required for the bindings.

Using multi-threaded PaStiX

In that case, the binding of the threads can be very important for the performance of your application. The function pastixInit() by default will bind all the cores (except the initial thread) to the cores sequentially by grouping them on common memory banks if HwLoc is used.

If HwLoc is not supported, then if the numbering of the cores is non linear as in this example:

$ hwloc-ls
Machine (128GB total)
  Package L#0
    NUMANode L#0 (P#0 32GB)
      L3 L#0 (15MB)
        L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
        L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
        L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
        L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
        L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
        L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
    NUMANode L#1 (P#2 32GB) + L3 L#1 (15MB)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#16)
      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#18)
      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#20)
      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#22)
  Package L#1
    NUMANode L#2 (P#1 32GB)
      L3 L#2 (15MB)
        L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#1)
        L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#3)
        L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#5)
        L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#7)
        L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#9)
        L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#11)
    NUMANode L#3 (P#3 32GB) + L3 L#3 (15MB)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#13)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#15)
      L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#17)
      L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#19)
      L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#21)
      L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)

You should replace the call to pastixInit() by pastixInitWithAffinity(), and provide a array that linearize the access to the cores:

bintab[] = { -1, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22,
              1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23 };

Note that the first thread is given -1 to avoid binding him. This can be useful, if you want to switch from sequential BLAS for PaStiX, to multi-threaded BLAS outside PaStiX. Or if you have OpenMP sections in your code. If the thread 0 is bound to something, any other threads later created from this one will inherit its properties, including the binding. Thus, if it is bound to a single core, you may have performance issues with other portion of your code that will create their own threads. So, we strongly recommend to let bindtab[0] to -1, unless you know exactly what you are doing.

Using PaStiX in a distributed memory environment with MPI

First of all, whatever configuration you are using for PaStiX itself (sequential, multi-threaded, MPI+X), the first step is to correctly use the binding capabilities of your preferred MPI launcher to cluster the processes of different parts of the machine. If you do this, the pastixInit() function call with HwLoc will handle the correct binding for you.

Using sequential PaStiX in an MPI context, or PaStiX in a MPI+sequential context

If you use the binding capabilities of the MPI launcher and HwLoc, you have nothing to do. Otherwise, it becomes tricky to bind your only thread to the single core you'll get. This is explained in the next section with the example for multi-threaded pastix.

Using multi-threaded PaStiX in an MPI context, or PaStiX in a MPI+threads context

Once again, If you use the binding capabilities of the MPI launcher and HwLoc, you have nothing to do. Otherwise, you need to correctly bind each thread of your MPI processes. In the case of one MPI process per node, this is equivalent to Using multi-threaded PaStiX. If you do have multiple MPI processes per node, the first thing is to detect which MPI processes are running on the same node. Once this is done, the resources of a same node need to be shared correctly. We consider here a linear numbering of the cores, but you may have the same issue as in Using multi-threaded PaStiX.

The example tools/binding_for_multimpi.c describes how you can do to split your resources among the MPI processes and generate the correct bindtab. Don't forget that bindtab[0] may still need to be set to -1 in a context where the application is mixing multiple threaded libraries.