FABuLOuS numericals results

Table of Contents

Go back to README.html.

1. Results Summary

2. Setup workstation

To be able to run the following test cases, you must have compiled fabulous:

I recommend using guix as it is quite simple Setup the build system: Eventually compile fabulous:

3. Get Test files

4. Cleanup results directory

5. Run Examples

5.1. Convergence

5.1.1. Influence of restart(m) parameter

  1. run test case
  2. plot the graphic

    This graphic seems to indicate than a larger restart parameter help to speed up the convergence (but this is not necessarily true, a collection of references in IB-BGMRES-DR, Section 4.5, show that the contrary is possible)

5.1.2. young1c (nrhs=6, m=90, k=5)

  1. run test case

    young1c.png

  2. plot the graphic

    In this experience, I try to mimic the parameters of section 4.2, example 8. The experience results are different because the right hand side are generated randomly with different random number generator, the kernels used may be different and some algorithms may differ a little.

    For IB versions, the algorithms do differ: the experience presented here implement an extension computing 'Inexact breakdown on R0', while the result presented in the article do not (Figure 2, Example 8, left side)

5.1.3. Influence of deflated restart (k) parameter (nrhs=6, m=90)

  1. run test case

    influence_dr.png

  2. plot the graphic

5.1.4. qr ib dr

  1. run test

    qribdr.png

  2. plot

5.1.5. young1c chameleon comparison QR (nrhs=6, m=90, k=5) (DEPRECATED)   chameleon

  1. run test case
  2. plot the graphic

5.1.6. GCR first results nrhs=10 m=300

  1. test case

    GCR_10_300.png

  2. graphic

    The max and min displayed here are the max and min of the "not yet discarded" direction This is why the min may be increasing on this graphic: This means a direction was discarded and its norm is not taken into account anymore.

5.1.7. BCG results

  1. test case

    BCG.png

  2. graphic

    As the CG algorithm minimize the A-norm of the error and not directly the residual. It is not an error if the norm of the residual is not decreasing.

5.1.8. IB maxkeptdirection parameter

  1. run test

    IB_mkd.png

  2. plot

5.1.9. BCG results 2

  1. test case

    When using random loader "-l RD" with BCG algorithm, the test program, ensure to have a SPD matrix by computing

    A <- rand_mat()
    A <- A^{T} * A
    

    BCG_2.png

  2. graphic

5.1.10. BGCRO results check 3 consecutive solve

  1. test case

    Using the exemple file example\gcro.cpp gcro.png

  2. graphic

5.2. Timings

5.2.1. Influence of incremental QR factorization

  1. run test case
  2. plot the graphic

    The curve for "fullgels facto" is zero because there is no factorization part in the "fullgels". But even if it may be unexpected, the curve for "incrementalgels solve" is also zero because since GMRES is not a short term recurrence algorithm, computing the complete least square solution is not mandatory at each iteration. Only the residual norm is needed is this can be obtained from the incremental factorization without computing the actual residual gels1.png

5.2.2. qr ib dr timing

  1. run test
  2. plot

    time_qribdr.png

    time_qribdr2.png

  3. papi (commit 20c31809be1f04a1b575008f9527abbec07a2f21 / branch papi) (DEPRECATED)

    DOES NOT WORK Try to compare PAPI hw counter with estimation of Gflops/s we computed:

5.2.3. with chameleon (DEPRECATED)   chameleon

  1. basic-cham
    1. run test
    2. plot

      The following graph represent relative percentages spent in each stage of the algorithm

  2. QR big case
    1. runs
    2. graphics

5.2.4. big matrix test 'perf1'   parade

  1. Description

    In this section they may have big test cases which may not work if your machine do not have enough memory.

    Reminder:

    • "-p" is number of right hand sides
    • "-m" is maximum krylov space
    • "-M" is maximum matrix vector product

    "BAD" matrices are matrices that theorically have a bad convergence:

    • real case: they are tridiagonal with 4 on diagonal, -1.0 under the diagonal and -2.99 over the diagonal
    • complex case: diag = 4 - 2i, sub-diag: -2.99+i and over-diag: -1+i.

    There is two kinds of test: MAX_KRYLOV_SPACE = 2000 AND DIMENSION = 5000 and MAX_KRYLOV_SPACE = 5000 AND DIMENSION = 20000

    MAX_MVP is set to a very big value in order that all test reach convergence (100000)

    The ORTHOGONALIZATION_SCHEME used is CGS (the cheapest)

    There are two kind of graphic:

    • Time or percentage of time of certain steps in a iteration with respect to size of the krylov space. (therefore data from different restart are grouped together by krylov space size)
    • Time of steps of total time of iterations with respect to the number of the current iteration in the globality of the algorithm
  2. Batch script
    #SBATCH --job-name=fabulous_perf1
    #SBATCH --output=fabulous_perf1_out_1
    #SBATCH --error=fabulous_perf1_err_1
    #SBATCH --exclusive
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=32
    #SBATCH --time=04:00:00
    #SBATCH --partition=routage
    #SBATCH --constrain=bora
    
    #source /home/${SBATCH_ACCOUNT}/.bashrc
    #source /home/${SBATCH_ACCOUNT}/fabulous/.plafrim_module_used
    #export STARPU_FXT_PREFIX=/home/tmijieux/fabulous/build/
    #export STARPU_GENERATE_TRACE=1
    FABULOUS_DIR=/home/msimonin/Repositories/fabulous
    cd ${FABULOUS_DIR}/build/
    
    GUIX_ENV=--pure fabulous --with-source=fabulous=${FABULOUS_DIR}
    
    guix environment ${GUIX_ENV} --  ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -o perf1_std_1 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QR -o perf1_qr_1 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A IB -o perf1_ib_1 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QRDR -k 20 -r DEFLATED -o perf1_qrdr_1 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QRIBDR -o perf1_qrib_1 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A IB -o perf1_ib_2 -x
    guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A QRIBDR -o perf1_qrib_2 -x
    #./fabulous_test_cham -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A CHAMIB -o perf1_cham_ib_2 -x
    tar czvf perf1_results.tar.gz *.res *.kernel.txt
    mv perf1_results.tar.gz /home/$SBATCH_ACCOUNT
    
  3. IB and QR-IB
    1. runs
      1. little
      2. less little
    2. graphic
      1. little

        Here ib_1 is the INEXACT BREAKDOWN variant with full least square version while qrib_1 is the version with IB with incremental factorization

        The test case ending with _1 have the following noticeable parameters: Problem/Vector Size = 5000 Maximum size of Krylov space = 2000

        • gels perf1_ib_1 is copy of the hessenberg + the full least square kernel call.
        • facto perf1_ib_1 correspond to nothing (that would be factorization part, therefore is null)
        • facto perf1_qrib_1 is the incremental factorization with factorization of last block line and last block column
        • gels perf1_qrib_1 is the solve in the QRIB case that contains:
          • a copy of the hessenberg,
          • a piece of factorization with last line (TSQRT L over H)
          • applying all factorizations update on right hand sides (because of double update: IB and incremental QR)
          • eventually the 'solve' (triangular trsm kernel)

        On this first graph is represented the percentage of time in an iteration that the different part presented before takes knowing that, the other important part of an iteration in this case are the matrix vector multiplication (rather constant among iteration) and the basis orthogonalization step whose length depend on krylov space size(abscissa). ib1.png

        On this second graph there are the raw length of each steps ib2.png

        ib3.png

        On the third graph there are the cumulated times of each iterations. For IB and QRinc-IB version ib4.png

      2. less little

        Same thing with a bigger test case Vector/Problem size = 20000 Maximum size of Krylov Space Size = 5000 ib5.png

        ib6.png

        i7.png

  4. QR and Restarting
    1. runs
    2. graphics

      qr1.png

      qr2.png

      qr3.png

      qr4.png

      qr5.png

  5. IB and CHAM-IB (DEPRECATED)   chameleon

    The tests in this section compare badly against each other because part that does not use chameleon in the executable linked with chameleon does not profit from multi-threaded kernels (as chameleon requires to be linked against sequential blas. See [[ile:NOTES.org] Section "linking with lapacke/cblas kernels"]]

    1. runs

      perf1_ib_2 from less little

      IB CHAMIB

    2. graphic
  6. Kernel parameters analysis

    tsmqrt1.png

    tsmqrt2.png

    tsmqrt3.png

    tsmqrt4.png

    tsmqrt5.png

    tsmqrt6.png

    tsmqrt7.png

    tsmqrt8.png

    gels.png

  7. parade REMOTE

    Setup connection with plafrim

    export TERM=xterm
    hostname
    echo $WORKDIR
    

    Launch the experiments

    cd ${WORKDIR}
    #git pull origin develop
    emacs -batch --load ~/.emacs.d/init.el RESULTS.org --funcall org-babel-tangle
    sbatch scripts/batch_perf1.sh
    squeue -u tmijieux
    
  8. parade RETRIEVE LOCAL

Author: Gilles Marait

Created: 2021-09-29 Wed 11:54

Validate