FABuLOuS numericals results

1. Results Summary
2. Setup workstation
3. Get Test files
4. Cleanup results directory
5. Run Examples
- 5.1. Convergence
- 5.2. Timings

Go back to README.html.

1. Results Summary

Summary

2. Setup workstation

To be able to run the following test cases, you must have compiled fabulous:

I recommend using guix as it is quite simple Setup the build system: Eventually compile fabulous:

3. Get Test files

young1c.mtx MatconeSpherePC_MAIN_MAIN₀ (binary) coneSpherePC_RHS_MAIN.0.res (binary)

sherman4.mtx

young4c.mtx bcsstk14.mtx sherman5.mtx

bidiagonalmatrix1.mtx bidiagonalmatrix2.mtx bidiagonalmatrix3.mtx bidiagonalmatrix3.mtx

4. Cleanup results directory

5. Run Examples

5.1. Convergence

5.1.1. Influence of restart(m) parameter

run test case
plot the graphic

This graphic seems to indicate than a larger restart parameter help to speed up the convergence (but this is not necessarily true, a collection of references in IB-BGMRES-DR, Section 4.5, show that the contrary is possible)

5.1.2. young1c (nrhs=6, m=90, k=5)

run test case
plot the graphic

In this experience, I try to mimic the parameters of section 4.2, example 8. The experience results are different because the right hand side are generated randomly with different random number generator, the kernels used may be different and some algorithms may differ a little.

For IB versions, the algorithms do differ: the experience presented here implement an extension computing 'Inexact breakdown on R0', while the result presented in the article do not (Figure 2, Example 8, left side)

5.1.3. Influence of deflated restart (k) parameter (nrhs=6, m=90)

run test case
plot the graphic

5.1.4. qr ib dr

run test
plot

5.1.5. young1c chameleon comparison QR (nrhs=6, m=90, k=5) (DEPRECATED) chameleon

run test case
plot the graphic

5.1.6. GCR first results nrhs=10 m=300

test case
graphic

The max and min displayed here are the max and min of the "not yet discarded" direction This is why the min may be increasing on this graphic: This means a direction was discarded and its norm is not taken into account anymore.

5.1.7. BCG results

test case
graphic

As the CG algorithm minimize the A-norm of the error and not directly the residual. It is not an error if the norm of the residual is not decreasing.

5.1.8. IB max_kept_direction parameter

run test
plot

5.1.9. BCG results 2

test case
When using random loader "-l RD" with BCG algorithm, the test program, ensure to have a SPD matrix by computing
```
A <- rand_mat()
A <- A^{T} * A
```
graphic

5.1.10. BGCRO results check 3 consecutive solve

test case

Using the exemple file example\_gcro.cpp
graphic

5.2. Timings

5.2.1. Influence of incremental QR factorization

run test case
plot the graphic

The curve for "full_gels facto" is zero because there is no factorization part in the "full_gels". But even if it may be unexpected, the curve for "incremental_gels solve" is also zero because since GMRES is not a short term recurrence algorithm, computing the complete least square solution is not mandatory at each iteration. Only the residual norm is needed is this can be obtained from the incremental factorization without computing the actual residual

5.2.2. qr ib dr timing

run test
plot
papi (commit 20c31809be1f04a1b575008f9527abbec07a2f21 / branch papi) (DEPRECATED)

DOES NOT WORK Try to compare PAPI hw counter with estimation of Gflops/s we computed:

5.2.3. with chameleon (DEPRECATED) chameleon

basic-cham
1. run test
2. plot
  
  The following graph represent relative percentages spent in each stage of the algorithm
QR big case
1. runs
2. graphics

5.2.4. big matrix test 'perf1' parade

Description
In this section they may have big test cases which may not work if your machine do not have enough memory.

Reminder:
- "-p" is number of right hand sides
- "-m" is maximum krylov space
- "-M" is maximum matrix vector product
"BAD" matrices are matrices that theorically have a bad convergence:
- real case: they are tridiagonal with 4 on diagonal, -1.0 under the diagonal and -2.99 over the diagonal
- complex case: diag = 4 - 2i, sub-diag: -2.99+i and over-diag: -1+i.
There is two kinds of test: MAX_KRYLOV_SPACE = 2000 AND DIMENSION = 5000 and MAX_KRYLOV_SPACE = 5000 AND DIMENSION = 20000

MAX_MVP is set to a very big value in order that all test reach convergence (100000)

The ORTHOGONALIZATION_SCHEME used is CGS (the cheapest)

There are two kind of graphic:
- Time or percentage of time of certain steps in a iteration with respect to size of the krylov space. (therefore data from different restart are grouped together by krylov space size)
- Time of steps of total time of iterations with respect to the number of the current iteration in the globality of the algorithm

Batch script

#SBATCH --job-name=fabulous_perf1
#SBATCH --output=fabulous_perf1_out_1
#SBATCH --error=fabulous_perf1_err_1
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=04:00:00
#SBATCH --partition=routage
#SBATCH --constrain=bora

#source /home/${SBATCH_ACCOUNT}/.bashrc
#source /home/${SBATCH_ACCOUNT}/fabulous/.plafrim_module_used
#export STARPU_FXT_PREFIX=/home/tmijieux/fabulous/build/
#export STARPU_GENERATE_TRACE=1
FABULOUS_DIR=/home/msimonin/Repositories/fabulous
cd ${FABULOUS_DIR}/build/

GUIX_ENV=--pure fabulous --with-source=fabulous=${FABULOUS_DIR}

guix environment ${GUIX_ENV} --  ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -o perf1_std_1 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QR -o perf1_qr_1 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A IB -o perf1_ib_1 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QRDR -k 20 -r DEFLATED -o perf1_qrdr_1 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 5000 -m 2000 -M 100000 -s CGS -A QRIBDR -o perf1_qrib_1 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A IB -o perf1_ib_2 -x
guix environment ${GUIX_ENV} -- ./src/test/cmd/fabulous_test -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A QRIBDR -o perf1_qrib_2 -x
#./fabulous_test_cham -S BAD -l BAD -n 20000 -m 5000 -M 100000 -s CGS -A CHAMIB -o perf1_cham_ib_2 -x
tar czvf perf1_results.tar.gz *.res *.kernel.txt
mv perf1_results.tar.gz /home/$SBATCH_ACCOUNT

IB and QR-IB
1. runs
  1. little
  2. less little
2. graphic
  1. little
    Here ib_1 is the INEXACT BREAKDOWN variant with full least square version while qrib_1 is the version with IB with incremental factorization
    
    The test case ending with _1 have the following noticeable parameters: Problem/Vector Size = 5000 Maximum size of Krylov space = 2000
    - gels perf1_ib_1 is copy of the hessenberg + the full least square kernel call.
    - facto perf1_ib_1 correspond to nothing (that would be factorization part, therefore is null)
    - facto perf1_qrib_1 is the incremental factorization with factorization of last block line and last block column
    - gels perf1_qrib_1 is the solve in the QRIB case that contains:
      
      a copy of the hessenberg,
      
      a piece of factorization with last line (TSQRT L over H)
      
      applying all factorizations update on right hand sides (because of double update: IB and incremental QR)
      
      eventually the 'solve' (triangular trsm kernel)
    On this first graph is represented the percentage of time in an iteration that the different part presented before takes knowing that, the other important part of an iteration in this case are the matrix vector multiplication (rather constant among iteration) and the basis orthogonalization step whose length depend on krylov space size(abscissa).
    
    On this second graph there are the raw length of each steps
    
    On the third graph there are the cumulated times of each iterations. For IB and QRinc-IB version
  2. less little
    
    Same thing with a bigger test case Vector/Problem size = 20000 Maximum size of Krylov Space Size = 5000
QR and Restarting
1. runs
2. graphics
IB and CHAM-IB (DEPRECATED) chameleon

The tests in this section compare badly against each other because part that does not use chameleon in the executable linked with chameleon does not profit from multi-threaded kernels (as chameleon requires to be linked against sequential blas. See [[ile:NOTES.org] Section "linking with lapacke/cblas kernels"]]
1. runs
  
  perf1_ib_2 from less little
  
  IB CHAM_IB
2. graphic
Kernel parameters analysis

parade REMOTE

Setup connection with plafrim

export TERM=xterm
hostname
echo $WORKDIR

Launch the experiments

cd ${WORKDIR}
#git pull origin develop
emacs -batch --load ~/.emacs.d/init.el RESULTS.org --funcall org-babel-tangle
sbatch scripts/batch_perf1.sh
squeue -u tmijieux

parade RETRIEVE LOCAL