GitHub - d-daniel/MPI: Some MPI Samples

Some MPI Samples

Blowfish results on VM cluster:

Brief description of how the benchmarks were performed:

Initialization: time to read the input file, calculate the offsets for splitting the file into chunks of data (if needed), allocate memory and initialize blowfish parameters;
Computation: time to encrypt, decrypt, write data to memory and send data to rank 0 (gather);
Total: total application time including memory deallocation and verification of correctness.

SERIAL 5MB FILE

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f file.txt -k bioinformatics (family=openmpi3)
File size: 5572792 bytes split into 696599 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.038672s
  Computation:    0.063190s
  Total:          0.102423s

SERIAL 500MB FILE

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f book.txt -k bioinformatics (family=openmpi3)
File size: 500563104 bytes split into 62570388 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 1.820641s
  Computation:    5.905069s
  Total:          7.737537s

SERIAL 1GB FILE

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f tome.txt -k bioinformatics (family=openmpi3)
File size: 1001126208 bytes split into 125140776 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 3.826321s
  Computation:    11.841680s
  Total:          15.692811s

In the serial version, the compute time is dominant for all problem sizes. The time of verification is negligible.

MPI 5MB -n 4 -N 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f file.txt -k bioinformatics (family=openmpi3)
File size: 5572792 bytes split into 696599 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.026710s
  Computation:    0.029338s
  Total:          0.056504s

MPI 500MB -n 4 -N 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f book.txt -k bioinformatics (family=openmpi3)
File size: 500563104 bytes split into 62570388 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 1.741193s
  Computation:    2.313624s
  Total:          4.065768s

MPI 1GB -n 4 -N 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f tome.txt -k bioinformatics (family=openmpi3)
File size: 1001126208 bytes split into 125140776 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 3.389017s
  Computation:    4.800958s
  Total:          8.211742s

With 4 MPI ranks (4 nodes) and a single thread per node, there's some speedup, although not equal to the number of nodes. The initialization times remain the same as in the serial version because this code has rank 0 read the whole file for verification. There is no message passing in the initialization phase (no scatter). Computation time is not as dominant as in the serial version.

MPI+OpenMP 5MB -n 4 -N 4 -c 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f file.txt -k bioinformatics (family=openmpi3)
File size: 5572792 bytes split into 696599 blocks of 64 bits
Executing with 4 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 0.020857s
  Computation:    0.043846s
  Total:          0.065160s

MPI+OpenMP 500MB -n 4 -N 4 -c 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f book.txt -k bioinformatics (family=openmpi3)
File size: 500563104 bytes split into 62570388 blocks of 64 bits
Executing with 4 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 1.866346s
  Computation:    1.243591s
  Total:          3.121033s

MPI+OpenMP 1GB -n 4 -N 4 -c 4

[prun] Master compute host = compnode0
[prun] Resource manager = slurm
[prun] Launch cmd = mpirun ./blowfish -f tome.txt -k bioinformatics (family=openmpi3)
File size: 1001126208 bytes split into 125140776 blocks of 64 bits
Executing with 4 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 4.590790s
  Computation:    2.585831s
  Total:          7.198708s

With 4 MPI ranks (4 nodes) and 4 threads per node, the achieved speedup in relation to the serial version is good - but not great - considering that this is a hybrid code. Results make clear that this is an IO-bound application. Initialization times are now dominant.

Blowfish results on Nucleus:

SERIAL 5MB FILE

File size: 5572792 bytes split into 696599 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.056338s
  Computation:    0.088864s
  Total:          0.145710s

SERIAL 500MB FILE

File size: 500563104 bytes split into 62570388 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.548407s
  Computation:    7.980852s
  Total:          8.540638s

SERIAL 1GB FILE

File size: 1001126208 bytes split into 125140776 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 1.124930s
  Computation:    15.953582s
  Total:          17.101078s

Note the decrease in initialization times because of the parallel file system in nucleus (the 5MB is an exception). Compute times are worse, tough.

MPI 5MB -n 4 -N 4

File size: 5572792 bytes split into 696599 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.009882s
  Computation:    0.025483s
  Total:          0.035448s

MPI 500MB -n 4 -N 4

File size: 500563104 bytes split into 62570388 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.449274s
  Computation:    2.200417s
  Total:          2.660147s

MPI 1GB -n 4 -N 4

File size: 1001126208 bytes split into 125140776 blocks of 64 bits
PASSED
  Execution times:
  Initialization: 0.944908s
  Computation:    4.377276s
  Total:          5.342119s

With 4 MPI ranks (4 nodes) and a single thread per node, good speedup is achieved, very similar to the number of nodes. Those are better values than those of the VM cluster, due to Infiniband. The initialization times remain the same as in the serial version because this code has rank 0 read the whole file for verification.

MPI+OpenMP 5MB -n 4 -N 4 -c 16

File size: 5572792 bytes split into 696599 blocks of 64 bits
Executing with 16 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 0.007641s
  Computation:    0.008256s
  Total:          0.015984s

MPI+OpenMP 500MB -n 4 -N 4 -c 16

File size: 500563104 bytes split into 62570388 blocks of 64 bits
Executing with 16 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 0.489728s
  Computation:    0.393478s
  Total:          0.893121s

MPI+OpenMP 1GB -n 4 -N 4 -c 16

File size: 1001126208 bytes split into 125140776 blocks of 64 bits
Executing with 16 OpenMP threads and 4 MPI ranks
PASSED
  Execution times:
  Initialization: 0.982534s
  Computation:    0.747577s
  Total:          1.749635s

With 4 MPI ranks (4 nodes) and 16 threads per node, there's some more speedup, although not ideal. Once again, results demonstrate that this is an IO-bound application. Initialization times are now dominant.