PRDMA

Overview

PRDMA is a restricted version of persistent communication in the MPI standard in order to carry out better performance. It assumes that all senders and receivers have initiated MPI_Send_init and MPI_Recv_init primitives and the same communication patterns are reused without reissuing the initialization. It also assumes that MPI_Startall and MPI_Waitall are issued in all participating processes, i.e., in th global operation style.
For example, in stencil computation, all communication patterns are fixed during the main computation, and thus before entering the main computation, communication patterns can be initialized by MPI_Send_init and MPI_Recv_init in all processes, and MPI_Startall and MPI_Waitall may be issued during the main computation. PRDMA does not require the users to modify their source codes if those are written using this restricted MPI persistent communication feature. The users statically or dynamically link their codes with the PRDMA library. It replaces MPI functions with ones provided by the PRDMA library.


Implementation: libprdma

The libprdma is a library which contains the implementation of MPI Persistent Communication primitives to improve the communication latency and the overlap between computation and communication over the RDMA-enable interconnects.

The PRDMA stands for Persistent Remote Direct Memory Access.

On the K computer, the libprdma runs on top of the Open MPI based Fujitsu MPI, and replaces the original MPI Persistent-Communication related API functions.


How to use the libprdma library

The librpdma can execute your MPI Persistent Communication program without any modications.

To use libprdma on the K computer, you need to dynamically link the library before your MPI application is run, by setting the LD_PRELOAD environment variable.

Typically, change in the script file as follows:

mpiexec ./a.out ...

to

PRDMA_HOME=/opt/aics/prdma/current
mpiexec \

-x LD_PRELOAD=${PRDMA_HOME}/lib64/libprdma.so \
./a.out ...

where the "mpiexec -x NAME=VALUE" option means that the user's program executes on the remote nodes after the specified environment variable NAME is set to the VALUE.
The install path (in "K" login node and the computation nodes) is
/opt/aics/prdma/current/lib64/libprdma.so.

See also a sample job scripts in /opt/aics/prdma/templates/run-templ-01.sh on "K" login nodes.


The following environment variables are used as options in PRDMA.

1)

PRDMA_NOSYNC
If the PRDMA_NOSYNC variable is set to 1, the MPI_Startall operation is started in each process without synchronizing other processes.
PRDMA_NOSYNC If the PRDMA_NOSYNC variable is set to 1, the MPI_Startall operation is started in each process without synchronizing other processes. This option is useful, if the programmer knows that all processes may start the MPI_Startall operation independently because some synchronization, such as a reduction operation, has been issued and after that all memory areas involved with the persistent communication are ready to send/receive.

2)

PRDMA_RDMASIZE
If the message size is smaller than the value of PRDMA_RDMASIZE in byte, the original persistent communication primitive is used. The default value is 13Kbyte.

3)

PRDMA_STATISTIC
Show the statistics of persistent communications in the error log file.

4)

PRDMA_VERBOSE
Show which options have been specified in the error log file.

5)

The following environment variables are for debug purposes.
PRDMA_TRACESIZE
PRDMA_TRACETYPE
PRDMA_NOTRUNK


References:

[ 1 ]

Yutaka Ishikawa, Kengo Nakajima, and Atsushi Hori, "Revisiting Persistent Communication in MPI," EuroMPI 2012: Recent Advances in the Message Passing Interface, LNCS 7490, pp. 296-297, Springer Netherlands, 2012 (poster)

[ 2 ]

Masayuki Hatanaka, Atsuhi Hori, and Yutaka Ishikawa, "Optimization of MPI Persistent Communication," Submitted to EuroMPI 2013, 2013.