Kernel Level Blocking MPI

An Efficient Kernel-Level Blocking MPI Implementation

Blocking and Non-blocking

  • MPI communication functions
    • Blocking and non-blocking
    • .....
  • Blocking MPI functions are the focus
    • Blocking Receive Functions

Background

  • Blocking Implementation in MPI Most MPI
    • User-Level Blocking (ULB)
      • utilizing the user-level communication
        [U-Net by von Eicken, 1995]
      • to achieve high comm. performance
    • Kernel-Level Blocking (KLB)
      • believed to have low comm. performance

Current KLB Performance


  • Some MPI implementations have the KLB option, however, they exhibit poor performance.

Each node has two Nehalem processors
(4 cores, 2.67 GHz).
16 nodes are connected with the QDR Infiniband network.
In this evaluation, number of processes is set to 64 (16x4).

klb01

Why is the KLB performance so poor ?

  • Assumption
    • System call overhead is too high.
  • Possible Remedy
    • Two-phase blocking [Ousterhout 1982]
    • A busy-wait loop followed by a blocking system call

Two-phase Blocking May Not Help
  • Two-phase blocking (TPB) can improve performance somewhat.
klb02
Why two-phase blocking does not help
  • Performance can be improved (thr<100,000)
  • Performance is degraded (thr>100,000)
klb_03
Threshold is the number of iterations in a busy-loop


How can the KLB problem be avoided ?

Progress Routine
  • Typical progress routine
MPID_progress_function( ... ) {
   try_recv_message( recv_queue );
   if( send_queue ) send_message( send_queue );
}
  • Naive KLB progress routine
MPID_progress_function( ... ) {
   KLB_recv_message( recv_queue );
   if( send_queue ) send_message( send_queue );
}
  • The sending messages in the send_queue are postponed because of the blocking receive.
Context of calling the progress routine
  • The MPI progress routine can be called at any time in an MPI library
  • The blocking receive in the progress routine ONLY takes place only when;
    • Receiving messages will not arrive immediately (Two-Phase Blocking)
    • There is no messages to be sent (send_queue is empty)
    • MPI functions (e.g. MPI_Wait) allow blocking (WaitEvent)
KLB Aware Progress Routine
  • Below is the KLB aware progress routine
ProgressRoutine() {
   if( isWaitingEvent() && isEmpty( send_queue ) ) {
        iter = 0;
        while( Iter++ < Threshold ) {
           if( ULB_MsgRecv() ) return;
        }
        KLB_MsgRecv();
   } else {
        ULB_MsgRecv();
        MsgSend( send_queue );
   }
}

Multi-protocol Issue (1)
  • Multi-protocol
    • Inter-node and Intra-node communication
    • Quite common comm. architecture
  • Cascading calls of blocking system call must be avoided
klb_04
Multi-protocol Issue (2)
  • MPICH-SCore
    • PMX (low-level comm. layer) combines the Infiniband (ibverb) and shmem protocols as one abstracted network device.
    • To block on the shmem device, Unix pipe is used.
    • The select() system call is used to block on the Infiniband and shmem protocols at the same time.
klb_05

Evaluations of the improved KLB performance

Evaluation of KLB MPI
klb06
Evaluation Details
klb07


Why do we need KLB MPI ?

Possible Benefits of KLB MPI
  • Do we have to have KLB MPI even if the performance is the same ?
  • My answer is "YES," because of the following reasons;
    • MPI programs spawning processes and/or threads may run more effectively
    • KLB MPI can decrease power consumption
    • KLB MPI can tolerate OS jitter

Summary

  • KLB MPI can exhibit the same performance with ULB MPI
  • The key to implement KLB MPI is in the progress routine
  • Future Work
    • Power consumption of KLB MPI
    • OS jitter tolerance of KLB MPI
    • Auto tuning algorithm of the threshold