PVAS

New Task Model for Efficient Intra‐node Communication on A Many‐core Architecture

Background

  • Parallel applications on many-core architectures
    • The number of parallel processes within a computing node becomes larger
    • Intra-node communication between parallel processes can be an important issue
  • Key issues for intra-node communication
    • Performance
    • Memory footprint
      • The amount of per-core memory resources in many-core architectures is strictly limited

Problem

  • Memory mapping schemes (shared memory, XPMEM, etc. ) incur extra costs
    • A double memory copy via memory mapped region results in high communication latency
    • Memory mappings among parallel processes result in excessive numbers of memory footprints
      • O(N2) page table entries are created in kernel space

Proposal

  • Removing address space boundaries between parallel processes by using PVAS task model
    • PVAS allows multiple processes (PVAS tasks) to run in the same virtual address space
    • Each PVAS task is located within a partitioned region inside a virtual address space instead of having an entire virtual address space (See Figure 1)
    • PVAS tasks within the same virtual address space use the same page table tree
    • If the parallel processes are spawned as the PVAS tasks within the same virtual address space, there are no address space boundaries among them
The semantic view of the PVAS process model
Figure 1: Semantic views of the address space

Intra-node Communication Using PVAS

  • Parallel processes can simply exchange a message by using the load/store instructions
  • Memory mappings crossing the parallel processes are not required for intra-node communications

Case Study: MPI Intra-node Communication

  • SM BTL*1
    • eager communication: The sender copies a massage to a eager buffer allocated in the shared memory region. The receiver copies the message from the eager buffer. In eager communication, send and receive operations can be decoupled.
    • rendezvous communication: The sender copies a message to an intermediate buffer located in the shared memory region. The receiver copies the message from the intermediate buffer. The sender must wait until the receiver finishes its receive operation.
*1 OpenMPI Byte Transfer Layer
  • PVAS BTL
    • eager communication: The eager communication is operated by memory copies via an eager buffer as well as the SM BTL. However, the eager buffer does not need to be allocated in the shared memory region.
    • rendezvous communication: The receiver directly copies a message from the send-buffer.
Figure 2: MPI intra-node communication
Figure 2: MPI intra-node communication

Evaluation

  • Environment
    • Xeon Phi 5110P (60 core / 240 HT)
  • Performance (Figure 3)
    • PVAS BTL shows best performance
*2 Kernel-Nemesis: OS kernel support for single copy communication
Figure 3: IMB-Pingpong latency ( rendezvous)

Figure 3: IMB-Pingpong latency ( rendezvous)

  • Memory footprint (Figure 4)
    • Total page table size is decreased by up to 256 MB
Figure 4: Memory footprint and total page table size when executing IMB-Alltoall (eager)

Figure 4: Memory footprint and total page table size when executing IMB-Alltoall (eager)