MPI-IO

Striping Layout Aware Data Aggregation for High Performance I/O on a Lustre File System
(* Part of this research work was supported by JSPS KAKENHI Grant Number 25330148.)

Outline

  • Research background
  • Objective of our research work
  • MPI-IO
  • Two-Phase I/O of ROMIO
  • Optimized aggregator layout in ROMIO
  • Performance evaluation
  • Summary

Research background

  • MPI-IO
    • High performance (parallel) I/O using MPI-IO
    • Parallel I/O driver layer in application oriented I/O libraries (e.g., HDF5, Parallel netCDF(PnetCDF) )
    • Big impacts in optimization of collective I/O
  • ROMIO: The most commonly used MPI-IO library
    • Available in MPICH and OpenMPI
    • Two-Phase I/O: optimization for collective I/O
      • Combination of file I/O and data excanges
      • File I/O task given to some or all MPI processes (aggregators)
      • Multiple aggregators per node may have performance penalty due to lack of aggregator layout optimization.
        • The current layout does not care about process layout and Lustre's striping access pattern.

Objective of Our Research Work

  • Aggregator layout optimization in two-phase I/O
    • Aggregator layout scheme   
      • blocked layout in each node
    • Mismatch happens when we have multiple aggregators per node.
      • Contention in data exchanges
      • Ineffective accesses to a Lustre file system in striping accesses

Striping layout aware aggregator layout with taking

account of computing node configuration


MPI-IO Access Patterns

  • Four representative MPI-IO access patterns
MPI-IO Access Patterns

MPI-IO on a Lustre File System

  • Example of collective write against a Lustre file system
MPI-IO Access Patterns

ROMIO

  • Software stack of ROMIO
    • ADIO : Abstract Device I/O layer
      • Providing file system dependent software stack (e.g., ad_lustre)
MPI-IO Access Patterns

Collective I/O for 2-Dimensional Data

  • 2-Dimensional data accesses by 4 processes
MPI-IO Access Patterns

Two-Phase I/O

  • Two-Phase I/O in collective write
    • Repetitive I/O and data exchanges in a unit of collective buffer
      • collective buffer size equals to striping size in Lustre case.
    • Aggregator: a process which plays file I/O (Some or all of MPI processes)
    • The number of Two-Phase I/O cycles: NTP-IO = |Ldata/nagg/SCB|
      Ldata Total data size
      nagg The number of aggregators
      SCB CB size
MPI-IO Access Patterns

Aggregator Layout Problem

  • Ineffective aggregator layout for multiple aggregators/node
    • Blocked layout by the current ROMIO
    • Multiple aggregators/node lead to contention in
      • accessing a Lustre file system
      • data exchanges
  • Performance degradation in collective I/O

Ineffective aggregator layout of ROMIO

  • Mismatch of multiple aggregator layout regarding to Lustre striping accesses
Aggregator layout example of the current ROMIO (4 processes/node for 2 OSTs)
  • Multiple aggregators/node act in the same striping round in the same node.
    • Lustre access throughput degradation due to shared use of the same network link
    • Congestion in data exchanges

Our Approach for Performance Improvement

Rearrangement of aggregator layout

・ Round-robin layout among nodes

  • Minimization in I/O access time in each striping round by eliminating shared use of network links
  • Minimization in data exchange contention

Aggregator Layout Optimization

  • Comparison with the current ROMIO
< current scheme of ROMIO (orig.) > < Proposed scheme>
< The number of communications per node>
  Orig. Proposed scheme
Inter-node comm./node 8 4
Intra-node comm./node 6 3
Round-robin layout among nodes leads to minimization in communication and I/O time.

Performance Evaluation

  • Evaluation of file I/O (POSIX-IO)
    • I/O time for striping accesses on a Lustre file system
  • Performance evaluation of ROMIO using HPIO benchmark
    • Collective write operations (MPI_File_write_all)
< Evaluation setup >
PC cluster (4 computing nodes)
Computing node CPU 1 x Intel Xeon E3 1280 V2
Memory 32 GiB
Network 1 x Mellanox InfiniBand FDR
Lustre ver. 2.6.54 (1 MDS, 2 OSSs, 2 OSTs/OSS)
MDS/OSS CPU 1 x Intel Xeon E3-1270 V3
Memory 32 GiB
Network 1 x Mellanox InfiniBand FDR
Storage server CPU 2 x Intel Xeon E5-2640 V2
Memory 64 GiB
Network 1 x Mellanox InfiniBand FDR (2 ports)
Disk system 4 x RAID10(~ 4 TiB)
LSI MegaRaid SAS 9260-4i

Process Layout of MVAPICH2 in Performance Measurement

  • Process layout by using MVAPICH2
    • bunch, scatter
    • 4 nodes (4 processes/node) 16 processes in total

< bunch >

blocked layout in each node


< scatter >

round-robin layout among nodes

< Examples of process layout by "bunch" (left) and "scatter" (right) >


Lustre Access Time

  • I/O time in accessing a Lustre file system (POSIX-I/O: write)
    • Round-robin data layout with a 16 MiB unit in terms of an MPI rank
    • Mean time value from 50 iterations
MPI-IO Access Patterns
  • scatter layout is better (read time is drastically decreased.)
    • Read-ahead performance of Lustre may be dependent on process layout?
  • bunch layout may lead to poor performance because of shared interconnection utilization.
  • Scatter layout is totally better with regarding to network contention.

Performance Evaluation using HPIO Benchmark

  • Collective MPI-IO (MPI_File_write_all) evaluation using HPIO benchmark
    • Original ROMIO (orig) vs. optimized ROMIO
      • Original ROMIO  Don't care aggregator layout
      • Optimized ROMIO  Optimized aggregator layout
    • The number of processes: 16 (4 nodes x 4 processes/node)
      • All the MPI processes act as aggregators.
    • Process layout: bunch and scatter
    • Lustre (ver. 2.6.54) : 1 MDS/MDT + 4 OSTs (2 OSTs/OSS)
    • Data exchange mode
      • MPI_Isend/MPI_Irecv pairs
      • MPI_Alltoallv (based on ADIO implementation of Blue/Gene)

HPIO parameters
Region size   5,744 B
Region space 256 B
Region count 48,000

< Data layout of HPIO benchmark >

Totally about 8.6 GiB data was generated by 16 MPI processes.


HPIO Benchmark Results (1)

  • MPI_Isend/MPI_Irecv version
< Bunch layout > < Scatter layout >
・ ORIG, agg_n=1: original ROMIO (1 aggregator / node)
・ ORIG, agg_n=4: original ROMIO (4 aggregator / node)
・ RR, agg_n=1: optimized ROMIO (4 aggregator / node)
  • Optimized ROMIO outperformed original ROMIO with 1 aggregator / node and 4 aggregators / node.

HPIO Benchmark Results (2)

  • MPI_Alltoallv version
< Bunch layout > < Scatter layout >
・ ORIG, agg_n=1: original ROMIO (1 aggregator / node)
・ ORIG, agg_n=4: original ROMIO (4 aggregator / node)
・ RR, agg_n=1: optimized ROMIO (4 aggregator / node)
  • Optimized ROMIO outperformed original ROMIO with 1 aggregator / node and 4 aggregators / node.

Internal Operation Times of Two-Phase I/O (1)

  • MPI_Isend/MPI_Irecv version
  • Three major operation times (read, write, and exch) increased with an increase in the number of aggregators (original version).
  • Proposed scheme minimized the three major operation times.

Internal Operation Times of Two-Phase I/O (2)

  • MPI_Alltoallv version
  • Three major operations (read, write, and exch) decreasd with an increase in the number of aggregators (original version).
  • However, proposed scheme minimized the three major operation times.

Summary

  • Striping access aware aggregator layout outperformed the original ROMIO performance when we have multiple aggregators per node.
    • Most biggest performance improvement was found in read operations from Lustre and data exchanges.
    • The optimized aggregator layout also help to have workload evenly among aggregators.

Future Works

  • Performance evaluation on a large scale of nodes
  • Optimizations in data exchanges
    • e.g., Nonblocking communications in order to overlap communications with file I/O

Related Work

  • K. Cha and S. Maeng (ICPP'11 workshop)
    • Optimization of aggregator layout has been also proposed (multiple aggregators/node) similar to our proposal.
    • Striping layout of a parallel file system has not been considered.
  • LACIO (Y. Chen et al., IPDPS'11)
    • Striping layout aware aggregator layout optimization similar to our proposal
    • While our proposal additionally focuses on performance improvements for multiple aggregators per node.
  • J. Liu et al. (IEEE BigData'13)
    • Analyzing both structured data formats and target parallel file system data
    • Analysis in higher level user application layer
    • Our proposal addresses at the lower user space layer than this work.
      • Combination of our work with this research may be interesting in using HDF5 or PnetCDF.