Fault Resilience

Sliding Substitution of Failed Nodes

Motivation

  • Having spare node set seems to be the last resort
    • "in such case, spare node can be used."
  • Having spare node is not the answer, but new research issue

Fault Resilience

  • Fault tolerance in Exa-flops era
    • High failure rate
    • High I/O bandwidth requirement
  • User-level fault resilience
    • Less I/O bandwidth required
    • e.g., ULFM (User-Level Fault Mitigation)
  • We need a recovery strategy !!

Survival from Node Failure

  • Jobs with dynamic load balancing
    • e.g., Task bag, PIC, ...
    • Job shrinking to exclude failed nodes
    • Tasks running on failed node(s) are migrated to live nodes
  • Jobs without dynamic load balancing
    • e.g., Stencil computation, ...
      • Very difficult to balance load
    • Having spare nodes seems to be the answer ...

Stencil Computation

  • Survival from a node failure
    • Load balancing
    • Preserving communication pattern
    • Less code modification

Spare Node

  • In an error handler (of ULFM, for example)
    • create a new MPI communicator to
      • exclude the failed node, and
      • include a spare node.
    • then, migrate the task running on the failed node to the spare node
    • No change in the kernel part of application
  • However, at the network level, the regular stencil communication pattern can be lost !

Is spare node really the answer ?

  • Our scope
    • Is there any penalty? If any, how much?
    • How spare nodes should be allocated?
    • How many spare nodes should be allocated?
    • How failed nodes should be substituted be spare nodes?
  • Out of scope
    • How (soft/hard) errors are detected
    • How checkpoints are taken
    • How tasks are migrated

Spare Node Penalty (1)

  • Spare node allocation and node utilization


How many spare nodes ?


Spare Node Allocation

  • Changing spare node allocation method according to the number of nodes

Spare Node Penalty (2)

  • Possibility of communication performance degradation
    • 5P Stencil communication pattern

Sliding Substitution


5P Stencil on 2D Network

  • Simulated Results
  • Spare Allocation
    • 2D(2,1) > 2D(1,1)
  • Max. Failure
    • 0D: up to #Spare
    • 1D: 3 (or more)
    • 2D: up to 2 (2D Cart. Topo.)
  • Comm. Perf.
    • 2D > 1D > 0D

5P Stencil Comm. Perf.


Collective Performance

  • On K and BG/Q, collective ops are optimized for their network.
  • Having spare nodes makes the optimization very difficult.
  • BG/Q' optimization works only with MPI_COMM_WORLD

Summary

  • Study on spare node substitution has just begun
  • Comm. perf. degradation is observed
    • 5P stencil :
      • Simulation: up to 100 times larger latency
      • Experiment: < 20 times larger latency
    • Collective : up to 12 times larger latency

Current and Future Work

  • Evaluations with real applications
  • Node-Rank re-mapping algorithms, or better substitution methods
  • Dragonfly and/or Fat-tree network ?
    • Experiments using Tsubame 2.5 (Fat-tree) is scheduled
  • At this moment, it is still unclear if having spare nodes is a promising technique