User-Level Process

User-Level Process towards Exascale Systems

Background

  • Network system will be larger and more complicated towards Exascale systems
  • Latency hiding can be considered as an important issue

Methods for Latency Hiding

  • Non-blocking communication
    • Overlapping communication and computation
  • Process oversubscription
    • Switching process when a process is blocked to wait for a completion of a communication

Problems

  • Lightweight OS kernels for Exascale systems may no longer support OS task scheduling
    • OS task scheduling is a resource consuming operation and results in OS noises
    • E.g. Mckernel, Argo
  • Process context switch is slow
    • The overhead of jumping into kernel context spoils the benefit of the process oversubscription

Conventional Approach

  • The process oversubscription using user-level thread (e.g. FG-MPI)
    • Invoking multiple user-level threads within a process
    • Assigning a role of a parallel process to a user-level thread
  • Pros and cons
    • Pros
      • OS task scheduling is not necessary
      • Fast context switch (context switch between user-level threads can be operated in the user-space)
    • Cons
      • The programing model of parallel applications is forced to be changed
        • Program code (text) and data (data, bss and heap) are shared among parallel processes
        • In general, programmers implement parallel applications on the assumption that a parallel process has its own program code and data

Our Solution

  • User-level process (ULP)
    • ULP is a "process", which can be schedules in the user-space
      • The ULP has the beneficial features of the user-level thread
      • The ULP has its own program code and data. (Therefore, we equate the ULP with "process".)
    • Capability of ULP
      • The ULP enables the process oversubscription without OS task scheduling
      • The ULP enables the high-performance process oversubscription
      • The process oversubscription using the ULP does not change the programming model of parallel applications
  Kernel-level Process User-level Thread User-level Process
OS task scheduling Necessary Not necessary Not necessary
Context switch Slow Fast Fast
Is programming model forced to be changed ? No Yes No

Overview of User-level Process

Overview of User-level Process
  • The ULP can be scheduled in the user-space
    • By assigning a role of a parallel process to a user-level process, the process oversubscription can be achieved without OS task scheduling
    • The high-performance oversubscription can be achieved by avoiding the overhead of jumping into the kernel context
  • The ULP has its own program code and data
    • The process oversubscription using the user-level process does not change the programming model of parallel applications

Address Space Design

Address Space Design

Context Switch

  • Segment registers must be considered on x86_64 architectures
    • Segment registers are not accessible from user-space
    • The fs register is used for implementing Thread Local Storage (TLS)
    • Thread safe functions must be build without using TLS
Context Switch

ULP API

  • pvas_ulp_create
    • Create address space for ULPs
  • pvas_ulp_destroy
    • Destroy a created address space
  • pvas_ulp_spawn
    • Spawn kernel-level process with a ULP
  • pvas_ulp_exec
    • Create and execute a new ULP
  • pvas_ulp_switch
    • Conduct context switch between ULPs

Compatibility Issues

  • The specification of the ULP is not equal to that of the traditional kernel-level process
    • For example,
      • Process ID is shared among ULPs
      • Transmission of signals takes place between kernel-level processes (not between ULPs)
    • Those compatibility issues must be considered when embedding the capability of the ULP in the runtimes for executing parallel applications
      • E.g. MPI runtime

Preliminary Evaluation (context switch performance)

  • Benchmark
    • Invoking multiple parallel processes on a single CPU core
    • A parallel process may be a kernel-level process or a kernel-level thread or a user-level thread or a user-level process
    • Measuring a time elapsed until all parallel process performs context switch 1000 times
  • The performance of the ULP is competitive with that of the user-level thread
Preliminary Evaluation (context switch performance)

Summary and Future Work

  • Summary
    • The ULP enables the process oversubscription, even if the OS kernel does not support task scheduling
    • The ULP enables the high-performance process oversubscription by avoiding the overhead of jumping into kernel context
    • The process oversubscription using ULP does not change the programming model of parallel applications
  • Future work
    • Embedding the capability of the ULP in the runtimes for executing parallel applications and evaluating it
      • E.g. MPI over ULP