IHK/McKernel is a light-weight multi kernel operating system designed specifically for
high performance computing. It runs Linux and McKernel, a lightweight kernel (LWK),
side-by-side on compute nodes primarily aiming at the followings:
- Provide scalable and consistent execution of large-scale parallel applications and at the same time rapidly adapt to exotic hardware and new programming models
- Provide efficient memory and device management so that resource contention and data movement are minimized at the system level
- Eliminate OS noise by isolating OS services in Linux and provide jitter free execution on the LWK
- Support the full POSIX/Linux APIs by selectively offloading system calls to Linux
With the growing complexity of high-end supercomputers, it has become indisputable that the current system software stack will face significant challenges as we look forward to exascale and beyond. The necessity to deal with extreme degree of parallelism, heterogeneous architectures, multiple levels of memory hierarchy, power constraints, etc. advocates operating systems that can rapidly adapt to new hardware requirements, and that can support novel programming paradigms and runtime systems. On the other hand, a new class of more dynamic and complex applications are also on the horizon, with an increasing demand for application constructs such as in-situ analysis, workflows, elaborate monitoring and performance tools. This complexity relies not only on rich features of POSIX, but also on the Linux APIs (such as the /proc, /sys filesystems, etc.) in particular.
Two Traditional Approaches
Traditionally, lightweight operating systems specialized for HPC followed two approaches to tackle the high degree of parallelism so that scalable execution of large-scale applications can be delivered. In the full weight kernel (FWK) approach, a full Linux environment is taken as the basis, and features that inhibit attaining HPC scalability are removed, i.e., making it lightweight. The pure lightweight kernel (LWK) approach, on the other hand, starts from scratch and effort is undertaken to add sufficient functionality so that it provides a familiar API, typically something close to that of a general purpose OS, while at the same time it retains the desired scalability and reliability attributes. Neither of these approaches yields a fully Linux compatible environment.
The Multi-kernel Approach
A hybrid approach recognized recently by the system software community is to run Linux simultaneously with a lightweight kernel on compute nodes and multiple research projects are now pursuing this direction. The basic idea is that simulations run on an HPC tailored lightweight kernel, ensuring the necessary isolation for noiseless execution of parallel applications, but Linux is leveraged so that the full POSIX API is supported. Additionally, the small code base of the LWK can also facilitate rapid prototyping for new, exotic hardware features. Nevertheless, the questions of how to share node resources between the two types of kernels, where do device drivers execute, how exactly do the two kernels interact with each other and to what extent are they integrated, remain subjects of ongoing debate.
Design and Architecture
At the heart of the stack is a low-level software infrastructure called Interface for Heterogeneous Kernels (IHK). IHK is a general framework that provides capabilities for partitioning resources in a many-core environment (e.g.,CPU cores and physical memory) and it enables management of lightweight kernels. IHK can allocate and release host resources dynamically and no reboot of the host machine is required when altering configuration. IHK also provides a low-level inter-kernel messaging infrastructure, which we named as Inter-Kernel Communication (IKC) layer. An architectural overview of the main system components is shown in Figure 1.
|Figure 1.:The IHK/McKernel Architecture|
McKernel is a lightweight kernel written from scratch. It is designed for HPC and it is booted from IHK. McKernel retains a binary compatible ABI with Linux, however, it implements only a small set of performance sensitive system calls and the rest are delegated to Linux. Specifically, McKernel has its own memory management, it supports processes and multi-threading with a simple round-robin cooperative (tick-less) scheduler, and it implements signaling. It also allows inter-process memory mappings and it provides interfaces to hardware performance counters.
FunctionalityAn overview of some of the principal functionalities of the IHK/McKernel stack is provided below.
System Call Offloading
System call forwarding in McKernel is implemented as follows.
When an offloaded system call occurs, McKernel marshals the system call number along with its arguments
and sends a message to Linux via a dedicated IKC channel. The corresponding proxy process running on
Linux is by default waiting for system call requests through an ioctl() call into IHK's
system call delegator kernel module. The delegator kernel module's IKC interrupt handler wakes up the
proxy process, which returns to userspace and simply invokes the requested system call.
Once it obtains the return value, it instructs the delegator module to send the result back to McKernel,
which subsequently passes the value to user-space.
The system call offloading mechanism is shown in Figure 2.
|Figure 2.: System call offloading in IHK/McKernel|
Unified Address Space
The unified address space model in IHK/McKernel ensures that offloaded system calls can seamlessly
resolve arguments even in case of pointers. This mechanism is depicted in Figure 3 and it is
implemented as follows.
|Figure 3.: Unified address space between McKernel processes and their corresponding proxy|
First, the proxy process is compiled as a position independent binary, which enables us to map the code and data segments specific to the proxy process to an address range which is explicitly excluded from McKernel's user space. The red box on the right side of the figure demonstrates the excluded region. Second, the entire valid virtual address range of McKernel's application user-space is covered by a special mapping in the proxy process for which we use a pseudo file mapping in Linux. This mapping is indicated by the green box on the left side of Figure 3.
IHK/McKernel is open-source and available for download at the following URL:
Balazs Gerofi, Yutaka Ishikawa, Rolf Riesen, Robert W. Wisniewski, Yoonho Park and Bryan Rosenburg:
"A Multi-Kernel Survey for High-Performance Computing",
International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), held in conjunction with
ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC),
2016, Kyoto, Japan (to appear)
Balazs Gerofi, Masamichi Takagi, Gou Nakamura, Tomoki Shirasawa, Atsushi Hori and Yutaka Ishikawa "On the Scalability, Performance Isolation and Device Driver Transparency of the IHK/McKernel Hybrid Lightweight Kernel", IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, Chicago, US
Takagi Masamichi, Norio Yamaguchi, Balazs Gerofi, Atsushi Hori and Yutaka Ishikawa: "Adaptive Transport Service Selection for MPI with InfiniBand Network", International Workshop on Exascale MPI (ExaMPI), held in conjunction with ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015, Austin, TX, USA
Balazs Gerofi, Takagi Masamichi and Yutaka Ishikawa: "Toward Operating System Support for Scalable Multithreaded Message Passing", 21th European MPI Users' Group Meeting (EuroMPI), 2015, Bordeaux, France
Balazs Gerofi, Masamichi Takagi, Yutaka Ishikawa, Rolf Riesen, Evan Powers and Robert W. Wisniewski: "Exploring the Design Space of Combining Linux with Lightweight Kernels for Extreme Scale Computing", International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), held in conjunction with ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2015, Portland, USA Best Paper Award
Rolf Riesen, David N. Lombard, Kurt Ferreira, Robert W. Wisniewski, Arthur (Barney) Maccabe, John (Jack) Lange, Mike Lang, Ron Brightwell, Balazs Gerofi, Kevin Pedretti, Pardo Keppel, Todd Inglett, Yoonho Park and Yutaka Ishikawa: "What is a Lightweight Kernel?", International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), held in conjunction with ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2015, Portland, USA
Taku Shimosawa, Balazs Gerofi, Masamichi Takagi, Gou Nakamura, Tomoki Shirasawa, Yuji Saeki, Masaaki Shimizu, Atsushi Hori and Yutaka Ishikawa "Interface for Heterogeneous Kernels: A Framework to Enable Hybrid OS Designs targeting High Performance Computing on Manycore Architectures", IEEE International Conference on High Performance Computing (HiPC), 2014, Goa, India [acceptance rate: 23%]
Balazs Gerofi, Takagi Masamichi and Yutaka Ishikawa: "Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs", International Workshop on Multi/Many-Core Computing Systems (MuCoCoS), held in conjunction with Euro-Par International European Conference on Parallel Processing, 2014, Porto, Portugal
Yuki Soma, Balazs Gerofi and Yutaka Ishikawa "Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach", International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), held in conjunction with ACM/SIGARCH International Conference on Supercomputing (ICS), 2014, Munich, Germany
Balazs Gerofi, Akio Shimada, Atsushi Hori, Takagi Masamichi and Yutaka Ishikawa: "CMCP: A Novel Page Replacement Policy for System Level Hierarchical Memory Management on Many-cores", ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2014, Vancouver, Canada [acceptance rate: 16%] Best Paper Award
Akio Shimada, Balazs Gerofi, Atsushi Hori and Yutaka Ishikawa: "Proposing a new Task Model towards Many-core Architecture", International Workshop on Many-core Embedded Systems (MES), co-located with ISCA'13, 2013, Tel Aviv, Israel
Balazs Gerofi, Akio Shimada, Atsushi Hori and Yutaka Ishikawa: "Partially Separated Page Tables for Efficient Operating System Assisted Hierarchical Memory Management on Heterogeneous Architectures", ACM/IEEE International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2013, Delft, Netherlands [acceptance rate: 21%] Nominated for Best Paper