 Hello, good morning everyone, thanks for attending the session. My name is Mok Chhada, today I'll be talking about MPI-Vasem, executing WebAssembly on HPT systems. I'm finally a PhD candidate at UM, mostly working in distributed systems and cloud computing. All right, so let's begin today's presentation, the structure that follows. First, I will talk about the main motivation behind the work, followed by a brief introduction about what WebAssembly is and MPI. What we did in this work, some results on a production HPT system and how WebAssembly can be utilized in the HPT ecosystem in future. So recently there has been an influx of HPT focused containerization solutions, such as Charlie Cloud, Shifter, Singularity, Podman, Serus, and Docker, which is not entirely suited for HPT environments, but in some ways is part of the ecosystem, especially for building OCI-compliant container images. So why do containers actually make sense for HPT? So containers allow developers to define custom software stacks for their scientific applications and not only rely on specific modules which are present on modern HPT systems. So despite the increasing popularity, the adoption uses of containers in HPT systems is still significantly limited. So if you look at the workload data from NERSC, only 8% of our total jobs actually use containers. So there are several challenges, such as one of the biggest challenges that most containerization solutions require root privileges for execution, which is not possible in HPT systems due to shared file systems. There are alternatives such as Podman or Singularity, which support ruthless containers, but they don't support parallel file systems, which are found commonly on HPT systems. HPT systems' nodes are becoming more heterogeneous with different process architectures. So building Arch64 container images from generic X8664 resources is significantly time-consuming. Most applications require special networking libraries or compilers, which are only possible or found on HPT systems. And building high-performance scientific application container images is actually significantly non-trivial. So what if there was an alternative that could address all these problems? So I'm sure most of you are aware of the street from Solomon Ikes, who was the founder and former city of Docker, who says that if WebAssembly had existed in 2008, when Docker was created, there would have been no need for Docker. So what is WebAssembly? It originated as an alternative to JavaScript and web browsers. It's essentially a universal intermediate binary instruction format. It has a set of instructions that are defined in the Watson specification, and it's meant for sandbox execution in a virtual machine. It has a 32-bit linear memory address space, so maximum memory possible for an application is 4 gigabytes currently. And unlike containers, WebAssembly provides lightweight isolation at the application level based on software fault isolation and control flow integrity. So everything is unprivileged and in user space. And it has this capability-based model, which you can grant capabilities so, finally, what it can do or has access to. So what is MPI? Just in short, it's just a de facto standard for programming HPC systems and is used in all modern supercomputers today. More information can be found there. So what we did, we proposed using WebAssembly as a distribution format for packaging MPI-based HPC applications. So the main idea is that you can compile any scientific HPC application to WebAssembly once, and then we can execute that on any platform by supporting WebAssembly and better. So we implemented the tool to simplify this compilation process and also an embedder which could, which can execute MPI-based Watson modules with great performance. So for compiling normal MPI application to WebAssembly, we currently support CC++. We implemented a custom header file that provides definitions for different MPI data types and added that to the WebAssembly system SDK, which combines the client compiler and the WASI Lipsy library for compiling CC++ applications to WebAssembly. So second to automate this process, we also implemented a custom Python-based tool. This again is the WebAssembly text representation on the right-hand side. And you see here that everything is an integer. This is because you extract all data types as integers from the perspective of the WebAssembly module, which are translated to native types at runtime by MPI-VASM. So as a result, all WASM modules for MPI applications are portable across different MPI libraries. So what is MPI-VASM? It is just a WebAssembly Embedder built on WASM, which is another popular WebAssembly Embedder. We currently support the execution of CC++ applications conforming to the MPI 2.2 standard, support different process architectures. And to facilitate its adoption, MPI-VASM enables high-performance execution of MPI-based WASM modules, has low overhead for MPI calls through zero-copy memory operations. This is automatically accomplished by translating from the linear memory address space of the WASM module to the host memory address space. And we offer immediate support for high-performance network interconnects, such as Intel, Omnipath, or InfiniBand, which are found on HP system by directly linking against the target library. So for compiling WASM code to native machine code, we use a head-of-time compilation strategies. For this, we use the LLVM compiler infrastructure, where the WASM instructions are first translated to LLVM IR, and then to the native instructions. And to offset larger compilation times, we also implement the caching mechanism in MPI-VASM, which prevents free compilation overhead before each execution. So how does memory address translation work, which is one of the core contributions? So imagine you have an MPI application. The MPI API is based on the library being able to read and write directly to the memory of the application. However, the executing WASM module can only provide memory addresses in its own linear memory address space, while the target MPI library requires addresses in the host memory address space. So when you execute MPI module in MPI-VASM, it reserves a part of its own memory address space for use by the WASM module. In addition, it records the base address of the WASM module. Using this, it is possible to directly convert 32-bit pointers that refer to the module's address space to 64-bit pointers that refer to the MPI-VASM address space, and vice-versa. So this represents a structure in WASM, which actually allows you to do this. And for implementing different MPI functions, we combine memory address translation and data type translation. And for directly utilizing the host MPI library, we use a project RSMPI which provides MPI bindings for Rust. So when a WASM module calls a particular MPI function with a specific linear memory address, this address is translated to a 64-bit address expected by the host MPI library, and after successful invocation of that MPI function, the status pointer is again translated back to 32-bit referring to the WASM module's linear address space. So we tested our system or implementation on a production HPC system based on Intel processors. So we scaled the application up to 128 nodes, which is around 6,144 MPI processors, and also on base systems, such as the AWS Gaviton II processor, and we experimented on different standardized HPC benchmarks, which is used in the community. So if we look at some performance results, Pinkpong is a common communication routine. So error graphs in the bars represent minimum and maximum values, white ration timings. And so for Pinkpong, we actually observe around 0.05 geometric mean average slowdown, while for send receive, we observed a 0.06 geometric mean average slowdown across all message sizes. So HPCG is a common benchmark. So if you look at up to 192 MPI processors, we observe really similar performance as compared to native execution. But when we scale to 128 nodes, we actually observe around 14% overhead. This is basically because of how HPCG does, like a lot of communication, and the translation overhead actually adds up. So this is the maximum overhead we actually observe for any application in our experiments. So the WebAssembly community is thriving, and they have a lot of proposals which target extending the Wasm specification. So I just want to highlight one of them, which is the extended SIMD specifications. So currently, WebAssembly only supports 128-bit SIMD, but in the future, there are proposals to support flexible vector lens. And if you look at modern processors, they have 512-bit. So just to show, this is the data traffic benchmark from NAS Palo benchmark suite. So if you see, when you enable SIMD, 128-bits, you get a 1.36x better throughput. But native is definitely better because you have 512-bit on Intel processors due to ABX512. So more details about this can be found in our P-POP paper, which was published this year. Great. Thank you. If you have any questions, please let me know. Yes? So this Embare is completely implemented in Rust because it has really good support for using Wasm and for WebAssembly. So we use Rust for everything. Python is just for making the compilation process to WebAssembly here. So it's a custom tool which can be installed in the system to simplify the compilation process. So you don't have to worry about installing the right Rust ESPK and everything like that. So basically to simplify dependencies and compiling the applications to Rust. So the actual Embare is completely independent and it's implemented in Rust. All right. So I'm out of time. Thank you.