 Hi everyone, I'm Mammy Ratsim Bazafi and today I will talk about the design of a high-performance multi-fledding framework. So the multi-fledding framework is implemented in NIM, but everything I'm talking about is something that you can also do in C, Rust, AIDA, any kind of language. My hope is that after that talk, you will be able to implement your own multi-fledding framework in a weekend. So let's see how we can do that. So a bit of things about myself. I've been using NIM for three years now and I'm a blockchain developer during the day and high-performance computing developer and also data scientist during the night. You have my Twitter and my GitHub accounts. So how did I start doing a multi-fledding framework from scratch? Three years ago, I discovered NIM. I wanted to do some high-performance computing in NIM and I had two-fledding models I could use. One is OpenMP because NIM compiles to C and you can add OpenMP annotation to NIM. So that makes it very easy to do parallel for loops. The second-fledding model is a simple thread pool with just a spawn and a sync statement. And so I used OpenMP for starter on my multi-fledding run on my high-performance computing and tensile library. But a year ago, I was dissatisfied with lots of the internal, including the threading. So I started to re-implement everything from scratch. And the goal today is, as I said, end up with multi-fledding runtime that you can implement in a weekend. And to go there, we need to first understand the design space, second, understand what are hardware and software multi-fledding. So definitions, these cases, parallel API and the sources of overhead, how to benchmark them and the design constraints that those bring. So first thing, understanding the design space. One thing that you hear often is that concurrency is not parallelism. So here you have a coffee machine, or here two coffee machines. Concurrency is the ability of the hardware, no, sorry, of the OS or the scheduler to switch between, to interleave two threads of execution on one single resources. In case of parallel runtime, you have two threads of execution and two resources. Another thing that you will see a lot is probably 1-1-fledding, N-1-fledding, M-N-fledding. So this is about on the left, the number of application threads, and on the right of hardware threads. And so this is something that is often seen at the OS level. So when you talk about very old OS, they had sometimes N-1-fledding or 1-1-fledding. But this is something that also happens at the language or the runtime level. It's just definitions and we won't see it again. So the problem is for multi-fledding runtime, how to schedule N tasks on N hardware threads. And another thing on the design space is latency versus throughput. So latency is for a single task, if you have like N tasks, and you have a single task, maybe you want the single task to be fair, like first in, first out. So this is what happened when you have multiple clients, a single server, and you are supposed to serve all of our clients, let's say a web page, and you don't want to, clients, to spend an hour waiting until everyone else is done or for video decoding. If you have multiple frames and you need each frame to be processed one after the other, you need some fairness. So this is optimizing for latency. The other thing is optimizing for throughput. For example, in scientific simulations, you don't care about the single computation. You want everything, the whole work package to be done as fast as possible. So even if the first task waits for one week, as long as everything is done as fast as possible, that's fine. Another design axis is cooperative versus preemptive. Preemptive threading, you probably heard about coroutines, fibers, green threads, first class continuations in a scheme, for example. The characteristic is that those are lightweight, and you cannot use hardware threads for those. Second, preemptive threading, also P threads used in open MP, TBB, or silk, if you dive into preemptive threading runtimes. Those are scheduled by the OS, they have easier context switches, and you need synchronization primitives because those are real threads, let's say. So synchronization primitives can be locks, atomics, and also things that are less known like transactional memory or message passing. And you have IO tasks and CPU tasks. So IO tasks, like you're waiting for network connections or files, and you create tasks. So those are usually latency optimized and implemented via async await, while CPU tasks they are throughput optimized, and the terms that are used usually are spawn and sync. So there is a parallel, let's say, between both API. It shows that the internals are completely different, and the requirements are different, the skills for maintenance are different, the OS API are completely different as well. So for the talk, I will focus on CPU tasks optimized for throughput on preemptive scheduling, so on multiple hardware threads. So now, we have a bit of definitions, and let's see the different forms of multi-threading that exists. At a hardware level, we have four kinds, there are many more, but the four main kinds are ILP, instruction level parallelism. So hardware, let's say, a CPU ARM or X86 has multiple execution ports. For example, to do an addition, you have two or three ports available, they are called 0, 5, 6, something like this. And you can schedule multiple additions in parallel, that's done by the processor, as long as one execution port is free. SIMD, single instruction multiple data. So if you heard about SSE, AVAX, or ARM Neon, for example, those are also called vector instructions, you have an addition, and it works on four floating points at the same time. SIMD, single instruction multiple thread. So those are GPUs, basically, and on GPU, you have 32 threads for NVIDIA GPU, those are called a warp, and they have to do the exact same instructions. And for example, if you do an ETH branch on the GPU, it will execute both branches. And the last one, simultaneous multi-threading, also called in Intel Speak hyper-threading, it's a way to use all the execution ports by having logical threads sharing the same execution resources, same memory bandwidth, because it's usually quite hard to use all execution ports at the same time. And it's not always two sibling cores that are used in hyper-threading. If you use Xeon Phi, it's four sibling cores that you have. Now, let's talk about the form of parallelism that you might want to implement or support in your runtime. First one, data parallelism. Easy, it's just a parallel for loop. If you use the OpenMP, or if you used on C++, Intel TBB, Parallel 4, or on Rust, Rayon, that's exactly that. It's a same instruction multiple data. Use cases, scientific computing, you have vectors, matrices. You do a for loop on all data. Challenges, how to support nested parallelism. For example, OpenMP doesn't really support nested parallelism. And there are other load balancing challenges. So it might seem surprising, but splitting a loop for multithreading is actually complex. Because if you split before entering the loop, maybe you might try to split a loop in 10, even though the loop is super small and you don't need to split it. If you split, well, I won't enter into detail, but you have three splitting strategies that you can research if you want to implement data parallelism. The main one, task parallelism. So this is spawn and sync. It's basically a function call that may or may not be executed on another hardware phrase. And the may or may not is managed by the scheduling runtime. An example, Intel TBB or OpenMP tasks since OpenMP version 3.0. Use cases, anywhere you want a parallel function. For example, parallel 3 algorithm, like depth first or breadth first search, divide and conquer algorithm. And there are multiple challenges. The API, now most of the runtimes are using features in Weave and in NIM. I'm using FlowVar to distinguish from IO tasks that are using features and CPU tasks that are using FlowVar. It's just a name. Over-challenge, synchronization, scheduling overhead, and memory management because you need to save tasks. Okay, five minutes left. Let's go fast. We have another kind of thing, data flow parallelism. Four names, pipeline, graph, stream, data driven parallelism. The main thing is you can express dependencies between tasks like dependency in, dependency out, and in out closes. Parallel API. You have Async, launch a thread that may be parallelized, await, await for a result. So this is for IO tasks and we can use a spawn sync for multi-threaded tasks. Data parallelism, I talked about four loops. Data flow parallelism, there is no established API, but you can use either declarative one where you create a flow graph explicitly before entering a parallel section or you can pass a handle like a promise to set a task ready or not. Species of overhead and implementation details. You have scheduling overhead because switching between tasks is costly and switching between a kernel when you need to create threads or destroy them is costly as well. Easy solution, use a thread pool. Memory overhead. You might, if you use the Fibonacci tasks, Fibonacci 40 for example, we create two at the power of 40 tasks, meaning trillions of tasks, and you need to deal with memory management when there is nothing to do. So you will need some clever memory management with memory pools to deal with that. And also on memory pools, sometimes you have one thread that produces all the tasks and an overhead that consumes everything and you cannot use caching in that case. So you need to handle that as well. I'm skipping on the CAC2 stack and segmented stacks, but it's a complex research and a go and rust tried and failed and abandoned CAC2 stacks. And the new GCC from three months ago also is a stack less to avoid CAC2 stack issues. Load balancing, the meat of the talk, let's say. The issue with simple thread pool is that usually you have one global task queue, you dispatch a task to a ready thread, but you have a contention issue because this task queue is, if you have 10 threads that are asking, give me a task, give me a task, give me a task, the task queue will be very busy and the best way to scale a parallel program is to share nothing. This is Andal's law and it tells you that if you have 95% of your program that is parallel, the maximum speed up that you can get is only 20. So you need to avoid serial parts as much as possible. And serial sources of serializations are the single task queue. If you have a memory pool that is also global and you need to distribute everything on multiple threads. So one way to do that is to have work stealing. You have multiple worker, one IP, everyone has his own task queue. It push and pop from one end of the queue and when you run out of tasks, it gets from another worker a task. This way the synchronization happens only if the queue is empty. This is a summary of the things related to work stealing and there is mathematical proof of optimality, asymptoptical optimality, which is why almost everyone except Julia is using work stealing. Julia is using something called parallel depth first scheduling. It's also proven optimal, but it has a different performance profile and hover profile. It's still in development because it was released in September and you can look into it. One thing that you should look into are memory models. I've given the talk from Herb Sutter inside. If you want to use relaxed atomics, acquire release, it's very important to watch it because it's written nowhere. I'm skipping on load balancing. You have multiple strategies to share tasks, still one, still half, adaptive, but I don't have the time to go inside. So the end, work stealing in a run time, in a weekend, you need a task data structure with a function pointer and a blob for task input or a closure. You need a start, stop, step for data parallelism if you want to express a for loop. Have next field for intrusive queues and dequeues, a future pointer to send the result back to the caller. A work stealing dequeue with a head tail field and push first, pop first and still last operations. For the API, you need to create your thread pool, exit to shut it down and spawn sync to create tasks and retrieve the result. Some references, those are also on the FOSDEM website, and that's it. I think for one question or two, and the next speaker should come up as soon as possible. Yes. Sorry? Yes. Currently, my run time is actually at least as fast or faster than any of the run times. I have an MP, TBB, Rayon, HPX, Julia, so really, if you have a challenge on benchmark speed, I'm ready to take it.