 when it gets started, yeah, we actually can start on time this time. So who is interested in AI these days? All right, many of you, right? So our next speaker, Antlis and Andrew, is from Intel. Please join me, welcome them to talk about Explored Pyrolysm for AI Workloads with Wasm and OpenMP. Thank you. Hi, my name is Andrew. I'm a software engineer at Intel. I'm here with, Aaron Dornie is not together with us, but he also worked on this project, and I'm here with... At the NASA, also software engineer at Intel. Yep. So today we're gonna give you a little bit of an overview of some of the kinds of things that Intel does when it comes to WebAssembly. We're trying to optimize WebAssembly for our hardware. And today we're gonna be talking about OpenMP. So that's a parallel framework. You can run parallel programs. We're gonna try to do that in WebAssembly, but why? One of the reasons is OpenMP has a long history. It's a well-known framework. It's quite complex. We do know that there's edge computing use cases that do have a need for parallel execution. OpenMP could fill that. And it also sort of tests WebAssembly as a suitable target for parallel programs. Also, it's a challenge, so let's see. Let's talk about threads in WebAssembly. We had some comments earlier about multithreading in WebAssembly, and it is a challenge, right? It's sold early days when it comes to threads in WebAssembly, and what you have to do today is, if you take some C code that uses P threads, you'd have to either compile it with Inscript in with a certain target, generate your WebAssembly file, run it in a browser, or you could generate a WebAssembly file with WASD SDK, different target, and then run it in a standalone engine like WASM Time or WAMR or something like that. That's the path that we'll actually be using today, is the second path through WASD SDK, but everything we're talking about here today also applies to Inscript in as well. I should note that the story for threads in WebAssembly should get better in the future. The post-MVP threads proposal that we've been working on is now phase one in the core WebAssembly specification, so more to come there, but this is using WASD threads, which is a proposal we did last year. You wanna talk about OpenMP? Yeah, so yeah, OpenMP has, as we heard, very long history in parallel programming, specifically in HPC, is very widely used there, and you see what is actually behind it. It's a four-joint model. You have a piece of code where, through some pragma statement, you can fork a group of threads working in parallel on a certain problem, and then when they're finished, you join them, they get synchronized, and you get the result. They are using shared memory architecture basically on the same node. It's really on the same node, the whole parallelization. We don't speak about distributed multiple nodes. Next slide. Yeah, so in terms of OpenMP, it's a framework, a base script provided by a lot of the compilers already natively. If you think about Clang or GCC, those are compilers which support it just by adding a flag when compiling your code, and yeah, so you have a bunch of features. Basically, the main thing is a statement called parallel, just to spawn, to fork those threads, then you can control how to share the work between them, and you have some primitives to do synchronization between the threads, barriers and locks and so on, but also you have very advanced functionalities to do more advanced math, which is nice for the machines what we have today, like SIMD instructions, and recently there is also support for device of loading, and yeah, that's displayed on the right. It's quite simple actually to parallelize code with that functionality. You need a pragma statement, and you write pragma.mp parallel, and this will fork threads, and they will work in parallel on the for loop below. Basically, each thread will work on a chunk of it. What's this top language here? The top block? The top one you see is a Fortran example, so very similar. Right, and in terms of architecture, in this work we looked more on the big block on the left, the OpenMP run time, so which gives us the basic functionality to run this kind of parallel for join mechanism, to do the synchronization, but we did not look into the targets kind of functionality of OpenMP, which is responsible for the offload. We are very interested on feedback if this will be of interest for community. So, if we're gonna compile the OpenMP run time to WebAssembly, we need to start with lib.omp.a. There's many parts to OpenMP, the offload targets we just talked about, we're not gonna touch that here, we're just gonna talk about the run time itself, the lib.omp. There has been previous work, there's a change set for LVM, that sort of did some of this, but in this project, we actually reworked that into a new PR, which you can see here, and if you like this work, please give it a thumbs up so we can get that reviewed. It's not immediately trivial to compile OpenMP to WebAssembly. As we tried to do this, we ran into many kinds of problems, compilation failures, we got claimed to crash, and we even had deadlocks in the code that we emitted. All of those are bad things. So, let me talk through those problems real quick. The first one is compilation failures. So some of the APIs that OpenMP uses aren't available in Wazzie yet. For example, Wazzie Threads doesn't expose pthread, any way to exit a thread early, right? pthread exit, we don't know how to do that yet. That'd have to be in a future version of the spec. So, in order to solve this, we need to conditionally compile different parts of OpenMP to say, well, in this case, we're not gonna call that API. We found this in one of our examples. We used the OMP critical pragma, and we claimed crashed on us, and we thought, oh no, what do we do now? So looking into a little bit further, we figured out that it's common symbols are not implemented for the WebAssembly backend of LVN. Okay, that's not as big of an issue. It's not a bug, it's just unimplemented. And so what we did is instead of using common symbols, we used external symbols. It's just different kinds of symbols inside the LVN, you know, elf kind of stuff. If you're interested in more details, we can talk more about it. And then, you know, we finally compiled the OMP.A. We had it, we like compiled some examples. We thought, here we go, this is great. We ran it in Wazzie time, and we deadlocked. And so that's a bummer. So what we had to do is troubleshoot that deadlock. And one of the ways we did it is we emitted, as logs, we emitted every weight and notify pair in the code, and by examining the logs, we figured out which weights and notifiers weren't matching up. We realized, well, we need a stack trace to figure out at which point that's happening. And so we proposed Wazzie backtrace. It's a way of sort of printing a stack trace at whatever point you are in the WebAssembly program. And eventually we figured out, oh, it's just var args. So the var arg, you know, needed to be a pointer to the var arg list. Once we figured that out, we got through those issues. And we could execute a simple program. If you take a look at this code on the right, it just prints hello world from each of the threads that OpenMP spawns. And so we're using all the special OMP pragma stuff. But when we spawn it, you know, OpenMP takes it away, runs these Wazzie threads. All works, great, and Wazzie time. Now I'm gonna go over to our friend Aaron and we're gonna run his video, right? So let me know. OpenMP and Wazzie. So here we have an example C program that just instantiates two square matrices of a given size with random numbers. And we have some functions here set up to multiply them. So the first function is just a standard linear implementation of matrix multiplication. And we just collect the time taken by the function. The second one, though, is using the OpenMP parallel pragma notation. So this is a threaded implementation. So we are going to run this program just with eight threads. And we're going to run it using a make file that just generates a Wazzie binary by setting a Wazzie 32 Wazzie threads target and linking in a customized version of OpenMP. And so we're just going to run make and it will then run the generated binary in Wazzie time. Obviously. And as you can see, the linear program took about 14 seconds. Whereas the parallel implementation only took 1.7. Now we noticed with larger matrix sizes that there was a bit of a difference in performance when compared with a standard native OpenMP program. We noticed as well that the Wazim programs were quite cash bound. And we came to the conclusion that we were missing out on certain cash optimizations. So we created another function to transpose one of the matrices to give us better cash performance. And we also used the OpenMP tile notation to create some memory blocks to also improve cash hits. So I'm going to recompile with a larger matrix size. But in the initial run, you can see that only about 0.7 seconds was shaved off of the execution time. Whereas with a larger matrix size, the effect of cash performance becomes more apparent. So I'm just going to disable the linear function because that can take a few minutes to run. And as you can see, the parallel function took about 17.6 seconds. Whereas the transposed blocked function only took 7.5. So something we're also going to take a quick look at is SIMD vectorization instructions. So we're going to enable those with some compiler flags. And we're going to decrease the matrix size again. Now even the linear function here actually benefits from using SIMD instructions. In the initial run, it takes about 14 seconds. Whereas with SIMD enabled, it's reduced to about nine. You can also see that the performance of the two parallel functions is improved. Now I'm going to disable the linear function. All right, you want to take this? Yeah, so you saw the example for matrix multiplication. So it was quite easy actually to get performance out of matrix multiply. We thought it's a good proxy to a lot of workloads which have heavy compute like in AI space. If you think about convolutional networks they are doing multiplication with a convolutional filter. So yeah, and this was our initial kind of look. And so now some kind of analysis of the approach what we did. So you see on the left basically the trivial matrix multiplication. Three nested loops like we know it from university or from school. And then a parallel version with open MP it doesn't change too much. You still have the three nested loops and you have this kind of pragma statements, pragma OMP parallel. You can define private variables which tells basically to the application those IJK will be private for my threads. And then you have the shared ABC variables. Collapse is another nice kind of syntax from open AP where you can collapse loops and split the work. And yeah, the main analysis here we actually run into a memory bound problem if we write the code like that. And if we do the analysis actually mathematically about the complexity of the code it's actually it should be a compute bound problem. So we are leaving performance out on the table. And yeah, so one way to deal with that is actually look at the machine architecture a little bit. So actually what happens is you are not using the cache correctly. And one technique, how you can improve through that is blocking and in for matrix multiply that works very nice because the blocked version of matrix multiply is equivalent. So we are just loading blocks of the matrix and doing the matrix multiply on the smaller problem. Right, you see that it gets a little bit nasty so we will have five nested loops but we turn the problem to a cache bound which is far better. We speak about orders of magnitude of hundreds faster. Right, and then if you are lazy writing those four or five loops can be done a little bit easier with OpenMP 51 which is actually supported automatically through our kind of implementation. So you write OpenMP tile sizes. The thing what Aaron showed and this allows you just to do this blocking very nicely with one line. And here are some results why this can make sense. So it was very kind of synthetic example with matrix multiplication but you see basically if you do do this optimizations you use also vectorization you can get something like 2000x performance improvement versus native versus kind of trivially written matrix multiply. What happens with wasm? So we see there are a little bit less we can see 785x which is still great. The difference is because of the vectorization. The matrix multiply vectorization used by wasm is mapped to SSE in the x86 architecture 228 bits and the architecture can support more. Right, and some numbers about what we win if we look at the vectorization 128 versus AVX 512 basically the code is zero times three X slower. And yeah, further. So basically the performance difference what you see between a native kind of execution with OpenMP of matrix multiply versus the wasm simd execution is only because of the vectorization. So it looks quite good if we have future vectorization capabilities with wasm we can get more performance. And to close our examples we looked at a little bit more complex stuff convolutional networks, AI it's quite kind of hot topic today. So there was a very nice small kind of academia like application from Berkeley University which I took and basically we compiled that with wasm and applied similar optimizations like we talked today. We applied the pragma OMP, we applied parallel we applied the simd vectorization and it's on to and first some good news if you use profiling tools in the past with your favorite language they actually work with wasm also very nicely with wasm time. And we saw in that example that convolution is the most expensive layer in that neural network. And similarly to matrix multiply it was actually memory bound. You see it on this picture where we have so called truth line model of the machine. This is basically how a certain kernel dependent on the ratio between operations math operations and memory access where does it fit in the whole architecture is it limited by memory speed or is it limited by compute and more to the left means it's actually limited more by the memory and for the low so and yeah to optimize it we want to bring it up. And again we do nasty for loops stuff basically we did pragma OMP parallel around our batch for loop and we could do also loop and rolling if you're familiar with that concept can bring a lot of performance basically simply or make the compiler know that the loop can execute those two operations save some compiler work in that and then we can do the simd reduction in the inner loop and this gave us roughly 14x performance we see on or yeah 13.46x in Wasm case natively we can do 40x with the application. The reason why we did not get the 40x with the Wasm code was most probably due to memory layout is we did not look too deep in how to align everything to SSE and 128 bits but still 14x was a good kind of result. Okay so what did we show here today? We compiled the open MP runtime using the Wasm 30 was the threads target. We were able to compile some examples so this matrix multiplication kernel we use the open MP to speed that up and we also optimize this image classification kernel all that and we compared the native to the web assembly side using some simd in there as well to get some additional speed ups so what we did not show is open MP has a vast array of features like all the task parallels and stuff we didn't go into all the libo MP target stuff so that'd be like running it on a GPU we didn't show any of that today that's what he was referring to if you're interested in that come talk to us we gotta figure that stuff out and so I just wanna you know little disclaimer here if you go and use this today and you will use one of the features that we didn't use I don't know if it works so you're on your own. Another thing that we didn't really discuss today is open MP has like reduced data types which can be used for you know eking out even a little more performance like this N8 and PS16 stuff those data types aren't really available in web assembly yet so we'll have to wait a while so we can use that okay so that's the end of our slide I have a couple things to say about this before we take questions our contact information is here if you wanna contact us and we would love to talk more about this stuff if you're interested in this open MP stuff take a look at the PR give it a thumbs up or whatever make some comments if you wanna run the beginnings of what autonomous was sort of explaining here today go to this link there's a make file that should be able to guide you through getting this running on your machine all right thank you so much that's our presentation all right questions Obi-Wan Trenobi that's a name I haven't heard in a long time it's hard it's your question okay so my question is about SIMD so the Bitch Park is against AVX-512 on native which obviously the gaps in WASM SIMD on 128 bit compared to that are well known I guess my top question would be would relax SIMD help this use case you know I think what would really help is the flexible vectors proposal which is I think phase one maybe trending towards phase two and that would enable us to be able to use 512 bit SIMD so I think that's the biggest thing to work on I guess a slightly related question have you benchmarked on 256 at all since that is kind of the least common denominator on native code AVX-2 basically yeah you basically see 2X versus SSZ at least matrix multiply 3X if you go 512 makes sense thanks got a question Andrew you know you worked on WASM before this and now you're you know which is a way to pull ML into WebAssembly like through a host guest boundary and now you're pioneering with the team here WebAssembly inside of I'm sorry AI inside of WebAssembly do you have any thoughts on when you should use one technique or another short answer would be if you're looking for portability try to compile it all into WebAssembly if you can if you're looking for performance jump outside of the sandbox that would be the short answer today I mean there's like nuance in that but that's how I'd sorry if data movement is expensive I'll save your data copies security other questions right huge round of applause for these guys