 All right. So my name is Han Lee. I'm from Intel and I'm going to talk about Intel hardware intrinsics in the net core in particular, but also cover the idea or the concept of hardware intrinsics in the net core in general. So I wrote the original proposal for Intel hardware intrinsics for the net core which has been announced quite a bit by Microsoft and the open source community. But what I wanted to do today is basically show you what it is, why you want to use it, and how you can use it. All right. So let me start with this question. Okay. I listed some of the computing domains and specific examples in those domains. And what do they have in common? I think what Martin just described could fit in HPC or image processing. And then there are text data processing, machine learning domains, and there are specific problems in those domains that are common. If you had said anything about performance, pat yourself in the back. For performance-sensitive code, you can use Intel hardware intrinsics for a potential performance gain. And I'm going to tell you how you can do that. So what I wanted to get out of this talk, or what you should get out of this talk is why, what motivated the design, what they are, and then what you can do in your own code. Before we get into that, I think Martin talked about SIMD in one of his benchmarks just before this talk, right? Not everybody may be familiar with it, so I thought I'd do a very brief intro to SIMD, which stands Forcing Instruction Multiple Data. And it's basically a way of doing an operation, same operation, on multiple sets of data using one instruction, right? So I have an example here where you are wanting to add eight of the 32-bit integers together. And one way to do that is using scalar version, right? You say, okay, I want to take x0, y0, and then add it together, and then so forth until you add x7 with y7, right? So you do eight scalar operations to get to the result that you want. Another approach is to use SIMD instruction, right? So one of the examples is Intel Advanced Vector Extension 2, also known as AVX2, and that operates on 256-bit vectors. And what you can do is put those 832-bit integer data or data into a 256-bit register in Intel iStar. That's YMM, right? So YMM1, YMM2. And you can use one single instruction, in this case, VPAD, to add those two registers together to get the answer you want. Eight operations versus one operation. You know, obvious advantage right there. How SIMD being used in C-sharp today? So C-sharp provides very nice abstraction for SIMD. It's in system numerics and vector variance. I'm going to give a specific example using vector t, right? And it's very nice because it abstracts the underlying hardware away from you, right? So if you are a developer, you don't have to worry about, oh, am I running on AVX2 capable machine? Or am I running on streaming SIMD Extension 2 machine, which operates on 128-bit vectors? So if you have this code here where you create two new vectors with eight elements and you add them together, you get this resulting vector v. But you don't have to worry about whether that was done using one operation or using two operations or eight operations. It's transparent to you, right? Which is very nice. So what's the problem, right? Or is there a problem? So a couple of things, right? So vector abstracts SIMD operations for you. If you want to access underlying hardware capabilities that are not SIMD related, you're out of luck, right? And I'm going to show you an example of that later. But here are some, and then the other issue, of course, is some of the operations that you do on vector are inherently difficult to do when you abstract the underlying hardware away. So one example is shuffling operations on vector, right? When vector t or vector t abstracts the size or the count information away from you, that's kind of hard to do. And then there have been other issues on GitHub that people talked about their needs, right? They wanted to do specific things to accelerate their application but could not do it in C sharp. So that basically prompted us to look at, okay, how can we enable these developers, right? And the idea is to use hardware intrinsics, also known as intrinsic functions, also known as platform dependent intrinsics. And these are basically special functions that map to specific hardware instructions, okay? And these are not new, right? If you have been using C, C++ compiler intrinsics or, yeah, intrinsic functions, it's basically the same idea because these are very useful when you have an algorithm that maps better to the underlying hardware than the language construct that the language provides for you or when you want to have maximum control of the code generation, okay? The benefit of hardware intrinsics having been explored before is that it's been field tested, it's been matured, there are specific use cases that benefit from those. So what we did when designing Intel hardware intrinsics was to basically look at what's out there in C, C++, how they get used, and how we can leverage the experience that those developers that knew how to program using intrinsic functions could bring over to C sharp, okay? So that's what we did. And to talk about specifics of Intel hardware intrinsics, we wrote this originally at Intel, but major announcements had been made with the help of the core community and Microsoft. And the actual implementation was by Intel, by Microsoft, and by the core community. So I think this was the beauty of having an open source project where people could bring the issues that they had, what they wanted from the project, but also contribute to the project in terms of making the APIs better as well as implementing the APIs. Two namespaces have been introduced, System.Runtime Intrinsic, which contain platform-agnostic data structures and functions that operate over them. So here's an example. So 256-bit vector is represented by a vector 256 of t. And then there are operations on those. And then there is a specific namespace for Intel ISA. So that's System.Runtime.Intrinsics.x86 and they contain ISAs that belong to Intel architecture. This feature has been available since the net core 2.1 as an experimental feature. So that way we had a chance to improve the APIs based on the usage and the feedback from the community. It's available today in the Danekore 3.0 preview feature and it's going to be part of the Danekore 3.0 release. A little bit more about the intrinsic design. Each ISA class contains a supported property. So you can check the property before you actually use it. And then actual functions or methods that map to the underlying instruction. We try to keep close mirroring between C-sharp hardware intrinsic and the C-C++ intrinsic function because that makes it easier for, as I said, those who have experience programming in C-C++ intrinsic to use C-sharp intrinsic. And here's an example. I don't know how many of you are familiar with pop count function or instruction, but it's basically a way of counting the number of bits that are set in data, in this case in UINT. And that translates into pop count instruction in Intel machine. It's an operation that happens frequently enough and gets used enough times that there is a specific hardware instruction that corresponds to it. And I'm going to do a demo of using pop count. And then for, as I mentioned, majority of the intrinsics operate on SIMD. So as a result, majority of hardware intrinsics operate on Vector 256 or 128. And then some more details about the function itself. Okay, so let's do a quick demo. Can you guys read this or is it too small? Okay, can you read it? All right, cool. So let me just create a new console project. Let me call it bit count. How about this? Can you read it? Okay, great. Okay, the first thing you have to do is import this namespace, system.runtime.intrinsics.x86. This is where the class that we are going to use resides in insert the main body. So I'm just creating an unsigned integer. And then I'm basically calling this count set bits function, which I'm going to implement. And here's the implementation. It's very simple, right? It takes in an unsigned integer. And it first checks whether this is supported property is set to true for this pop count class. And what that basically means is the underlying hardware supports this pop count isa. And if that's the case, I just call pop count function with the n. Otherwise I do software implementation. So this is how you, one of the ways you can count number of bits set. This is by Corningan. But then there are number of other methods to do it. So I'm just going to run this. Okay, and it says that I'm taking the hardware in 26 path and this particular number has 28 bits set. Right? You might wonder, okay, how can I test this software for back path? Well, the net core provides a number of environment variables that you can set. Specifically for each isa. So you can disable the isa this way. And when I run .NET with this environment variables unset, I guess, or set to zero, it will take the software for back path. So this is one way of testing your code. And this is available for all the isa. So you can do something like enable AVX2 or AVX SSC2 set to one or zero. So that's how you can use hardware in 26 in your code. I've created another program here which basically has the same thing except that I'm using benchmark.NET for benchmarking. And I don't have to talk about benchmark.NET. Previous talks have talked about very nice tool for benchmarking your code. And what I've done here is I'm creating let's see, four U-longs. And then I'm going to basically measure how long it takes to count the number of bits. And I provided here three different methods of doing it. Cunningham you saw already. LookupTable is using a pre-populated LookupTable to count number of bits. And then hardware in 26 you already saw that as well. And then I guess the other thing is I'm basically saying that the baseline is hardware in 26. So once you run this that will be used as a base. And for benchmark.NET you have to use the release mode. So that's how I'm going to do here. This takes about 30 seconds or so. So let me hop back out. While that's running. So this is basic structure of a program that uses hardware intrinsic. You basically import the namespace that you need. I didn't have to import systems that run time intrinsics in my demo because I wasn't using Vector 256 or Vector 128. So I didn't need that. But if you are working on SIMD then we need to import that package or that namespace. And then the general idea is you check whether the ISI support it and then you provide the code for that. Otherwise you provide other implementation or software for that implementation in your code. The checks get optimized away so you don't pay penalty at run time. And if you just call ABX or any ISI method without checking for the ISI support it, there is a chance that you will run into this problem where it throws a platform that does not support exception. So check for that when you're using it. Okay, so it's finished. So three different methods. Right, hardware intrinsic was the baseline. So it's ratio is one. We see that it's about five times faster than the lookup table method and then about 18 times faster according a method. So these numbers will depend on the size of the data and other factors but it basically shows you the power of using hardware intrinsics for something like this. Okay, another example or demo that I'm going to do is structure of array based ray tracer. Okay. How many of you are familiar with ray tracing? Most of you, that's great. So how do you vectorize a ray tracer? And vector and vectorize is kind of I guess overloaded term here but when you're dealing with ray tracer you're dealing with 3D vectors. So you have your XYZ is inherently 3D even your color space RGB is 3D. 3D vector. So there are basically ways of vectorizing that. One is using a ray of structure and the other one is using a structure of arrays and I try to illustrate that here. So this is a ray of structure vectorization. So scalar you're already familiar with, I went over except that we are not dealing with 8 32-bit integers. We are dealing with 3 floats or doubles. So you have XYZ XYZ1 and XYZ2. One way of vectorizing that is basically if you have a YMM register which is 128-bit you can put three floats in there in one of the register three floats in the other and you can just execute one instruction. So instead of three scalar additions you do one scalar addition. But because you're vectorizing it as an array of structure notice that you're wasting one of the slots in that 128-bit. A better way to vectorize this would be to use structure of arrays. So this is AOS you just saw and then this is another way of doing it using structure of arrays. So instead of putting XYZ in a single register through X8 or X18 in one register same thing with Y and Z Y and Z and then you have three instructions VRPS that adds X1 or YMM1, YMM4 and so forth. So that's another way of doing it. So the problem I guess with the abstraction that vector of T provides is that it's very hard, very difficult to do it this way because there's no way to shuffle it easily between the actual 3D representation and the underlying implementation. So basically at the end of the day RGB has to be in RGB form vectors have to be in vector form. So you have to have a way of converting those between the two and using vector of T and that's kind of hard to do. So what we did when we introduced Intel-Hard-Warned Trinsic was to add another test to Core CLR. So I built this test here but basically there is maybe I already have it open. So there are tests that are built into Core CLR and the original test is called SIMD Ray Tracer and this and we can take a look at vector.cs this is based on the vector 3 class in system numerics and it basically is using a ray of structure to the ray tracing. So when we put it or when we implemented Intel hardware intrinsic we added this test called pocket tracer which basically does the same thing except that it's using it's using structural ray. So I've changed the program a little bit so that it runs for about five seconds and using 512 by 512 SIMD same thing with ray tracer so it's using 512 by 512 running for about five seconds and then so here I'm just going to run the test this is not based on Benchmark.net but it does print out how many frames it processed. So this is pocket tracer which is using structural ray and it says about 100 frames per second was processed and then if you go to ray tracer and ray tracer so if you recall that was about 100 and ray tracer the original tracer is about 6 frames per second which doesn't make sense to me. I know Intel hardware intrinsic version is faster but not this much faster. Okay, 43. That's more reasonable. We usually see about six times speed up but not necessarily running on my laptop. So we usually see between 6 and 7 speed up for this particular application. So who's using Intel hardware? Oh, thank you that's a time reminder. Who's using Intel hardware intrinsic today? So I mentioned as an implementation I showed you that. Another interesting use is in the CPU math operations in ML.NET. So ML.NET is machine learning library from the.NET team and they used to have or they still do they have a native implementation of CPU math operations for their machine learning library and one of the things that they did was to use Intel hardware intrinsic to port those to C sharp and this chart this chart right here shows the performance of native versus C sharp and the point here in this chart is that the bars are very similar in height which means that their performance is pretty similar and if that's the case what's the point of bringing them to C sharp? Right? One of the advantages of doing this in C sharp rather than in C plus plus is based on what they said was that they don't have to have platform specific implementations for different OSs and different business before they had to carry 32 bit version for Linux 32 version for macOS 64 bit for Windows and so forth right? Now because they are using C sharp intrinsics in C sharp and G takes care of compiling that into the native code so they don't have to worry about that so that was a big plus so in addition to having performance benefits if your application is already using native implementation it has an advantage that you don't have to have platform specific implementations in your code they are being used in a bit operation so pop count is one example there are other operations like tz count leading zero count that are used not in those specific micro benchmarks but in the context of image manipulation or string processing, string conversion and so forth matrix four by four has been optimized there is a hashing algorithm called black two which is taking advantage of intrinsics in C sharp and if your application is performance sensitive then your application may be one of the candidates right? So how do you actually go about accelerating your application? So first thing you need to do is understand your application, what are the hotspots how can I improve those so I use v2 an amplifier which is a great tool works with the net core for identifying hotspots if you wanted to use intrinsics there is a wealth of information about what they are, how you can use it I've linked a couple of those here there is existing solutions that use intrinsics functions in cc plus plus so just last year there was a talk at STPP con about accelerating utf 8 conversion using c plus plus dfa and sse intrinsics I think one of the challenges that we can take is change the c plus plus to c sharp and make that work in c sharp you would after you optimize your application using intrinsics you would measure if it's what you expected or where you want it to be but otherwise you can iterate this process and then last thing I wanted to mention was our experience with working in net core introducing a new feature to a set of APIs and then going through the review process enhancing it and implementing has been a pleasure Microsoft has been very open about about the project and we have we have gotten a lot of help from the team as well as from the community so if you have a project or a set of APIs that you wanted to implement maybe related to hardware intrinsics maybe not I'd encourage you to work with that net core team on making those come to reality that's it thank you very much