 What about the rest of you guys? AI? So deep learning? Okay, because my talk will be focused up around deep learning, which is a very specific field of AI and I go even more specific on how I accelerate inferencing, which is one of the important stages in deep learning. It is a fairly advanced talk, so I will turn it down just so that it's easier to digest. So what I'm going to introduce to you, actually, this is the outline, what is Intel Deep Learning Boost? It's a new fancy hardware technology. I'm going to highlight what it is and why it is super useful and and explain what are the new vector registers that we included in our latest hardware that we recently released in April and then show you in a live example, like why it is so nice. All right, so cool. So what is Intel Deep Learning Boost? It is a set of new AVX 512 registers in our latest hardware that we recently released. The code name is Cascade Lake, so it's a second generation of Xeon scalable processors. These are really what we call big fat Xeon. They're really powerful. They're really good. They're really fast and it's great for AI and we are really proud that this new set of vector registers allow you to do inferencing really fast and those registers are called vector neural network instructions or for short VNNI and VNNI in itself can give you close to 2x boost on doing the inferencing. Now, what is inferencing? Inferencing is prediction. So you have your deep neural network. It has been trained on your data set. The next step is make the neural network think and make decisions, right? So making the decisions, that's what we call doing the predictions or technical term inferencing. So we want to make the neural network decide faster, make decisions faster, infer faster. And this is what we're bringing up to you. It's hardware technology to boost this decision making faster. So, um, oh, yeah, so that was it. The summary of what I just said. And let's go back to the deep learning foundations. So why, what is involved in deep learning? There's a lot of math involved. So if you look at the typical convolutional neural network where you have a filter looping over your image, there's a lot of math involved, the cells being multiplied with others and then adding the numbers together. So lots of multiplications, lots of ads. And this is heavy on a compute unit. So since we make processes and processes what they do, they compute, we need to make sure that we can compute really fast. And in traditional, like typical use cases, especially in high performance computing, the data types are usually floating point 32, so 32 bits numbers. And we think that we could do things faster if we change the data type. So I will go more along this in a second. So recall back the convolution network that I mentioned. These are really popular. Look at that. This is, if you look at the animated part, this is a filter going on an image, right? And as it goes around looping over that image, lots of malls or multiplications, lots of ads. And we want to make this faster. So why do we need Intel deep learning boost? One of the key terms, a key concept is quantization. Now what is quantization? Consider this number. 896.1924. If you represent this in floating point 32, so 32 bits, you take a lot of space in memory, a lot of space in your registers to represent that number. But the bulk, the most, all the cred of the information is actually in this 96. Maybe you don't care about this dot 1924. It's insignificant compared to 96, right? So if you really represent this number as an integer, you lose less registers to represent the number. So only 8 bits to represent integer 96. And that's great. So what benefits come out of this? You use less power. So less CPU energy to represent that number. We also lower the memory bandwidth. We also lower the storage. So you recall earlier, you have four boxes to represent a 32 bit. Now you have only one. So less storage for that. And all of these are the benefit of higher performance. So that's the key idea behind quantization here. And as you recall, you had this number 96.1924. So we are now reducing the precision of that number to just 96. So we are losing a little bit of accuracy. So what is important is that we don't lose too much accuracy. Now a quick intro about this vector neural network instructions that we introduced, that's hardware technology here. So you recall that convolution, right? So there's this filter going over that image, doing lots of multiplications, lots of additions. So we have vector registers in the hardware to do these multiplications and those additions faster. So this is it. So if you look on the first line over there, that's the first generation of Xeon processors code named Skylake. So if you would have 32 bit floating point numbers being added, you use one instruction to do that. So you multiply these two floating point numbers, you get an output of 32 bit. And if you would do this in lower precision, so int8, so you have two int8 numbers being multiplied. And then if you have to add, you actually use free instructions to do this, free instructions to add two low precision integers and add them to get an output of 32 bit. So we decided, okay, you're using free instructions, you're spending CPU cycles a lot to do that. Can we do that better? The answer is yes. In our second generation cascade lake processors, we combine those three instructions into one. And this is the result. The same stuff that you do in free instructions, you do that with just one. So effectively, less CPU cycles spent in doing this process of adding and multiplying in low precision int8. Now, you as a software developer, maybe you don't care, oh, this is hardware. Why do I need to know about this hardware? Is there software out there that I can just do that out of the box? Easy for me. The answer is yes. We thought about you guys as a software developer and we contributed or brought to the market one product to do that for you and enter Intel distribution of OpenVINO. So OpenVINO is a tool that can allow you to process int8 for you. So a quick introduction to it. So in OpenVINO in a nutshell, when you have done your training of your model, right, it is in TensorFlow or Cafe, MXNet, whatever framework you have used there. So you have obtained your trained model. What OpenVINO does, it takes a trained model and sends it into the component called model optimizer to make that model more efficient and more CPU friendly. Let's put it this way. And the result is this intermediate representation marked as IR. It's a combination of two files, an XML file and a BIN file that contains the weights of your neural network. And traditionally, in normal OpenVINO, that model will be in floating point 32, and then you do your inferencing on that trained model. That's the traditional way. So first step, get that trained model, optimize it with a model optimizer, and then get that intermediate representation and then do inferencing with the inference engine. That inference engine is the component of OpenVINO that will allow you to do inference on any type of hardware you have, be it a CPU, be it an integrated GPU, be it an FPGA maybe, or even the MoVIDU stick, that little blue stick you may have seen that allows you to do inferencing on the edge. For instance, a drone that's flying or like a little robot going around, you can do inferencing on that mobile device. That's really great. So this is how nice OpenVINO is. Now, a step backward, this talk was about low precision inferencing. So how does OpenVINO fit in there for low precision inferencing? So what OpenVINO does, it takes a 32-bit representation of that intermediate representation and uses a component called calibration tool. It will calibrate this 32-bit representation into an int8 model. So then once you get this int8 intermediate representation, then you do inferencing in low precision. So that's the big picture of it. And the parts where you do all this calibration, this is done once. We call it kind of an offline stage. You do it once. Once you have this intermediate presentation, you can store it on your robot or your drone. And then the online stage is this live process that you do on the device. So this is the online stage. Anyway, enough said about that. Let me show you now live results of the benefits of low precision inferencing. So I will show you two cases. In one case, I'm going to do inferencing on the floating point 32 model. And in the other case, I'm going to show you inferencing of the same data set, but in low precision, integer 8. So let's see the difference. So I have here an example, which I will move to the screen. My mouse went there. So in this Jupiter notebook, I'm going to show you inferencing in floating point 32 bit and see how fast we do that. And inferencing in int8 and how fast we do that. So let me maximize the screen view. So this is a very simple algorithm. I'm just inferencing on cats and dogs. It's an open data set. And I'm using the Intel OpenVINO for doing inferencing. So model optimizer will actually convert my ResNet model into this intermediate presentation. This is done. Great. The next step is declare the network and so on. Great, I do that. Okay, it's done. And next, import matplotlib that will load and take care of plotting the results and so on. And right now I'm processing the image. Now it's done. And then you can see I'm proceeding with inferencing on my image data set. So cats, there are dogs, really cute cats, really nice friendly dogs in there. But pay attention to the numbers. What is my rate? These are 32 bit representations of my network. And I'm inferencing approximately 300 images per second. Maybe you'd be happy with 300 images per second, but can we do faster? Dan says yes by leveraging hardware technology for that. And so the calibration tool, if you recall, was that part that converts this 32 bit model into the int8 model. And I've already done that. So in the interest of time, I'll proceed with showing you how it is in the int8 model. So I defined my target device as you can see is CPU and the network int8, cool. And I'm loading the plugin and I'm loading the model and so on and so forth. Now let's proceed with inferencing. It's the same data set, same cats, same dogs. But as you can see, I am inferencing faster, approximately 600, close to 700 images per second. And maybe we can plot this in a table to show the difference between the two here. Inference speed in 32 bit, approximately 300 images. And in int8, low precision, close to 700. So what's the key message there? Leveraging low precision inferencing, boosted by our vector neural network instruction on Cascade Lake, we almost got twice the performance on inferencing. Isn't that great? So software technology, low precision, boosted by hardware, twice faster. I think that's nice. So that's the key idea I wanted to share with you. If you have the chance, use Cascade Lake. Cascade Lake is available on Amazon Web Services. As of now, it's right now the only cloud technology that offers Cascade Lake with the VNNI instruction set. And also if for the deep learning guys in there in the room, take a look at low precision inferencing. So from 32 bit, the usual normal use case. And consider int8. There are corner cases where int8 won't work. For instance, when you really care about precision, say you're looking at cancer cells in an MRI image, for instance, where very specific details are really, really key and important, then maybe not. But in other cases, like images of cats and dogs, or maybe language processing, sound and so on, where precision may not be that critical, then int8 is a great boost. It gets stuff done faster. That's it. Thank you. Okay, we have some time for questions. Any questions? I would pass by with the microphone. Hi. Thank you first for the talk. What's the trade-off in precision, like if you compare the floating point 32 model and the int8 model? Because you show the increase in speed, but not the decrease in precision, right? Yeah, there is a loss, some degree of precision in there. And it all depends on the nature of your, of the tasks that you're doing. So in my case, I was classifying cats and dogs, and I did lose some precision, yes, some, but it didn't affect my results. So how much you lose depends on your use case. I can give you numbers because that's very specific on the nature. Okay, because you could have like some kind of a validation metric that you, but okay. Yeah, so in my case, for the cats and dogs example, my loss was not that great, like not too big, like 5% difference. For me it was okay, totally okay. Depends on your use case. Okay. Thanks. So if I understood correctly in the demo you showed, you compared using a floating point without special instructions to use int8 with special instructions. Have you tried to benchmarking using int8 without the special instructions, just to see if the lower memory footprint helps? Yeah, so of course. Let me show you back this thing, this one. In gray, oops, okay, one more, okay. In gray, that's Sky Lake, the previous generation. Okay, it's still the latest, it's still, well, not the latest, still the great Xeon processor. And the one that's released this year is Cascade Lake in yellow. The difference between these two, Cascade Lake comes with vector neural instructions, VNNI. So the comparison that you're referring is comparing the last one, Cascade Lake versus Sky Lake. Yes. And from there, in my measurement results, I got close to 2x performance boost. So that was my real world result that I got on paper, in theory it should be 4x. If you do the math, you know, like calculating how many instructions you use to do some mill and some ads and how one instruction will do it. So it's on paper 4x. So on my machine, when I did, I got 2x, 2.9x. And yeah, I think I also have it somewhere here. Speed up. Okay, let me move that. Maybe, hold on. You see there, difference. Okay, 32 and 8 speed up was 2.3, but that's comparing 8 and 32. Comparing the two 8 on Sky Lake and 8 on Cascade Lake in another example, it was also around 2x. Yeah. And right now I'm comparing 32 bit on Cascade Lake and 8 on Cascade Lake. Yeah. So if you would do 32 bit on Cascade Lake and 8 on Cascade Lake, so 32 bit on Sky Lake and 8 on Cascade Lake, then it's 4x. Any other question? You skipped the calibration part here. I mean, you said you done it already. How much time does it usually take and could you share on what it's based on or not? Yeah, so the calibration takes some time to do that, around 10 minutes ish depending on how big your model is. And because I don't have time 10 minutes to do that, that's why I did it offline. But yeah, for a model that was around 50, 60 megabytes, it was like under 10 minutes, which you do offline, one time and then you're done. Yeah. The goal is like on the edge in real life like inferencing big quick decisions. This has to be fast. Sorry, Mike. You are calibrating based on like your calibration is based on what? The calibration is basically taking the 32 bit representation of the weights and so on and representing them in 8. And yeah, so you know, in a neural network of the nodes of the weights and so on. So these numbers, they are all in represented in 32 bit that take a lot of space to represent these numbers. And going through all of that, doing the calculations in between the nodes, it takes a lot of CPU cycles to process all of these. So the idea is to convert this whole thing, well, certain layers into int8. Actually not all of the layers are calibrated to int8, some of the layers. Yeah. Right, we have one question here. So the reduction from 32 floating point to 8 bits, is it like always possible or the algorithm that does it? It is always succeed. And if it does, it is up to the person that created the model to like evaluate if it is like good enough. Yeah, excellent question. So a bit referred to the last question. Some, not everything, but some. Some layers, it depends on, yeah. Open Vino will figure out whether it can or it cannot. And it also gives you a report like it has calibrated this layer, it has done this successfully, others it didn't. But at the end of the day, when some parts have been converted to int8, you should out of the box see some better performance compared to pure FP32. So its answer can be that I cannot convert this. It could be when it's very complex or Open Vino doesn't understand your neural network, maybe. Thank you. Any question? Five minutes. I'm really happy. There's so much interest in this. That's great. I didn't expect so many questions. You guys are great. So I have a question. So you said that this technology is available on Cascade Lake? Yes. Are they for sale right now? Can we get them right now? And will this be a technology that will arrive also on the desktop or is it going to be just on the seance on the data center type of thing? Okay. So right now, this vector neural network, which is hardware technology, is only available on Cascade Lake and future Xeon processors. Of course, when it will come to desktop, I don't know. But it's on Cascade Lake. Now hardware we sell. You mentioned sale, right? Hardware is something we sell. But Open Vino is open source technology. You can read the source code online if you want. And our AI software tools are free and open source. We don't sell these. Any other questions? Yes, one question. Are you planning to have this on MKL? Are we going to add this? Like to add this operation on the McKenna library? Like if you want to use it, I don't know, in like standard Python, like for the next generation of hardware? Well, MKL can treat in 8. But like do you have like some kind of like a convolution operation directly implemented in MKL that's like optimized? Like using this set of instructions? Michelle may want to add something on this one. You're asking about MKL. There's also MKL DNN. I'm sorry, I'm butting in, but I also work for Intel and actually for this question, I think I know the answer. So the MKL DNN is the extension of the MKL library for these deep neural network operations and it already has support for intake operations. So we're adding them as adding new algorithms as the architecture evolves and the VNNI instructions are already there. So I guess these instructions can be used not only through your tool, but for example a compiler could generate assembly or the binary code that uses them. Yeah. So OpenVINO is one of the software solutions that we provide that's already taking advantage of intake and this process of converting models to intake is actually kind of special because it's almost automatic. It's almost like magic. You don't need a data scientist to tweak it manually. So that part is only OpenVINO, but the MKL DNN instructions are available for anybody and there's other solutions that we're also working on at Intel. For example, a graph compiler called nGraph. It's also taking advantage of these instructions through the MKL DNN implementation. So yes, many frontends can use this. The direct optimization that Intel provides for libraries like TensorFlow will also be using these instructions if you compile it with the VNNI architecture extensions. So yeah. One last very quick question. If not, we can thank the speaker again, the speakers. Thank you very much. Very interesting.