 Thank you for coming for this session. My name is Guo Bin Cheng from Intel. I know that it will be tied when we go through so many topics since yesterday. So I hope I will not get you sleeping for the next half hours. So yeah, let's start. So my topic is about the enabled generative AI everywhere with the ubiquitous hardware and open software. So as we know that AI conducts very important as of today, and I believe it will be even much more important and be broadly used in everywhere. So a simple question is that it is so important, but how we can get on this boat simply. And I think a quick answer is that you need some hardware and you need some software, right? So when we're talking about AI or deep learning typically, we will refer to a GPU, right? But actually, AI can be run on also other hardware like the XPU or even CPUs. So from Intel perspective, we provide all these combinations like the CPU, GPU, XPU. Especially for CPU, we provide both server-based CPUs like the Xeon and desktop-based one like Core. So that is the thing. I think all these can be used for our community to run your AI. And I believe that when we're talking about computation, the most typical one or simple things we can use or found in the past decade of years is the CPU, right? And so I show you some of the brief little map that Intel actually provided based on Xeon about how it can help on the AI technology. So for the Skylake, actually, the first generation of the scale of all the CPUs, it provided the AOS 512. We all know that. To conduct the vectorization computation, so we can use this kind of instruction to do the dot production gene computation, this kind of the computation-intensive workload. And after that, we also provide kind of like we and I instruction for Inter-8, this kind of data type. And, latestly, we provided AMX, T-Mode, this kind of new magic-based computation. So it's kind of the very specific designed computation module inside CPU that can help us to run quickly for typical workload like the magic multiplication. And the magic multiplication actually is the core computation part in the deep learning or AI model or workload, like the convolution, like the gene linear, this kind of operations. So that is why Intel provided this kind of the AMX accelerator inside CPUs. So it can also be used to accelerate the AI workload. So compared with the past AOS 512, when we run in 8 data type mode, it can provide up to 8 times faster performance with AMX compared with AMX 512. And yeah, I think that is a big improvement from hardware perspective. And when we're talking about software, you know that for AI or deep learning, a typical framework today is the PyTorch, right? So here I show you kind of the typical ecosystem based on the PyTorch. So we can see that in the middle part is the PyTorch. And based on PyTorch, we typically in the community, there's a hugging phase, which provide a lot of the modules, deep learning models or modules, especially for light learning model, and also other libraries like TorchFishing, TorchServe, these kind of things. And based on all these, we can then have these typical models like Lama, GPJ, 4Com. Those are typically for light learning model and also stable diffusion, this kind of thing for the generative AI, right? And here I also want to mention that from inter-perspective, we also provide a framework level library called inter-extension for PyTorch, which is we can see that in this inter-extension for PyTorch, we provide the latest optimization on CPU to XPU, these kind of things. We can think that it is kind of the staging area to have the optimization for Intel platforms. And also, when we push this optimization into PyTorch, we need some time to make it merge in the upstream repo. So before this can be fully landed in the PyTorch, we can have this optimization to be used in the inter-extension for PyTorch. So here is some more breakdown of what the inter-extension for PyTorch is. So it is kind of the extension library and also open source in the community. You can get this library from this link. And usually what it did is to try to run as fast as possible for the typical deep learning models and also for generative AI models like the Lama, GPJ, these kind of things. So it means that you can use Apex. I mean, this inter-extension for PyTorch to get a better performance. And it also have the very similar API like StorePyTorch. So it will help you just with kind of the very limited code modification in the application level to include this library. And it also have a very similar release cadence as the PyTorch. So it means if PyTorch release the 2.1, Apex will release also a 2.1 version just one or two weeks after PyTorch release. So we can have this kind of 1.1 map to get this library. And here also some of the breakdown of the optimizations we can have in the PyTorch and Apex. I will not break into too much on this part, but generally when we're talking about optimization in the PyTorch, we usually refer to three layers, one for the operator layer. We need to have a good kernel design in the operator layer. We want to have some vectorization, might be spreading these kind of things. And the second is a graph layer. So we can have the operator fusion. We can have the deal compiler based technology to be used to make a good combined kernel that can reduce the overhead, these kind of things. And the final is a runtime kind of the we can have the better control of the environment and fully utilize the power of your system. And the specific for large-rung model. We know that large-rung model, actually, the typical thing is that when comparing with other deep-rung models is that it has a very big size and it has a very spatial MHA design. So we have some spatial optimization to make all these things to perform as fast as possible, like the spatial GIM design to utilize AMX to get the better GIM computation efficiency. And we will try to use a low precision computation, like before 16 in the 8, to make the model to be kind of to come forward to be smaller when run in the hardware. And we also try to use the kind of the indirect access carry catch, page attention, these kind of the details to make the MHA part to be efficient as much as possible. And also some of the graph fusing, this kind of thing. So with all these optimizations, we can have good performance when we even run on the single CPU. So here I show you some of the data we collected on AWS. And the AWS instance. So we can just look at the white bars. Here it means that when we run a typical Lama 7 billion model and we can have the average token latency about 39 or 41, this kind of level. So what this kind of the latency means? Typical thing is that when a human reading speed, it usually means 100 millisecond per token. So it means for January, when we as a human to read an article or some sentence, we usually read in the 10 tokens per second. That is the 100 millisecond per latency, this kind of level. So here it means for Lama 7 billion on a single CPU, we can have a double performance than the human reading speed. And when we try to run the bigger, two-size bigger model like Lama 13 billion, we can also have the latency to be like the 72, 76, this kind of level still much faster than a typical human reading speed. So it means when you try to have an application for your scenario, you can just run this kind of the 7 billion, 13 billion model on the CPU to get a better performance than your consumer reading speed. And here is another data about the stable diffusion. We know that stable diffusion is kind of the generative model to try to generate the kind of pictures with text input or these kind of things. So particularly, the stable diffusion will be trained to adapt to your scenario. So firstly, when we try to run the front tuning, on the four several rapid nodes machine, we can do the front tuning just in the five minutes. So it means you can even to front tune several runs in just one hour. So you can have more scenario like if you have a service, like I want to do this kind of the star stable diffusion generation, and after that, I can change the star with different front tuning. So these kind of things can be done in several run of time. So it can improve your response rate to your users. And regarding the inference after you front tune your models, on the typical suffer rapid CPU, you can even to do one round of the image generation in just a five second. So it can give you some more opportunities when you just try to have the stable diffusion generation in the machine. And you have several users to hook on the machine to do the generation. So yeah, all these things can be done very quickly. Yeah. So here, all these examples, what I'm trying to show you is that with this software, with this hardware, we can have the AI, and especially larger model and the generative AI model like stable diffusion. All these things can be done with just the CPU. And with our software, like Petorch, Apex, these kind of things. So I believe this can give you some of the opportunity to try the AI just on the simple machine of from AWS, from GCP, this kind of cloud platforms. So you can easily to try the AI solutions. OK, so I think that is most probably my presentation today. So if you want to try the things that you can pin to this link, or you can send me an email if you have any question or follow up materials you want to know, any questions. OK, so I think we can end today. Yeah.