 So to begin with, just a quick introduction of what ONIX is. So ONIX stands for Open Neural Network Exchange. And this was essentially built as a standard. And it's an open source format of representing neural networks. So the benefit of having ONIX runtime, sorry, ONIX, is that one of its core principle is to enable interoperability. So what this essentially means is ONIX acts as a bridge across different frameworks. So this allows you to train a model in one particular framework and deploy it in a different framework. So there are different frameworks which are supported by ONIX, some of the very popular ones like PyTorch and TensorFlow and so on. And over here, you can see what it essentially means. So you can train a model in something like TensorFlow or PyTorch and convert that model into ONIX and have that converted ONIX model deploy on your desired framework. So that could be on TensorFlow, TensorFlow RT or TensorFlow Lite. So in essence, ONIX is kind of being used as an intermediate representation of sorts here. So there's another key principle that ONIX is kind of built on. It's portability across different platforms. What this means is you could use the same ONIX file and deploy it across different devices without having to make any modifications. So you could take a model which was run on GPU and take the same model and run it on a CPU without making any significant modification at all. So this allows you as a developer to work with different devices without having to worry about where your application is being deployed at, right? So if you look at the ONIX ecosystem, ONIX is growing, it's always been growing and it has a very diverse community which keeps adding new operators and support for some of the latest advances in deep learning. At a high level, ONIX defines something called special interest groups. Basically, these are called SIGs and everything inside the umbrella of ONIX, whether it's a tool or a library, falls under one or the other SIG, right? So at current, there's SIGs, ONIX SIGs, that's the architecture and infrastructure, compilers, converters, models and tutorials and optimization and operators. So just to talk a little bit more about ONIX, we have Krishna who is the co-chair of the models and tutorials SIG. Thank you, Krishna. Hey guys, this is Krishna. I'm an ML software engineer at AMD and I'm super excited to be here to talk about turnkey ML which is the newest repository under ONIX and ONIX model zoo. So before I jump into what turnkey ML is and talk about the architecture, let me just start with a quick demo. So what you see here is, on the right is the hugging face page for a bird and you have an example and let's say I wanna run the example on a hardware and see how well it performs. So with the turnkey tool, I can just copy an example code, just throw it in my code editor and call turnkey and ask it to run it on device x86 and you will see that the whole flow goes through. The PyTorch model gets exported to ONIX and gets benchmarked on the device that I asked for and all of this happened with no changes to the code. I just had an example file which I can just throw it into turnkey ML and ask it to do a bunch of things. So why do we do this? What's the background here? So in the AI compute stack, the compute landscape is really three dimensional. You have rapidly evolving model architectures, we have transformers, LSTMs, CNNs, GNNs and a lot more and we have diverse software ecosystems today. We have ONIX runtime, TensorFlow, TensorRT, PyTorch and a lot more and there is also diversity in hardware. CPUs and GPUs have always been there and now we have a lot of accelerators showing up, FPGA based accelerators, NPUs and all of these three different dimensions continue to grow. And let's say as a developer or as a user, we live in any part of this stack, how do we know how good this particular component is? Let's say you have, you're developing a runtime, you're developing a software stack for AI, how do you know how good that is? What's the efficacy of the stack? How broad is the support? Or let's say you're developing a hardware, how good is that hardware? Like what are all the different domains of models that it can run and how good is it in each domain or where is it not good enough that it should be improved? Like how do you find that out? What's the efficacy of your compute stack? That's kind of the problem we're trying to address here. So going into TonkML, so TonkML is a suite of tools that seamlessly connect with each other and some of the highlights here are, there is a PyTorch model analysis tool that lets you find out and recognize models just from an example Python file. Like I don't have to take the model out and show that this is the model but the tool can automatically recognize where the model is and what the inputs are and what it looks like. And there is a build tool for transforming models from one model to another. Like any process like quantizing a model is very common, distilling a model, compiling a model, all of those are model to model transformations. And in the end, there is an execution tool. You've built your model and now you're running that model on any specific hardware that you want. So these tools work very well together and they provide a standardized execution and reporting framework. So double clicking into the stack, getting into the architecture of the tool itself, we have models at the top which from all the repositories we know and love, touch hub, touch vision, hugging face, all of these open source repositories with a lot, lots of models. And then we come into the tool itself. The bigger box that you see is the TonkML framework and the access to that is through a CLI or an API. So the first layer of that is called sequences which performs model to model transformations like as I mentioned. So you have a PyTorch model, you want to converge that on X, that's a sequence. You have to quantize a model with your own quantizer, that is a sequence. So you can build your own set of sequences together to go from an input model to the model that you want. And then there are run times which you can pick from a lot of available run times or any, which is a software stack that's used to execute the model. When you want to run on the GPU, you could use TensorRT. If you're running on CPUs, you could use Onyx run time and there's a number of choices for run times. And any time we are running a model, we would like to profile it, like how long is it taking, what's happening in the hardware? Like how do you analyze what's happening underneath the hood? And some examples are Intel VTO, AMD, Uprof, NVDI, SMI. You can choose the right profiler that you want for the job that you're doing. And yeah, in the end of the day there's device which is CPUs, GPUs, or any other accelerator that you want to test on. And all of this has a common reporting infrastructure that measures a lot of metrics throughout the tool, latency throughput of the model, or even power consumption of the model, or whatever output your profiler gave you. So this, now I'm going a little bit deeper into the architecture. So I gave you an example of a layer, sequences run times, profilers, devices. And what we have here is a really extensible framework through this concept called turn KML plugins. So let's say I have a specific hardware and a software stack and I want to build my own plugin, it's as easy as just plugging these components in. And what we already support is ONIX x86 CPU, so you have a model in Torch, you just call a sequence that does Torch to ONIX conversion using the standardized Torch to ONIX exporter and I can run that model on the CPU through the ONIX runtime. And we have support for Torch using, running the same model on the CPU through Torch, or if you want to use the GPU, you can use TensorRT or you can use ONIX runtime DML. And at the end, you see a box in orange because we have the capability to also make certain plugins private, where we don't have to open it up to the public. Internally, we have a Ryzen AI plugin that we run on AMD's own NPU. So what that looks like is you have a Torch model that comes from all of these repositories at the top. You grow from Torch to ONIX FP32 and then we quantize that model with our own quantizer. And we use the ONIX runtime, Y to CP runtime, use the AMD graphics manager as the profiler and run it on Ryzen AI. So let's say you have a new hardware and a runtime, you can build your own plugin like this, install it into the run KML and benchmark away on all of the models or run all of these models that are available in the repositories on your own compute stack. So yeah, we saw a demo of BERT, but ResNet and BERT demos have gotten old. So let's try something new. The resolution is not that great, but what you have on the right is the Lama model from Meta that's available on Huggingface. And what I've highlighted is just the example, just like what we had on BERT before. And the system on the left is my VS code window that's actually running on a 64 gigabyte RAM Azure VM. So all I'm doing here again is just copy pasting the code, getting some example inputs for Lama, importing Torch, and I just ask turn KML to run this model. So at this point, it's going to go through the same flow, it's going to convert this model to ONIX because that's the sequence that I've chosen. And it converts it to ONIX, it optimizes the graph based on the optimizer that I have chosen, in this case the default optimizer, and then it saves the model to disk. And as this process is complete, all of the results and files are saved in the cache folder. So right here, I'm going into the cache folder and you can see that all of the files, including the ONIX files and the weights for these modules are saved to my cache folder. And there is also a separate stats file that gives me in-depth information on everything that happened during this execution. So it tells me how long each stage in this process took and what are all the different operations in this model? What is the diversity of ops? So we have so many macmals, so many sigmoids, like how does this model look? And it also tells me how many parameters are there and what are all the packages that were used when this particular execution happened. And that helps me recreate this whole setup if needed. So given the ability that we have to process thousands of models from so many open-source repositories in a very automated way, we were able to convert a lot of the open-source PyTorch models into ONIX and revamp the entire ONIX model zoo with 2,000 new models. That nearly 10X is the number of models and that's publicly available today. This whole infrastructure that I spoke about has been open-sourced under the ONIX umbrella and that repository is also publicly available today. Really appreciate anybody here taking a look and while you guys are there, give us a star as well. So that was about turn KML and moving on about talking about AMD's investments in open-source. AMD is really committed to open-source. We are contributors to major open-source repositories like Huggingface, PyTorch and ONIX. And we see ONIX as a very key piece of the puzzle in terms of improving developer acceleration and developer experience. So at AMD, we have different execution providers for all of the accelerators that support AI. We use the MLAAS EP on the default CPU for x86 CPUs and we have AMD Rock-M and McGrath X for the AMD GPUs and we also have a separate EP for the Ryzen AI and NPU from AMD. So all of these ONIX models that you have that we saw can be accelerated on these accelerators directly through these ONIX runtime execution providers. These execution providers are a part of a rich ecosystem of EPs that ONIX provides. And here you see that this comes from almost all of the vendors in the AI space. All of them contribute execution providers to ONIX runtime. And this also applies across different device classes like CPUs, GPUs, even mobile and edge all the way to the cloud. That's all I had. I'll hand it over to Kishen to continue on the developer flow. So to move on to the developer flow, what does ONIX do for a developer? How can a developer take advantage of ONIX on their applications? Just to revisit what we talked about before, this at a high level is the core principles of ONIX. It is interoperable. As you can see, it can move between any two supported frameworks that it has. And it's also portable. That means you can generate an ONIX file and run it on any required device, whether it's CPU, GPU, FPGA, or an NPU, right? So one thing that ONIX probably does is given that it's interoperable, let's say as a developer you are planning to use some of the latest models for your application. What is ONIX is able to do is it gives you quick access to some of the latest research models that has come out, right? So it acts as a bridge so that you can take any model deployed, developed in any framework and deploy it in your required framework, right? So essentially ONIX removes the barriers across different frameworks and it accelerates the production from model training to model deployment, right? And what this does is it speeds up the innovation across the AI community and will be able to deliver more models quickly to the customers and using that we can develop much more optimized models. Another thing that ONIX has is the ONIX model zoo that Krishna talked about. Let's say if you wanna use some of the most prevalent machine learning models that have already been developed and well studied for your application, you don't have to do any additional work. There are a lot of models that are present in ONIX model zoo which you can just take and plug in into your application as is and this reduces the time for your application development and depending on your use cases, the models that are trained for general purpose, there are also models that are fine tuned for different applications that you can take and plug in. So another portability principle that ONIX goes on is it's able to deploy across different devices irrespective of whether it's CPU, GPU, or NPU using the same model file. So as a developer, you don't need to worry about where your application is gonna be deployed on, whether it's running on CPU or NPU. So this allows the developer to kind of work on application and make it quickly accessible to the customer without having to consider the environment restrictions for the application. So apart from this, ONIX also gives you easy access to a lot of optimization tools which can lead to better model performance and some of these optimization tools would be like quantization or sparsity and these are something that can be applied on a lot of industry-wide prevalent models that's already available in either the ONIX model zoo or the converted ONIX models. So you can just take those converted models and run these optimization tools on top of them to get better performance for your models. So moving on, if we look at the deployment scenarios for ONIX, and ONIX is very versatile. We talked about it being interoperable as well as portable. So because of how flexible it is, you have deployment scenarios for ONIX on both the cloud as well as the edge scenarios. Each deployment comes with their own challenges where there are particular kind of conditions that you look at when you're deploying on cloud, versus when you're deploying on edge. So Intel's commented to both scenarios. We want to accelerate ONIX wherever it's being deployed. And if you look at on the cloud side, when you're looking at cloud deployment, there are a few conditions that you focus more on. This would be something like the throughput and latency. So in a typical use case, this is what your deployment is gonna look like. You train your model, you convert that into an FP32 ONIX file. You can either use that directly to deploy your model across different applications on your required hardware device, or you could optimize it further. You could do some kind of a model compression to get like an intake model and get better performance and use that in your deployment. Now factors influencing deployment. We talked about throughput and latency. So why are they important? On the cloud scenarios, you have a large number of incoming requests that your model has to handle and it has to handle in a very efficient manner as well as make sure that it's not taking too much time. So that's where the throughput constraint comes in. You wanna have your model deployed in such a way that it's able to get as much throughput as possible so that you're able to answer more queries in a given span of time. And that gives you better TCO savings as well. And we also need to look at latency because in some real world or real time applications, you also want your application to be responsive. You can't have your user weight too much for the given input to be, for them to get an output for the given input. So these are a few constraints that you kind of look at when you're looking on the cloud scenario. So what Intel helps ONIX do is we've optimized ONIX runtime to kind of be flexible to deploy wherever the requirements are. We have developed some accelerated libraries, provided support for some of the latest hardware instructions and made sure that whenever you're running something, you get the best performance on Intel hardware. So as mentioned earlier, ONIX has access to a bunch of tools for model compression. They're tools like Intel Neural Compressor that you can use to get better performance from models. And you have better cost saving because you have reduced requirements, resource requirements, when you're doing certain optimization, right? Some of these model compressions are techniques like quantization and sparsity. Quantization is a technique where you kind of lower the precision in which your model is being executed. So you can have an FP32 compute be reduced to an intake compute. And what this also does is it helps the model take advantage of different hardware accelerators that are present across diverse devices, right? And sparsity is another technique where you randomly zero out weights across the model. And this helps reduce the memory pressure that some of these models put on the devices, right? There's a few other optimizations as well, something called structured sparsity where you kind of, instead of randomly zeroing out weights, you do a targeted pattern in which you zero out weights. And that also helps you take advantage of some of the hardware accelerators that are present on the devices. Now, another thing that INC does, or the Intel Neural Compressor does, is it makes sure that whatever models that we've optimized, we've quantized, or we sparsified are readily available in the ONIX model zoo. INC has made an active commitment to make available as many quantized performance models possible in the ONIX model zoo so that any developer can readily take them and use them in their applications and see the performance benefit straight out of box without having to do any additional fine tuning on their end. So what does Intel do for ONIX ecosystem on the cloud side? We have worked extensively with ONIX runtime to make sure that it takes advantage of any of the latest Intel instructions that have come out, right? We've optimized the ONIX runtime to run matrix multiplication 4x faster on Gen 4 compared to the previous generations. And this was done through something called Intel Advanced Matrix Extension or AMX. AMX is essentially a set of instructions and registers which are designed particularly to accelerate matrix multiplication on the latest Intel Xeon servers. We also have a few additional hardware-optimized libraries such as OpenVINO and 1DNN. And these hardware-optimized libraries allow users to use the models that they developed or the models that they converted across various Intel devices and still be able to obtain as high performance as possible without having to do too much fine tuning or tweaking for the model. Now, if you look at the cloud side, the set of problems you kind of face on the cloud is gonna be completely different from what you see on the server side, right? On the server side, we saw that throughput and latency was something that we looked at when we are deploying. On the cloud side, the things that you consider are gonna be slightly different. We are looking at the responsiveness of the application, the memory footprint, the battery life, and the disk size. Responsiveness is pretty straightforward. As users, you want your application to be quick. You want whatever inputs you give to be quickly generated outputs, right? So you don't wanna wait whenever you press something or you give an input. Now, memory footprint. In cases of PCs, a lot of times, you're running multiple applications in parallel, right? You have limited memory across all of these applications. You can't have one application taking up majority of the memory and slowing down the other applications. So you need to make sure that whatever applications you're working on or building are optimal in their memory usage, right? Battery life, again. For some of these edge devices, which are portable like PCs or tablets, you wanna make sure that you don't have too much power demand when running or deploying these models, right? So what Onyx does is, Onyx is able to use certain hardware acceleration features through Onyx runtime. And some of these are like the DirectML from Microsoft. Basically what DirectML does is it uses DirectX 12 to run some of the kernels on GPU so that you're able to get better performance and you don't have to wait too long or you don't have to worry about your CPU taking too much time to run or make an inference on the model, right? So this is how it looks at a high level, right? Your app calls into Onyx runtime. Onyx runtime calls into DirectML. Onyx runtime can either go directly to the CPU or to the GPU, right? So that's something that the application can use at runtime, depending on the user needs, whether you wanna use CPU for lower power or whether you wanna use something like GPU so that you can get higher throughput. So over here, given that we don't want our models or our applications to put too much pressure on the resources that are available on the PC, we can also take advantage of Intel Neural Compressor as a model compression technique to reduce the model size. The reason for this is compared to an FP32 model, Intel model is gonna use lesser memory resources and other resources in like a resource constraint edge device, right? So you get better user experience because of these reduced pressure on the device resources. And that's a link of the Intel Neural Compressor. If you wanna check it out, please do. They do some amazing optimizations, sparsity, distillation, quantization and a lot of other fun stuff. On the cloud side, what does Intel do on the cloud side? Again, Intel is extremely committed to making sure that Onyx deployment on cloud is optimal. Why do we do this? Because we've been seeing an increasing trend where we want the inference to be done on the cloud, right? Now, the advantages of doing this is when you're doing inference on cloud, you can circumvent the overhead of sending either your data or any of your inputs to the cloud, right? So by doing your inference on the edge, instead of doing it on the cloud, you're able to avoid the memory transfer time, right? Another advantage is you don't have to be constantly connected to the network in order to make your application work, right? So if you can run your inference on the edge, it's more responsive. You don't have to depend on your memory, sorry, your network bandwidth and other things, right? So to that extent, we at Intel have made sure that the Onyx runtime has been optimized to make use of Intel DL boost so that you can run any of these intate models or the compressed models more optimally, right? We take advantage of VNNI and other such fun accelerators, right? We also have tuned Onyx runtime to be good on hybrid core architecture. Hybrid core architecture is an interesting concept, where Intel came up with a heterogeneous processor design where you have a performant core as well as an efficient core. So performant core, as the name suggests, is good for performance, but it's not so great for, it takes more power than an efficient core. An efficient core is less in performance, but it's optimal when you have to run some applications which are not of higher priority. You can just put them in background, run it on low power and save your battery life as well. So Intel has optimized Onyx runtime to take advantage of this hybrid core architecture as well so that whenever you deploy the workload, there's a balance between how you run it on performant core and efficient core so that you're not using too much power, but also you're making the application be responsive and be quick to your user experience. On the GPU side, Intel has fine-tuned the DirectML so that you get higher throughput when you're running your models through DirectML. And the interesting part, our latest work actually, is on the Intel NPU, the neural processing unit. So in the upcoming Intel Core Ultra, we are able to run DirectML through NPU. So what this does is NPU is a high-performance, low-power accelerator. So you could essentially run some of your models using lesser power than what you would have potentially done if you're running it on GPU, but at the same time see the performance benefits that you can get through an Intel NPU. So as a summary, Onyx is very flexible. It's something that is an open-source format. It's something the community has a great interest in. There's a lot of incoming operators, a lot of optimizations. As you've seen, Krishna talked about a lot of new models that have been pushed into Onyx Model Zoo, which makes it easier for developers to just pick and plug into their application and just see the performance benefits. It's something that's still growing. It's always growing. There are always new optimizations and new things that come in. There are new work that's been done every day by different SIGs. And yeah, do check out Onyx. It's very interesting. There's lots of documentation. There are lots of material out there available. And if you have any questions, anything you wanna talk about, anything you wanna follow up on, feel free to reach out to me or Krishna. We'd be happy to check. Awesome, thank you everyone.