 Hi, I'm Abhishek and here are my co-speakers, Shama and Sankoon. Today, we are going to talk about a large language model, fine-tuning, and inferencing using Onyx Runtime. So let's get started. So the agenda for today would be starting with some code execution. We wanted to show you a live demo, the code which we are running, so that by the time we complete our presentation, the code has completed. And we want to show you the results of that one. The crux of the presentation would like to address a scenario of training a large language model using the existing technologies and environment so that you can easily take any large language model and use and improve the training and inferencing time of the model. So the technologies which we are going to use would be Onyx Runtime, which helps you in improving your training as well as inference time. We are also going to use DeepSpeed, Laura for the model training. And the environment which we are going to use is going to be Azure Container for PyTorch. And the large language model which we are using today would be the Mistral 7 Billion Model. And everything put together would be running on Azure ML. And that will be all the technologies which we will be using today. And then after introducing all the technologies, I'm going to talk about, I'll go through the code and walk you through the steps which we have added and which you can use it for your use case so that you can improve your training as well as inference time. And finally, we'll do some performance comparison compared to PyTorch. We'll do some performance comparison and we'll summarize these, whatever we did today. And finally, we'll do some UI walkthrough so that you can use some existing large language model to train your use case scenario. So let's get started. Just wanted to point it out if there are any questions, feel free to interrupt me at any time. Okay, so it's starting with the ORT, like I'm starting two scripts. First would be to training on a training Mistral model and the second would be running an inference benchmark script. So here, what I am doing is I am just submitting the command for training. Yeah, yeah, so I have provided the links on the PPT. So if you just click on these links, so yeah, you can just click the link and it will take you to the exact page and we have provided the instructions here. It's on the PPT, like we updated the PPT today. It was present on the previous PPT as well on the page. So if you go to the page, you will able to see like PPT here, the LLM file. Yeah. So going back to the code. Okay, I understand. So it's just for like, I'm running it right now, but this is not very important. We can do that because they don't have access. So it's just I'm submitting a script right now. I'll go through the code later, which will be zoomed in a little bit more. So I just want to show that like, okay, we ran a code right now. So this is an example for running PyTorch with MrilModel. So the code has been submitted. It is currently queued in. And then this is for the ORT code. And by the time we'll complete the presentation, hopefully it will execute. And now I'm just submitting the job for, sorry. Yeah. So now I'm submitting the job for inference session. It will benchmark the runtime for PyTorch EGAR mode as well as ONIX runtime. Okay, one second. Yeah, just a second. I'll go through the code later. Like, so just right now, we have just submitted the runs. So this is the inference session which we submitted. And the first two was training for PyTorch. And this is training using ONIX runtime. So using deep speed stage two as well as Lara. Now we'll get back to the presentation. And just a second. We will do a walkthrough of the code. Good. Okay. So starting with the ONIX runtime. So what is ONIX runtime? So it's a cross-platform machine learning model accelerator which has flexible interface to integrate your hardware-specific libraries. It has been... It can be used with various model frameworks including PyTorch, TensorFlow, TfLite, Scikit-learn and many more. You can train your model in Python and deploy it with whatever languages you are using for your use case. For example, C sharp, Java, JavaScript and many more. You just... To use ONIX runtime, you just need to convert your model to ONIX format and which only requires a few lines of code. This is the link to the page for ONIX runtime to get more information. To just give you an example, what is... Like what does the ONIX do? ONIX runtime do. So we are taking a very, very simple example in which we're creating a feed-forward network. So we have a X dataset which we are reshaping and taking a dimension out of it and then applying some feed-forward network and applying a relu on top of it and applying softmax at the end. So what you get in graph representation would be something what you see on your left. So this is the actual graph representation which we are seeing here. And in this case, we should focus on the right. You can see my arrow right here. Okay, great. So we should focus on this part. So in this case, the ONIX would optimize your graph from left to right. And what it has done is it has taken this part of the graph and optimized into a single load by using some of the fusions and other things so that at your runtime, you are able to execute your code faster. So that is like the crux of the technology. We also integrate some kernels as well to optimize your training runtime as well as inference. This is just a simple example, but the optimizations which are done can also include fusing a lot of layers to create into a single layer non-fusion or something like that. So that is what ONIX Runtime does for optimization. So there are two parts to ONIX Runtime. One is doing inference. So ONIX Runtime inference is a high performance engine for deploying ONIX model to production. Since it is optimized for both cloud and edge use case and it works seamlessly on Linux, Windows, Mac, mobile and web. It is written in C++, but it has various APIs and like C, Python, C sharp and it can be deployed in a variety of environments. It integrates very well with various hardware accelerators, including CUDA on NVIDIA GPUs, OpenVINO on Intel processors, ROCM on AMD GPUs, DirectML on Windows and many more. So you can use ONIX Runtime inference on whatever use case you have. These are some of the results which we have generated for ORT inference case scenario, compared it with Torch script, Torch compile and ONIX Runtime. So the first few cases which you are seeing here, the orange is using Torch script, then Torch 2 is the Torch compile and finally ONIX Runtime. As you can see here, in all the cases we are getting better performance for inference scenario for this models. Some of these models are language models and some of these models are vision model. And for the last four models where we are comparing Lama 2 as well as Mistral, what we are comparing here is the orange one is Torch eager mode. The yellow one is Torch compile and the blue one is ONIX Runtime. And the last four models use, yeah. So yeah, these are the comparisons for the ONIX Runtime inference. So the next half of the ONIX Runtime is the ONIX Runtime training which is comparatively newer compared to inference. And it is used to train large models while you are able to accelerate your training and reduce your compute cost, which is a very important thing nowadays given the increasing amount of cost per GPUs. And it uses only a few lines of code and we are going to talk about that later. ONIX Runtime training supports CUDA and drawcom acceleration and it can also be used for on-device training. It is built on highly successful and proven technologies of ONIX Runtime and ONIX format. What the benefits of the ONIX Runtime is it provides you with faster training as it optimized kernels and you can get up to 2X speedup for your state of the art models. You are able to train larger models as some memory optimizations are applied on your model. And for example, you are able to train a GPT-2 on 16 GPU, but it runs out of memory with stock PyTorch. So you are able to efficiently use your GPU memory. And since it is part of PyTorch ecosystem, it doesn't require a lot of setup. It is available via Torch ORT Python package as well as it is also part of the Azure container for PyTorch, which we are going to talk about later. So these are some of the results for ONIX Runtime training. So the first two models, the Lama2 and Mistral, these are generating using deep speed stage two. I'm going to talk about the deep speed stages very soon and the last of the models compare PyTorch with ONIX Runtime. And in all the cases, you can see we are seeing good amount of performance improvement for all the models which we have here. And again, these are some of the models for language as well as vision. So if there are no more questions, I can start with the deep speed. So deep speed is an open source deep learning optimization software. And what deep speed does is it has different stages of partitioning your optimizer states, gradients and parameters so that you can efficiently use your GPU memory. So the stage one partitions optimizer states, stage two adds gradient partition on top of stage one and stage three partitions parameters on top of stage two. So it helps you in efficiently utilizing your parameters on the GPU, so that's what it helps on. And because of partitioning and sharing those different stages across your different GPUs, you are able to create large and large models with billions and even trillions of parameters. You are able to achieve excellent system throughput and efficiently scale to thousands of GPUs. And you can also train your models on very resource constrained GPU system because you are partitioning your parameters across different GPUs. So with deep speed, you get really good speed ups and the cost reduces significantly. And a lot of recent large language models are using deep speed different stages to train their model nowadays. So next, we are going to talk about Laura. Laura stands for low rank adaptation. So it's a slightly newer technique and it is basically a fine tuning technique which you can use on top of your already trained model. It significantly reduces the number of parameters you have for training. You don't have to train your entire model because that takes a lot of memory. You can use a lower end, you can freeze your bottom half of the model and only retrain a portion of your model to make it usable for different kind of use cases you have. The benefits of Laura are you will have faster training because the number of parameters which you are going to use have reduced significantly and which will also lead to lesser memory requirements. You will have similar inference time because essentially you are not changing the number of parameters which you have and you will also get easier task switching because essentially your base model remains the same but the task which you are using it for, you can have different fine-tuned models for those and the fine-tuned models will only contain the retrained parameters. So you won't have so many replicas of your model since the models are so huge. It will help you in reducing the model size for task switching. So the next thing we are going to talk about is the Azure Container for PyTorch which is a container developed by Microsoft which is optimized for large distributed deep learning workloads. It is pre-packaged with some of the best Microsoft technologies for training acceleration. It is primarily released for training based on the PyTorch as well as Onyx Runtime training workloads. So what it provides you is it is an optimized training framework so you can develop and accelerate your PyTorch models on large workloads. Since we are releasing lots of containers on a regular cadence you get up-to-date stack with latest compatible versions of Python, PyTorch, CUDA versions and so on. It is very easy to use since you are taking a already validated environment and which has already been tested on a number of workloads. Then we also do a lot of training optimization technologies like we talked about Onyx Runtime then ORT mixture of experts, DeepSpeed, Nebula Checkpointing, MCICL which is for communicating between different GPUs and so on. So the benefits are you don't have to install these libraries and integrate it with your environment. You get everything tested out and it also supports integration with Azure so you can directly download these images for your use case into Azure and directly use it. I'll show you an example of using ACPT into Azure later but that's what it is. So this is one of the examples. A lot of internal customers, 1P as well as 3P are using ACPT for their use case. An example where a small company, Fashable used ACPT to train their models across several nodes and able to efficiently get the results. So this was really helpful for them and they were having a lot of problems while setting up their environment and this was one of the achievements for ACPT. And the large language model which we're going to talk about today is the Mistral. So Mistral 7 billion version 0.1 is the model which are using today and it is a recent model which is a pre-trained generative text transformer model and it uses group query attention for faster inference and it uses sliding window attention for having a more context in your model. Onyx runtime, this is just an example which we are using today. Onyx runtime supports a lot more other models including the Lama 27 billion, 13 billion for training, the Microsoft's five model which is also another large language model. Then there are several vision models as we saw earlier like Google's VIT model and many more. And then we also have stable diffusion Falcon 7 billion model which is supported. To put this all together and we'll have Shama explain this all the details here. She will put it much in a much better way than me. I'll just talk about this slide where right now we've learned about a lot of different technologies. We've learned about Onyx runtime. We've learned about deep speed. Then the Laura ACPT which is the Docker image that kind of brings all of this together and makes it available easily for developers. Azure ML takes all this goodness and then it's essentially providing the one stop solution for you to do both model development which means it's fine tuning as well as inference which means it's deployment. So if we start from the bottom then we have access to hardware, different NVIDIA SKUs. We also support EMD. And once we move a little ahead to access these different SKUs we can use different forms of training methods which is through the SDK. We can use CLI. We can use the UI which is the Azure ML Studio. And then as we move up another level as I said the package, the Azure container for PyTorch that allows you to use something that's already validated something that's already tested and has been packaged with all the needed technologies so that you can optimize your fine tuning and make sure that it runs fast. Your training time is reduced and you can get your fine tuned results faster and more efficiently. Once you, again when we move another step ahead the user code can be through scripts, their jobs, compute targets, all of those. So it supports a variety of user code. And then again it supports a variety of models. So of course it supports the traditional models transformer based CNNs, RNNs but it also supports LLMs, open AI models as well as the open source LLMs. And once this fine tuning is done, so this entire stack that I just described is for training, it's for fine tuning. And once that is done, you can then use that model to deploy for inference or you can export it in the ONIX format and you can take it for deploying it on a mobile device or you could do, convert it to any other format that makes sense for you. So this is the end-to-end kind of model development story that is supported by Azure ML as a solution. Oh, back to Abhishek, yeah. So I'll hand it over to Abhishek, we'll do the code walkthrough now, yeah. Yeah, go ahead. Okay, so what I know about is like, so they were trying to I think use it for like training their model and they wanted to train, they had a larger data set and they had a larger model, but they were not able to parallelize it across various nodes. So they used instead of setting up their own environment and figuring out all the issues, they used ECPD image which contains like the latest version of PyTorch, the CUDA version which we want to use and other packages. So, and if there are some packages which you want to use but are not installed, you can install on top of that. And once you have the setup there in Azure ML or any other place you want to use it, you can train it on multiple nodes and that's how I believe they use the ECPD environment. So yeah, it's for their use case, I think like, yeah, like I'm not sure like what was their exact use case, but yeah, it could be like you can have a vision model, you can have a language model, you can have a multi-model model. So that's different, yeah. Yes, yeah, they are connected with the infinity band, yeah. Azure, I'll talk about that, yeah, yeah, yeah, yeah, yeah. Okay, so let me start with the code walkthrough. Now what we did today in the starting, so I'll start with the Onyx Runtime Training job first. So if you just open your browser and search for Onyx Runtime Training examples and go to the Onyx Runtime Training examples repository and go to the, okay, yeah. So if you guys want to follow along like I can go through this slowly. So just Google for Onyx Runtime Training examples and then select the first repository which comes the Onyx Runtime Training examples. And in this, we have added a Mestrel fine-tuned code folder here and I'll just go through the readme file and it explains whatever I did earlier. So how do you set up your Azure ML account and have the required workspace to run your job? So for this case today, we are using Azure Container for PyTorch which has all the isolators and technologies which we talked about. And then we are using the Mestrel 7b model for this. Then for setting up Azure ML, you need to set up your account on Azure ML and once you have the account set up, you need to do easy CLI login to ensure that you are connected to the cloud and then you will need to get a JSON file which contains the environment you are using, what workspace you are using, what kind of GPUs it has. It's a small JSON file which you need to have to run this code. There is another way to run this code which would be directly if you have some machine on your own, I have provided that below. I'll go through that later. You can directly run this command as well. So that's what you need for this case. And whatever JSON file you have, you need to rename that file to wsconfig.json. And once you have done that, you can run the demo which I did in the starting by just submitting pythonamlsubmit.py file and that will start the job. This job builds on top of the training environment which we use. So let me quickly go through the environment which we are using today. So this is the Docker file which we have for today. And as you can see, we are using the ACPT image. It has 1220.04, CUDA 11.8, Python 3.1 and Torch 2.1.1, which is the latest release for Torch. So since it already comes with Onyx front-end training but our lava model required a nightly version of Onyx front-end, I had to uninstall Onyx front-end training and install the nightly version and configure the Onyx front-end training. We'll be releasing Onyx front-end training 1.17.0 soon in January, which will have all the optimizations for Lama and Mistral models. Then what it is doing is it's using a requirements TXT file because we don't want our images to contain everything because that will become too bloated and it would become slow for every use case. So we want to keep at least the important packages which we have and add whatever you want on top of it. So for example, in this case, we are adding this requirements or TXT file. So here we are installing some of the evaluate packages then Azure ML Core, which is required for integration with Azure ML and then using other packages like data sets, transformers, Optimum and so on. So these versions are installed on top of this ACPT image. So that's how we are creating this environment. So as you can see here, like using the image and installing on top of that, like you just need to play around with whatever use case you have and it doesn't take much time to figure out the package requirements since you already have the most important things like the CUDA version, Python version, the Torch version and other things already set up for you. So you can use them very easily. So once you have the environment set up, what you need to do is you just, like once you submit this command, python.amlsubmit.py, it will generate the two URLs which we looked at earlier and it generated these two commands. So these have already completed. So I'll go through them later during the inference, during the performance comparison stage. If you don't have an Azure ML account or you already logged into your compute, you can use the other way which is directly running the code on the machine each you have. So you just need to go to this folder called find you in CLM which is this one and it just have two scripts. One is run CLM.py and another is zero stage two. So zero stage two is basically the deep speed zero stage two configuration which contains like what kind of stage you want to set up for your optimization and a few more parameters to optimize your use case. And I'll go through the run CLM.py file now. So yeah, sorry, this one is still left. So to run the PyTorch model, you just need to run this case. So it is mentioning that you are using a GPUs, you are using the script and you have specified the model name Mistral7b version 0.1. Then what is the data set name you are using, the training batch size and so on, some of the parameters which you need to specify. For running ORT, you need to export a few couple of environment variables. So apply ORT would ensure that the Onyx runtime is applied and there is another environment variable which you can set as ORT module fallback policy. So once you disable the fallback for ORT module, the default feature is if you are not able to convert your model into Onyx runtime because of any reason or because of any import error, it will fallback to PyTorch and your code will run normally as it should on PyTorch. You might not get the optimization benefits but your code will still run. But once you enable this case, either it will run on Onyx runtime or it will fail. So for this case, we want it to ensure that it runs on Onyx runtime. So that's why we have disabled the fallback. And the command remains the same for this case as well. So let me quickly go through the run clm.py file which is the crux of all the code here. So this is a script taken from Huggingface Transformers Library and it has, there are very, very small number of changes which you need to do to ensure it works well with Onyx runtime. So what you need to add here is a couple of things. You need to add the apply ORT environment variable flag and you will need to change the parser which you have. So that's the one change you need to do. So instead of using the default training arguments parser, you will need to pass it ORT training arguments flag. The other thing which you need to do is change the trainer which you have. By default, you are using the Transformers trainer, but if you want to use Onyx runtime, you will use ORT trainer and that's the only change it's needed. So you can see it's just like less than 10 lines of code which you need to add to your code base and you can get benefits of more than 10% of throughput improvement for compared to PyTorch. The other thing which I talked about today was Laura. So this is the code which has been added for Laura. So without this, your default entire model will be fine tuned and in Laura you need to, so we are using the perfect package from Transformers and you need to specify the target module. So these are the modules which will be fine tuned. Since the entire model is not fine tuned and you are only specifying these modules, which are fine tuned, essentially your model size has decreased significantly. So that's mostly the crux of what we are doing here and yeah, so if there are any questions on the training side, please go ahead. Otherwise we'll switch to the inference code example. So let's go to the Onyx Runtime inference. So let's go to Onyx Runtime inference examples repository. You can search for that and select the first option. In this, you need to go to the Python folder and in this go to models folder and today we are looking for Mistral. So we'll look at the Mistral folder here. I have provided like the similar details for benchmarking compared to Torch Eager as well as Torch Compile here. The setup process is similar as I went through for the training job and instead of like what we are going to run here is the AML submit Mistral inference.python file and it will generate a single URL which will contain the complete output for PyTorch as well as Onyx Runtime. And if you want to directly run this on your compute you can just go to this folder and run this bash script in this repository. So the environment code for this case looks pretty similar. It has lesser requirements. So we are just installing the nightly version of PyTorch because we want to use the latest version and then instead of using the Onyx Runtime training we are installing the ORT nightly GPU and then we are using the Transformers and Optimum and one more package. That's it. That's the only change we need to do here. And then the inference code looks something like this. It's basically cloning the Onyx Runtime repository and then running. So here I want to point this out. Like there are two more steps you need to do before you can run your Onyx Runtime code because your first step would be converting your model to Onyx Format. So whatever PyTorch or any other model you have you need to convert it into Onyx Format. And once you have converted it into Onyx Format you want to optimize that model. So in the next step we are optimize that model. So some of the, like one example of optimization which we saw in the starting of the session. So this is the step which is there to run the optimization on the Onyx model. And finally you can run your benchmarking for Onyx Runtime. So this step runs for using Onyx Runtime. Then this is for Torch Eager mode and here you can specify what kind of benchmark you want to do. If you want to use PyTorch Eager, if you want to use something else and you can also specify if a different batch size you want to use or if you want to use different kind of sequence length. And finally this is another you can use for Torch Compile. I have disabled this for today's session because it takes like 30 minutes approximately to run this one. The reason for that is like for each batch size and sequence length the model is recompiled and that's why it takes longer to run the entire model to run the benchmarking script. But essentially what you will get at the end is I have all the numbers which I have generated earlier. You will take a look at that one later. But it works fine otherwise. So you can take a look at your Onetime leader. So I can quickly go through the benchmarking script now. I think this would be the... So for the case of when you have enabled ORT Convert to Onyx, you will be starting an ORT session and that will be used for this case. And if you're running ORT, you will be running like, yeah, the run ORT inference function here and essentially prepares all the inputs for your use case. So the difference between PyTorch and ORT is PyTorch uses Torch tensors as its input data, whereas ORT uses NumPy or IO bindings for its input data. So that's the difference which we are doing here. So it's creating the input with IO bindings and finally running the same code for evaluation. So that's the same code which we run for using PyTorch as well as for ORT. And then finally, so that's pretty much which we are doing for the inference side. If you have, like I've provided links to all the code which we have here, you can go through the code on your Onetime, but if you have any questions, please let me know now. So it says code of visible devices. So it's saying that only use GPU zero for inference. So that's what it is saying. So for training, what we used was, we used eight V100 GPUs for this use case. And for inference, we used a single GPU on V100. So we have 32 GPUs which we are using here and that's what we are doing. So let me show you the runs which we did today. So what we are doing here today is like the first is the Mistral 7 billion PyTorch causal language model using Laura plus deep speed stage two. So we submitted this job at 413 and like it took like seven minutes for code allocation and the entire code ran for 16 minutes. And if you look at the output of our logs, this is what we are going to see here. So it's printing the iterations per second which we are observing during this runtime. So we are seeing approximately like 1.52 iterations per second for the training job. And as I mentioned before, it's using eight V100 32 GPUs for training the model. And similarly, if we look at for Onyx runtime which is also doing causal language model plus deep speed stage two and Laura. It overall, it took 17 minutes but I'll go through that later like for PyTorch it took 16 minutes and for what it took 17 minutes but I'll go through the increase of time for Onyx runtime in a minute. But what you see here is the iterations per second which you're getting is 1.76. So you are getting the throughput has increased from 1.52 to 1.76. That's approximately a 14% improvement of the in performance. So these are the numbers which we have computed before. So the Onyx runtime is we have seen as 1.79 iterations per second and for PyTorch we have seen 1.54 iterations per second. And it's approximately 16% throughput increment with while using Onyx runtime. So the number of iterations which we ran today were only 500 and that is mostly not enough for the fine tuning case. So when you use it for longer time you will see more improvement in the overall runtime. And if you benchmark it on A100 using two GPOs we are seeing a similar increase in throughput change. On PyTorch we are getting 3.09 iterations per second but with ORT training we are getting 3.59. So the reason why we saw a higher runtime for Onyx runtime is it ORT has overhead because you need to convert your model from PyTorch to Onyx runtime and then do some optimizations on top of it which takes some time. But it's just a one-time cost. It's not on a recurring basis. So since the overhead is not significant for the training time you will soon see the benefits of higher throughput. For example, I ran the same code with 10,000 iterations and you can do this on your own. On PyTorch it took the entire training it took 6,600 seconds approximately and with ORT training it took 57 seconds training time and these are less than two hours of training on V100. So like for a lot of use cases which we have today the training is much longer than that and for this case itself we were able to see realized performance gain like the time gain of 12.5%. So if you are running it longer you will see even better throughput improvements closer to 16% while comparing the overall time. So that's the takeaway for the on the training side. Then for inferencing. So these let me go through the run which we did today. Okay, hopefully this one has finished. Okay, great. So we started the run at 4.15 and it took like nine minutes to submit and finally the output took approximately 28 minutes to complete. So the two steps which took longest time in this one was converting your model to Onyx runtime and then optimizing the graph. So that takes about 15 to 20 minutes and then the scripts take for eager mode as well as for Onyx runtime takes approximately three, four minutes each. So let me go through like it's a lot of numbers here but I can tell you what it is printing. So what we ran here today was for a batch size of one and two and we use sequence lengths from 32 to 512. So what it is printing is it's printing for batch size one and sequence length 32. It's printing what is the latency to step to get past key values. So in LLMs there are like for Mr. model there are two things which you need to do. One is the prompt processing time and the another is the token generation time. So prompt is to like whatever information which you have you need to create key values for that sequence and the token generation is basically generating the next token based on the whatever input which you have. So that's what you do here. So the first step is it took 0.0248 seconds to process the prompt and it took 0.030 seconds to generate the next token for batch size one and sequence length 32. So these are the results for Onyx runtime and you will notice that the token generation time is mostly constant but the prompt generation time increases significantly as you go for higher sequence length. So from let's see let's compare one case for example. So the starting with we started like 24 milliseconds for prompt generation for sequence length 32 and if we go to let's say 512 it increased to 99 milliseconds. So the prompt generation time has increased but if you see like the token generation time is mostly constant here like it was 30 milliseconds before and now it just increased to 32 seconds or 32 milliseconds and below there are the results for Tor sugar mode. So you can see the numbers here for V100 but I'll summarize the results which have already completed into a table. So these are the numbers which we have seen on A100 single GPU. So we have summarized the numbers from batch size one from sequence length 16 to 2047 for batch size one and two. So there are four columns here. The first column represents the prompt processing time improvement compared to eager mode. The second column represents prompt processing time improvement compared to compile mode and the third represents token generation time improvement compared to eager mode and the last compared to compile mode. So these are the four columns which we are looking at. Overall, what you can see here is the honest runtime is doing better in all the cases. That is one of the things to take away. The other thing is to notice when you go for higher sequence length PyTorch runs out of memory. So you are not even able to process the data set which you have. A100 has 80 GB of GPU memory but if you go for like V100 GPUs, PyTorch will run out of memory even sooner but it works fine with honest runtime. So if you want to use higher sequence length, definitely using PyTorch is not going to be worth it. And you can see we are seeing a good amount of improvement. Even for like Torch compile which is faster than Torch eager mode, we are seeing more than like 40% improvement for lots of cases. And yeah, so you can get these improvements by just switching to honest runtime which requires a few lines of code change. Are there any questions on this? On the results? I think so it can be dynamic, yeah. But we are just benchmarking the numbers here. So let's say if you ran a sequence length of 2047 10 times, what's the average runtime? So what we are doing here is we just want to understand like how much is the performance comparison for different sequence length but it can definitely be dynamic. Another thing to, I wanted to mention here was like for both ORT and Torch, we do some kind of warmups so that the initial loading time is removed. So we do some warmup run case scenarios and then we run the actual benchmarking numbers. So yeah, but to answer your question, yeah, we can do, we can have dynamic sequence lengths. Yeah, so yeah, so as you go to higher batch sizes like you are still seeing performance compared to PyTorch, but I haven't like I don't have all the numbers right now but we still see some improvement. And I think we, if you know about the answer for continuous batching, thanks, Angoon. So if there are no more questions, I'm going to summarize what we talked about today. So we started with different technologies which can help you in improving your training as well as inference scenarios by getting better throughputs as well as reducing your compute usage. We also went through what kind of environments we provide and how it can benefit you in providing an easier setup and more efficient use case for your GPUs and how it easily integrates with Azure ML. We also went through an example for LLM fine tuning for Mril model and inferencing. And now Sangoon will go through a UI walkthrough for using a model catalog. Can you enable the other speaker? Oh, okay, this is working, sir. Yeah, so I'll check to show you them actually the, you know, for training inference, we need some like the setup, the script, write the script and then also like the, you know, the write some, like the configuration for the disk feed or the, you know, the HPT. But the AML actually provide pretty simple way to define tune the model. So what you need actually to provide some, your own data set and then model link that sheet. So I'm gonna show some simple demo. So if you go to the ML.azure.com, you can see the model catalog here. So basically AML actually provide a lot of the models from like the open AI meta or hugging phase and like the mystery. So this one actually, the agile team actually lots of the comprehensive evaluation to find some write the training parameters and the all parameters are predefined in each UI. So let's pick up the mystery model here. So basically it provide evaluate the fine tune deploy option here. So you can go to the fine tune. Then select your data set. Let's just one train loading next. Then you analyze the, your data set. Basically this one, this all the UI is predefined for the every models in the AML. So simply you can map the field you're not trained here. Then go next. That's it. Actually you can select your the compute instance and then start the training. But if you want more like the fine tuning on the training parameter, you can go to advanced setting and then it will show some training parameters like the enable Laura disable Laura and Laura parameters also even the enabled ORT or the tip speed these kind of things. So yeah, but if you want actually to predefined the training parameters, you can simply cancel everything. Yeah, so that's it actually the select data and then click the start that's it. Yeah that was the last part of the presentation. Thank you everyone. Any questions or and I will be here to answer any of your questions now and it would be great if you can provide feedback on the below link. Thank you. And so if you're converting a model from like fight work, can you do that on CPU or does it have to kind of be, is that a competition intensive and you have to do it on GPU? The conversion? Yeah. It's just like re-revening the supply. Please. Yeah, well I need to move it. Yeah, it's supported inside the fight work. Yeah. They call these sparse or sparse ML and that use the onyx on time format. So I'm in the process of like converting model over to onyx formats or sparse supply and they run there on CPU. So the quantization is one time step but I'm not sure if we can use to... So for training, he's saying like if you want to convert a model to like int eight or int four, is there a faster way to do it using GPU or something else? And I'm not sure about that. For the case we were running today, like the Mistral model took for inference, it took it takes approximately 10 minutes to convert from onyx format to optimized and quantized format. That's, and I was running it on Indie 40 machine which has 40 CPU cores. And actually the old weights in the separate but the modern weights are separate. So you can actually put the modern CPU. Load the model without the weights? Oh, no, no, no. We need actually the basic weight files are located in the same location with the onyx model. So the onyx model has the information of the old weights. So that's why the onyx model can load both the model and the weight. So we have some issues with actually the draft it's called like that. If you found some issues... Yeah, our repository is pretty active. Like if you are having any issues on onyx one time or anything, like feel free to report it. Like we take a look at it. Like we support like at least till code 11.3. Like some of the newer features might not be available. I think. Yeah, so yeah, like I think as Sangoon mentioned, you'll need to build onyx one time from server. So that might help. Yeah. Okay, thank you everyone.