 Good morning everyone. It's so excited to be here. I'm Chen Wang from IBM Research. I'm a research scientist working on cloud native AI platform and I have been working on Kubernetes for five more years. This is Abhishek. Hello folks. I'm a senior software engineer here at IBM and excited to be here. So today we are going to introduce some of our experienced best practices on ARM serving. So before that we all know has been attracting a lot of attention and then serving ARM will help a lot in modernize the existing business use cases and applications. So however serving large language models is very expensive. So first it needs to run on high-end GPU accelerators such as A100 or even H100 and then the sequential nature of large language models makes the inference processing time very long. So for one A100 accelerator if you want to precise and request it's less than one request per second. So if you are thinking about production use cases or our business use cases you may need tons of those requests inference requests going through and then it will need tons of GPUs which is apparently very costly and then the diagram just shows a simple toy example of how is making the inference iteratively and so it is generating token per token and every time it generates a new token which is a word it needs to cache all the previous tokens in your cache. So this sequential nature mix is very like takes a lot of resources and makes the inference very slow. So that's why in production use cases we really want to have some techniques to improve the support and performance of inference and then there are two popular techniques in the open source of wider academia community to improve the serving. One is called continuous batching and the other is patient attention. So for the batching part continuous batching is basically derived from the static batching which is utilized more memory to batch more requests to use GPU more efficiently and then continuous batching is just continuously understanding how many requests are coming in and then utilize the memory better to continuous put requests, candidate requests to the previous one so you maximize your memory utilization as well as your GPU utilization. So the patient attention kernel technique is similar it's tried to might be the logical blocks of KV cache which is necessary to generate to the next token to the physical KV cache blocks so you can have more efficient utilization of memory and reduce the resource fermentations in the memory space allocation. So in our case we have research clusters and we want to serve a lot of GPU models for the wider users in our research lab and then we find out if we will serve a wider range of models and then some models are very popular and some models may be idling for a long time but researchers still want to use those and in this case for example we if we want to serve 50 models and find out 30 of those unpopular models but necessary to serve and then if we use one GPU to serve those unpopular models and then we will find out we still have a very long long tail GPU underutilized so how can we solve this problem and we are thinking about like packing more unpopular models in fewer number of GPUs and then the available techniques of GPU sharing available nowadays can be time sharing, MPS and make and then due to the nature of the batching and the memory page attention kernel or flash attention kernels memory optimizations in those servers so we really think if you limit the memory allocation you are not using the compute efficiently enough so because MPS and time sharing really targeting the dynamic sharing of HBM space and those unpredictable memory allocation may lead to exceptions easily when there's there are birthday requests are coming in so then we want to try the make partition initially which is a static partitioning the memory space so we did some simple experiments on benchmarking the MPS model on different varying size of make petitions and then if we set the per token generation latency to 50 millisecond per token and then we found out like if the load is low enough like there's less fewer than 32 concurrent users of sending requests then we can guarantee the latency pretty well using smaller make sizes like 4g 4tgb however in our practice we find out if we use the NVIDIA default GPU operator to enable make every time we want to reconfigure the make petitions we need to evict all the workloads on our GPUs on our server so this but from because the optimal make partition we need for serving the model really change over time over very low so we want a dynamic way to create make partitions so that's why Abhishek will talk more about how we use DRA. Thank you Chen for working us through the importance of using make slices for inference workload let's quickly dive into DRA so DRA stands for dynamic resource allocation and it provides two new APIs basically resource claim and resource class to request GPU resources and while DRA is a blanket statement but it solves a very important use case for us which is the ability to have incremental mix slices on some of the other end of GPUs so as we see that in the DRA world there is a quite a lot of setup that is needed to enable GPU sharing so on the right hand side of the screen in the middle we see that there are different resource claims like GPU claim parameters and make device claim parameters that are needed to be set up and those claim parameters are then called on a reference into this OPD model workload that we see over here let's quickly dive into the demo here so what we do here is we submit the same workload that we just saw on the previous slide as we submit this workload we see few resources that are created but notable resources here are the resource claims that are created once the desired resources are created then the container comes up and inside the container what we have is the VLLM server we wait for some time for the VLLM server to come up and enable port forwarding to interact with it now we send a prompt or a sentence completion request to the model saying San Francisco is and and finally we do get a response that it's a great place to to live in by the OPD model workload here thank you for watching this demo if you want to learn a bit more about DRA we do have another full talk here on Thursday questions thank you thank you okay we do have the the barcode of the talk and also the other tutorial on how we deploy VLM server using DRA in the previous slides so and the demo link is also available