 Hello everyone, Minus in Kyung-Li. Maybe you can get started because the time's up So so we are in the research group where we studied a lot about the Sustainability and the LLM minus in Kyung-Li and then meeting one of the research group in IBM research and in in a hybrid You know cloud infrastructure Hello everyone, I'm very excited to be here. I'm Chen Wang from IBM research I'm a staff research scientist working in Kubernetes cloud native AI platform and now in LLM as well especially in Inverse and Looking forward to more deeper discussions with all of you offline as well Hi, hello everyone. My name is Bo Wen. I'm also a research scientist from IBM research I'm from the digital health group. We use a large language model for patient engagement and Today I will show a use case how we do that in With the hell with my colleague Thank you everyone. My name is a farming Chen. I'm from a red house emerging technology group So my day-to-day work is on sustainability how to make energy efficiency available for cloud native workloads Yeah, I'm very excited about you know this opportunity to and I have our presentation to you So as this is the tutorial session, you know I want you to have some of the tangible Experience and also we try to share all the github, you know everything that we showed today in the github I think you should be able to reproduce it at home as far as you have NVIDIA GPUs at home, right? so that is our expectation and so So we will be covering in a lot of stuff, you know starting from CNCF the cloud native AI Working group and also the environmental sustainability tag. We try to have very You know short introduction about that and also some of the cloud native LLM overview a large language model and also Cloud native LLM in action. So how we have to Coated and how we have to you know use the LLM in the cloud infrastructure and the real-world user experience So I really want to have some like end-to-end user experience for you So I think both will you know have one of the really nice You know application try to show it to you and a cloud native sustainability in general and also some of the acknowledgement takeaways So we truly believe that sustainability is the key, right? So I think you know this belief is really important because you know all the all the things are happening Even you know, I heard that you know today's weather is really abnormal in Paris I think I heard that is a 73 Fahrenheit, which is really high temperature today So I think you know over the time we're seeing that this abnormal temperature happening more and more So I think a little warming is a right now happening and also all the flooding on the fire You know all the things it really want to stop it or at least the mitigated over time and for our next generation So there are this is one side of the humanity side and the other side could be you know the company wise So this is ESG environmental responsibility, you know are going to the enterprise and So you may already know about this like ESG in a requirement and the carbon tax So I think starting from some of the northern American northern European countries They try to have like more carbon taxes. Let's say if you are generating more carbon and then you have to pay more tax You know that kind of thing is in action. So I think one of the example was European, you know energy efficiency directives. So they are kind of Asking for you know transparent data center energy efficiency. Let's say I think previously they are blindly Okay, energy efficiency for you know data center is this much or I think they are They are using lots of energy, but I think they really want to have more transparent, you know report Let's say, you know if you're using this kind of services, how much carbon you are using, right? That requires a lot of measurement methodology verification methodology that you know, how we have to verify this number, right? And the other thing was AI Act. So I think AI Act was happening in to the 22 or something and So when you train the model, I think you have to really you know transparent about the you know, how much energy you have to Spend I think you have to really report it so that you know that model quality I think you know that has to be reported to the government and and So I think everyone may know that you know overall, you know Interests about the you know LLM or the AI so at the right-hand side chart actually shows that you know That is it going really in a skyrocketing at some point right so a lot of models comes out and the GPT for even like a lot of You know enterprise and the Google Facebook even like IBM We are developing a lot of you know models and then energy constant for training Data centers are in a skyrocketing. So I think maybe you may heard about this story from you know, some of the keynote, right? So using, you know Thousands of GPUs at the same time to train GPT for GPT 3.5, you know kind of stuff is happening right now and This is some of the bullpang number. This is based on some of the the report So they're calculating that you know based on the average, you know household energy Consumption per year. I think training GPT for model, you know takes up to approximately, you know 10.5 K Thousand household per year. So which is a lot of energy for training one model, right? So so by the way, one of the disclaimer was actually that is the estimated number is not the you know actual verified number So they just want to mention that So I think what we have to do So one of the workgroup in cloud native AI workgroup, I think we are working on you know review promote, you know To educate the cloud native AI ecosystem. So one thing and then at the same time we try to reduce energy at the same time So environmental sustainability and attack was you know founded last year. I think maybe two years back So we are kind of also actively involved in this tag Try to what we can do, right? So I think all the discussion was happening in there So that is one of the you know tag in you know a cloud native You know tack here. So we have you know app develop the delivery and runtime and security and the sustainability one of the tech So specifically what's your mission statement? Okay, so our goal is to advocate for develop support and the help Evaluates environmental sustainability basically we are developing the model specifically and also we are trying to The identity value values and you know possible incentives for the service and providers to reduce the carbon consumption energy consumption and the cover for print So if you want to have more of Information I think you can you know take a look at this, you know environmental sustainability tag in CNCF And then you can have a lot of information in there So this is kind of overview. I think since this is a tutorial session. We have a luxury to Have a you know kind of you know, what is the steps in the large-length model? You know process basically you have to do a lot of preprocessing So that involve you know clearing data clearing and also Transformation and also a lot of integration and then also reduction, right? So I think you can have a you know massage lots of data first and then you know that data goes to training Right, so training. I think you know a lot of me and made through application all this like forward pass backward pass And then you have to do a lot of you know calculation in there And then once the model was built then you have to do inference And then also you can do the fine-tuning and then you can reuse this model Let's say you have the base model and then you can define tuning those models and you can reuse them so this is kind of entire a life cycle and Of course, you can add more cycles on on top of this, right? There's a prompt tuning There are a lot of other tuning you can do it on top of this But I think this kind of big chunk of the you know the cycle of the large-length model But I think you might wonder, you know, how much energy they are consuming, right? So I think one of the you know recent, you know Facebook paper was saying that you know the I think the left bottom of the graph shows that yellow Yellow part is about the pre-processing this kind of great part is showing the you know the training and the black part is the inferencing inferencing or the You know deployment phase rates once you use that model how much energy was consumed because the inferencing part is pretty large you may assume that maybe training takes a lot of a lot of you know energy, but Depending on the life cycle of the model I think inferencing if you reuse that model over time then you know energy consumption for that model is like more for the inference So that you know, that's graph shows it So what is ecosystem, right? So I think a lot of hardware vendors are working on this, you know NVIDIA Intel IBM AMD I think they are having their own Habir's and also, you know, a lot of software scale stack was developed on top of it You know, of course, you know NVIDIA is kind of almost dominating the you know inference or the training a lot but I think a lot of vendors are working on that and Also, you know, most of them are in a deployed in cloud. So I think they are very You know close to you know cloud native and also you may have heard, you know A lot of project related to you know those the cloud native project, you know in this conference as well And also, what are the ecosystems in you know energy carbon quantification in cloud? So I think there's a green soap the foundation I think there is also this one of another ecosystem. They're developing some of the tools for example Depending on the date of the date time of the day Whenever you have a lot of sunlight for example, then your your carbon intensity will decrease significantly Like, you know, the blue curve in the left-hand side, right? So then if you use energy in during that time actually you can save a lot of you know carbon footprint because you know Your energy is you know given by the Sun, right? I think, you know, there's some SDK you can have those information from that kind of SK There's some repository working on that and also some interesting project about you know based on your Python code I think based on the line of the code You know all the execution path you can actually estimate the you know carbon footprint on that So there are some you know tools are developed, but still we see that there are a lot of opportunities Opportunities are available for you know cloud native application. You know for example, you know, we try to have this Kepler Kepler is one of the project, you know that Huamin will introduce later about the you know How you can measure the power based on your in a pot level? So I think based on you're running some of the pot and then based on some of the other counter information We try to estimate the power consumption for Running your application So is it really easy? Of course, no So basically, you know, there are a lot of things has to be You know in the equation So for example, once you only take a look at the energy then of course if you don't use the resource Then you say save energy, but you have to actually compromise some of the performance But our aim is not actually compromising too much of the performance So then like a what kind of things you have to take a look at it So basically we have take a look at a lot of other metrics for example throughput You know accuracy or the latency or you know, even like carbon footprint energy of course and also the cost right so You know only saving energy It's not possible. I sometimes you have to sacrifice some of the performance But I think you know based on the SLA even like your service a requirement or service level objectives You can possibly save energy So I think we are aiming at you know given the SLA is given then like how much energy we can or the carbon footprint You can save it. Of course, we shouldn't do any You know the greenwashing so I think greenwashing is the term where you You really want to show that you're saving, you know carbon, but actually in fact, it's not So I think you know we have to be really you know transparent about this all the you know reporting and stuff So I think what kind of technology is possible. I try to List down some of the things over here. I think one of things is accurate quantification is really important So so when you are using your resources, how much energy we're using You know, you have to be really accurate about those numbers and also you have to send their eyes Let's say, you know, if your workload is running in the cloud Even you may not want it may not see the power numbers, right? Maybe you may know that what kind of hardware you're using but I think in the cloud level you Hardly see that what kind of hardware you're running. So I think Senderization I think among those like a cloud infrastructure or what is the you know, how do they are using what is a you know Power number I think that has to be agreed on many different cloud vendors that that will be very important and also inefficiencies in you know identified in optimization So this leads to a lot of technical techniques, you know that goes later So I think you have to really identify the bubbles bubbles meaning that when you do the The training or the pipelining so I think backing up all the Pipelines so that you don't have any bubbles meaning that you know some of the ideal cycle on Your resources will be very important So I think you have to identify those bubbles and also you can do you know a lot of In the techniques, you know, for example dynamic scaling or resort dynamics resource Dynamic scaling of the resources in the RA you may have heard it from the keynote speeches And also the multiplexing will be very important and also, you know power capping of the frequency scaling to save energy So I give some of the example as a preliminary result So I think I already mentioned about the power capping or the quantization or the enabling me Which is you know multi instance GPUs You know, right-hand side chart, you know shows the you know to information at the same time So I think the bar graph shows the pop shows the energy consumption Energy consumption, you know using have you know different different power cap and also the line graph which is, you know Line graph shows the You know performance in latency. So which is you know, the lower the better So I think we try to find a sweet spot Even like a power capping meaning you can actually save energy by doing, you know lower power capping Let's say your GPU can spend 400 watt But if your power capping in 250 then you can possibly save energy But I think you have to really, you know Sacrify some of the performance, but I think we try to find a sweet spot So that you know, even if you're doing some of the power capping still your performance still, you know Good as is so that is the you know, this paper. We published it, you know sometime last week I'm sorry last year And also, you know multiplexing gives us lots of Opportunity so that once you multiplex the resources and then you can actually save power because you can host many Users at the same time All right, so I will leave this up to Chen right now She can do more practical and hands-on experiences. This is kind of overview. So let's Okay now so now it's an interesting part and Let's do some hands-on going through the basics of LLM and you all know large language models has been attracting a lot of attention and we will have Kind of revolutionize all our use cases business use cases and applications While this large language model so behind the scene every application needs to call in an API for the inference That's why we are discussing about discussing about the large language model serving here However, serving large language model is very expensive. It's cost wise expensive and it's also energy wise expensive So they run in high-end Accelerators GPUs like a 100 and then if you consider the sequential nature of how large language model is generating tokens It's all too like all their architectures are all too aggressive Meaning you input some tokens and based on all your input you are generating tokens one by one And then this sequential nature makes a very The generation time long time and then if you consider 180 100 Then you can process like less than one request per second and then in your production use cases You may have like hundreds of thousands of applications needing to query this Large language model and then you can imagine how many GPUs you want to spend on this and then How much cost it is? so I Want to to briefly introduce the VRM which is an open source framework for production scale like large language model serving because of the two techniques they Actually introduced to optimize the cost to make it faster for production use case the first one is Continuous batching so it first came comes from the static batching meaning you can pre-allocate memory to batch more of your request input request tokens, so you Make though that all the processing those requests in parallel to utilize the GPU computing better And then later they derive the from the static batching to dynamic batching meaning the pre-allocation of your memory is Only to the dynamic request that you will have and it batch allocate the memory up to the maximum token you will generate Through all the request you have and then they also find if you have a very long request and then all other requests You batch this smaller shorter Then you waste a lot of memory in the long request a chunk, right? No, and then What's next is they propose the continuous batching meaning they can concatenate The newer requests are to the empty slots of the memory So you fill a memory up and fully utilize the memory as well as the compute This is the first technique they introduce and then so what about those each block of the token They are cashing. It's called KV cash. So KV cash is actually the intermediate results for when for all the tokens You compute after each layers and then in order to compute the next token You need all of those intermediate results cash in your GPU memory. So Inference is not only compute intensive but also very memory intensive That's why we need those high-end accelerators with a large high bandwidth memory So if you pre-allocate your memory Statically in your GPU memory, you need to kind of estimate what's a maximum length of your request will generate and then pre-allocate memory for that and then in this way you would end up with a lot of fragmented memories not used because your prediction is never the best right so practically if you will generate 20 tokens, but the maximum possible tokens you will generate might be Southern 24 so you can think of how much memory segments you are wasting So that's why we are Introduce another key technique called page attention kernel Which is basically they have some logical memory space mapping to the physical AV cache blocks and then to make sure to reduce the fragmentations between request and also the fragmentations due to the The shorter request generation so you can get into more details from this paper and I also got the diagram from the paper as well So let's assume we are building a back-end production scale back-end serving engine and then In our research cluster one of our needs is not only to build the cluster for one model So all our researchers that want to experience a big set of model for example here. We want to serve 50 different large language models for like all our researchers and then we find out like Some models at certain period some models are very popular and some models are less popular over time Because for example last year everybody wants to try llama this year The everybody wants to try mistrum and how we deal with the shift of the load at certain period so If we allocate one at least one GPU per model and then allocate more GPUs for those popular models What we will end up with like if we have 92 GPUs and then at a certain point 74 out of 92 more GPUs are idling at Certain certain particular time and then that's a huge waste of resources and also huge waste of energy So what what we can do about it So the first thing to reduce the energy cost and not the money cost of course is to When those GPUs are idling and models are not popular we can use NVIDIA the System management interface to that they provided the GPU clock frequency tuning tool for you. And then You can use the same command line in your GPU server to check What are the available frequency you can tune the GPU to and then Also, you can easily just use the NVIDIA's SMI to change the frequency of your GPU So if we try that during idling period what what we can get So when the GPU is idling if we tune down the GPU clock frequency from like a thousand four hundred ten megahertz to five hundred forty megahertz It can reduce the GPU temperature from 39 Celsius to 35 Celsius and then totally reduce the power usage from 55 watts to 35 watts. That's it idling period. What we it's basic So if the GPU is basic basically in this experiment we sent Like 16 concurrent request to the LIM serving engine that is served on this GPU And then if we tune down the frequency again, we reduce the temperature from seventy four to sixty one So you can imagine how much cooling cost you can save and also reduce the peak power usage from 300 watts to 150 watts and then let's see if we indeed tune down and then Unluckily, we got a lot of requests coming in What will happen to our latency and super on the server so This diagram basically is varying the concurrent number of users I will say here is we are we vary the load to the LIM server And then we also burn the GPU clock frequencies from like the top frequency They can afford to the lowest one not the lowest one fifty four fifty forty is kind of like double of the lowest Frequency you can set and then the medium per output latency for the The lowest of frequency will be well below 50 millisecond per second 50 millisecond per token so So if you are in this domain, you know like 50 millisecond per token is pretty acceptable when you Typing the input and at the same time observing the input the reading speed the 50 millisecond Protoken is completely catching up with your reading speed as well So I will say if you have a low load below 16 concurrent users sending request then even if you tune down the frequency, you know, you are not sacrificing the service level agreement of the you and user experience very much So let's take a look at the 99th percentile Latency so the same thing we can see even we tuned down to 540 megahertz and then if we have a load of 16 concurrent users sending request and then our per token latency is still about like a hundred millisecond to a hundred twenty So it's a little bit slower, but it's still popping up So the next one is what if we not only want to Reduce the energy cost we also want to reduce the money cost Oh, we even want to use those those GPUs for something else like for more popular models then What can we do is the second option? We can pack More those small lightly use the models together We hope we can pack those small lightly use the models into fewer number of GPUs And then if we look at the size of different models here is an example in the lower right chart You can see the memory demand all the number of parameters They have very a lot that they can spread across a GPUs and they can even pack into one-eighth of the GPU So we have a lot of opportunities here so then what technology we want to choose and in Those are the default options provided by NVIDIA GPU. You can do time-sharing As long as all your models can fit into the memory you can do MPS you can do make it Which is static partitioning of your high Bandwidth memory as well as the space Multiplexing of all your compute and then here we want to start with trying make Because all those memory optimizations that are before on the server They really make the memory allocation very unpredictable because if you have more requests They are trying to batch more requests to use the memory more efficiently. So if you are sharing the same memory between servers and they may easily lead to Memory overload of exceptions. That's why we start with make which is the static partitioning of the high bandwidth memory So make basically allows the GPU to Secure a partition up to a seven separate GPU devices and each of those will have separate and isolated paths to the entire memory system and it is supposed to be faster and it's supported Biometers containers and the Kubernetes and the way to enable a I show the Simple command here and then also this diagram shows all the possible make partitions You would have you would merge two small make partitions to 2g 10gb which is a little bit bigger And then you can even merge those two 3g 20gb as needed So this is the way we configure make partitions on Kubernetes Basically you need to define your own config map and using the default GPU operator You can pre-define like whether you want all small sizes on one GPU or you want some balanced configuration with one 1g 5gb 1 2g 10gb and 1 3g 20gb and All you need is to define those for profile and label your nodes with corresponding profile you want So here is a numerical analysis on our previous research cluster example. So what if we pack all those tail models Into fewer number of GPUs how many GPUs we can save? So we analyze as 42 models among our all our 50 models And then they used to need 42 GPUs as they need one GPU per model And then if they have low load and we use make to pack them together We can totally pack them to 19 GPUs. So in this way we can save 23 GPUs Cost-wise all we can also use those 22 23 GPUs to those popular models to reduce their latencies as well So then the next question is application wise Do we fully need large very large models for all those large language model applications? and then what if the small models can do the same job and later we'll talk more about our experience in using small models and then here I just want to highlight some benefits of using small models So they are first to efficient second they are very they have very low cost and stir you can easily tune the Small model to domain specific models and then with much cheaper cost So you don't need a lot of GPUs between a small model And then the last option we want to see is what what if you still use want to use a large model But you don't really have the hardware have the very high and hardware with a large Hyper bandwidth memory to serve those model So the option might be the contest model and then here we in this tutorial we will demo the model with GP TQO quantization and Basically, it's a layer-wise quantization algorithm trying to minimize the Objective function here I show so the W is the original coefficient of those Neural networks and X can be the input to the neural network Then the quantization is basically coming up with the W hat Which you can minimize the difference between the two in this way You don't lose the accuracy as well as you make the model smaller and the W hat can be just just like four-bit integer while the W is the 16 16 flow point data and then the The bottom left figure is is showing the original GP TQ paper how accurate the contest model compared to the original model so for those different O P T family and bloom family you can see the The the blue dots line is the original floating point models and then the four-bit GP TQ model is the model after quantization and then then from the benchmark their accuracy results are really similar and What's interesting to us is we can now use even smaller make petitions for example the 2g 20 10 GB make slides or 3g 20 GB make slides to feed those quantize the model instead of using the whole GPU So when we are using the smaller models and quantize the models System-wise how much performance we are sacrificing here So let's make an assumption if the app doesn't really need To generate a lot of requests at the same time Let's see if the concurrent number of users our queries sending request is lesson a then switching to the contest model as we Stated before you can just use one half or one eighth of the GPU to serve the model and if the concurrent Concurrency for the load is below a and then you can see that the 50 millisecond second token SLA line and then you can see all those latencies as soon within our SLA is still acceptable Of course, if you have a lot of requests coming in you may want to switch to the a 100 To serve the but serve those models as well. I want to highlight. So this chart the contest model Latency goes up very quickly. It's because we squeeze them into smaller make petitions and if you use the large the same a 100 accelerator it will be similar to the original model as well So quantization is quantization and smaller models can really Serve your goal if you don't have the large device and you don't have enough resources So next we will show a quick demo on how to deploy our application. This is our whole setup We have one small quantize the model, which is llama to 7b GPTQ model Served on a 2g 10gb mix slice and one larger quantize the model llama to 13b GPTQ model served on 3g 20gb mix slice and then original 7b llama 70 model Served on the whole a 100 and before the servers we develop Our VRM router that will automatically router closer to the corresponding model we have and Behind the VM VRM router, we will have load generator and our application But we'll introduce later in observability stack all those servers are exporting those metrics to permissions and We use graphana to visualize our results and we use NVIDIA DCGM exporter to export those metrics for realisation purpose as well and last the Hwami will introduce our enhancement in Kepler to Export all those energy consumptions. So for all those open source project You can scan the barcode to get to the tutorial as well So now I will have a quick Quick demo on showing the Demo we do Sorry, it really takes some time to load the video It seems we have a lot of people here. So we don't have good internet Maybe in the meantime, do you have any questions? I think, you know, this is should be interactive a little more Mirror Could we mirror the screen mirror the screen Sorry for the trouble Okay So so basically if you scan the barcode in the previous slides You will go to this tutorial and we will go through the steps of deploying those servers and Applications there are two ways one way is we provide whole deployment YAML to deploy everything including the router the all the service monitor necessary to the Exporting the permissions metric. The other way is you use the ham chart we provide and then here I will show and quick step on how we deploy the whole Setup system setup so you can configure the models you have all the necessary parameters and even the Model ID and then we also open source the VRM router as well In both get up and play and the heim really helps you do the one One step setup and now you can see there are three VRM server running different models and then we also have our Application twilight chatbot deploy and then as well as the VRM router So next I think I'm I'm going to demonstrate how to quickly run the load generator for all the experiment results have shown and shared in our slides today and This is an example dashboard for the VRM server we have and it will show you Performance metrics such as the throughput the time to first the token latencies and the per output token latency and This this is a DCGM dashboard. Well, you will see all your GPU frequencies temperatures energy consumptions So make sure you set up your hacking physics Cray before all those steps and Also, make sure you set up the resource claims and resource volume for your Vocation of your models. So here we are they Pre-configure our hugging face secret to just a fetch the model and this Kepler demo is Tutorial is actually in the sustainable computing IO arc in github and Here we are just to configure the persistent volume and persistent volume claims to catch the results of the benchmarking of the load generator and To to start running the load generator. It's very simple We'll configure a dot in the file to configure the maximum Concurrency just like the results I showed before in the chart and then some other parameters like the the models you want to test and how many Prompts you want to generate it for this experiment and this config map is just the wrapping up the dot EMV file You configure and once the config map is ready, you can go ahead and create the load testing job The load testing job will automatically grab those parameters and generate requests So I think That's all of the demo and next I think Boa from IBM research will introduce our application and how he developed the application connecting to those large language model Infrared servers Okay, so I will first give some Background context of this use case. It's about dementia Dementia is one of the biggest impact biggest Health crisis impacting our society and economic nowadays According to the World Health Organization, there are 15 million people have dementia worldwide and it will be triple by 2050 reaching 150 million globally The economic cost is around one trillion around 2018 and it can and we can try to understand that from our More personal connecting Way like all of us middle-aged young people right and some of you probably have kids and We all have our parents so my grandparents actually one of them have dementia towards the elder days and Throughout my grab we can see like Well now my parents are still healthy, so I'm lucky, but if one day their health start to decline I will have to take care of my kids while I take care of my parents So we become like a candle burning on both ends. So this is why Dementia is one of the things that not just from humanity point of view that we should love and take care of our Family, but from the society point of view if this is this health crisis is not being addressed It become an unsustainable Situation for the whole society because the productivity is going to go down all of us have to take care of all our loved ones and no one is available to do the jobs and Prevention is the key to become sustainable in this situation. So there's a lot of Research showing that You just need to use your brain more and that's the key to prevent this To get the measure in the elder age. So reading is one of the Saints that can make you bring more active as well as other cognitive active tasks like Using your computer rather than just watching TV. You just need to use your brain more That's the key message to take away and in order to do that one of the research project that IBM that my group digital house is working with a Harvard Medical School is a pilot study to To get the elderly people in the assistant living facility to engage in reading and Hopefully that will help them to maintain their cognitive activity But for all those people you can imagine they are actually about 65 years old or even elder like Sometimes walking around for then it's difficult. So how do you engage them to like keep reading every day? That become itself become a challenge. So that's where a large-length model become a helpful here We've developed this chatbot The goal of the chatbot is not to test whether the reader Understands the book, but it's to make the reading a more enjoyable experience is to talk about the chapter They just read and then maybe also talk about some other stories fun memory Moments with their life so that to make them feel like reading is not just Being alone, but it's having something more interactive Um Okay, so let me show the next thing is the demo Let's see if it works Working, okay So I'm showing you a side-by-side two screens one on the left hand side right now is powered by GPT4 And the one on the right hand side is powered by lamina 13b quantized model quantized version and Then we also have the lamina 17b The non-quantized version and quantized version will show in the next So we will start chatting with the bot so of course a different model will respond a little bit differently, but The main idea is the same and we will try to respond in the similar way to see how the model responds and The idea here is to show you like what Chen mentioned before the bigger model sometimes may not be the best choice for the task and You will see why very soon So we are talking about the Alice in the Wonderland the first chapter The assumption is the reader already read that chapter and now we are having a conversation with them about the book and Try to make them more engaged and feel it's more fun to read more You can see the GPT4 Response is actually more sophisticated Just having more words So the bigger model tends to response with more sophisticated answer and now we are going to show you the smaller model And you can see the smaller model tends to give a very short sentence response And in this use case actually that's preferable because for the elderly at that point Our medical team tell us keep their reading level around five to like four to five grade graders because at that point Because they already start to experience cognitive decline If you give them a long question, they actually cannot pay attention to it they will forget and then they will feel fatigue and it will make them a bad experience with the chatting With the chat bar. So the goal is you want to really have a short and sweet like you're talking to kids So actually the smaller model here the behavior is more preferable in this use case than the larger model And that came to that. That's why we chose this for this demo purpose is to show you sometimes Bigger model may not be is use more energy, but they might not be what we are looking for from user experience point of view Okay, so that's the end of the demo and Let's move on to next. I will talk about how we build this demo So this is the this is the architecture of the application In the bottom as a chance show before we use a VLM as Engine to host our lambda models and then GPT for of course is through their API cloud API and then in the middle we use a line chain as the prime engineering framework to to connect with the LM as well as a Rack in this case is the chapter of the book that the LMS talking and then there's also other necessary Memory memory management Is the user database? So we give the profile of the user to the LM so to customize the chatting to become more personalized And then on top of that we use the chain let as the chat engine to host the front end that Connects to a web UI that you saw just now, but in our repo you will see we also have a voice Interface which is through two little integration So the elderly can actually just a call in and then talk to the chapel over the phone or through text messages It's just Providing more ways for the user to to have an interaction with the system and Here this is how so chain let and the lantern they are so VRM is a very popular framework So lantern already provide a very nice package to APM The way to call it is just a calling one function and then you can put in the URL from the server You are serving the model and then the model you want to serve and then other parameters that you want to use to control the L M behavior and Here is some simple examples to show you how to do the proper engineering. They're ready trap templates I will show you the source code later Oh, sorry one thing I want to mention so for different models The template is a little bit This is a caveat between different models for example for laminar chat and And charge BT they are trained to be a chat. They are fun to do with a chat behavior So when you prompt with when you generate the prompt template model they expect to have a Chat which means it's a wrong by wrong Interaction with the model But with other models like mixture those are Funtune thing for instruction. So if you want to chat with them, you have to use some special Prom engineering technique you need to tell them Here is a history of the chat and I want you to respond What is the next sentence you want to respond? So you are trying to make me the trap behavior with them but essentially what the model is doing is still doing completion meaning it's just extending what's the what's the prompt What's the input prompt that you gave it and then you try to extend it And chain lead chain lead is actually a starter from here from Paris So we sample to the founders here. They are doing a great job chain lead which really makes building a chat bot very easy It's built on top of a fast API. So if you're familiar fast API, you see the decorator. That's idea main signature idea from there and then you most of time you only need to define two functions one is to initialize your chatbot with this function and then this is how do you want to handle the wrong by wrong interaction with your chatbot and after this function and then you just run the server and then All the UI and everything else is taken care for you So the chap the demo you saw just now that we show before actually only took half an hour to build this part and Here so this is the URL to the repo if you want to check out the source code and Actually, we have a Variety of range of contributors to us. Some of them is only from high school. They are Concerning with their grandparents. So they are contributing their ideas to the to this project Of course also researchers from Harvard Medical School And thank you Thank you All right, so let's come to the sustainability. What we do here for measuring the power consumptions used by large language models So the project template you may already seen the projects being mentioned multiple times in the cube count here before So they are wonderful maintenance also here in a conference. There are some more talks available Later today and tomorrow. So welcome to the Maintenance of the Kepler. So the projects is mostly about how can we use them? Sort of methodologies to give you the idea of how much energy used by your containers your processes or virtual machines So these are very interesting information you can use to find to your application and also find to your Deployments models so that you can achieve your sustainability goals. So projects Kepler Is currently a CNCF in sandbox. We are very glad to grow the community strong and Here is a quick introduction of the framework So if you look at the different columns on the diagrams We particularly particularly on the left side This is the tricks that we collect information from the operating system using eBPF As you know that the eBPF has the capabilities of being small being versatile It's able to collect different levels of information as operating system level and hardware level Specifically in Kepler we build a multiple entry points Corrupt probes and each of the pro functions will intercept certain or as a level if contact switches and software IQs and Memory page dirty memory pages so that we can count collect a whole picture Of how the processes and containers Works inside of the operating system using this information. We move up to the actually move rise to the middle column This is the place that we Using the eBPF collected information to correlate with the user space Informations to paint a bigger picture of what's that the processes ownership So the process could be inside of a container or could be inside of the virtual machine So this is the mapping that we create in this area and then once we got all these Informations we associate the energy consumption with activities about these processes and containers Using certain ratio method. So currently in biomedical environments. We're using the CPU instructions to attribute energy So let us say the whole server Hypothetically using simple numbers the whole server uses one hundred CPU site a CPU instructions and Process a using the 30 Instructions at this time and process be used 70 instructions at this time So as a ratio the process a we got 30% of the energy consumed in the collection window and in process B gets 70% Thus the power information we get from difference in platforms are depends on the configurations and hardware architectures on x86 we get the power information from the rappel at the runtime average power level Certain arm platforms are we using the hardware sensors to get the CPU level of energy consumption On virtual machines where you do not have access to the hardware counters to gas energy consumption We're using machine learning models to estimate how much energy during this time duration based on the CPU activities So that is one of the areas so we actively modeling on different hardware architectures Aside from CPU. We are also able to manage the GPU and server-level energy consumption using different libraries at a server level using the rappel and ACP I to get the platform level energy consumption and on the GPU level which is coming up next is various platform dependence We're using our country supports and video GPU using the NVIDIA management library The two libraries over there one is the MVML the other ones are these data center GPU management, which were coming up next So as we are just going through the discussions before so the Level of configuration in GPU actually varies based on the deployment models If you are deployment one model one GPU you can guess the power consumption from the GPU And you can attribute the total energy consumption to this model Completely, so that is a simple case we can use in the NVIDIA MVML library to guess all this information But if you are having these make multi instance GPU that you are sliced up a GPU It's into multiple slices and each of the slides we are just serving one model Then things can be very tricky because NVIDIA does not give you precise level energy consumption which have to come to us to do certain level of Estimation and modeling to come up with energy consumption, which is going we are using multiple level of information here So if you are this is a snapshot of the MVML NVIDIA SMI Outputs so you see here we have multiple outputs from here on the Highlighting in red This is the positions of the identifiers of different mix so we have three mix in this picture So they have a GPU ID GPU instance ID ensures is a GI ID and a compute instance ID ensures is a CI ID and MIG MIG number so this is the information in programming program of magical level We have to go in using the DCGM library to get it So highlighted in blue. This is the multiprocessor count. So we are using a 140 gig NVIDIA GPU so the multi processors if you are slice up the GPU as we said before there's a three slices I think it's a 3g 20 and 3g 10 and 2g 10 something like that So the biggest one has a 42 multiprocessors come in access 20 8 and 14 So this is the information we can get from a media library MVML So the very last one is the processes that's are using the GPU So as we see here in highlighting in green So we see the process ID at the PID and we also see the GPU ID on the left and the GPU instance ID which is true in this case So and then we can make a mapping between the GPU and the GPU I instance ID with the process ID to Correlate which which processes or containers are using the GPU slice and therefore with that information We can also go into the energy estimates by using the multiprocessor counts with respect to the whole GPU in order to get the Estimates of how much energy used by that slice which are consumed by certain processes or containers So that's the basic idea In short, right up so we need to get a number of information from the GPU as well as from the operating system To match the which processes are using different information we got from containers Containerization runtime and the information from the GPU libraries to get the mappings between the GPU slides and GPU physical GPU ID with the same reason create We create certain models to estimate how much energy consumed by the whole GPU can be attributed to that CPU slice. So that's is something we still we are working on a difference of formulas So one of the formulas totally based on the CPU count on the computation units counts So we're using the time if you are using the video GPU DCGM, you know, that's an event called tensor Utilization so the number of a percentage of for active tensors currently using by the GPU slice. So that's the Indicator we are using for attribution. So if you are having a process That's using a certain MIG and as a certain time of sampling we find out the The tensor utilization Tensor utilization is points to a 20% and the the GPU ratio is 50% meaning the biggest slice as we show in the last slide the 30-20 MIG and We can do a simple math if you are for 20 250 watts of power consumption We can divide that by ratio CPU the processor ratio points by 50% and then we see this 50% only 20 of them Actively using so we divide times another 20 we get the total CPU consumption on that slice And that's a number of energy can be attributed to the slice and eventually going to the containers And then we got all this information in the Kepler as well as in a VRLM And we can use some permissions to process the whole diagrams on the dashboard. So we are using So in Kepler you find a number of Matrix available one of which is interested to you is the Kepler container Matrix and inside of the container matrix we have a GPU total jewels So this is an aggregation of how many energy how much energy has to be consumed by the By the part that's I use specifically on GPU So as you see here before the workloads get started the GPU consumption is almost zero and there's some idle power So once the workloads gets kicked off. We are using a single Carry on the our language models back and the VRM back end We see the energy consumption just back up and eventually using the amounts of energy is going to be stabilizes So that's one of the ways you can query how much energy used by a part So the next thing I think most people are interested is how much Energy you need to generate a single Token so this is a this is very interesting because as the end of the day is you know from people who are doing the Management managing these clusters and data centers They care a lot about the energy consumption in the data center as you know If you are running the GPU once it's powered on even all a 100 or h100 The power consumption is could be one anywhere between 500 to 7 or 1000 watts depends on the number of GPUs You have and the latest announced by a media the B 100 and be upcoming be 200 The energy consumption is even higher. So the data centers may not be able to may not be configured in that way so if you are Providing the tokens, you know talking about what's level of management is will be very intuitive for people to manage the Models as well as the infrastructures to match up with the workloads. So this is the metrics We believe we'll provide you a such directions. So we got some token struples for VRM So again, this is just one of struples. It has nothing to do with the latency I wish it's more on the usability side of the story So performance wise we guys are struples and energy wise we get a couple of container messy a container GPU energy in terms of what you know what units because we take the rates over the Gauch that will give you the vast level information And if you are because GPU is not only resource that's being used by the container We also have CPUs together So you also want to have the whole pictures of how much energy indeed using by the containers on protocol level Service and so we also aggregates all the resources used by the container in the another master is called a GPU Capital container juice total which will include both GPU CPU and DRAM In more visible ways, we can visualize the whole thing in a Grafana dashboard So in this again, this is just for information It does not means which models is more efficient than the other or which models is more high-performing than the other because I used Same concurrency, so I use a 10 queries for all the backends in a certain backends because of the computation and memory constraints I may not support that kind of batch configuration So this is a just ledger say you have a tactical use case of serving 10 queries Perseconds and you are serving from different workload difference large language models from different hardware configurations So what would be the visually? representative energy per seconds kind of information you can get so on the top panel. This is the From the VRM so the throughputs based on different models on the top one the yellow you are seeing this is from the number to 7b unquantatized So this from the using the whole GPU these throughputs could go as high as 600 600 tokens per second and the second one Any Really from here the second one is a 13 be a lemon to 7 7 13 be a quantizer model You got a 200 tokens per second The reason that's these bigger models in this case is doing better than smaller models In terms of throughputs is that's in my opinion is that's because the resources CPU resources give to the language models a difference the 7 be unquantatized Unquantized is using the whole GPU the 13 be guess twice as much GPU resources as the sum be quantized version So these throughputs also be higher than the 7 be even some be the smaller model But the because of resource we are given to is smaller It's actually half of the 7 13 be quantized. So you get the even lower throughputs on the middle panel you see is That's a breakdown of energy spending on CPU and GPU The takeaway from the diagram is that the GPU even is a two only two a 100 Using almost ten times more energy than CPU, which is one of the reasons GPU optimization make a big bang off the buck as of the whole picture as the very bottom is the panel as the very bottom panel you see the token per watt Comparison again, I'm not saying which model is better than the other in terms of a sustainability Just for reference if you are configured a model to serve in certain workloads or time periods per second Which are the ones maybe more efficient than the other ones. So the llama 7 and 7 be model using the whole GPU you can get a street You can get three tokens per watt Right. So which is actually high performing terms of sustainability. You got some more Bucks for money up a dollar more was for about a dollar versus the other the 13 be quantized version I usually which using half of this GPU and give you a you know one token per second So what sorry one talk one token per watt again. This is always have no relationship with the Latency just the purely based on the token perspective The smallest model in this case seven be quantized only generates almost half token per watt So this just give you some visual representation and some of the touch points if you are managing the language language Model serving difference infrastructure with certain infrastructure in certain configurations What are the potential metrics you can consider to save energy while still maintain your service level agreements? right, so This is the introducing 10 concurrence queries, and we also have a recorded demo using just one Query or one stream of queries and the picture actually looks a little bit different the not energy per watts numbers It's not as significant the data may not be as significant since I see the one I just show I might just leave these are YouTube video for you to for you to You know watch offline after this talk We have a technically issue the browser actually crashed Coming back very soon It's a YouTube video so we can actually Last you watch offline For the delay, okay, we're going to skip this one. Um, yeah, let's uh Come together, so this is acknowledgement to all the people who actually make this work possible. We appreciate all IBM team IBM research team we collaborate closely with and some of them working with coupler some of them working on the AI this is a very substantial line up of Talents sitting behind the scenes make this happen and also appreciates the people who are volunteering for the application developments from Trilite chat We have a wonderful experiences working with unused experiences To identify the use cases of small models large models and difference Connecting techniques to make the end user experience better For take away We do have a lot of activities going on in this cube calm now. We have the sustained our cloud native AI working group We have the white paper just published recently so you can see there's a lot of information You can find over there the configuration the deployments and even the background of AI and also since the student abilities How crowded native AI help us work together? I saw the information you can find over there and we Cloud native AI work groups also have this bi weekly meetings Welcome to join the meeting. I will also have the bi weekly meetings on some environmental sustainability Tag, I think the neo the tag lead is also here in the meeting in the conference So he has a talk on sustainability sometime tomorrow. So welcome to news talk The significance of a large language models as a utility for day-to-day life And also as the challenge in our environmental Sustainability are both visible and doing something to mitigate the risk And while improving the quality of life is quite beneficial to all of us I believe technically that is also very encouraging to take so I welcome everybody Well, you know spend time looking at it makes technologies available and makes the use case more appropriate for your environmental considerations So we do have a kind support for NVIDIA GPUs, which we have already Identified certain use cases how you can measure the CPU will get up our consumption How we can correlate with your large language models performance to provide the interesting use cases to choose the best models while Preserving the energy consumption. We also exploring ways to support other type of GPUs and accelerators Vendors who are interested in working with capital working with a difference language serving infrastructures This could be a very exciting playground for everybody very lastly So thank you all for coming to the session and if you have questions this is a time Okay. Yeah, just a question is about how to map the PID to the container ID So that's a difference ways I believe NVIDIA DCGM exporter is one way and a capital user in different ways that the way capital uses using a C-group file system Name space to ID conversion So if you are query you get the container ID and also the process ID and then you query the C-group file system With the PR with the container ID, then you can get the container Container ID the container name actually eventually you carry with the Kubernetes API you have correlates the C-group ID with the container ID, sorry That's a three entity C group ID, which is 64 bits numerical number container ID is a hash number and then we going from there you can go into mapping Richard processes using this container Okay, so the variant missing piece is the ebpf so when ebpf going to the contact switch Level of a probe you get the C-group ID You get the PID So that's the mapping with the C-group ID to PID and then we pop up to the user space level with the C-group ID and Then we get the hash ID the container ID and then from the container ID at the C-group level You carry the Kubernetes API and you get the container part name So there's a true level for resolution and that's how we got it Thank you. So everybody gets everything. That is great. I Have no more questions. Maybe you can just miss offline and get up questions. That's all welcome