 Thank you very much for attending our session. Today, we are going to talk about strategies for efficient LM deployments in any cluster. My name is Angel, and I'm part of the AI and Advanced Services team in Broadcom. And today I'm here with Francisco. Hey, my name is Francisco Cabrera. I'm a Senior Technical Program Manager at the Azure Agent Platform Team. So just to set a little bit of expectations for this talk, the goal is that we will learn different strategies to run, operate, and improve models while working on our specific use cases, setting also the appropriate infrastructure so we can run them in different environments, not only in a big cluster and a huge cloud, but also in small deployments, like on-premise, inside stores, on anywhere in which you want to run them. For that, we will talk about local catching, OCI for models, GPU usage. You may be familiar with these terms after being here at KubeCon for one day. Also, we will focus on a small and medium Kubernetes deployments, nothing about AI service provider or mega cluster for distributed training, but more something that you can deploy in your on-premise cluster. But before starting with the strategies and discussing a little bit more about that, the first thing you may ask yourself is why you would like, why you'd consider to deploy LM lamps that are already a huge offering outside with a lot of really good models that you can just access them via API so you don't need to host, install, and manage those ones. But there are 13 good reasons to deploy them and manage. The first thing is that you have more control and flexibility. If you start using one of those services, most likely you will have access to the models, the proprietary models that they offer. But you won't have the ability to experiment with new models, go small, try new things. That may impact also the cost, because now you are tied to the cost of the service that you choose. Also, because any company can actually access to those services and you can access and use those models. But the real difference between any of the companies that we are here is the data that you have. But maybe certain regulations like if you weren't going to government or by other reason, because you don't want to move your data to the services, you want to keep it private. So deploying LMS locally may give you access to that data that otherwise you cannot use. Based on that, you can also find in the models. There is a huge offering on the open source side for machine learning models. And you can take advantage of your data, combine it with these models, and create new models that fits for your use case and perform better for your users. And when talking about efficiency, what we mean is that first, you provision the minimal resources that you need to complete a task without compromising the stability of the system. Of course, you don't want to like to go to the minimum and then just one spike breaks the entire system. Based on the resources that you allocated, not to have either hardware. You want to use them properly, not just wasting energy and time like running them. But now let's put some numbers about energy consumption, which is one of the measures that you may hear about for green energy and green code. For that, I'm going to take as an example the Bloom large language model. Bloom is a research project that, in the end, and up with a 166 million parameter model that was trained on 1.6 terabytes of data. So the interesting point of this model is that the research team was collecting all the energy data while they were developing and training this model. I think we can connect it directly. It's on the other side. Just one second. Should work. OK, here we go. So while they were training this huge model, they started to collecting all the metrics that they need. So just to give you some numbers, they took to train this specific model, it took like 120 days to try it. And the total energy that they spent was about 400,000 kilowatts per hour. So to give you a little bit, a comparison with some data that may sounds more familiar to you and now that we are in Paris, this means the same energy as the average of 61 French homes or French houses in that year, which, to be honest, at least for me, this is a personal perception, is not that big because we are millions of people here in France. And this is one time effort. So you spend the time training this model and now you have this model ready to be used at inference. So you don't need to retrain it over and over. So that said, the important thing for all of us, because I think like most of the people that are here are going to use the language model but not to train them, is that we optimize for energy efficiency at inference. So the system that we build to provide value to our customers requires less resources and less energy. And in other words, it costs less money for us. So now let's talk a little bit about the strategies that we can follow to get efficient LLM deployments. The first thing is about model selection. For sure, if we go to the biggest model that we can find out there, like GPT-4 or mixed trial, there are huge models that provide really, really good results and accuracy when you are asking many different kind of things. And the reason is that they are big and they were trained with a lot of data. But on the other side, the bigger the model list, the most energy it consumes to run it. So based on this great paper, they started to test the inference energy that they were consuming based on the LLMA family. So they were comparing 7 billion, 13, and 65 billion. And as you can see, the biggest is the model. It can even triple the energy that you need. So for many tasks, you can go to the biggest model and they will perform for sure pretty well. But you can also go into the opposite direction, like let's try to find the smallest model that can perform well for your use case. And now that way, it's more efficient actually trying to get the biggest model for everything. And you can combine them. So you can use a small model for certain things and big models for other tasks. The other thing is about quantization, which is a pretty popular technique for anyone that wants to try a machine learning model, but you don't have actually like access to a powerful NVIDIA GPU. So quantization is a technique that reduce the model size by reducing the precision. So basically you have, like you can see in the table on the right, there are a comparison in the Lama family that shows the perplexity, which is one of the scores that you can use to measure the accuracy of the model. It's not perfect, but it's good enough. Here, you see like the comparison between full precision with 16 bits and then quantization models that we're reducing that precision to four bits, 4.1, five bits, et cetera. So if we compare it and we look at this number, the perplexity for the seven billion model at full precision is 5.9, just in case for this specific score, lower is better. And if you compare it with the perplexity of the 13 billion parameter, but now quantized to 4.1 bits, those always beat the full precision of the seven billion model. And if we compare now the sizes, you can see they are pretty small. They are almost half of the size of those models, meaning that for running the quantized model, you need less memory and they are faster than others. And this is a pretty, pretty hot ecosystem. Microsoft, I think like two or three weeks ago, you announced like a new paper about a quantization of 1.5 bits, which is kind of almost anything. The good thing about this is that it rises pretty promising benchmark, like being 2.7 times faster than others while keeping more or less the accuracy, using 3.5 less GPU memory, which is for me really, really amazing. The other thing that we can consider is that until now we have been talking about large language models, but there is also a new category, which is called small language models, by the way, the chronic is wrong. So this is from a posting in LinkedIn from Clem, the CEO of Hagenface, and he was mentioning like, not only LinkedIn, but in many places, that small, cheaper, faster, and customized models could be the future because it will cover many, many use cases for different companies. We have, since last year I think, we have a pretty good number of small language model available for use. So just to show you some example, we have Phi2 from Microsoft, Gemma to William from Google, and we have 1.5 from Alibaba, and actually came in different sizes, like for 0.4 I think is the smallest one, and then the biggest one is around like, like the Lama big models. And the cool thing about these models is that customizing or in other words, when you are doing it for machine learning, fine tuning those models is pretty, pretty cheap because thanks to the size of these models, you don't need huge graphic cards and powerful ones to do the fine tuning. So you can take them as a base model, not meaning that it's generic, good enough when they are generic for most of the task, for that you need to use like bigger models, but you can take them and then you can take your private data, retraining them and adding those small layers, using different techniques, like for example, Laura layers. And the cool thing, and we will talk a little bit more about Laura layers later, is that Laura layers can be independent from the base model, so they are really good for catching because you have the base model, which is Phi, Gemma, Tobillion or Quem, and then you have a set of Laura layers that can be fine tuned for different use cases inside your company. So it's a pretty, really good approach. And the other thing that I wanted to mention here is that there are also advances on to moving the workloads that we currently run in cluster in the cloud to actually the user devices. And this is not something new. We have seen like Google, for example, using machine learning in Android devices, like for a long time, but the good thing and what I wanted to mention with assembly here is that it simplifies a lot the way in which you can do that. Now you can have machine learning engines that can run even in the browser, in your mobile phone, and then you can use the same exact models as you are running in your cluster, but now moving them into the user device. And even fine tuning them for very specific user because they are small enough, so they are easy to train and easy to distribute to your users. And that said, we are moving about how to make everything on Kubernetes. Well, thank you. I think that was a great interaction to different type of models and how to use them. But now let's talk a bit about how we bring those models to Kubernetes. And just when you have the model generally, the big issue is how you actually host the model in these production deployments. You know, like you need to get the GPU, you need to get your resources and the cost money, so you need to actually find a way to actually bring the models to this Kubernetes cluster in a secure and efficient way. So the first thing you need to do is actually make sure you define your architecture. Probably you'll go through more like a microservices approach. You want to make sure that your GPU is being used when you actually need it. If you're actually moving to the cloud, you want to actually make sure that you're using more like serverless, so you're paying whatever you're using. At the same time, we all use these models, right? And generally the way we use it is we just go, we ask a question, we get a response, we ask another question, we're done with our probably with the code that we don't know, and that's it. So we want to make sure that we scale up, we scale down when we're done, right? So we want to make sure that we're actually using that near scalability in a kind of a good way. And the third part is that probably a lot of these models may be that you're being part of critical applications and we want to make sure that we have a unified deployment between cloud and edge and we want to move right if needed right from edge to cloud or from cloud to edge whenever needed. So we're gonna go over your kind of a really simple architecture, so you can have in your ingress or gateway, you'll have your outer scaler, your proxy that's again depending on your architecture, your AI service, and then the interesting part is around the inference in part with the model loader, the GPU operator, and of course the LLM storage. So we're gonna be talking all the blue kind of pieces here. So let's go ahead and start from right all the way to left. So let's start with the models. In the end models as you know like Angel was saying, they're just big binary files. And because they're just so big right, the closer you get them to your Kubernetes cluster, the better. You can actually see for example here in examples like well when you're using small SML models, you're in the 3.75 gig kind of size, it's big but still manageable right, we can still manage that. But as you grow more right into more complex model in a seven billion, it just goes into 13, 14 right, and it's going to really big ones right, the Lama 70 billion or over 140 gigs, so it's just pretty complicated. And to that you of course you need to add your container, your application, CUDA drivers, so it just gets really really big. So if you actually think of how you're gonna be distributing this in a secure and efficient way. And the different methods right, you could use like blob storage, you could use short side URLs. In particular what you've been kind of doing a bit of research is actually how you leverage the infrastructure that you have. And there were great talks today about how you can actually leverage the OCI infrastructure. So basically what you can do is you can just use the same kind of container registries that you're using and actually put all your weights there. And by doing that, you're getting a lot of benefits right, you already solved right, for example, ALT and RBAC for all your other kind of containers. So you just use the same that you're using there. You will get other benefits like you have the layering, you have the hashing for each of the different chunks that you'll be doing. You have the retrial mechanism, which is also really important when you're kind of on these network constrained devices. And of course then because of chunking right, you'll just have better performance when you're kind of packing and unpacking these kind of big files. So how does it works? First of all, choose your model format. The different model kind of formats you have, like the binary, you have PyTorch, you have HDF5. So recently, Hacking Phase, they open source a new kind of format. It's called Save 10 Source. It has a bunch of kind of advantages over the previous formats. It's open source, it's quite new. So probably there's some models that you won't find this specific format, but it's quite kind of easy to move from, let's say, PyTorch to Save 10 Source. As soon as you have that model, what you would do is you need to kind of define what's gonna be your optimal division and divide into smaller chunks. So again, these Save 10 Source files are big, so you need to just put it into small chunks so that depending on your network, maybe you want to do 500 MB, you want to do 200. Again, depends on your scenario. And as soon as you have that, then you just go ahead and upload these kind of chunks into your container registry. In particular here, like we're using ORAS. So ORAS is an open source CNCF tool that allows you to use OCI kind of container registries and upload all these model weights as OCI artifacts. So it's pretty simple. You just use the kind of the same approach as you're just doing with containers, where you just do like Docker push, ORAS push, right? You just put what's your specific kind of format. You could just create a format, let's say, Doft Save 10 Source, and you just push it. So here's just an example. I got a model which was like, I think it was 5GB. I just like chunk it right into 500 MB pieces. Then I just go ahead, I upload and I can see there as part of Azure Container Registry. Almost all container registries right now in the cloud, they already support these kind of OCI artifacts, so you should be fine. And as you can see there, you can see the manifest JSON and it has all the different layers of the different kind of chunks of something like 500 MBs. So once you have that, right, you have your model there and you see that kind of registry. What you need to do is actually download it, and because we chunk it, we need to recreate that. But again, it's a really simple task. You can use ORAS kind of SDK, you have it for your Python for Go, Rust. And what you end up doing is just, every time you download a chunk, you just copy it right to the kind of final model, and as soon as you finish, you just delete the different chunks. And then once you have that, I just put here in a sample code how easy it is to actually load the base model, and then you can also load the custom kind of lower layers. So for example, in our demo, what we're gonna be doing is we're gonna load a base Phi2 model, and then we're gonna put the fine-tuned lower layers. The also, the good part of this is that you can actually have multiple lower layers. So let's say you have a specific lower layer for your marketing kind of team. Well, that's one layer. You have another layer for, let's say, HR team, and you're gonna have the same base model, which is gonna be Phi2. The other kind of good practice that you can do is actually have local caching on your peer-to-peer alternative. So there's some really good projects out there like Spiegel, which in the end, what they're doing is this local mirroring of these those kind of artifacts. So basically what it does is every time somebody looks for a container or any kind of layer, let's call any OCI artifact, it will actually talk inside the cluster and see if any other node has already the layer. If so, it will just like serve it directly. And if not, it will go to the container registry and get it. Recently, I think it was yesterday, also, HR team announced this new project, open source project that's called PRD, that is doing a kind of similar this peer-to-peer communication and caching. But one of the advantages is that also doing for other kind of technologies, not only OCI, but also, for example, for blob storage or kind of, and also supporting artifact streaming. So a bit of numbers here. I just did this on my local cluster. I just downloaded the CUDA container, which is something like 3.8 gigs. I downloaded first on the node one. It took something like 151 seconds. And then what I did, I changed it into node two, right? But because it was already part of node one, it was actually like it was served by node one on my kind of local network. And it's just like 5.7 seconds, 28 faster. This will actually depend on your network infrastructure, your Kubernetes cluster, but you can see results between 10x to 50x, generally. So now, one of the other parts here is just how you get the GPU. Of course, when you're using GPUs, what is that you need to take into account? And of course, first, you need to go ahead and set up your node, the different kind of pieces that you need to take into account, the NVIDIA drivers, the container toolkit, the container runtime. And one of the important kind of pieces here is also how you define your Kubernetes device allocation. If anyone was today in the keynote, the NVIDIA team actually talked a bit about what's the DRA driver and all the benefits that you get with that and what are the kind of limitations with the kind of device plugin right now, like for example, you can assign more than one type of GPU for a specific node. You can need some limitation wanting to share in that GPU between containers. And all those are already being sold right now with the KDS DRA driver. All the issues, you're stealing kind of an alpha version, but probably if that's okay with you, you can just go ahead and start testing. Once that you also have defined what's your device allocation, you need to make sure how you're gonna access that kind of a Kuda or your GPU. And there are different ways, right? So you have, it should be a single process, but also when it comes to concurrent access, you can be MPS, you can be time slicing, virtual GPUs. And again, it's gonna depend on what's your use cases, kind of what you're gonna be defining as the access. And then the final thing is around the application logic. So you need to take into account, okay, what are you gonna be doing if it's GPU on fallback to CPU, if you're gonna be doing lower layers, well, make sure that you also, you could use for example, safe tensor for lazy loading and just loading the layers that you need. And also how are you gonna host that model? Are you gonna be using a framework? So the different frameworks to host the model, you have the Olamma, you have like VLLM, you have open LLM, so make sure also that you define that logic based on these kind of frameworks that they're already available. And then the final thing is around custom scaling. So again, because these kind of applications, it's just that you go, you type something, you finish and then you're not using that anymore. Probably you need to make sure that you have a good kind of logic around how you scale up and scale down. So you could use KIDA to actually have this event driven kind of out of scaling. You need to make sure also that if you're gonna be using GPU, you use the DCA GMM and video exporter so that you can get all the GPU metrics. And then finally just make sure you're also using your own logic and your custom metrics. So you could use metrics like latency, tokens per second was a GPU utilization. And getting all these metrics done, then you can actually go ahead and create your kind of a scalar for your Kubernetes cluster. So now let's go ahead and show you a demo of end to end of trying to apply all these things that we've been talking. And to give you a bit of context here, so the demo that we ended up doing, we wanted to do a simple demo but kind of useful in your real world. So imagine you're working for an HR company that is just actually receiving a lot of emails and resumes, right? They're like from multiple countries, multiple language. And in the end, what you wanna do is some filtering. You wanna make sure that you abstract information based on those emails and you just put it, for example, in a database. So for example, you'll get something, hey, my name is Antel, I'm 32 years old, I'm living in Sevilla, and I'm working as a software engineer, probably a resume. And what you wanna do is, hey, just get this JSON that is extracting the information, right? And just creating this file JSON file. So we actually wanna say, okay, based on this kind of application, let's go ahead and try the online models. So we tried with Mistral, 7 billion, it worked quite good. We tried with Gemma, quite good. And then when we tried with Fight 2 and Gemma 2 billion, it was okay but had a couple of mistakes, right? So we say, okay, let's go ahead and try to see if we can actually fine tune these and use the base Fight 2 model and create a lower layer and see how it works. So what we ended up doing there was we created a synthetic data set, so we went there like online, we created, I think it was 100,000 lines of these examples. Then we go ahead and we train, we fine tune this Fight 2 model and actually try to see what happened, why when we were running the fine tune model in our cluster, okay? So now Angel is gonna demonstrate this live as to how this works. Is it working there? Okay, can you? Yeah, I want to do this. Okay, so let me clear this, cool. So we did like with the Fight 2 fine tune in, so we train it, but in a real environment, you may think like this is going to be more of a work from a machine learning engineer. So at this point, what we did is like, we fine tune this model pretty simple and then we have this model available in our OCI registry. So you're gonna start right away using it. So the first thing that we need to do is actually to deploy it. And as Francisco was mentioning, there are many different frameworks and services that you can use for this. In this case, we use VLLM. So how it looks in reality is if we go to the configuration file, and then we see like the inference YAML file. Here you can see that we are defining a port that comes with an init container. This is because as Francisco was mentioning, for the OCI registry, we need to download all the different layers and recompose all the files. So this is part of the init container that we have here. And then we're loading into the model in a volume. And then we are using the VLLM container right away. We don't need to perform any modification. Just to load it from the folder that was downloaded before. In this case, we are using a local cluster using kind. So this is the way like we are asking for the cluster to allocate one specific GPU to run this example. So we can see like the result of the init container. So let me like the kipctl logs VLLM. So basically the init container is just downloading all the layers and then just packing back to the original model. It's just really simple Python code. So here as you can see, it's a pretty simple Python code. It's just like with Francisco said, at the end you can see like the final files that happened after all the reconstruction. This took time, so that's why we made it like before the presentation. After this is like available, we can see under VLLM service is now like ready to start receiving responses. And this was based on a fine-tune model and it was pretty simple. We don't have to run any custom code, just use the same service that we have. But now we have a fine-tune model. So now let's try to run it. So for this I have a pretty simple inference test.py file. Here what you can find is that we are using the OpenAI library because VLLM exposes an OpenAI API-compatible layer, so we don't need to create any custom SDK. Here, just for simplicity, we are adding the text inside. So here instead of having like a short description like before, you can add a little bit longer one with some information around. And then we have the prompt in which we are adding the instruction about to, hey, strike this information. This is the format that we want. Just skip things before and after and then give me the output with the final JSON. So let's try it. So now if I run like Python inference, PN, it put like the URL, P1, we're good. So in the result, we see like we have already the JSON that we wanted to strike in the right format. Then after performing some tests, you can see like it's giving a little bit more accuracy. Of course, this is synthetic data, so we cannot expect like a really, really good accuracy when you seem like in real data. But since you can now access to the private data that you have in your company, you can fine tune with this data and improve over and over. And since fine tuning is so cheap, you can do it like iteratively and constantly. And just to finalize this demo, the cool thing about this for me is that everything that you saw here from the fine tuning to the running of this entire model is actually running in NVIDIA GeForce RTX 3060, which is a 300 bucks graphic card with 12 gigabytes of memory. So you don't need to have like huge graphic cards to run this kind of models. And you can use this maybe for doing some specific task inside the stores or in any edge deployment that you want. And you cannot put like $1,000 graphic cards on every store that your company owns. Yeah, with that, thank you so much. And yeah, I don't know if I have any questions. Thank you. I think we have some time for questions. I think you can go to the microphone. Of course, thank you for talk. Yes, I have a quick question. So when we put the model in the OCI compatible server, when we download the model in our code, we are existing library like Hackingface or other PyTorch code can download from the OCI registry directly. Or you have to use some in need container to download it. So as far as I know, there is no like direct tool inside the Hackingface transformer library to download using the OCI layer. But I believe like in the future, it will be some kind of adaptive that you can use right away with the transformer library so you download it directly. But I think as today, there is no. Okay, cool, thank you. You're welcome. Okay, so cool. Thank you very much.