 Good afternoon. My name is Arun. I am at Intel. I'm part of the OpenEcosystem team. Hello. My name is Yiziko Lanza, and I am an AI open source evangelist. Yeah, so today we're going to talk about, how do you deploy LLMs using Cloud Native with LangChain? And what is the use case for that specifically? We've always wanted an assistant. I want a system that can cook food for me. I can get that with AI, but at least in my job, how can I get an assistant that could make me more productive? Whether it's writing my emails, whether it's summarizing my meeting minutes, simple assistants. I mean, bouncing ideas back and forth, conversational chatbots, all sorts of assistants could be very, very helpful, actually. And if you put that in a business context, in a company, there are different business units, so to say. There is a finance, there is IT, there is legal, there is engineering, and their needs are very different. Their jurisdiction is very different. How much amount of data can they keep within the business unit? Can they send outside the business unit? Should it stay inside the company? Can it go outside the company? So each business unit, in that sense, if they want to explore how do you leverage an LLM, their needs are very different and very unique in that sense. So that's sort of where we are going to talk about. There are many ways by which customers are deploying LLMs in cloud native. So what we're going to talk about is one specific way by which you could deploy LLMs in cloud native. Exactly, yes. And this is one way to do it. So basically, we're defining the five main steps. The first one is the model, the definition. You need to define your model. The second one is the API. The third is the packaging, because you need to put everything in one package. You need to containerize that. And you need to use Kubernetes, of course. So let's go for, we will go deep on all of them. So let's go for the model definition, which is probably the most challenging part, because we need to know what is our problem. So we need to pick which model we would like to use and what is the use case. Do we want to build a conversation on chatbot, text summarization? You know that LLMs, we can use it for multiple things. So classification, question answering, and so on. So the first thing to define is the problem. The second one is the model. We have tons of models, as you may hear, Lama, Falcon, OpenAI, everything. You have tons of models. And within then, you have foundation models, fine-tuned models. Do you want to use a generic LLM? Do you want to fine-tune it for your context? Do you want to use it a rack, like a material augmented? And the last point is you need to define your tools. So you define your model, your strategy, and you need to define the tools that you have available to use it, like haggling face, lang chain, and so on. In our case, we'll be using a question answering application with a foundation model, which will be a 7 billion parameters. And we'll be using lang chain and haggling face. And we'll be explaining the benefits that you can take when you're using lang chain and also haggling face as well. So the first challenge is you need to define the model. How can you make or how can you do the decision to pick a model? So we have main three things. The first one is the leaderboard, which is basically how your model behaves against the benchmark, which is a public benchmark. So you go to the haggling face leaderboard and you see, OK, I would like to pick a model that is acceptable between the benchmarks, but it's not just performance. Of course, you need to see if this model is used by people. If there is a community around it, if there is tutorials, if there are forums, the people are using, you know that these models, they are behaving in mysterious ways. So most people on the feedback of the people is very important. And the last one is the ethical considerations. So you need to know if your model is biased because you need to probably mitigate that bias or do something. You probably should know to consider the training data so well. So these are the first part. So once you know the model, and let's suppose that we define a model, and now we need to consume it. So on that model definition, and if you think about it, haggling face, as they say, is a GitHub for LLMs. So very much when you go to GitHub and you're looking at an open source project, you don't want to use a project that was created three years ago, never been maintained, potentially has CVEs, and you saw the impact that it could cause if a model has not been maintained for a long time. It could be malicious. It could be having vulnerabilities. So very much the same philosophy that you apply in terms of picking a project from a GitHub repo is what you're applying to haggling face as well. Absolutely. So we need to consume the model. So now we need to consume the model. In a business unit, again, as we talked about, if you are in a legal business unit, then maybe you can send the data outside of your business unit. You want to keep some data inside your business unit. And maybe in certain cases, like if you're in engineering, you're more OK to send the data out. Maybe use OpenAI as a back end, which is open only name, but at least you can call it using an API. So local, that means the model is within your jurisdiction versus using an external model. That means it's hosted somewhere externally. So those are the two models, essentially, that we are looking at. And if you think about the considerations, what should be the considerations between whether you want to pick a local model versus external model. Local model is within your app, part of your app. You want data privacy. You want offline usage, because external model is hosted somewhere else. You don't have control over it. Cost could be a big issue. Because when you're using an external model, really you're paying cost per outbound token and inbound token. And we'll talk about that cost in a bit. With a local model, you have a better ability to customize. You could feed the data back into the model and then continue improving it. And then when you're looking at external model, what are the pros and the cons around that? Well, in an external model, setting up a local model could take time. With external model, it's ready to go. You just make an API call, you get the data back. You get the scalability, because scalability is no longer your concern, it's the external model concern. It's less complex setup. Essentially, you get a token, and you're ready to call invoking it. Elasticity and high availability. And the way we look at this is, if you see the comparison, local model versus external model is very much like, hey, do you want to run a private data center versus you're going to be on a cloud? So those are the similar considerations between using a local model versus external model. So now we are using a local model, for example. If you want to use a local model, say you download the model from Huggingface, if you're doing local inference, then there are two elements you've got to think about. One is the storage, because the model, if it's a 7 billion parameter model, it's 30 gigabyte big, it's got to be stored on an NFS server or EFS or some sort of a storage internally. Then when you're running the model as part of your compute, then you need that compute capacity, essentially. It's a CPU, it's a GPU, and of course, your app is running over there as well. So that's the combination you've got to think about it. You could be running a 7 billion parameter, which is about 26 gigabyte, or you could use a 7 billion parameter model, which is optimized and potentially still 7 gigabyte model. So all of that storage runtime capacity is something you've got to plan for, and how do you scale it all comes into the action, as opposed to external inference, because then you are focused only on your app. And then all you're doing is, in order to engage with an LLM, you're just making an API. That external model could be hosted by OpenAI or Gemini, or whatever that model is. You get your token, and whether you go local or external is a call that you will need to make. Now, this is a wonderful tool that we found a couple of weeks ago. So really, what you can do is, you can put your number of input tokens. So at the very top, you see there are 100,000 input tokens. Then there are output tokens. That, OK, I'm going to put 100,000 input token. And for that, let's say I'm getting back 300,000 output token. And roughly, if you think about 100,000 output token, which is estimated, maybe you're making 8,000 API calls. So in terms of your LLM pricing, you're putting all the relevant information. And what you're getting is, if that's sort of the input-output API combination you're using, it's giving you the idea of an approximate price that it's going to cost for that model. So based upon your usage, I would say, on the right column is the price. So based upon that user, you should think about it. Is that price giving me the value, or is that the way I should maybe look at? Maybe a local inference in that case. All right, yes, let's go now with a package. Because we have the model. We define if it's local or external. And we need to find a way to make it a package. But recapping to the first slide, we need to find something to meet multiple business needs. So we have multiple models. So you make the definition, for instance, that you need to use multiple models. And all of those models commonly have different ways to be accessed, different compute or storage requirements, because you probably may like to use a big model. And you would like to use an optimized model or find two models. So you have multiple different kind of models to use. And the requirements are also different, because you need to meet the business needs, of course. So what you need as an IT or the technical part, so you need to provide a unified way to interact with those models. So we need to find a way or a tool or something that will be showcasing in the next slide that can allow us to use in a unified way, basically. So and that tool is Lightning. I mean, we are now sponsored by Lightning. I don't know. OK, good. So basically, Lightning is an open source project. It's basically a framework that you can use to make easy when you would like to build an LLM. That could be chatbot or a conversation on chatbot and so on. It's based on Python, JavaScript, and also. But the important thing is that it has support for AD plus AD LLM open source models. So that's the beauty, because you can work with multiple models just using the same tool. How to worry about is the model configured for something or whatever. So you have everything in one simple tool. And the other one is that it supports RUG. I mean, most people is talking now about RUG, which is providing the context to the model. But you can also do it if you are using a Lightning. So one point that I want to see on that slide as well is, you're really learning the Lang chain concept at what that abstraction looks like. And then the pluggable model fits in very well. And something else that we missed out on this one, there's a project called as Lang chain for J. So if you're a Java developer, for example, which is used quite heavily in the enterprises, there is a very similar abstraction that is done in the Lang chain for JLAN, like using Lang chain. So you can start interacting with that project if you want to bring in those pluggable LLMs into your Java projects. Yes, yes, great, great. So how is the logic now if you would like to use it? So we will start with a very complicated question, which is, tell me about Kubernetes. So I don't know if someone can answer about that. But that will be the question. That will be what will be making the question to the LLM. And the first tool that we will use is the prompt template. You know the challenges that we have when you are prompting to ChatDpT or to a model. So you need to know how to make the question, because they are probably not so smart as we think. So they have something, a tool called prompt template, which is basically, it helps you on your customer engineering, which is basically instructing in a different way the model. So it's saying, you are a very smart and educated assistant, and so on. And don't say if you don't know something, and so on. And it adds the question in that part. This is very important, because how you make the question to the model, you need to at least provide the most context or given some instructions on how to behave. And the user is not doing that, and you are not doing that when you are talking to a model, of course. That's why the prompt template is very important. There is the other part that is the model. We need to use a model. Remember that this is a local deployment in that part, in that example. So we need to download the model from the hub, with this high-end phase, mainly. If you are familiar with high-end phase, it's where all the models live, basically. So if you like to download the model, you need to download it from high-end phase. So this is basically how you download it. And you need to put your model in a pipeline. Basically, the pipeline is, think that a model is not just the model. It's the model, it's the parameters, it's the tokenizer, and multiple configurations that you can do when you are using a model. So you need to put everything in something that is called a pipeline, which is provided by high-end phase. And the beauty is that if you pick the prompt template and if you pick the pipeline, LangChain provides an API that is called chain, which is a chain, basically, which puts together the prompt and puts together the pipeline. And how you use it, basically, chain.invoke. Invoke is one method. We have multiple methods. But you, as a developer, the way that you have to interact with the model is a method that is chain.invoke. And you send your question, basically. And you got the answer. So hopefully, the answer is correct, yes. But now, remember, as Arun said at the beginning, that we have optimized models and we have normal models. What happens if we need to make it fit in a CPU or in a low-resource device? Well, the chain would be exactly the same. So we would be using the prompt template, same as usual. But you need to do an extra step, right? Intel provides Intel extension for transformers, which is basically make a quantization, I mean, optimization methods that sees more like a compressing. If you're not familiar with that, it's a kind of compressing, which allows us to start from a model that weights 26 gigabits to a model that weights 7 gigabits. And this is the tool that runs the process. So you can go to the GitHub and you can run it. And once you have that model, you use it in the same way as we did it before, right? So put it on the pipeline, and the pipeline is integrated with a chain, and you interact in the same way. Chain.invoke, and you send the question. And you get the same answer, of course, right? So this is basically how you move a model from a very weight model to a light model. But you can also use the same thing when you are doing something external. Let's suppose that you would like to use the external API, the OpenAI, for example. And this is something that LinkedIn has. It has an API ready to be used when you are using something externally, right? So we have the chat OpenAI. The only thing that you need to create is the connection, like the server, which is the model client, basically. And this model client has the key that connects with the OpenAI. And you interact in the same way. You have the same chain, and you look chain.invoke, and you get the answer. Now we need to containerize how we put it together. Yeah, so the whole idea that we showed in the three previous slides is, you can use whatever model you want, whether it's a local model, whether it's a quantized model, whether it's an external model. But line chain is an abstraction that you are learning. But let's think about what does containerization mean. First of all, let's take a look at why Cloud Native is the platform of choice for deploying LLMs. Last year, I live in the San Francisco Bay Area. We did the TED AI, it's a TED event. And we were talking to several folks over there that how are you deploying your LLMs? Kubernetes, that was the answer. Of course, Kubernetes is a de facto compute platform. But let's take a look at it why it is a platform. As we talked about it, one of the key elements over there is if you think about model by itself, it is several components in there. If you think about your app, how you can containerize all of that, package that as a Docker file, and that dependency module, all packaged up together, it moves as a single unit, is a big deal. Scalability is a big part of it. As we saw in the demo this morning, how you can use that in a kind cluster on your desktop, and the same concepts are done very, very applicable when you are going into a production. You can scale it on EKS, or you can scale it on Azure, or GKE, or Intel Developer Cloud. It doesn't matter because once you have done the experimentation using your local desktop, they just scale out there. Another important part is the resource management. These are, again, the concepts that we know very well in the cloud native landscape. Models are memory hungry. You don't want to call it a noisy neighbor. So that's where you can start putting CPU memory limits. This is what it takes. You can start putting priority classes at which models are getting priority. So all of those usual execution concepts that you would need for your model to run in a business context are available in cloud native. You don't need to learn a new language. Portability is a big deal. You have got a model up and running now in your data center. Now you are scaling it. You want to go to a cloud provider. Now you want to run the model on the edge. Now you want to run the model maybe on your client desktop. So lots of different ways. But Kubernetes being the sort of de facto compute platform, it really brings that all along. And a big part of models are because models are typically long running. They need a lot of observability. And open telemetry-based observability groups are really engaging more and more. I was learning about a new project called as Open LLM Metry, which is basically based upon open telemetry. But they are kind of looking into how they can provide more observability into LLMs running in cloud native. Some of the benefits on why cloud native is a platform of choice for deploying LLMs. Nice. And now we need to put that thing together in a container. So the same thing as we showed before, like the chain, the hanging face part, the pipeline, and the prompt template. Lightning provides also, but it's basically an API that exposes that method, like dot invoked, for instance, and exposes. So the way to interact with this model, or with this minimal part, is via a post, an API, basically. So you send the question to that. And what I wanted to highlight internally is that the two pipelines, there are two pipelines here, but mainly the idea is to showcase that you can use either local or external. If you're using local, you will not be storing your model locally when you're building the container, basically. So you're just creating this space and just putting the direction of the file server, because that will be downloaded when it's launched. And the pipeline, of course, for the external is a connection that you will have to do with the connection with an external API. Oh, this is the animation that we love, yes. So when it starts working, for instance, let's suppose that we have your models in your file server. That could be EFS, file server, whatever. So basically, when you are running the container or when the container lands, what happens is, I love it, the model is downloaded from the file server to the container. So the model is leaving on the container. Of course, what you need is the container is a pod. Of course, you need the service, but you need the PVC and the PV, the persistent volume claim and the persistent volume, which is basically the way that you can connect to the file server. But what I wanted to highlight here is that the model is downloaded every time they pod launches. So it doesn't leave there. So you need to be aware, as we talk in the section 2, I think, so we need to know that, hey, you need to have a file server, so you need to think in the space where that will be stored. And you also need to think that you need to create that connection between your pod and the file server, basically. Serving multiple models. So now we have our pods, our containers, and we need to put them all together. So now think about this. Think about your intranet, where all of your workday, pay, legal notices, et cetera, come inside the company. So this is the model that you want to think about, that think about UI as your intranet. And Intel, we have Intel Circuit. So we go to circuit.intel.com internally, and that's where my pay is available, all my workday is available. And those could go through a LLM proxy. And then that LLM proxy is where you could do model provisioning, model governance, how much cost allocation, all of that could be done at that proxy layer. And then from there, it goes down to the actual LLM that is hosted down into the business unit. So that proxy layer is where you could do a lot of control and governance, essentially, and really unite all the elements over there. And if you are deploying all of that together in like a Kubernetes architecture, the LLM proxy, where we'll probably have a LLM proxy adapter running as a pod, that could be like a horizontal scaling. That could scale really good. And at the back end, typically when you are running the LLM backend worker, those are generally vertically scaled, as opposed to horizontally scaled, because they need more new memory and CPU and GPU and all. So that's sort of the way to think about it. Now, there are products available in the market for this already. So take a look at it, for example. One of the companies that I advise for is Katanemo. And they're exactly building like a LLM proxy adapter, by which they can allow you to do NLP based going to the right model, essentially. Yeah, and basically, if you go just one, basically the challenging with using LLM proxy is because you need to use a filter or something has to take the decision when you receive the question or the user is prompting something. If you have something smart enough in the middle that's not just a proxy, let's suppose that in the proxy you have another model that is guessing or is predicting, which is the topic of the question probably. So you can add external additional intelligence when you are using the LLM proxy, which is basically, I mean, you know that the models, I mean, we need to use something in the middle to interact with all the models, basically. Yes, and now, how we can build that in a cluster in Kubernetes, because this is KubeCon and we need to explain how we build that on the Kubernetes. Basically, we deployed any of the demo, we deployed an nginx ingress, basically, and we exposed the front end and we exposed the proxy. The front end is built on React and we need to expose the proxy, of course, because the browser will be talking to the proxy, but all the connection underneath, like the local models, the optimized, the external APIs and everything, they are pods that lives internally, of course, in the cluster. And of course, you need a container registry because once you are launching, you need to go to the container registry and you need a PVC and a PV. All of those configurations, they are on the GitHub, so the step-by-step on how to build it and they are all explained with detail in the GitHub. So let's do some recap before going to the short demo. What you will see is you will see a front end, which is basically, as I said, in React. The front end will be sending the post APIs, the post calls to the proxy, and the proxy has a connection, of course, with multiple LLMs models. And there's a detail that is created, the API or the name of the API, which is basically, if you see that on the GitHub, you can see how is the flow, basically, and it has a fast API, which is what exposes, of course, to the external world. And at the end, you have each particular LLM and they will be sending the answer, of course, to the front end and the user. One detail, of course, that we have in the foot is that the model is loaded when each container is launched. There was something that we talked before. And let's go to the demo. It will be very short. Which is this demo, right. So this demo, how I have it, is built on Intel Kubernetes service, which is the developer cloud service provided by Intel. So it was recently launched, so we are using that and we are proposing to the users also to use it. But it's basically a platform where you can access to the newest hardware that Intel is launching. So you don't have to wait, for instance, for six months or five months or whatever, that the CSP is adopted. You can use it as soon as it's launched, you have it there. And this is basically the same as, if you are familiar, of course, with EKS or something, we have the IKS, which is the Intel Kubernetes service. Basically, I have here my cluster, which is very easy to create and also you can use the nodes. In our case, we will be using two nodes, basically this mall node, which is this node that has the front end, has the LLM proxy, it has the external open AI connection, which is basically, we decided to divide it out because in terms of the processing needed and also the storage needed, so we have this mall node for that. And the other one is a big node, an Excel node, using the latest generation of the GN processors. And just to, yes, and this is, just to show you real quick, if you would like to create, for instance, a node, it's very simple, you go to the nodes and if you would like to create a node group and here you can select all the instances that you can use, for instance, you can use a small, large, or something. And this is what I have here. If I go to my environment, it's okay, the fonts, yes. So first of all, this is the GitHub, right? So this is the same thing that we were talking before, but basically you have the four files, which are, how you deploy each particular container. So this is a recipe, it's not something that you do, if you install it and it's working, it's more an educational GitHub, so you have to download it, you have to create your own containers, you have to upload your containers to your container registry, and you have to deploy your Kubernetes cluster. Everything is explained, just to go and be in detail, for example, if you'd like to explore the deployment file, here's the deployment file, when you create the deployment for the pod, the service, we create vertical pod autoscaler, and also horizontal pod autoscaler, it's something that we showed in the previous slides, but this is how you create it, basically. Dingress, which is how you expose what we've seen before, like the local lemma, the proxy, and everything, how you expose it using the nginx, and how you expose the frontend, and the PVC, of course, that will be different if you are using EKS or any other provider, but you can use it. So basically, this is what I have deployed here, just to showcase, for instance, let's go Q-Control, get pods, we'll see that we have the four pods running, we have the frontend, we have the lemma send seven billion, non-optimized, the seven billion optimized, the proxy, and the openAI model. Just one thing that I think that is very important to mention is that we talked about the weight of those models, right? So if we see the lemma seven billion of non-optimized, it weights 26 gigabytes, and sometimes it's very weird how to use that, right? Gigabits, gigabytes, and the optimized weights three gigabytes. So the story is because the model is running there, so instead of using 26, we are using just three gigabytes, bytes. So let's, how we work on the demo, we do a poor forwarding, of course, just to forward what we are doing. This is because I already have it, so we need to kill process. That's a live demo, right? So this is, here we go, that's it. Now let's do the forwarding, and it should be working. So if we go to our browser, and we go to the local host, we will see. It's not a real fancy interface, to be honest, but we use it React, because most people would like to use React because of the interactions that you can use. We didn't want to use Gradio, something that you cannot make a real use case, so we use it React as because of that. I'm not a React developer, so I did my best. And it gives you the free options, right? It gives you the non-optimized version, the optimized version, and the external version. So the question will be the same question, right? Tell me about Kubernetes. And this is running in the local non-optimized model. So it will take probably 10 seconds, eight seconds to give you the answer because it's a non-optimized model, and you see the weight that you have in that model. That is very, I mean, 26 gigabytes, right? So it's a lot of space for running that. So it took 10 seconds, 12 seconds. And you can do Star for instance, and now we would like to use to the optimized model, and we will make the same question. Tell me about Kubernetes. So the optimization that we used is not to accelerate together faster, just because we didn't want to use that, so it was just an option. We wanted to make it light. So we would have the same results, a bit faster. So about half the time. About half the time, but we are using these three gigabytes. And of course, if we go to the, if we start again, and if we would like to talk to the external model, which in this case is OpenAI, you will see, tell me about Kubernetes, and you will get the answer in probably two seconds, two seconds 30, but of course it's faster because we're using an external API, so we are not trying to compete against an external API, of course, but basically what we are showing is that with a very light model, you can have the same results or similar results as it's accessible. The optimizations can go further and further. You can do even more faster. You can be very close to the same results. I'm not doing anything. Is that an indication that our time is up? We got two more minutes. Okay, we can go if you don't like it. And I don't know, let's go to the conclusion slide. Yeah, let's go to the conclusion slide. We'll talk through that one. Go to the slides. Go to the conclusion slide. We'll talk through that one. Oh, there we go. Okay. It's working, it's working, right. So let's go to the conclusions. Yeah, so in terms of conclusion, really the choice of the model that you want to use really depends upon your business need. But the whole idea here is, Lang chain makes it simple to switch between multiple models. And we talked about why Cloud Native is the choice of running your LLM. And optimizations, as you noticed, plays a significant role. It probably ties you to a hardware, but then depending upon what architecture you're using, I think it's an important consideration. Yes, and do you have all the QR codes? The clone, you can go to the GitHub, you can clone. We have our site, which is open at Intel. And we also have a podcast, which is very interesting to listen. So these are the links. So thank you, thank you for your time. Thank you, I think we are just out of time.