 All right. Thank you very much. Hi, everybody. My name is Dave, Dave Southwell, with Def Computing. This is my co-speaker, Ann. We're going to split the presentation up into two different parts, so that's kind of why we're standing the way that we are, instead of trying to huddle together behind the lectern. So, yeah. What are we going to talk about today? It's a very long title, so I forgive you if you haven't read through it all, but you're probably familiar with all the terms. We're going to talk about resource-aware policy-based scheduling for production GNI with RAG. We've heard a lot of other talks today about different ways to deploy your AI-powered workloads, but I don't think any of them have specifically gone towards how you might do it in production. And the whole idea behind this talk was how we were thinking, how could you develop a reference architecture for implementation for how you could deploy your production AI workloads? And that's what we're proposing here. As you'll see later on, we've tried to use open-source software wherever possible. We've also thrown up a couple of different examples where you might want to use a vendor for some portions of the architecture, but let's get into it, because we don't have too much time and we do have a demo later. All right. Why might you want to use GNI and RAG? Well, RAG with GNI specifically. I think most folks here already know the answer to this question, too, but just in case it bears repeating, layering RAG on top of GNI reduces your LLM hallucination risks, and it also uses, because it's using recent domain-specific information to augment that, it avoids the cost of needing to retrain an LLM as well, and you can also utilize a bunch of other different components on top of it. And on the left side, I don't have a fancy pointer, but on the right side over there, you can see kind of a generic high-level diagram that's going to get more and more annotated as we go through. Okay. You could obviously do a lot of this stuff using paid-for software. However, self-hosting has a lot of advantages. These are kind of the usual advantages that you would see for self-hosting, cost, scaling, performance, control, privacy. Privacy seems to be becoming increasingly more important for folks, but all of the other four components there, too, are quite important. I'm sure you know most of the big names up there at the top, but there's some downsides to self-hosting, complexity being the biggest one. So handling the complexity is resource-intensive. Our reference example that we're going to share with you today combines deploying multiple Kubernetes clusters, sorry, I use cubes in another area, so sometimes I throw that in instead of Kubernetes clusters, along with a cluster scheduler, and that's going to help take away a lot of the administrative overhead of deploying your own. Let's take a look at some of the different components for our reference architecture. Up at the top, we've got our ingestion job to create and update our vector database. The vector database can be a lot of different solutions for that. So towards the middle, we've got our prediction service that's combining the model input and query with rag context. We've got the user interface over here on the left, and then we've got our LLM model serving at the bottom. Each one of these components has different requirements, resource needs. Your vector database can run on pretty generic CPU resources, and it's only needed periodically. This may change over time, and of course you can change that with policy-based scheduler, which we're going to talk about later. Your center section with your LLM and rag serving, and the user part is going to have higher availability requirements, but it can also run on pretty generic hardware, so really just CPUs. Your LLM serving component is where you're going to need something a little bit more exotic in the form of most likely a GPU, although as we've heard earlier today, there's options where you might be able to use an accelerated CPU for this too. But the point with this is that there's different resource needs for each of these different parts and also different availability needs for each of these different parts. So obviously for this, we're going to use a multi-cluster Kubernetes deployment across multiple different clouds. If we use multiple clouds, that means that we can take advantage of all the kinds of benefits that you get from operating in a hyperscaler. We can also, by deploying multiple Kubernetes clusters, we can define each one to have the different kinds of resources that our architecture could benefit from. So with the kinds of workloads that need GPUs get deployed to Kubernetes clusters that have GPUs and those that don't, don't. And we can also efficiently look for the most cost-effective Kubernetes clusters in various different clouds for us to deploy on. And we'll see as we get into it, Nova makes this easy to do. There are a lot of different services that you could use for the various different components here. And we'd be remiss if we didn't mention some of those. So you can automatically deploy your workloads on model rollout. And there's a few different services that you could plug in for the vector database. And you can, there's various different cluster schedulers that you could use. A couple that we've talked about here today, Carmada's one of them, KCP, and Nova. This talk is going to focus on Nova. All right, what is Nova? So Nova is a policy-based cluster scheduler. And some of the benefits and capabilities that it provides is the ability to place workloads based on different spread constraints, label-based, capacity-based, including GPUs. You can migrate your workloads, you can autoscale, and you can do just in time clusters as well. Across all the major Kubernetes service providers, EKS, GKE, and AKS. All right, so here let's put some more specific names to some of the things, the components that we've talked about before. This is what we're actually going to have in the demo. So we have, for the vector database, we've selected FAISS from meta slash Facebook. For the LLM and RAG serving component, we've selected LangChain with Fast API. And for the LLM serving component, we've got OpenLLM with a hugging face LLM model that we downloaded. Again, trying to stick with as much of the open source spirit as possible. We'll kind of skip through this pretty quickly because we don't have a whole lot of time today, and we want to get to the demo part. But if you're curious, all of these things are on GitHub. There's a link to all the different Docker containers that are prepackaged and scripts and tooling that we've used to generate this demo as well. So we talked a little bit before about the availability needs for some of these different components. We are specifying one cluster for the vector database, three clusters across three regions for the LLM and RAG serving components to facilitate higher availability, and two clusters for the LLM serving component. Also to bridge the gap of having decently high availability without spending too much money. All right, we're almost there to the demo. So this shows us the steps that Nova is going to go through when it's rolling out a new model, or a new deployment, actually, not just a new model. So first step is creating a new namespace and then pushing all the images, or at least pushing the image creation secrets to all the different Kubernetes clusters. Then we'll pull in the LLM serving and load balancer and the RAG model down here. And then we'll fire up the vector database on the vector database cluster, and last but not least, we'll place all of the LLM and RAG serving components across the three clusters to serve the API layer. So it's a four-step process to get this all deployed to across all of these different clusters. And you may be wondering, what's the process to take it down? One-step process. Just remove the namespace. Everything lives all in one namespace across all the different cubes. Simple. And if you're thinking, well, maybe we could run multiples of these at the same time across different namespaces, you'd be right. You could. So there's a few different components that we wanted to also call out as options. We mentioned using FAISS as the vector database, but there's alternatives for that as well. AstraDB is one, Pinecone is one, and Weaviate is another. So if you don't want to spend your time learning FAISS or dealing with some of the headaches that comes with that and might talk a little bit about some of those headaches you ran into with that, you could use a SaaS offering for that. You could also spread your clusters across, differently across different clouds than what we have outlined here. And you could have different model operations. So you could trigger a rollout via GitOps and you can run multiple different versions of the model at the same time in a blue-green mode. Okay. I think this is where we're going to get into the example. Right, Ann? Okay. Thanks a lot, Dave. I appreciate it. So, yeah, let's move into the examples. We have two examples of using the basically approach that Dave described for multi-cluster Kubernetes, along with a cluster scheduler, and making sure that all of the components can be managed properly for doing the model in production. So just some choices we made for the examples. We chose to use the AstridDB, vector DB, rather than FAISS. It's not like that FAISS won't work and we have the same examples in backup slides in our talk, but we just thought in production people may want to host a database. And so we've already given you links to all the packaging you need for all of those open-source components we've already looked at. And here's the packaging where we replace FAISS with AstridDB. And you'll see that it's super easy to plug in a different vector database if you have a different one that you want to use. And we chose to only run ingestion at the deployment of the model time, but you could obviously run a continuous ingestion and it wouldn't change too many things about our example. We chose to run our examples on a single cloud, which is EKS. And we chose to enable a feature of the NOVA cluster scheduler, which is cluster suspend-resume standby. This feature is not on by default and you don't have to use this feature if you don't want to. But what this feature does is if a cluster is idle for an extended period of time, it basically spins it down, meaning it cycles all of its non-controlled, plain parts of the cloud to zero resources. So it's basically making things a little greener. If you're already paying for those resources, you might not care if they're powered up or not, but at least you're not burning that power. And finally, we trigger the operations manually for the purposes of the demo. I certainly agree that get-ops would be a great way to go if you were going to do this in production, but just for the presentation. One little disclaimer here. These examples focus on infrastructure management, not on mad ML skills. Like, I'm not a data scientist. So, you know, I chose a respectable embedding model for the VectorDP in a respectable open-source LLM model, but some different model might work better for your workload. And similarly, the resource sizing of the Docker containers that we used that you'll see in our repo was chosen for being able to demonstrate something, but it wasn't tuned heavily to make sure that it could be the optimal efficiency. So with those disclaimers out of the way, our first example is a series of operations that you would do on a modeling production. You would roll out version one of the model. You would roll out version two of the model and make sure it's working as well as version one, and then you would retire version one. And so let's see how this approach that we've been discussing would handle this situation. So basically at the beginning of time, the Nova control plane, which just looks like a Kubernetes cluster you talked to, is itself interacting with six clusters in this example. The first three clusters with the CPU prefix are the clusters that we want to run the user-facing part of our LLM plus RAG serving. So we wanted to locate those in three different regions across the world, but they need to be highly available, and they need to respond to the user, even if they can't talk to other parts of the system. Then we have two clusters that have GPUs available. In this case, each of them only has one A10G GPU, because I wanted to show resource availability and the selection of a cluster based on resources, and I wanted to do that in the simplest way possible. And finally, there's a cluster on which ingestion will run when the ingestion job runs. So after we let those clusters sit for a while, they were all idle, so they all go into standby, so just to kind of show that. All right, so we're ready to roll out version one of the model. What do we do? Well, it's kind of a one-liner basically, and you'll see the script in our repo, but basically the one-liner says, what is the model namespace that I want to deploy the model in? What are the secrets I need all of the clusters to know to be able to pull parts of the images that they need to run their part of the workload? True means I'd like you to populate the vector database when you run, so we're only going to populate it during version one model rollout. We're going to keep that vector database going after version one. The next argument is the type of input. In this case, the next argument is the cluster in which we want to run the ingestion job. So that's ingest US West One. The next two arguments have to do, what is the raw data that I want to ingest? So for our particular example, we're going to ingest a sitemap. That sitemap is going to contain documentation from the little company, and then the final three arguments have to do with the vector DB credentials for to set up the vector DB in terms of populating it and using it. So step one is to... that the script executes is to place the model namespace and the image pull secret on all the clusters that are being managed by the Nova Control Plane. So the first thing is a policy that's applied to the Nova Control Plane, so that tells the Nova Control Plane, here's the policy, and that policy will apply to the namespace, and that's a spread and duplicate policy. The namespace for the model will be duplicated and spread across all the clusters under management by Nova. The next step is to do the same thing for the secrets that are needed to pull the images. Step two is to place the LLM serving layer. That's the layer that needs the GPU. So we're going to create a policy, it's a simple policy to say place the things that this policy pertains to to a cluster that has adequate resources for the things that this needs. And so we then deploy the LLM serving along with its service as well as the service for a front-end for load balancer. And that resource availability policy allows those to be placed on one of the GPU clusters we saw. So we wait for that deployment to finish, and we get an external IP address which will then be used by our customer-facing layer to make calls to the backend GPU model. Third step is to populate our VectorDB with the VectorDB ingester. Here we're using yet another policy. This policy is specified cluster policy. Place one this job on the cluster I specify so that when we run the data ingestion job, the policy applies to that job and the data is ingested and put into the vector database. And the final step is to put those front-ends on the three clusters that we located in different GOs across the world. And in that case, the service needs to be spread and duplicated just like the namespace and image secrets to be spread and duplicated but not across all the clusters. Just across those three CPU clusters we set up in the three GO. So we label those clusters and our policy says apply this policy to the labeled clusters. So spread and duplicate. And since we spread and duplicate, we get back three external IPs for the three entry points to the system in the three GOs. So one step from the user point of view of just running the script and basically all the operations are executed against the Nova control plane Kubernetes cluster so you feel like you're talking to one cluster but these disparate clusters are handling the workloads. And so at this point we execute a call against the model and ask it a question and it responds. So now what's happening with our clusters? Well we see that the three CPU clusters are now active they're ready to take responses so they're no longer in standby they're no longer idle and one of the two GP clusters is also busy ready to serve when requests come in. Okay now we're going to deploy version 2 of the same model. So in version 2 we're going to use a different namespace so that we can separate the two models in terms of the resources they're using. We're not going to populate the vector databases time. We want version 2 to use the same version of the vector database that version 1 did. So very similar multiple steps they're left them out for brevity here but again we get the three addresses this time for version 2 and we execute a command against version 2 of the model and we see that it responds with the answer we expect. So at this point five of the six clusters are active the two GP U clusters are both active each of them having one GP so each of them running a separate copy of the LLM serving model and then the three front ends continue to be active. So now we're ready to let's say we've done a lot of validation we decide that version 2 is really good it's better than version 1 we're willing to now retire version 1 and as we say the simple answer to that is to remove the namespace but behind the scenes there's a lot of work to remove that namespace because basically the Nova control plane has to make sure that all of the workload clusters remove the namespace so a lot of work goes on behind the scenes but from the standpoint of you the person that's administering this system it's a very simple operation so now we're back to one of the GP U clusters being idle and we're done so that's example one of things you would do in production you could imagine you have multiple models they all have multiple versions so you would have scaled up version of what we just saw the second example has to do with what if I sometimes want more resources than I have statically so in the second example we have version 1 and version 2 of the model but what if we wanted version 3 and my artificial setup I can't run a version 3 of the model because I don't have enough GP resources to do that I just have one A10G and one cluster and one A10G and the other cluster so I can't have three copies of my model running at the same time but the NOVA cluster scheduler recognizes if a target cluster that it's managing a target workload cluster has an autoscaler running in it so if the autoscalers that currently recognizes are the Kubernetes autoscaler and the Elodo Luna cluster autoscaler so if it can't find static resources to handle the workload that it wants to deploy it sends the workload to a workload cluster that has an autoscaler with the idea that that autoscaler will be able to handle the workload so here we set up the Luna cluster autoscaler in one of the two GPU clusters and model v1 and model v2 are both done just as you previously saw but when we get ready to run model v3 it just sits there pending because it can't get the resources it needs and so because we care a lot about model v3 and we're willing to pay extra for it we can figure it to let it be placed by the autoscaler and so the autoscaler then scales up one of that cluster the east two cluster and we can now run version 3 so you might be thinking well hey that's pretty cool but what about when I scale down by retiring version 1 or version 2 what should happen with version 3 it just sits there the answer is what do you want to have happen so some people would say I want version 3 where it is on the dynamic resources because I don't want to disrupt anything and that's fine and that would happen by default but if you would like rescheduling because you've now freed up resources in your static area you can engender a change to the configuration and let the nova control plane replace that workload so you can basically set up a system to do that so basically we've shown you an easy and efficient approach to self hosting production LLM RAG models that combines multiple cloud Kubernetes clusters for their aggregated scalable resource management and a policy based resource aware cluster scheduler that is nova to manage multiple cloud Kubernetes clusters nova supports a variety of different policies that provide functionality you need to get this done it can optionally place clusters in standby if they're lightly used and you're interested in that and it can interoperate with the cluster auto scaler if you're interested in that and so we presented LLM plus RAG model roll out and retire using this approach so thank you very much please rate our talk here and give the software a try we have all of the scripts that we used in a repo and we have the ability to freely try both nova and luna if you'd like thanks very much thanks for your time we have time for questions I don't know if there is any I think we've got like a minute or so I think that mic died there we go so question, you've got GPU CPU can you define lots of other spectrums of stuff because Azure just has different stuff to EKS and Google cloud can you define depending on what resources you want which cloud they appear in because they're available including services I've got the service but it needs one of the cloud pacific features like I don't know Neptune or something that's running in Amazon can you define that workload has a lot of requirements so it needs to go to EKS not to AKS a very good question with respect to the things you can define as part of the policy we sort of put in a bunch of stuff built in like you said CPU GPU memory and so on and then we put in the ability to label things so you saw me labeling the three CPU clusters that I wanted in the 3GOs so that those workloads would only go there so I would say you'd have to label at least with the current system you would label some of the clusters that had these specific things and then make sure the policy that used that label was used for the workloads that had that constraint I think that's the basic so we don't have to keep inventing an infinite number of configurations but it kind of lets you label things as you need to any others thank you