 All right, so let's start. The first talk today is building a multi-cluster privately hosted LLM serving platform on Kubernetes. Our speakers are Noah Yoshida and Julian Brite from Predibase. So thanks a lot. Let's welcome them. Thank you, everyone, and apologies for the slow start. It's not because I'm from Australia. It's just generally a technical issue. But yes, I have arrived just yesterday, actually, from Australia, and it's great to be here. Chicago is a fantastic city. I hope you all have a great time over the course of the next few days. So we're talking about privately hosted LLMs on Kubernetes. We're from Predibase. So I'm a platform engineering manager in Predibase. So we've both been with company pretty much since the early days. We're a startup about two years into our journey, and we're really excited about large language models. And we've built our platform on Kubernetes. So we'll be talking a little bit more today. And Noah will be diving deep on some of those technical challenges and ways we've approached solving some of these. So briefly, I was a bit surprised to see LLMs so small on the topic slide there, because that's all we think and talk about these days. But for those who aren't familiar with LLMs, they stand for large language models. So these started back in the early days of 2012 when the first generative pre-trained transformer model or GPT came out. But since then, there's been a real progression in the last couple of years in particular with GPT-3 from OpenAI leading to chat GPT, which I'm sure you've all heard of. And that really powerful, very large, $175 billion parameter model has made it possible to do some incredible things. So these models are trained on vast amounts of data, but they're also very flexible and very intelligent in terms of what they can do out of the box with what's called a zero-shot mode. So effectively, you provide an input, a prompt, and we can do things like sentiment analysis, transcription, many other text completion tasks, obviously writing pros and other pieces of content. But they're also useful in enterprise contexts, which is where most of our customers are interested. And in fact, using an open source model and being able to fine-tune and present those serving inference insights back to their business is a really critical capability. So what you can see on this slide here is that since those large language models have appeared, we've had a lot of open source investment, particularly models like Meta's Lama and Lama II, which are available in smaller sizes of $713 and $70 billion parameter, but still very powerful. And when you fine-tune these models, what you find is you can get amazing results. So very quickly, in terms of what we do at PrettyBase, we've got a platform that allows you to connect your enterprise data into the platform and then build and train and deploy these models, these large language models. And we do that at scale in your own native cloud, as well as our managed SaaS platform. So in terms of some of the challenges required to sort of meet these expectations of delivering inference, I won't go into them in too much detail, but the tools are complex, right? So you're dealing with often low-level Python programs, typically, when you're talking about fine-tuning these models, it's very easy to run out of memory, particularly when you've got very large models in the order of $13 billion or more, you can often hit these out-of-memory exceptions. And so making sure you do that efficiently can save you a lot of time and cost. And then in terms of serving, even the smaller models, they do require GPU compute. So NVIDIA T10 or A10 or A100s are typically required. So in terms of what we look to do with PrettyBases, we make this easy. So we're providing an open platform for this where you can bring your expertise of your engineering and developer staff and basically just get started through our API and SDK. And in terms of the efficient fine-tuning, we're built on an open source library called ludwik.ai. And that has a lot of flexible capabilities to allow training with things like lower weights, which Noel will talk about a little bit later in the presentation. And finally, in terms of the serving piece, we make it very easy to make this cost-effective by having just in time a sort of serving infrastructure provisioning and also scaling up these and using some of our own technology that we've developed called Loreax. Again, Noel touched on this. So without further ado, I'll hand over to Noel to continue with some more technical details. Thank you. Cool. Thank you, Julian. So one thing Julian didn't mention is at PretiBase, we pretty much only have about, everyone came into PretiBase with almost no Kubernetes experience. So this has been quite a journey for us. And I think it really speaks to the power of open source and Kubernetes and cloud-native technologies that we're able to do all of this with a pretty small team. Yep, so from now on, I'm gonna be talking about specifically the LLM serving portion of PretiBase, some of the challenges that we've encountered while building it out and how we've kind of navigated those. So once we started building this platform, there were about three kind of like big challenges that we faced. The first was GPU availability and then GPU cost and finally privacy. So despite there being many breakthroughs in fine-tuning and serving technology for serving and fine-tuning on commodity hardware, there's still a lot of limitations around what you can do unless you have a lot of GPU memory. So currently the standard for doing all this work with large language models is the NVIDIA A100, which is quite difficult to find at the moment if you've ever tried to acquire them in AWS or Azure. And even if you do manage to get them, they can be quite expensive. Finally, a lot of customers that we've talked to care a lot about having their models and training data remain in their control. So they're not comfortable sending off all this like proprietary data and they're like very important models off, to be trained and someone else's cloud, they really want that to all remain kind of in their private VPC and their account. So building around these challenges and kind of overcoming them was our main goal at PrettyBase when building our LLM serving platform. So the PrettyBase serving stack can be broken down into several distinct layers, the first of which is our multi-cluster service mesh, which I'll talk about. So why multi-cluster first of all? So the first reason, as I alluded to earlier, is that customers care about privacy. So building PrettyBase with a multi-cluster service mesh has allowed us to deploy data plane components where the serving and fine tuning happens into the customer VPC so that they remain in control of their data. It's also allowed us to deploy into environments where we can find cheap and available GPUs, such as dedicated GPU clouds or on-prem providers. And then finally, using a service mesh, this is a little, I won't talk about this too much, but it has allowed us to scale our platform a lot more because now we don't have to deploy the control plane and data plane at the same time, we can kind of deploy independent data planes for each customer all managed by a single control plane. So here's a kind of high level view of the PrettyBase architecture. So it consists of multiple Kubernetes clusters all registered into the same Istio service mesh. So each data plane for a customer, if they choose to deploy like this, lives in their VPC and their cloud account. And all of these cloud accounts as well have blob storage provision. So that's where like model artifacts will go. And that has allowed us to kind of alleviate some of the concerns with data governance. Also, if customers are okay with us hosting their compute, they can register to the PrettyBase kind of AI cloud platform. So what this is is compute managed by us, where we have multiple Kubernetes clusters spread out between both the public cloud and dedicated on-prem GPU clouds. So when customers are able to queue, when they queue up jobs for fine tuning or serving, we can reroute the job based on some heuristics to either the public cloud or the on-prem cloud based on if they need A100s essentially. And using Istio and kind of this like multi-cluster deployment has allowed us to really alleviate both the GPU availability and the cost restriction as well as the kind of privacy concerns. And going a bit more into that privacy part, for those maybe familiar with Istio, this might be a little bit familiar, but for those who aren't with Istio, we were allowed to configure authorization policies and what we have is a kind of centrally managed Istio control plane that allows us to manage all of this locally, not in the customer data planes so that we can restrict communication between customer deployments. So next going a bit deeper is serverless inference. So first of all, why would we want serverless inference at PrettyBase? So the two main reasons, as you could probably guess, are cost and traffic. So as I mentioned earlier, GPUs are quite expensive. Even if you can find them, A100s aren't cheap. So we really don't want to be running models that aren't serving traffic. So this is both for us and for the customers because if they are deploying the data plane into their account, they're gonna be paying for all the compute. Finally, we also do want to be able to scale up traffic, scale up the models as traffic increases because LL models can be quite bursty with people wanting to do a bunch of experimentation at one time and then kind of forgetting about it and kind of turning it off after a while. So this is kind of a little bit more of a detailed look at a PrettyBase control plane and data plane. So this will kind of help us understand a bit more about how we first developed LL and serving at PrettyBase. So remember, of course, there are multiple data planes, but here's just pictured one. And I'll kind of go a bit into how we initially created LLM serving. So first, when customers would want to provision LLM, the request would come in through our SDO and Gress Gateway and the control plane. It would just kind of go through our API gateway for authentication and then go to our workflow orchestrator where we would kick off an asynchronous workflow which gets picked up by the PrettyBase agent which lives in the data plane. So side note, this is all kind of application level, workflow management. This isn't using any kind of Kubernetes stuff in case you're wondering. But the reason why we do this is we want to have some work executed on the data plane where we want to talk to the Kubernetes API server. So generally we weren't comfortable giving API server access kind of out universally to all these data planes. So instead, the PrettyBase agent kind of takes care of creating deployments and whatnot on the cluster. So once the PrettyBase agent gets the request to provision an LLM, it will create the deployment and then the weights for that model will be downloaded from that data plane's model store. So this blob storage is also in that customer's account or that VPC's account. And then finally, during LLM inference, the request will come in through the gateway which will get authenticated and routed to the LLM in the data plane via an Istio virtual service. So this is all well and good, but let's actually look at what we have to do to make this serverless, because right now this is pretty bare bones, doesn't auto scale or do any of that. So what we did to kind of introduce auto scaling and serverless into this LLM serving was using KETA. So KETA stands for the Kubernetes event driven autoscaler. So it's a project that allows you to scale Kubernetes deployments based on a variety of metrics or events in the cluster. So in our case, we are using the HTTP plugin. So this allows us to scale up and down the deployment based on the number of incoming requests, although we are looking at a variety of different scaling options for different needs. So looking a bit more into what provisioning looks like with KETA. So instead of just creating the LLM service and deployment like one might do in pretty standard Kubernetes environment, instead we also create the HTTP scaled object, which is a custom resource for KETA. So what this does is it kind of configures that deployment, the LLM deployment to either scale up or down based on certain parameters that you can specify in this object. And of course this is managed by KETA. And this does let us scale down to zero or scale up past one and beyond. So let's first imagine that we have an LLM deployment that has been provisioned, but it hasn't received traffic in some time. So KETA has scaled it down to zero. So when a request would come in, it'll come into our Istio gateway into the data plane, and then it'll get routed to the KETA proxy. So this routing is actually kind of dynamic. We can configure it with Istio virtual services, which we do. And once the proxy receives the request, it will talk to the KETA controller. So the KETA controller will then look at the HTTP scaled object, determine if the deployment is ready to receive request, or if there are too many in the queue and it needs to scale up more than one pod. And once it does determine all this stuff, it will then try to scale up the LLM. So while the scaling is happening, the proxy actually holds on to requests. So then once the LLM is scaled up, the proxy will send the request along the normal path, so through the Kubernetes service to the deployment, and then, yeah, on to the iPods. So this is kind of how inference happens. So this is pretty cool. It makes it a fairly serverless experience. You can query an LLM that hasn't been used for a while and might have been scaled down, and you'll get a response back, but unfortunately, the scaling is quite difficult. So scaling an LLM up from zero, we found was very bad performance-wise. So it would take anywhere from like 15 to 20 minutes to scale up, which obviously is not serverless experience at all. The request would time out, you'd have to wait a while. The problem here is that we're not only scaling the pods, but we're also scaling nodes. So pretty much a party base for LLMs, we only have one pod running on a dedicated node, and the reason for that is essentially, the LLM needs to use a lot of GPU memory, so it's pretty much gonna use the entire GPU on the node, so it doesn't really make sense to run a lot of other stuff there. Also these nodes are very expensive, so the whole point of auto-scaling was to save money. So we don't want them sitting around for a really long time because that's incurring costs, either on our side or on the customer side. So this introduces us into the next layer, which is cold start optimizations. So as mentioned earlier, it's very difficult to scale these LLMs up, and two of the reasons are the containers are large and the LLM model weights are even larger. And of course we do have to acquire the node into the cluster, but that's kind of like a different set of problems. So the LLM containers are quite large because they have to be shipped with things like CUDA and custom kernels to actually serve the LLMs. We've done many different things to try to slim down our container images, but they're still in the many gigabytes of size. And the second problem, which is the LLMs, being quite large for those who are familiar, things like Lama 2.70 billion are about 150 gigs on disk, so we have to download all of those weights and then load them up into the GPU, so that can take some time as well. So I'm gonna split this up into two different sections, so first will be the kind of container download optimizations. So after a lot of trial and error, we ended up landing on a service called Spiegel. Well, an open source project called Spiegel, I should say. So what Spiegel is, is an in the cluster container registry mirror. So what that means is essentially it will mirror a container registry that's external to the cluster, but inside of it. And Spiegel, so I'll talk a little bit about how this works, but first it runs as like a daemon set on all of the nodes. And what it does is when a pod is started on a node in Kubernetes, of course the image for the pod will attempt to be downloaded onto the node. And this download happens layer by layer, doesn't just happen all at once. So what Spiegel does is it modifies the container desettings on the node to instead of going to the default registry first, it'll actually go to Spiegel when we try to download a layer. And what Spiegel will do is ask all of the other, kind of like Spiegel pods in the cluster, if any of those nodes that they're living on has the layer that we're looking for. And if the layers are found in the registry, they will be downloaded from the nodes in the cluster rather than going outside to the external registry. So we found this dramatically speeds up the download process. Of course, if a layer is not found, it will have to go out and download from the remote registry, but of course this layer is now in the cluster, so any other pod that needs this layer to start up will be able to download it from inside of the cluster. And another side note is this does require container D to be the image or the container runtime in Kubernetes, but we found most of the cloud providers switched over from Docker, so this hasn't been an issue for us. So overall, after adding Spiegel with very minimal configurations, we saw a pretty dramatic decrease in container download time. So for our larger GPU images, this went from about eight minutes down to four. So this was quite literally with very, very few configurations and it was pretty plug-and-play which is why we ended up using this as opposed to some other things we tried. So next, a little bit more LLM specific, we'll talk about weight download optimizations. So before adding anything, the weight download process of PretyBase looked like this. So when the LLM server would start up, we would download the base model if it wasn't from Huggingface, from Huggingface Hub, of course we do support a few other sources as well. But once we download the weights, which could take anywhere from like one to 15 minutes depending on how large the model was, we would have to convert the weights into the safe tensor format if they weren't in that format already. So that would take on the order of several minutes. Once they're converted, then we can start up the LLM and start serving traffic. To kind of fix this issue, because that was quite a bit of time, we added three things. So an init container, a shared storage volume between init container and the main server and a private cache in the data plane's blob storage. So first when the cache is cold, obviously we're just gonna have to do what we did before, so which is just download the weights. But instead this is taking place in the init container. So the init container will handle converting the weights to safe tensor format and then it will push the weights to the cache. Of course once they're in the cache, we don't have to do this again, so this only needs to happen once per data plane for any specific LLM. And also the safe tensor weights are mounted to the shared storage volume, so they're accessible for both the init container and the main serving container. So once the cache is warmed up, whenever we start an LLM in the data plane, we're able to download the weights from this cache. Now we were actually able to find a pretty fast way to do this at least in S3, which is using the S3 CRT library. So what this is is a set of many of the AWS APIs written in C and it's available in most of the SDKs as well as the CLI. So by enabling this, we're able to do multi-threaded downloads, which is a lot faster. And so for an example, for something like LLMA 213 billion, which is about 26 to 30 gigs uncompressed, we were able to download the weights in less than a minute when before it took about like five minutes. And of course we're able to do this as well because the nodes that these LLMs are running have very fast networking and a lot of CPU, a lot of memory because they're kind of high-performance nodes for GPU workloads. And of course if by chance we do serve multiple LLMs on a single node, these weights are kind of cached onto that node, so we're able to just start up the LLM almost instantly. So here's kind of like a timeline look at what serving was beforehand. So as you can see, it took quite a bit of time, everything, all the way from acquiring the node to converting the weights we had to do, which took several minutes. But after all these optimizations, we were able to get the container download down to about like four minutes and then the weight download to under a minute for the warm cache scenario. And of course there's no conversion that needs to happen either. So another kind of side note is that a lot of this is now acquiring the node and downloading the container, which can all be erased if you don't auto-skill your nodes. But again, we're coming from a perspective of we have deployments and customer clouds and in our cloud as well, and we don't want to keep nodes around for very long because they incur a lot of cost. Finally, we do have one more optimization we've done around LLM serving, which is something Julian kind of alluded to, which is called LORAX. It's more of an application optimization, so I'm gonna go off of Kubernetes a bit right here, but I figured it would be very interesting for everyone here. So when you fine-tune a model at PretyBase, you can fine-tune using a method called LORA. So for those who are not familiar, the LORA method for fine-tuning is you create a very small kind of subset of weights that you're actually fine-tuning instead of all of the weights in the model. And when you use these weights for inference, when you wanna do inference on this fine-tune model, you load in these LORA weights and kind of add them to the base model. So actually, you have kind of two distinct things. You have the LORA weights, which are called the LORA adapter, and then the base model. So at PretyBase, we created a system that kind of lets you dynamically load in these LORA weights at inference time. Because they're so small, this does not incur a whole lot of latency. So what you can do is instead of deploying a whole fine-tune model every time you wanna query it, you can just have a base model and then specify the adapter you want at query time. So we've seen about 200 milliseconds of latency here for model sync weighting. And of course, this can handle concurrent requests as well. So if you have multiple users or multiple models you want to get inference from, you can send the query to the same base model, but with different adapters in each request. And then these adapters are loaded while in the model and they can be loaded while other requests are happening at once. So we can kind of squish things together and increase the throughput here. Yeah, so that is kind of a very brief, but I'd say pretty thorough look at all the serving happening at PretyBase, all the cloud native and a little bit of specialized machine learning optimizations we've done around making serving work. So yeah, thank you. And if you wanna learn more about LORAx, I linked a post about it here in this slide as well. As if you wanna connect with me and Julian, those are linked ends. And there's a few of us here from PretyBase. So if you see us around, we love to chat. So we can talk about anything from LLINs, AI stuff, all the way to every kind of Kubernetes thing we've done because we've done a lot more than what was mentioned here. So thank you.