 All right, fantastic. Well, thank you, everyone. Welcome for joining us today here. We are going to be talking about overcoming the GPU shortage with Virtual Cubelet and Distributed Cloud. I am Kyle Dodson. I'm the head of engineering at Salad Technologies. And with me today, I have Dean Troyer, software engineer here also at Salad. Good morning. Yeah, good morning. Dean and I work at Salad. We're building a distributed cloud computing environment that's focused specifically on GPU-based applications. And kind of within that landscape, we've thrown together kind of challenges and ideas. And what we'd like to talk about today are really some of those challenges of running GPUs at scale in a cloud-native environment. And we'll talk about one particular CNCF-sponsored project. Spoiler alert, it's in the title, and how that provides kind of an interesting and unique answer. Along with, you know, provided the demo gods are on our side here, Dean will kind of show it off for us and give us kind of a walkthrough of what a project looks like using it. To kind of key it up, first challenge that we hear a lot about, right? This is out of an article from recent Horowitz back in April. So demand outstrips supply by a factor of 10 and access to compute resources at the lowest total cost has become a determining factor for the success of AI companies. You know, talking with numerous customers over the past few months, I've really learned that this has much further reaching implications than to those companies just focused on, you know, generative AI and large language models. And I also wish that I could say that, you know, half a year later from this thing, half a year later from this article that things were looking better. But unfortunately, on the next slide, I want to write numerous headlines and industry reports continue to confirm that we're in this midst of another GPU shortage. And I say another right if you think back four or five years ago, GPU demand was booming, thanks in a large part to a lot of cryptocurrency and proof of work projects. And then in 2020, there was this kind of significant global event. You may all have heard of it and demand for PC hardware reversed course. We actually saw it, right? We hit kind of positive double digit growth on that side. So all of that coupled with numerous supply chain issues. A lot of us were really no strangers to GPU shortages even a few years ago. And while the PC purchasing trends were largely temporary and proof of work demand has waned significantly, we still see forecasts today that the supply strain constraints will likely persist for another year and a half from today. Moving on to the next slide, a second challenge that we frequently see and talk about is, you know, running GPU based apps on Kubernetes can be challenging. There were some interesting chats just at KubeCon a few weeks ago and a couple of interesting quotes that I picked out of one of the keynote presentations here. But, you know, the complexity of setting up and maintaining device specific drivers and run times. There's kind of this lack of expressive expressivity and selection control and limitations around controlling exactly how much and what type of hardware is required for a given workload. Second theme kind of there on challenges with this is also this shift to continue to put your compute closer to the edge. And an interesting call outs here right around how Kubernetes was sort of set up with particular use cases in mind. And as we move forward, right, how do we solve those as these kinds of these application trends continue to build and shift on us. The next slide there. So, you know, is there a magic solution? I wish I could say it was as easy as using one of these generative AI image generators to make a fancy picture. But the apology no, we don't have a magic solution to solving, you know, questions around NVIDIA GPU procurement issues. And we can't help anybody, you know, jump the queue. Ultimately, problems like these challenges, right, they will be solved in time. For example, also out of out of the cube con chats, there was kind of a focus and a number of talks on in Intel NVIDIA and others in the community collaborated to bring an alpha quality API to Kubernetes version 1.27 called dynamic resource allocation. And it's needed uses a claims based approach to modeling hardware requirements, and it shows a lot of promise as an alternative style to declaring more complex commute demands. And, and so between supply chain and developments that continue to happen with the community over time, things will improve. But how do we weather the storm as we wait for all of this to kind of come together. And there are a number of interesting CNCF sponsored projects that are developing frameworks and tools that ultimately focus on running workloads on nodes that are that are an extension of your Kubernetes cluster. And they kind of aim to run what workloads closer to the edge at a lower cost, or, or both. And by leveraging these tools, we've also been able to find an effectively scale workloads near term, and also optimize spend around that. So today we're going to, as I noted, right, we're going to focus on one specific project, although there are several. And next slide here, we'll kind of talk about how we do it, but virtual. So there's this great definition on the website. The virtual kubelet is an open source Kubernetes kubelet implementation that masquerades as a kubelet. So we have to say kubelet three times there. The, the idea really is right, we create a virtual node in the cluster that now becomes available for scheduling, and it allows us to take advantage of any compute infrastructure we might have that we can talk to and scale out to. And I think the super neat thing in this landscape of, you know, GPUs and access to GPUs is that tool like this allows us to find and use the GPUs where they are available, and really it kind of doesn't matter where they physically exist. So to the next slide, as we've kind of focused on leveraging this, we're really thinking about optimizing for efficiency for the resources we have. A couple of themes that we kind of talk about as we go through it is put latent resources to work. It's obvious, right, but if you have hardware, then put your on-prem hardware to use and fully maximize that. But once you have and if you need to find other areas to scale out to, I think a lot of people are familiar with the hybrid cloud models, right? So leverage that scale out into, you know, across these different ones and maximize how you use your hardware and what exists out there. Second big one is, is we talk with people right size your hardware selections is kind of an important theme. Also kind of sounds obvious, but leverage benchmarks, you know, and compare your options. We have talked with a number of people who don't always go through this exercise and end up massively overpaying based on the constraints that we see on the supply side, right? The prices for a lot of GPU hardware, they don't scale linearly. So be sure to run through that exercise to make sure that you only select what you need. And that will go a long way also in helping address how much supply is kind of available across these different environments. So kind of shifting here next slide on to what, you know, what is the virtual Cubic kind of doing for us and how is this virtual node get our applications running. Here's kind of a picture of imagine we have a couple of Kubernetes clusters, you know, on-prem with some nodes within that have access to GPU resources. As we create a deployment and we scale up in our cluster, we might find that we sort of maximize what we have within that cluster, but in another cluster, we might have underutilized compute resources. The virtual Cubelet allows this virtual node to present to the Kubernetes control plane and pods can scale out, you know, get scheduled onto it. But under the hood, the compute doesn't run here. Instead, the virtual Cubelet turns around and calls a remote API to run that workload on a different container orchestrator. So in this example, just another Kubernetes cluster that we have on-prem. And that idea of shifting around to use our resources most effectively is that big theme, right? On the next slide, though, the thing that's really neat about the virtual Cubelet kind of architecture design of how they've done the project is that it's really intended to be open and extensible with this provider-based approach. So kind of like with my example on the previous one, right, like using virtual Cubelet to scale out into different Kubernetes clusters, those could be on-prem, cloud-managed, use that API. But there are also providers for other environments, like Azure ACI, if you want to go to the hyperscaler side. Stack path, you know, more towards the edge or salad cloud, what we are working on, and more in this kind of like globally distributed environment of computing. So on the next slide, kind of to hit on just exactly that, that theme, right, of maximizing resource utilization. When we talk about a distributed cloud, we mean fully globally distributed. There are over 400 million high-end consumer GPUs that sit idle for 20 plus hours a day. And we've kind of looked at this to say, well, that's a lot of latent resources out there that as we continue to look at this supply shortage issue, right, what applications can we find that are compatible and scale out into. And the exciting bit is we've also found we can do this at a much lower GPU cost for those applications that are compatible. This picture here is a little dated, I think it's about a year, a year and a half old, but this was an example of what this geo-distributed GPU network looked like at that time, and it's quite extensive. Look under the next slide here, though, the thing that I really like about Virtual Cubelet is kind of the simplicity of what it is that it does. It kind of just does one thing, right, take a pod that's been scheduled, run it somewhere. And it doesn't try to do more than that. I noted, right, that there are several projects kind of in this landscape, in the CNCF landscape for looking at how to shift and compute and maximize resource utilization and accomplish different goals. But the Virtual Cubelet doesn't have a lot of bells and whistles. One of the things that I see is pretty powerful with this model is its compatibility with the rest of the stack. So this is a model of a real application, and we have customers who use Argo CD, external secrets operator, Harbor, right? So managing our deployments, configuration, application images, how we get those scheduled and into our cluster, keep them up to date, roll them out. Nothing changes about that. We continue to express these applications as standard deployments. There's no new nouns that really creep in with this provider. Rafauna, its stack, Prometheus, and even Keta. I think this one's interesting, not just extracting something out or pushing something in, but Keta with kind of its feedback loop, right? So monitoring the application, looking at what's happening, and then scaling based on them. All of these tools continue to be compatible with this approach because it doesn't ultimately change much about how these things are running. It really just shifts where to get access to the GPUs where they are. Yeah, this has been really neat as we've worked with it, and I think one of the things that's really attractive and powerful about their kind of provider approach. But enough from me on kind of like high level ideas. I think it's always fun and better to dig in more with an example and see the real thing and get a little bit lower level. So I'm going to turn it over to Dean and he's going to take us through that. Okay, thank you, Kyle. I think probably the best way to illustrate this is to walk through a relatively simple use case. And we're going to do that with slackersatwork.com, which is a mythical app. The app is called Slacker Tracker that is used to keep track of that guy. Let's call him Joe Doolittle, who takes 30 minutes to get his cup of coffee in the morning, takes an hour and a half for lunch, has to walk around and say hi to everybody, at least twice a day. You know this guy, you've worked with him if you've been in an office of any size. And the Slacker Tracker is just a quick way to kind of log where he's been. Why forget about why we don't we don't we just think it's fun. Product team has decided that they want to use QR codes to track Joe. And so we're going to add a QR code generator to the app. They've also left it as an implementation detail on how you're going to attach that to Joe. But that aside, we need to figure out how we're going to do this. And so, you know, like, like many apps of this sort, there's no on-prem to speak of it all. It's all cloud hosted to begin with. And the cloud provider that they've got right now doesn't really offer affordable GPUs without a large minimum spend. You know, they don't need 100. They might need three or four, you know, and even then not full time. And they're not already heavily invested, you know, in an API or anything like that. They've got some options and they're Kubernetes based to begin with as it is. So figuring out where you're going to get your GPUs, you know, really comes down to a couple of things. Of course, first you look at leveraging your existing relationships and investments. Like I say, those aren't working out so well. Free credits don't last forever if you're in that position. You know, it's great for getting started, but it's not a long-term. And that minimum spend can really bite. You know, that really hurts a smaller operation. You've also got the problem of other external APIs may or may not have, like a Terraform provider, it may not have the tooling that you really need. You know, you're going to have to invent all of that to take advantage of something else that's external. So basically they've got an existing cluster. They've got home charts for their existing things. It's, you know, it's fully buzzword compatible. And what they want is to add basically stable diffusion with QR code monster to generate the QR codes to add to the product for their new feature. And the solution that we have is a virtual cubelet that connects to, obviously in our case, it's going to connect to salad and be able to just utilize a handful of GPUs when they need them to generate these codes and, you know, run them on. Like Kyle said, you know, you don't need 3090s for this, you know, you can run on a 3060 easily. And, you know, or even possibly something smaller, because this isn't a, you know, it's not a high speed. You don't need three second generation times, you know, you can get away with 10 seconds in this kind of a case. The virtual cubelet itself, as Kyle said, you know, it's basically, in my words, it's a translation layer that takes Kubernetes control plane events and translates them to an external API allows you to schedule specific workloads. You've got all of the usual knobs available in Kubernetes for targeting those and selecting, you know, it's not like you're just going to add them to the random pool or anything. One of the cool things that that I like about this, let's go here to talk about that. Anyway, one of the things I like about this is that it's a single binary, you can run this inside your cluster. You know, you don't, it can run about anywhere, but you don't need, you know, you don't need special dedicated anything. And since it's not critical to bootstrapping your cluster, you can run it inside the cluster and bring it up, you know, at the right time. But most importantly, I think is the ability to continue to use the tooling that you know the, you know, your DevOps skills, all of these things. And you've got the ability to add provider specific metadata. For example, in our case, we've got, you know, aside from the authentication, which has happened, which is in the cubelet itself. But, you know, things that the influence like what class of GPUs do you want. You can add that to your spec and pass it right on through to your back end API and, you know, get all of that stuff to where it needs to go. Again, some of the benefits and I think the benefit here that I like, besides the resource contention is the isolation and and if you're running a workload, especially if you're running in a situation where you've got some more than minimal security needs, and you're running untrusted code, maybe this allows you to run those workloads at arms length from your network. You don't have to worry about, you know, code that you're pulling down is going to turn around and backdoor your network and do something internally. In our example for, you know, we're going to, we're going to put stable diffusion with QR code monster into a container, and we're going to run it out somewhere else, send to the URL, get an image back. That's all we need. We don't need that in our network. So we can, you know, we can run that at arms length and not risk the internal exposures. And also with the with the scarce resources, you know, even even in the situation where you're running it on prem, you know, you can you can put all of your GPUs into a single cluster. This is what Kyle was showing earlier into a single cluster and have all of your other actual, you know, production clusters you call it. And, you know, it's just the it's just the usual. Well, it lets you intelligently distribute your workloads around and and maximize the utilization instead of having, you know, three GPUs and three different clusters, and each one of them is only at a third utilization you can put them into a single one and share them as needed. This is kind of a repeat of that earlier diagram, but basically you have your typical Kubernetes cluster over here on the left of center and that green box in the middle is the virtual cubelet node. And we're just pretending, you know, we're taking we're taking an entire cluster of machines running pods on them and then presenting them back to your Kubernetes control plane as a single node with all of those pods running. Okay, so this is this is where this is where I was supposed to talk about the details. The virtual cubelet project itself is is just a is a go module that implements the Kubernetes control plane API and defines a handful of interfaces that the external provider side needs to implement. So for example, for cloud we've got these five interfaces we implement them and that's the conversion to the back end of the solid cloud API. And I think this is the only implementation doing something like this and there's nothing else that's not in go. So that means that most likely if you're doing this you're you've got a single go binary, like I said it can run anywhere. But most commonly you're probably going to run it in the cluster that it's serving as opposed to having a dedicated node or something else it's just easier to manage there's there's so few actual resources required for that process. And let's see the particular implementation that we have you know includes a helm chart for for showing how to do this it's there's no magic in it is pretty straightforward. And so with that let's let's just see what it looks like. Okay, pray to the gods. There's my window. There we go. So on the left I have the solid cloud portal so we can watch what's happening inside solid cloud itself. And on the right is the show window I am going to run this in Docker desktop on my laptop. And I've scripted some of this just to make it easier status just runs a cube control get node and get pod. So we've got the usual you know Docker desktop control plane running and there are no no active pods and start. Basically going to run a helm install command, which is that at the top here so fire up you know pretty straightforward. Use that helm chart to start a job. And let's go look at it. And we see on the node output we see a new agent process running here. And down here as a pod, we can see the pod that is actually running that. So it appears twice one is the you know one is the actual process the other one is the emulated cubelet node. So at this point, let's do cuddle apply. This is just a quick spec that you have that yet is hiding that just creates a creates a container job. And in a second here, we should see it appear in the portal status. We have four pods starting. There we go. So we have four pods starting, which correspond to in our case for container groups. I should note that that that's handled by the replica setting in Kubernetes. Unfortunately cloud uses the word replica for something slightly different. So the Kubernetes replica shows up for us as a as a solid cloud container group. Let's look in one of them. And what our replica means is we could have multiple back end instances running to to satisfy that. So that gives us load balancing. We provide this domain name, this DNS name to access the image. And that can be load balanced across multiples. So here we're just doing one each. I'm kind of stalling a little bit because this can take a few minutes to get started. The app that we're running was produced by Sean Reshevsky. And it is basically, I think he built up a stable fast and QR code monster. The image is about six gig fully baked with the models and everything. So it takes a little bit to download one deploy, deploy, deploy, deploy. Okay, so we don't have anything ready to go yet. Excuse me run the status again and we still show to do up and to not rather than wait on this. Let me show you what it's going to do. Flackers at work.com is probably oldest the oldest domain that I actually do own. And it even helps when you spell it right. This is a pre running instance of the app that we're looking for. Joe Doolittle, one of the things that always helps for these is to up the error correction. And let's put Joe in a plaid flannel shirt. Turn on validation, generate an image. This always takes much longer than you think it does when you're waiting for it. That's running. See if anything's updated here. No, we're still downloading. Is that what it's doing? Yeah, still downloading. Oh, come on, do that again. But really, I'm not going to complain because as far as live demos goes. This is as good as it's gone. And so there's Joe in what it thinks is a plaid flannel shirt. The validator says this is good. Let me check with my phone. You can check and see if you can scan it. It turns out that these things. Yeah, my phone gets it. Turns out that sometimes these things don't always work the way you like them to in terms of producing usable QR codes. But anyway, our product team will be happy and they'll let someone else figure out how to get those on Joe so that the the users can track him around the office. I'm going to flip back to that, but that's essentially the demo. So that's another one that we've generated. Of course, that's that's our own salad image. I think that's it. Is there any questions? Question there. Sure. Dean, do you mind if you jump back? I just I know you applied that deployment, but you don't mind. Could you could you just print it real quick? I'd like to kind of just see exactly what you would kind of set up in there. Oh, what's in the deployment? Yeah. Sure. Okay. Okay, so standard standard deployment here. It's pretty standard. Let me show you the one of the one of the differences. The metadata that I was talking about, this is the salad specific bits. You know, we, we give it a country code where we want it to run, set up the protocol, the port. We do have authentication available here that uses JWTs. And then unfortunately right now we've got to use the UU IDs for our GPU class. So those you'll have to go look up to set this up. But the rest of it is pretty straightforward container spec. Down at the bottom here is the magic in the tolerations to make it select our virtual cubelet. Okay. So that's how we can control which, which node and where we allow which applications to run. Right. We want them to scale out into another environment or not. And, and you know, we could even run multiple virtual cubelets with either different authentication to get different, you know, like a multi-tenancy sort of thing. Or, I don't know, for whatever reason, you might want to divide them up. You might want one that defaults to different classes or different GPU classes or things. So anyway, this is how you would choose among those two, or even different back ends. I mean, there's nothing that says you only have to use salad. You might use salad and ACI to do this in the same cluster. Or another on-prem cluster, right? Or model there. So, so these can be stacked and deployed quite easily. Yeah. Limited to your imagination there. All right. I don't see any questions in the chat. Give it a minute. In case we have slow typers like myself. Oh, the other thing too. Have any of them come up yet? Oh, here we go. Here's one running. The thing is, okay, at one minute it may be running, but that just tells us that the container is running. Not that the app inside the container is quite ready yet. So this is what you'll get until it's actually fully up. It's still loading the model and whatever else it has to do. That's the only one so far. Usually it takes a couple of minutes for that to come up. But then again, this is all due to the size of the images. If you run a, for example, if you're running not a GPU workload or not a stable diffusion with a 4G model in it, this runs very quickly. One more try. There we go. Okay. So it's the same app. This is actually running from the deployment that we started. And just for grins, let's see, still just the one. It's always good to leave things better than you found them or at least as good. And so status, we clean up after it and still a couple of leading, but that cleans up the container group. Okay, here we go. Yeah. How do we choose the GPU type for the workload? That is in the spec file for the workload. I'll pull that back up. In the salad comm GPU classes, bit of metadata. You know, I don't, I've been doing this with postman. I don't have postman up to get what those UU IDs are. Do we have that? Yeah. So this particular one is, yeah, kind of salad specific there. This, this annotation that exists is just kind of the way to arbitrarily pass in some data. So what's nice is kind of highlights one of those examples, right? Of like trying to map things into the existing Kubernetes models. As I was talking about at the beginning, it's challenging, right? Some of these things are not, not currently, right? We don't really see these models that we can map into very cleanly. So with the virtual kubelet providers, what we've done and what we've seen with the other providers that exist is that features and capabilities that aren't natively modeled by, by some standard Kubernetes schema definition, kind of just get temporarily stuffed into, into the, into those annotations where we can kind of model them with, with arbitrary key value pairs. So specifically for salad, and this gets a little bit lower level as Dean was hinting, right? But there's a GWT ID that kind of addresses and references like the different types of GPUs where they can be very fine targeted to the type and class of hardware you're looking for. And, and so in this case, yeah, for hours, it's, it's a, you know, get the right GWT and pull it from there from, from the API. So there is a public API where those values can be obtained from, but they map to on the left there Dean kind of within the user interface, you can kind of see there's, if you scroll down a little bit for the GPU section there. Yes. There's all sorts of interesting kind of classes as we, as we call them for, for the different types of hardware. Or if, you know, there's a spread of hardware that may be compatible with an application and you're looking for lowest cost, but you know, maybe what's more available or something on. Oh, sorry, going ahead. I was just going to say, for example, the, the one that I was using here is the stable diffusion compatible. So that, that way you don't have to just, you know, pick every individual one you that grouping. Azure ACI would be, you know, again, just taking another example, kind of in this virtual cubelet space, but Azure ACI would be sort of similar special annotations that you just have to put on your deployment that, that the virtual cubelet provider for ACI then recognizes, oh, okay, that's the one you're looking for. Same thing with stack path. Really, those just get kind of mapped to the types of instances, you know, the virtual machine instances and whether or not those instances have a GPU attached to them and what they are. If you're doing it with your own Kubernetes cluster, again, managed or on-prem, then I think it would just be the normal method of kind of scheduling and distributing based on those. So right again, little limited, but using something like the NVIDIA or AMD node labelers to be able to say this is the hardware that we have access to and how many to keep those updated and in sync. So adding those more as annotations, or sorry, as labels on a node and then using that as part of the selectors that you might define for your workload to kind of shift them as needed to the right resources. All right, I hope that answers your question, Mike. Let's see. Taylor was asking about image sizes. I don't know. I think one of the trend I've seen, and I'm by far the last expert you want to ask about this, but the trend I've seen is that the models are getting bigger. The good thing here is it's up to you. You know, this QR one, you could probably pick a smaller model depending upon your needs of what you're going to do. And realistically, you're not going to be using QR code monster. If you just need to generate a crap ton of QR codes for some reason, that would be a much smaller image to start with. So it's really going to be application dependent. The one thing we did find in our case is it does work a little cleaner to bake the models into the image rather than have a small container image and then download the models. Only because you get the once it started, you know, it's about ready to go as opposed to Oh, now I have it started. Now I have to download. In addition to more points of failures, go ahead. Yeah, but also building on that but also in addition to the standard kind of caching approaches you get right with different layers and what you can do with with distributing and managing those. Yeah, I would say that, you know, low single digit gigabyte is kind of what we see on the low end. Some get quite big. But yeah, exactly as you're kind of hinting at their Taylor. Trending kind of in both directions, I'd almost say right new foundational models that are being trained that are getting bigger. But then at the same time techniques and approaches that look at effectively compressing that down and and trying to run it on lower cost more distributed examples. The the other thing that that we see is that foundational models can also be used effectively with small fine tunings that get applied on top with some of the compression techniques that exists for that. Some of the fine tunings can be on the order of just a few hundred megabytes for some of these and so they don't have to necessarily grow to be much bigger, especially when you're trying to target. You know, a particular application if you're in that in a generative AI space. Yeah, 26 gigabytes, we've seen we've seen some bigger ones. They do get fun. All right, maybe the last call for any other questions. Perfect. Yeah. Well, thank you, Dean. It was nice to see you go through. Oh, sorry, one more question here from Mike. You have customers who use the service for inference. That has been. Yeah, that has been a very big growing trend. Kind of had that that headline earlier on in the presentation, right that LLMs Gen AI are kind of, you know, touching everything right now around tech in 2023 and and we see that kind of happening broadly as well. And so the answer is yes, we do have a number of customers who are looking at salad as an example for for inference style applications that that fit on on the class of hardware that that is kind of globally available there. People are scaling out quite effectively right now on on that side. So, yes. All right. Well, again, yeah, thank you, Mike. Well, thank you, Dean. It's nice to see the demo and kind of run through it. And thanks to everybody who joined us today. It was quick and this was great. So, hope you learned something about the virtual cubelet and, you know, check out the project on GitHub and the CNCF website if you're if you're interested.