 Hi, welcome to KubeCon 2021 in Los Angeles. You're about to learn about some cool new tech that lets Kubernetes workloads consume virtualized service from a pool of GPUs. A GPU does not have to be installed in the same physical worker node running the pod that is consuming the GPU. I'm Steve Wong, co-chair of the Kubernetes VMware user group. Also presenting is going to be Miles Gray from the UK, but because of travel difficulties, he couldn't be here. But he composed a great recording showing a deep dive with a demo of this technology. A word about the user group. We're inclusive of all users running all forms of Kubernetes on VMware infrastructure. So what I'm talking about doesn't just apply to people running a VMware distribution. This should be useful to you if you're running distributions like Anthos, EKS, Anywhere, OpenShift, Rancher, or Pure upstream Kubernetes open source. We'll start by talking about why you might want to use a GPU in your Kubernetes workloads. Then Miles is going to show you how to set this up with a step-by-step demo using Kubernetes after you'll get details on how you can join the user group and a link to download this deck. GPUs got the name graphics processing units because they were originally designed to manipulate pixels for images and video. People realized that the parallel processing capabilities could also be applied to many other general problems if they could be broken down to allow parallel processing and use of parallel algorithmic solutions. Machine learning is a common application, but there are many more. Super computer pioneer Seymour Cray is said to have coined the marketing line. Which would you rather use? A couple of strong oxen or a thousand chickens? He said this back in the 90s. Well, advancements in both hardware and software have changed the terrain and today you want the chickens. Now, before some nitpicker in the audience looks at that picture carefully and points out that those are actually ducks and not chickens, give me a little leeway and I want you to paint an even different picture, visualize this in your head. You can keep the ox, but lose the ducks or chickens, but I want you to bring in a shark and a school of piranha. Now, imagine somebody is giving you the goal of creating a museum exhibit of an ox skeleton at the LA County Natural History Museum. And that museum, by the way, is just a short distance that way and I highly recommend it living in Los Angeles if you've got some spare time to kill. But back to the mission, we start with an ox and we want an ox skeleton. You could take that shark, oops, accidentally, that's my slide. You could take that shark, put it with the ox in a pool and that shark is gonna take big bites and digest them. Bigger bites than a piranha could. But if you unleash that school of piranha, they are gonna be able to access that from the pool on all the sides, the top and the bottom. You're just gonna see a cloud burst of activity and poof, an ox skeleton. This is what this GPU scenario of parallel processing is like. For the right kind of job, this is a big winner. Another reason for GPUs, the growth rate in conventional CPUs has been leveling off compared to GPUs. Maybe CPUs are just reaching maturity level and the GPU situation is still young, so they're growing faster. And by the way, this isn't just time to result. There are often energy advantages to using these GPUs. If you can get the same job done by while burning fewer kilowatt hours, it helps the planet. Another possible tie-in that might save energy, if you're doing something like deep learning applications, usually the training stage is very GPU intensive, but then after you've done the training and take it to the runtime to actually execute using your model, you can get by on a fraction of a GPU's capacity. What if you could eliminate the wasted money and wasted energy by pooling capacity to increase your efficiency? So I'm gonna move on to the demo by Miles. I hope you're having a great QCon so far. So over the next couple of minutes, I'm going to demo to you our AI ML framework for running those kinds of applications on top of Kubernetes, on top of Docker, even NVMs if you want, but we're focusing on the Kubernetes aspect of this today. So what we're gonna do is take a brief look at the lab that this is running in. So it's running on all GA code, all fully released code. It's vSphere 6.7, vSphere 7.0 U2, and we're running Tanzu Community Edition. And then on top of Tanzu Community Edition, we're going to deploy this application and the Bitfusion integration for Kubernetes. So let's have a look at just the VMs, which is the sort of base level that we're at here. You can see that we have four Bitfusion servers here. And if I go into edit one of these, what you will see inside the Bitfusion servers is that they have a GPU attached. So each one of these has a Nibbidia Tesla D4 attach to it. And in case you're not sort of aware of how Bitfusion works, Bitfusion is a client server model. So these servers in this instance are VMs and they have a GPU passed up into them. Your clients then can dial into the servers and request GPU resources. So say you have a application and you say, I would like it to consume half of a GPU, then the Bitfusion client running in either inside the container or on the nodes, and we'll get into that in a second here, is going to request from Bitfusion half of a GPU. If it can fulfill that request, then what'll happen is Bitfusion will intercept all the CUDA calls that go to the CUDA APIs that would need GPU acceleration and instead sends those over the network to the Bitfusion server, which does the calculation on the GPU and sends back the results. So Bitfusion abstracts the GPU from the workload itself. So there's no need to have a GPU mounted into every single VM that is consuming or has a workload that would like to consume a GPU. Instead, you can consolidate your GPUs into, as we have here, we've got one GPU in each host, therefore we have four Bitfusion server VMs each with one GPU. And that means that we don't have to have a one to one mapping from GPU to application. Instead, we can slice up that GPU into an arbitrary number of frame buffer allocations. So each of these GPUs has 16 gigs of VRAM or frame buffer more accurately. So you can divide that up into however many slices that you like and GPU and Bitfusion will allow the client to claim that much of the GPU. So you'll see as we get into this why this is particularly beneficial to Kubernetes workloads, given the non-deterministic placement of them, the fact that there are many of those and only a limited subset of GPUs. So anyway, we've got our Bitfusion servers there. They are going to be the things that are gonna do the actual processing of our application code that requires acceleration. Inside this resource pool here, you can see TCE management and a workload cluster here, Tanzu Community Edition that is. So Steve set this up for me very kindly last week. So you can see we've got our management cluster and we've got our workload cluster here. And we've also got Rancher in here as well. This is not a Bitfusion that is, Bitfusion is not specific to Tanzu. This is not something that only works with Tanzu. This will work with basically any distro of Kubernetes that you have as long as it's vanilla compliant. So that's the VM view out of the way. Let's have a look at Bitfusion itself before we get into this. So here you can see the current total number of GPUs that are available. So you can see it's top here in purple. It says we have four GPUs available. Total allocation is zero. We're not actually running any workloads. On the left hand side, you see the servers. That's the same servers that we just saw. And clients, it says we have 177. Now you might look at that and go, hmm, I don't really understand why there's 177 clients, but nothing is being consumed. So the way Bitfusion works is it assumes a GPU as a service model. So we're assuming that the GPUs are sort of being leased out to these tenants or internal customers and they're billed based on how much time they have used those for. So each of these clients is actually kept as a historical record so you can say, okay, this consumed 30 minutes of GPU. This consumed an hour of GPU. So you can do charge back and that kind of stuff. So that's why we keep all of our clients there, whether or not they're actively consuming GPU. So the first thing that we're gonna do here is we're going to add Bitfusion or the Bitfusion credentials more accurately to our Kubernetes cluster, to our TCE cluster. And this is brand new in Bitfusion 4 and to be honest, I think this is a really, really killer feature. So if we go to Kubernetes clusters, we click add. It says kube config and I've got my TCE kube config file here. So click upload and we'll call it TCE 01. And you can see it's grabbed the IP address from the kube config file, automatically connected and pulled back the namespaces. So it already knows what namespaces are in there. So I want to add my Bitfusion credentials so that would be what are the server IP addresses? What is the CA cert for the servers that I can validate it and what are my client credentials for actually connecting to the servers? So I'm gonna add that to three namespaces. We're gonna add it to kube system. We're gonna add it to flower market, which is where we're actually gonna deploy our application. And we're going to deploy it to the Bitfusion with Kubernetes integration namespace as well. So we clicked add. Okay, so now these are our target clusters, right? So you can see here, this is sort of a one to many relationship. You can have many Kubernetes clusters and you can even have different tokens for different namespaces inside the same cluster. It depends what way you wanna do your permissions model. So we've added our cluster and now we need to create a client token so that those workloads can actually connect to Bitfusion and consume some GPU resources. So we'll click create and we'll just call it flower market token. And we're just gonna add it to all three of those. You can see you can pick and choose here. So you can have many tokens if you want to do individual permissions models and that kind of thing, but we're just gonna pick all of them and it says activate token after creation. So that means that it'll deploy the token to the Kubernetes cluster in the form of some secrets. And actually, let me show you that before we do it. So if I do K get secret and hopefully we won't see, yep. So there should be three secrets in here and you'll see them whenever they pop up. But basically that would be the CA servers and the client configuration file. So they do not exist here. So what it's asking us here when we say activate token after creation is once it's populated those secrets into Kubernetes is that token able to make requests of the Bitfusion server and you can disable it at any time as well. So if you've decided that you need the GPU resources for something else, you can deactivate that token without actually deleting it and changing the configuration. But anytime someone connects, they'll just get out. Sorry, your token has been deactivated. So we're gonna go ahead and click create. You can see it says it's activated. It's added it to these namespaces. So if we go back into our Kate's cluster, we should hopefully see straight away the three new secrets and we do. So we've got our, like I said, our CA cert, our client configuration and our servers.com. So that's good. That means that we can now mount those secrets into an integration that allows us to connect to the Bitfusion instance. And the nice part of this about it is it's dynamic, right? You just add the kube config file, you choose your namespaces and it auto-populates. It's not download credentials, create generic secret, all that kind of stuff. And you can rotate secrets a lot easier with this the way the integration is built. So the first thing we're gonna do is we're gonna take a look at the Bitfusion with Kubernetes integration and I'll just pull up the GitHub page here. So you can get this at github.com slash VMware Bitfusion with Kubernetes integration. And the bit that you're interested in here is the Bitfusion device plugin. So Kate's has a concept of device plugins. It's generally used for like hardware that you have that has specific drivers that need to run on the guest OS of the node itself alongside kubelet to make them, you know, work. So we have a similar sort of thing with Bitfusion here. It goes about things a little bit differently but it does install some libraries and some drivers that are required. But basically it's in two parts, right? And we'll look at the architecture here to just to exemplify that. So what you see here is your standard in green by the way is your deployment and the job for that deployment if you're doing job based batch processing or whatever. And then you've got your Bitfusion device plugin and that is essentially doing the connection between the worker node and the Bitfusion server. So that creates that pipe between CUDA on the client and CUDA on the server to pass those calls over the network. So you can see here it says it consists of two bits. You've got the device plugin itself which is the thing that actually does, you know the data path bit. And then you've also got the Bitfusion webhook and the Bitfusion webhook is looking for a certain type of object in Kubernetes with some annotations and if it sees those annotations it'll do some transformation on it because it's a mutating webhook and it'll rewrite them to the Kubernetes API and we'll have a look at that now as well. So I've already deployed this because it takes a little bit to deploy and I'm just gonna show you how this actually mounts the secrets in just so that you're aware of where the integration from Bitfusion server in 4.0 and the Bitfusion Kubernetes integration is. So if we look at our actual deployment YAML file here it's passed in as a config map and if we go down to the bottom you can see here it's mounting in CA, client and servers.com. So that is then telling the device plugin here is how you connect to Bitfusion. Here are the servers, here are your credentials and this is how you authenticate the connection, right? So we've already deployed the Bitfusion with Kubernetes integration. We're not gonna look at that. What we're gonna do is deploy an app and I'm gonna show you the Docker file just to show you that we don't have any pre-baked bits in there, right? There's no Bitfusion client in there. There's nothing like that. It's just a standard container, just runs Python, that's it. And what the Bitfusion with Kubernetes integration is gonna do is look for the annotations and we'll look at the annotations in a second and if it sees those, it will automatically inject a knit container onto that deployment and put the Bitfusion client into it and then append all of the entry point for that Docker file onto the end of the command, essentially proxying the entire container's connection to Cuda through the network completely transparently or requires no changes to your app code whatsoever. So what we're gonna do is just look at the Docker file to begin with, because it's really, really simple. You can see we're making some very minimal changes. We're taking the NVIDIA TensorFlow container from upstream on nvcr.io. We are then telling it, we wanna execute this benchmark Python script, right? It's just a standard TensorFlow benchmark. It uses GPU acceleration, but it's not actually doing any real work. It's just demonstrating work on the GPU. So we're specifying the number of batches, the batch size, data format, model, all that kind of stuff. So this is essentially just setting up TensorFlow and telling it, what benchmark are you gonna run? What model are you gonna use? And what are you going to do the processing on? So local parameter device GPU. So it's going to use a GPU. If we come down, we just pip install some packages because this exposes some information to Prometheus. So we install a Prometheus client. We then clone the repo for itself down so that it can get the Python script and then we execute the Python script. So you can see here entry point Python and then the name of the file along with all the other model stuff that we saw further up as well as telling it to run on a GPU. So that is the container we're working with, right? It's essentially a base, NVIDIA TensorFlow container. It's just running Python. That's it. There's no other bits or pieces in there. So this is like your standard machine learning container. Now what I wanna show you is that we can take that standard machine learning container, which if we just ran it on the cluster without any GPUs attached, it would fail, right? Because it's trying to use a GPU and there's no GPU directly attached to the cluster. The GPU is mounted somewhere else in the infrastructure entirely. So this wouldn't actually work. But what I wanna show you is that whenever we deploy it with the Kubernetes integration for Bitfusion, it will dynamically inject Bitfusion client into it and allow those GPU calls to succeed. So let's have a look at our manifests here. The only one we're gonna look at is the deployment. All the rest of it is fairly uninteresting. We've got our namespace, service, monitor, service, horizontal polydota, scalar, that kind of stuff. But the bit that actually ties this to the Bitfusion integration is just some annotations here. So you can see this is just a standard deployment. It's called a new Hope Wookie. There's a whole other presentation on that, which you'll find if you Google the Dutch Vmug from last year, you'll find what this whole story is about. But essentially, we've got our same container that we were just looking at here, the a new Hope Worker latest. And you can see the Python command is exactly the same as the one that was in the container itself. And we've added these three annotations. So we've got auto management, Bitfusion set to yes. So that essentially says enable the Bitfusion integration for this deployment please. What OS would you like it to use to do the bootstrapping? So I just chose Ubuntu 20 because it's the latest one. And what version of the Bitfusion client would you like to use? This generally needs to be matched to your server. So my server is running 4.0.1. So my Bitfusion client version is 4.0.1. So that's essentially it. We then add our limits to our container just under our resources. So you can see we're requesting Bitfusion GPU amount one. So that's the number of GPUs, so one GPU. And what percentage of that GPU would we like to use? So we're asking for 100%. So we are essentially expecting to get a full GPU allocation here. So I'm gonna go ahead and do a K apply dash F and it's in, I'm in the wrong folder. So let's fix that first. So it's a new Hope app. And we'll do K apply dash F manifests app. All right, so you can see I had most of the rest of this already deployed out here. Horizontal pod autoskiller always says configured for some reason. And you can see our deployment has been created. So if I do K get deploy, you can see that it's 15 seconds old and it's already up and running. And if I do a K get pod, you can see that that container is running. Now what I want to show you, before we look at the container logs or maybe we'll look at the container logs really quick, I will do a K logs and it's gonna be a new Hope. And we'll follow. So you can see here, it says adding visible GPU devices, zero. That might look weird as if it added zero GPU devices. That's the number of the GPU, the index of the GPU. So it's GPU with index zero, right? So the first GPU or the zero GPU. If we go a little further up, you should be able to see where it mints the GPU in. There you go. So you can see the bitfusion client stuff has already been mounted into the container. So that was all done by an init container that was injected into this. So if we do a K describe PO, a new Hope. So if we have a look up here, you should see that there is a container Wookie, which is the standard one. That's this one that is in the manifest here, as you can see. But we should also see a init container as well. Here you go. So there's our init container. And you can see that I have not specified an init container in this manifest anywhere that's been injected by the integration. So you can see it's doing some command line foo here. It's copying the bitfusion client service.conf as well as the Debian file or the bits to actually run the client. It's copying those into the workload container. And then it's running, it's adjusting the command. So you can see here it says the command for the Wookie now is actually not just the Python stuff that we had already specified, but rather it has injected the bitfusion client in and is now executing the command through bitfusion. So you can see it's doing a bitfusion run dash n one, number of GPUs one and how much of the GPUs. So dash P is partial 1.0. So it means the whole GPU. So you can see that it has injected the bitfusion client and it is now running that benchmark through the GPU there. So if you look up at the top here, you can see it says requesting GPUs with 15 gigs of memory, right? So 16 gig GPU, obviously they're using different units here, but you can see that that's now running in the background. If we go into our UI and we go to our cluster. You can now see that we've got bitfusion server. One of them is having its full GPU allocated out and you can see the client here, one of one GPUs fully allocated. So if we click on the client, it'll take us through, you go currently allocated, one of one, if we change that to five minutes. So you can see it's on the ramp up and there it's using a full GPU and it must have finished. So it's on its ramp back down again. So you can see that that has successfully mounted a remote GPU over the network from a bitfusion server into a container that had no knowledge of bitfusion or the bitfusion client whatsoever. So if we go back in here, we should see once it's finished, processed so many images per second and that kind of thing from the benchmark script. But you can see here that essentially this container has zero knowledge of bitfusion whatsoever, it's just a Python container and the bitfusion with Kubernetes integration, along with the bitfusion for token copy mechanism, allows these containers to connect to a bitfusion server without any knowledge of the client, without having to be adjusted in any way. I'll just show you for clarity how I used to have to do this. So if we go into my Docker file here, you can see I have a whole bunch of stuff that's been commented out here and that's the way that we used to have to do it with bitfusion 3.5. So I had to bake into the container, the bitfusion client bits as well as all the configuration for those. So I had to bake in the number of GPUs, how much of the GPU, any variables along with like the servers file, the CA file, it made it really fragile. It worked, but it meant that any time the servers list changed, you would have to rebuild your container, right? Or if you needed to issue a new token, you had to rebuild your container. It's just stuff that you don't wanna have to deal with. It wasn't necessary. So you can see all of this complexity around installing the bitfusion client, everything like that is completely removed from the container. And instead, now we can take off the shelf, GPU accelerated containers, pass them through the bitfusion with Kubernetes integration that we've released. Again, github.com slash VMWare slash bitfusion with Kubernetes integration. And this will just automatically bootstrap bitfusion into any of them and allow you to allocate GPUs to containers without any modification whatsoever. And I think with Kubernetes, this is particularly powerful because with more standard steady state workloads workloads like VMs or even containers running in VMs under Docker or something like that, where it's not being constantly spun up, spun down, descheduled, or it's not job based or that kind of thing where you could just mount a GPU into a VM and it was fine to live there with Kubernetes because placement is pseudo non-deterministic. That means it could land anywhere in the cluster or even on a different cluster or whatever. So instead of having to buy GPUs for every single one of those just in case, well now that you can have a consolidated set of GPUs and just allow the bitfusion client to do all of the connection stuff for you. So it means that you don't have to continually unmount and remount GPUs into things or worry about is this being underutilized over here? They actually use in that resource. So that was our kind of a quick high level whistle stock tour of the bitfusion for integration with Kubernetes. So that's this bit here under our tokens. You can now add Kubernetes clusters direct to bitfusion as well as the open source bitfusion with Kubernetes integration project that allows you to take standard off the shelf unmodified containers and run them with GPU acceleration. So I hope that's been useful. If you need any more information, we'll put some stuff in the chat so that you can have a look at each of these repositories and recreate this on your own clusters. Thanks for watching. Thanks, Miles. So these are just some links to some of the things Miles showed. I did add one thing that he didn't cover. This was based on a conversation I had on hallway track here. I wanted to practice my delivery of this. So I showed it to somebody with a lot of experience using GPU resources in an organization. And the person planted the seeds for this aha moment where he pointed out that the organization has a lot of users running these Jupyter notebooks. You know, it's sort of like a UI for data scientists. And one way to do that is to install this provision individual users with GPUs and high-end laptops. But the Jupyter notebook just is a web application so you could centrally deploy that in a server farm. And there happens to be a, the Jupyter project itself has put up a helm chart for running this at scale in Kubernetes, provision a thousand Jupyter notebook users. And I kind of think that using this as a back end of that because the scenario with that Jupyter notebook is people are typing things in almost like a spreadsheet with a lot of dead time, you know, it's think time for the user until they kick off whatever the operation is. It's not going to use a GPU. This technique of pooling it might be saving money, energy, maintenance overhead. So if you like this presentation, this was sponsored by the Kubernetes VMware user group. We have meetings, I'll tell you in the next slide how to join it. It's me, Miles, we have a couple user sponsors. One of them, Joe Cersei of T-Mobile. I don't see him in the room, but he's walking around the conference, oh, there he is right there. But, you know, we have these monthly meetings. Sometimes there's a speaker giving talks on a topic, I don't know, like this. Other times we just declare them bring us your problems or presentations on best practices. Once again, these apply to using any form of Kubernetes on top of VMware infrastructure. So picture the base vSphere as the equivalent of a public cloud platform. You can bring any Kubernetes distro you like there. This isn't VMware product centric. This is whatever kind of Kubernetes you got if you're running it on vSphere on prem. This group should be of interest to you. So that link is how you join the group. It's a mailing list in the first step. We actually gate access to the documents and things by membership in that group. If you, we have a Slack channel listed at the bottom so you can ask questions there and join the Zoom meetings if you like. Those Zoom meetings are recorded and they're in a playlist under the Kubernetes user channel. If you wanna get ahold of the speakers, these are our GitHub and Twitter handles, but to be honest with you, it's probably easier to get ahold of us in that Slack channel. And I think we only have a couple minutes for Q and A and Miles is not physically here but he was going to be monitoring in that Kubernetes Slack channel, which is our normal one. So even if you have questions after this is over later today, next week, go ahead and ask them in that channel. And then this is a link where you can download this deck. A lot of the things that I showed up here on the screen are hot links behind them so you should be able to conveniently go to these resources that we just talked about. And with that said, maybe we have a minute for a question or two. So if we run out, I'll hang around in the hallway. Yes, maybe wait till you get the mic right over there. So what are the performance implications of sending those requests to the GPU over the network versus using the system bus? Well, you wanna have fast network so I would not recommend this if you have less than 10 gigi and probably 40 is better, 100 gig ethernet even better. And it depends, what's going on is really you're feeding a pipeline that goes to the GPU and usually what comes back is relatively small compared to what has to go over there. And it's a scenario where there's a little latency on getting that pipeline filled. But if you've got jobs that run, you know, this probably wouldn't pay off for jobs that run 100 milliseconds or something, but it's my understanding that many of these applications, some of them might even run for a day or more, even with the GPU, many of them run for minutes and hours. And that initial pipeline effort is just kind of a one time thing. It is dependent on your algorithm and the ratio of kind of GPU think time to actual data that has to be loaded into that fast access frame buffer over on the GPU. So your mileage would vary, but it works for very many popular jobs. And it's my understanding that when you get GPU service and some of the public cloud providers that you might have a similar thing going on too, depending on how they chose to implement it under the covers. Thank you for the presentation. I had a question regarding the functionality that might not be available, things like GPU profiling, like CUDA profiling, having a TensorBoard for your TensorFlow training. Is there any limitation when you're doing this kind of thing of sharing GPUs? I'm afraid I don't want to tell you more than I know and I don't know that, but Miles might. So come back on the Slack channel and we'll answer that. All right, thank you. We've got one over here. I think this will be the last question, by the way, because we're going over. Are you using pre-trained models or are you training on a data set to do inferencing? Oh, this is up to you. It's like you just write it as if you were running your machine learning, deep learning app inside Docker or inside a Kubernetes pod. And this is just taking on virtualizing a CUDA interface out of physical GPUs that aren't necessarily on your worker node. So in terms of what goes on in the stages if you're using it for deep learning, this doesn't get there. That would have to be in the code that you'd implement in the Kubernetes pod. Can you also use other like DL frameworks like Pandas? I think anything that can be accelerated with a CUDA interface will work. So we are not virtualizing the physical hardware here. We're virtualizing the CUDA interface. So anything that works and can take advantage of CUDA will work with this technology. Are there any performance like trade-offs with that? I think it was like the question brought up earlier that there is a potential cost with sending these data flows over a network versus having a physical card, literally mounted in a PCI Express box on the very box you're on. This isn't quite up to the same bandwidth and there is going to be latency associated with the network. How big a deal that makes is kind of application dependent. So longer running jobs, I think that might not show up so much and jobs that don't need the less data that has to flow in there to fill the frame buffers, the better off this is going to be compared to using a physical GPU. And there's not one answer for every application. Okay, and is this support batch processing? Yeah, it should. I can't say I personally tried it, but I think that's orthogonal to this as well. Thank you. Okay, so I'll get you in the hallway. I see we have one more question, but I think we're over time. So catch me in the hallway, but thanks for attending. Like I say, if you have more and you can't meet me in the hallway, just reach out, join the group, join our monthly meetings, but thanks. Thank you.