 My name is Fabian Dodge. I'm an engineering manager at Redhead and upstream I'm working on the Kuvert project And that's why I'm more than happy to be here with Ryan And we're going to speak about I'll kind of do you in the about Kuvert Which is part of open chip virtualization and we're not going to speak about open chip virtualization It's specifically right now, but about a different use case of Kuvert. So Ryan. Sure. Hi. I'm Ryan Halasey I work in Nvidia and Nvidia's cloud division specifically focused on building the infrastructure for GeForce now So we're going to talk a little bit about that So what is GeForce now? So probably some people in the crowd have heard about it or use the service So it's a it's a service that Nvidia offers for streaming games in the cloud And so if you wanted to fire it up and you wanted to play a game This is a service you can use and play and like a desktop like experience So if you're like me who's got like a 10 year old PC sitting on your desk that you really love and you New game comes along like cyberpunk and you really wanted to play on a 3080, you know And you get RTX and get 60 FPS Well, I you can use GeForce now if you wanted to instead of changing that desktop that you really like for whatever reason Because cards are expensive, you know So GeForce now is a cloud gaming service you can get that desktop like experience for for streaming all sorts of games and Just to give you a little picture of actually what this looks like There's data centers all over the world and we have about 30 plus maybe almost 40 now data centers and you know in your local your local region So what I also want to talk about as the infrastructure behind this so GeForce now is We're actually faced with the problem a few years ago and the the original architecture for GeForce now It's very VM based is focused on you know, the monolithic model we create VMs and we Provide those as our control plane and we have our workloads that run in VMs and we wanted to move to a more microservice based approach and The issue was is we how do we do this? We only want to go to Kubernetes you want to we know we want to use containers How do we move how do we do this without completely abandoning our investment? And so this is where we actually looked at adopting Qvert and so the next generation of The GeForce now infrastructure is actually based on Qvert and Kubernetes so we took that previous investment where we had our VM based approach and we Brought them forward to actually run those those workloads and on some of our control plane on on Qvert And now we have containers and now we have this ecosystem Which is microservice based for some of our services and we have also our traditional VM based services that we continue to run and so this kind of gave us that nice easy on ramp to Transitioning into this new worlds of microservices So how even more details about this like we have a we have a device plug-in that we use like this is device plug-in Is we actually ship around as part of the GPU operator and all and all them This is a device plug-in is how Kubernetes exposes you know how you attach GPUs for example to a to a guest So we use that heavily We have for we use DPUs for network offloading some performance reasons security reasons a lots of good stuff you get with DPUs we we use Throughout our dentist centers for local storage. We have a bunch of NVMe drives So we that we use so we don't use like In any sort of cloud storage we have we have local storage and then for scale-wise So I have here like a 40 data centers. It's 30 to 40 data centers worldwide Hundreds of nodes per cluster and then thousands of virtual machines and pods that we have inside of a cluster And so we actually use the upstream lease of Kubernetes. So we're actually making our way to 126 now and For Qvert we have a fork that we we do we have a bunch of reasons we do this and mainly it's because a few things that we needed to customize with CPU pinning v2p pinning a few things that we needed to do to make it so that we can consume it We use zero five zero and we're eventually making our way to two zero five nine Cool. Thank you very much, Ryan Right so before I proceed How many of you actually know what Qvert is or had their hands on it? Mmm, that's a nice ratio for the hands that have not been lifted a short reminder So Qvert Qvert is an extension to Kubernetes and Kubernetes is helping us to run containers so orchestrate containers and with Qvert You're able to take your legacy your 20 years old legacy. So VMs From OpenStack from wherever you want it actually over to Kubernetes as well and run it there natively alongside containers Why is that helpful? Right, we believe that it's helpful because you can use Kubernetes. It's a single control plane, right? And your single your single mental model right every every platform has their own mental model, right? If you think of OpenStack, if you think of vSphere red at enterprise virtualization Or OpenShift Kubernetes then all of them have their model, right and by converging the workloads on one platform There's a mental mental reduction for you to think about more fancy things than a platform And you have your assets running on the same platform, right? So it's easier to connect them and to make the transition right like you did at NVIDIA, right? They said we need an opportunity to keep our legacy, right? Our investments still give us the room to evolve and that's what we want to enable all of you As I suspect that many of you've got VMs to yeah to to gain the same benefits from Kubernetes that You know, all right While NVIDIA is using Qvert directly with their own on-premise diploma of Kubernetes We have points of points of contact, right? So we are on the redhead side are using Qvert as part of OpenShift virtualization And some of the problems that NVIDIA is one of our community members has run into is interesting to us as well So the first point that is what became relevant is actually scale, right? So we saw that you run a lot of data sense with a lot of VMs and a lot of notes actually Far more notes than we usually run with OpenShift So there was this scale initiated upstream to to add all the The details right all the metrics all the tunings the small code changes that you need in order to really trace Can we run that many VMs? How do we run them? How well do they run them? What about regressions in an upcoming release, right? So we created the scale in the Qvert community initiated by NVIDIA and Through our collaboration. We were able to to add this regression testing to the Qvert release process Which is helping us on both both sides So one one part of it was these face transition metrics in order to see how long does it actually take to launch a VM, right? It can be vastly different depending on the use case the different storage providers different networking providers all of them If they are on timings But with these metrics you are able to trace that down and to see where do we spend the time? System D or system CTL blame. I think is the note local variant of it We also had an opportunity to to collaborate on the on the networking side of the house, right? So Qvert provides Two mechanisms to connect your VM to a network one is by default a VM is connected to the pod network So that it can speak to every part on your on your control plane You can limit the access or control the access to the VM using network policies like you can control it to pods as well Now you can use it with services and ingress everything, you know And on the other side Qvert also supports attaching multiple interfaces using multis and Our collaboration primarily focused at that stage on the pod network, right? So to make that efficient to make sure that the latency is low. There was some State problems with the bridge networking that we had for the pod network in Qvert and that's where we added like production ready code to Qvert a Major change that is actually reflecting also in Qvert's maturity status is Qvert's release cadence, right? Qvert's been out since 2016 and by now we think it's time to To signal that we're major. I mean we're used in production not only at NVIDIA, but also in other places and So we are aiming to get to a v1 release So one preparation in order to get to v1 was to align our release schedule to Kubernetes Before we had like a monthly release schedule, which is pretty hectic, right? It's hectic for developers to write code But it's also hectic for customers or adopters because you always chased right to to deploy a new update, right? Because Qvert releases are also not supported forever, but only for like like a year or half a year So we try to reduce the number of releases in order to provide Well-tested stable and supportive for a year Releases in preparation for this v1 release these releases are now aligned to Kubernetes So that usually you get a Kubernetes update and a few weeks later the Qvert releases following Why do we follow Kubernetes? because We align a Qvert release to a specific Kubernetes release right so that you know this is a match And if you choose a specific Kubernetes version, then it should be really clear to you What Qvert version do I need to to use or to choose in order to have the perfect combination of both? By the way, Qvert and Kubernetes is not enough Because in Kubernetes you have a lot of freedom about what is my storage provider? What's my networking provider? So there are more degrees of freedom than only Kubernetes right from a Qvert perspective There are more degrees of freedom than only Kubernetes like it also depends on what is your storage provider? What's your network provider in order to understand? What is my the maturity of the overall system and what is the functional scope of the overall system like? CSI providers that can support snapshots or they cannot write so their differences between them Alright, but that is not where our collaboration ended So something that's coming up is Ovi and Kubernetes so OpenShift for a while was delivering the OpenShift SDN to connect the pod network and But we see that this is not sufficient for all use cases We we want to see that we provide a more modern experience for networking in OpenShift Not only for pods but also for networks and we saw a point of alignment to use OVN for it I mean OVN is well known from the OpenStack side of the house or the OpenStack project So we focus on writing an operator or a CNI plugin to support L2 overlay networks with multis for Kubernetes and for Cuba as well including all the fancy stuff like multi network policies and services to tie into all the other features like ingress and load balancers At the same time but not surfaced in in GeForce now is our joint collaboration on the GPU operator so NVIDIA is also creating I think graphics cards and other kind of accelerators and You can use them in OpenShift as well NVIDIA provided or has all the domain knowledge to to empower those cards and we jointly created a GPU operator To make it easy to actually get these drivers to the hosts where they need to belong in order to slice you and video GPU up and allocated to your container or your VM There's also some opportunities for future calibration for example the dynamic resource allocation, which is just surfacing in Kubernetes itself There is more CPU tuning we can do to increase the performance of our workloads and Ultimately last but not least is also the support of them for ARM not only for the control pane, but also for workloads. I Think that's roughly the scope of our talk. I think we are on time And we will be happy to hear any questions about cubert NVIDIA or OpenShift Wonderful. Thank you for the questions already Yeah I think we can repeat the question then is loud or you shout the question was if UDP was implemented for Kubernetes as well because Kubernetes is HTTP based So how the heck do we get UDP traffic in and out of the VMs? So we have we have dedicated interfaces for for specific for game traffic, so we we don't we don't use the Kubernetes network specifically for For using for streaming the any of the game traffic will use external Knicks for this stuff So we will attach them and and that's how we'll we'll offload to the DPU and then we can stream that way So we don't we don't need to implement it By the way, that is also why we collaborate on OVN Kubernetes up here, right? It's because it's for late to network traffic, right? We we acknowledge that Kubernetes is HTTP based and HTTPS obviously, but there's more than just HTTP So it robust layer to solution is is what we need, right? We want logically defined networks We don't want to physically bridge that to some switch So the the question was are we as NVIDIA contributing its changes to upstream or are there changes that are Downstream that I haven't been contributed or kept internal So the there there are changes that we do have internal What we try to do is we look at the changes We see and we see that if there's if there's value in the community like if we like for example We've got a way of doing CPU pinning that is very specific to our use case And this this is and this is exactly what one of the things we keep downstream It's very specific to our use case So we this kind of thing doesn't really make a lot of sense for us to push to the community Now maybe someday like when Kubernetes has expanded the way it so does CPU pinning Maybe there'll be like a way we can plug it in But it's just not that way today. So those examples are Cases where it just doesn't make sense But there are a lot of cases where it does make sense for us to push our use cases and have them support in the community There's there's tons of these that we have we've worked on I mean I can name a whole bunch of mine like he was talking about the with with scale We are scales pretty large larger than I think most people out there So we've done a lot of work to actually find you know take our work and bring it upstream our experience And and try and find bugs within the community like we found numerous on bugs and we bring them upstream We talk about them and we develop them in the context of our scale And then we fix them in the community. There's a number of these like I think another good one We did was like VM pools Where you know, this was a new API that was created out of the nearer where we based on how we do our scheduling of our of our work for workloads We want to have pools of of of different VMs and kind of the way you think about this is like you know Like warming right like if you think even in the VDI use case like you want to have warm sessions available to Allocate at a time and so the idea was that okay Well, there could be an entire API that could be done in Cuber to actually handle this exact use case And so that was actually one of some of the work We did to create this new API for called VM pools and there's new me as others and I'm probably not remembering half of Them but but so the answer question is whenever it makes sense like it's something that Alongs the community that we want to have you know support in the community that others could use we try to push it there In the case that it doesn't We hold on to it and continue to maintain it until maybe the time comes that we can or maybe doesn't We've got a few minutes. Oh Since the demo was recorded I didn't get the I think there's a question if there's maybe a discount on gforce now in order to allow the audience to oh, yeah Okay, well, I mean yeah, I There is a free tier if you want to try it, you know, yeah Yeah, if you want if you want to try it, there's like I think you can there's like you can play for two hours or One hour. Okay, see someone Yeah So yeah, but yeah, and you can try it and you can go run on Cuber. Yeah Okay, it's a good question. So while there are Active workloads with people streaming a session gaming. Are we doing live migration of their active game their session? the answer is no and There are various reasons behind it So if I've been talked about bridge networking one of the things that there's there's sort of a gap here And actually being able to do live migration without any interruption. Well, okay There's a few there's more gaps than just bridge networking But one of them is specifically with bridge networking way It's it's we don't have the ability to support my great live migration with bridge There's also problems with if you can imagine with physical devices and how you actually live migrate with a physical if you're passing through a physical device So yeah that we don't today But there may be in the future with some VGPUs and maybe what's a little bit more work on on on being able to do bridge Networking we might be able to do in the future, but so not today Thank you very much for the questions We sadly only have a few seconds left before we get in kicked out But in case you're more curious feel free to to stop by and catch us somewhere and thank you very much for your attention