 Test test, okay. All right, we're gonna get started test There's still room at the front people do you want to come I? Promise we don't bite Okay, it's early morning for me because I'm from the US so Show of hands from everyone who in the audience loves virtual machines Okay, pretty good. All right who in the audience show fans loves containers A lot more people, okay. I Saw some people who didn't raise their hands so You must be here to see us talk about scale and performance. So I appreciate that Okay, so I'm Ryan Halsey. I work at Nvidia as an engineer. This is my colleague a lay Patel also from Nvidia and We're here to talk about CICD driven benchmarking it's measuring scale and performance that that we do in Qvert and We're excited to share with you today talk to you about why this matters and Share with you how you can do it too. Okay, so first we'll begin. I'll talk about At a high-level Qvert what it is And then I'll hand the mic over to a lay and he'll go through talk about scalability How we measure scalability and then this idea of a control plane as a shared resource Something really important to keep in the back in your mind as we're going through That control plane is a shared resource And then we'll talk a little bit about Qvert's performance and scale Stack and benchmarks so things in ways we measure and then We'll share with you how you can do this too as as someone who writes an API someone who writes an operator How you can do the same thing that we're doing in Qvert? Okay, a virtual machine is a custom resource Right, so what does that mean? It's it's an it is an extension of the Kubernetes API So really when you're asking as the user for a virtual machine You're going to the Kubernetes API server and what are you gonna get? You're actually gonna get a container Right, so this is interesting so for all the people who raised their hands for both that love Virtual machines and containers to shout out to you guys We also love virtual machines and containers in Qvert because we rely on them very heavily so we get this container and We have the Qvert control plane That does some work. We have inside this container. We've got a Q mu process. We've got liver And then the host we've got cavea. We've got the hypervisor And so we do the Qvert control plane does some work generates an XML and Voila, we're in France. Voila. We have a we have a virtual machine and From there Now you have a if you think about it. It's a virtual machine running inside of a container So that's what Qvert is. It's it's a virtual machine running inside a container Qvert provides that control plane provides the API To interact with the user to extend the Kubernetes API to give you that experience that you can run a virtual machine just like you would in other in other scenarios and other stacks and runs it in a Kubernetes native way Okay, so we work in video and we use Qvert very heavily and We rely on a lot of things not just Qvert. We rely on a lot of things in the ecosystem So what's our use case? So we have a cloud that we operate our use cases GPUs right we want to provide and users with GPUs and so we Particularly with the GPUs. We want to provide users graphics. We want to stream graphics over the network to provide users Different experiences particularly we focus a lot on the g-force now g-force now is Streaming graphics over the network so that you can play games. There's probably some people here who are familiar with g-force now You can play triple a games on your phone. So we do the rendering I'm in the cloud and we'll stream the graphics to you wherever you are And so that all is powered by Kubernetes powered by Qvert and a bunch of the things we've got Oven we use For a network. We've got gatekeeper we use Prometheus Grafana Flux Fluent bit so a lot of things that go into our stack and So something to think about when when LA is talking This is what we know when when she's gonna show some examples and That how we have measured and tested scale in our environments. This is these are the things that we use This is what our stack looks like Thank you Ryan for that introduction to our stack. So when we are running this stack in production We need to make sure that Scales well when there is high load for GFN So in order to think about perf and scale of that stack we rely on the great work that Kubernetes six scale has done To think about perf and scale The six scale Kubernetes has introduced something called scalability dimensions Which act as a guidance on how to think about perf and scale So three major things that include in that guidance The first thing you need to think about is the environment in which That the stack is running in so it includes things like Kubernetes version what Kubernetes version you are running on What kind of hardware resources you have for the control plane how it is configured and the environment? impacts greatly how the scale of the how the stack scales The second thing and the most important thing to think about is the scalability thresholds. So this means that It's the number of objects that the cluster needs to scale at The scalability thresholds as you can see in the in the diagram They will differ for each cluster use case for us For the stack that we run the scalability thresholds that matter our Object counts for pods VMs nodes PVC's those are the key focus for us But it's it's really important to understand that the scalability envelope that six scale Kubernetes six scale has provided gives you a great way of thinking about Thinking about perf and scale and this diagram will look different for each Kubernetes cluster Then the third thing is if you're using extensions Almost all production Kubernetes clusters use extension in one form or another for example mutating validating webhooks Those should have those should be used wisely all the extensions should be used wisely in the sense that If you're looking for high scale or low latency the webhook should provide low latency for instance The CRDs or the CRs of the extension Should be within the the scalability thresholds in this envelope So once you have all of these three dimensions, you can determine the SLO for the Kubernetes cluster that that you are running It is also important to understand that when we are thinking about the scalability thresholds some of the work that the Upstream Kubernetes six scale is doing can be used here. So if you look at the diagram The the upstream six scale provides dimension or scalability guidance on nodes parts PVC secrets But then when talking about extensions, there are other things in the cluster that could impact scalability So what we have found is that all the extensions like our cube word OBS gatekeeper that we run they Compound the number of CRs and hence compound the number of hence compound the scale of the cluster not only number of CRs, but but they also Create some kind of client load on the API server. So in terms of the HTTP request So in order to reconcile those CRs They will send HTTP request to API server and again that has a compounding effect on on how control plane scales So the first four or five things are provided by six scale, but then the HTTP request and the scalability of cube word with respect to CRs is what we in in the cube word six scale focus on the rest of the presentation I'll focus on how these things can be monitored and measured to get an idea of scale just to prove Showcase an example of how these things cause failures at scale We were seeing frequent of API server ooms in in our stack In order to and and one thing that was observed Was that when when we find an oom Chances were that there were a large number of secrets existing in the cluster So in order to root cause the actual problem behind those API server ooms We did an experiment the experiment involved creating around 5,000 secrets 0.2 MB each and then Sending a list request against secrets. So a client is just asking for a list of those resources and what we found is you can see in the charts that At that scale in our environment two or more concurrent list calls cause an API server outage There is some kind of an inefficiency in the way API server handles list calls This is a well-known problem in in the Kubernetes community and it's being handled by an enhancement Cap 3157 if you want to learn more on what kinds of List calls are susceptible to these kinds of ooms It's mentioned and explained very well in in the cap and the cap actually proposes a solution to work around those problems But going back to the problem you can see that in a large-scale distributed environment You could have an extension client that could potentially Make those two concurrent list calls and because it makes those list calls in our environment We were getting those API server ooms. So it's important to monitor what kind of Load Extensions and other clients on top of Kubernetes are creating against the API server So when we found out about that experiment we actually plotted a graph You can see on the x-axis number of requests on Sorry on the y-axis number of requests and on the x-axis Number of secrets multiplied by size of each secret in this graph anything in the green Was perfectly okay for us if you are on the graph on the line or outside the green domain That's when There were outages. So in order to scale up Let's say you want to have more than two concurrent requests in the cluster you can either reduce the size of the secret or reduce the maximum amount of Scale in terms of number of secret the cluster can support either of those will help you increase the number of concurrent calls The API server can handle So hopefully this helps set the context that It's not just about the number of objects It's also about the size of each object and then on top of that how many concurrent Request the API server is serving to the clients Okay, so with that background if you are an if you're writing an operator And if you want to measure a performance and scaling characteristics of that operator, how do you do that? So as Ryan explained earlier Qvert is just a CR and there is a set of controllers Behind the scenes that would make that VM running So as part of Qvert 6 scale what we do is we define the scalability threshold in terms of the number of VMs And then we monitor the client side load that Qvert generates in order to Make those VMs go into running state. So to do that We at Qvert 6 scale have Developed a benchmarking stack that helps Qvert monitor perf and scale metrics for each release and at the end of each release Those graphs will be shipped to the users so they can have an idea of what kind of perf and scale They can expect Talking a little bit more about the benchmarking stack it consists of three major layers The first is the basic set of tools So this layer includes things like Workload generator so in order to perf and scale test we need some Workload generator and that will create a bunch of VMIs in in our test environment And then once we have those VMIs we need certain kinds of metrics And metrics monitoring so we have a set of tools That creates a bunch of VMIs when those VMIs are being reconciled by Qvert There is a monitoring stack which will continuously monitor the interesting metrics from our perspective Once we have this basic building blocks. We have wired this up into CI so these End-to-end tests are run daily through a pro and the results from that monitoring is dumped into an S3 bucket So every day at certain point three jobs get run and the results for those are dumped into an S3 bucket Once we have the results into the S3 bucket the next thing we need to do is aggregate those results and Plot it in a graph. So the third layer will include scraping things down from the S3 bucket and Plotting it in a graph once we have a graph. We just release those graph on on the Q-word release as benchmarks Okay, so That benchmarking stack will produce a graph like this so you can see this graphs are P50 of VMI creation to running In terms of seconds. So how many seconds Q-words stack takes in order to Send a VMI or VM into running state you can see that the graph is plotted over time and at some point the trend line for the graph changed at The green vertical line we started to see that the performance Improved a little bit and the observations were highly concentrated around Whereas previously those were a little bit more dispersed So then in order to figure out, you know, what caused this problem We started digging in and we found out that at that time we actually had changed the underlying Kubernetes provider from 1.25 to 1.27 and because of that there was Increasing performance in the the way pod gets to running and because Q-word relies on on a pod To provide you a VM Because of that Q-words metric started to improve So you can see that over time if something changes in the environment then that would lead to better performance and Having benchmarks like this can help understand exactly What change caused those Metrics to improve or degrade so for example all of those vertical lines in the in the graph Then denote some kind of a change in in the environment Apart from the green one all the three vertical lines are Change in Q-word releases. So we have plotted this for last three release and We have shipped this as benchmark for v12 not only that as we as Q-word Runs into its development cycle using this graph a six scale periodically monitors if there are regressions in the or improvements in the Matrix and finds out what exactly caused those to happen. So for example, you can see further along the time there was a time when The trend line improved a whole lot and Basically half What we found out at that time was that a monitoring change was introduced in the Q-word Benchmarking stack which actually broke our collection of matrix. So this Second change that was identified was not a performance improvement But actually a bug in in the way we were collecting metrics. So these graphs are used to find out a bunch of problems in the cluster and Not not just in the cluster in Q-word development cycle and We continuously monitor that as part of the six scale effort Okay, so now that we have seen the performance benchmark I want to introduce the scalability benchmarks and the way these are different is that In the performance benchmarks, you get to see exactly how much performance is being Increased or decreased But in the scalability benchmarks, you get to see how much load Q-word stack Generates against the API server. So the idea here is that the exact cost of that call Is not something we could determine The user of Q-word will have to figure out if they their environment can scale based on this benchmarks So you can see in the graph above that the graph above shows a patch Podcounts for a VM. So what these what this means is that if Q-word if a user starts a hundred VMs how many patch calls does a Q-word stack makes against the API server and In in one of the releases we doubled the amount of patch calls that were made so we went from around one patch call to two patch calls and You can see that that is easily identifiable based on the benchmarks and At that time six scale figure out that The feature that was being shipped as part of this release was Causing this to increase and it was actually okay Because we were getting a good benefit from that feature Okay, so hopefully That showcases how to how we benchmark our Q-word stack Now the question is how do you benchmark your operator right? So there are three building blocks as I explained one is Having a set of tool set to start Some amount of load in the test cluster There are well-known Tools that are getting popular one of the tools is a cube burner cube burner allows you to declaratively specify a yaml and say okay, I want to create a hundred VMI's Put it in a yaml and then cube burner will start and and create those hundred VMI's for you There are other tools for example cluster loader Is something that is being maintained by Kubernetes six scale? So even that could be used to generate load Once you have the load generator the second part is monitoring Controller runtime has a well-defined set of metrics that you could monitor and if need be You can also add more metrics on the client side. So what we have done in Q-word is In production, we don't We don't collect a lot of metrics, but just for the end-to-end test We have an environment variable that would collect much more in-depth metrics During when the tests are executed. So this helps in making sure that Prometheus does not have to collect a lot of detailed metrics in production and you can keep the scalability of Prometheus same at the same time Understand the scalability behavior of code changes through end-to-end test Then once you have these set of tools You can set up a CI automation and instead of putting things into an S3 bucket, which we had to do You can actually dump metrics directly into Prometheus or Thanos And if that option is available as a next step For your upstream or downstream project, you can actually wire up some kind of alerting rules that when these metrics They fall below or fall above a certain threshold you get notified of a performance degradation Yeah, so That's how you can take advantage of this performance and monitoring stack That we have developed and the ideas that that we are trying to share with folks. Thank you. Um, if you have Feedback Please share with us and then we have few minutes for questions How will you create them the group by component graph in the first light? Then the graph that you created in the in the group component Yeah, in the first light In the sense Okay, so repeating the question. How did I how did we create this graph? This graph is just to Explain that there are other things in In terms of HTTP requests and the number of CRs And the extensions that cuba six scale. Sorry kubernetes six scale does not monitor and it's important to Look at those when you're running a production stack. It's not actually representative of the actual numbers. It's just You know, yeah, it's just denoting an idea Okay, looks like there's no more questions. We'll be here for a few minutes afterwards if you guys want to come talk at all Thank you very much everybody