 Thanks, everybody. Wels has a bit of a meat coma today. Anyways, so this is load testing Kubernetes. And I'm going to go through how to optimize your cluster resource allocation and production. So a little bit about me. My name is Harrison. And I'm a senior software engineer at Buffer. And I focus on helping our product teams get more stuff done and also some of the architecture work for our system. So I'm going to start with a little bit of a case study. So we had a pre-existing endpoint in our monolith. It was done in PHP. And it serves the number of times a link has been shared within our product, Buffer. So Buffer is a social media tool. And people like to share tweets. They'll queue up a bunch of tweets. And sometimes those tweets would contain a link. So we keep track of the number of times those things have been shared. And then bloggers will have the Buffer button displayed. So we serve a button that shows accounts so people can get an idea of how many times their link has been shared so they can gauge interest in what they're writing about. So we settled upon a simple design, or so we thought, using Node and DynamoDB to back the counts. And we deployed the service to Kubernetes. And it had four replicas. And we manually verified using Curl that the service was operational. So we were running pretty much stock Kubernetes using the stock services. And we're deployed on AWS. So we route 1% of our traffic from our existing application onto our Kubernetes cluster. Things look fantastic. So we scaled that up to 10% of all the traffic was being routed into our new setup. And then we moved up, as you'd expect, to 50%. And this is where things started to get interesting. So the first thing that we did was kind of freaked out and scaled up our replicas 5x. So we scaled it up to 20 pods. This helped, but the pods just they just kept dying. And it wasn't clear because we were really new to Kubernetes at the time. What was going on? So then we scaled the traffic back down to 0% and spent some time investigating what could be happening. So the first thing was I had copied and pasted deployment from another service. And I think it was just something that I found on the internet and shame on me. But the deployment included resource limits. And they weren't the right resource limits for the application and the load that we were doing. With a little more investigation, we found that we were getting OOM killed, which basically means that the container ran out of memory and Kubernetes had basically killed it. So let's talk a little bit about resource limits. So these are constraints that can be set on both CPU and memory utilization. And without these things set at all, the containers can run unbounded with the CPU and memory that they consume. There are some things in place now, defaults that you can place on namespaces. But I'm not going to cover that in this talk. But they are available. So when these thresholds are crossed, so when a limit's crossed, Kubernetes is going to restart the containers. So how do we go about optimally setting these things? So it's important to understand what optimal means here. And that means that each pod has enough resources to complete their task. This also means that individual nodes can run the maximum number of pods. So there's different ways that things could go wrong and also things could go right. So the first one is under-allocation, which is what we were experiencing in our case. This is where your limits aren't high enough so that when you apply the load that you're getting from your traffic, Kubernetes just causes this thing to crash because you haven't given it enough memory or maybe it ran out of CPU. Another one is over-allocation. And this is where you set the limits too high. This is a trickier problem to spot because things aren't going to break in a very obvious way. And it really becomes a problem when you start to scale up replicas because you could imagine, let's say you waste 10 meg of memory for every replica. You scale that up to 1,000. You've made your problem 1,000 times worse. And this could mean all the difference between running more pods on your nodes. In this case, you'd be running an extra pod if you had set your constraints appropriately. Even allocation. So this is something that you should strive for. This would be a perfect allocation. You're utilizing 100% of your resources and your cost savings is maximized. This is something that you should work towards, but again, this is one of those, good enough is probably good enough and striving for perfect might not be great in your use case. So the next thing to think about is the way that Kubernetes does monitoring. So this is kind of to highlight the system as a whole. I'm gonna go through what this looks like on an individual node, but it's important to note that there's multiple nodes, there's a master, and there's also the storage backend that connects to this thing called Heapster. So the very first layer is CAdvisor, and this runs on each of the nodes, and what it's responsibility is to collect metrics for each of the pods that are running from Docker, and it collects metrics like CPU memory, it collects information on the file system, and there's some other things that it does as well, but the important thing here is now, with that information, the kubelet makes decisions on what to do with pods based off of what CAdvisor tells it. So then from there, something that it's an add-on, but if you add Heapster onto your cluster, it's going to aggregate all of the metrics from the kubelets, and also has different back ends for storage, the default is something called influxDB, and Heapster allows you to visualize what's going on at the cluster level. So when you're setting limits, and this is Buffer's approach, we're trying to understand just what one pod can handle, so one replica, and we start with a conservative set of limits, and then we'll run some tests, and I'll talk about what type of tests those are, and we'll adjust those limits until we find the limits that work for us, and we're only going to change one thing at a time and observe the changes, that way you don't have too many variables. So there's a couple of different type of testing strategies that we employ. So the first one is where we slowly ramp up the traffic. We start from no traffic, and we slowly increase it until we find this point where it breaks, and then once we find the point where it breaks, we're going to run something called a duration test, and we're going to run that test just under the breaking point, and at this point we're going to be looking for things like memory leaks, maybe unpredictable, maybe there's a queue that gets filled up. It depends on what you're building, but everything kind of has different modes. So I'm going to do a live demo here. So, all right. And I'm going to set limits on EtsyD, and this is a tool that we've got open source called Cubescope. I did this demo last year, and to get all this information before I had to use port forwarding and get the information from CAdvisor directly, this tool gives you a little more information, and so you can see here that I've got CPU, this red line is what the limits currently set, and the blue line is what the current utilization of the container. So you can see that I've got 25M of CPU, and that's approximately 25 1000s of a CPU core, and also 5M of memory. So I'm going to apply some load, and we use a tool called LoaderIO, and before I get into this, EtsyD doesn't really have a way to, so LoaderIO needs a token. If you do decide to do this, we've got a, let me increase the font here. Can everybody see that? Coming into our cluster, we have an Nginx server that's sitting in front of EtsyD, and that just allows us to serve our loader tokens we can authenticate with Loader.io in order to run these load tests. So I'm going to run this test, and what we should see here is that, you can already see that the memory utilization was pretty close, so containers died. It's exactly what I'm expecting here. Unfortunately, I wasn't quick enough, but what ended up happening there, so if I describe the pod, and I take a look at the last state, so you can see the last terminated state here, and you can see that the reason was that we actually got OOM killed. So we can see that the memory, we ran out of memory in this case. So the next thing that we'll do is we'll edit the deployment, and we're gonna increase the memory here. I'm gonna set up something so we can just watch, and you can also see that the EtsyD container was restarted here too, that's another indication, that something went wrong. So I'm gonna edit the deployment, and at this stage I'm making macro adjustments, so I'm going to increase this by a factor of 10. So we should see that container get restarted, the container's creating. So I'm gonna go here, and I'm gonna find, okay, we've got the new EtsyD, that's up and running. You can see that before this was five, now this is set to 50, and now when we rerun this test, we should see that we've got much more room to breathe there, and instead of just crashing immediately, this test is going to complete, and you can see here, so this green line is the amount of traffic that's being sent over through Loader, and you can see that we've got a spike here in the CPU, and the blue line is the response time. So you can see as the response time, as the number of requests simultaneously increases, so does the request time with this case, that means we've probably got another bottleneck here, and with Cubescope here you can tell that we're definitely hitting a CPU bottleneck. So that'll be the next thing that we go and adjust. So again, and after this test completes here, I'm gonna increase the CPU by a factor of 10, and see what happens. So I'm gonna set this to roughly a quarter of a CPU core, and I'm expecting this to break here, but I've got a really, really small cluster here with one node and one CPU core. So if I describe that pod again, you can see that in the events log, the default scheduler is telling us that we failed the schedule, and you can see that it doesn't have enough CPU. So I've hit one of my natural limits with my system, so I need to, I can't quite give it 250, but I'd probably go back a little bit, and I'd edit the deployment again to something more reasonable for this system. So is that 25? Let's do 50, okay, that's terminating. And while that's starting up, let's just take a look at the response time here. So we're looking at something like one second response times. And that's not great, but we should expect that if we actually are fixing our bottleneck, that the response times are gonna go down. And since we gave roughly twice the resources for CPU, and if that's actually the bottleneck, we'd expect that to increase somewhat linearly. It's not gonna be perfect, and it really depends on the system. Okay, that's back up and running. I'm gonna run another load test. So remember, about a second, visualize this while this is going on too. You can see that CPU's creeping up again. And if I were to compare this side by side with the other response times, I would expect, yeah, we're still creeping up because we've hit our bottleneck, but we're not creeping up quite as much. If all goes well, we should see an average right around five to 600 milliseconds on the request. But if I had more resources, I could keep doing this process over and over again until I hit this point where the CPU isn't pegged. And wherever that setting is, you kind of find this maximum amount of traffic that one container can handle. And once you've hit that maximum amount of traffic, the next thing you'll do is you'll run a duration test. And I'm just gonna run an example duration test here. But you'll run just under that breaking point for an extended period of time. And again, what you're looking for, things like memory leaks, queues being filled, variance and response times are indications of these sorts of things too. A lot of interesting stuff can happen at this stage too. And while you're doing these sorts of tests, instead of modifying things by a factor of 10, you're making smaller adjustments at this point. You wouldn't necessarily wanna increase things by a factor of 10 unless you had a really good reason to do it. And again, you might wanna go back to your ramp up test when you do this sort of thing. So I'm gonna jump back in. So while you're doing this, it's important to keep a fail log. And that's something that you're gonna wanna share with the team. This is both qualitative and quantitative information about how the thing broke. And this is gonna be really important when you're making your runbooks because you're not always going to be on call. Somebody else can look at your fail log and say, okay, we know that this thing is failing in an expected way. I probably just need to scale up. But if it's failing in an unexpected way, first off, you wanna update the fail log to make sure that other people know that this is a particular failure mode, but you might wanna start looking at other things. It could be an issue with the code. It could be another issue with the infrastructure with the related service. But it's good to keep track of those things so you can understand that. So there's different failure modes that you can observe. One of them is where you've got a memory leak and memory's slowly increasing. And then you end up with that sawtooth pattern that's familiar. Another one is the CPU is pegged at 100% even after you're running your load tests. Sometimes this'll happen, maybe a queues filled up or maybe there's some process that just got home. Another classic one is you just see a bunch of 500s. You see high response times. And a stranger one is a large variance in response times. That one can happen with queuing. And then the last one is just requests, just they just get dropped and you never get a response. So some of the stuff that we learned through this process was that scaling up your replicas isn't going to solve scaling issues for stateless, services. And we also learned that there's a lot of different ways that applications can fail. And keeping a fail log is actually a really good practice. It keeps the teams closer together. It gives something for systems and developers, dev teams to talk about. So product can work closer, it's a communication point between product and systems. And really it's about increasing the predictability of your system. If you're not setting limits on your containers, unexpected things can happen. Things can take more resources than you expected. So just looking ahead at Kubernetes, all the tools that we have right now are a huge step for ops and cluster wide. Operations, there's never been a better time to be an ops person. But I think that there's still a huge opportunity for developers to get involved here. And if SSH is, or if Kube-Cuddle is the new SSH, these tools, there's some tools that we can build to help developers visualize what's going on and set up a new system. What's going on and set things like limits. So that's kind of why I built, I hacked together KubeScope. I'm gonna be spending more time on this. And if it's something that anybody here is interested in, looking for pull requests, feedback, anything helps. Yeah, I'd like to open it up for any questions. And thanks everybody. Yes. So the question was, stop me if I don't quite get this right, is that how do you judge how to set the actual limits? Once I know where the breaking point is, maybe I run more pods at half of the breaking point. I run twice as many pods. So I think that kind of depends on your business case here. For us, we want to run as few nodes as possible to keep costs down. And we're running things pretty close to the breaking point at all times, and that keeps the cost down for us. If you have more capacity, maybe you have more bursty data, it might make more sense to have twice as much capacity as you need, for instance. What was that called, pod? Okay, so the question was, am I familiar with pod vertical? Sorry, what was that again? I'm not familiar with those, by the way. Yeah, I'm not familiar with those. Did I have a replica set to three? Oh, yeah, I had watch hooked up. If there's three, that would have been weird. So the question is, how do I know when I've gone too far? How do I know when I've given something too much? There's a couple of different ways that you can know that. The first one is that even though you're giving a pod more CPU, for instance, that's one of the things that we kind of ran out of resources because the cluster was small, but if you had more, you could, what you'll see is your requests will end up staying the same. You keep throwing more resources at it, and your response times will end up being about the same as they were before. Yes, yep. So the question is, I might adjust that a little bit to be like, what if I write an app that is inherently performant, but its resource limits are extremely high. That's a trickier one. I think it kind of depends on the individual case. I'm thinking of compute heavy stuff, I don't know, highly parallelizable tasks, like machine learning might be an example of one of those things. I would start looking at GPU acceleration in that case, but I think it really depends on the individual app. If it can't be split up, maybe it does actually need that much resources, and it's kind of the state of what the world you live in. So I read that question, have we automated this, and also could this be automated? Okay, we haven't automated this process yet, and we have been looking in how to automate this, and the tricky part isn't actually running the tests, it's actually the starting point, and that requires some domain knowledge about what language you're using, like node versus Java containers are going to have a different starting point, because I suppose you could just start from a really small number and slowly increase it, but like loader.io, for instance, has an API that you can trigger things and kick off these tests. You could probably use that API and some tooling to adjust those limits, so you run a test, and if the requests look okay, you can look at the mean and variance and the average requests and see how that changes versus. See how that changes? You could probably automate this with the APIs. Yes, the starting point is the tricky part, I think. I don't quite follow what you're asking. So the question is, is there stuff that I could do before putting this out on Kubernetes? Could I instrument this locally and make sure that my application is super-performant? Yeah, I think that's the other side of this as well. This is kind of like after you've got a container, after you've got a container and there's kind of that assumption there that the app is relatively performant, then you go and do this process, but it would be very good to do that before starting this process. Okay, yes. So I think the question is, would you run this in production? So we actually run this in production along with the rest of all of our stuff. And the reason why is that crosstalk between services is something that we also want to observe. So if this pod fails and causes other things to fail, maybe DNS is causing issues across the system. We'd actually want to know that before putting other traffic onto it. Yes. So the question is, do we have something that's got more... like non-trivial... like maybe it does a lot of reads and writes. Maybe it makes a request to a third-party service. Maybe it hits a database in the same endpoint. So maybe you've got a complicated endpoint that... So we don't really do this on those... We're trying to avoid doing things like that because that's kind of our starting point. Like if you think of... like our monolith is really that. It's really difficult to load test that without putting users onto it, but it is much easier to, say, focus on one of those endpoints, pull it out, and then load test that one endpoint, which is kind of what got us on the microservices journey in the first place. Yeah, that's a good question. We actually don't test that. And I think you could use this. You'd have to have the place that the ingress terminates to probably do something like a Hello World app or something, but yeah, that would be cool to see what that looks like. Thanks, everybody. Thank you.