 gang thanks for coming today we've got a packed agenda I think we're clocked in at like 34 30 so we're gonna start right at the pen we can't take any questions to get through all of it but I tell you what we will stay as long as it takes to answer all your questions so don't think that we're fleeing the scene or anything like that I promise we'll be with you and I think we're about two minutes out thank you very much for coming in if you guys in the back it's not like the blue man group we won't spread anything on here anything so feel free plenty of places up front thanks again call the coup con police on you all right ladies and gentlemen my name's Shane this is my good friend will and this whole week at coup con you're gonna hear amazing success story after amazing success story delivered by some of the most brilliant people in our industry so will and I we're gonna give you a break from all that right now we're gonna tell you about a misadventure we had running a large-scale kubernetes cluster and the reason why we think you're gonna be interested in this is we believe that many of you in this room right now are on this misadventure with us it's just no one told you our misadventure begins we're all great misadventure begins a misunderstanding you see we took the things that we knew to be true in kubernetes and misapplied all those concepts the things that were actually governed by the linux performance rules and it turns out that's kind of a big mistake to make but it's an easy one to make you see we were thinking in course because that's how kubernetes thinks about stuff right there is a node object with the number of CPUs on it and every time you schedule something it decrements that object it's definitely a thing however linux has an abstraction layer and it thinks in time not cores now I get it I'm probably sounding like the crazy guy on the kubernetes subway right the cores are a lot so let me explain that for just a second okay so I'm gonna use the least controversial feature in all of kubernetes to explain this kubernetes limits now I know I know right there's there's a little bit take how you feel about kubernetes limits and put that aside for me for just a second because it is the best way to understand how kubernetes thinks in time and how dangerous it can be to think in cores just as a quick review there is metadata attached to a container a specific container for how much CPU it can take in a given period of time that period of time happens to be 100 milliseconds just follow that number away for me real quick we'll need that in just a few slides and every period we reset this quota of limits pretty straightforward right but what I didn't understand is when I was configuring cores what was really going on behind the scenes is it was converting into time let's explain how that works really quickly see in my mind I was configuring one core and that's broken up into 1000 little chunks called millis CPU right and if I configure 10 millis CPU which is the minimum value what I'm really saying here is whatever value that I configure divided by that base value of one CPU or 1000 millicores there's a percentage but a percentage of what it's one percent of this 100 millisecond period in time or one millisecond which is why that's the smallest value now how is that different so someone would come up to me and like Shane my application needs half a core sounds pretty reasonable sure but what's happening when I convert that into time you see that is 50% of that 100 millisecond period or 50 milliseconds in time how is 50 milliseconds in time different from half a core well you see I didn't think to have the good sense to ask them anything about this app and it turns out with our fictitious app we have four threads running on four cores and if I add up the total bill the total amount of CPU utilization in this 100 milliseconds of time that's passing in the real world I get 400 milliseconds that I've just say no that was nowhere near the 50 that I thought when I'm thinking in time it sounds reasonable but let's take a step back right if if can anyone tell you the right way to set your limits if they didn't know this information how many threads you're running how you're gonna how many parallel processes actually one person can it turns out Prometheus can so by using the familiar see advisor metric container CPU usage seconds total this isn't second so we're gonna do a little prom foo right prom QL and convert that into periods right we're gonna save all that I promise I'll share that with you at the end but when we measure that in time what's the value exactly the amount that we needed to set it for to keep it from throttling did I need a Ouija board right a dream catcher and five million blogs to set it no I just needed like a five second chart and Prometheus to get this all down was it really that hard it's not when you're thinking in time okay Shane is it really that easy alright I left a few things out so let's cover that our application we decided to do the number of threads static but we had an option to have that go dynamic with user load what would happen as the load goes up would the value be the same if the thread count was dynamic in fact it would not be whoops thinking in terms of static can be highly dangerous more on that in just a second but how does this affect things in the real world well you see I've got four cores that I'm running in test right and so I'm running like a runtime maybe let go laying or something that says oh cool you gave me four cores well I'm gonna run for OS threads and I'll run my go routines on that well what happens when I go to production and I give it an eight core box it's gonna go all you gave me a course great is the number for performance gonna measure out to be the same if I have eight OS threads in fact it's not whoops whoops so few things that we have to think about when we do this sort of thing cool so you get this wrong like I got it wrong I used to start to do these percentage of periods and all this type of stuff and I wasn't thinking in time so I'd go to will all judgy and stuff right I'd be like will you're 99% throttle on your application how could you and he'd be like oh do dude what we set it for and I'd be like I don't know right and so because you know it's just some random percentage up there and he'd look at me all disappointed kind of like you're looking at me right now sir and and I'd be like man there's got to be a better way to do this right well you see I'd go back and I do my engineering right some engineer I got this right we'll double it right he's one and one we'll run two cores cuz that sounds good to my ear oh you've done it don't you judge me and so I do two cores now right that sounds like a lot but when we think in time immediately we see the problem so just converting between these two pretty simple just drop the zero right going back and forth and what is it 200 milliseconds time what did I just do man I just have his application performance sorry buddy right and in that 100 milliseconds of time if I didn't have another application that needed to run guess what goes out as idle right so I got 50% utilization but I got a huge performance problem that would be a monstrous problem to try to figure out unless we were thinking in time you see luckily I have the CPU seconds throttle total and if I just convert that to periods and again I'll share this with you low and behold what number is it the exact number I need to raise it by right I had 200 I needed to add an extra 200 to get it to 400 the exact value I needed to stop the throttles stuff's actually quite simple it was two little charts I need to look out for five seconds and I didn't need all this stuff that's been going on for years right but I know exactly what you're thinking right now you're thinking shame you didn't even cover per CPU slice allocation on the global container for the metadata of the container what kind of amateur knight are you running up there man that's a little harsh but I got you fam don't worry I wrote a whole blog post going into the brain damaging details of all this now I know the the grandmaster of Kubernetes himself Tim Hawkins try to get us not to shoot ourselves in the foot years ago but there's actually some cool stuff coming out you see in a new version of the Linux kernel that's coming there's rollover and burst values for the stuff so limits might be cool again but more importantly you're in a multi-tenant talk right now and us as multi-tenant operators we don't get to control what people do in our containers we need a safety net and if you need a safety net this stuff's actually not too bad if you think in time now someone if not all of you are gonna send me this oh but you can do all this stuff with request it turns out that's actually a more fascinating topic and might not work quite like you think to discuss that let's go explore that universe for just a bit the gentleman on the slides name is Edwin Hubble and about two hours from where I live in Los Angeles back in the 1920s Edwin Hubble provided direct evidence that the universe was expanding at a faster and faster rate now some mathematicians at the time were pretty bummed because their whole life work was on the universe was static right overnight just completely bad this makes me feel a lot better because all my math and all my performance calculations on Linux for one because I thought the Linux universe was static now I don't want to brag but when I was in the United States Marine Corps I was so good at basic math they gave me a special box of crayons to do hard things like ratios right so when it came to take my CKA course I was feeling pretty confident like I got this now tell me if you learn the same thing right so one request equals 1024 shares and if I got double those amount of request I get double the amount of shares and just like shares at a company whoever gets the most shares gets a greater percentage of the company right so in my mind at work like this this application got 25% of shares this application got 50% of shares except none of that is true I'm read my hero Brendan Greg right who makes me feel like I probably should be working at Best Buy not doing this but he says hey crayon eater guess what it's only the busy shares that count wait what so you mean if this container is idle it doesn't count in the performance calculation it's just these two that completely changes my performance ratio on this node oh my goodness wait a minute wait a minute I have hundreds of pods forget containers I got startup containers and I rather try to figure out the heat death of the universe then try to figure out like what containers gonna be busy at any given time if I could do that kind of math I would have gotten the ML right and would it be the same if I could calculate it on a node would it be on all all thousand nodes no it wouldn't oh I'm getting a little anxious now right but is this practical you see when I was thinking in cores a developer come up to me and like can I have a core sure buddy here you go right but what was actually happening let's say just to stretch a point we were we were working in stage would all the containers be busy in stage like they are prod no so I do some performance Greshan you all do performance regression on your apps right so we do a little performance regression what happens it takes up that entire node because nothing else is going on right if I was thinking in cores I'd be convinced I had another core but I'm actually taking up the whole box whoops what happens when I move to production would I get the same performance profile indeed I wouldn't I'd be like what's going on and would that be the same across all 1,000 boxes that were running and I think no oh my goodness okay so now I'm just disillusion right I'm screaming at the lanes guys why does this have to be so brain-damaging right and it turns out it was just my mental model was all wrong you see once these tasks or threads get put on a particular CPU right really what we're talking about this weighted share system is who gets to run next on that particular CPU oh well that kind of makes sense now because if this task doesn't need to run on CPU doesn't need to be in that share calculation of who gets to run on next no oh thank goodness now I understand I just misunderstood all of Kubernetes no big deal right oh goodness okay to really understand this though we'll have to do the completely fair scheduler and gang we have 30 seconds per slide so I need a little over confidence from me in the crowd right now we're gonna learn the completely fair scheduler in exactly 30 seconds are you feeling it this morning are you ready yes I'm talking about all right little delusional over confidence right all right gang will time us go sweet all right gang this task is running 100 milliseconds on the CPU it is hogging up all the CPU time meanwhile this poor task is waiting patiently for that CPU it has only had 10 milliseconds of run time on the CPU that's not fair Linux fair scheduler to the rescue it sees us and it says wait a minute this task has the lowest run time let's put that on CPU once these two a task have the same amount of time or their fair share of the CPU time bam that's it ladies and gentlemen completely fair schedule time 39 oh come see my presentation over confidence tomorrow I'm gonna kill it it'll be much better I promise all right all right all right so why did I take you through that okay what is a request at the end of the day a little hand wavy here but bear with me all right I got my CPU hog is back run time 100 milliseconds in time what I'm actually saying here with request is one is divide this runtime by the number of shares this is an advanced crayon math hang with me but when you divide a number evidently it makes it smaller right and so when you have the shortest runtime as you guys are already experts at now right the shortest time is the highest priority right cool and if I was to increase that value it's even shorter oh my goodness we're messing with time again okay all right so let's think about this for a minute but am I the master and commander of everything that's going to the node or is there maybe something else that might be in play we call that meeting the witness right okay here we go all right there was something I didn't tell you and all this and that was that that 10 millisecond task was waiting on I owe and so what was happening is it didn't even need to run on CPU now if I was running a real time scheduler what would happen this CPU hog would get interrupted at every so often right hey you know do I need to run something no no no and I would waste all those CPU cycles we don't want that that's where CFS shines right so I want to let the CPU hog run as long as it wants and when this packet comes back I set a flag saying I need to run now wanted to run it as fast as possible and you know what happens from here right lowest time wins right it goes in but it's an IO task so I want you to think about this it's probably gonna run on CPU for very small and then go back I owe task by default are probably gonna be the smallest time what starts happening when I do aggressive requests and all this type of stuff on something that's already pretty aggressive right oh wait there's a lot going on here evidently cool so how did I get this fictitious version of the universe right well it turns out you know I did CPU stress and all the containers at the same time and we know if all containers are busy that doesn't happen in the real world right I ran the note at a hundred percent busy hopefully that's not happening in the real world right I had a fixed number of threads when they could be dynamic and I had no IO based thing so I got this like fictitious version of the universe and how that worked okay so what would be this new mental model that I would have to paint like what would that look like and in my head it looks something like this so I have two containers or call them task groups each with four threads or task in them there's a million places we can go with this but I'm gonna focus on this concept of saturation you see this is what my node really looks like and on this CPU I have one task it's all so the cache is all warm it's at its peak performance however this one where I have two tasks going on it's gonna take double the amount of time right to get that same resource now what I didn't know is if the CPU wasn't a hundred percent man I was packing that notice dents as a neutron star right what's the big deal well it turns out maybe that wasn't the best but to understand that I want you to do a thought experiment with me real quick game let's just pretend right that these two CPUs are 100% utilized I'm gonna stretch a point to make a point here right but this task is actually perfect right there it's it's 100% utilized but that's fine right because it's perfectly optimized but you let me been pack your notes so on this side oh my goodness right I've got so many tasks context which I'm not getting any real work done whoops right so it turns out would I see this saturation metric that I've saturated this with utilization in fact I wouldn't but I would argue gang is multi-tenant people that need to do these things super dense that's the more important metric out of the two now I'm a little crestfallen I'm not gonna win the Nobel Prize for any of this it looks like there has been a family of node metrics called pressure stall information that tell you all of this wonderful data right we just didn't turn it on right and what we did the findings were pretty crazy so sure okay there was like you know 50% of threads waiting and all that type of stuff maybe we should use go max prox or something to limit the thread counts and stuff like that but interestingly enough that wasn't the most interesting thing that we found we found that we were stalled on memory up to 10% we were thrashing the memories on the box they were saturated and the most fascinating thing was IO stall means every single thread on the box was stalled up to 35% of the time whoops because we weren't even looking at this sort of thing there is all kinds of amazing work by guys like Chris down that's did like see groups version 2 in the meta Facebook guys that I really recommend that you watch but the point is I was super excited about all this and I was thinking who do I know that runs a large-scale multi-tenant cluster my buddy will so I pick up the phone and I'm like hey will guess what I don't know anything about Kubernetes and he's like oh I know man I know why she's telling what's going on and I was like jabber non like a monkey and I'm like yeah all right all right Shane click hey babe I think I think Shane hit his head bungee jumping again but in the morning I started thinking about some of these decisions that would made at aquia in a little bit of a different light see we were scaling our customer workloads on CPU utilization as a percent of requests and I mean that's a Kubernetes default right it practically comes out of the box so could it really have bit us that bad well turns out it could this is one customer namespace over a period of like a week and you can see we're running hundreds of nodes an hour thousands a day that it turns out after I looked into it a little bit we did not need their load did not look like this right so we have all this churn but why well what is percent of requests I never really thought about it before but when Shane tells us all this about requests I mean he's talking about these weights right and like the Linux fair scheduler and stuff like that he's not talking about like using a certain percent of your weight right what does that even mean well turns out in this context right percent of requests is simply the actual CPU usage per second on average and remember Linux thinks in time right so this is a this is the amount of time that something spent on the CPU well not something actually to be specific it's all the processes and all the threads in that container added up on average per second and we just divide that by the requests so if something ran for 500 milliseconds per second on average on the CPU and it requested a thousand millicores we just take those units we throw them out the window and we do 500 divided by a thousand and we get 50 percent of requests right simple enough well with that in mind let's revisit an example that Shane gave us a little earlier let's take two containers right we're gonna run them as hot as they can they're just gonna eat up as much CPU as possible the first container we're gonna give one core right and it's request and the second we're gonna get three it shouldn't surprise us at this point that on a four core node that first container is gonna get one CPU second on average and the second one's gonna get three and when we do that division out with Shane special box of crayons we do one second divided by one core and we get a hundred percent simple enough right I mean this is kind of intuitively how we would expect all this stuff to work but what happens if we change those requests around a little bit see Shane tells us it's just the ratio of these requests that matter right so as far as Linux is concerned whether it's one core and three cores or a hundred millicores and 300 millicores it is actually honest to goodness as far as Linux is concerned the same exact situation right so we get the same performance one second and three seconds but when we do that math with our Kubernetes brains right we take one second divided by a hundred millicores and we get a thousand percent of requests that's a little misleading right I mean it's wildly different numbers a hundred percent and a thousand percent that both describe the exact same situation as far as Linux sees it but I wasn't quite convinced at this point right I mean that's interesting and odd maybe you know not how I expected that it would work but Shane's asking me to basically redo my entire scaling metric for all you know five thousand six thousand of our of our customers so you know I needed to dive a little deeper on this see what we're doing really when we're calculating this percent request and we're thinking this way we're comparing what's essentially a performance metric right how long something's running with a number that frankly we're just kind of using to figure out how many pods we want to run on that node right how densely we want to pack that no which as Shane talks about it's maybe a little bit more complicated and we're doing that comparison between these two concepts with really no regard to how Linux actually views that number and sometimes don't get me wrong sometimes that comparison is a perfectly valid one to make it gives us a lot of information but by no means does it give us a complete and accurate view of what's actually going on within the system let's look at a more extreme example of this let's say we have just one container and every time it gets a request like a web request or something it has some work to churn through and that work takes about eight seconds of CPU time but let's say we weren't thinking in time right let's say okay well that's a pretty big container right it's doing a lot of work I'll give it four cores right four cores has to be enough right it ran on a four core box before four cores is plenty well what happens when we try to run that on Kubernetes we send that up to the API server and it says great you want four cores I'll give you four cores here you go but let's say it schedules us on an eight core node well when that container comes up on Linux and there's nothing else running on the box it can really spread out right over all eight cores which means it can get its eight seconds of work done in just one second of real time which is wonderful right I mean we're getting a lot of work done as fast as we can but when we look at this when we look at our metrics we're gonna see eight seconds divided by four cores that's 200% of requests oh no we say right 200% of requests we must be overloaded we must be putting too much work on this pod so what do we do as Kubernetes aficionados when something's overloaded we scale up right we add another pod so let's say we do let's say we just set that you know dash dash scale replicas to done now next pod comes up Kubernetes helps us out right doesn't want to waste space puts us on the same node give us us another four cores but now when that container comes up it's not a free range anymore right Linux has to constrain both of those workloads equally because they have the same request to 50% of the time so now in each one second of real time they only get four seconds on the CPU each but when we look at our percent of requests we're gonna say okay four seconds over four cores hundred percent of requests right hooray we solve the problem where we're not overloaded anymore but is this is this a good thing did we help I mean we still have the same eight seconds of work right that didn't change but in the first case when we have one container it's getting it done in one second and by adding that second container we're doubling the amount of time it takes to get that work done but if we just look at this percent of requests which is all the HPA is doing right we could be misleading ourselves pretty easily to think that we're actually helping that we're actually doing something good right now okay fine requests maybe maybe it's a little misleading right maybe it's a little not the whole picture right but maybe we should be asking ourselves a little bit of you know we should back it up a little bit maybe we should ask some more fundamental question what makes us think that CPU is the right metric to scale our workloads on at all well let's take these two web applications let's say they're both saturated they're right they're both taking as many web requests as they possibly can but one on the left though right every web request is getting it's just doing some data processing right number crunching something like that that all shows up on the CPU wonderful but more typical web applications right they don't necessarily do that let's say we're waiting on a database or maybe we're calling out to some external thing it doesn't really matter none of that shows up as CPU utilization so we have two wildly different CPU utilization and the same saturation it turns out that CPU utilization right it just tells us what the app is doing namely whether it's spending its you know free time on the CPU or whether it's doing something else we'll find okay at this point you know I give Shane a call back and I say Shane I don't say this enough but you were right I don't like percent of requests anymore right it's it's not it doesn't I can't trust it CPU you know it doesn't even work for my application what do I do what do I scale on how do I solve this problem right and he looks at me with that confident glint in his eyes that 39 second confident glint in his eyes and he says right which is what all the blogs say right request per second scalar applications are in quest per second because of that specific CPU problem but it turns out request per second has kind of exactly the same problem see we run a lot of different applications at Aquia right some let's say they're really simple sites maybe they're serving static content or something right not a complicated request to fulfill so that can take a lot of requests per second but the more complicated sites right spending a lot of time in the database maybe they have some crazy back-end proprietary stuff or or let's be honest maybe they're just not the most optimized in the world right they can't take as many requests per second no judgment request per second just like CPU right it tells us what the application is doing namely how fast is processing request but it doesn't give us that context to tell us how many more it can do before we actually need another replica before before it's gonna fall over so we really sat back and we thought about it we said okay what do we want to scale on and we decided that any metric that's worth scaling on needs to at the very least it needs to be highly correlated with what's actually going on inside the application right at the very least we wanted something smooth right we didn't want these big spikes up and down and wasting all that time scheduling all those pods and lastly we wanted an early enough signal that we could actually respond to it right so not something like a latency which sure is very highly correlated right but if you're already having high latency then you've already failed right scaling too late see every application has this kind of curve as you apply load when the load's light right your application can handle it no problem but maybe it's not as utilized as otherwise could be and if you scale at that point right you end up just kind of scaling empty right we're not taking as much as a good advantage of the resources where we have available to it but if you push that load too far right you end up introducing latency or drops or errors or crashes or ooms or whatever right like you and well hopefully not but you end up with issues in your application right what we were looking for is a metric that told us where that inflection point was right right before we start introducing issues and we wanted to set our HPA to a little bit before that right the idea being that as we scaled up and down we would be holding the majority of our replicas in that optimized zone right now for us that ended up being essentially the number of threads that were active at any one time right it didn't tell us what those threads were doing and it didn't need to it just told us hey they're busy they can't take another request right and when you have a fixed number of threads like Shane was saying you know kind of how much of that you can take until you need another replica now for you this metric is very likely to be something very different right depending on the application the language and all that but the point is the process we went through to find that metric it forced us to learn a lot about our application about what happened when we apply load to it and what it looked like when it was about to fall over right and I'm happy to report we started rolling this out to production and we drastically reduced the churn right turns out when you're scaling on the proper metric maybe you don't need to be running 73,000 buds every single day right now when our customer scales because they actually need to right it's because they they need the threads which makes us really happy because it greatly simplifies our cluster setup right we don't have these crazy churn and all the problems that come with it and that's a whole nother talk and it makes our customers happy right because a lot of customers aren't particularly CPU heavy right sometimes they've weighed a long time in the database and now they scale just as well as those well better everybody scales better but just as well as the people that use high CPU originally there we go our time together is that and then but two things before you go you see we got caught up that what we needed was WebAssembly base injected Envoy BPF and it was the fundamentals that we had the whole time and when we focused on those fundamentals that's when we got the biggest gain and we thought someone was gonna come on a silver platter with this best practice this thing and it was always wrong because we had to measure what was going on to really understand what the best practices were for our cluster and it turns out that Prometheus is the way to do that but it turns out everything we knew about all that stuff was wrong too and that's an even more fascinating universe but that's a tale for another day on a personal note gang I just want to give you a heartfelt I got rejected for gubcom three times in a row and this is like my last big thing and so the fact that it's like standing room only in 700 of you guys showed up I can't tell you what that means to us this is a big labor of love we spent three months on it we'd love to get your feedback if you dig this kind of stuff or you rather hear from like adults on this sort of thing in the future so if you feel that out of be grateful gang we will stay as long as it takes to answer all your questions we sorry we couldn't take them live have a wonderful rest of your QCon