 Hi everybody, my name is Marius Gregorio. I'm a senior dev manager at Nordstrom in the spaces of kubernetes CICD build tools and Hello, my name is Emmanuel Gomez. I'm a principal engineer at Nordstrom and Just want to say there's a ton of really great content going on. So it's exciting to see you all here Thanks for coming and it's an honor to be here to present for you today Yes, and welcome to 101 ways to crash your cluster so the two of us have been heavily involved with the adoption of kubernetes at Nordstrom all the way from incubation to implementation and now at scaling and Nordstrom is a fashion specialty retailer Founded in 1901 based over in Seattle, Washington And we've got hundreds of stores nationwide and recently opened some stores in Puerto Rico and up in Canada And we love kubernetes and we're serious about using kubernetes over at Nordstrom And so we're using it for a wide gamut of microservices that are powering different parts of our business Reviews gift cards purchase orders and so on and and even parts of Nordstrom.com. They're a bit more recently updated And and we also Rely on kubernetes to run a bunch of the dev tools So things that our engineers use. Oh, hello. I'm on auto somehow Auto-advance I'm not sure how or why this might be scary Okay, so anyway, not only do we use the dev tools but also the The enterprise logging and monitoring stack parts of that are running on kubernetes So we have we use kubernetes for a lot But if you're interested in stories with happy endings, then you'd be better off going to some other talk Don't get me wrong. I mean kubernetes is great, but just go to any other talk. You'll hear about how wonderful it is It is our solemn duty to bring to light tales of mishaps bad luck and calamity We'll be telling a series of short stories beginning with unfortunate events Down at the node level and we'll work our way up to disastrous cluster-wide Incidents and you know as we start off don't discount those node level events Because before you know it they can spread across your cluster and you'll have a major problem on your hand You know you in the wow Okay, PowerPoint fail. So you in the audience have no obligation to remain Honestly, I advise you if you have a week stomach to turn immediately and find something more pleasant instead So our first tale begins with The way most most tales and issues tend to happen is when you see a node go not ready and when when that happens the first thing that we do is Is we just go cube CTL describe node to see what's going on in this case? Cube with stopped posting status not uncommon at all And just so happens that our end users also said hey our applications running on this node have stopped responding So we decided we wanted to go take a look see what's going on that node login firsthand And so we go SSH into that and just sit around wait wait wait looks like SSHD is also unresponsive So meanwhile we go digging through our logs and unfortunately our logs are scrolling by so fast And we're not really sure what we're looking for And eventually SSH gets through and everything is nice and good And so the node seems to have become happy again by the time we got there So unfortunately we couldn't see anything when we finally got onto the node But then we looked into the past so we collect our metrics they go into Prometheus and And so what we found out was memory utilization on this node reached up into the 90s And then right before we were able to log back in there's this crash and a lot of memory got freed up and suddenly somehow That got correlated so we knew that we were looking for We were looking for something that was related to memory utilization or memory memory kills So now we were able to query our logs for something more appropriate And so we found something like this in the kernel messages We found umkills and we found page allocation stalls on that node And so we wanted to really understand the mechanism through which this was going on because we didn't run out of memory not entirely So we went to Google about page allocation stalls and back then there was nothing if you actually do that now You'll find some other unfortunate Kubernetes user who's running into the same problem But there was not a lot of information so we cracked open the Linux source source code Search for that string and read I think is like page alloc dot see And so what that tells us is that when you go and allocate a page of rams kernels gonna go look for it It may need to perform memory compaction and if your memory utilization is really high well then Then then it could take a really long time to get that memory Meanwhile, none of your applications are running because everything is is frozen Now eventually that's gonna time out if it takes too long and that's when the umkiller gets involved So not a pleasant experience and it turns out the way this happened is while we were running Prometheus on our node Prometheus gobbles gobs of ram And then we had a ton of flink pods also on the same note that that that were dormant and one of our users would initiate a job and all the pods would just wake up at the same time and and They would wake up at the same time and then burst in terms of memory our memory utilization go sky-high and then the machine would freeze So that's that's what happened and the way we got to that information is just really following a standard checklist of troubleshooting steps Now this is really pared down. I mean if no not ready This is in the event of memory high memory utilization, you know, we look at the nodes We figure out what's going on cuba stops posting node status and Then So then we look for signs of high resource utilization and finally the umkills So what was the fix the fix was to set our cuba flags correctly primarily eviction thresholds hard eviction actually so And there's documentation that tells you all about this Suggest you read it very closely because it's easy to get wrong and so What you want to do is you want the cuba to enforce a memory limit global across all pods and once that hits You want it to evict immediately you want to evict early because it takes time for the cuba to respond to memory pressure And so you don't want it's bumping up against the Linux umkiller and then your node going unresponsive So once you're doing that then I would also recommend you look into excuse me. I'm gonna Make a little edit or hit the button here. I think this is gonna help. Okay So you should probably also look at cube reserved and maybe even system reserved flags to make sure that your node agents And your system demons all have enough resources to operate So that was that was our kind of like our small story and was fortunate that this didn't spread to multiple nodes It was fairly contained But you could imagine that somebody could write the page allocation stall or demon set That's just gonna rip through the cluster and cause havoc Well, we didn't need that for something to rip through the cluster and cause havoc We ended up seeing this one day And there were more nodes down there which actually were ready these are the not ready ones Which is about half of it And so, you know the first thing you want to do is panic and then the second thing you want to do is not panic So we have to resort back to our to our runbook and you know simplify down we take a look What's a node up to a look at stopped posting status yet again? We want to look for signs of higher resource utilization and in this case we found that Resource utilization was not consistent across the nodes that were affected and in fact A lot of the nodes were not utilized at all So it couldn't have been that You know and so when you see a lot of nodes just kind of pop out of existence all at the same time It makes you wonder maybe this is like a networking problem kind of like net split from the IRC days and so we go to a cloud provider happens to be AWS We take a look we say what's the networking status is it green? Yes, it's green or they claim it's green It's always a network right always blame on the network So there's only one way to find out so we actually go try to log in the machines that are currently posting not ready No problem. We're able to log in and then we say well Let's try to hit the API server from these machines because maybe that connection is somehow down and no problem We're able to reach the API server from these nodes. So What's going on doesn't look like a networking issue must be something else So we look through our cubelits logs. We look through our API server logs We just find a lot of warnings a lot of information a lot of Red herrings absolutely nothing that we could identify as a cause for this and then eventually everything just kind of cleared up And we can walk away and we can pat ourselves on the back job. Well done Except for when it happened again at some point like months later on on a different cluster And then again and again And this was a really hard one to track down But eventually we were we were tracking down the clues each time this happened and we were eventually able to put something together And it looks something like this Usually 50% of the nodes go not ready all at the same time, but sometimes 33% or even 25% go not ready And then it every time this happens It's always all the nodes kind of pop at the same time and They also happen to come back at the same time And in fact every time they go not ready and then get ready again that Time window with a time period where they're not ready always happens to be somehow Just like almost exactly 15 minutes Now it's starting to feel like we're dealing with a timeout, you know, maybe maybe it is network related after all But maybe we looked at it the wrong way So we scratch our head of trying to say well, what did we not look at yet? And it turns out what we didn't look at our load balancer logs And so let's go to those ELB logs So a little bit of our architecture we have high availability masters Which means that we have multiple IP addresses for the the API servers and the cube Let's need to go talk to one of them. And well, we've got an ELB classic ELB in between And so we look at the the ELB logs and they said, oh, we're gonna scale down one of the instances behind the scenes Transparent to you. You don't know about this until you know about it And and that preceded the incident every single time So what is this all about? How is this happening? Well, it turns out that there was an action multiple issues over on github But this is the more general case because it turns out that this issue is not an AWS specific problem The problem was that the cubelet didn't have a timeout in the heartbeat over back to the API server And so the 15 minutes. Well, that was TCP retransmission The quick fix for us was to use network load balancer The claim behind that is they can handle connections that are open for months or years I'm not sure how they tested the open for years part Maybe maybe a time machine time machine solve all problems But but it was finally fixed in 178 And so if you're running a current version of kubernetes You probably don't have to worry about this instance of this happening, but you never know So I'm actually gonna hand it over to Emmanuel now who's gonna tell us a story about the day that our robots turned against us So yeah, so here we move from node level outages to well Multi-node level outages to cluster components going haywire so as Marius mentioned we run on AWS as a cloud provider and we have Run the cluster autoscalar component as a way to manage our spend and kind of keep our infrastructure provisioned according to our workloads and one day We came to find that Our cluster was shrinking actually we found out after the fact our cluster a shrink So something around a third of its original size and this came up as a user report of My workloads not running That's never a happy moment as a cluster operator. So first stop We Took a look at the Obviously consulted the API server to see what our note, you know what nodes were being reported found the shrink size and then proceeded to go and Check out our AWS ASG the autoscaling group scaling history Which helpfully reported that yes those nodes had been terminated and yes that had been requested and happily complied with Let's see and the Next top was to go check the cluster autoscalar because knowing that this component is running and interacting with the autoscaling group APIs. That was a logical culprit and Autoscalar group are the autoscalar logs reported that utilization was observed at 0.0 So we have the autoscalar set to both scale up and scale down when it observes zero utilization It's gonna happily scale down Problem was utilization was not zero So We were scratching our heads about this for a while We had this one incident which was catastar Significantly impactful our cluster was still it operable, but number of workloads were impacted but it didn't recur at the same scale and This is truly a sad tale because we have not been able to determine true root cause on this issue a Couple of things that made that difficult was that we had Diagnostic data age out in terms of the numeric telemetry and logs that were aggregated We were trying to you know use that data While it was available to try and understand the issue But we didn't get to the root cause during that period of time And we didn't do the work to durably capture that in a long-term way to be able to go back and look at it so for the more we didn't have the data to go open an upstream issue about this and Potentially save somebody else the trouble now a couple of things about how we worked around it we Well as it says here, we extended the smoothing function. So we so we slowed down its scale and behavior both Reducing the number of nodes it would scale in it at once as well as the interval between scale and events So that gave us time to respond so that we wouldn't get sort of bulk bulk killed Which is clearly a workaround but at least gives us you know even if this is a middle-of-the-night incident gives us time to kind of get somebody at a keyboard and Intervene before before we're in a catastrophic situation In addition we upgraded the cluster autoscalar component to a version which supports a significantly better observability in terms of the metrics that it exports and makes visible but there's still a couple of metrics that we haven't yet been able to Introduce to the cluster autoscalar. We'd like it to be able to report what will happen. It reports what's unneeded but not the actual Intended scale in that it's going to perform. This is something that we know how to do We just haven't gotten a chance to get to yet And along the way we learned some interesting things Specifically and this may be germane. It's somewhat speculative because we can couldn't conclusively prove this but this I Suspect is germane to what hit us that day that so first first interesting thing we learned in this investigation was that the Kubernetes service the cluster IP service that Kubernetes creates by default to be able to address the API server often it shows up as Kubernetes default service cluster local There is session affinity set on that service implicitly You don't create that service that service is created by the API server if you're running in a high availability Configuration and you have multiple API servers present in your cluster There's a number of different knobs and possible configurations you could be and you could be advertising a load balancer But if you're connecting directly to an API server Your clients that are addressing it in that way will be pinned to one specific API server instance on the face of it That's not So surprising there's a number of reasons why that's the case this issue explains a bit However, the second thing we learned is that the behavior of API servers when you're running with the API server account flag is Surprising so in short what happens here is Readiness is not respected When the API server account flag is in play so in our case we run five node control plane five instances of etcd five API servers Scheduler and controller manager age configurations and we run with the API servers having the API server count set to five Seems seemed logical now it turns out that the behavior with that Setting is that all five of those API servers remain in the endpoint set for that service at all times So if you're in a maintenance event and you don't Change the count value bring down one of your API servers 20% of your traffic that is addressing those API servers over that cluster service are going to fail And it could actually be more because of the client because of the session affinity based on client IP You can end up in situations where you have hotspots because the clients could During maintenance operations you could consolidate down to less than your total set of API servers Now the API server account issue the referenced pull request here Helpfully suggest that the behavior with API server account set greater to one is worse than just leaving it set to one I Tend to agree. I think Basically, this is something that there are there's upstream work to change But it's something to be aware of where you have components addressing the API server over that service We've seen in a couple of issues around the scheduling and controller manager as well, especially during maintenance operations This is where it's come up is when we're Maintaining the control plane nodes and when we have to take one of those out of service We've well developed our maintenance procedures to work around this because we got hit by it a couple times so something to be aware of and the further speculation getting getting out past what I can strongly assert but One possible explanation. We sort of hypothesized about this cluster autoscaler incident and the 0.0 Utilization observed in the logs is that under some set of circumstances, which I can't name The cluster autoscaler when talking to the API servers metrics API had some network connectivity which was deserialized into in memory representations in go and that goes behavior of representing uninitialized values as the zero value of the type Could potentially and again speculative here But could potentially have resulted in an uninitialized float being reported as a 0.0 And then acted upon Resulting in the scale and behavior we described Don't let this happen to you And as a side note, we have actually so we mentioned that the smoothing function Well, where were we here? Yes that the smoothing function slow down We've upgraded to the newest release of the cluster autoscaler and we haven't been hit by this behavior, but we have seen intermittent Only very occasional but we have once or twice seen reported in the cluster autoscaler logs that observed utilization was zero when in fact it was not so Throwing around a little bit of fun there. I don't I guess I don't personally completely understand the issue, but It may still be lurking out there in the woods Now I'm gonna move on We had a truly catastrophic outage Which calling the split personality at CD cluster So I was describing a bit about how our control plane maintenance procedures had grown to kind of work around this interaction of the API server count and Session stickiness or session affinity Of the API server communications from components that talk to the API server over the built-in cluster IP service So our procedure for performing control plane node maintenance which We occasionally have to do due to configuration updates we Treat our nodes control plane nodes included as sort of immutable and when we need to update the underlying node configuration That's a typically will almost universally we will Terminate one by one and so roll through the roll through the set and replace them with the with nodes of the new configuration So given given what I just described that means removing a member from the at CD quorum Starting a new machine It also involves actually Decrementing the API server API server count flag on all the running API servers Then I say starting a new machine we start a new machine That new machine then we we manually add that to the at CD quorum And then go back to all of the API servers and increment that API server count flag That allows us to work around the negative behavior that I described a moment ago Telling all this it's gonna it's gonna come back later Now in this incident We had a Situation where nothing made sense. We were deeply confused We run cube control get pods and we would get two different sets of data and it would alternate back and forth We would run q-control get nodes and we would get two different sets of data and it would alternate back and forth and We couldn't make sense of it. It seemed like there were two opinions of what the state of the cluster was what was going on And Here's how it came up to us Our user this is this is not ever a conversation you want to have as a cluster operator and not because this is a self-identified cubs fan, but because Pod starting up and vanishing is not supposed something that's supposed to happen here So when we went and looked we'd see this And then this and that flashed by pretty quickly So it would go back and forth between these two states something like this although obviously depending on what resources you were looking at You'd see different things this Was confusing to say the least as I already said and it wasn't just about cube control and reporting to our sort of Interactive querying of the state of the cluster the control loop started misbehaving. So this manifested as the controller manager acting on incorrect data and this was Thousands of pods getting spun up and terminated I Don't that basically the cluster or the controller manager would be querying API servers and getting alternating sets of data and then Helpfully rapidly acting on alternating sets of data trying to converge to some state, but It's hard to converge when you're Getting conflicting views of the world the next set of you know horrible symptoms of certain manifesting was you know service endpoints We're thrashing our ingress controller started to do very bad things to the traffic that should have been flowing through it or rather The ingress controller itself was reasonably, okay, but it just it was responding to these conflicting signals So there's some bad news here. This was a full cluster outage on our primary production cluster We weren't serve it simply out of service. It wasn't just sort of went away, but it was sort of violently wrong and Our time resolution was brutal It was about a four-hour outage for our until we were able to restore a cluster or restore service by way of a replacement cluster There are a number of contributing factors that kind of led to That significant of a time to resolution One was you know simply confusion But also we were reluctant to replace the cluster and spent in quite a while troubleshooting and diagnosing We knew that replacement would mean Read basically redeploying applications on the new cluster and there was a fairly distributed relationship in terms of teams Utilizing features that are challenging to to simply migrate pick up and move from cluster cluster We didn't end up replacing the cluster so all of our reluctance ended up being for not and that was also contributed Contributed to the delay and the duration of our outage So some specific things here volumes are challenging volumes are Exclusively bound to one node so when we brought up workloads on the new cluster We had to sort of go manually one by one and make sure that those were freed from the The errant cluster in order to be able to bind them on the new cluster but actually load balancers were Somewhat more challenging because Although there were much fewer in number the teams that were using load balancers a handful of teams had created DNS records that pointed to the sort of ephemeral name that kubernetes creates the load balancer with and They had to manually update those DNS records to point to the new load balancer names that were created on the new cluster and One could potentially wish for a little bit better support on kubernetes side in terms of Making it possible to kind of claim ownership of an existing load balancer. I think it's maybe possible But it's not a supported use case. So basically migrating load balancers is is not Realistic there's other ways to work around this in terms of managing the DNS records on the kubernetes side is is something that we've looked at But haven't gotten into place yet. It's There's some other things that make that challenging There's some good news though. So this happened during working hours. Hooray. Nobody got woke up in the middle of the night so we had full team presence and we were able to kind of bring our full resources to bear and Another bit of good news is that we were able to analyze this one and and actually get down to a root cause and And that root cause understanding did drive or did lead to significant improvements in our sort of understanding of the system It's behavior under this failure mode Our code and and our procedures around that Now you may be asking but wait, how could this possibly happen at CDs a consistent key value store, right? That's the whole point It is it is and I'm not making any claim We didn't observe split split brain as described which is a breakage of the consensus model itself but what we did see was stale data and it's Not super widely known. I don't think but it is indeed documented that at CD will return stale data there are known conditions under which that can occur and Typically, this is very small increments where the at CD members. I have you know some Small degree of replication lag in between them a leader will accept rights and then replicate those to followers But under certain sets of scenarios that lag can get very very significant and It can also get if you have very significant lag You can also in inter situations where you're having leader thrashing Because leader election timeouts can be repeatedly crossed causing multiple elections back to back to back And like I said, this is documented and in fact we had even read the manual and Knew that there was at least some theoretical possibility of this At the time which is to say earlier this year six months ago There wasn't widespread agreement that this that oh, sorry, let me back up for just a second so one piece of mitigation here resolution actually is to always ensure demand that at CD perform Quorum operation when reads are are happening. So stale data is In the read path not in the right path. So writes are always consistent You'll always get an act that reflects the quorum of the cluster the at CD cluster But on the read path at CD can return data that's only known to the local host the local at CD instance that you're querying And that's the case unless you expressly request a quorum read quorum read involves a quorum of the cluster and It happens to be the case that Kubernetes API server and its default configuration does not request a quorum read and It's not mentioned in the a day HA Docs or at least until October of this year. It wasn't mentioned in the HA Docs There's another male culpa here. We we After having observed this and then immediately going after resolving the issue immediately going to the HA Docs and scratching our head saying How did we miss that? Found that it wasn't there. It is now. So read the Docs read the Docs like Sunday at noon like it was church like those are changing and you have to stay up to date and Some of the some of the reason that that wasn't recommended configuration from the beginning is that there was concerns about the performance impact when when you're performing a quorum read on the read path you have to talk to you know a quorum so majority of your Etsy team members and Etsy 3.1 greatly reduces the impact of that because reads don't have to hit disk anymore previously that meant a disk hit on on you know majority of your Etsy cluster and and 1.9 quorum reads become the default behavior. So If you're running 1.9, you're safe if you're not running 1.9, you may not be safe There yeah, this might be something to go look at in your configs if you're running your own API server configs So in our case we know what happened We got hit by right latency. So Etsy is extremely sensitive to right latency This is healthy right latency We're talking single-digit milliseconds in this case. This is this is a happy Etsy the instance or a cluster Actually, this is five Etsy the instances. This this is not a happy Etsy the instance So if you if you start crossing double digit or multiple hundreds of milliseconds of right latency And this is specifically the database sync lag You're you're entering a world of hurt and Etsy is going to start to do Well, if you're not using quorum reads, you're potentially going to start to See behavior like we saw even with quorum reads you're going to start to see, you know Tremendously degraded performance because every operation is going to see well Many of your operations are going to see significant lag So how is it that we have missed such an important thing on such a critical database? and the answer is really that There's just so much to to go figure out if you're running a Kubernetes cluster And to kind of demonstrate Kind of what I'm trying to talk about. I mean, let's let's talk about what happens at the node level So this is from a great site published by Brendan Greg who posts a lot of topics about Linux performance and if you need to figure out how to troubleshoot Performance issues on a Linux node. This is a pretty good kind of roadmap to that And as you can see there's just a lot of different components you have to worry yourself about and for each component There's a different set of tools Right. So now as cluster operators, we don't run one node We run dozens of nodes and dozens of nodes each in dozens of clusters, right? and And so you really need to stay on top of this but this isn't the small right this is zoomed in down to the node If we take a zoom all the way back out to the ecosystem, this is what we have to deal with And in fact Linux isn't even anywhere on this slide So so not only you need to deal with the operating system But the cloud level and provisioning you need to know the quirks of your cloud providers because maybe they're going to transparently Replace or scale in some of the nodes behind your load balancer doing crazy things to your network traffic You need to know about Your container runtime you need to know about that overlay network plugin that you're running How is that performing for you? You need to know obviously about Kubernetes itself how the scheduler interacts with the API server and how the cubelets talk to each other and the controller managers and everything and then don't forget that Supercritical 300 megabit database called at CD. That's really sensitive to write latencies And and as you can tell there's a the surface area is huge Absolutely huge But here's the thing Full-stack DevOps teams have had to deal with something like this for years because You know it may not be kubernetes But they still have to deal with their applications the deployments the cloud the provisioning the operating systems of virtual machines the Networking the DNS load balancers all of that stuff And what kubernetes provides for the application teams at least is a simple interface through which to run their applications and Then push all that responsibility on to somebody else the cluster operators So that's the good news good for them as for us cluster operators. Well any number of small things could possibly turn into a big issue And so surely out of this diagram and the one before There must be 101 or more ways for your cluster to come tumbling down. Thank you