 Okay. Hello everyone. Thank you for taking your time and staying for one of the last sessions in the KubeCon. Hope everyone had a great time and maybe, and I'm hopeful to see you maybe on the next one and we can discuss more. But today we have one of the last sessions. Let's finish it with some secrets of running at CD. I'm Marek Sarković. I've been a maintainer of at CD for the last two years with recent creation of CIG at CD, the special interest group in Kubernetes. I'm the TL of the CIG and I'm also the person working at GKE and making sure that if you run in Google, your Kubernetes clusters are running and the at CD that we are running is the best that we can bring you. So today I wanted to share some of the experiences and my view on reliability of at CD that I have seen personally in production. So agenda for today is look at some simple cases of failures in distributed system. I would want to focus especially on cluster scope failures, which is in my experience biggest reason of failures that I have seen. I will propose couple of mitigations how you can avoid those problems and we'll finish maybe the secret, maybe some people already know it. So failures in distributed systems. At CD is really great at handling failures of single members. This is because it uses Raft consensus algorithm and it allows it to survive failures of single maybe two nodes depending of your cluster size. So as long as the quorum, so the majority of the members are alive, we can, the cluster can proceed and this is great for like failures of this network, some disconnections. So short temporary issues that you don't want to like think about. But so this works because of Raft. Raft is an algorithm that can take any number of concurrent request and provide us a singular organized and order stream of requests and properly distributed along all members. This allows us to be sure that every member has the same data and every member can end up in the same state. But unfortunately it also means that every time there is a developer mistake, application issue or even a corruption in the data. It's as easily to replicate a correct behavior. It's as easily to inject failure to all the members. So today I would want to give you a couple of examples of errors like this where the whole cluster can suffer because of some issue that was either hard to predict in HCD or Kubernetes has some problems of using HCD in a way that would make it more reliable or even issues that we could not predict at all and are in the Golan language itself. So I would want to start with pretty simple case. I think everyone that runs Kubernetes heard about events and use the events. But as stable as Kubernetes is, it still has. There is a lot of people that still have issues with storing and persisting events reliably. Here I would want to discuss directly a case that you can easily or how Kubernetes uses events can easily lead to production downtime if you don't handle them properly. So what is Kubernetes event? Kubernetes stores two types of distinguished resources into the HCD. One is objects that represent the state, the intent and the status of what is happening in your cluster. They are of critical importance. So we need to persist them and we need to guarantee their delivery. They are never deleted until we want them deleted. So either intent of the developer changes or administrator or they are garbage collected because they are no longer needed. And if you run of clusters of certain size, you can easily predict number of objects that are running in cluster because you can multiply number of your nodes by some number of pods you run per node and maybe add some deployments and there is some multiplication that you can give a high limit. On the other hand is events. And events is a way for developers that deploy on Kubernetes to get access to debug information that usually only administrator can see. So when there is a decision by scheduler and it failed because there is no node in the cluster that matches pod selector, this is what selector was chosen is the decision of developer. So Kubernetes makes it really easy for developer to see and run kubectl described to see why did some pod selector was not scheduled. And having those logs is very useful for debugging but some of the properties of those logs are different than the state. So they are not critical important. They are best effort. So Kubernetes will even intentionally drop and avoid sending them if it's run out of resources or there is some network issue. So like FYI you should not use events because they are not guaranteed delivery. So they are not critical. And you cannot make, you cannot always depend that they are available. Usually I think they are configured to be deleted after two hours. This can change depending on your release or distribution of Kubernetes if you run. And they have tendency to really explode when there is some failure because if there is an issue with couple of nodes they could get disconnected. You get a lot of information or events not only about the nodes, you get about the pods that they cannot schedule. And then you get about deployments that they cannot get, number of pods that they requested. So there is a big bump. They aggregate mostly during failures and one failure can easily lead to another one and they can really grow in size. And the problem with Kubernetes events is that they are using LCD leases and they are using it in somewhat incorrect way. And this is because leases were designed as a short, short, for short time to allow LCD to provide distributed primitives like leader election. So for times like five, 10 seconds. And it doesn't, and this requirement means that LCD doesn't really persist their stats. So when the lease is created we save its TTL but we don't have the data. And we don't have the data. We don't have any updates or any checkpoints throughout the time of the lease. And during this time LCD members can go down up and leader can change. And because LCD leases are only counted down by the leader, every time that your leader gets disconnected from the cluster it's an easily case for the time, the lease time to be reset. And Kubernetes directly uses leases to provide the TTL for events. So if the leader changes and lease timeout is reset, your events that should be deleted like half an hour ago can still leave for another hour. Yeah, so two hours make it really, really unfortunate. And every time that there is a leader election or leader change, number of leases can grow exponentially. And a couple of unfortunate events can cause a full explosion and domino effect around the size of LCD. And this cannot be really, you know, avoided on the default Kubernetes clusters. So you need to configure and make direct changes to your distributed or how do you architect Kubernetes or how do you run your control plane and how you configure your LCD to handle it. So because events are best effort and not really a durable critical thing for your cluster, you should really think about separating the LCD instance that stores them, making it much easy to be not persistent. This allows your cluster if it even goes down because of events to be then rebooted without the load of leases causing the consistent downtime. The way if you're making or if you still want to persist information that is available events, you can easily watch or there are ready solution to export events to your logging solution like elastic search or your preferred cloud, you can reduce the TTL to like five minutes, which makes it much more resilient to failures and then have a separate process that watches them and exports them to your logging solution so developers can still read them and look through historical dump, historical, debug historical issues. A new thing that was introduced around a year ago is the least track pointing. It's a direct fix in LCD to make Kubernetes use case reliable and it has two iterations that in first prevents you prevents leader election from causing TTL to be reset and by having every five minutes leader sending update and check point to other members that hey, like I counted down five minutes, please remember this and reduce your TTL. And second was in persisting, so if your whole LCD cluster goes down, persist will make sure that this checkpoint is not only in memory in the members, but it's also on the disk. So even if you shut down your full cluster, the time will be persisted on the disk. So second issue that I would want to discuss is about a Kubernetes quota that lock. If you ever looked into, if you ever looked into how LCD stores data on the disk and ever found or encountered an issue that LCD cannot or Kubernetes cannot progress because LCD complains about out of quota. This is the mechanism that is underlying. So the LCD stores both the latest state and the history of history of all changes that happened. So there's two dimensions to every, to size of the database. And this is a mechanism to prevent regression in performance on LCD. And if you've ever either increased or ever state of your LCD or grew too much or number you haven't, or number of changes has increased, you might run into issue that LCD will hit the database size limit. And in that situation, LCD will raise an alarm and require someone to remove the unnecessary information and release the quota alarm. Let's now look at two mechanisms driving the size of LCD. The first one is compaction. It's a mechanism that is responsible for cutting and removing long tail of changes in LCD that are no longer accessed or useful to be accessed. It just, it takes a full history of all resource version in Kubernetes that might be used and it marks one revision as unavailable. And from that, and this allows LCD to then clean up the old revisions and reduce the data space. And second mechanism that drives the size of the database is the defragmentation. Unfortunately, LCD algorithm for selecting pages, selecting and using disk space is still around Windows 95, which means from time to time you need to defrag. It's less about the performance, like based on my experience. It's more about the size of the search never decreases. So for your sanity and you making sure that the quota doesn't increase and you're not feel pressure that you're getting close to it, you should defrag your LCD to prevent unoptimized page layout. So the two mechanism work together to reduce size of LCD. First, usually you would have your, most of your storage pretty utilized with only couple of empty spaces and all the revisions used. If you then compact that, you can remove the revisions below. So here, if we compact on revision 10, we would remove the pages that have the data for those revisions, leading to even more empty spaces. And then we need to run a defrag for LCD to rewrite the page layout and end out at the minimal size of database that is required. So going into the problem, Kubernetes compaction is somewhat unaware of some proper or some LCD behaviors and can cause a lot of problem. The algorithm that Kubernetes uses is assuming that there are multiple API servers talking to LCD and those multiple API servers are raising to do the compaction and to prevent unnecessary compaction or those and those two API servers are interfering with each other. They are first trying to raise for a change on to the key, so they make a right. And the one that is first will wins and can do the compaction. This has an obvious issue of what happens if you run out of quota. We kind of have to make a right. Kubernetes just stops. It cannot proceed. So what you should do to prevent this? Kubernetes algorithm is not perfect, but there is for a long time a solution in LCD that not only avoids the problem, but it also avoids the problem. And it also avoids the problem is for a long time a solution in LCD that not only avoids the problem, but reduces the overall overhead of your LCD. By default, Kubernetes algorithm stores data between 5 and 10 minutes of all historical changes. And this, if you only set the compaction period for five minutes. So you can double the size. By default, the size of LCD can be double the data necessary by Kubernetes. And if you use the LCD mechanism, you will only have around 10% of overhead. So we can double how much you get space from LCD by just setting a flag on LCD and disabling it in Kubernetes. Second recommendation I would do is on DFRAC. DFRAC is very expensive. And you should avoid it if possible. Maybe depending on your experience, you've seen some performance improvement, but in my experience, they're not big enough to move forward. It motivates running DFRAC too frequently. You should DFRAC when you should execute always the DFRAC as only when it's, you should execute DFRAC when it's appropriate and check this pretty frequently, but you should avoid making the database unavailable because DFRAC itself requires a full lock on database and rewriting the storage. So you can avoid this cost by just adding some simple checks before running it that we, that verify that there is at least some space to be freed before you, and before you execute the DFRAC. So by running it's pretty frequently you can, with the check, you can avoid the downside of locking the database too much but still ensure that you don't run out of quota too fast. One thing to remember, you should at the end also disarm the alarm because from time to time, maybe it, you will get into situations that you reach the quota and you want to be sure that the alarm after DFRAC is disarmed and you can safely automate that if you're running DFRAC. Yeah. The third issue that I wanted to discuss is the most like recent critical issue in, that was in HCD project. It's about HCD watch starvation. So to understand how the watch starvation can happen, especially in Kubernetes, we need to look a little bit into Kubernetes. So there is, what Kubernetes does is there is a single HCD client per resource, sorry for the mistake. So if you have multiple controllers like scheduler, controller manager, they will all talk to Kubernetes API. But at the end, this will be grouped by each resource and which has its own separate storage structure, is a Golan structure and those will, and each client will run independent GRPC connection to HCD. The important part is that if there is a lot of traffic on single resource, they are all sent through one connection. So the issue that I encountered myself was a simple change of enabling TLS that caused a watch starvation. It was a pretty hard to discover issue because we've been rolling out the change for very long time and only one cluster out of, only one cluster after a year encountered it to tell the surprise of everyone. What was unique about this cluster was that there was a lot of demon set controllers running as demon set because if you want to have some logging setup and you want to add some metadata about bots, you will have your Fluent D or Fluent B log to API server. And if there is a short downtime of those nodes, they will want to reconnect to API server and they will start it from making a list request. And if there is a pretty big cluster, this can cause a lot of traffic on the wire. And if all of those requests are about bots, the traffic on the HCD client for bots will be overwhelming the GRPC connection. And this happened because HCD is serving, the serving stack has a separate pass for both TLS and non-TLS. And the underlying issue is that HTTP to standard does not allow multiplexing of TLS connection. You cannot easily, you cannot even distinguish it at all during the protocol negotiation, whether it is a TL HTTP request or GRPC request and whether, so it requires you to proxy through someone. So if there is, so normally for non-TLS case, we can just check, we can just check on connection level and send HTTP request to HTTP server and GRPC request to GRPC server that run as separate goroutines. But in TLS case, because we cannot distinguish them, we need to send the TLS request to HTTP server and it will pass if there is a GRPC protocol header, it will pass it to GRPC handler. And there is a big difference between those two solutions. The HTTP, maybe, or the protocol HTTP2 is the same, but the implementation is totally different. Even in GRPC documentation, it stands, it states that performance and features between those two solutions may, can really vary. And what happens is, or what happened in our case was that because HTTP2 supports multiple streams per writer, it is very difficult to it also needs to pick algorithm to decide which stream to response, which streams to response first. So if there is two list requests at the same time, there is algorithm that will pick whether to respond to the first list or the second list. And this, the same happens with the watch. So if there is a lot of list requests, GR, or the HTTP server, needs to have some way to decide who to respond, who is in which order it should respond. And unfortunately, the main difference between a GRPC and HTTP server was that HTTP server did not only prioritize the watch, which is much smaller requests that are less frequent, but it also even worked against it and made it result in a lot of this, in watch being always at the end of the queue causing a full starvation that could take minutes to clock out. So if there was multiple list requests concurrently and one watch, the watch could never get, could be hanging there, waiting for an event and can be, there could be minutes passing and it will not get any updates. And because of how Kubernetes reconciliation loop works, this causes total chaos because you deploy a pot, you create it, and nothing happens. And all controllers don't observe this. So the fix, unfortunately, was pretty involved, required re-implementation and collaboration between both HCD and Kubernetes scalability and Golang teams. To improve the, to make the algorithm handle the watch properly. At the end, we only recently have provided an update to the HCD that include this fix. But because it takes us long time and we knew that it's not an easy fix, we needed to have an immediate mitigation. So if you're running older, older version of HCD or, yeah, if you're running older version of HCD, maybe you can use the client or mitigation by separating GRPC and HTTP server and thus allowing the multiplex or thus skipping the issue with GRPC request going through HTTP server. To detect if you are interested, if you have this issue or if you can hit this issue in your clusters, I really recommend monitoring those two metrics. So we discussed couple of failures and you can see the pattern that all of them are cluster-level. There was no thing that HCD could done without help of either operators or proper SRE mitigation. And for those issues, HCD, it's only up to you to set up your cluster properly. And to mitigate, you should not think that, or you should not assume that Kubernetes gives you 100% reliability and 100% protection about from developer failures. So any change to your cluster can lead to a problem that you may be not discovered. So putting all your eggs in one basket can end up pretty badly. So my suggestion would be to separate your cluster and run them as or minimize your blast radius. So if you have an issue with one cluster that you've never experienced, you should think about having either a second that can take over your traffic or have an easy way and documented way to mitigate and recreate a new cluster in its place. Because it's better to have partial downtime than be totally and maybe some performance or latency increase than being totally down. Second way to handle such issues is to do a canary rollout. So the same like blue-green deployment, you should treat every change as a potential disruption to your cluster. Any HCD upgrade, any Kubernetes upgrade, any of your application upgrade that maybe has some new traffic pattern that is abusive to HCD should be, if you treat all those changes as a black box and rollout them independently, it should allow you to be able to discover the issues early and contain them. And the third iteration of the idea would be to think about changes as, yeah, all changes as a disruption and qualifying and sulking all of your changes that are coming to your cluster. And you can validate if, or you can minimize the risk by separating your clusters into groups of different criticality and assigning them a multi-tasking approach to multi-phase rollout and having a direct qualification targets for each part of your fleet. And this is especially what GKE does by separating their fleet into channels and allowing customers to pick which channel or what kind of disturbance in your service you can tolerate so you can run your test clusters in the first phase and we can qualify the HCD in the first phase on the test cluster and not cause any, cause only minimal issues and avoid production impact. So all of those solutions are a standard application mitigation to failures. And there is nothing new that I proposed. And so the main, maybe problem with them, not everyone can run multiple clusters, not everyone can take the cost of running them. So what's the alternative approach? What is the secret? And for me, all of the, like my understanding is all of those issues were discussed and available publicly. You could read them, you can verify them and the HCD, like for example, Kubernetes events issue was there for years as a default separating the HCD events was a solution available for years. And I still see people coming to me and asking about how to handle events properly. So for me, there is one observation that people treat open source as a ready solution that you can just skip your, skip your, a lot of experience or skip gathering experience and take free beer out of the shelf and treat it as a, treat it as a solution to, for, for, follow your reliability. And unfortunately, this is not true. Open source is mostly about the freedom to exchange ideas, share them and learn together. So when you think about running at HCD, you should invest your, your time into knowing how open source community runs it and making sure that you're avoiding common pitfalls by going outside of the common paths that HCD has been tested for a long, long time. So HCD is a production grade key value. So HCD is a production grade value store, but only you can make it production ready. And most of the issues that I discussed, their solutions cannot be baked into open source because they require full redesign or rearchitecture of Kubernetes cluster. And you should take it account and think about how do you, how do you make sure that, that you can use or that you, how you can make sure that you're, you're not all the common pitfalls. So for me, the best way for that was to tap the collective community experience. Most, most, most, most, most, most of the people I talked to still run HCD versions that are, have been known to have corruptions and been, have been multiple emails announcing problems with them and not, and they're still, not follow, or there are still many people that don't follow it. So my recommendation would be you will save you yourself a lot of time to just know what is coming to be doing and what is the discussion. And we do a full announcement about the issues. So you should follow the HCE mailing list. If you have any questions or any problems or you go out of the tested and verified dimensions of HCD or Kubernetes, you should first, or you should ask questions and make sure that, that you double check with the, you double check what is opinion and what people have already done because you're not, you're not unique in what you're doing. There are definitely people that have done it before. If you just ask them, they will be happy to share what they did and be sure to share your experience, experience yourself and file an issue and talk about it so we can collaboratively debug and help you get what you want. So in summary, cluster scope failures are a thing so you should plan for them. You should understand what can happen badly if you don't, if, or you should assume that things can go wrong and you should think and you should have a plan how, how this will be a reaction and you should also test it. There is a lot of non-issues and most of them require, unfortunately require a mitigation because of backward incompatibility or require you to change your architecture of or how you run your cluster. And my main suggestion would be don't try to figure out everything yourself, tap the collective community experience and talk to us so we can, so we can learn together. That's all from me. Sorry for running out of time. If you have any feedback, here's the QR code. There are still some stickers at the, from the seat if you want. There are also some chocolates from the speaker. Thank you.