 Let's get started. So our talk is called tales from on call fun with operating at CD at scale. My name is Geeta and I will be joined over video by Chao Chen. Both of us work at Amazon web services. So let me introduce a little bit about us. We work for EKS. EKS is a managed Kubernetes service. What that means is EKS manages the control plane for you. All aspects of it. Performance, scalability and availability. So all the components here in the blue rectangle are owned and managed by EKS. Customers don't have access to those. They typically just manage their workloads and sometimes the worker nodes. So within the control plane, the team I am from is the HCD team, which focuses on operations and contributions to HCD. Now let's hear from Chao about our HCD environment. Thank you Geeta. Good afternoon everyone. My name is Chao, a software engineer at Amazon. I've been actively working in the operation of HCD, the distributed key value store that is used by Kubernetes as its primary data store. In my virtual talk today, I'll be sharing some of my experiences and insights of operating HCD in Kubernetes clusters. HCD at EKS. Each EKS HCD cluster is a three node cluster. Evenly distributed across three availability zones in a region. Availability zones are isolated data center located with specific regions in which public cloud services originate and operate. Second, EKS HCD uses static IP to advertise HCD client. What endpoint it should connect to and for peer communication. We also use static volumes to store wall and DB files. Static here means every HCD node where we use the same IP and volume after the previous node is terminated. It guarantees data durability even if HCD Chrome is lost or permanently. It also indicates the HCD membership is static. There's no membership reconfiguration. It simplifies the operation. EKS HCD supports version downgrade from 3.5 to 3.4. It's an EKS private patch and there is a reference published in upstream. EKS HCD 3.4, 3.5 and 3.3 does not have storage and API layer schema incompatible issues. For example, the difference between the 3.3 and 3.4 wrapped internal protocol buffer schema change is a list checkpoint request. If in a 3.4 liter the experimental list checkpoint feature is enabled, once it starts to replicate the entry to the 3.3 follower, the follower as its server cannot understand the entry schema and we just panic during deserialization. Of course, EKS HCD runs as a system D diamond service. Unlike runs HCD in a container, we run HCD as a system D service because it's a simpler approach with less overhead, involved like network isolation and container orchestration. Fifth, EKS HCD operator agent runs in the same box as HCD. In EKS, we run HCD operator agent in the same box as HCD to manage provisioning, health checking, taking periodic backup and storing persistent storage, automatic defrag and monitoring, etc. No space alarm self service or every touched upon later by Kita. That's the introduction of HCD at EKS. So now that you guys know who we are, let's get to the agenda. So today we want to talk about five operational issues that we see while operating HCD. The first one is where the storage quota provided by HCD is not enough for the workload. Second one is the revision divergence issue where the nodes don't agree with each other. HCD can run out of memory. That's the third one. We sometimes see timeouts mostly related to the maintenance workflows of HCD. Chow will talk about that. And sometimes we see requests that are too large than the limit HCD runs with. Let's get started. So the first one is the database size quota exceeded. Before I get to the production issue, let's recap a few concepts. So this is a simplified view of HCD layers. We have raft for consensus and then we have the backend storage, which is BoltDB or the DB file, which is a big memory mapped file. The concept of quota applies to the backend. This is a toy example where we show key A, which is shown in pink, and key B, which is shown in green, and those two keys have filled up the file. A1, A2, A3, they represent updates to that key. And A del represents the key getting deleted. Same for B1. B2, B3, they're updates to that key. So things to note here is that when the file fills up, there are multiple revisions per key occupying space in that file. The other thing to notice is that it's copy on the right semantics. So every update to the key takes up a new page in the file. Deletes work the same way. So a deletion will also add to the file. When the file hits the quota, an alarm is raised, and that needs to be explicitly disarmed. Today we run with 8GB quota, which is the maximum supported limit upstream. When the alarm hits, the cluster becomes read-only. Any modify operation, put operation cannot get in. The next concept is that of compaction. So compaction cleans up the old history. So in our toy example here, if we compacted everything that was in that file, only B4 will stay. Everything else will get cleaned up and the files will have free pages or holes, if you will. Under normal circumstances, when there is no alarm, these holes are usable by HCD. But when the alarm has already triggered, the put request will be declined even if there are holes in the file. In EKS environment, compaction is run by API server every five minutes. The next concept is that of defragmentation. Chau will talk about this more when we visit the timeouts. But defragmentation basically removes all the holes and packs up the live data into a brand new file. So again, in our toy example, B4 will sit by itself in a brand new file. At this point, the size of the file will drop. However, like I said, the alarm will not clear by itself unless we call a specific disarm API. All right. So in production, multiple times we see workloads exceeding our 8 GB quota many times unintentionally. When this limit is reached, cluster becomes read only and your on call gets paged. The interesting thing is that the compaction as run by Kubernetes API server stops working when this alarm is raised. This is because the compaction workflow from Kubernetes API server needs to do a put request before calling the compact API. And since that put request cannot get in, it never calls the compact API. The operator wakes up and then we typically increase the quota. And then it is a coordinated activity. We request the customer to do the deletion of objects before returning the quota back to 8 GB. So why do we see this issue? There are three main factors, object size, object count, and then multiple revisions due to fast updates. So this is an example from production where we see this last key here, the admission reports, there are 2 million objects. So even if the object size is just few KBs, this will rack up GBs of space quickly. We have a blog article out about this, which you can find online. The second example, object size. This is when the workload typically has a big key or a big binary blob in part spec, such as SSH keys typically. Because it's part of the part spec, it gets replicated. And the part spec becomes big, like 500K, 800K, those are the bigger ones we have seen. While this is supported, this is not optimal. And there is an easy way to optimize this by referencing the big blob instead of embedding it in the part spec. So we talk about that in the blog. And the last one is these repeated updates. We often get questions like I have just 1000 pods, how come they're taking up a GB of space. So remember that when the quota hits, there are multiple revisions per key. So if something goes through fast updates, it's taking up much more than the size of that object. Consider a 800K part spec, and it goes through 10 updates. It's now taking up 8 MB per pod. So 1000 of those is going to consume the 8 GB. Typically, we see this together with the large object size. Audit logs can help identify what's changing fast. And we have our CloudWatch query example in our blog. Other providers can also have similar query or on-prem also. You can find the query for audit logs. Here is an example of a fast changing objects. Here, the scheduler is updating a pod repeatedly just to record the fact that it's not possible to schedule that pod on any node. So these are unintentional side effects, but they do eat up the quota. This can be monitored for. So for monitoring this proactively, API server has a metric for this. So there could be a monitor on that. Detecting it reactively, it has already happened. Then there is this specific log message that you can look for, which will tell you that database quota is exceeded. Mitigation. So today, as I said, when the DB size approaches the quota, it pages one of our team members who are on call, and they wake up and they do this coordination to get the cluster back into operational mode. We have some work in progress to automate some of the workflows we do manually today. In long run, we would like to work with the community to enable a self-service experience for this, such as a file system. Stuff fills up, you go and delete. Maybe in case of HCD, you wait a little bit, but then your cluster comes back to rewrite. We have thought of increasing the quota. We used to run with 4GB. Now we run with 8GB. Increasing it any further will need more testing and validation for performance. We are at the limit supported by upstream. All right, so that's the first issue. Now let's hear from Chao about the revision divergence. Thanks, Kita. Let's go to the next topic, HCD revision divergence in a cluster. Here is the animation of the revision progress in HCD. We can tell we are watching the key foo to get notification when the key value pair is changed or updated or deleted. You can see whenever we update a new version of the full key value, the revision is incrementing and as long as we delete that revision, it's also increasing. HCD uses this global, monotonically growing revision number to keep track of changes to the data stored in key value store. When multiple nodes in the HCD cluster are making changes to the data simultaneously, it's possible for the revision numbers to slightly diverge and converge to the same revision eventually. However, diverging revisions could also mean that two or more nodes in the HCD cluster have made conflicting changes to the data. For example, one nodes successfully complete the modification or the other nodes discard or partially commit the modification. In EKS HCD, this will cause sustained growing revision divergence across HCD nodes. Let's take a look at the real-world example. Recap each HCD cluster in this ring node cluster. The orange line represents the maximum of the revision across three nodes and the green line represents the minimum of the revision across three nodes. The blue line represents the gap or the divergence of the above two lines. So you can clearly see the divergence grows starting from 10 o'clock and then indefinitely growing. So we observe these failure symptoms and configure in our alarm systems to alert on-call if the divergence grows for an hour. It helps our team proactively mitigates the data inconsistency problems before customer notice the impact. We also open a HCD feature request to demonstrate an example to set up the alert roll currently incorrectly in Hermitius and Grafana. The link is shared below and it's on my to-do list to complete it. You may wonder how come the revision divergence grows. So this is the coast snippet from API server updates the objects. So it's using a HCD transaction API inside that transaction. If the HCD servers stored a proprietary key modification it is the same as the requested API servers kept revision number. Then it will result into a put request. Otherwise it resulting in a get request. So as you can see if initially the prepared keys modified revision is different with this core pattern from client the revision divergence will grows impacted to Kubernetes. The core Kubernetes components will fail to acquire release and stop functioning. For example Kube scheduler and Kube controller manager. And also many other leadership-based controller will fail to acquire release and stop functioning as well. In the other words deployment scaling and post-scheduling will fail. So the following screenshot is an error message from the Kube controller manager and Kube scheduler. Next mitigation. The mitigation can be simple once we detected the failure mode. We just remove the revision lagging behind member and join a fresh new member to be in sync with other members. So it's an exception to use static volume where basically replace the DB. So this is followed by the upstream operation guidance of the encryption. Next root cost. The root cost can be classified as two categories. One is ACD and VCC partial commits and followed by ACD process panics or like CQL the process. It's well summarized in the upstream data inconsistency summary and the maintainers did a tremendous great job to preventing this in current 3.5 release. So if you're interested we should you can take a look and contribute in that robustness test framework. The second is a little bit unknown about a both DP or it's called ACD backend corruption. So about the revision divergence root cause he mentioned there was a talk by a maintainer Marik before me if you missed it please check it out he goes he does a deep dive on how they found the inconsistencies and now have a robustness framework for testing. All right the third issue ACD can run out of memory under overload. This issue cascades because if one node goes um the same workload that brought that one down goes to the next one brings that one down as well and now we have quantum loss. Typically our case studies suggest that the workload that causes this are large unpaginated range requests typically the same one get parts repeatedly. Why do the why do we see this issue so these are typically the unpaginated list requests like I said it's basically spiky workload it spikes too much too fast. Every new request is a new allocation in hcd. hcd will never return any cached answers and the mechanism to free up that memory the garbage collection is asynchronous. So there is a window where hcd can run into memory pressure and can um. So we have two mitigations for this the first one is a change on the api server side this change when it sees an unpaginated list request it paginates it and it before sending to hcd. So hcd client and server will always see a paginated request. The pagination limit is configurable the link to the cap is on the slide even if we do this pagination we could land up with a page that the request can still be spiky and can still cause the issue depending on workload. So we have a second layer of defense on the server side this is a server side throttler which is implemented as a grpc interceptor. So it's watching every incoming and outgoing request. If it is not a range request it doesn't do anything to it it just simply goes through. If it is a range request then the throttler will check if the box is under memory pressure by just consulting the resident set size of the process as a percentage of the total box memory. If the memory is under pressure then the request will be admitted after delay it will throttle it. So this graph shows the test where you can see the throttling in action the blue line here is the memory pressure when it goes past 65 percent which is the threshold it's running with the throttler which is the orange line that activates and it starts throttling the request this is the same test except it's trying to show that the throttler and the memory pressure feeds into our scalability system so in this test the orange line is still the throttler metric but the green line is the total system memory. So this box got scaled up from 4gb to 16gb because of the throttler signal and the memory pressure. Someone might want to use the same changes we are using. You may not need it in case you have total predictable workload you can always provide for the peak workload use a bigger box but if you are interested in trying out these changes please do reach out to us right that brings us to the fourth issue let's hear from Chao about the timeouts. Thanks Gita let's go to the first topic timeout and causing 5xx one of the top contributors of Kubernetes API server 5xx is ac online defrag stop the word. So as you can see from the diagram when API server trying to get an object using key1 if the request hits to a acd server that is defragging then request with hands until timeout what is acd defrag? The following statement where we quoted from acd document website the key value store is effectively immutable its operation do not update the structure in place but instead always generate a new updated structure or faster versions of keys are still accessible and watchable after modification to prevent the data store from going indefinitely over time the store may be compacted to discard the oldest versions of data. So you can see from the db file there are multiple revisions for a key value pair or just a single version a single version of key value pair like k2 so after we compact revisions 7 you can see any key value pair that is deleted it will be compacted or any previous version of key value pair that is superseded with a new version that will also be compacted and leaves free pages in the in the db file. Compacting the key space key three drops or information about keys superseded prior to a given key space revision the space used by these keys can then becomes available for additional rise to the key space so at this point after compaction the db file will not shrink but the free pages can be reused later like if you insert a new key value pair then the free page can be reused after compacting the key space the back end database may exhibit like internal fragmentation any internal fragmentation is space that is free to use by the back end but still consumes storage space compacting old revisions internally fragments sd by leaving gaps in the database back end database fragmented space is available for use by sd but unavailable to the host of our system deleting application data does not reclaim the space on disk so that's when the diffrag uh come come into the play after diffrag db file size shrink shrinks so what's the mitigation e-case adopted is reduce the frequency of online diffrag to run once every 48 hours for each member and to reduce the latency of diffrag we advise customers to keep the number of keys in sd small and increase the disk support on demand if there are a lot of objects or key values under mitigation it could be a diffrag offline if the what diffrag offline means is take down the sd server a diffrag the db file shrinks the file size and remove all of the unused pages and then bring back the sd server online so you could mitigate this if the availability risk is accepted um some notes is remember to diffrag one member at a time uh usual words uh acd upstream has some great proposal to graceful uh diffrag acd um well either from client or from server first is from server to make diffrag concurrent internally and not blocking uh other uh other uh from client if it's a multi-node cluster uh the sd server can notify the client i'm going to diffrag please fill over to other endpoints that is not diffragging brings us to the last issue which is request size too large so acd has a limit on the request size the default is 1.5 mb that's what we run with but sometimes we see workloads which want to push more data through a request uh why do we see this uh typically it's a workload issue so here's an example where a customer was using endpoint objects and they were updating the service and it got bigger than 1.5 mb it turned out to be a known upstream issue and the solution was to use endpoint slices instead of using endpoints so mitigation typically we prefer to stay with the upstream limit for this issue and we have helped customers uh change the workloads to match best practices to make this issue go away all right um so to summarize uh we talked about five operational issues that we see first one was the database size exceeding uh second one was revision divergence we talked about the panic due to out of memory we talked about timeouts due to defrag uh and we talked about the oversized request uh that's it from us if you want to share your experiences operating at cd please reach out to us we would love to learn uh if you want to contribute to please check out the contributing guide at city maintainers are here so please check out the booth uh and feel free to reach out to us on slack any questions uh thank you uh maybe you can use the microphone over there uh how much are github's controllers causing all these issues sorry can you please repeat how much are github's controllers causing all these issues because they can hit a lot on etcd right so um can't put a number on it but we do see list requests uh from argo time to time any other questions all right thank you have a have a great conference thank you