 Hi, welcome to our talk. I'm Lou. I'm from the Robin Hood markets. I'm in the container orchestration team as a senior software engineer. Hello, my name is Madhu. I'm also in the same team as Lou. I'm a software engineer on the container orchestration team at Robin Hood markets. Our team is responsible for building offering and operating the compute platform for all of Robin Hood and our compute platform of choice today at the company's Kubernetes. In this talk, we'll walk you through how we adopted CLM in one of our environments, the challenges that we ran into and the lessons we learned from the challenges and how we adapted and then how we now live happily ever after with CLM in that environment. Next slide please. So, before we get into any of the details, I want to clarify or explain what near production means here. Like pretty much every other company, we have multiple environments. There is a production environment as an actual production environment where all our user-facing production traffic is served as all the production services are run. And then we have a bunch of other environments, but there is this one specific environment where we run integration test and personal development namespaces for the products at the company. We call it as near production because this environment is treated at the same level of seriousness and has the same response SLO SLAs as production because it's critical for the entire company's engineering and development. If this environment stops working, the entire development at the company hard, so if there is an incident, we take it as at the same level of seriousness as a production environment. The only reason why we might prioritize it lower is if we are also simultaneously having a production incident at the time, although that has almost never happened since I joined Robinberg. With that context, I also want to explain how this environment is different from the production environment. Because of the nature of the workloads that we run in this environment, which are integration tests, for most part, integration tests are set up, infrastructure is brought up for them as in the pods are kicked out for them, data is set up for the integration test, the tests are run and towards the end, everything is completely toned down, including the namespace that was brought up for that particular integration test run. Because of this nature, it's a very high-chair environment and the challenges that we face in this environment is different from all other environments. Next slide, please. So what exactly is high-chair? Before we even get there, let's look at some of the SLO prerequisites, Kubernetes upstream projects defined. So if you go look at the SLO definition pages in the SLO MD for six scalability on one of the Kubernetes six scalability repositories, you will see these two specific lines there, which is on the left here on the slide. The prerequisite state that for any of the SLOs to be applicable, the Kubernetes cluster should be available and serving, which makes sense. And then the second prerequisite talks about the cluster shell that is considered to be a prerequisite. Before looking at the number, let's look at what the definition of churn is. Kubernetes upstream defines churn as number of path spec creations updates and deletions per second towards the number of requests that users make to the system in a given second. And Kubernetes requires this churn to be lower than 20 for SLOs to be applicable. Now, if you turn right and look at the graph that we are sharing here, you will see the term patterns for one of the clusters that we have here in the environment over a typical 30-day period. You can also clearly see or say which tell which of these days are workdays and which days are not workdays because we told you that this is used for integration testing and personal development. So we expect traffic to be higher on workdays. So you will see five peaks followed by traps for a couple of days and then five peaks which follows the pattern of five workdays and then two weekend days. You clearly see that on workdays, the peaks are typically over 30 many times over 40 and it peaks almost as high as 70 pot creations and deletions per second. And in this graph, we are not even considering the user originated requests. These are just pot spec creations and deletions. So this is the level of churn one cluster in this environment goes through. Next slide, please. So given all these things, what exactly motivated us to adopt CLM, especially overlaid based networking in this environment? So before we even get there, let's look at what we had in this environment before we adopted CLM. So to begin with, we have AWS case CMI as our CMI plugin running in all the environments. That was the case before we adopted CLM in this environment. The CLM, the case CMI plugin allows for a flat networking model that we use in all the other environments or in all that we used to use in this environment, including all other environments, because in many other environments, we need cross cluster networking. So it makes sense. But in this, in this particular environment, the traffic is mostly isolated within the namespaces. So cross cluster networking is not required. Like even within the cluster, the networking across namespaces is mostly not required for most cases. So all networking works fine for this model or for this environment. In addition to that, the case CMI model with flat networking works well for stable production environments, but has limitations on more bursty environments, which where we need higher port density and have very high churn like this particular environment. So to go further, go a little bit further. The number of ports per node for us is limited in the flat networking case CMI model AWS case CMI model because of the limitations that EC2 imposes on the number of secondary EMI's that you can attach to EC2 instances and the IPs that you can attach to the EMI's. Because we can't go over that number, it leads to very low CPU and memory utilization because the parts that we spin up here are for integration tests. So they don't have high utilization and because utilization is so low and we cannot pack more densely, it leads to very poor cost efficiency model. So we were evaluating various solutions and we wanted to adopt a solution that allowed us to get higher port density and we knew that we wanted to adopt overlay networking for that. And then we looked at various solutions and we wanted to make sure that we did not significantly sacrifice the performance by going to the overlay model. Although this is integration test, performance is still important in this environment. So not sacrificing performance was an important goal here. We looked at EVPF and it looked like a very strong candidate and the moment we knew that we wanted to use EVPF, the solution of choice was CELium because this was the most adopted solution that was out available out there. Next slide please. So with that, how did our migration actually look like? We started from CELium 175 and then we quickly adopted or we quickly had to jump to a much more modern version or a much more newer version of CELium. So we went through three version bumps. The way we did this was we initially brought up a new cluster. We decided that doing this kind of upgrade or change in place was not viable. So we brought up a new cluster for personal development and another new cluster for integration test with CELium overlay networking model and then we ran a lot of load tests against these clusters. And once we reached a stage where we felt confident about the norms that we had to tune and everything, we started cutting over workloads onto these new clusters. Interestingly though, personal development in spaces were more or less fine but in integration test environment because of the high churn, we ran into a number of challenges that we were not expecting. It caused a few incidents or quite a few incidents. The term that we use here at Robinford like many other companies can be industry sales. So we ran into a number of sales and we had to go back and forth between our old clusters that use black networking model and the new clusters that used overlay networking model with CELium. Eventually we were able to migrate all the workloads onto the CELium clusters. Next slide please. Overall, if I were to talk about the efficiency that we have achieved because we have adopted CELium with overlay networking, at the end we were able to pack at least two X more pods than we were able to pack in the flat networking model. In the flat networking model, we used to set our number of pods per node, the pod density per node to 110, which is what Kubernetes generally recommends. When we adopted CELium, we went all the way up to 250 but it caused a number of issues. So we slowly dialed it back down to 180. Although the number was previously set to 110, we never actually hit the number 110. Our utilization of density was more around 90 to 100. Now we are comfortably able to reach 180. So we have doubled the pod density of our clusters. With that, I'll hand off to Lou to talk about the more specific challenges that we ran into. Lou, take it away. Thanks, Madhu. So as Madhu mentioned, this migration came up with a few surprises. For example, it revealed several previously hidden bottlenecks because some other factors became new bottlenecks with much higher pod density on each node, such as the excessive API server requests, the CPU memory and the month exhaustion on the node, and imbalanced workload distribution made things even worse. We also encountered a few CELium stability issues. Although version bumps of CELium fixed most of them, it was kind of costly and tricky to root causing and fix. The CELium's throttle and the timeouts in the high-trend environments is also a challenge and we had to build custom scheduled extensions to work around it. So in the following presentation, we will select a few interesting stories to share in more details. So let's take a look at the new bottlenecks first, which is actually the first few surprises we got. With the abundance of IP addresses, we decided to consolidate two integration test clusters into one to achieve better cause efficiency. And in the meanwhile, the Qubelet on each node became much busier with more pods packed onto each node. As a result, the API server got overloaded and couldn't keep up. As a result, we couldn't bring up new pods, new namespaces, and essentially it means the integration test environment got broken, and we had to migrate to the previous clusters. For the mitigation and fix, we set up another CELium-based cluster for the integration test environment. On one side, it relieved the API server by reducing half of the pressure, and on the other side, we got more resiliency and redundancy. We also moved from the deployment spec to pod spec for integration test namespaces because essentially those deployment spec were the single pod deployment, and we didn't really need the auto reconciliation feature of deployment for the integration test. So it further reduced the API server pressure. Last but not least, we routed requests from the Qube controller manager to the network load balancer instead of to the local API server. We removed the stickiness, and thus the request became more evenly distributed across the API servers instead of concentrating on the one which co-locates with the leader controller that prevented the hottest API server get crashed. Another bottleneck is the data plane node resource exhaustion. The nodes of the new CELium cluster somehow became more likely to encounter resource exhaustion, such as running out to CPU or memory resources or has too many months. Pots on such nodes were stuck at creating or terminating. To make sense of the words, those nodes will be churned, and pods on those nodes will have to be rescheduled to the existing nodes as new nodes have to go through the initialization warm-up phase before joining the cluster. And the migration of the pods on those churned nodes into the existing nodes just makes them worse because it put more pressure on some of the already kind of overloaded nodes. Thus, more nodes will be too hot and become turned bad, which is essentially a cascading effect. As a mitigation, we initially tried the Kubernetes system resource reservation, basically reserved the system resource for Kubernetes demos and the system demos. However, it turned out that it wasn't a one-size-fits-all solution. It has some own implementation limitations, and as Mandu mentioned earlier, the way we use the Kubernetes cluster exceeded the prerequisite for SROs, so it also makes sense that the simple resource reservation couldn't solve the entire problem, although it helped a lot. Thus, in addition, we audited the applications, especially for those heavily used applications in the integration test environment. We right-sized the resource request and limits parameters, because in some cases, those requests growth with time, and eventually, the actual usage is way beyond the request, causing the cluster to be much over-committed, and as a result, it is easy to get burned out. In addition, we also made some performance optimizations for those applications on the cluster to reduce the resource over-commitment problem. In addition, we also made the scheduler training and added some scheduler extension plugins to target for more evenly distribution of the workloads across the cluster. If you turn to the left, the ABAP2 graph shows the CPU and memory usage across the nodes on a typical day before we had any of those optimizations, and the bottom two are the newer distributions with both tuning and plugin of the scheduler extension. As you can see, the workload distribution for both CPU and memory are much more evenly compared with before. Thus, we will have less to-hot nodes, and much fewer nodes will encounter the resource exhausting problem. Switch to another topic, which is the CELIM stability issues we encountered. One of the most costly CELIM bugs we encountered was the CELIM identity collection bug. When we were running the version 1.11.6, that bug caused the exhaustion of the CELIM identity, thus we couldn't bring up new paths and blocked all integration tests. What happened was essentially the garbage collection was not functioning as expected, thus none of the CELIM identities could be removed to make space for the new ones, and once it reaches the max count, no new paths could be created. And what's tricky is that the problem only started to manifest almost one month after the initial deployment of this version. On the other side, for the personal development, fortunately, because the chain is much lower and it uses much fewer CELIM identities, so even though the CELIM bug was there, the identity wasn't garbage collected, it was not impacted. In fact, it still had a few months or even years of headroom before the personal development cluster got hit by the CELIM behavior. So what are the lessons from encountering this bug? First of all, if we could understand the key components of CELIM better and deep dived into how it operates when we are adopting CELIM, it will save us a lot of time to identify the problem and fix the bug or just version update to the new CELIM. Also, it will be important to properly set up the monitoring and alerts, which could help us to catch the problem earlier before it actually starts to impact the near production environment. CELIM provides a lot of great metrics out of the box. However, it's kind of overwhelming in the beginning and had to understand which ones are the real important ones, thus we didn't have the alerts for the CELIM identity counts. And coincidentally, with the change, there was a rename of the metrics, which removed the total post fix from the metrics name. And we didn't notice the change lock during that version update to 1.11.6. As a result, when the issue occurred, when we looked at the GC interest dashboard, it was empty that as more difficult to identify the problem. And the graph to the left actually showed how slowly the amount of CELIM identity accumulated over the month period of time and eventually kind of reached the limit and it began to hit us. Another problem we encountered was the egress broken on some nodes. And it only happened on a few non-deterministic nodes when we scale up and could always be mitigated by a CELIM agent restart. So it was a mystery for us. We did a fired an issue to the CELIM community and we got as a suggestion to update to the latest CELIM version. In the meanwhile, as a mitigation, we implemented auto-coordinate for when the problem occurs on the node as a stop gap. And interestingly, with the Kubernetes version update from 1.15 to 1.18. Although, even though before we actually update the CELIM version, the problem no longer shows up. And to be honest, the exact reason is their TBD. One issue that CELIM bothers us is the CELIM throttling and timeouts. It still kind of happens from time to time, although much infrequent now. What we see would be the rate limit error that in the logs we will see a lot of 429s with signature like put endpoint ID to many requests. So we kind of understand it was due to the rate limit. Thus, we initially tuned the adaptive rate limiter. After that, we indeed see much less 429 errors. However, it was discovered that they essentially turned into the CELIM API client timeout exceeded error, which is even worse because that clearly indicates CELIM couldn't handle the burst of requests. So, even though it can serve here slowly, but it still cause test failures, especially it's difficult for CELIM to keep up during the pod creation. And the IP addresses will not be assigned timely, causing the pods not healthy before test timeout. So as the mitigation, we had to introduce a new module, the burst control in the customer schedule extension to avoid scheduling too many pods on the nodes at once. To conclude our talk, here is some of our key takeaways. First, CELIM is a great technology that we love a lot. We achieved a much higher cost efficiency and EBPF opens doors for a lot of great possibilities like the observability, security stuff. However, just solving the networking problem won't be the golden ticket. To operate CELIM at scale is not sufficient to just adopt it. We really need to have a good understanding of CELIM under the hood, specifically how it operates and set up monitoring and alerts appropriate. So that's about our presentation. Thank you for the time to come in to our presentation and appreciate your interest in this topic.