 So we're from Expedia Group. We're going to go over some of the problems we had in our Istio adoption. We're going to start from the page out all the way to the diagnostic steps and how we got to a solution. So let's do some introductions because the way we diagnose this between different engineers is how we got to the solution. So I'm Tim. I started to orbit in 2009. I've been in the travel industry for forever. Now orbits is Expedia. I like to call myself OpsDev. I know that DevOps buzzword floated around, but I've always taken an issue with that. I do OpsThings first and then Dev. Hi, everyone. Myself, Fragof Grover. And I work with Tim at Expedia as a senior software engineer. And unlike Tim, I'm a Dev first guy. And I think that's how we complement each other. And I've made some contributions to projects like Crossplane, Carpenter, and Istio. All right, just a brief overview again. We're going to go over what the actual issue was with Istio when we adopted this and scaled up. We'll go over possible bottlenecks or dive a little bit into how it works if you're not familiar, so we'll cover that. And then we'll dive into some of the good stuff, remote debugging, the actual solution, and some stuff we added to help you out if you might go down this road. So here's the actual issue. This is where the page out starts. We've been paged. 500 errors, uh-oh. It's probably our fault, we're not sure. And it's lasting a really long time. A pod terminates, and for nine minutes, users are seeing a 500 error. So this is a worst case scenario for us. At some point, we determined the cause. Pod churn was causing this to happen. So every time a pod was deleted, we would see these 500 error codes, right? And our pod sets are really big, so that's important to this story. This cluster had grown dramatically because we're moving to it. We have 1,200 pod sets, 600 and 900 and 300, so they're all really big, and they release very frequently. So Istio architecture. This is where we need to step into a little bit of how this was all working. So here's the ingress. All your traffic is coming in from the internet to the ingress. It then flows to your pods, right? In our scenario, pod gets deleted, and then the 500 errors begin. So if you're familiar with Istio, what should be happening here is Istio should be getting updates from API server. Those should go down to everything and tell everything where everything is. That didn't seem to be happening, and the first proof we had of that is somebody restarted Istio D. Oh my gosh, that nine minute window was gone. Everything gets fixed immediately on restart of Istio D. So we started looking at suspects, and myself being an operations guy, we didn't give API server enough horsepower. So it must be API server. Unfortunately, we can't adjust it, it's a hosted one. So we had to prove it another way. Oh, I get a little bit of head myself. This is about the size we're at. There's some hypothetical numbers out there of you can have 150,000 pods in Kubernetes, whatever. This seems to be the number where we hit a brick wall. We tried everything as well at this point. We gave Istio D as much horsepower as we could. We took off the limiters, we gave it a crazy amount of CPU, and it took it all, 2,500 cores at full usage, and we were still seeing this problem. So what could have been the bottlenecks? Again, it's either Istio D at this point, because the fact we know for sure is restarting Istio D fixes it right away. Or API server, which is tricky to prove. So finally, Opstev Tim is gonna do some dev. If you're a go-lan perfectionist, I'm sorry, this is very hastily written up, it was a page out. I know I'm not catching the air and there's a squiggly line there, but essentially what we did was say, okay, how do we prove it's API server or not? We create a pod every 20 seconds and then we put an informer in our software. And then it catches it and finds out how long it took to show up. Well, it definitely wasn't this, it took in the middle of a second. So at this point, we're all doubting it's API server. Somebody pointed out there's a lot of informers in Istio. So we decided, okay, let's look at the endpoint controller. We didn't write code for this one, we just did some log queries and again it was in the middle of seconds. So it's starting to look like it's some problem inside of Istio. This is a fun part. If you're really familiar with IstioD, you're probably already thinking this, did your namespace isolate? Because those can fix, you can get really, really big. No, we did not. So we started doing that. We took all those big pod sets, the ones that were 1,200 and they release 50 times a day. All the rapid churn ones, we did isolation on them. So the config, I believe, was about 50 megabytes and it shrank down to, I think, five or six. And the problem still wasn't solved. So at this point, we've tried a lot of operational tricks to try and tune this thing and it's still not working. So I'm gonna turn it over to Rob because I think that's when he jumped in on this. Yeah, so this is one of my favorite memes and it precisely depicts the situation that we were in. We have thrown more CPU at IstioD, it doesn't work. We have downscaled our cluster horizontally and upscaled it vertically in order to increase the number of pods per node so that the number of daemon set pods come down and address any scalability issues related to pods. But that didn't work. As Timothy mentioned, we implemented namespace isolation to optimize the Istio proxy config distribution and also reduce its size. But that also didn't work and we checked for the lag in Informer and the endpoint controller that also didn't point us to anything significant. And that's where we turned our eyes to IstioD and we started looking into the internals of IstioD because IstioD is the control plane of Istio. It's the brain of Istio. So the chances that it is introducing delay somewhere along the path are far significant higher than any other component in the cluster. So let's have a look at the internals of IstioD and try to understand and let's see how that works for us. So the blue dash box you see represents IstioD, precisely pilot inside of IstioD and on the left hand side you see Kubernetes API server which is feeding events into this queue called controller queue. And these events actually represent the state of the cluster and if you were around a talk by John Howard who is one of the maintainers of Istio, they said that Istio actually subscribes to 45 or around that different type of objects in Kubernetes and gets all those objects into this queue. But this particular queue is by design non concurrent which means that any element that is being dequeued is being processed one at a time from this queue. And the next stage we have debounds where the events are made to wait and merge because generating Istio proxy config is a CPU intensive task and distributing it over the XDS API requires a lot of bandwidth. So events are required to wait and merge here for a specified number of time and in order to optimize things a little bit. After some context gathering and initializations, the events are converted into what is known as push requests and they are placed into this push queue where on the bottom right hand side you can see that by default there are 100 GoRoutines that are feeding on this push queue to generate and send the Istio proxy config for each and every sidecar proxy that is connected to this particular instance of IstioD. And the number of GoRoutines there can actually be controlled by a flag called pilot push throttle and later we'll see how changing that flag from a default value of 100 actually worked for us. And then came the remote debugging. We came across this awesome documentation from Istio which described on how we can create a custom IstioD image with Delve enabled into it. So that document was really an eye opening for us because it gave us the ability to actually debug the IstioD pod which is running remotely inside a remote EKS cluster and we can look at the internals of the code and debug it from our local ID because we were able to replicate the issue by creating more churn in one of our sandbox clusters. So because now we have the ability to debug Istio code let's see what it pointed us to. So this is a screenshot of one of our debugger when there was some pod churn in and the pod churn was at peak and a few slides back we talked about this controller queue which was getting all those events from the API server at that peak it had 20,000 elements into it and they were taking a lot of time to get processed which clearly was signifying that there was some contention there is some delay along this path and but from an observability point of view that controller queue had only error logings it had no metrics that could point us to the number of elements in that queue or the type of object distribution in that queue. We now know that there is something getting piled up in that queue but we don't know the type of objects are there any specific objects that are taking more time than others that could give us some evidence as to what's happening but now we'll dive into some code but before that because there were no graphs there were no metrics out of this controller queue so we assumed everything was right but it turned out it was not the case for this particular piece of code and what does controller queue exactly contain so it contains the callbacks that are generated from the informal events so if a pod terminates then there is a specific piece of code in SDOD that will handle that pod termination and there are different types of callbacks which are subscribed for different types of objects and these callbacks are the ones that are actually placed into this controller queue and when you process this queue you dequeue an element from this queue and you call that callback or you call that function to process this queue in a non-concurrent way but as I said up till now we had no clue whether this delay was being introduced by pod events endpoint events, service events or other 40 different types of objects that SDOD listens to so we decided to modify some code inside SDOD and rather than just pushing the callback into this controller queue now we were sending some more data and we were attaching some metadata sorry, related to the time stamp when this particular event was pushed into the queue and also the type of the object that this particular change event is related to and this is almost the change that we make so on the top you can see that c.q is the controller queue that we have been talking about it has a push function which gets a function which has no arguments but return an error so what we converted was to because this controller queue was being used at 50 different places in the SDOD source code so we converted that interface and modified it to push an enriched version of the callback, not just the callback but we are now attaching the start time and also the type of the object that this particular change event is related to and while processing this queue we actually placed in additional logs so that it could lead us to further evidence and we generated some custom metrics related to those logs to infer the time it is taking for any object to get dequeued and especially to get processed so after all those fun things we actually came out to a very interesting point that endpoint slices were the main culprit and they were taking the max time to get processed out of all those 40 different objects and on a peak during the portion it would take 800 milliseconds just to process one of the endpoint slice event and if we assume that during the portion we have 1000 part terminations and that leads to 500 endpoint slice update events it translates to a delay of almost six minutes which means that the cluster is actually operating on a stale configuration that is present inside the cluster for six minutes and that's why the ingress is still routing traffic to a part that is long gone before it gets that updated information but with all those additional logs and metrics pointed us to the fact that endpoint slices are the culprit for this kind of situation when we generated the CPU profiles of SDOD we noticed a similar thing we noticed that these two paths are being dependent and they are trying to contend for a lock but and the interesting part is that both of these paths are actually related to endpoint processing the gray one is related to the controller queue as we mentioned and the red one is the one which is related to pushing the XDS config out using the XDS API and they are both trying to fight for the same lock and when we shared all those metrics logs and the SDOD CPU profiles the maintainers were very swift to acknowledge the issue and they were very swift to fix the issue from the root cause and they removed some heavy weight processing that was happening inside of the lock to outside so that it can optimize some things but all this was happening in version 1.17.1 way back in June or May I think and but the actual fix came out in September with like in version 1.19.0 but till that official fix was available Xpedia reliability engineering team actually found a very interesting workaround to this problem which solved majority of the lock contention and majority of our problem was gone so let's see what that change was so few slides back we talked about a flag called pilot push throttle which control the number of go routines and it turns out that when we changed that number to match the number of vCPUs from a default value of 100 when we change effectively to a value of 16 we saw a huge drop in the processing time of end point slice update events and from 450 milliseconds that actually came down to less than 50 milliseconds and majority of the lock contention for us was gone and we were happy but why did that fix actually work was the fact we've already showed you the CPU profile but the interesting thing is that there is an imbalance in the number of go routines that are trying to get the same lock so the gray path is the one which belongs to the controller Q and as we mentioned that it is operating on a one single go routine it is non concurrent but the red path is actually having those 100 go routines that are trying to get the lock and when we changed that number from 100 to 16 the number of routines that were fighting for the lock actually came down and the contention was gone so we took this opportunity to introduce and add metrics to the controller Q and we added metrics related to the latency which represents the time it took to dequeue the element and the number of we also added metrics related to the controller sorry the depth Q which represents the number of elements in the Q and also the processing time which means the time it took to process the dequeued event so these metrics are all available in version 1.19.0 along with the fix from the author and they can be enabled by setting the environment flag Istio enable controller Q metrics and on the screen we have the QR code if anybody is interested for the link for the PR that we open to Istio so I like to give a huge shout out to Istio maintainers especially John Howard you must have heard his talks and he was very swift to acknowledge the issue once we shared all the evidences that we have gathered and they addressed the issue in no time and so kudos to them and the community and I think Expedia housed them few drinks and some pizzas regarding the takeaways let's talk about takeaways because as users of open source products like Istio we should make our best efforts to contribute back to them and not just through code because often contributing through code is considered as challenging but there are other ways like developing the internals understanding of the internals of the project or improving the documentation or as little as even fixing comments and the typos go a long way but I've often seen individuals and even companies treat open source tools like a black box where they don't understand the internals of it but when they are in some kind of a problem they the help from the community is actually demanded and that perspective needs to be changed because for the big projects it's easy for them to offer enterprise support but for the smaller projects it's difficult for them when that kind of support is demanded without knowing the internals of the project and well thank you all for listening we have a medium blog that describes this same journey in a little bit detail on the right we have the Istio GitHub issue that we opened with the community to get some help from them and you can leave us the feedback if you like this session or not on the right most QR code and yeah thank you all for listening and I hope it was fun and not too intense in terms of the technical depth and aspect so thank you all. Any questions? Thank you. I think we have one. I was curious in the slides you mentioned the workaround that you guys used to get around it before the Istio maintainers were able to fix it but do you know what the Istio maintainers did to fix it in the 1.19 version that was fixed and if you do could you explain that a little bit? Yeah so they turned, there was some heavyweight processing that was happening inside the lock but it was not to be protected by the lock so they removed that processing out of the lock that actually improved things and also they turned the lock into a read write mutic so that the reads could be improved from one side and I think that was one more big factor that helped it. Any other questions? Last call. All right thank you. Thank you all.