 Hey there, welcome to our session. We're gonna talk about what not to do when you are updating Istio in a critical environment. I am David DeLuca. I have been working at Itaul for four years and currently I am part of the fraud prevention team as a staff software engineer. Hi, my name is Guilherme. I have been working here at Itaul for almost two years and I'm currently part of the platform team with the role of senior platform engineer. The history of Bank Itaul began in 1924 and since then it has been expanding its operations. It's the largest financial institution in the southern hemisphere with millions of customers headquartered in the Americas and Europe and we have diversified portfolio of financial products. This data show us that we can't stop our products because any failure could impact millions of customers with their mobile experience and it causes losses for our company. Let's recap some concepts about what we're gonna talk. What is a service mesh? A service mesh is an application layer that handles all communication between service applications. Basically, it's a layer that allows you to transport add new capabilities to your network like monitoring, tracing, logging, traffic control and others. As we have hundreds of microservices running into our cluster, we have to have service level observability and communication control on them. We choose issue as a service mesh application because we have a complex microservice architecture. Talking about issue architecture, we can speak into data planning and control planning. The data planning is composed by the proxies that are deployed as sidecars which control all network communication between microservices. The control planning manages and configures the proxies. Issue has the following components. The envoy that is a container deployed as sidecar with the application in our case with the microservices and the issue D that provides the service discovery, configuration and certification management. Issue also has another mission controller a mutating web hook. When we create a new pod, the Kubernetes API server will call the issue the service to synchronize configuration and stuff like that. Bunk it always giant and we manage one of products from portfolio. The Kubernetes cluster that we manage has more than 200, more than 5,000 pods, more than 400 microservice and we receive more than 30 million requests per day. So as a every story, there is a context to make it happen. In our case, it started with some organizational chains that resulted in a new team inheriting a fully provisioned as the infrastructure. This infrastructure had many components with expert versions. It was necessary to map out the order in which the update would be carried out. Not doing this update was not an option because it could possibly impact the entire environment making it totally unavailable. As Billyard said, we had to do some non-optional updates including issue. At that time, we were running version 1.6 when the current version was 1.70. From version 1.6 to version 1.7, we took the safe path. Deploy a new control planning, create small groups of namespaces to be updated, updated labels from namespaces, execute a rollout deployment for each group and then update issue components. This process was very smooth but it was very slow because I was deploying it to Windows. We tried to make it faster from version 1.7 to version 1.8. Deploy a new control planning, update all namespaces labels, execute a full rollout on all deployment resources and then update the issue components. It worked, but we had a response time increase on microservices for all incoming requests for a short period. Even with the response time increase, we were comfortable because that situation looked like a selfie recovery. Then we repeat the same process through the date from version 1.8 to 1.10. After a few hours from the deployment, Merth reminded us that if anything can go wrong, it will. We had run a lot of tests before going to production but the scenario was a little bit different. In our tests, we had a less number of pods and the number of requests per second were incredibly lower. After the issue to 1.10, we realized that the issue there was running with a high number of replicas. And the response time from microservices were being degraded. The issue there was consuming a lot of resources and the HPA started to scale up the new pods. As the pods were requiring more CPU, the cluster outscaler started to launch more nodes, but we have used all available IPs and we weren't possible to keep launching new nodes. Due to increased response time, we started to have timeouts on communication between the mobile application and microservices. In some cases, it was just a specific point failing, but we didn't have circuit breakers at code level. And this was resulting in impacting the entire request. Microservice architecture is awesome, but you should implement circuit breakers to avoid this kind of scenario. At the moment, we are desperate and seeing what was happening. As it was a critical environment, we needed to stabilize the environment as soon as possible to minimize the impact. This strategy was the same that we carried out for the previous updates, but we are not successful on virtual 1.10. In fact, it was a catastrophe. We didn't have any observability of the mesh and this directly impacted the troubleshooting of problems that was occurring because there was no way of measuring everything that was happening during the update process. This scenario occurred in the first day and it was our first rollback. After some days and some analysis, we tried the same process again because we are stubborn and unlike the expression give up, we thought, why don't we try again? This time it will work. And this was our second attempt, not successful. Here we start our investigation. During one of the troubleshooting we carried out, we searched how each configuration was distributed to the sidecars. We saw that it's possible to access the configuration running a Chrome Chrome from inside the issue proxy. And analyzing the results, we had a bad surprise. The file was huge. The configuration file that was being sent to the sidecars had almost seven megabytes. After seeing it, we made some math and we had a deal about the problem that was occurring and start to get prepared for another try. We revisited all the issue configuration to see what would be improved. And after some tests, we adjust the issue components, research and the HPA values. We started to look how other places were monitoring issue. This process was very fast because in the Graphana Labs page, we found a lot of dashboards related to issue. And in the issue official docs, they give us some information and how to use it. We also configured our alert to manager to be triggered if anything doesn't look good in our Kubernetes cluster. So after a lot of tests and tests, we fought in a new consolidated strategy. The first part was installed a new operator with the updated version and duplicate all issue components. Our important, an important point is that even after the entire immigration, we kept both operators running for a period of time to ensure that if a problem occurred, we can do the rollback again. At this moment, we had all observability configured and we could see what was happening in both issue version and we were prepared to take an action quickly if something unexpected were occurring. In the new strategy, we also choose to use a policy enforcement tool to distribute configurations in the production environment. A policy enforcement tool allows us to validate, mutate, generate and clean up research in a cluster level. So we saw that doing this change manually would cost some time that we didn't have. We ended up choosing to use it to apply the issue revision labels to the namespace because we have more than 400 namespace and also configure the gateway in the virtual service because we have more than 300 virtual services. After choosing rollbacks, it was time to play safe. We choose to combine a Canary deployment at API gateway level with Canary upgrade from issue official documentation. Using a Canary deployment, only the request on Canary could be impacted in case we had any problem. To make it work with these a few steps, split namespaces into groups based on the number of replicas. Updating issue labor revision is a key to our launch for each deployment and wait finish. Then we repeated these three steps until all groups were updated. After that, it was time to include the new issue gateway in all virtual services and perform a Canary deployment on AWS API gateway to shift to the traffic. This strategy has been very safe and you are able to skip versions on the issue update process. In this journey, we have the opportunity to learn valuable lessons that are our golden rules when we talk about updating core components of the Kubernetes cluster like issue. So the first golden rule is never execute a role launch for all deployments in a single shot if you have thousand of microservices. If you are not in the sidecar configuration to resume to traffic between services, your proxy configuration will be big and we're resulting in a high CPU usage while it's being distributed across the service mesh. The second rule is avoid one size fits all. It's important to do something that fits into your work. Sometimes it's hard to have a scenario super close to production, so it's better to go slowly. So the third rule is ensure that you have a good observability. Without configuring all issue dashboards in our graphana, we were totally blind. After configuring it, we had a clear view of what was happening in the environment. The improvements of the observability allow us to control the speed at which we carry out the updates and see the role of changing the environment. The last one is share everything. Sharing knowledge brings up different points of view which help us to define a good solution. And finally, sharing the ideas as it can help other teams from other companies that may be doing through the same difficult and consequently foster the open source future. So that's it. Thank you for watching our session. We hope you have enjoyed and feel free to reach us using the QR code. Thank you guys. As David said, our contact code is here and if you have any doubts, let's talk about it.