 Hey there everyone. Thanks for joining our session. Today we're going to be talking about pre-sale checks. Do you need a checklist before sailing to production with Istio? First off, just a quick round of introductions. My name is Simon Green and I'm a field engineer here at Solo. I'm based in Boston, Massachusetts and I'm very proud to say this is actually my first Istio con and actually I haven't been part of the Istio community for that long. In my prior life I've contributed to Apache Camel and ActiveMQ and Kafka and I've got roughly 25 years or so experience working in the app dev world. And over to you, Ron. Hey everyone. My name is Ron Venom. I'm also at Solo. I lead one of the field engineering teams called Special Operations. And what my team does is it focuses on providing mesh-based solutions for some of the more complex use cases. So I focused on like enterprise customers and some of the larger multi-tenant environments. I've been working on Istio since almost the beginning of the project. So before joining Solo almost three years ago I was at IBM also working on Istio. So yeah, glad to be here and I'm glad to continue being able to be a speaker and present to the Istio community. Happy to be here. Perfect. Thanks, Ron. So first off, for those of you that don't know me, I've been a pilot also roughly for I guess 15 or so years. And as a pilot using checklist was always something that was drummed into me back as far as my primary training. So not only things like pre-flight checklist, but checklist for all phases of flight, whether it's takeoff, cruise, descent or landing. And of course we also use them for emergencies. What we're looking at right now is my airplane. It's a highly modified 1949 Temco Swift, building Dallas, Texas. And what's unique about this is it has over 150 airframe modifications. And because of this, there was never a checklist for this unique bird. So what I had to do is go ahead and write my own. And here is the checklist I wrote. So this is a checklist that covers everything that I could possibly need to do, including what to do in an emergency. In fact, I actually had an in-flight emergency once where my landing gear wouldn't extend. So luckily I had a passenger to help me. What we call this is crew resource management, or CRM. So my passenger kindly took the checklist, read it line by line what tasks I needed to accomplish in order to troubleshoot my production outage, so he read and I verified. This resulted in a really positive outcome. I was able to help identify an electrical fault which could be resolved by cycling the flaps three to four times. And further extending CRM, I included the control tower to help me verify and check that my landing gear was in fact extended and I was cleared to land. And what does all of this mean? Checklist facilitate teamwork. So the question is why am I telling you all of this? Remember, whether it's for something like piloting an airplane or a surgeon in the operating theater who doesn't want to forget a scalpel in their patient or a platform SRE responsible for business critical workloads, we need a clearly defined checklist. A checklist is intended for the pilot though, not for the mechanic. We're focused on checking the oil, not changing it. In other words, we worry about solving the issue later in a more controlled manner. Use the checklist to help identify shortcomings or gaps, either during production readiness or an outage. Taking what I've learned as an aviator, Ram and I have embarked on drafting an Istio readiness checklist which you can see over here on the right hand side. This checklist is intended for the platform ops team. Those who are focused on day two ops, not day zero. Intention is not to go into the internals of how Istio works, but more about identifying gaps or outliers. So what we're going to do in today's session is help you build your own checklist, just like I did for my highly modified airplane. Ram and I have put together a bunch of different customer case studies where we were able to observe the problem, check and validate what was wrong and help the customer with a positive outcome. And with that, our first example is the orphan sidecar problem. So for this particular case, we had a customer come to us with a problem that some of their services were not behaving well. They were seeing random intermittent errors and this just started happening out of the blue. So as field engineers, Ram and I asked them what had changed in the system leading up to this particular event. They mentioned that they'd recently upgraded Istio but didn't think it was part of the issue because things were working fine after the upgrade. We helped narrow down the problem to just a few microservices and observed in the sidecar logs that there were endpoint and certificate related errors. So from the customer's perspective, this was difficult to troubleshoot and mostly because of the intermittent and the random nature of when it started occurring. It wasn't all of the services in the mesh. It was only a subset of them and it wasn't all destinations or all endpoints that the mesh service was trying to route to. That was also intermittent. So because of that, we started going down first thing we checked, make sure Istio D is okay and Istio D didn't have any errors. The sidecar logs showed access logs. We turned on access log and looked at the access log and most of them were working and then we're seeing errors in some of them but not all of them. There was a few warning messages but it wasn't enough for a customer to quickly debug themselves. The issue was that some of these sidecars were orphaned from Istio D. After they did the upgrade, they did a revision upgrade, installed a new control plane and took out the old one and then they didn't complete the entire process of moving all the sidecars to the new one. The way a sidecar works when it receives config from Istio D, the control plane, it is self-sufficient for a while. With that configuration, it will continue running until things like the endpoints that it's trying to get to are stale. The config gets stale because no one's updating it and then any certificates that it no longer receives from Istio D. That was the reason why the customer wasn't able to identify the problem for a while, especially because if you're not making any strong, strict MTLS communications to other sidecars, then it might be a couple of days before you notice a potential problem. The solution for all this, in your checklist, obviously make sure that you run all of the stuff that Istio tells you to do before and after an upgrade to make sure all the sidecars are moved over. To also prevent other orphaned sidecar problems, there are metrics exposed on the sidecar side. Envoy, cluster manager, cluster updated metric, for example, or there's a CDS update time and an LDS update time metric that each sidecar emits. Set up an alert, make sure that if that's going past a certain number of hours without any upgrade, you probably haven't have an outage that's about to hit your lane. Perfect. One of our Istio support customers, this is another example, recently endured a production outage during an extremely busy trading weekend. After further investigation, they determined that all of their virtual services were missing. This is really strange though, because the developers don't have access to delete all the virtual services across namespaces. They should have proper boundaries in place around their namespaces to prevent this. So how did this happen? A lot of customers that we work with have large environments where they have a platform team that's responsible for installing platform related things, things like Istio, onto the cluster, and then each individual development team or a tenant operates in their own sets of namespaces. So when they install Istio, things like the Istio control plane and maybe the Istio gateways, they protected all of those and all of those namespaces from their tenants from being able to modify. That's kind of obvious. And then they used an RBAC to make sure that every namespace and every developer that namespace can create their appropriate resources that they deem that they should be allowed to create. What they fail to protect are the cluster scope ones, things like the Istio CRDs, not the CRs, but the CRD itself. So what happens if you delete a CRD, like the Istio virtual service CRD, all of the CRs get deleted. So this is another checklist item to make sure that you have proper, if you're using a multi-tenant environment, you're using the proper Kubernetes RBAC controls to make sure that you're not only protecting the stuff that's inside the namespace and namespace scoped, etc., but you're also protecting things like the CRDs. So just make sure you lock down access as much as possible. A lot of times what we've seen is that this is not malicious. It is purely accidental. One of the tenants, they couldn't get something working. So instead of accidentally deleting a CR or CRD, they delete one or the other without actually entirely understanding what's happening, right? Because they'll try anything to try to get something working, especially in a high stress environment. So make sure that you have proper checks in place to prevent that from happening. Thanks, Ram. So the next example is traffic hijacking. So for our next case study, we worked with a platform team that in production expected traffic routing to a completely different upstream servers. There was no recent new configuration change in the cluster, but the routing behavior had changed. We simplified the problem down. The platform team had a shared gateway resource in the Istio gateways namespace to be shared across the cluster. One of the team leads, we'll call her Iris, has a virtual service that routes traffic based on prefix slash app to her holo world service. Sometime in the past, Tim from another team had decided to create a virtual service in team B namespace, which included a route, which was slash app login, routing traffic to his service. This didn't impact anything with the original application and they went to production, but somewhat randomly, the routing in production changed. So by default in Istio, the configuration is exported to all namespaces. It's not scoped. So whatever virtual service you create in one namespace is exported to the other ones, unless you explicitly scope it down. And in a multi-tenant environment, we've often seen this where for a single host, there's multiple virtual services across different namespaces that people will use to do their multi-tenancy. Istio does its best to merge all of these routing roles, these virtual services together. But for the sake of running into production, that ordering is essentially undefined. So when there is a conflict between two routes, it essentially just picks the first one based on a timestamp, whichever was created first gets higher up in the order. Same thing for catch all roles. Like whatever is created, Istio picks up first is the one that's there. So that's why you can get into really bad intermittent behavior. In this example, where team A was running fine, team B comes along, applies bad config. They don't realize that it's bad because it's not accepted by Istio. Sometime in the future, maybe in production, the original team A uninstalls and reinstalls their app. So now the timestamp has changed and that bad config or that dormant bad config is now active. So that's a place where we often see users get tripped up in the middle of production where Istio routing has just changed and they haven't essentially changed any configuration. They just changed the order of when particular resources get applied or when they scale to another cluster when timing is different. The solutions for this ideally avoid using multiple virtual services for a single host. But that's not always possible depending on how you have your environment architected. The second way is scope down all of your resources. Your gateway can scope down exactly what namespaces it's going to pick up its virtual services from. So in the host field, you can specify which namespaces that it's going to pick up, that it should be watched to bind to that gateway. Same thing for your virtual services. Make sure you have your export to fields set in all of your virtual services so that you're exporting things to the right gateway's namespace and not your devs to test staging, for example, by accident. And then Istio-Cuttle analyze might catch some of these things, but we've seen a lot of cases where it doesn't just because of the way the complexity of rules ordering. And then the last piece of advice is run Istio-Cuttle proxy config routes on your gateway. And it shows you all the various routes that are bound to that gateway. And then from there, you can see if you have multiple routes going to different destinations that you did not intend that to happen. So before going to production, run your Istio-Cuttle proxy routes to look at the configuration of your Istio-Ingress gateway and make sure that looks good. Excellent. So for our next case study, we had a customer noticing a high number in cross zone traffic charges from their cloud provider. After digging a little deeper, we noticed a lot of traffic between side cars in one zone connecting to Istio-D running in another region, zone with a lot, a lot of chatter. This is not something that people typically think about, but with Istio running in your environment, this might become a concern. The customer asked us if there was something we could do to help them reduce this cost. Yeah. If you take a step back, the premise of this problem is that, again, by default, every side car, every mesh service needs to know about every other mesh service. So if anything changes in your mesh, Istio-D has to take that updated information and blast it to every other side car. And that can be scoped down, but a lot of users don't scope it down. This is what we've seen is that it's wide open. So Istio-D is doing a lot to orchestrate everything. And this becomes costly. We know it's costly at a resource level, but it's also costly at a networking level, especially when you have these multi-zone clusters where you have side cars spanned across these availability zones. And then your Istio-D, the piece that all of these side cars are connected to, are not necessarily in the same zone. So the solutions for this, make sure that you're scaling up your Istio-D to match your environment. Make sure you're running multiple Istio-Ds. You have the proper affinity set so that they're scattered across your nodes and across your zones. You can take advantage of topology aware routing feature in Kubernetes to make sure that when side cars connect to Istio-D, they're connecting to their closest available Istio-D, so that you reduce that cross zone routing and chatter. So Istio-D is doing a lot. And you can look at what Istio-D is doing, the amount of information it's sending to every single side car, to every single one of your workloads. So from a workload perspective, just like you have metrics to observe how your application is doing, you should also observe the amount of configuration that Istio-D is sending to your workload. And if this is getting too high, or if you have a few that are really high, then you know that those are the ones you should probably tackle first and create your Istio side car resources to scope things down properly, so that Istio-D is not sending that much data to every one of your side cars. So that's how much data. So next, you should also monitor how often and how long it's taking for Istio-D to push this information to the side cars. Typically, a lot of users monitor Pilot XD as push time metric to see how long it's taking for metrics to get from the control plane to side car. But it's worthwhile in production to understand that there's actually more that goes into the Istio-D XDS pushes. And the Airbnb did a great blog, and this diagram is straight from that blog. And I put that link to that at the bottom. But essentially, whenever there's any change in the environment, know that change gets created, a push request gets created. And that push request stays in a debounce time period for a little while to make sure it gets all of the right information, and we're waiting for all the merges. And then once that period is passed, then Istio-D will pick up that push request, it gets added to the push queue, and then XDS is calculated, and then it's being sent. So there's multiple steps in this process. And when you're running Istio in production, make sure you're graphing every single one of those metrics so that you can tune things properly. All of those various periods and the number of concurrent pushes, etc., they're all configurable with an environment variable. And in production, you might tune it differently than in your dev environment. For example, in production, there's not as much configuration changes that's happening. You're not constantly applying new config. But what might change are endpoint information, the IPs as your pods come up and down. That information you want to get to the other services as fast as possible, because you don't want any downtime in production, even the slightest. So for that, you are going to reduce the debounce period to make sure that any update gets sent quickly to the other sidecars. And you can also increase your push throttle. You don't want to do that in your dev environment where you have a lot of configurations happening because you'll just bottleneck Istio-D trying to keep up with all of these config changes that are happening. So for that, you might set a higher debounce period, maybe 10 seconds instead of for production, you might do 100 milliseconds. Thanks, Rom. So the next example is what we call the bypass sidecar. And this example comes from a large consulting company that we were working with who had already built a highly secure app on top of Istio. So they asked us, as the Istio experts, to help them with production readiness before go live. So after looking at their installation, we noticed a lot of unencrypted traffic between workloads and also that auth policies were ineffective. They asked us if we could help identify what the root cause might be. Yeah. So this is a case where a platform team installed Istio. They thought they installed the right restriction policies and they just expected everything to be secure. That was not the case because developers found a way to go around the sidecar. I don't know if it's malicious or they were just trying to get something working and they forgot about it. But developers can add annotations to their pods, like exclude outbound IP ports or IP ranges, or they can add annotations to not even have the sidecar come up. There's various things that they can do to bypass the Istio sidecar. The production checklist solution for this is to make sure you're using something like OPPA to ensure that these annotations are not there in your environment. And developers are not relying on these type of annotations to get around the problem. Perfect. So similar to our last example, I recently chatted with another Istio adopter who had issues with their SQL server RDS database running on AWS. As all relational databases are, it had become a massive bottleneck. Basically, all their microservices running in EKS had uncontrolled access to stored procedures and functions, causing database records to lock up. This Wild West was tolerated until a major outage occurred, bringing down the entire platform. This is a common problem that we've seen is that people don't think about egress controls at all until something bad happens. Like egress is just not even a concern. Every developer, every microservice talks to whatever external endpoint that it has access to, and that continues to go on until something bad happens. Maybe that external endpoint that a lot of services depend on now has a rate limit set on it, and now it's restricting further connections. And then you're in a scramble trying to figure out which microservice is hammering this external service, and you don't have controls in place to restrict who has access to what and add your own rate limit policies, for example, to it. Or if the URL changes for this external service to be able to seamlessly apply it to your entire environment. So the production checklist for this one is leverage Istio's egress controls. Whether that's as simple as you're just creating service entry objects and you have Istio registry only set, that's a good first step towards it. But the right final step towards it would be to leverage an egress gateway, running on dedicated nodes where you have full control over how traffic leaves your system. The same way you have an ingress gateway where you can control, secure, observe traffic coming in, you need the same things for your external egress as well. Don't wait until it's too late for you to apply those policies. And here we have another important concept, which in fact, is nothing new. It dates back to the Middle Ages, and we call it defense in depth. So if we look at the castle remote system, the way that works is we defend against any particular attack using several independent methods. In other words, we have layering tactics. In the castle's case, the mode is the first line of defense preventing attack is from getting too close. Then we have the overhanging holes, platforms, et cetera, built into the castle so those inside could hurl items at the attacker below. The question is, how can we apply the same concepts to networking and can Istio help? Yeah, I mean, everything we've been talking about kind of builds up to this. Essentially, you want a layered approach, right? I mean, Istio provides a lot of security for your applications. But as you've seen, there's their ways to get around it. There are times when that configuration causes a hole through Istio. And when that does happen, make sure you have another line of defense. And you can do that by leveraging Kubernetes network policies just around your nodes or around your namespaces just to have it as a backup, as a layer three, layer four backup. And then also leverage egress gateways to make sure that traffic is being funneled through the right nodes. And you have controls in place to make sure that if you wanted to cut things off, you can do that running short on time. So this is the last one. We've helped many Istio customers with intermittent problems they hit during pod startup. Looking at the application container logs, we saw that it was related to not being able to talk to the network. The application relies on the network to start properly. And it doesn't have retries resiliency built in, then it results in the pod failing to start and crash loop back off. This is a container race condition, whereby the Istio sidecar fails to start and time the server network request from the app. Yeah, like in production, you want to get rid of all these race conditions as much as possible and have everything be very deterministic. And to do that, you have to understand how your Istio startup process is working for every one of your services. When a pod comes up, by default, Istio init container has to run first to do its IP tables rerouting. And then once that's done, Istio proxy pod has to come up, Istio proxy container has to come up, and then your application pod has everything, the network set up so that it's able to make its IO communications. A lot of problems where we've seen is customers have their own init containers that have a race condition with Istio init container. They see different behavior, whether it runs before Istio's init container or after. Or we've also seen where the customer's container starts up faster than Istio proxy sometimes in production, but in their dev environment, it didn't start up faster. So you can run into these problems where speed makes a difference on how your application behaves. We're hoping ambient deployment pattern will solve this particular set of problems cleanly without you having to worry about it. But until then, make sure that you avoid using init containers if possible, use regular containers so that it always runs after the init container. Use Istio CNI to make sure that maybe if you don't need the init container to do the redirection, you have that set up. And then leverage flags like hold application until Istio proxy starts. There's an environment variable that gives you control over exactly when your pod comes up in relation to the Istio proxy pod. Cool. So I know we covered few different set of use cases, but they were meant to be so that you can build your own production checklist to cover those types of issues. Everything that solo field engineers and our customer success engineers have hit with our customers over the last three to four years of us helping customers with Istio, we've collected all that feedback and we've built analyzers, automatic analyzers to check for these best practices. These are not just static config analyzers, but also best practice deployment level analyzers. And we built it into our glue core product, which essentially just sits on top of your existing Istio and gives you this insights into your environment. It does life cycle management and helps the operations team make sure that they're running Istio smoothly, upgrade Istio smoothly, make sure they have the right checks in place, et cetera. And that management plane will also provide insights and dashboards, et cetera. So that was the result of all of our work is that we put into building this analyzer solution. Anyway, if you have any questions, please reach out to Simon and I. You can find us on the Istio Slack or you can reach us at the glue solo IO Slack channel. My email address is rom at solo.io. And Simon, what's your email address? It's Simon.green at solo.io. Awesome. So it doesn't look like there's any questions in the chat. So I hope you found that session useful. Again, if you have any questions, please feel free to reach out to us. Have a great rest of your conference, everyone. Thanks, everyone.