 So welcome to ServiceMush, debugging Istio deployments. My name is Sundip Parikh. I work for Google Cloud. And I write sample code, best practices. And I work with a lot of customers and people who are implementing Istio, specifically like DevOps, Ops, SecOps, that sort of thing, SRE. And we try to help them do Istio the right way in their environments. You can find me on Twitter or GitHub at CircusMonkey minus the vowels. So that's CRCS, M-N-K-Y. All right, so let's start with kind of the basics. Microservices introduce a number of challenges into your infrastructure and environment. A lot of it's kind of really built around the idea of cognitive overhead. There's just adding a lot more work that has to happen to kind of coordinate. You get a lot of benefits. There's increased velocity, language independence. People get to choose the right tools for the right job. But the challenge is it does introduce a lot of other components, right? Way more services to keep track of, right? A lot chattier, because you've got to do a lot more coordination so you can lead to network congestion. You've got to worry about things like service reliability. You've got to figure out how to do things like, well, what happens if I need to rate limit on one service and fall back over to another version or if I want to direct traffic to particular versions or implementations, right? How do I handle that? How do I do things like encryption and transit? How do I make sure that every connection to every other service within that cluster, within that environment is secure as well? And then you've got to worry about aggregating metrics, logs, tracing, telemetry, right? In a lot of deployments, what we often see is each of these teams does their own approach. So each team worries about their own monitoring stack or their own logging stack or their own infrastructure stack there. The problem is that then no single central team has the visibility into what's going on, right? So there's no way to pinpoint when there's a problem whose fault it really is. That's where Istio comes in. Think of Istio, and I would say this kind of broadly applies to service mesh, think of these things as automation, right? They're automation tools. They let you implement and apply policy at scale, whether your deployment has 10 microservices or whether it's got a thousand, right? It lets you do things like turn on encryption and transit and say I want every service communication now to be encrypted by MTLS and I don't wanna have to manage those keys. I want them to get handed out automatically, right? I wanna be able to route traffic in fine-grain ways across my entire cluster and say things like, hey, if a user agent comes in for this one API, I want it to go to this version of the service instead of version A, right? Or I can do things like, service A can talk to service B and service B can talk to service C but A can't talk to C and you wanna implement that kind of level of control. And that's often what's required when you get to this like, again, 10, 100,000 microservices sort of complexity. And so that's where Istio, again, applies that kind of automation. So examining everything that's happening within your cluster, so all the network traffic and mediating all that inbound and outbound means it can capture every single important metric around latency, request counts, how many bytes are being transferred across the wire with literal or no change to your application code. And then along with that, things like obviously security or managing the flow of traffic into and within your clusters. So it's a pretty powerful mechanism, which is great, that's awesome. Everything's roses, right? Except all of these technologies in Istio is no exception. They are not without inherent complexity, right? They introduce some, they introduce almost new complexity and they take that new bucket of challenges of microservices and they add more stuff to it. The upside is that it does take more of that kind of overhead but it centralizes it to a smaller group of people. So instead of having every single application team have their own monitoring stack, you run one and that's run by the central ops team now and Istio is kind of covered by them. So it's more complexity but it's across a smaller number of people in theory, right? And that's a lot of what we actually see in practice with Istio. But you are adding additional control plane components within your Kubernetes cluster, right? And those are generating lots of logs. So that's more things to manage there. It's got a pretty deep API, right? It's incredibly composable. You can build a lot of really cool things with it but it's incredibly, you know, it's got a lot of moving parts to it, a lot of complexity to it. Policies, right? Implementing security policy or encryption policy highly customizable, which means it's really easy to shoot yourself in the foot. And then the last thing is that Istio's control plane at least, right? Sits on top of Kubernetes. Now you can add VMs to a mesh, you can add external services to a mesh but ultimately the control plane components and the crux of it is still on top of Kubernetes, which itself is complicated as it is. So you're adding layers of complexity here. So there's a lot to kind of take into account. And on top of that, Istio is a very fast moving project, right? 1.2 was announced back in June. They're moving to a quarterly release schedule. So we're expecting 1.3 to land about mid-September. So if you look at like the Istio community calendar right now, there are days of testing planned for next week and there are doc fixed sprints going on starting next week to get ready for that mid-to-late September release. The documentation is constantly growing in size and getting reorganized better and better based on what we're seeing users are asking for. And the tools ecosystem is still really young, right? There isn't a ton of stuff out there and you can kind of see the difference between things like Kubernetes and Istio. Kubernetes has a lot of tools like for evaluating log files very easily or tapping network connectivity between different containers and pods. Istio still lacks some of that growth, right? There's still a lot of the basic tools that we have to use. So it's not fully fleshed out yet. We're seeing more and more, but it's still relatively new compared to Kubernetes. So what I wanna cover today, I wanna cover how to diagnose and kind of fix some Istio configuration problems. And the three scenarios I wanna talk through and I'll demo a couple of them before you are. One, traffic just not routing correctly. The second one is what do you do when telemetry data is missing, right? So you're not seeing metrics or monitoring data coming out. And the third one is what happens if you run into MTLS issues, right? So services can't talk to each other because they're trying to get, they're trying to negotiate connectivity that's encrypted but something's not working correctly. And then finally I wanna close with what's a good kind of list of tools in a toolbox you can use when you're trying to debug these deployments. So a quick recap of the architecture of Istio. So on the top you see like a typical couple of pods, right, two services, service A and service B. Each of those pods has the sidecar proxy, which is Envoy. And I'd also say Envoy and proxy kind of interchangeably but the proxy is there, you know, kind of mediating the traffic in bound and outbound. That proxy layer is called the data plane in Istio. That's what handles a lot of the actual like heavy lifting but the control plane components at the bottom are what take in the Istio APIs and then actually turn those into Envoy configuration and push those out. So pilot handles getting all of the rules and configuration data back out to those proxies. Mixer is there to handle things like telemetry and all the custom policy stuff you might wanna do. So for example, telemetry data all comes out of those proxies monitoring, tracing, logging data and goes through Mixer and then Mixer sends it to whatever backend you want to. Datadog, Prometheus, et cetera. And the last one is Citadel. Citadel is basically there to handle keys, right. Its job is to generate identity and make sure those identities get out to those services and pods. Oh, I'm sorry, I just skipped that, thanks. Galley is there to validate configuration, right. So it's there and when you push configuration it'll validate before it pushes it out and it's only validating it for API conformance, not necessarily another, you know, it's not validating it logically within your cluster which is why it's easy to, you know, have overriding configuration that could be problematic. So let's talk about debugging traffic routing. So real quick, Pilot's job is to observe that topology and then take Istio API resources like any kind of traffic rules you want to apply, circuit breaking, rate limiting, or just traffic splitting and it takes those in the form of like API objects and then it turns them into Envoy config and pushes those out to all the proxies. So that's Pilot's job. And then within the Istio traffic API there's really four main objects that you're typically gonna work with. For north, south, or inbound and outbound you've got Istio gateway or service entry objects, respectively, and then for east, west traffic you've got virtual service and destination rule, again for inbound and outbound. So the demo app that I've got is just a really simple front end and it's got two back ends, right. Version one and version two. In this case I think I called them, I think version one is called single and version two is called multiple, right. It's just giving me weather info for a few cities that I wanted to know the weather about. Right, so here's a screenshot of it and I'll show it to you all in a second. So the problem that we're having and this is the crux of the situation we'll walk through is what we want is on the left. We want a 90-10 split between single and multiple or V1 and V2, but we're still seeing a 50-50 split. Right, so how do we debug that? And so that is what we will do right now. And so you can see this deployment is up and running here, so version one has data for a single city, in this case Austin which is where I'm based and then version two, see if it shows up, has data for multiple cities. And as I look at this, I'm refreshing a bunch here, as I look at this I'm seeing what kind of feels like 50-50, it's definitely not 90-10. So let's dig into it and see what's going on. So can everybody see that text okay? Yeah, okay. So first I'll show you what's running in this cluster. So I've just got a handful of pods here and I'll talk through what some of the other ones are in a minute here, but if we take a quick look, everything is configured correctly, I can take a look at the virtual service objects. And this case it's the virtual service weather backend. If we take a look at the configure there, we will see in here, down here, we can see that we've got weight 90 and weight 10. Those are two subsets, right again, the single city version and the multiple city version. So I should be again seeing the 90-10 split because I was able to push all this stuff in, but again we're still not seeing it. And one of the ways we can confirm that is with a tool that one of my colleagues has written called recipe and it just measures the output of two different endpoints to see what the responses are like and it aggregates the count. So first we'll get the recipe pod information and then we will execute the recipe command within the pod that it's running in. So what this command is doing now is it's gonna go and issue a thousand requests against that URL that I gave it, right? And because of the way that transparent routing should happen, it should be balancing across single and multiple at about 90-10. And what we're seeing is that, yes, it's coming out of 50-50, right? So over a thousand requests, it's seeing the kind of the normal load balancing traffic that you would see in Kubernetes. It's not seeing the Istio rules I've implemented. So now when you have problems like this, and this could be true of circuit breaking, rate limiting, there's kind of a tried and true path you end up kind of going down. Usually what you tend to start with is looking at log files. Let's see what we can see in the logs from, let's look at the logs going from the weather front end to weather back end and start looking from there. So first we'll grab the front end. I'm pretty sure I did this earlier, yep. So we can take a look at the logs for front end and you'll notice I have to specify the container I'm using, right? So in this case, front end, that pod has two containers in it. It's got my app container plus the Istio proxy. So we wanna look at the Istio proxy information. All right, so that's quite a bit of stuff that shows up and it's not great. So let me just quickly grab for traffic going to weather back end. It'll be a little bit easier to read. All right, so you can see all these calls to weather back end and what's interesting is if you look at these calls, right, it's successfully making them. It's making these outbound connections and it's doing it on the right port which is port 5000. It's got the right host name in it but there's actually this little field missing. There's something missing in there and that actually turns out is the subset naming field. So it should actually be saying that whether it's going to single or multiple. So what's happening is we're falling back into Kubernetes around Robin load balancing. It's not directing the traffic to any particular subset of my service. It's just going saying, hey, Kubernetes network infrastructure, I just need to get to this service, route it appropriately and it figures out and that's why we get 50-50. So now we know that there's a problem there but we want to dig a little bit further and kind of confirm that. So the next thing we will start to do is Istio includes this great tool called Istio CTL that lets you push objects up to the cluster, specifically all of its API objects but it also has a bunch of debugging or all that kind of utilities that you can dig into and use to help assess the health of a cluster or the health of your service mesh. So the one we're going to focus on is, let's see, proxy config or proxy status. Let's start there. We want to understand, hey, I wrote this rule. It says 90-10 split but it's not getting applied or it's not showing up everywhere. Why is that the case? So maybe it's something going on there. So let's start with Istio CTL proxy status and we've got a couple of slow pods syncing but that's not usually a problem because CDS, LDS, EDS, RDS, those are cluster discovery service, listener discovery service, endpoint and route discovery service. So we know that some of the information's getting out there so the stuff I care about are the routes. So that data should be out there at this point. But we can actually dig a little bit further here and we can actually look and say, well, tell me about the front end specifically, specifically, tell me about that pod. How does it look compared to the configuration and information it's gotten? And so we'll see that based on what the control planning knows and what this proxy knows, everything matches up. The clusters that it's supposed to talk to match up, the listeners match up and so do the routes. And it'll actually tell you when the last time the routes were loaded. So we've got, as far as we can tell, we pushed up the object successfully but they still haven't again made their way across the cluster for some reason. And so that's where the next one comes in which is a little bit more interesting called Istio CTL proxy config. And proxy config has a couple of sub commands or a few where you can see things like the cluster configuration. What does this envoy know about the rest of the topology of the cluster or the endpoints or the listeners or the routes? So let's look at a couple of these and see if we can figure out what is happening. So we'll start with Istio CTL proxy config and I'm gonna look at front end. Well, actually let's look at cluster on the front end pod. All right, so these are all the clusters or the kind of infrastructure it knows about that that envoy knows about. So right here we can see, it knows that there's a service at weather backend but it also does know that there's multiple and a single. So it knows that the subsets exist. So this envoy proxy is aware that these subsets are rules that should be following. Right, it knows about them. We can do the same thing with endpoints. We can take a look at the endpoints and see what does it know about all the endpoints in the cluster. And I'll just, you can kind of see this a little bit here but you can see them right there. Right, it knows again, there is a version with no subsets listed which is sort of the generic one. And then it also knows that there is a single and multiple there. So again, this envoy proxy seems to know where it should be sending this traffic or at least knows that those things exist but it's still not sending the traffic there. So we've got to dig even further. We got to keep going down this rabbit hole. And so the last one we're gonna do, and let me just get to the top. The last one we'll do is routes. So we're gonna look at the outbound routes from front end. All right, so now we can see all the routes and these are organized by port number. That's the way the route command does it. So we know that weather backend is hanging off a port 5000. So we can actually filter this down by 5000. Just look at this. And it says there are two virtual hosts. Okay, let's just start. And then we can actually take a look at this as JSON because it's a little bit more involved. Well, then this happens, right? Then you get a lot of stuff you have to weed through. So now we have to scroll through this a little bit. As you can imagine, this is sort of difficult to read. So I'm gonna filter this down and use something like JQ, which is a good command to help kind of filter JSON. Makes it a little bit more readable. All right, so this is a little bit better. So we know there's two hosts that it knows about. And looking at this list, the virtual host here, it knows about weather frontend, which is the default, right, itself. It knows, like frontend knows about itself. And then it knows about, the last one is allow any. So it knows about sending any other traffic. That's it. It only knows about two outbound routes, back to itself and anything else. And so anything else is basically saying, if you wanna send traffic outside the mesh, this is the path that goes through. So effectively what's happening is that we're finding out is that weather frontend can communicate with weather backend, but it's doing it outside of the mesh. It's sending that traffic and saying, hey, Kubernetes, wrap this packet. And Kubernetes says, well, I know it's a service. I'll use DNS to figure it out. It's one of these two pods and I'll just round Robin across them. Then it leads us to the next question, which is why is my service not getting picked up by the mesh, right? I did my configuration correctly. I've technically got routes in there, but something is causing the service to not appear in this routing table effectively, which means it doesn't think that service is part of the mesh. And that's where our problem probably lies. So if we look at that, and we can start here, is let's see. So let's take a look at that service again. Oops. So let's look at the weather backend definition. Nothing fancy, right? This is all pretty run-of-the-mill stuff, except when you get to spec port's name. That's the problem. Istio requires that your service spec port's name field follow a standard convention. It requires you to tell it the protocol that it's gonna use first and foremost, and then optionally some identifier that you wanna use. So in this case, the name I'm using is backend, but it doesn't know what to do with that and so it's just not including it as part of there. In fact, you'll notice that it actually added a protocol for me, right, it added a protocol TCP because it didn't know what to do with that traffic. So it just treats it as unmarked traffic and it doesn't even pick it up in the mesh because it doesn't really know what to do with it. And because of that, my service is not being included in the whole group. And I can tell you, there's actually a good list of the, this is actually right in the documentation. To be part of a mesh, pods and services must satisfy this naming requirement. You have to name these ports based, again, on the protocol they're gonna use and then an optional suffix. So in my case, I'm just using HTTP so I can just name it something else, you know, HTTP-backend and that should fix it, right, if we follow the example here. So let's do that now. And I cheated and left this in here from earlier. So all I've changed, if you look down here is, I've changed spec ports and name to HTTP-backend, right? So I fixed it to follow the convention that they're asking us for. I'm gonna apply that now. And then let's take another look and see if it came out the right way. So we're gonna get the service, weather-backend. All right, so it's still, so that protocol field is still in there, the incorrect one, but it turns out that's actually not the problem. It was the naming component, right? That's the issue. I wasn't telling the mesh what to do with it. And now, if we start to retrace some of our steps from earlier, we'll actually see what's going on. So if we go and rerun, let me just clear that. And if I rerun Istio CTL proxy config for routes, there's actually a lot more data here. And I'm actually gonna filter this down so it's a little bit easier to read. So we're just gonna filter by the array and then virtual hosts, then name and domains. We wanna get the name of the entry and the domains that it corresponds to. And so if we look, turns out weather backend is now in there. So that's a good start. We're already seeing weather backend as one of the outbound routes. If we actually go back and scroll through this lengthy output, which is annoying sometimes to go through this, but you'll actually start to see within the weather backend entry, there's a routes section. And now we can actually see there are weighted components here where it's now it's waiting 90% to single. And if we scroll kind of buried down here, it's waiting 10% of that traffic to multiple. So we got the right configuration. Everything's getting picked up correctly. Now let's double check with recipe to balance what those outputs are and see if we're getting the right thing. So we can do, we run that command again. Let's see what we get with a thousand requests. All right, it's close enough. It's Istio's routing is not necessarily gonna be exact every single time. It's over the lifetime of a number of requests, but you can see we're effectively trending towards 90, 10 waiting, which is what we wanted. And so that's really, it was a quick kind of whirlwind, but you have to dig through and start with, like let's start with log files. Let's check the control plane components. And then you have to start looking at Istio CTL and saying, tell me what the proxy thinks it knows and let's use that as a means of figuring out what the problem ultimately is. And so that's just a quick run through of a simple kind of traffic example. Let me switch back to the slides here. All right. All right, so that was traffic. Let me see how I'm doing on time. Doing okay so far. Missing telemetry data is actually pretty straightforward. So we're gonna talk to this really quickly, but mixer is the component that handles the telemetry piece. So every bit of the network mediation that captures monitoring, tracing, logging data, it sends that stuff up through mixer, right? That proxy just sends it all through mixer and then mixer communicates with the backend of your choice. Whether, like I said, data dog, Prometheus supports stack driver, it supports a whole bunch of other systems. The problem is that it can break in a number of different ways because it's a pretty complex mechanism. One of the first things that we'll see is that standard Istio metrics might be showing up, but maybe your custom metrics for Prometheus are not showing up, right? They're not getting scraped in, right? That might be a change. And one of the easiest ways to amend that is to update your annotations. You may have to change annotations in your pod spec or your deployment spec to make sure that you're telling Prometheus where to grab those metrics from, and then you can have your custom metrics along with your Istio centric metrics in the same dashboard, right? At least in the same monitoring backend infrastructure. So that's one. But the second one's a little bit more complicated. What happens when mixer is not working correctly? There is a lengthy doc within the Istio docs about how to debug some of these issues with mixer and they could range from a number of different things. Mixer might not be reporting correctly. There could be just underlying configuration problems. You may have to look at mixer logs. You may have to look at individual metrics configurations and figure out whether the handler is broken or maybe the numerical computation for your rule is not working correctly. So there's quite a bit here, right? In a lot of cases, it's almost sort of difficult to really unpack this and fix it kind of on a case-by-case basis. So what we've done in a couple of scenarios and had to run in this is let's just reapply the entirety of the mixer configuration we were using from a metrics perspective and then figure out from there what the problem is. And then we can use that as kind of a diff against what's in the cluster. So in this case, like the steps I would typically do are grab the latest release, generate the Helm chart with mixer on and then join it with mixer off and then take a diff of those two and then what's left over is all of the configuration that should match. And then in a lot of cases, I've just reapplied that configuration. And what I end up with is a working mixer component again because some handler may have been overwritten or deleted by mistake or some metrics instance may have been configured incorrectly. So sometimes that is the only way out of some of these really weird hairy mixer problems. So that was just a little bit on telemetry. That's a deeper topic. And frankly, I didn't wanna spend too much time on it because some of the way the telemetry stuff is working today is actually gonna go away soon. Starting in Istio 1.3, if you remember from that architecture diagram, mixer, everything goes through there. Well, it's kind of a single point of failure in some cases. It's a little bit of a choke point. So the plan is to go and do away with mixer entirely and just take it out of the equation and then push all the work that mixer does today just straight to Envoy. So those proxies, instead of sending metrics and telemetry to a mixer, we'll actually just send them straight to the backend that you configured. So now it does put some of the onus on the backend to be able to handle all that stuff, but those backends, again, Stackdriver, Datadog, they were made for high concurrency, lots of requests. So I think that'll be okay. And the benefit for Istio is that there isn't this single spot that can cause a lot of problems because as mixer slows down, if there's more and more and more services, you've gotta scale that up, but you don't really know until metrics start getting slow. And at that point, things are falling behind and you don't want that to happen. So mixer's gonna change in the near term. It's starting in 1.3. It'll start to get backed out and they're moving to a web assembly approach where that's gonna be the mechanism that reports telemetry data out of the proxy and directly into the backend that you've configured. All right, so for the last one, I wanna talk about MTLS. And so MTLS is how we do authentication and encryption and transit between services. But this is how services know who you're talking to on some level, right? There's clear the handshake, there's a naming check and then that's how the connection gets established between service and service B. And again, A and B have no idea this is happening, right? This is all happening at the proxy level. So these applications were not written with MTLS in mind or any authentication in mind. They were simply written as a really simple, backend, front end, microservice combination. So in this case, what Citadel does, right? It creates those certificates and the keys for the service accounts. Those get handed out by Kubernetes to the Envoy proxies. And then users, operators decide how they wanna get enforced. So whether you turn Citadel on or not, like you don't actually have to configure it itself, it's on by default in all the production profiles. So it's gonna hand those keys out anyway, but it's up to you whether you wanna enforce that level authentication and encryption. So you control, as the operator, decide what you wanna do there. If you wanna implement that policy, you push that rule up to the control plane, pilot turns it into a policy component and then tells all the proxies you're now using authentication or you're using MTLS across the board. So what we're gonna do is we're gonna push up two objects in a pretty similar configuration to the front end, backend one I showed earlier. We're gonna turn on MTLS at the backend and then we're gonna turn on, so policy and destination rules serve a couple different approaches here. So policy is saying, hey, backend will only accept MTLS connections. The destination rule is saying to any clients of whether backend, you should use MTLS authentication. So it lets you control which half of the equation is responsible for it, but it's typically both, the client side and the server side. Policy controls the server, destination rule tells clients what to do. So what we wanna get to is 200s. We wanna make sure that applications still work once we implemented this policy and this destination rule. What we will see, unfortunately, is that we're ending up with 503s. So let's switch over to that one now. And I'm gonna switch to a different cluster that I had configured for this earlier. All right, so again, we can see what's in this one. Nothing fancy, it's a lot of the same stuff. It's weather backend, single, weather backend, multiple and weather front end. And the only difference in configuration here is that weather front end is configured to send 100% of its traffic to backend because I'm just working with those two. And so we can actually see that traffic is working fine, sorry for the switch. If we go to this tab, right? Everything's looking fine, I'm refreshing a few times. Now let's implement those first two components we talked about. So back in one, all right. So it's created that policy and it's created that destination rule, great. But now let's start refreshing and this usually takes a few seconds for pilot to get the rules and for them to push out to the proxies and the proxies wait and then they implement it. So I'm just refreshing, there you go. And now we're seeing 503s. It's calling out, weather front end is calling it to weather backend and it's just getting a 503 back. We've broken the connection between them. So how do you diagnose this? Right, what happens when you see 503s in your cluster? Well, we start down a similar path. So the first thing we'll do is we'll look at weather front end and see if we can figure out what the traffic looks like there. And again, we're gonna look at the Istio proxy logs and I'm just gonna go ahead and grep for weather backend now because it'll be a lot of data in there. And let's see. So here's an example, right, at the very bottom. It's trying to get slash api slash weather from weather backend, but it's getting 503s. You can see that right there in that response code. So weather front end is doing the right thing but it's getting a bad response. Something else is going wrong there. So now we gotta dig a little bit further. And typically what you might wanna do now is start checking things like the control plane health. Did something break in the control plane, right? Well, if we look, there's a bunch of pilots running so everything looks good there. Citadel's running, whoops, up there. Citadel's running so that's good. So we know keys have gotten handed out in theory so something else is going wrong. So we need to start digging a little bit further. Maybe the keys got handed out but the keys themselves are bad. So let's check the keys that front end has and figure out if maybe that's the problem. So we're gonna do a call to front end, specifically Istio proxy. We're gonna check and see. We want it to tell us about the keys that it's got and I've got a little shortcut here. So what I'm asking for is I wanna take a look at the certificate chain and I'm gonna ask OpenSSL to decrypt that and just tell me, show me the validity field of the key that it's got. All right, so it looks like it says it's valid not before August 19th which is a couple days ago and not after November 17th. So the key is technically valid. We're in the date range that that works. So now let's try the same thing on the back end. Let's see if that key is broken in some other way. It's the same thing. I'm gonna get the back end here and then I'm gonna run that coopctl command and I'm gonna cheat and just edit in line. Back end. Let's see, all right. Okay, well, unfortunately same thing. Looks like the keys are valid on back end as well. So we've got some consistency there which is good but that's still not telling us what the problem is. If we take a look at the logs for the back end, let's see if we can see what's going on there. Nothing really, right? We're getting inbound calls just fine. So nothing here is telling us that there's any problem. So now we've gotta do a little bit further. Well, one of the ways you can dig further in these cases is actually to turn up the level of logging on the proxy. Right now I think the default setup is set for info, so we're gonna turn it up to trace. We wanna see the trace level logs coming out of the proxy and see if that tells us some more about what's going on. So what we will do is coopctl exec and we're gonna do that on, we're gonna do some of the back end proxy. Oops. And there we're gonna run a curl command and we're gonna change the logging level of that pod's Istio proxy and we're gonna change it to trace. And so now what it's done is that the output tells us that for every logging component I've got, everything is now set to trace logging. So we've just dialed up the logging a whole bunch. So now let's refresh this a few times. Generate some traffic in there. Just doing a couple of hard refreshes. And now let's take a look at the logs again. And oh man, there's a lot more stuff in there. Okay. Let's go back and grep for, typically if you see this kind of thing with 503s, there's usually a problem with the handshake. That's the component we're worried about. We wanna see if there's some kind of key negotiation problem between the keys from service A and the keys from service B in this case. So if we grep for the word handshake, we might see what's going on. Sure enough, there are a handful of handshake error one. We know there's a problem now. Okay, we know there's a problem with something around the configuration of the keys that we've given it. So something's not working right because we know the keys are valid, but something is still conflicting or having a problem to get the handshake to be happening effectively. And so this is where we start leaning on Istio CTL again. Istio CTL has a TLS checking component. So we can actually check TLS here and it should show us. So I ran Istio authentication TLS check. And what we see is for clients of whether back end, there is a conflict, right? There is a problem with our configuration here. Something's broken. And that typically means it's a problem in our rules, not in the infrastructure. Because again, keys are valid, the negotiations happening, but there's a handshake error. So there's something that's causing a problem here. And typically, not typically. Again, remember what I mentioned earlier about kind of the policy object controls the server aspect and destination rules control the client aspect. Well, here what this column is telling us is that the conflict is on the client side, which means the problem is in the destination rules. So let's take a look at the destination rules. All right, well, I've got two. Let me take a look at both of them. And I just open them both as a YAML. Okay, so this is the first one that I put in as part of my original deployment. Well, I've got traffic policy as TLS mode disable, okay? That's what I originally deployed it as. Well, here's the problem. The second destination rule I put in has a traffic policy, it says mutual. I have the same destination rule object pointing at the same service, whether backend, and I'm telling it to do two different TLS modes. One is saying use mutual TLS and one is saying disable mutual TLS. And that's where the conflict is. So now clients of that service, they don't know what to do. Do I use them TLS or not? But because I've already told the backend, you're only gonna accept TLS connections. Anybody that tries to go in without it is gonna get 503s. So I've created this conflict by a bad rule configuration. So the way to fix it is first, I'm gonna delete the old configuration. So I'll get rid of the policy and I'll get rid of that incorrectly written destination rule, right? The one that was conflicting with the first one. And I'm gonna apply a different one. And let me show you what that looks like really quickly. So the policy one looks the same. The destination rule looks different because one, I fixed the name, now it's updating an existing rule, not adding a new one in. So instead of taking a destination rule and saying, let me just add a couple that are conflicting with the TLS mode, I'm actually just updating the current one that has the same name and I'm making sure it says IstioMutual, right? And that's our fix. So that is now, right? You'll see it didn't create a new object, it configured an existing object. So I fixed the, I changed effectively TLS mode disable to accept mutual TLS. And so now if we go back to our weather deployment, everything's rosy and it's working again. So that was looking at MTLS. And I know I'm keeping you folks here late so I appreciate it. I'll try to run through the rest of it. I talked about this a little bit already. This was the problem. I had an original destination rule. I added a new one which caused a conflict because they were conflicting modes. And I fixed it by putting in a third one that was the right configuration. So I just updated the first and added IstioMutual and that fixed my problem for me. If these destination rules don't exist, then I can't actually send traffic to it because again, it's expecting MTLS. And if it's clients aren't told to send MTLS then they don't. They just send plain traffic. So if I just delete the destination rule entirely, the policy is still there to determine that it only will accept those connections. All right, we got through everything we wanted to get through. So what we quickly covered, how to determine if a virtual service is working correctly, how to use Istio CTL, parsing Envoy logs to some degree, diagnosing some of the basics around mixer rules and metrics, but really we dug into MTLS and traffic, right? Those are the two areas I really wanted to get to. Some of the tools that you should have in your toolbox, obviously Istio CTL, which is part of the Istio release. Kube CTL exec to run commands on there. Stern is a great tool for looking at Kubernetes logs because you don't need to have the exact pod name. You can say just, you know, whether front end and it'll figure out the rest of it from there. It does a lot of like inspecting the pods in there and trying to figure out which one you want to look at. If I had said weather back end and use it with Stern to look at logs, it actually would have streamed in logs from both pods, single and multiple, so both versions. So Stern's a really great tool. JQ I use for filtering JSON output, right? That's an easy one. Sleep is a pod that's part of the Istio samples. It's actually part of the samples in the repo and it's a great debug pod because it's got curl installed, which you're going to want to use often for checking HTTP requests and that pods there. It's just got a quick little pod spec. It's easy to deploy and it's very convenient as a little tool to have an entry point into your cluster. And finally, don't forget the Istio docs. A lot of this is covered there just spread out across multiple areas and we're trying to get more of these debug scenarios in the docs so people have an end-to-end walkthrough. And that's it. The repo that I use for all this stuff is at OddGitHub in my setup there with ServiceMush. Or you can find me on Twitter if you have any questions. That is it. Thank you all for sitting through this. And I'm going to be around for a few minutes. If anybody's got questions, I'm happy to answer them now, but feel free to go if you want to try to get snacks or you've got to get to the next session. Please. Yeah, so there actually is, so the question was... Yeah. Right, no, it's really hard. So the problem is that validating them for API conformance is easy and there's now a client-side tool to do that. Istio CTL has that sub-command built-in to validate your config. But it's not... Yeah, but you need more of a logical or some anti-checker and that is still TBD, right? The right way to do that is... It's one of those things that's debated and hasn't really gone to a good point yet. Now, I will say the user experience working group is now taking a much better, closer look at this kind of stuff and figuring out how to say, here's the state of your cluster before you apply your thing and here's what we think it's gonna look like after. Does it match what you expect? If so, we'll go ahead and apply it, but if not, we can try to give you some sense of what's gonna happen when you implement a change, but it's still not exact. So a lot of this, at least right now, is still knowing where to look on some level. Yeah, it's not great, but I would say if you have ideas on how to do that, please, please chime in there. Those meetings, that governance is wide open. We want more people to bring these use cases in. The community meeting that happens every other week is not meant for the Istio team. It's meant for users and operators to come in and share their use cases and talk through it. Any other questions I can answer, please? No, it's a good question. It is not part of the NCF. So Envoy is, which is a component of Istio, but the Istio project is not. Now, there has been some discussion and talk that it could end up in there at some point. We really don't know. That's way above me at this point. But I will tell you that the project is run very much like a CNCF project would be. Like we modeled the governance that Istio uses very much based on how CNCF runs. So there are individual working groups. There are, there's an oversight committee. There's a steering committee. And these meetings are all very public and open, right? All the schedules published on a regular basis. Anybody can join them. So we publish all that stuff out the open. All of the meetings in fact themselves, all the working group meetings are recorded and posted on YouTube. They're just Zoom video conference calls. So all that stuff happens wide out in the open, right? This is not owned by Google. It's, Google is probably one of the top five or six contributors, but it's Google, Red Hat, VMware, Pivotal, I think, or IBM, obviously. So there's a bunch there. And then there's a long tail of other contributors, right? We had, I think, I did the stats for the one two release. There were 76 contributors and they were not all from those companies, right? There was a good portion that worked from outside. So it's running a pretty open way, but to your point, it is not in CNCF at this time. Yeah, they are. I often like to think that if you put five giant companies in room together to make a ham sandwich, they wouldn't be able to do it because I don't think they can all agree on the same things. And I think that's pretty much what it comes down to. I think everybody's got slightly different goals and I think that just causes a little bit of tension. So that tension can either happen in CNCF or outside of it. Right now it's happening outside of it. But to your point, yeah, they're all members of CNCF. And now we'll say that CNCF also does have an existing service mesh in Linkerd. So I don't know, maybe that could be part of it too, but they don't pay me to speculate on that stuff. I just talk about the things I know. But yeah, that's it. Thank you all for coming.