 Hey folks, hello and welcome to ServiceMeshCon EU. Today we're going to be talking about debugging an application with your ServiceMesh. Tap, tap. My name is Jason Morgan, this person here, also that person there. I am a technical evangelist over at Point, that means it is my job to talk to folks about the LinkerD project, encourage them to use it and evaluate it and help folks as they move from development through to production with LinkerD. You can find me on Twitter at our Jason Morgan. You can find me on GitHub at Jason Morgan. And you can find me on the LinkerD Slack at JMO. So this is about the end of our slides for today. We're going to do everything as a live demo or as close to a live demo as we can get considering the circumstances. So let's talk about what it is we're going to talk about. I've got two applications that are running in my Kubernetes cluster that are having some issues, and we want to diagnose and try and do what we can to remedy those problems with our mesh. So I've got one application, which is a web front end backed by two GRPC services, and another that is a series of web applications talking to each other over rest. EmojiVote is the GRPC application. So what it gives us is a web front end that displays a number of emojis, allows us to vote on them, view the leaderboard and see what the current state is of the voting. And it seems to be working as designed, but we're having a problem with it. I'm getting reports from my users and I want to fix it. So let's go ahead and take a look at the service map for EmojiVote. So what I have here is a vote bot which generates some traffic. It talks to our web service, and then the web service makes a GRPC call over to voting or Emoji. And to be clear, I'm not getting that GRPC thing from looking at this graph. I just knew that already because I work with the EmojiVote app a lot. So I want to debug and I want to get things going. So the first thing I'm going to do is just talk to my Kubernetes cluster and ensure that I can communicate and that things are running as expected. So I can ask about the nodes. I can see that I've got my three nodes in my K3S cluster. This K3S cluster provided by the good folks over at Civo who are actually going to be announcing something here at KubeCon. So please stay tuned for that. Now that we've checked on our nodes, we're going to just check on the state of our mesh. So the linker DCLI bundles in this check command, which will check on the health of the control plane and all of its components. And it will also look at any installed linkerD extensions and it will run their health checks as well. So we see that the results are green check mark for both, which sure seems good to me. So we're going to move on and actually troubleshoot our application. So the first thing I want to do to troubleshoot my app is just take a look at the namespaces in the cluster and see what does linkerD see in terms of the golden metrics for those namespaces. Golden metrics being success rate, volume of requests, and then latency. So we can see we've got a bunch of namespaces here. PodInfo is seeing a ton of requests, but is responding great and has 100% success rate. LinkerD, Viz, and LinkerD dashboard also both have 100% success rate. One nice thing with LinkerD, the control plane components are also part of the data plane. So we can use the same debugging techniques we're going to use today on our various applications to check on the health of our mesh or debug issues should we run into anything. Now we have EmojiVote and BooksApp that are both seeing a sub 100% success rate. So let's dive into EmojiVote and see if we can't can't get this problem fixed. So now that we know we have two namespaces with issues, right? BooksApp and EmojiVote. Let's look at the deployments in EmojiVote. And see what their relative health is. So I can see right away, VoteBot and Emoji seem to be succeeding all the time, have low latency. That's great. But web and voting are both reporting some problems. So let's dive in a little bit further. So now that we've got the statistics on the deployment objects overall, we can actually dive into each one of those deployments and see what live calls are going in and what can they tell us about the health of the application. Now to be clear, I haven't instrumented anything inside these apps. These are normal GRPC apps. I don't have tracing enabled or anything in particular. This is what the LinkerD Proxy is able to tell me about my application traffic natively with no configuration. So I can see what pod a given request is coming from, what pod it's going to, the method, as well as the path that's being called, right? And the number of requests coming in. So I can see right away that web, here my web pod is talking to Emoji, and those should all be successful. We can see the call as list all and find by shortcode. If we scroll over to the right, we see that those are both 100% successful. Great. In line with our expectations based on the statistics we've seen. VoteBot talking to web has two calls, API list and API vote. So let's see how they're doing. API list is at 100%, but API vote is at around 85 to 90. So it's not succeeding all the way. And then web is talking to voting a lot, right? And it's talking to voting, and it's hitting these individual vote, these individual vote URLs. And let's look at their success rate. So it looks like it's good for most things, but vote donut, right? So donut, what should be our most popular emoji, is actually seeing a 0% success rate. Great. Is actually seeing a 0% success rate. So let's go on. So we probably have, we probably have our problem solved already or diagnosed. And what we'd like to do now is just triangulate a little bit. Let's go check out the voting service, see what it reports. So I do a top on that voting deployment, right? Now I can see that all the calls to voting, like we saw on our traffic map, are actually coming from the web service over to voting. We see the API calls that they're, or the past they're taking inside the API. And we can see, we can see how many, and we can look at the success rate. So so far, everything seems to be going great, right? We don't have, we don't have any failures, but let's give it a minute and see what else pops up. All right. Thank you. We have vote donut coming in, starting to receive some, some traffic. And we see that it's actually hitting a 0% success rate. So right here, let's go ahead and let's go ahead and check on that, right? I can go over to my emoji vote app. I can click on donut. I see that I've got a 404. So we have a real problem. So I have more than enough to package over to my development team. So they know where to begin looking for the problem. Great. So we've got that. But let's, let's continue to dive a little deeper. Let's see if we can't, if we can't grab some of that live traffic over to this, this voting service and see what we get. Now we've got the linker D vis tap command here and tap is going to snoop in on the calls between the two proxies and get a bunch of metadata around them so that we can, we can see the current state of our environment. So we, we run our command and we can see that, you know, we have calls like this vote for running man, it's indicates that it's an MTLS call, which is the default for linker D. When you install it, you get MTLS between all, all services. We have a status code of 200, which is great. And in GRPC status, okay. So let's look for donut donut. See if I type that right. Scroll around here. There we go. We've got a donut call here. So now we've got a path of emoji vote v one voting service vote donut. And we can see that while we still have that TLS is true and the status code 200, we also get our GRPC status of unknown, which is actually a GRPC error. So okay, well, the nice thing that we got out of that last call is the particular path to look at for voting donut errors. So we check out vote donut. And let's just let's again, let's update the path that we're calling to instead of just the base URL to the full voting service vote donut. See what what all calls come in. Now this is going to populate as soon as as soon as our traffic generator votes for donut, or I can actually pop over to my web app and try voting for the donut again. Great. So we see some of that traffic coming in. We see some of that traffic coming in, we have still the TLS true status code 200 and that GRPC unknown. So I could save this off and pass it to my developers, which would be handy for them, but we can actually take we can actually get way more detailed information. So that same tap command, right, I'm going to run it again, but I'm going to change the output format from just the default terminal to a full JSON output. So we'll see, we'll see the same basic thing, but with a ton more detail and we could save this off, bundle it with, you know, with the message we send to our developers on the voting service and let them let them start solving the problem. So here, right, we see the we see the output of one of these calls right now is Jason. So I've got a bunch of metadata about the source, the destination, you know, the request information and all the headers involved in this call. Right. Great. I can also if I look down a little bit further, I can look at the headers for another request and see that, you know, we've got our GRPC status and we've got our GRPC error message right there. Cool. So save that, send it to my devs. I feel like I've discharged my duty to the team in terms of in terms of debugging emoji photo and now it's on the developers to actually get it fixed. So that's been great, right, but this this troubleshooting process actually took longer than it needed to take, right, because I was looking around, I was waiting for things to come up. Y'all saw when I went to voting service, I didn't see anything about vote donut until a fair bit of time in, right. So I had to do that kind of aggregation myself. So one of the things that's really nice about linkardy is you install it and you don't have to you don't have to build or configure a bunch of custom resource definitions in order to make the mesh work, you get all the value with a very simple install and and inject on your applications. But there one of the two custom resource definitions that linkardy does use is this service profile. So let me just create this and we'll talk about what we did. So I use the linkardy CLI to look at the proto file for my emoji service. So emoji is a gRPC service, it uses these proto buffs, and we can actually look at them and see what are the actual valid calls for this service. And we get we get this service profile object, which will allow linkardy to do some more intelligent stuff with the traffic for this application. And it will be able to do things like collect and maintain the data about the given routes right there inside the service so that we can more quickly debug issues like this when they come up. So I'm going to go ahead and create and apply a service profile for the emoji service and do the same thing for the voting service again, just using that just using that proto file. And once those are done, I'm about ready, but there's a third service that I care about here, which is the web front end. Now the web front end is not it's just the rest service, so I don't have a gRPC file to work with or a proto file to work with. And I also don't have a swagger file. So I need to actually just take a look so I can either write my own service profile based on what I know of the application, or I can also use linkardy's tap functionality to in to watch the traffic that's coming to this web service and and decide how to build a service profile based on what we see. So for the next 10 seconds, we're going to watch the web service, see what comes in and auto generate a service profile. So we have we have our new web service service profile created. Great. We're going to just give myself a reminder to go hop over to the dashboard and we're going to take a look at how we could debug this even faster if we were using service profiles. So again, looking at the Emoji Voto namespace and click on web and get my route metrics. You know, I can see that, you know, as we're coming in API list is staying at 100% API vote. Now again, it's only been a couple of seconds, but over time, we're going to see some of those failed transactions come in. And this this API vote service is going to start to degrade, right? And if I left this running for hours, it would be a lot more obvious at what the failure rate actually is. So again, gives us an indication that we've got a problem with with voting. So we look at our voting service. And again, right, instead of waiting for a vote donut call to come in, right, which it did right away. But you know, you saw it doesn't necessarily do that, we can look at our route metrics filter on the success rate. So now imagine the situation I've been paged, because there's an issue, I come in, it's it's some amount of time later. And I can just immediately look for the past inside my API and see what's what is either responding slowly or is starting to see errors, and then and then immediately go direct, direct the the right, the right ticket to the right team so that we can get get this thing fixed. So that is emoji vote, right? So that's just that's as far as we're going to go because we have some fundamental problem with our application. But now let's change scenarios a little bit, let's talk about not a GRPC application, but instead an all rest based application. So we've got our buoyant books app. And and books, let's take a look at that inside that terminal or inside the dashboard real quick. Books is is a bit more of a complicated service. Let's give this a little refresh. Books is a bit more of a complicated service. We have a traffic generator just like we did with emoji vote and three services. But web app is talking to both authors and books and authors are talking to each other. So there's some additional dependency there. And my failures with with books are more intermittent. So it's a little bit harder to to get a sense of really where my problem is. Now, luckily, in this case, I've already got I've gotten I've had the profiles set up this entire time. So we're going to just use these service profiles, because these are all rest APIs with swagger files, we're going to use those those profiles to get a sense of where our problem is. And we're actually going to do a little bit to resolve it before our devs have to get involved. So we start we just look at the routes for the web app service, right, because that's kind of our entry point to this app. So let's look at what's going on here. We can see a bunch of the calls are seeing 100% success rate, but post the books and post the books editing an individual ID are problematic. Okay, so let's let's see if we can't get this a bit more triangulated. I'm going to see how is web app doing in its conversations to both authors and books. So when I look at, you know, from this web app over to authors, let's see, let's see how how the routes look looks like every single call from from the web app over to authors is succeeding 100% of time. So we're pretty good on that route. Great. But we still have problems with web apps. So clearly, it's a problem with web app talking to books. So we look from web app to books. Yep, there it is. Post books. Jason and post book or sorry, put on a particular book ID. Jason is failing somewhere around 50% of the time. So we're already getting a lot closer to to our root cause, right? And our goal here is to drive down that that meantime to detection on the problem so that we can get this resolved as quickly as possible. Now I'm going to look now that we've seen seen this we also have that dependency between books app and authors. So let's just let's just go in and check how what is books talking to authors about and how are those requests doing. So now we look at all books app is doing is a head request on authors by a particular ID. And we're seeing that right about 50% successful. So should we clearly have we clearly have an issue. And if we look right and that was that was quick to diagnose with that was very quick to diagnose with with the routes in place, but you'll see it's a little bit harder when we look at the author service directly, right? So we look at the live calls for the author service. And we can see that that sometimes we're seeing we're seeing failures, but it's not it's not in a way that's clearly aggregated, right? I've got a particular author ID that's seeing some some failures and and some that are succeeding, right? But when I when I view it from a route perspective where linker D's been been aggregating the traffic based on the the API calls defined in the swagger doc, I see that you know the particular head, the particular head request to any author ID is failing about 50% of the time. So I've got like I've got the I've got the problem identified, right? And and that's good. And and so we're we're there right I can go ahead and alert my my authors folks that this head request is failing half the time they need to look at at the code that's responsible for that response. But I can also because it's failing about half the time and succeeding about half the time, I can actually use my mesh to solve some of this problem. So we're going to look at that service profile for the author service, and we're going to we're going to change it so that we we fix the problem for tonight, and they're able to take time and deal with it in the morning. So our service profile has the various routes for this service. And we're going to go into this head request that we saw before. So head authors ID dot jason is failing. We're just going to go in here. And we know that it's safe to retry this call. So we're going to add a field that says is retryable is retryable true. I'm going to hope that I type that correctly. And and so we're telling the mesh Hey, when you see these calls proxy, go ahead and just retry it. So no app logic has changed. No, you know, no, no fundamental issues to the code or no no changes to the code have been created, right? But instead, I'm letting the mesh take responsibility for trying to solve some of this problem. And now we look at the routes from from books to authors, right? And we can see that the success rate, while the actual success rate is staying right around 50%, the effective success rate is going to steadily climb, right? And it's going to climb all the way up to 100%. Because it through those retries, it's going to it's going to succeed eventually. We're also going to see our latency go up. But as long as it stays within reason, you know, it's overnight, we'll we'll page them first thing in the morning, they can respond to it in a timely fashion. So we'll break out of this. And again, now I can look at the routes from from the web app to books where we originally saw the issue and see what this looks like. And now our success rate has gone up to gone up to 100%. And we're feeling good about about that result. So let's pop back into the slides. Let's pop back into our slides. And that's the end of my talk. Thank you so much for staying to listen. Like I said earlier, you can find me on Twitter at our Jason Morgan if for some reason you want to see my GitHub contributions, you can find me here. And I'd love to hear from you over on the link or DSLAC at JMO. Thanks so much and have a great day. Goodbye.