 Hi there, my name is Flynn, I am a tech evangelist for Boyant, the makers of Lincardie. If you're not already familiar with Lincardie, it's the only CNCF graduated service mesh. Lincardie's purpose in life is to arrange it so that every cloud native developer on the planet has access to the tools that they need to do secure, reliable, easily observable cloud native applications and for those tools to be freely available. My role as a tech evangelist is to make sure that people know that and also to make sure that people have the knowledge and resources that they need to really succeed at the whole cloud native thing. So to that end, today I will be talking about observability using Lincardie. This is kind of the classic problem in cloud native. It's very, very hard to really see what's going on inside the cluster even when things are going well. When things start going badly, it's even harder. Service meshes are well positioned to help with that and for Lincardie there are two things in particular that make it really good at it. One is this Lincardie viz tool, the other is the service profile CRD. Lincardie viz is a tool that just gives you easy access visually to a bunch of things within your cluster. Here we can see the topology of our application. We can see a bunch of statistics, success, requests per second. There's a bunch of stuff in there. We'll look at some of that as we go on through this presentation. The service profile, on the other hand, is a CRD that first defines kind of how Lincardie should be watching your application and it also gives you access to ways in which Lincardie can help manage your application too. For example, if a delete request comes in, matching this URL over there, then that will get bundled up as a statistic into this bucket with a much more human readable name and all of the deletion requests, no matter what ID they are, go into the same bucket to make it really easy as a human to figure out, okay, are deletions working or are they broken? Also, the really killer bit of this is that as soon as the service profile is created, Lincardie will watch that and it'll aggregate statistics for you on its own without you having to do anything special. You'll be able to go through and look back in time to get access to those. It's a really, really wonderful tool for troubleshooting. Also, I said management. So, for example, you can use a service profile to do things like configure retries automatically. We're going to see some of that with the rest of the demo here. And yeah, this is pretty much it for the slides. The rest of this presentation is a live demo. So, yeah, let's get to it, shall we? Okay, so I have here a running Kubernetes cluster. I'm doing all this on a K3D cluster that's running in my laptop just because I kind of like having control over everything. The first thing that you'll see here is we've got the books demo and we have the emoji vote demo. We have both of those running in the cluster. We can take a look. This is the books demo. You may have seen it before. You can go click around and look at books and look at authors. And yeah, it's pretty simple. This is the emoji vote application where you can go and vote for emoji and then you can view the leaderboard. And that's all there is to it. I should point out there is a traffic generator in here. I am not just clicking endlessly on emoji all day. So, let's go back here. We also had LinkerD running and, you know, I've got an Ingress running in the cluster as well that's kind of, I just like to be able to use domain names instead of having to do everything by a port forwards. And then I must remember the port. All of this, I pretty much set up just using the quick starts available from the LinkerD documentation. I'll have the link at the end of this presentation where you can go and see exactly how I set everything up. It's pretty standard, pretty easy to get going. Very important thing here is that we have run LinkerD check. We can see that LinkerD is actually running cleanly on its own. So, we're good to go. All right, now let's suppose it's Friday night and somebody calls up and says something is wrong with the emoji vote application. They don't know what's wrong, they just know something is wrong, it's not behaving. What should we do about this? Well, I guess the first thing we can do is we can just look over all the namespaces in the cluster using LinkerD viz from the command line. And we can see immediately that, yeah, there's some challenging stuff here. The emoji vote application actually is not showing us 100% success. Neither is books, but we'll have to come back to that. So, given that we can see that there's something wrong in this namespace, let's drill into that a little bit. Here we're going to look just at deployments in the emoji vote namespace because that's where we already know that there is a problem anyway. So, if we do that, again, we can immediately see that the web deployment and the voting deployment look like there's some kind of unhappy things going on. So, those also seem like pretty natural places to look. So, we're going to go ahead and take a look at the web deployment. We'll use LinkerD viz top for this. That's going to give us a rundown in real time of the most common requests that are going to or from the web deployment. If we do that, we can go through and we can see a bunch of things happening. But it's most interesting, I think, to look over at the success rate column here, which is all 100%. So, so far, this is working out okay. That's, wait a minute. That's not a good sign. Okay. So, that tells us that at least this time voting for a donut did not work. And it looks like it might not be working at all. Come to think of it. So, that's something where it looks like the web deployment is talking to the voting deployment. So, maybe we should go take a look at that. Let's, yeah, let's go ahead and take a look at the voting deployment. If we look at the voting deployment with LinkerD viz top, then let's see. Things up. Nope, that's not working. Yeah, it looks like we have a problem with voting for donuts. That's too bad. Donuts are usually pretty popular. At this point, we probably could go off and hand this off to the developers and say, hey, it looks like there's a problem voting for donuts. But we can probably do better than that. So, another thing we can do is instead of running LinkerD viz top, we'll run LinkerD viz tap. Tap shows us real time request by request. It just gives us a running list of everything going on. It's a really nice way to get a quick look at, you know, the actual real life traffic. And here I see a bunch of things that are working, right? I see a post, it gets 200, it's got a GRPC status of okay, so far so good. I don't actually see any donut requests so far. Oh, wait, here's one all the way down at the bottom. Yeah, so you'll notice that this says GRPC status unknown. Very important to note here, status unknown does not mean that LinkerD doesn't know what the status is. That is a GRPC error message that says, something went wrong and the GRPC error, GRPC layer can't tell us more about what happened. But we know that something is wrong here. So, that means we can come back. Maybe we should just drill into the voting for a donut. You know, only that one, right? If we do this, and we might have to wait a little bit for somebody to actually vote for a donut. But if we do this, we should be able to see whether, you know, is voting for a donut always failing or is it only failing part of the time? Looks like it's always failing. Okay, all of these are saying GRPC status unknown. That definitely gives us enough to go back to the deficit. If we could go back and say, hey, we're seeing a GRPC error when you vote for a donut, but we can now say we know it's a GRPC error. Is there anything else we can do here? There's one more thing. If we tack dash o JSON onto that same command. So, we're going to look at the donuts, but we're going to look at it in JSON. Then instead of giving us that nice three-line summary, it'll break everything out into a huge JSON block and we'll get information on the requests and the responses. There we go. There's one. So, we can see this is a request. It's going from the web deployment to the voting deployment. It is, in fact, a vote for a donut. If we scroll down, we'll be able to see the response. So, we come down here. There's the response, response. And if we scroll a little bit further down, we can see, oh, yeah, look. GRPC status two, if everything were working, that would be GRPC status zero. We don't really get anything particularly useful in the error message, but at least we can now go back to the developers and say, right, when you try to vote for a donut, you get a GRPC status of two. That's a problem. So, overall, that works out pretty well. On the other hand, this took a little while. And the reason that it took a little while was that we needed to go through and watch the traffic and wait for somebody to vote for a donut so we could see the problem. And then we had to wait for them to do it again so we could see if the problem persisted. That, you know, there should be a better way. And as I was mentioning earlier, with service profiles, there is a better way. The emoji vote application is a GRPC application. GRPC means protobuf. Protobuf means that rather than writing service profiles by hand, we can just ask Linkerty Profile to go through, read the protobuf definition and write us a service profile for it. So let's do that for the emoji proto. There you go. There's not much to it. You can, it's both posts, which is kind of interesting. You can list all the emojis. You can find and get an emoji by shortcode. That's kind of it. If we needed to modify this, we could. But this is certainly a great way to get started at minimum. So let's go ahead and apply that to the cluster. And same command, right? Just generate the proto and then apply it. That works out pretty well. And then we can do the same thing for the other GRPC that's part of the emoji vote application, the voting proto. I'm not going to bother showing that. It's pretty much just more of the same. So we'll apply that one. Now we really want to see some things about the web too. And that's problematic because the web app here is not GRPC. The web app is just plain old rest. So we could write the service profile by hand. If we had a swagger definition, we could have linkerdue profile just read the swagger definition and write a profile for us. In this case, we don't have either of those things. So instead what we're going to do is this other trick where we can have linkerdue just watch the traffic going by for a little while. In this case, we're going to say 10 seconds and generate the profile based on what it actually sees in the traffic, which is kind of cool. So let's go ahead and do that. And it's going to take a few seconds here, of course, because I told it 10 seconds. Let's take a look at that profile that it wrote. And yeah, you know, you can see lists, you can see votes. It's pretty straightforward, right? So let's go ahead and apply that. And now we should be set up to debug this problem with EmojiBoat much more quickly than we otherwise would have. So now we're going to go check this out by looking at the dashboard. So here we are at the dashboard. We have the books app namespace, the EmojiBoat namespace, and as we saw from the command line, things are not working flawlessly here. Let's go into this namespace. And you can see with the topology graph, we've got this boat bot that's generating traffic, sending it over to the web service. The web service, in turn, is talking to the Emoji service and the voting service. All this lines up with what we saw from the command line, which is kind of nice. Let's go take a look at the web deployment itself. And you'll also see a different graph here of which deployments are talking to which. But the really neat thing we can do now is we can click on this route metrics tab, where we can just immediately see, oh, hey, here's the route that's not doing so well. Get API vote. If we go over to the voting deployment, in turn, then we can scroll down and check out its route metrics. Here, let's just sort by success rate. And we can instantly see, without waiting for anything, that, yeah, it's the donut that's causing us problems. If we want to drill into that, let's go back to the live calls here. And we can click on the microscope for any of these. Let's see if we get a donut. Yep, there's a donut. Great. It moved up because this is a top view. Let's click on that. That fills in a tap page for us. You can see it filled in with donut. Click on start. As requests for donuts come in, then this will populate here. There's one. And we can click on this tab. And there you go. There's the JSON view from before. So we kind of get to have it both ways. Obviously, this is much faster than going through and running all the stuff from the command line. But important to note, it's working on exactly the same information. So everything you can do here, you can do from the command line. All right. So at this point, we would hand this over to our developers, tell them that there's a problem with the donuts. We can go on. And of course, that would be the time that something comes up with the Books app, because I don't know, that's the way life goes, right? Now, the nice thing about the Books app is that the Books app already has service profiles. So we don't have to go through and build them manually. So we can go straight to the route metrics from the command line and see, okay, what's going on with the web app service here in the Books app. And we can immediately see that, oh right, there's two things in here that seem to be failing about half the time. That looks, you know, problematic. We can also go through and do the same trick we did before, where we drill down and say, okay, show me where the web app's deployment is talking to the author's service. How is that going? And you can see there's a little bit more detail here there are calls in this list that don't show up up here. But in particular, we can also note for our debugging purposes, all of these are working. So there's probably nothing for us to worry about there. Let's check the web app talking to the Books service. And what do we see there? There we see, okay, here we've got a couple of things that are failing about half the time. That's probably not good. Finally, up here, we're not going to see traffic between the book service and the author's service, although maybe there is some. So let's take a look at that. And what do we see? Yeah, there's one call. It's just a head call, but it's failing about half the time. Kind of interesting. Now, we could of course do all of this from the GUI as well. Let's go ahead and do that. Back up to namespaces, then duck into the book app's namespace. There's our topology graph again. We saw this at the beginning of the presentation. It's traffic generator talking to the web app, talking to the books app, talking to authors, and books and authors are talking to each other. So here, if we go through and take a look at one of these, let's look at the web app, shall we? Once again, we get the neat little graph here. And once again, we can go down and look at route metrics and kind of immediately see, right, so these are a problem. If we go in and we look at, let's look at the authors deployment actually. If we look at the authors deployment and we look at its route metrics, then again, we can immediately see, okay, this head, that's got some trouble, right? So an interesting thing here, head requests are item potent. They have no data. They make a great candidate for retries and service profiles, if you remember, are a place where we can configure retries. Here's the service profile docs, configure retries on your own services. If we look over here, routes that are item potent and don't have bodies, you can edit the service profile and add is retribal to the retribal route. That's the only thing we have to do to enable retries happening down in the mesh. We don't have to change any application code. So let's go ahead and try that out. We will do that using kubectl edit. We'll do that the really simple way. That's the author's service profile. All the way at the bottom, you can see this is the head that we're talking about retrying. And we are literally going to just add is retribal true. And we'll save that, quit. That's updated. And now let's go and watch and see what happens. If we tack dash o wide on this linker d routes, it will tell us the effective success rate as well as the actual success rate. And if this worked, we should see these two start to diverge. We should see that the effective success rate will be going up, even though the actual success rate isn't doing much. And yeah, we can see that it is going up. In fact, that here we see 61, 62% almost. Let's give it another few seconds here. 68%. Yeah. So that looks like it's headed in the right direction. What we don't know yet is whether this is the only problem with our books application right now. So let's go take a look and see from... What shall we do? Let's look from the web apps point of view, I think. Yeah, web apps talking to books, that'll probably do this. And here we can see, right, so we've hit 100% on everything at this point. Here, we're only seeing the effective success rate, but the effective success rate is the one that we care about from the user's point of view. So at this point, we've been able to use LinkerD to figure out what was failing and to put in a mitigation so that the end user is no longer affected by this problem. We still have work to do with the developers, obviously. We're going to have to go through, figure out why exactly the thing was failing, which we don't know right now. And we're going to have to get a fix put in place to really solve the problem for real. But if you remember, I started this by saying it was a Friday night and putting in this quick change to the service mesh means that we don't have to go back and bug the developers on Friday night. Everything is working from the user's point of view. We can come back and tackle this in the morning on Monday. So that's about it for this demo. You can find more information about LinkerD at linkerd.io or you're always welcome to join our Slack. Go to slack.linkerd.io for that, I hope you do. The source code for LinkerD is in the github.com LinkerD organization. You can also find us at LinkerD on Twitter. If you're curious about how this particular demo got set up and run, you can look in the service mesh academy repo. You'll find everything there, all the details. And you can always reach me at flin.linkerd.io for email or at flin on the linkerd Slack. Hope to hear from you. Thanks.