 My name is Jay Shaughnessy. I'm a developer at Red Hat. I've been working on Kiali since it started about five years ago. And I'm Nick Fox. I'm also a developer at Red Hat working on Kiali. So welcome to Kiali Beyond the Graph. And today what we're going to try to do is, for new users or people that aren't familiar with Kiali, try to make sure that you leave here with an impression of what it might be able to do for you. And for existing users, maybe we'll hopefully show you something that you didn't know about it and you can leave here and try that out. So what is Kiali for the people that don't know? First of all, just a show of hands. Who here has used Kiali at all? Well, that's great. Well, that's awesome. Thank you. For the people that don't know, Kiali is a project that gives you a console for STO. Basically, you're a graphical user interface for STO. It lets you visualize your traffic while your mesh is in operation in a graphical way. It gives you all the observability pillars you might expect using metrics, logs and traces. And one thing that a lot of people don't know or don't utilize very much is that it's also a resource control mechanism for your config. So it's got wizards that'll allow you to create your config, create valid config. It will let you update config. And for config that you've got, it will do validation for you that can help you avoid mistakes or fix them quickly. So as Jay mentioned, really the goal for this talk is to show you how you can go beyond just staring at your traffic graph. And don't get me wrong, there's a lot of value in being able to just visualize your mesh. But there's more that the tool can do that can actually save you time or make you more efficient when using STO. So we've got three short demos for you. And this should be a good introduction for new users that haven't used Kiali before. Or if you've been using Kiali for a while, hopefully you'll learn something new that you can take away with you. So the demos that we're going to go through are, we're going to show you a service that's responding slow in your mesh and how you might go about troubleshooting that. Exposing a service through Ingress, so trying to get traffic from outside of your mesh to a service inside of your mesh. Applying an authorization policy and testing it first with DryRun. And I'll just emphasize the goal here isn't to show you how to do each one of these things. In fact, what works well for a demo probably isn't what you want to do in production, but it's really for you to take away different ways that you can approach this problem with Kiali and hopefully make your day to day work with STO a little easier. So demo time. All right. The rest of the talk really is just demo. So wish us luck. Yeah, demo. All right, so that for those of you who haven't seen it before, this is basically what your traffic graph is going to look like. And I'm going to actually reduce the shrink, shrink it just a little bit. So what we have here is our travels demo that we use a lot in Kiali. We like a couple of things about it. One, it's not book info. So it's a little bit different. I know I like that to visit the whole tutorial series on keali.io that's built around this. So if you're just learning Istio is a great way to learn keali and Istio learn the features of both at the same time. So I would recommend keali.io and the tutorial that we have here. So here's the report. Briefly, there's just two main namespaces in play on the left here you've got the travel portal, which has three regional portals for travel reservation system and then you've got your travel agency. The portals make requests of a central travel service, which is backed by three versions of the travels workload. And then those in turn, make requests of the different reservation services cars hotels etc right. What we can see just looking at this with all these green edges and some blue edges for TCP traffic is that it looks like our mesh is working perfectly right we don't see any. We've got no availability problems there's no red edges there's no orange edges here for degradation and so forth. But that doesn't necessarily give us the whole story right so what I'm going to do up here. In our, what we call the graph find box is I'm going to see if I've got anything running slowly and the response time greater than 1000 milliseconds is very slow with regards to the amount of response times that we're usually looking for. And we can see that in the last couple minutes. We've got a couple edges here that are running slow in fact, four and a half seconds is is a terrible response time right. What can we do with keali to try and figure out what's wrong. Well, there's a lot of different ways that you can try to attack the problem, one of which is, I'm going to select the set the travel service here. And on the right side anytime you select a note on the graph you're going to get information in the right hand side about it. And one of those tabs is called traces right so this is a trace integration that you can get with the graph. And if we take a look down the right side, each of these is an individual trace. So when you're looking at the graph as a whole, you're getting all the traffic aggregated together, but you're not seeing what the specific request paths are. If I just go to the top one I can see it was an 11 millisecond request, and you now get a graph overlay this in this purple showing you that particular trace on your overall graph. We can see that it went through the travel service it went to the version three workload and made a hotel reservation and it was fast. So maybe that's not a problem. We can keep looking. If I can get this out of the way scroll down. Let's look for a slow one might have to refresh just turn on some refreshing here. There's one. So here's a slow one at three seconds. We can see this one started over here in the French portal came through the travel service went through v3 and then made flights hotel and insurance reservations. Well, we're pretty sure hotels was okay. This makes us a little suspicious about flights and insurance. In general, we think this is helping us understand the problem but it's not really doing everything we need. Let's drink big in a little bit further and take a look at this from this workload v3 into the detail. So Kelly gives us the ability to drive to drill into detail on any workload application or service. It gives you a mini graph that's focused on the note of interest. And then up top it gives you a bunch of options you can look at traffic logs metric charts traces. Let's take a look at that outbound metric right because we know response time is not great. So let's look for our request duration chart and sure enough we can see that there's one thing here that seems a little out of whack. This green line on the chart is apparently looking at the flight service. So that may actually be our problem if we were to take away the flight service from the chart. We can see actually that all the other reservation system seem to be behaving about the same and they're all very fast. You know 20 milliseconds is this line right here. Another thing that you can do in Keali in various places is we try to correlate information right so if you've got metric information giving you the chart. You've also got trace information by clicking this option here I can put traces for the same time period on top of my chart. So it's a way to correlate things we can see that there's some traces here that I'm going to bring flights back. We can see that there's some traces that actually mirror the slow response times and a lot that are fast which is kind of what we saw before we were looking for a few things that were slow. You can click on one of those and drill into a trace scatter plot. What I'm trying to emphasize here is there's a lot of different ways that you can approach the problem right and I'm just trying to show you we're trying to show you different things that Keali can do to help you out. Here you've got a bunch of different traces on the chart. The different size dots are relative span numbers so the bigger dots have more spans the smaller dots have less and we can see up here that we've got some in the three second range. And you can of course select one and start to dig into the detail. So I can see here this one had it was a three second duration. You can see a lot of red what Keali does is it helps you try to identify spans or traces of interest by comparing the traces with other traces in the population. And for when you have outlier traces you'll see more red in this heat map kind of thing when you have things that are very common you'll see more green in the heat map. We're trying to help you identify traces of interest. Looking down at the span details you can basically track things in a time ordering. And as we go down again we see everything's fast until we get to this one three seconds flights again. So we're pretty sure at this point flights is the problem but we don't know why it's a problem. So let's jump over to Envoy. If you know a lot about Envoy proxy, which I don't, you can use a lot of things here to investigate. So you've got cluster information, listener information, route information. What I tend to do is just use this feature which is just let me look at the config which we as John showed earlier this morning can be thousands and thousands of lines. And I'm going to do something simple which is the usual way I approach things which is I'm just going to start searching for flights and see what I can find in the config. See if it helps me explain my problem. I don't see anything too much there. Looking around I can see retry information configured that looks like the default. And then I see this. And this is definitely the problem right so what I see actually is a fixed delay of three seconds being injected 75% of the time. That explains the problem what I've done here right is I had fault injection configured using Istio and I forgot to get rid of it and it slowed everything down as it should. If I go to the graph. It actually makes a lot of sense because if you look where the delay is. It's actually at the source proxy. Right. When you have fault injection delay. It's actually happening at the source those requests that don't actually even launch they just stop and then they go. So that kind of explains why we're seeing what we saw here I'm going to do one last thing before I hand it over to Nick. And that is I'm going to get rid of. That injection by just right clicking we can see that there was fault injection. I'm going to delete that. And solve my problem. And eventually when this thing refreshes that will go away but we're not going to wait for that Nick. Right. So our second demo. We're going to be exposing a service through ingress. And we're going to pick this voyages v1 service here in the travel portal namespace. What you don't see yet is my ingress gateway that's running in my Istio system namespace. So I'm just going to create a couple of objects to make this happen. And so you can see what I'm creating. Here you've got a gateway that's going to be referencing my ingress gateway. Then you've got a virtual service that references my gateway. It's got a route to my voyages service. And this is just a traffic generator that's going to be sending traffic through my ingress gateway. So I don't have to keep spamming core requests why we're doing this. So that's basically what is we're going to go ahead and apply this. Head back. Well first let's just make a curl request to see if it works. So if I could spell. And it doesn't work. So what now right. Well you could start poking at this and a number of different ways. But this is a demo about Keali. So we're going to go back to the Keali graph. You can see on here we've got ingress gateway traffic. It looks like it's hitting the gateway. But maybe it's not being routed correctly. You can click on this gets more information about the request. So doesn't tell you a whole lot. Let's go over to SEO config. And right away you can see our SEO config has errors. So we'll click on our voyages virtual service which has the errors. And there's a giant red bar there. And then error message telling us it's got an error code next to it. And a message saying that our virtual services pointing to a non-existent gateway. So that's a pretty big clue. But another way that you can get more information are these info icons. And these info icons tell you that if you click on the adjacent field, you'll get more information about that field over here in your summary panel. So I'm just going to highlight the relevant text over here. And it says gateways and other namespaces may be referred to by namespace slash name. And it looks like we've got that backwards. So we're just going to fix that reference hit save. And now everything's green. We get a reference to our service popping up that we're referencing down there. And then our gateway reference now looks good. And you might not now be thinking is this whole demo just about getting a reference wrong? Yes, it is because this kind of stuff can be a real pain to debug. And whenever you actually find out what it is, it isn't even satisfying being able to find out that, oh, this whole time I just had a reference wrong. So Keali will validate your Istio objects for you. And it helps a lot for simple things like this. It won't always be the simple, obviously, but when it is, you want it to be obvious to you so that you can go in and fix the problem quickly. So let's try our curl request again. And we get back the 200 response. So yay, it's working. And now if we go to our graph, zoom in a little bit, those were the requests that we're failing earlier, but now we're seeing traffic flowing through our Ingress Gateway to our Voyage Service. So that's great. The next demo that we're going to show you is applying an authorization policy and testing it first with Dry Run. So Dry Run is an experimental feature, but it's pretty helpful whenever you want to test your authorization policies. And what we want to do with this particular policy, we want to make sure that traffic can continue coming in through our Ingress Gateway to our Voyage Service, but nothing else can communicate to our Voyage Service. So we're going to go and create this all in EY. We're going to go in here and create authorization policy. Select the namespace that we want. We'll call it something like allow from Ingress. Select our Voyages app. And rather than type this out, I'm just going to use the service account for my Ingress Gateway as my principal identity. And I have this saved in a text file. Add that to my list. All right, let's get a preview. So this all looks fine. We're going to add our Dry Run annotation here. Again, spelling is hard. All right, so we've got our authorization policy created. Now we want to test it. So how do you do this in Dry Run mode? A couple of different ways you can do that. What we're going to do is we're going to look at the logs for our proxy. And in those logs, we'll see some messages that tell us whether or not our authorization policy matches and is being applied to the requests that you see or not. So we're going to head over to our Voyages workload. Go to our Logs tab. Here you can see in the white, these are my application logs. So my actual Voyage Service, that's what it's logging. And the yellow are my proxy logs. So we're just going to take away our Voyages logs. The yellow logs that have an info icon next to them, those are Envoy access logs. So because we know that structure, we can click on those and get some more information about them. It breaks things down for you by field. You can get some more information about that field if you click on the individual fields themselves. So that can be helpful. But the logs that we actually want in order to see them, we have to change the log level for our proxy. So you can do that by going over here and setting your proxy log level to debug. Hit refresh. So now we're getting a bunch more logs and changing the log level through the UILAC like this just gives you all of your Envoy logs, which can be a lot. So we're going to filter for our back and clean those up a little bit. And this is what we're looking for here. So this tells us that our policy matched the request. So at this point, we were pretty sure that our policy is working correctly. We could also test it from the inverse direction, right? We could test it, you could say make a curl request through a different pod in your namespace to see if that request got denied. I'm not going to do that for time's sake. But in that case, you would look for a shadow denied policy or log message. But this all looks good. So we're going to go back into our SEO config and remove our direct annotation. And now the policy will actually take effect. So if everything went according to plan, you should still see traffic flowing through your mesh and nothing's broken. So I guess we did it right. So that was our last demo, but I think we have a little more to show you. One last nugget. I'm going to say one last little thing to show in case you didn't know about this in Keali. So we probably should have called this Keali the graph and beyond, because still the graph is a centerpiece to what we're doing. Usually it's where you kind of focus and start from. And if there's something that you see or something that happened in your graph traffic that you found of interest and you wanted to rewatch it, because instead of just watching it in real time go by, you can always go click up here and hit replay. And this is going to bring you to a basically a recorder type thing where you can literally go back and replay a graph any amount of time, anything you've gotten your Prometheus stored. So if you want to look at something from last week for five minutes, compare it to this week during the same five minutes or something, you could do something like that. So for our demo here today, we could easily just jump to the last 30 minutes and start replaying some of the stuff that you saw basically. So probably it's only been about 20 minutes I hope. So you can see at this point by just scrolling back, we've still got the four second or the slow responses from the first demo. But if we slide forward a little bit or a lot of bit, those should have gone away when I deleted that fault injection. There they go, right? So this is just, we just wanted to throw this in to let you make sure, because some people have actually never seen that button and have never clicked it before. But it's something you can do. And if you capture the URL, it's totally bookmarkable. So you can replay it. You can share it with a colleague. You can say, hey, look what happened. What's going on? And they can log in, replay it, and you guys can all figure it out together using all these debugging techniques. All right. So that's what we got. Thank you. You can, of course, ask us questions. We'll be around. Join us on Istio Slack in the Keali room or on GitHub, Twitter, all those things. If you want to leave feedback, that's the QR code for the talk. We'd love to see that. And then there's the annual user survey if you haven't done that yet for Istio. Thanks very much. Anybody have questions or you can always catch up? Yeah? Is that mic live over there? Question mic? I'll tell you what. You can just tell me. So the new ambient changes? Yes. How does Keali work with it? Because I saw there's one more done. Yeah. Yeah, great question. So the question is, what about ambient and what's Keali doing with ambient? I can, what I can say about ambient is we are actually going to the, I usually go to the weekly ambient meeting and we are kind of tracking it as it goes, right? So one thing that you get with ambient right now, if you're here, you know, if you're running Z tunnel, Z tunnel will generate the TCP metrics today. So you can actually get a traffic graph. It will show you your, your MTLS TCP level, level four graph. You won't see, you'll see all blue edges in Keali. You won't see the green edges right now because that's going to be your HTTP. That's going to come from the waypoint. The waypoint metrics are still in flux a little bit. Hasn't totally been figured out, but it will be there. Don't worry. But yeah, so, and we also with ambient, I know hopefully we're not running out of time, but with ambient, this idea of whether you have a service in or outside of the mesh is a little different, right? So we used to have, we have something in Keali today, which will say, oh, you don't, you're missing your sidecar proxy. It'll show up on the graph. It shows up in a variety of places. That doesn't quite work in ambient, right? Cause you want to be missing your sidecar proxies. So we have this new idea where we'll tell you if it's something's out of mesh, basically. Is it in mesh? Is it out of mesh? Cause the semantics are different with ambient, but it's, it's emerging, right? So, and anybody that's using Keali with ambient, please give us any feedback that you've got. One more. Do we have time for one more? What is the, can you use a external tracing or metrics provider? Or do you have to use whatever it is? Yeah. So you can collect your metrics in different ways. Right now Keali supports a Jaeger interface. We're basically querying for traces using Jaeger, but as of like last sprint, we're also now supporting tempo. So if anyone's using Grafana tempo, we're integrating with that as well right now too. All right. Thanks folks. Oh, sorry. One more. Freak. I'm new to Keali. What kind of like plug in or extension mechanisms does Keali support? It doesn't. It's so, that's a good question. We get that. So the question is basically, how do you plug into Keali? And the answer is basically right now you don't plug into Keali. Keali is a devoted Istio console. And we are consuming Istio metrics natively and things like that. And we're looking at Istio config. So if you say I wanted to run a different service mesh and use Keali, that's not going to work. For example, we are looking at maybe moving in a way that we can, you know, take in some other sort of making a more generic interface for pulling metrics and so forth. But that doesn't exist yet and that would be a total future.