 Thank you for coming to this session. Today we're going to talk about observability of applications based on the service mesh. First of all, my name is Chere Canal. I'm a software engineer working on a project. It's called a Kiali, where we observe container-based application, basically on Istio and service meshes, working in Kiali and also in Red Hat. So today, what we are going to cover is mainly why we are here discussing service meshes, why a little bit of history, why we are here, why we are talking about that, and also talking about the architecture, how it works, how Istio actually is deployed, and why we have all those nice features. But most of the time of this session, I'm going to spend it on a demo. My team built a specific application where we have a bug inside or something is not working nicely. So we are going to see, thanks to Istio, what tools we have, what observability tools we have. In this talk, I'm not going to talk about a traffic management, not security, but I'm going to show you, walk through all the three main pillars of observability that Istio gives to its users. And at the end, I'm going to do a really brief recap. So why we are here, why we are even talking about service meshes. So yeah, a lot of you know about Istio's lines, and you know that we are moving from monolithic applications to microservices one, microservices, or maybe just services one. So what we were dealing like 20 years ago, it was with application servers, like huge servers, really huge servers that deal with a lot of responsibilities. Do the front end of an application, HTML, JavaScript, maybe dealing with emails, dealing with maybe in CPU intensive operation, maybe GPU intensive operation and disk intensive operation. So we needed to have huge server for dealing with those operations. But then someone decided, OK, do I have to pay for this really, really huge server to maybe like compete like 10% of our requests that they are CPU intensive? And I do have to have a machine which don't know how many cores. This is really expensive. What about splitting those responsibilities in small servers that, for example, I have maybe one server for the CPU intensive operations, another server or a couple of servers for these ones, and maybe some of them for the regular request that they don't need a lot of huge resources. So we were moving, we needed to move this monolith, a split, break it down in different ones. So we got the microservices one, microservice part of it. What's the big difference in here that before in the monolith, we were sharing information using the memory, right? Invoking a method, importing a library, and sharing information with it, calling some methods there. Now, for sharing information, what you have is you have the network. You have all these cables connecting all the services, right? So for example here, one example. This is an example of a monolithic application. We need to compute the benefits of a company. So what we are doing is just returning the subtract of the income and the outcome, right? Non-network involved. Let's put a microservices part in here. Now, those both methods are using network. So then what do you need to do? This is not precise. Just like for you to understand my point, right? So it's like now we have to deal with network. Then we have to deal with the timeouts. And for example, in this case, it's how many retries we want to do because network is reliable, right? So OK, we have to sum how many retries we've been doing and say we have to retry again. And also, before the monolith, we had one lock and everything requested started in one port and finishes that way. So what we want to do now is keep, for example, distributed tracing. We want to, because now one request is not in one server. One request is made of maybe 10 different requests. So we want to keep all the track of them, right? So we need to yet add one more client, imported client, for example, and then manage if the initializations is having any exception. And at the end of the request, keep or close the Jigar like tracing, right? Or this panel. And also for telemetry, it's not in monolithic. You put an agent next to your application. And that agent went down to your services and collected the information. Now you have to do it like for every single service you have. So you have yet another client here for keeping all this telemetry. So we end up with a method that has 30 lines of codes when we just needed one before, right? So developers in this scenario, they spend a lot of time dealing with network, telemetries, distributed tracing, and so on. And for example, security, right? You have tons of services who is going to be in charge of implementing TLS for every service, right? So a lot of extra codes for dealing with this. So here is the service match to the rescue, right? So what we have is that now for every service, you have that the blue space is what we call the business logic of your application carrying the benefits of the business logic. But then you have the orange boxes that are the libraries, modules, whatever you want to call it, in charge for dealing with network, telemetry, and dealing with the new paradigm of microservices. There's a bunch of technology in charge of it. But because every service can be written down in multiple different languages, maybe you have a benefits one is in Java, but then the other one is in Node, and the other one will be in C++, right? So now we have the developer taking care of too many responsibilities for just one single service. And yes, now yes. Now it's service match to the rescue, service match. So back on the starting of Kubernetes, so we have the service, and we have the platform, which is Kubernetes, and then you deploy your service on top of it, right? So it's this scenario. So we want to push those responsibilities down to the platform. So pushing them down to the platform is what a Istio and service mesh are doing. Service mesh and Istio then move us to this scenario where the service, it only have the business logic operations where you can just compute benefits, income, or outcome data of your company, right? So you don't have to deal with all of them. These responsibilities went down to the container platform layer, what we call it, the Istio service mesh, right? So now developers are just free to do just business logic. OK, this seems really, really cool. But how is it working? What's the magic here? So here's the architecture of Istio. What we have here. So check this out. Imagine we have a line here dividing this architecture. So in the upper part of the image, what we have is what we call the data plane. Actually, where all our services are leaving, are running, right, where the income service is outcome and get benefits, for example. And then in the other one, in the bottom part, what you have is the Istio modules, is the control plane that is a governing all what you have in your data plane, right? I imagine like a castle, right, down the mountain, it has kind of the town, right? So from the castle, you can govern. You can see whatever is happening in your data plane, right? So the castle is the control plane, and the town is the data plane, right? So let's just start with the data plane. How is this, how we can push those responsibilities down to the platform? So all these responsibilities now are in that proxy. Let me explain a little bit better. So for example, now when you deploy a service, automatically Istio is automatically injecting a proxy in front of every service, right? This proxy, which is this one, it captures all the traffic going outside of the service and also the incoming one. So for example, if service A is calling to service B, what actually happens is that service A goes to the proxy and the proxy forwards the request to the following proxy that this one is going to forward to service B. In those proxies, something like authentication happened. So it means that security, there's security termination, TLS termination, living there, there's authorization. So as the proxy is able to respond to the question, should I forward this request to the service or not, right? And also proxy does this thing of, which version of the service A should I send the traffic? I'm on A-B testing a scenario. So should I send it to version one or version two? Is the system really collapsed? So should I prevent sending this request? Because I know that if I'm sending this request, my database is going to blow up, right? So should I stop forwarding this request? So this is the data plane, where everything happens. All the request happens of your application. So the question is, and who is the one configuring this proxy, the control plane? Here you have four modules of Istio that manage all this configuration and make this happen. The first one is Galei. Galei is the one in charge of receiving configuration and checking for its validity. Is it valid? I'm going to break the service mesh. And if everything goes OK, then it sends it to pilot, which is the one delivery type, right? It's sending all the configuration from the users, sending it to every proxy in the service mesh, in the data plane target. Then you have Citadel. That's why we are on top of the mountain. Citadel is the one detecting if there's one new service. And if there's this service, then it sends a pair of keys in order to implement the TLS end, right? So it secure all the services you have. Or at least, it prepares for being secure. And I start using MTLS, Mutal TLS, or TLS for all the connections on the data plane. And at the end, what you have is Mixer. Mixer. Oops. All right, thank you. And Mixer is the one in charge for telemetry and authorization. So telemetry, it registers every request happening in your data plane, and it sends back to Prometheus. And also, it's in charge of authentication. Should I forward, should I allow this request go to the final service, to the service B, for example? And yes, this is the pillars of Istio. And with this, what you can achieve and what you have by default is telemetry, traffic management, and security. So there we're going to focus mostly on telemetry. One really, really important thing to say here is that with Istio, you don't have to change any single line of codes, right? It's because of you have this proxy here that it's really managed by Istio developers. They don't need to change any single line of codes. Actually, it's the opposite, right? We have to remove those 30 extra lines we added before and just call the network, call the service you want, and then Istio is going to do the retries, is going to do the circuit breaking, is going to implement the TLS at the end. So you are back to monolithic area in terms of relying on the network, right? And telemetry and security. OK, so now that we know that we have these three group of functionalities, let me introduce you, Kiali. And with this, we are going to see the observability in action. So Kiali is the observability tool for applications based on Istio service mesh. Kiali, what he answers is the question of how are my microservices doing? Let's go to the demo. So I'm going to spend now some time, most of the like reminding time probably in here. So what we prepared here is a application. It's a travel agency application that basically is a really, really dummy application that you have three different portals distributed all over the globe that they are checking for cars, hotels, insurances in order to make some reservations, book some of the services. Really, really easy. What we are going to do with this demo is troubleshoot it, understand what happens, and go to the root case, to the problem that is making our application run a little bit sloppy and with some problems on there. This one, this page you see here, this is Kiali actually. This is the console. And as you can see here, like as the overview, this is the overview page. So what we can see is all what we deploy in our Kubernetes or OpenShift. And here we want to take care, we want to check out only those two namespaces. Let me filter that out. Yes, only those two, right? We see that apparently everything is in green. So it looked like there's no problem here. What we see is that configuration, all the configs are all right. The applications here, the six applications, we are seeing are pretty healthy. Traffic is stable. We see like 70 requests per second. So it looks pretty cool. Let me show you what is this application and how it looks like. Here what you have, this is one of the coolest features we have in Kiali. And this is how our application is connected, how are all the requests flowing from one service to another. And yes, this is thanks to the Prometheus. Do you remember guys about this proxy we have in front of every service? So that proxy, what it does is it is capturing all the requests and sending it back to Prometheus. And with those metrics, now we are able to show this graph, right? Let me show one cool thing we have also that is the traffic animation. And also let me show if it's secure or not. So now you can see that we have actually traffic flowing from one service to another from one service. Let me stop here a bit. So the triangles here are services, like the ones you know if you know Kubernetes services itself. And that one is a workload. Workload meaning a unit of a runtime application. It may be 100 pots. It may be one deployment like releasing or creating 100 pots. Or yes, it's like one piece of application. You can have multiple versions of it. So I can show you that, for example, here for every application, we have version one and version two. So it looks like this application is running an A-B testing. It looks like, right? It's like for every request received in troubles goes to B-1 and B-2. And also from B-1 goes to respective services checking for insurance or hotels that this traffic is a split in B-1 and B-2. And here in this example, what we have is we have three different applications that they might leave in different data centers. This one is actually in different namespaces. So this one is a portal based in London. This one is in Paris. And this one is in Rome. Those portals are the ones sending traffic to the engine. And what we see is that there's two different customers here. We see that there's a web customer and VIP, the ones paying more money and probably getting better discounts on its booking. So more or less, I think it's clear what it does, this application. And one important thing is that I remember someone saying, that is pretty, pretty useful not only for seeing all the requests flowing, but also because I finally know how many services I have. Because I'm talking to one developer, and he said, no, no, my namespace, my team only deals with three. But I'm talking to his colleagues. And he says, no, no, we are dealing with four. It's like, OK, how many things do I have to secure? How many things do I have to take care of it? So pretty cool. Let me show you more things for understanding, yes. The large boxes are applications, meaning that it's runtime, a pod's deployment that run a application that has the same behavior, but may have different versions of it. For example, here. Is your customer useful? They are just deployments. Nope. They are deployment and fox. And here it's just a representation out of the metrics. So actually, you see insurances here. So it means that B1 and B2 deal with the same responsibilities. For example, fetch insurances, boot insurances, or cancel insurances. But there's two versions of it. But essentially, it's the same behavior, the same applications. OK, let's see one. Let's see response time. Because here, thanks to Prometheus, what we have is the responses time. What we have is that, for example, the traffic going to B1, it's taking 91 milliseconds. But the traffic going to version 2 takes 222. Here, what we can see, it's kind of the same, that 90 milliseconds for B1 and 200 to version 2. And same happened here, version 2. So it looks like the version 2 has a higher latency. Here, exactly the same. So now, I want to introduce you the first pillar of the observability, which are the metrics, the golden rules. So with Istio, you have, by default, with not changing any single line of code, you have something like this, which are the inbound and outbound metrics. So we want to guess. We want to see what happened with those latencies. And we want to answer the question, for example, is which users are suffering the most, this new version 2, that has this low, higher response time? So what you have is that you have the ability with Kiali to group metrics by remote. Just let me show this one as the one I wanted. So you remember, Trouble's service had as a client the London portal, Paris portal, and Rome portal. Trouble is kind of the front end of our engine, right? So we want to know for every portal, which users are the most affected by this version 2. So if I'm grouping the metric, all the telemetry, by remote version, by web and VIP versions of it, what you can see is that, yes, the VIP, so there's two different lines of responses times. The VIP are really, really like a high response time, 104. But for the web one, it's around, let's say, 50, 60 milliseconds. So yes, we are observing that VIP ones, the users that are paying extra money or we want to take care of them, they are suffering of our new deployment. So OK, let's see what happened. Let's see who are which bots are the ones having those problems. Do you see also that you have little dots in here, like correlated? Can you see these dots? These dots here are actually the traces that are automatically recorded in every proxy. So what you can see is that the same time of you see here at that moment, what you can see is the actual traces that lasting those milliseconds. So what you see is that there was a request to troubles at that part that last that amount of time. So we have correlated not only we have correlated the metrics and also the traces registered on our service mesh. So we can see for every request where it happened on the top or in the bottom. So we clearly see that most of the traces are off from the VIP. So let's check that out. Let's see what those traces says. And for that, what we're going to go is to the services because Istio traces for it creates the traces or track the traces in the services, not in the workloads. This is the page of a service with information related to the service. We see that there's two endpoints. We see which part is exposed. We see that there's those two versions that belongs to a flight to the up flights. And yeah, and also it's version one and version two. But the most important thing we see that troubles is that the front end is the one sending traffic to it and then flights is sending traffic to its pods. But let's go to the tracing tab. Here the awesome thing you can see is all the traces related to that flight service. So we see here all the traces. And quickly we can see is that there's three levels of traces. One level here around like 0 milliseconds should be probably 5 milliseconds. Another one here around like 50 milliseconds. And one more on the top here. So let's check those traces. For example, let's go to the middle one. What you can see for a trace is which are the spans that created that request. For example, let's see for the flights. So we see that the service flights, this is the span. So the request going through flights. And we see here some extra information. What we can see is yes, this trace in the 50 milliseconds goes to B2. Let's see to the discounts, the discounts span. This one goes to B2 also. All right. Let's check these ones that are, they don't have this huge delay. And let's see which version are using. This one, it uses B1. One span uses B1. And the other span for flights uses the B2. All right. It also has B2. So B2 is not introducing delay, but because it's only five milliseconds, something is going really mad then for the 10 milliseconds one. Let's check that out. And what we can see here is that here, this request goes to B2, the ones on flights, and in troubles, it uses B1. OK. Well, it's not clear. We see that when B2 is involved, then it might have extra delay or the latency grows. But we are not sure about it. It's not all the time. So it looks like B2 introducing randomly a delay. So let me introduce you the third pillar of observability, which are the locks. So with Istio, what you can also check is the locks of a specific application. Not only the locks, let's go to B2, which makes sense, regarding our hypothesis. So what you can see is the locks of one workload. Not one bot. You can have many multiple bots here, but also you can check the locks of your proxy and see what's happening here. So this is the proxy we were talking about in the architectural page. So it looks like everything is kind of OK. Info, info, no errors, nothing happened. And let's see the locks of our applications of B2 on cars. So what you see here is that, for example, you have one get car for city London, but then you have, OK, you have here one line saying Chaos Monkey introduced 50 milliseconds of latency. Do you guys know what is a chaos engineering? How many of you? OK. So it looks like someone just released a version 2 with a 50 milliseconds delay on all the requests. So I mean, not that cool. But OK, we are on Istio, so we can amend this, right? So so far, we've seen that we have metrics for the graph and seeing how healthy are our applications. Then what we saw, it was the metrics and the response time for everything. And after that, we see telemetry for distributed tracing and for locks. Now, what I want to introduce you is how we can prevent our VIP users to receive this bad experience. How we can use traffic management from Istio to cut all the communications to versions 2 that they are the one introducing latency. And just for fun, let me see if B2 has also introduced some extra chaos. It looks like no. So we were right. B2 is the one, for sure, introducing a latency here. OK, so yes, here we have the response time really, really high. And we see that everything is flowing from cars to B1 and B2. So what we should do is prevent all the traffic to flow to B1, right? So I'm going to introduce you what we call it the Keali action, where you go to a service, which is the one in charge of splitting traffic to one version or another. And we are going to do is suspend traffic. We are going to suspend traffic on version 2. With this, what a Keali does is create for you the necessary configurations on Istio that goes through Gale sends it to Pilot and Pilot, the delivery guy goes to every or the necessary proxies and installing those new configurations. Here you can see that we create for you this new YAML, saying that now 100% of the time, let's go to V1 subset and 0% of the time, go to subset V2. Let's check that out. Let's go to the graph. And if we see now, we have done this one. Let me highlight which is the app cars. OK, so you have a cool feature here for highlighting what you have. And let me show the request per seconds. Right now, the graph says, OK, here you have a virtual service. So you have traffic management going on over here. This should be getting a stable over the time. And we should see that it should be 0 in a few seconds. Or if demo gots once, it should be 0. Request per seconds, percentiles. OK, it's getting like lower percentage of requests, going down, 32, and so on. This is for the last, yes. So now for the last minute, what we have is that now there's no traffic here. So we didn't change any single line of code. We didn't have to rely on any configuration service. Just go to the configure the platform. We could do exactly the same for all the other services. So you will finally traffic all the traffic. Sorry, we will forward all the traffic to only V2. Let's see, for example, again, two hotels. And we do exactly the same. Or let's do another different thing. A split traffic, not 0, 100. But what we can do is, OK, for example, the scenario of we've talked to the guy who introduced this chaos monkey. And he convinced us to at least for one of the workloads, try only 10% of the traffic. Because sometimes it's really, really useful to play with real traffic, with a new version. We don't want to do performance testing with an invented plan. We just want to use real traffic. So we can do something like that, send 10% of the traffic here and the rest to the previous version that we know it works. So if we should go to the graph now, I don't want you to believe me. So we should see. I don't remember now which was the, which one? Hotels, thank you. Yeah, so here we have. Actually, I don't know why I ask, because I have here a virtual services saying that something happening. And you see that over the time, the traffic going to version two, it's going to be like 90, 10%. Cool things you have is, if you see the traffic animation, you see the request here. If you have plenty, if you have a lot of requests, then you're going to see a lot of dots here and the size of the dot is the size of the request. If you have, for example, uploading pictures, if you have a picture like a service uploading, uploading pictures, then you're going to see that you have huge balls or not. So it's going to be easy for you to say, OK, we have a bandwidth problems. So let's go to that service and see what we can do this area, for example, right? The cool thing I didn't show you before is that we also show security because Istio by default uses one thing it's called MTLS. Do you guys know about MTLS? No? OK, MTLS is like TLS, but not only from this client's standpoint saying, OK, server, are you who you say you are? It's both ways. So service identifies for a client, and client also says, I'm the valid interlocutor here because especially for microservices and especially for machine-to-machine communications, it's really important to identify both ends of the communications, right? So what you see here really easily is that you can find all the edges, all the communications that uses MTLS. At this point, it's all of the traffic. This is a really cool feature. And for example, what you can see here, what you can do here also is now tell me response time, which are the edges that last more than 200 milliseconds. And we can see that the version 2 still have a higher latency, right? All right. Yes, and here probably you're going to be one of the, it's going to be a premiere, right? Because we have this really, really nice tool, which is the replay that also, yeah, this is brand new. So what you have here is that if you had an incident, you had a problem, what you can do, it's replay. It's like a football, right? It's like when you score a goal, and then you're showing what happened. So what you can see here is that you can replay, you can replay what happened for the last five minutes. So you can see all the traffic flowing, and you can check all the response time happening over there for the last five minutes. So isn't it awesome that you can replay what was happening at the point that you had an incident? And you still can use the usual tools for highlighting things, seeing if it was because the MTLS, it was introduced, and making you some issues. Or maybe you want to replay the moment, you roll out a new version, and your system get really, really, really like saturated, like full of requests. That's awesome. And I think it's a premiere. I think this is like a Monday Comet or something like that. So yeah. OK, so I think it's kind of most of the things I have for you. But before finishing, let's do a recap, because sometimes I can be a little bit messy, like talking about things. So what we can say that ease your leverage in terms of observability, not security and not traffic management, those are two different talks. It's the four golden rules of metrics, service discovering. So you see automatically when you introduce one service, you see it on this graph. You see a lot of green light, green communications there, health communication, dashboards, meaning that you can see all the response times, operations per second, even runtime operations, and then security status. Second pillar here, it's a distributed tracing. We saw that from here. You can see a sneak peek of the distributed tracing. And also, if you want the full information, you have access to the Jigar console. And then locks, not only for application, but also for the proxies. Since you introduced proxies in front of every service, now you need to know what is happening there. One question. Yes? Is the proxy in front of the service or in front of the pods? In front of the pods. Sorry, in front of the pods. It's like every container, every pod has two containers, the application and then the proxy. And the proxy is capturing all the upcoming reports from the service and forwarding it. Yes? So this should be. So this one is the pod. So one container here, and in the same pod, you have the proxy here. And that one here is the one sending everything necessary for telemetry. That's when we have the graph and metrics. This one is also configured by this one for MTLS and TLS terminations. And also this one is configured thanks to this one and this one. More question? I think it's time for questions. Yes, please. I wonder why you mentioned that the graph within the graph of traffic that you use from these metrics. Yes. I wonder why I think if you have also the tracing. You could do it. If you have the distributed tracing, you could fill the graph from there as well, I believe. So why can you elaborate on why would you use Prometheus instead of what is the advantage of what you use to do with that? The basic, I mean, I'm going to be straight here. It was, I remember that they built before Keali used to use it like a service graph. And it was based on Prometheus telemetry. And it was really, really like, it was thought for this end, Prometheus. So it was Istio guys that designed all the telemetry for showing the representation of all communication. So it was pretty straightforward. And the second thing is Istio, if I'm not wrong, is putting a lot of metadata in every metrics. So Istio, it puts there, for example, empty at a security, what's the status of security? It says if it's inside or outside of the cluster. And there's a lot of metadata there that is really, really useful for the graph. Besides that, I cannot be accurate. I wasn't working on that area. And I might be wrong. But more or less, those are the highlights of it. And I haven't did that much on distributed tracing. And I don't know the difference. But if you want, I think, at the end, maybe offline, offline, or a lot of blue jeans now. After the talk, we can gather a couple of you guys and talk about it. Yep? Can Kiali visualize or show any connections that go off the cluster? Yes. I mean, Kiali shows the traffic as long it has the proxy in front. And it's part of Istio, right? So if it's outside of the cluster and outside of Istio, then you will see the traffic going outside of the mesh. And you end at it. But you will see that what happened from until the end, the evening, right? Yeah. Yeah. This is running in a class, not a master node, but in the cluster in Bernadette, right? Yes, it's open to this. On the compute nodes. Or do you deploy it? How is it deployed in the cluster? When you enable this, this comes with OpenShift. Out of the bulk store, you have to... No, I mean, if the question is, this is like a Kubernetes. It's one cluster. This one, exactly this one. It's on Amazon, OK? It has four nodes, right? This is one part. So Kubernetes in Amazon with four nodes. The second one, if you want to enable this, then what you do is, you have your deployment, your services deployments, right? And then what Istio and OpenShift service mesh does, it automatically injects another container in your deployment. So automatically, when you deploy this, you have the proxy on the code. Does it answer your question? OK, but I just want to control play. It's running within the cluster. OK, the control play is running in the cluster. Yeah, it's one name of space. Istio system, for example. And there's the node. Are probably six, seven pods running everything. And just to... Is there any... I mean, performance overhead? Yes. How do we approximate? How do you scale your cluster, if you want to use this? Yeah. Number number is depending on the scenario and depending on the features you're enabling it. The one, of course, this is not for free, because it's like you're putting a proxy in front of every pod. It's a clear... Thank you very much. Clear overhead. People in Istio, OpenShift service mesh, they are working really, really intense there to reduce this latency, especially for telemetry. How to approach that? Most of the people start with the staging, try to see a lot more performance on there, but I don't have a further idea of how to approach that. One of the ideas is try to add Istio in your cluster and not put the proxies to all of your deployments, but just maybe for two of them, see how it works, enable security, enable everything, see what you can get off traffic management, was just like three or four of them. Because if you put Istio with no proxies, it's just having Istio application, and that's it. The thing is when you put the proxies on front of it, so you can gradually move from non-service mesh to service mesh application, and when you feel comfortable with three, then maybe you can go with four, five, six, until you have. Yes? That's when you deploy it, you choose it on the main space level or on the service level? Deployment level, yes? So it's not the whole name of space, but only for the services you want or the deployments you want. So it's really kind of like... Okay, thank you. Yeah? I'm free-level, we're running out of time. Okay, so thank you very much, everyone here.