 Okay, so my name is Paolo Fai and in this talk I will tell you something about observability for Istio series mesh. So on today's agenda there is some generic introduction to observability. Then I will talk about telemetry subsystem in Istio and as the last part there will be demo showing you soccer game deployed on Kubernetes and we will be changing some Istio configuration to change actually the behavior inside the game. And we will be monitoring the game in the Keali project which is a series mesh observability tool which is like a standard for Istio and also in Yeager so we will be looking at some disability traces. So before we start just a couple words on myself, I've been working on disability tracing and observability for about last four years. I'm not Istio developer, however I work on Yeager and Open Tracing projects which can be considered as a distributed tracing project which can be used with Istio. I do also work on framework instrumentations and also runtimes for Open Tracing to make it very easy for developers to start using tracing in their technology. So when I started preparing for this presentation I created a couple of microservices and to simply verify whether it works I deployed them on Kubernetes and to verify my deployment I created a request to cluster, the simplest thing I can do, right? And obviously the request failed and well from the error message it's obvious that it's something wrong with the DNS. But my point here is the DNS is a distributed system and if we want to be able to tell why the request failed we should have all the telemetry information from all of the nodes otherwise we don't know what really happened like it's in this case. I wasn't sure whether it's my configuration issue or it's something wrong with the MiniCube. So anyways when speaking about microservices on this slide you can see a very large deployment of microservice application and it's actually a screenshot from Yeager. It shows dependency diagram and there are maybe like 2,000 microservices. It's actually a deployment from Uber the right sharing company and if you are booking right what happens is that request usually goes to maybe like dozens or hundreds of services and if you are not able to correlate what happened in those services then you are not able to tell the story about the request, right? If you are monitoring each individual service then it's fine you know what is happening inside the service but if you are not correlating all this data you are basically lost you don't know what is happening with the request. So traditionally we used to monitor our systems by using blogging and metrics like this is like base standard and both of these monitoring techniques they work on an event-based or they take an event-based approach so if there's an event in a system you just generate log or you increase a counter like metric or something like that and if you want to use these two you have to instrument your applications there is no other way around, right? So for example for logging you have to use instrumentation APIs like SLF4j, log4j or maybe standard geologging API. For metrics pretty much the same story probably your favorite framework provides you some integration with metric systems so for example in drop wizard you have drop wizard metrics in a spring boot you have something like micrometer but the problem with this is that if you look at the polyglot applications that you potentially deal with different APIs in different languages, right? But if you look at the specific language then there are also like different APIs and different implementation of these APIs so you deal with different APIs then the different instrumentations produce different set of data for the same basically same events, right? So you get inconsistent data and the other point is that maybe some of those instrumentations they do not support or do not export data to your frame rate like monitoring platform for example like Prometheus or Prometheus or Splunk. So that's a problem like very inconsistent way how we monitor our applications, right? And here is where Istio can help you to unify all the data, all the telemetry data we just produced. So just quickly on this slide you can see Istio architecture and there are basically two important parts the top on the top of this slide you can see so-called data play it consists of your microservices but also all these proxies so the proxy is like a process which lives very close to the application and if the application wants to tell something to the other application all the traffic has to go through the proxy so these proxies have like full control over the traffic which is going between the services. The other part is a control plane and control plane is like a brain for the service mesh so there are three components first one is pilot it handles the configuration for the proxies then there is Citadel which is like an authentication and then mixer a mixer is where all the telemetry happens. So mixer as I said that all the traffic goes through the proxies then this proxies or the mixer can generate like very unified telemetry data for the for the traffic and what happens in a mixer is that you can configure like what kind of telemetry would like to extract from the traffic so for example like I would like to look like HTTP status code or maybe the payload size. The other important thing is that mixer is pluggable so you can actually use different monitoring solutions with it by default for example there is Prometheus for the metrics Yeager for the traces but you can configure basically any of those. So the first telemetry data which mixer can produce is a matrix as I said by default it uses Prometheus and exports the the matrix in Prometheus format is a text-based format so it's very easy to understand and read. I will not go into the details. The next telemetry information is distributed tracing and just quickly for the people who don't know tracing I would like to mention how it actually works. So we mentioned there are five microservices A, B, C, D and E and the request comes to this deployment and the first service is A. So the tracing integration generates a unique ID which we call trace ID and we put that ID into into a bigger context and we propagate this context all downstream calls. It's very important that the context doesn't change. So later after the invocation tracing system can basically correlate all of the events with the same ID and show you something like a gun chart or like dependency diagram what happened. So in terms of Istio this trace ID is actually generated inside the proxy because the proxy is the first one which knows about the request and then the proxy puts those like those identifiers from the context into the request headers and your application has to get them from the request and propagate them and send them to the outbound request. But that can be very problematic because well it depends on the language right. So for example in Java you can store the context somewhere in thread locals and just on the client side when you are making outbound call just get it from the thread locals and pretty much an easy thing to do. But the problem is that we write quite complicated code with a lot of concurrency a lot of thread pools queues and futures and thread locals just do not work with these creatives. So you have to be always careful about propagating those headers inside your applications. So as I mentioned the default tracing in Istio uses Yeager but actually Envoy and Mixer produces traces in Zipkin format. It's because OpenTracing well Yeager is like OpenTracing implementation and there is no standard data format we don't put tracing. So Envoy I think Envoy project they decided to use Zipkin because the format is well standardized but the Yeager consume Zipkin and Zipkin data so there is no problem with that. So headers your application has to propagate this set of headers it's actually called B3 protocol it comes from a Zipkin project and there are also these two additional X OT spam headers I think those are from LightStep. It's because Envoy initially was using LightStep Tracer which is like commercial vendor of tracing. So my recommendation like this context propagation can be very tricky and my recommendation is that you can actually use like any tracing instrumentation to propagate the context for you. So maybe you can use like something like OpenTracing Country which there are a lot of instrumentations for various frameworks and if you plug them use a tracer then you can you don't have to care about propagating those headers. The other benefit by using instrumentation is that if you look at traces in Envoy you get only visibility at the end points so you see like there is an incoming request or there is an outgoing request but you don't know what is happening inside your application this is basically black box. So if you use instrumentation inside the process you get also some visibility what is happening inside like there is for example a database call or there is a CDI location it depends on you like what kind of instrumentation you use. So just quickly about the future so you can see there are a lot of headers right and it's like we get B3 protocol which works for Zipkin but we get also some like LightStep which we really don't care about. So in the future there is a working group which tries to standardize the wire protocol for tracing and it introduces two tracing headers the trace state and trace parent and the nice thing about this is that actually imagine you have instrumented or you are using a tracing for example Yeager in your system but you are calling some managed service using like totally different tracing instrumentations and tracing backends right. So like traditionally these two wouldn't talk to each other but once we have these standard headers actually these two different tracing deployments can actually correlate the traces so maybe like imagine you have a problem with like I don't know ABS managed service so you can call them and say oh yeah give me trade data for this trace ID and they will actually be able to do it. So the last telemetry data what we get from Istio is logging and so Mixer sends unified logs to Fluendi and Fluendi can be configured to use basically a lot of backend storages for logs I think by default something like elastic search but my question is is logging actually useful like but it's probably too expensive to verbose and there is no causality right and in microservices we want to correlate the telemetry between the services and between the calls. So maybe it's useful but maybe for just a couple of events right something like application lifecycle events so when your application is starting up it's booting you would like to log like what is the configuration and some installations of the components and I think that instrumentation API is like we have for logs metrics and traces they will somehow unify and so we will instrument our application once and get observability for all of those telemetry data what we can generate. So next is demo and I will show you this it's like a soccer game deployed on Kubernetes it's actually not my work it's a work by Joel Trakovian which works in the same team so I will just start deploying the microservices into Kubernetes so the first service is a UI so I will just port forward the Istio gateway now we should see there is nothing there is start game if I hit the button nothing happens because we have only like UI service which is responsible for for refreshing the game and drawing it on the screen but we don't have actually microservice responsible for like stadium and the players so next I will deploy the ball stadium and the players so we see that something happened okay so we have the ball there are players and basically there is like very simple UI they are going after the ball but it's it's very simple and as you consider like two teams the yellow and this louis and they're just going to the to the ball and trying to uh to to score basically so I will go to Kiali and let's have a look what Kiali can show us okay so as I mentioned Kiali is like service mesh observability tool for Istio and it provides you like a high level view on the names tab faces and the application is deployed so we are working in this default namespace so I'll just just go there we can see all the all the microservices so we have AI locals this is basically the players for local players and there's visitors ball stadium and UI then there is also this graph which is very interesting so we can see all the basically all the components in our deployment so we can see locals the visitors ball stadium you can also see like the percentage of the traffic going to to each individual services you can actually change this like there are a lot of these different views you can choose from so for example now you can see the AI visitors just calling ball stadium but you can also introduce something like service nodes and we will see that more when we have like different versions of a service that then we can quickly see like how much traffic is going to each individual version so I would just zoom in to to the visitors and here we can see that visitors are calling the ball so we're getting position of the ball calling the stadium there is this Yeager collector so maybe the service is reporting some some traces to the to the Yeager and UI so next step I will I will deploy a second ball so we see the ball is here but sometimes it takes time to to get it to the to the stadium okay so we we have two balls and we can see that the players they are hesitating they don't know which ball to take the Kali the observability is pretty much the same we see the well we see the different balls right and I will have to refresh and we see the traffic is like splitted like 22 by 78 that's just because the we we we too was deployed later so we wait the traffic will be 50 50 so we don't have a lot of time so I will just quickly quickly change the the easter rules to forward 75 percent of the traffic to one of the balls and 25 to other ball look at the game yeah so now the players should be playing only win bad ball I think with this the first one but sometimes maybe they go to the other one because it's like 25 percent so if they are very close and the request goes to the second ball they will just go there what is more interesting we will use some like better players so we have Messi and Mbappe and they're much better players so so we can see now they're using the same like the same strategy like 75 percent they go to to the white ball and 25 percent of the dimes they go to the to the other ball but what is I haven't changed any code I'm just changing the configuration in Istio and what is even more interesting that we can actually enable two games in the in the in the stadium so the Messi and Mbappe will be playing with with the second ball and all like dummy players will be playing with the first ball so you can see Messi and Mbappe are playing only with the pink and the other ones are playing with the white ball so yeah if you go to Cali you you see like what is happening like how much traffic is going to the beach of the balls you can also go to distributed tracing and see some traces I think this will work so what is interesting here what they have done that instrumented the the AI or the players in a such way that you actually see how they move on the on the stadium so you can trace like who scored and like where the player was it's pretty interesting but if the game like if it lasts long then you get like an insane number of the spans for example like 1000 but also in the UI what you can do with tracing like if something is starting and the start process is complicated you can instrument it to to basically see like what was the what were the steps during the initialization so you see the UI called stadium stadium called the ball the ball called mixer okay so the other thing in a so in Cali we have this this nice graph we have also access to the metrics so if I go to services I can go to a local's and see I have to go to I have to go to the workloads and I'll just show you I think it's a ball yeah so you get you have access to all the metrics from the Istio and you can actually go to to Grafana there should be some other link where you can like get more of these graphs okay this is everything