 But while you're here grab your laptop because we're definitely gonna need that Good news. You only have to download 23 gigs of stuff. So if you haven't started that yet I really hope your 4g connection is great If not, well tough luck Just kidding All the stuff that we have for you is in a browser So whoever was on the iPad good choice very good choice because it is gonna work for you And here we go. My colleague Nick is gonna join us in a second I'm just gonna make sure we're all ready for the workshop So before we get started I know the keynotes were amazing Well the kickoff over there Who of you is using a service mesh right now? I like the the handshaking is like. Yeah. Are we using a service mesh? Are we not? it's a That is always a question right too many options who's running a envoy Console, what are you using shout it out sweet? That's that's gonna be a fun workshop then so While you have your devices head on over to hash it out co-slash smc-22 dash workshop and you should land in a platform called instruct which is well an online training environment and While we wait for that to happen. What you're gonna. See is a browser interface Where we have some documentation for you a couple of commands on the sidebar and Other things should be a green button that says click here to get started Don't touch your green button yet No, definitely touch the green button. I mean if you can't open it up. It's not yours, right? Definitely touch a green button So let's switch over to the browser view. That's professional. I do that What happened to a simple maximize? Life would be too simple with that All right, so There we go If you hit the green button at the lower right Mine says next right now because I'm one step ahead. You're should say start You should be presented with this very same interface And if all of this looks good do me a favor and click on interactive terminal For us. It just says interactive because we zoomed in I think And you should see this terminal and In here, I'm just gonna need you to run one command which is ship yet status Is my lovely assistant is doing right now the URL is if you switch over To keynote. Yeah, we got so you can tell we take this seriously because we aligned this to the slide Want to make sure you know like all the all the stuff is working I'd be awake right now Did the URL work for you? Good. Yeah, so this is the sound that plays when somebody opens a workshop for the first time It is the sound of a thousand computers being slowly taught. Yes. Yes, it is Sweet So if you run ship yet status like Nick did you should see roughly 42 things that worked out if not raise your hand and We'll walk over to you and help you figure it out in your case Open a new terminal And if you can in incognito mode, then we'll go from there so we have Everything we're gonna run through today Well, you know, we're gonna run through this as a group The instructions are are all here in a kind of a self-led approach So if we don't kind of get through or you want to kind of come back and run through this at your leisure Then you you know, you can feel absolutely free free to do so We've we've tried to kind of build this as I say in a way that you know You can you can kind of go through things and then we can we can kind of pick things up a little bit Self-started as well So the the first thing that I want to kind of say is that Karim uses natural scrolling on his Mac There is nothing natural about natural scrolling So I'm gonna make Some excuses so we in terms of the the service In terms of the service mesh that we were gonna be kind of using today it is an envoy based service mesh It it's console service mesh because that's the one that we know But it doesn't matter whether you use console or whether you use traffic or Kong or Istio or you know Whichever one you're using because envoy being the standard which it is The kind of most of the things that we're gonna kind of cover will be will be easily transferable Now I do kind of want to put a little sort of disclaimer around that in that It is possible to configure on boy in a number of different ways and specifically it's possible to configure the way that envoy presents Matrix and tags to youth and this is directly through on voice configuration So some of the the matrix in their naming that we run through might not be exactly the same in the in your specific service Mesh, but we're gonna give you all of the information to to be able to kind of Figure out what they are and to be able to do a lot of the diagnostics that I think you you kind of want to run through So the first thing that I want to to kind of talk about is is this and this is the the kind of the topology of Envoy and what I've tried to do is kind of pull out some of the the core components because all of the The statistics inside of Envoy are deeply rooted to these concepts So you have the concept of a listener and this is just you know, the TCP UDP it's the it's the kind of the the ingress into the Envoy proxy and Then from from a listener you have a listener filter and a listener filter is is going to be something simple might be TCP It might be HTTP. You've got the capabilities around Sorry gRPC and also things around authentication So if you're using a policy based authentication flow, you'll probably find that that's implemented in Envoy's listener filter Then Envoy has the the routers and again It depends on which listener filter you're using but but routers give you the capability to define things like time-out retries in the instance of of anything that's kind of HTTP based and ultimately you then start to get into load balancing and clusters so the the kind of the clusters that that's your Your list of endpoints that your request is going to be rooted to now a cluster can be a Local app which is basically you Envoy is talking to your main application or it can be an upstream and that segues nicely kind of Into into this but let's kind of dig into those a little bit more depth My glamorous assistant can scroll down for me. This is a bit flickery where I so a listener So listener is a name network location. It's predominantly your IP and port or your Unix socket That can be connected to by a downstream client and we'll kind of get on to the concept of downstream and upstream because some I think sometimes it's confusing because you think downstream and you think well a Sort of a river Flows right so you go downstream. You're going with the river. So you'd think well Downstream therefore that is service a to service B is downstream, right? No, no, it's not because Well, why make it easy and sort of want a mental concept like that downstream is actually service B to service a now that's that's kind of a confusing concept because it doesn't Mentally, I didn't so I don't get that right But then you've got to think about where the sort of the term came from the term came from when they were starting to to kind of coin the the sort of thinking about HEP and You're thinking about it in terms of a request and a response So the reason that downstream is B to A is because you're thinking about the flow of data And at the time when they kind of coined these terms the bulk of the data was going Downstream you would make a request for data Upstream to a service the data would then be flow Sorry, the all of that bytes and all that information would flow downstream back to the original Requestor so that's the kind of the the downstream upstream and again a lot of the metrics inside of Envoy are rooted to Downstream and upstream so it's it is a an important concept just to remember that and finally we We have the cluster and a cluster is just a group of kind of Logically Associated end points so in Envoy that that can be quite disparate. I mean it's for example payments Which we have in our example application? Consists of three pods so there will be a cluster in Envoy which has three end points each corresponding to one of the payments pod a Cluster in Envoy doesn't have to be that straightforward. You can have a cluster which is pulled from multiple different types You know that there's concepts around being able to build virtual services, but Ultimately a kind of a Cluster is a collection of end points Including your local your local service. That's kind of a way to think about it Now the the example application that we're we're going to run through is a is a three tier application And it's it's built of of three different services So we have API and API is a single instance. It's an HTTP RESTful service and that makes Requests upstream to the the payments service now the payments service is actually three instances and It's comprised of two instances of v1 and one instance of v2 now the v2 service Has been deliberately Misconfigured so that it reports 20% of errors because we want to be able to see some errors in our metrics So we've we've got a deliberately badly Configured application API makes one request for payments for every request it receives. So it's a one-to-one relationship Payments then talks to currency and currency is a gRPC based service Again, we we have one of the instances of the two instances with the v2 service has been Misconfigured so that we are going to be reporting 20% of any request to the v2 instance of the gRPC service is Going to result in an error and that's going to return you a gRPC error code 13 So this is a kind of a key thing that we'll get to but I don't know how familiar you all are with with with gRPC PC based services, but they don't use HTTP response codes use gRPC error code so it is a slightly different concept and kind of a from a graphical perspective it looks like this and this is a Look, I'm gonna hold my hands up and say it's a bit of a weird way to lay out a Flow diagram for a for a services service communication, but it's trying to fit it all on the page But you can see as I say we have three instances of payments one of which is mad misbehaving which goes through to Somebody made a typo there. Yeah. Oh, that's my vet I'm not gonna answer that right now So we have two instances of currency v1 and v2 currency v2 to same 20% error rate So again, we'll be able to interrogate that and see those those metrics and This is an important bit. So we're gonna be using most of the time. We're gonna be just using Grafana now the The metrics system that we all the metrics stack that we have in our example application is Prometheus Grafana Whether you're using Prometheus Grafana, whether you're using Datadog Lightstep Honeycomb what you know, whatever sort of matrix platform you're using Other than the sort of the subtle differences around the the language because we're gonna be looking at prom sequel in order to Sort of build our dashboards. It's gonna be very fairly trans translatable now Grafana You're gonna have to log in when the first time you use Grafana the we used what we think is probably the most secure username and password Which is admin admin Now Grafana will ask you to change your password Just say no Because if you do change your password and then it logs you out and you forget what the password is Then I can't help you and you're gonna have to start again So in the spirit of security, we'll have all this code available for you afterwards still with the same passwords This is a workshop environment We do silly stuff so we can all learn from it Nick was kind enough to say that we deliberately made that service Suck about one out of five times That's that's my doing. I'm very proud of that more importantly though When you deploy this make sure your network is not open to the world There are way better ways to get up to speed with those passwords. Don't use admin admin. It has to be said because then Not our fault. Yeah, but who's gonna who's honestly gonna think no hacker in the world is gonna Thank you that people are gonna be stupid enough to use admin admin. Therefore Admin admin is probably the greatest password Depends on the dictionary attack that you're going through. Yeah, do we have the disclaimer on there about getting if anybody gets sued For using admin admin the Grafana password. Then we're not we're not liable If we we don't have that in the documentation just just so you know Okay Now the the other thing that you're gonna use we do have an interactive terminal there that you use before But we we've got a terminal which is built into the documentation There's gonna be some steps in the workshop where we're gonna want to dig into Envoy and we're gonna want to look at things like the config dump because we're gonna kind of like understand some of the topology We're gonna understand how we can identify metrics by actually looking at the envoy configuration and for that There's a terminal built into the documentation, which is gonna allow you to do that for most things You don't need a copypasta or type on your own there will be a big blue button And if you can click the big blue button there Korean, please My environment just timed out, so I'm gonna reload this one real quick. Okay Then clicking the big blue button what that will do is that will automatically run the command in the terminal again You know that most of this stuff is literally just exacting into a pod and running a curl to to get the the config dump or to to look at the Clusters and things like that and with that We're ready to to begin We're just gonna restart our environment so the environments are our Will run for for two hours If you come back to this say tomorrow or something like that It might take about three minutes for an environment to create because we currently have them switched on the hot start today But you know just just bear with it. It's it's doing a lot of stuff. It's creating kubernetes cluster It's deploying a bunch of applications is installing Prometheus and Grafana and and service mesh and and all of that And as somebody who has worked in this industry for quite a while three minutes Is nothing short of a miracle when it used to take three months to get an individual Apache server No No, not at all. They the pods should Should have come up so that might just be is everybody getting pending on their pods Okay, so you're running now you're still It's still pending. Yeah, let's have a look. It's probably I hope it's not getting Hope it's not getting blocked by sort of docker hub or something like that Let me Yeah, just do it. So while Nick looks at that Quick note about this environment Shipyard makes it easy to spin up a ton of services that you want to try out We've got a couple of different templates around when I say a couple of me a couple hundred For pretty much everything that you can imagine For a repeatable development environment locally. This is very nice. It helps you out Go ahead Have we have we changed the size of the machines? They're all eight or 12 gate ones Okay, so we're getting some insufficient CPU Which part is that? All right, so don't let I'm gonna hot fix this Because I'm just gonna basically remove any of the resource Resource boundaries because it doesn't actually use a great deal of CPU but we can we can continue because the first part of Is predominantly gonna use the cluster For API and looking at the connections there So it's everybody pretty much getting that on the payments pod. All right Don't worry. It's not It's not the end of the world So com connection metrics Now the first thing we want to look at is is connections We're gonna look at connections. We're gonna look at requests We're gonna look at gRPC services and methods and then we're gonna dig into it's also into retries There's there's limited time. We've got available. But what I what kind of want to be able to take you through is the Kind of the what I would see is kind of like a lot of the core things But also give you the ability that you kind of start to to understand where you can kind of go and And discover things further So envoy has very common connection metrics now what you have are Active connections total connections and destroyed connections. These are kind of like the core things Now they're they're comprised of a gauge counter and gauges and counters. Do Do we do you understand the what a gauge is within? Within metrics Everybody understands that yeah, so gauges basically a snapshot. It is at present. There are n items of this thing Whereas a counter Continually increments so something like a gauge as I say it's a snapshot in time with with a counter You've got the ability to do things look at change over time So requests per second or connections created per second so in terms of like connections created Well, that doesn't exist as an envoy metrics, but what you have is total and destroyed so the What you can look at is the increment of the total connections That's going to give you the number of connections that have been created over time and then the the connections destroyed Is obviously the the number of connections which are destroyed and why this is is an important thing to to kind of look at is it? it's I think sort of fundamentally interesting around the health of your applications because What we're going to kind of see when we're looking through this is that Well, it's not always good the connections are opened and closed all of the time Like you've got to think about the way that the service mesh works. So the service mesh is is From proxy to proxy making an MTLS connection now The MTLS process requires handshake you've got to do the various keystrops and things like that in order to start encrypting traffic So it in certain instances It's not always ideal to have lots of connections coming up and going down in an ideal world You what you want is a connection pool where you're reusing those connections and and I think in for the for the most part That's that's actually okay for an application. It is okay for different HTTP requests to be pushed over a Connection that's been reused from a different request Now we're gonna we're gonna kind of look at all of those things and we're gonna look at how we can get that information And also how we can we can query that information so First things first is looking at downstream connections So what what we have here the first time the first thing that we're gonna look at is Is Active API downstream connections So this are the number of connections that are coming in to your service and we're gonna look at the API and the API is kind of the the the end point so with With that if you can click on your Grafana tab and what I would like you to do is go to the the explore Which is This button here So I think do we have we have two tabs anyway, so you probably got a tab which I think is the second one Which will take you direct to that But the passwords admin admin and don't change it Because we're keeping things secure Okay, so you can If you just log in again there, it's just admin admin. It's on this one this one here It's the little little wheel click on that Okay, cool So the if you've not used Grafana before or even if you have and you didn't you weren't aware of this but the the explore Capability is really good when you're kind of doing some some metric archaeology. It's a really easy way to kind of start Looking at some of the the metrics in the system without having to to go through and create a dashboard and the first one we want to look at is Active downstream connections So whoops So this is what we're gonna end up building and there's a lot of information there So we're gonna we're gonna kind of look at this and we're gonna fill it through it. Can you scroll down a little bit for me, please? All right, cool So first things first so if you if you click in your Your explore tab and just start typing envoy listener Downstream or even just start typing downstream and you're gonna start to see that that Grafana is is going to Pull out all of those metrics from from Prometheus and what we want is We want the the ability to to kind of well the metric the specific metric that we're looking for is envoy listener Downstream CX, which is the the sort of the shortcut for connection and active And if you if you display that up on on your screen, so I'm gonna do this with you and I'm gonna grab This over here. I'm going to admin login Change my password forget it press not now. So we're in our explore tab So there's there are a huge number of Sorry There's a there's a huge number of of metrics which are Available envoy is very granular in the number of and the stuff that it admits and it can be quite confusing Sometimes as to to what you should use but what I'm gonna do is I'm gonna look at downstream connections active and what I'm gonna do is I'm also going to Change the the duration here just to the last five minutes. So we're pushing some some fake load Through the system and where we're pushing approximately 12 requests per second through it So let's have a let's have a look at this now we've got a number of different metrics and We've got connection metrics which are being pulled from a bunch of different services as well So it's not we want to specifically look at API, but we're getting data here from from all of the the other applications and services and What we what we have is Connections active so you can see here. We've got another one connections active and Here again, there's another and another so lots of lots of different metrics What we want to be able to do is we want to be able to to filter this information So to to kind of filter that information what we can actually do is use the labels that That envoy is is providing for us So let's let's do a filter and let's add a label, which is the the envoy listener address so if I go over here and I Add The order complete should and we've got HTTP and listener and what we want is sorry the listener so I'm expanding that and I want the envoy listener address So the envoy listener address this is going to to show me the the connection metrics for each of the listeners that That exists in the system and you can see there's well, there's a lot of a Lot of them. There's a bunch of different Different listeners that that are that are present So for example, we've got a number of different ports because for every port that that envoy is listening on you've got that there so for example in in terms of console console service mesh the the public listener the the inbound connections are always going to be running on 20,000 and that's just kind of like a standard that that we use and again like different Different service measures will use different things But I can actually apply that filter there and what I can do is I can just get the connections in this civic instance for that pod on port 20,000 Now that's that's great and in the instance of API where we're actually we only have a single instance, but What we want to be able to do is be able to do some some sub filtering So you can actually have multiples of these pods and you can also use regular expressions so for example Let's Make this more generic. So let's get all connection listeners That are going to be running on Port 20,000 so we can just use a regular expression and we can change The the prom sequel query to use a regular expression Now what we have are a bunch more metrics So now this is all of the the envoy public listeners that are in our in our system but This is also can you can see that you have like the currency service and we've got payment service and all sorts of things in there So what we want to be able to do is add another filter and again We can we can use the the metrics inside of The tags which are added by envoy to be able to to kind of do those filter so I could do things like Filter by the pod now. This isn't added pod is a label is not added by envoy Envoy has no idea of your pod. This pod is actually added by Prometheus when it when it scrapes so I could use I could use pod I also have some some envoy specific metrics such as console source service and If you roll your mouse over here, you'll see all of those labels So you can see there we've got the source data center the namespace the petitions Listener addresses instance and all of this metadata is associated with your Metrics. So what we want to do is we want to specifically look At the the API service. So what we can do is We can filter this we just look here Using service and we can get API so let's add that Service and again Grafana is gonna order complete for me So this is this is cool. So now what I have are a a bunch of A bunch of metrics my regex is Not actually Working for me here So let me just well we figured this out if you run into any trouble just raise your hand or come over There we go, and we've got about that now What you're seeing is is three three sort of series So we've got three series there. Whereas we've only got a single API So why why are we seeing three series basically what looks like? three different listeners The reason for this is These are are actually the way that on boy does its training So on boy in terms of its connection statistics is going to give you a line item or a data series for each of the the internal sort of threads that it uses for its concurrency model and Again, depending on how your service mesh is configured. It might only have a single thread So So in our instance what we're really interested in we want to you know We want to look at the kind of the overall connections So we've got two options one is we can sum the worker threads or we can just look at the main thread The kind of the in the documentation the reason that on boy says that it gives you these individual line items cross each of the Oh, it's okay. I can either in 123 or 122. It's definitely wasn't there sort of a little while back so what we want to do is Again, just kind of go through and Take a sub filter So instead of you see that on boy worker ID equals one, but we want to just get the main Thing so what we can actually do is we'll do a sub filter. I'm gonna do on boy worker ID And we're gonna say on boy worker ID is Black there we go. So that's now just giving us the the main thread. So we now know Because this is a About So let's grab that that statistic What we're gonna do is we're gonna create a new dashboard and we're gonna go through and we're gonna start building up a Panel so you can use your your other tab there so that you can keep your explore window open and we go over to Dashboards new dashboard And I'm going to add a new panel Then what I can do is I can just grab this query from there and I can drop it Into My dashboard I'm gonna give this a name API API connections. So again, we we've got the ability there We're gonna Change the the time. Let's just look at the the last five minutes If you look at the data over a little bit more sort of time there if you're looking at it over a Greater duration you'll you can see it does change over time. So the connection coolers is expanding but We've got that that information there So the the next Thing is that we can we can give that a name. So we're gonna call that active downstream connections Just paste that into the legend. We've got our active downstream connections So the the next thing that we want to do is We want to Start looking at upstream connections And this is where things are gonna get interesting because the payment service is not actually running It's complaining that it doesn't have any CPU so Let me see if I can be creative. This is the the payments deployment. It's a fairly straightforward Kubernetes deployment there the resources where it's specifying the amount of CPU or requests and limits I Apologize the the the machines that we spun up to to create the environments are Haven't got a but enough CPUs allocated to them. So the application is Is not starting Yeah, it's it's gonna it'll depend on which which Which pod has has come up first? I'll tell you what for now So it's just restart. So I apologize But let's let's restart our environments because we've just we've just up the the CPU It's gonna be quicker than than faffing around trying to remove the the resource limitations so if you if you just go back to I apologize about this It was fine Yesterday, but somewhere between getting on a plane and Getting off a plane And and it's kind of like the way that we've built the workshop it Everything is very very dependent upon this app having three tiers because actually if the payment service isn't starting Well, then you're not gonna get any requests through to the currency service. So I'm not gonna be able to show the GRPC stuff and Yes, so so I apologize about that But but again, you know what what I will what you can do is now We've we fixed the number of CPUs you will be able to kind of work through this at your own leisure as well So we'll you know, we'll run through these things and I'll explain them to you because we're a little bit kind of short on time I apologize that we're you know, we're not we won't have the the leisure to to kind of go through much in depth but We can go through will explain the concepts and you can go away and you can literally just spin this up when you get back home and run through it your yourself and obviously feel free to Reach out to me and we're just creating a new a new link for you which will Just be easier to I'm just getting a bit link for you. So can you go back to the The the main page there and we just get a link for you to get back there And we're just gonna stop those machines. We're just gonna restart them with one which has more CPUs at the start of your screen is This beautiful Stop button right next to continue track Remember that time in college when your teacher was basically like yeah, you got to keep working on this right now You don't just hit the stop button Terminate that instance like it hurt you And we'll get a new one because in fact it did hurt us It wasted our time by being too underpowered Life there we go to this URL So it's this one here So you just pop to that URL. Oh, so it's on the screen. There we go The mystifying metrics is also very much about typing in long strings into address bars Grafana search query bars Pretty much that and as before if you run into any trouble Raise your hand will be right over help you figure this out Imagine in the olden times if it was that easy to get 16 gigs of RAM Just like that new machine Well, you could argue we probably went for the 16 gig option in the beginning because it would have been such a pain in the back So they can to change it. We would have gone all in on the the maximum Is everybody got that got that URL there so I can just go back and Click my start button Let me switch on the ring again because that's gonna be easier. I forgot we make people actually read that start track Actually, it's already what I'm gonna do is I'm gonna hit stop so everyone can see that and then Start track. Let's see The pops it in Still getting a few at our pen. That's the old Yeah, so let me switch to this one real quick So from a code perspective, it should give you a new environment All right, let's not We're gonna run out of time. So let's Let's kind of Look at what we can do is we can kind of start digging into HTTP requests because I think We can still look at HTTP requests with regard of the the API Service and we can start looking at some of those those metrics So with that Let me just log in here With that previously what we were kind of looking at is connection metrics now the the thing is about connection metrics They are they are useful But they they don't really tell you what's going on because in the instance that you're using Connection pooling you have multiple requests which are using the same connection a Connection will also only tell you if a connection has succeeded or a connection has failed There is no knowledge around there of Whether a request has succeeded or a request has failed and the the only way that you can you can kind of Do this is by looking at a layer 7 or an application level protocol as a As a metric and Envoy has the capability to do this So when we kind of look through and we said look we've got listeners and we've got listener filters and things like that To be able to to configure and to be able for Envoy to omit connection level Sorry request level metrics, then we need to use a filter and Generally, the you know your service mesh is going to to do this for you You're not going to have to actually make this configuration yourself But in in terms of Istio what you would do in order to be able to configure something as HTTP is you use the Service declaration and then you specify the app protocol or something else and if you're using console you use our sort of configuration so the the API service has actually been being configured as HTTP and what that will do is It creates this configuration now. I think it is Useful to understand Envoy's internal configuration that you can get from the the config dump Because when you're starting to look at your metrics and you want to kind of understand which particular listener you should look at which What what is this metric what is this this label what does it kind of mean Then the Envoy configuration is actually a good place to kind of go and digging in and this is what what would have been configured inside of Envoy, so we have an HTTP connection manager the The stat prefix there where we're kind of on the public listener So that'll be some of the the labels and things like that that we're adding and we don't have any sort of Root configuration or that other than Everything is going to go to the local app. So that's kind of the base level configuration, but it means that for every Connection and ever that that's made to Envoy Envoy is going to go right. I know these connections are going to be HTTP so what it does internally is it will decode The the data flowing over the connection decoding it into its HTTP Which it can then use to be able to define metrics because it can look at things like the the HTTP response codes It can then look at the the payload and the data sizes and things like that in order to be able to to report richer richer information and Again, what we can do is we can look at Downstream requests, so we're going to look at Envoy listener HTTP downstream RQ xx So what this convention again, it's the the listener now you'll see that because we do have HTTP metrics it has the the HTTP in the name and again We're looking at downstream, but this time instead of connections it's its request and what the the xx part of Of this is going to be is the response code So when I run run that you can see that I've got well, you know a huge number Number of metrics in there and the thing is that's different from when you looked at the connection matrix before Connection was a gauge. It's a snapshot of a point in time Request is Actually a counter and because it's a counter what it will will kind of show is the change over time Which means that I can actually Run some some functions to to be able to show requests per second and things like that and we're going to we're going to look at How how you do that? So the first thing the first thing is what we're going to do is we're going to filter it So we want to use some of the filters We're going to use the connection manager prefix again to use that tag to be able to get our public listener And we are going to specify our service Which is going to be our API service so if I go into my Box there and I Filter that Then what we can see is that we've got a a graph there Which is well, it's it's kind of increasing over time But the the other thing that you will you'll kind of see around that is that it's giving the the HTTP response code now Because we've got that payment service, which is haven't deployed well every request an API is resulting in an error Which is why we have Here if you look at the envoy response code class so for every Every HTTP response code Well through through the classes so one through five Envoy is going to add a label to the metric which means that you're going to get it have the ability to have an Individual series inside of of your dashboard and and this is in itself Not particularly sort of interesting because you don't want to know the number of requests as they're increasing what we really want is How many requests per per second that we have? So in order to to be able to kind of do that then you know What we can we can kind of do is we can use some some functions. So in the in the connections Tutorial what we were going to show you is we were showing you how you can use the Prometheus Query sorry the Prometheus function rate in order to be able to convert that that counter Into requests per second. So let's add the function rate I had rate here and Now what rate requires is basically a It's a bucket size So the the bucket size is is based on how often Prometheus has scraped the metrics from your envoy proxy so for example if Prometheus is sorry if envoy is only getting scraped every 60 seconds then you can't have a resolution which is Which is greater than 60 seconds because you only have the cardinality of your data is 60 seconds So you you specify the the kind of the bucket size to kind of see what you want the aggregation and therefore what the kind of I suppose that the smoothing now Grafana has implemented this this really handy variable called rate interval and rate interval is not a Prometheus Promql Capability it is something which is specific to Grafana and What this this variable equates to is basically four times the the sort of the minimum resolution So Grafana will kind of figure out what the the scrape duration is it multiplies it by four and that gives you in most instances a A pretty good bucket size Why does this hate me this just killed my instance So now what we we can see here is the requests per second because we've we've used that that that rate it's it's converted the counter into Requests over per second and if you look at the the kind of the number of requests you're getting here It's it's around about 10 which is approximately the same as the the connection pool that we were we were seeing Seeing earlier It does seem to have actually has that restarted. Oh my pods are all running That's good news for me So we we've got that information So we can see requests per second now what you see here in the yellow line is Anything that resulted in a response code 200 what you see on the red line is anything that resulted in a response code 500 or 5xx The the reason you've seen the 5xx is is because one of the payment service Instances is reporting errors. So you're gonna get errors coming back. So this is actually as expected What we can see here is also 400s well, it's it's an empty line entry So the the first thing that we can kind of do is let's let's just take this and Let's put it into our Dashboard and we'll look at how we can actually clean this up and Create something a little bit more useful one explore. I want browse new dashboard New panel we can see that there So now what we can actually do in terms of the legend is we can we can use that information in the in the legend and The the the information in the legend we can use here as you can see that we're using envoy response code class Now that's one of the labels From from that metric item. So we can take that and we can actually use that in our Legend here and now what this is going to show us is we're actually seeing now that we've got all of those Those items the series are are nicely nicely named based on Based on the The the the metric label we can also change the the mode into Table and we can kind of enable some some various other bits and pieces on there So we can kind of see some Some details you'll be able to see like the the averages and stuff like that and that's all all Configurable from from those chart options there on the right-hand side Now what we don't want Really is we don't really want to see these these kind of legend items for 1x and 4x when when they don't you know, they don't really exist. It's just that envoy is reporting A metric with a label which has a zero count. So what we can actually do is we can just add a Little sort of thing there which saying where it's not equal to zero and what that will do is it just cleans up our Metrics there. So we've got That that information there All right, so I'm gonna as I said the the way that this was all set up barring not having enough CPU is that We have the the sort of the ability that you can kind of run through this as a self-paced self-paced workshop and What I am what we're gonna kind of do is we'll kind of just move on a little bit So we can get through some I can explain some more the the bits and pieces One of the metrics which I think is really important and again like the sort of the common things that you're looking at You're looking at errors. You're looking at retries. You're looking at some of the Reliability patterns that you've configured inside of envoy. You want to look at latency. You want to to kind of Do things like that and it's it's really sort of a really important sort of metric to have I mean latencies I mean it tells you whether you your sort of applications within normal bounds It can tell you whether you've had a regression in your deployment Or importantly whether you have any sort of outliers and things like that Now there is a blue button on there which will basically dump the the sort of the cluster dump And let's just let me just run that and I'll I'll show you what's going on So what I'm doing here is I have just run It's basically coupe CTL exec and all I'm doing is I'm I'm calling config dump Which is on on voice admin port so on voice admin port in our system is running on 19,000 So I can call sorry stats Prometheus to get the the raw information and I'm just Running a grep on that to just get the the time bucket Queries and if you look you'll you'll see that you're getting like a bunch of different metrics. So for example here. We've got 846 so the local cluster API is 846 now what that means is what what on voice is a histogram. So what on voice doing is it's grouping requests Into buckets based on their their duration. So in this instance, it's less than or equal to Half millisecond and then here we've got you know less than the one millisecond. So the number of requests so anything 3000 took less than 10 milliseconds and 25 and by using this this sort of these histograms what we can actually do is we can We can represent this and kind of get the data out of the bucket So we can look at the 50 percentile which is kind of the the median or the sort of where the bulk of your Your requests are kind of coming into and then we want to look at the as I say the the outlier. So we want to look at the The the 95 percentile maybe the 99 percentile 90 percentile So let's let's just kind of Go through and let's just put this into Into a chart and we'll have a look at How we can use that that histogram quantile function So the raw data well, it looks like this. So what you're seeing is like 10 12 now this is not The the duration this is not Milliseconds per second what this is is at it is Requests per second that took less than Where is the line item in there less than 50 milliseconds So it's it's basically the histogram information reporting it like this is not particularly not particularly useful So what we're going to do is we need to transform that into something more more interesting now what Prometheus has is a oh, yeah, we can do this is It has a query sorry a function called Histogram quantile and what histogram quantile does is it takes the the histogram metric And it basically runs a bunch of maths on it put the the metric back into into the buckets And it will pull The the 50 percentile the median and then it's reporting that as The the the actual duration So this here is actually thirty eight point eight milliseconds As a as a duration so now over the last five minutes you can see that You know within within reason our application is fairly static. It's it's kind of like going from a minimum of around Oh cool It's going from a minimum here of like sort of 30 38.2 to kind of a maximum up there of around 40 And it's a it's a sort of a fairly sort of consistent But what the the the median doesn't really sort of show you is It doesn't show you the outliers and the outliers are important So let's let's just put this into our dashboard. So I'm going to add a panel And what I want to do is I'm going to paste in there. We've got the 50 percentile and we're going to also report Let's just do the 95 percentile So you can see when I when I kind of change this That the 95 percentile in in terms of My my application Is running around 90% so five percent of requests in my system Exceed 90 milliseconds which Which is great because now we can kind of start to see the the the outliers So when you want to kind of looking at why you this information is useful is you want to be looking at is there a really great Difference so this I would actually say is fairly normal. It's just normal behavior But in some instances what you you will kind of see is you might see errors now You might look at your your latency and latency could be fine, but you might be starting to see errors there may be a spike in latency now that that spike in latency could just be Increased traffic could just be the system in general is Slowing down it can also be an outlier. So if you think about something like a database lock now a database lock Which basically slows down the requests for every single individual in the system Will pull up your median But it's it's not a median increase that's due to you sort of across all your requests It's a single request right so then looking at sort of spikes like that It's really important when you start to do your performance tuning and when you're starting to look at your your sort of systems health so you you want to always be kind of reporting as I say they probably the The median which is useful for for sort of day-to-day running and and your outliers 90 percentile 95 maybe 99 percentile So now that's that's we've got the ability to kind of see the the requests per second But also the the time it's taking for for each of those those requests Yes, yeah, so that's a very good question. So the the question there was From the the 95 percentile as we're a way to determine if it's a specific pod That is that is causing that because you you know, you might have a noisy neighbor on a machine or something along those lines and the what we're actually showing here is we only have a single instance, but if you look at You've got the the instance label there you have the The pod label as well. Oh, yeah, I keep forgetting about the laser pointer You've got the pod label as well there So all of that information you can use now ordinarily what what I tend to do is I don't I generally some These so I group them together to look at a service as a whole as opposed to an individual service instance, but to your point like when you find an outlier it is actually Possible that it's not Across the border problem. It could be an individual instance and the the matrix are all there to kind of be able to dig in there and do your do your forensics Yeah, generally you kind of you know, there's too much information can be overwhelming So you when you're using your dashboards from a sort of a day-to-day basis you want to represent the the matrix that give you enough information to understand that something's wrong not necessarily that That's specific of what is wrong because you want to be able to take a quick glance and go is this okay. Is this not okay? Because you've always got the ability to dig deeper and look at sort of other dashboards Maybe more specific ones or you can actually just go in and and use the metric explore it a kind of Dig in to try and find the root cause because if you put too much information on your dashboards It's just the signal noise you you can just look at it and you go well You know, I don't I don't really know if there's anything so it's better to kind of Say you figure that all right, so g rpc services So we kind of said that you know, so g rpc is HTTP in a sense Well, it's HTTP to the in a sense that the g rpc requests are multiplexed over HTTP But you're not using HTTP in terms of a restful sense So you you're not really using your HTTP verbs like you get your post and you're certainly not using your HTTP response codes in the same way so in order to determine Service problems and service traffic for for g rpc based services. You've got to look at the very specific protocol and Envoy again has a has a filter which is the The g rpc stats filter and what that will do is Envoy will for every every request It will basically understand that this is a grpc method call not a plain HTTP request It will decode it it will pull out the the service the method name and importantly The error code and it'll be able to create an individual metric for that What what it will also do is it will attempt to kind of create some some generic sort of Matrix that you you you would use as if it was an HTTP request and It does that by default you can disable that Maybe your service mesh will or won't it depends but The beauty of looking at the g rpc the specific metrics in Envoy is that you get method and service level Information which is you know, you can you can see very granularly If we we kind of look at Some of those metrics like some of the common metrics there you you have The basics method success and failure Method total now success and failure like a a grpc is basically you get an error code of a zero which is a success Anything which is not a zero is classified as a failure now that could be a Depends on how the service has been implemented. I've seen grpc services which will return say an error code To to indicate that an item doesn't exist or something like that so it's it's kind of I Think of somewhat use just to use kind of success and fail But what you do also have is the ability to kind of look at more more granular Requests and Envoy is going to create metrics which are kind of rooted like this. So we have Envoy cluster grpc the name of The grpc service of which an application may have multiple and then the a grpc method which is handle So if I let me just grab this and let me paste this into The Grafana Explorer here and we will see that that you've got kind of a number of different metrics So we've got a number of metrics then we've got zero so successful requests and failed requests which are 13 now The the currency service returns a grpc error code 13 on a failure. So that's why we We kind of see those two things but When you look back and think about the the request metrics now What you were seeing with the request metrics was that you had a label there was a Metric label which had the the specific error code We're not getting that with the the grpc metrics So we we need to get a little bit creative is probably the polite way of saying hacky but We can you know we can do quite a lot with with prom sequel So the first thing that we can do with with prom sequel is that every metric Can be referenced by it's its actual metric name So Envoy cluster fake service handle underscore zero But you can also use this kind of very generic way of addressing a metric which is you just use the brackets the The generic label underscore underscore name, which is the metric name and then we can use a regular expression So for example here if I run this what I get are all of the The fake service handler metrics For all of the the sort of the the different the different codes, right? So we can see all of the the zero codes we can see all of the 13 codes and again We've got three instances of the payment service calling two instances of currency So we we have an individual metric item for each pod So first the next thing we want to do is we will we want to we want to be able to kind of Report this and get that error code as a label. So what we can do is we can we can actually extract the label So the to extract a label we can again use a prom sequel Query which is using label replace and what label replace will allow us to do is It allows us to define a regular expression and say from this existing label add a new label and It uses this syntax here So the metric that you want to use the label name that you want to add the regular expression that is going to To match so this is like the match group The label that we want to search for and a regular expression, which is a search So you end up with with this so we have label replace. We're using that generic metric approach We're saying we want to create a new label called code. We want to use the match group one We want to use a source label of name and then we have a regular expression here, which is basically pulling out the the error code and What I'm also doing is I'm going to wrap that in a rate So that I'm actually getting the the number of of requests per second And you can see now that we have the code. Okay now There's a there's a little kind of a little trick here, and I'll be 100% honest I'm not entirely certain how this works. It seems to be some sort of Magic inside of prom sequel but by adding what you basically have is the the precision by adding an additional colon to your duration what I'm actually forcing Prometheus to do is run a sub query So then label require replace returns the the correct vector that is needed to be able to use by rate It's an Interesting one So now what we can do is we we we can either you know, we don't necessarily want to look at these as Individual line items we can group them together and to group them together It's it's just a simple matter of using the The sum query so I can wrap all of that in a sum and then I can say bye And what I'm going to do is I'm going to use code Which is my the label that I've extracted and now what we can see is we've got a nice clean chart Which is showing us that approximately nine point seven require gRPC method calls per second resulted in a status okay or a code zero and Thirteen resulted sorry about half a request per second Resulted in a status code Zero sorry 13 or an error. So we've we've got that that ability We are we're out of time But this this instruct lab will will be available to to run through your own pace and it will have the correct number of CPU So you're not gonna have to deal with painful payments pods failing I Apologize for that So it'll be Well, we'll keep it up for like a couple of weeks a few weeks. Don't worry about that. It'll be fine You can actually also I'll we'll put a link in the the description of the instrument But you can actually just download this lab and run it locally all you need is is Docker on your on your local machine So if you've got Docker and a little application called shipyard, you can spin everything up locally including the documentation to to run through the lab on your on your local machine and then you can of course Play around with things and make some changes and and stuff like that and we'll we'll give you the The link here where you'll be able to download That that from and I will put a read me in there But it's Has to cope dev advocates slash on void metrics demystified and we'll put the instructions there so that you'll be able to run it Offline and on your local your local machine the only thing you need to remember is that on sketch shed shed calm sketchy Instruct no the oh, yes, wherever you find the schedule. I'm guessing that's shed We'll have the links in there easy for you to click either through the browser based environment or by just cloning it yourself and working through it and For fun. We also included a pack of image that we use as a backend. So if you want to have some fun with that go ahead But the yeah, we have an additional There's additional labs in there as well We go into looking at how retries work and and we we kind of glossed over some of the connection level stuff, but we we look at Things around the sort of the say that the connection level stuff and looking at how you can actually identify that you're using Connection pooling and all you're not using Connection pooling and all of that information is in in that lab there So thank you. Thank you so much for bearing with us Lunchtime, I hope you all have a lovely rest of your service mesh con and a great Kubcon and if you've got any questions, you know, just reach out to us We're more than more than happy to to dig in just out of interest which what what service meshes are you all using predominantly a steel base? Yeah So it's the the metrics are rooted. It's just envoy like the envoy documentation is really good You'll find we've got the links in here, but this is the statistics For each of the different components envoy will tell you what each statistic does and it goes into depth on On on the varying things there are a lot of them And I think the key thing is learning and just kind of understanding what the most important ones are and that kind of the High-level because there is a a lot of stuff that goes very very granular But might not be the sort of the day-to-day usage What we wanted to cover is kind of the stuff that we feel is kind of day-to-day, which is connections request Packet sizes, durations, GRPC and some of the reliability Thank you with that. Thank you and see you at the next one