 OK. Ni hao. Jabao Jackson Nick. And... My Chinese is terrible. I try. I have a dual-lingual subscription. I've been trying. Maybe next year. So I apologize. But it's a pleasure to be here. I always love coming to China. I especially love coming to China. And it's wonderful to see you all. And today I want to talk to you about network observability with Envoy. When my clicker. OK. So I'm a developer advocate. I work at a company called HashiCorp. And I've been playing around with Service Mesh and Envoy and the whole observability sort of sphere for about 12 months. Maybe 18 months now. And I found that there are some concepts which are pretty difficult to grasp. So what I want to do today is kind of give you an introduction on what observability is and how you can use an absolute myriad of metrics which come out of Envoy. And we can also take a look at tracing. And tracing is what you're using tracing or not. It's interesting. My background I've used metrics for basic statistics for about 5 years. And tracing is a slightly different concept. And we'll take a look at that. And very very briefly I want to look at logging. And that's kind of more so why this is actually really important in addition to all the other stuff. All right. Why are we here? How have we kind of got ourselves into this situation where we need to be thinking about things like service mesh and network observability? Well, we're here because we've moved from static infrastructure to dynamic infrastructure. The whole world has changed. We've changed the way that we're running services when we're running Kubernetes, we're running multi-tenanted, we're running smaller services. And these things cause problems as well as solving problems. How do we get traffic east-west? How do we deal with things like network perimeter? And most importantly for this talk how do we understand what's actually going on inside of that environment? So the market trend has been right and I wrote a book on micro-services a couple of years ago and I'm going to publish a new edition updating that hopefully this year but one of the things that I see as a real benefit is this movement towards smaller functions of work. Smaller instances, smaller application instances, container-based services rather than huge monoliths and there's a number of reasons why that's beneficial. It's easier to deploy. You're deploying smaller instances, lower risk. It allows collaboration but it did require us to rethink pretty much completely how we manage and understand our systems. So the benefits of this as I said is that productivity thing. You know micro-services I see are a developer tool. The benefit is to us who are working on the systems and coding them more so than a business tool but there's also a business benefit in that if we can ship code quicker we can get benefits to the end users quicker but there's a cost as well because you're deploying these multiple instances of services you need like many many different load balancers and different components of infrastructure. I'm not going to go into all of that today but just in terminology before we begin just in case anybody is kind of not familiar but so I'm going to be talking about envoy and envoy is a component of a service mesh so a service mesh generally built into two main components, a centralized and distributed control plane and the data plane. So the data plane is where all of your traffic is flowing traffic doesn't flow independently from a service instance to another service instance it's all proxied in and out through the data plane where that gives us benefit is that because the traffic is flowing through the data plane we can understand better what's going on with regard to our network communications and back onto that dynamic infrastructure thing networks are not reliable they're kind of pretty unreliable so we've got to think about ways of mitigating that reliability but before we can mitigate reliability we've got to understand it you can't do reliability unless you have observability but observability it's like it's all over all the blog posts all the twitter is it just a buzz word what does anybody mean when they say the word observability so by definition object so observability comes from control theory it's an engineering term and it's kind of a measure of how well internal states of a system can be inferred from knowledge of external outputs like what well what we're trying to say is that if we measure internal inputs and outputs we can determine an overall health of a system based on those we can infer and what does it encompass so observability this is kind of the key thing for me that observability is not just about metrics or on-voice statistics observability encompasses all of this so things like your on-voice statistics application statistics Kubernetes statistics tracing logging health checks business analytics don't forget this but if you want to measure the performance of a system based on its external inputs then what about sales like if a system isn't functioning correctly you're going to see decreased sales potentially or decreased traffic or customer interaction those things are business metrics they're not necessarily system properties but observability tries to measure all of those what we're going to be looking at is just a small part of that and we're going to look at on-voice statistics and tracing mainly so again when we're thinking about observability we've got to think about internal and external instrumentation so we're going to be thinking about things like probes on Nagyos looking at disk space all of these affect the behavior of the system we're going to look at things like health checks again it's an external check but we're also going to be looking at application network statistics these are the things which are emitted from the internals of your system so metrics so when you want to understand metrics with Envoy you've really got to first understand the architecture of Envoy because if you look at the documentation Envoy it's just thousands and thousands of different metrics and where do they come from so there's ultimately these five component parts so we have things like listeners listener filters routers clusters and ultimately the control plan so let's break those down so a listener is something which is a named network location so this can be something which your downstream clients connect into but also something which you make a request outbound to and Envoy is going to expose a number of different listeners in your setup it's going to have a minimum of sort of two one for the control plan and one for the internals but this is kind of another key term and differentiating between the understanding of what is downstream and what is upstream so downstream is a response and a request which comes from an end-user client so an external so I could be a downstream source when I'm making a request to your website another service can be a downstream source whereas an upstream this is something which is made to another service and it's important to understand the difference between these two when we start getting into metrics because a failure in an upstream doesn't necessarily mean a failure response code return to your downstream service customer or user and that's because of the reliability patterns which are potentially implemented somewhere in the filters a cluster what a cluster is in terms of Envoy it's a collection of end points now that could be either automated or radically configured but a cluster contains load balances and it has a knowledge of end points for which it's going to route traffic so let's look at the configuration so Envoy matrix depending on which kind of mesh you're using or whether you're not using a mesh at all are pretty straightforward to configure there's a couple of key things which I want to kind of point out and really sort of bring your attention and the first one is the ability to add custom tags so we're going to look at this when we start looking at some of the metrics but the key thing is around observability you need to understand from where a metric or a statistic comes from right and being able to add additional bits of metadata to a metric is incredibly beneficial when you start to build up your dashboards and your alerts the other important and I think absolutely essential configuration feature is this one here which is using all default tags because if you look at the raw metrics or the raw statistics in terms of Envoy they can be incredibly long you'll have things like HTTP listener 192.168 underscore 8080. downstream underscore CX underscore blah blah blah right it becomes incredibly difficult to manage those and build up dashboards when you're sort of dealing with something as complex whereas if you use this configuration element what will happen is Envoy is going to extract things like the listener name or the cluster and it's going to put it into a tag so it makes selection of metrics much much easier where we're going to kind of go with metrics again something important to point out is that Envoy has Prometheus metrics it has the ability to expose a Prometheus endpoint metrics endpoint but there's a couple of caveats with it so you really need histograms and they are only available in Envoy 1.10 and the other key thing when I last checked the change log I don't think there's a change in 1.10 is that the Prometheus metrics are exposed on the admin endpoint now the problem with that is if you make the admin endpoint for Envoy public then not only can you get access to metrics but you can reconfigure it into a security hole there's a couple of ways around that you can sort of set an internal Envoy route just to kind of loop back and expose only that specific Prometheus metrics point but I believe that's going to be either kind of modified or the admin endpoint Envoy is going to be an authenticated endpoint at some point so using metrics so you've got a number of different ways in which you can kind of get access from Envoy the simplest is StatsD StatsD has been around for a long long time originally created by Etsy it's a push-based metrics format which uses the UDP protocol the key thing around StatsD is it doesn't support metadata and we're going to kind of look at why that's important you'll find that whichever sort of metric system you're using they tend to support these kind of four things there's gauges, timings and sets predominantly the first three are the most the most used and it kind of looks like this so the kind of the inherent problem there is because you don't support tags the entire name or any metadata has to be encoded into the name of the metric itself that can make querying, filtering incredibly cumbersome unknown problem and it's a solve problem so DataDog who've got a really really good SaaS based metrics platform introduced an extension to the StatsD protocol which is called DogStatsD and again it's that same format of push-based UDP but DogStatsD introduces tags so now what you can start doing is you can add metadata so I can start building up serviceA.myMethod.call.tags and I can start putting the service ID into the metric and that's really important because again you need to be thinking about how do you filter how do you break down information it's generally not an entire system of services which you're going to cause have problems it can be particular instances so you need to be able to think about how to filter to a particular instance of a service and tagging is essential Prometheus great open source, great CNCF project takes a slightly different approach so rather than you pushing metrics to the service with Prometheus what Prometheus will do it will pull the metrics from your application and that pull is done over HTTP and importantly it supports metadata so again those key things there, counters simple things like number of requests you want to see that break down over time gauges what is the current CPU levels, something which is kind of a fixed value and histograms, histograms really really essential for being able to deal with timing data again I don't want to be looking at averages I need to be looking at things like quartiles and percentiles inside of my data metrics format pretty simple, straightforward again here's an example of an envoy metric and you can see here with envoy HTTP connection manager prefix I've got the name of my connection manager now if I wasn't using tags that would have been part of the name so that use tags is a really really important feature so all those things when you're choosing a format you've kind of got to sort of take into all of those things into account if you're using Prometheus which I guess a lot of people do then you know you can kind of just go straight with Prometheus but think about how you're accessing those metrics and that you're accessing them in a secure way because of the admin endpoint and envoy but ultimately you know you can also send data out to dog stats data dog and things like that but you need to be thinking about metadata and metadata needs to be more than just the service instance you potentially want to be able to capture things like the Kubernetes node you want to be able to capture the pod name the deployment sort of ID as much information as you can which will help you in the event that you need it and this is the kind of the key thing you kind of got to capture a little bit more than you think you need right now because you never know what you need until you find out you don't have it right okay so listen to metrics so things like downstream connections and I've just kind of pulled out some of the key elements here that there are a huge number but these are some of the things that I like to kind of monitor and again don't forget that because I need to be able to get that metadata when I want to kind of build that up into a Prometheus query so I can look at something like the connection total so I'm going to that's a counter so I want to be able to see maybe a bar chart or a graph of the number of requests with counters I tend to like to see those rather than a rate I want to see the number of requests so I tend to use increase instead of a rate there but I'm in either way breaking it into a 30 second bucket when you start to kind of view that information you can kind of connections are they really that useful I think so the key thing you've got to remember with the connection is that internally Envoy is using connection pooling so you do not have a one-to-one relationship between a connection and a request but connections are really interesting to monitor because connections should be somewhat static because that connection pool is being maintained if you're seeing a lot of connections created and destroyed it can be an indication that potentially there is something in the system which is not allowing persistent connections persistent connections give you speed that's why we kind of use connection pooling and persistent connections there's also some some key metrics around connections which I think are really good for alerts and the kind of the key one which I've highlighted here is SSS SSL fail verify no cert so within the context of your service mesh all of your traffic should be using MTLS so it's using an encrypted transport and it's using client side certificates to manage authentication if you get a lot of errors around that then that could be a potential problem where you need to think about the configuration of your service mesh that's a kind of a good health warning so what about requests so requests you got to be thinking about layer 7 and again we're going to break up requests into downstream and upstream and remember that difference a downstream is to me back to your your end consumer the upstream is something which is happening internal in Envoy so a downstream sorry an upstream failure does not necessarily result in a downstream failure because of the retry patterns and the reliability but downstream is really important because that's what your end user sees failures on the downstream means that your end user is suffering again looking at some of those things we've got like downstream requests underscore 1xx and 2xx and 3xx it changes a lot of those things we'll take a look at that in a second when we're using stats tags what we're going to be able to do is use a metric like this so I'm using the downstream request and because I've got that enable stats tags on Envoy I've got request underscore xx is my metric name rather than this great big list right so that gives me the capability because it's extracted that response code as a tag and Envoy understands the HTTP protocol so it understands HTTP response codes and I can see there that Envoy will inject the tag response code class and it allows me to build up a differentiation so I can say well I'm only interested in errors or I'm only interested in things which are not errors because a 404 is it an error probably not a 404 could be it's not an internal error looking at those kind of things on a chart I can actually plot them and I can stack them as a bar if I want and I can kind of see the various different response codes or I can split them up but I've got that flexibility to be able to do that now request errors this kind of goes without saying but this is really really important to be monitoring request errors and you should potentially have alerts as well around request errors I know what happens but it's something which is undesirable it's your end user which is affected right especially on a downstream so for example if I'm monitoring my request errors I start a new version of my pod and all of a sudden I go from zero request errors to a whole bunch of them now I can configure an alert which is looking at a percentile increase that's giving me early warning that the new version that I've just deployed is not working as requested timing, timing is incredibly important and we want to be looking at things like timing and timing what we want to look at are things like the histograms because we want to be looking at say the 99 percentile so 1% of the users are experiencing a time of this sort of level majoritarily you're going to be kind of looking at the 50 percentile so where the majority of your users are but you want to be breaking those things down means are somewhat useful but sort of histograms and looking at histograms is so much more descriptive when you're looking at sort of the technical performance of your system but what about GRPC so GRPC is HGDP right? HGDP2 yeah but it's not well it's kinda it uses it as the transport but GRPC doesn't honor things like HGDP status codes with GRPC what you're doing is you're encoding the status code into the protobuf response so when it comes to building your metrics I could have a failed request which is an internal server error potentially in my service which is still going to get response as an HGDP200 and luckily for us Envoy understands the GRPC protocol and Envoy can actually decode both the method core and also the response code so to enable that it does have to be specifically enabled we use this feature which is an HGDP filter in Envoy and it's the HGDP1 bridge filter now if you look at the documentation the documentation is pretty clear on what this thing does but when it's easy to misunderstand so a lot of people look at the HGDP1 bridge filter and say well it's a method of translating an HGDP request into a GRPC request and it does that but it also allows Envoy to understand response codes the methods, the encoded data inside of the protobuf and to report that into your statistics so I can use things like this and again you can kind of see here I've got Envoy cluster GRPC so this is a metric statistic and then I've got GRPC response code so 0 that's the equivalent of an HGDP 200 I've got a response code 5 here this is kind of the equivalent of a 404 and GRPC status codes don't directly map to HGDP status codes but again building that up into a chart I can build something which is really really rich I can filter on that status code but I'm also seeing the method core so I can actually isolate those independent requests which is really really nice and useful to see I've just lumped them all into a single chart here but you can kind of break them up as you as you need to errors again I'm using a different metric but in the same way as I'm dealing with HGDP requests I want to deal with GRPC errors but again remember GRPC request can complete with a 200 but still be an error because of the status code so cluster matrix cluster matrix we're kind of getting into things like the upstream requests and upstream requests are internal in Envoy and we need to differentiate between upstream and downstream because as I mentioned if you're using a pattern like a retry then you could have multiple upstream failures but a downstream success you need to monitor them independently we can kind of see here that we've got some retries and we've got some errors I've applied the retry policy so I can see that Envoy again is putting my retries there now these retries wouldn't result in a downstream error but they are testing themselves as an upstream error timeouts we need to monitor timeouts so a timeout in the system is a big red flag I tend to put alerts on timeouts and whether you have that go straight through to pager duty or not that's up to you but I think a timeout is a big red flag and you need to be carefully sort of monitoring this information if you're kind of implementing those reliability patterns you've got to be thinking about like monitoring outlier ejection so what does outlier ejection do? as a pattern if Envoy receives a number of status codes from an end point which result in a failure it will remove that end point from the cluster temporarily so here I'm monitoring those end points have been ejected I might have one particular service that is a little bit sort of flaky but Envoy will remove it temporarily then you can see the gap because Envoy is going to try it again it's going to give it time to recover but then in the second block Envoy said hey I've got errors again so it's going to remove it for a longer period but again outlier ejection this is something which I think should be both monitored but also configured for alerting we'll see that kind of the authorization works so again two components authentication and authorization authentication is done through the MTLS process client side certificates authorization is a facet of the control plane so it's an additional layer I have this presented entity is it allowed to connect to me this is a metric that you really want to be monitoring inside of your control plane because you kind of got to be thinking about why would you get these failures you should never really get or see failures there's a kind of a bunch of metrics again so the total responses but ultimately I'm kind of mainly interested in errors and I'm looking at okay so why am I interested in those two things the first reason is that when I start a new pod I'm going to spawn a whole load of new connections so I should see a little spike in authorizations but they're cached if you see constant authorizations for a service then there's something not behaving itself correctly authorization is going to slow down your service so you don't want to see many authorization requests at any period of time authorization failures this is a massive red flag so you've got to be thinking about authorization failures why are you getting those there's two kind of predominant reasons the first one is that you've got misconfiguration so you haven't actually explicitly allowed service communication between two services or it could be somebody who's probing around in your system and trying to do something that they shouldn't and on to tracing so tracing is interesting I'm kind of growing my appreciation for it as I said it's something that's different to me and I just took this from the open tracing website but it's a nice description and the thing that's interesting about tracing is when you're looking at performance so the configuration of tracing in Envoy requires that you set up a cluster and you've got to set up the cluster because Envoy needs to know where it is going to send the tracing data to the spans your control plane should handle this for you but it's interesting to know I think and you also need tracing configuration because there's a number of different drivers which Envoy can use it can use a Zipkin driver which is kind of open tracing there's some new stuff in 110 which allows plug-able open tracing and it'll also support open sensors I believe you've got LightStep but when you start to look at a tracing so here's an example of a GDP post I can actually see visually the difference kind of the paths or the different sort of network hops that have gone through my service I've got internal from the downstream to an upstream listener and then I've got a GRPC service what you've kind of got to be reticent of when you're thinking about tracing is that an external call all tracings have a parent-child relationship and if this is an example for Zipkin but you've got to be thinking about forwarding the headers because if you receive a, if you make an internal call from your code you've got to forward on the span IDs and the tracing IDs for it to be registered as part of that graph it's pretty straightforward in HTTP GRPC again we've got to be thinking about things differently we're not using the HTTP protocol in the same way but what we do with GRPC is we extract those headers and I'm extracting them from an HTTP request and I'm adding them as metadata to the upstream context the open tracing and GRPC there's some great frameworks which work as middleware to help you do this but essentially that's what they're doing they're taking the parent information and they're adding it to the child span in order for that nice chart to work lastly on logging why is logging useful if you've got all of these wonderful metrics well when I was kind of putting together this presentation and working on my demo I couldn't get my tracing to work and my metrics weren't telling me anything but when I looked in my logs I could clearly see that my logs I wasn't sending the spans so you've got to think about all of those things Envoy is going to give you some great information around access logs everything it wants all of the things I want to thank you so much for listening to me today and I'm going to be around the conference if there's any questions you've got I'd be more than happy to answer them thank you so much oh yes what's your recommended way to send a tracing what's your preferred way to send a tracing directly to Jebgen or Jebgen all by some Istio mixer so what's my preferred way to send tracing is it direct or is it via the Istio mixer so I think Istio mixer and I correct me on this but I believe there's going to be some changes because when you send everything through Istio mixer one of the performance bottlenecks in Istio is with mixer and I believe that there's going to be a separation from mixer as a centralized component in an up and coming release so you can absolutely send them direct to Zipkin oh sorry you send them via a collector which acts as a proxy so I run generally a collector as a service so I've just got like a deployment with a number of instances and I think it's very much dependent on the direction of way Istio goes around using mixer right now about Jebgen yes let's show that the view is not she's Jebgen oh we'll get there so these are the sorry this is a little bit misleading I've actually named the geopc methods put get and exists because the service in question here is actually storing some data or it's retrieving yeah sorry not HTTP yeah sorry in retrospect those are terrible names but no they are actually geopc methods not HTTP methods alright thank you so much