 Okay, now can you yeah. That's better. Right where was I? So, let's start with a bit of audience participation. Who here is already using Prometheus? Oh wow. Okay. I'm gonna have to make this a bit more interesting then. Right then. How about, I guess as we're here, we're probably all using Cubonetys. Hands up. So, more people using Cubonetys and Prometheus and using Cubonetys and Prometheus together. Ok, that's not as many as I was expecting actually with the overlap. Right then, as I said, my name's Tom, started this company called Causal recently, we do a hosted version of Prometheus basically. I also make my own beer, which kind of explains, and that's also the barbecue for lunch. Used to work for a company called Weaveworks, who also here and also do a lot of work on Prometheus. Before that, I was at Google for a few years and before that, another startup. So we're going to cover four things today. We've kind of done the first already. The use method, I know that the talk was pitched as the red method, but we're actually going to talk about the use method first, because that's kind of my inspiration, at least for the naming scheme. The red method, which is what we're mainly here to talk about, and then the Google's four golden signals, if anyone here has read the Google SRE handbook. So, quickly, the introduction. Why does this matter? Why do we even need patterns and why should we even care about instrumenting our applications? So, I'll start first with why I'm giving this talk. There was a recent Prometheus conference in Munich, and the red method was alluded to in multiple different talks, and no one actually explained what it was. More and more like, I'm seeing mentions of it on Twitter and things like this, and no one's actually kind of come along and said, this is what the red method is. So, as I kind of tongue-in-cheek came up with this name a couple of years ago, I thought there should be an authoritative talk on what the red method is, at least from my point of view. So, why are these patterns important? Well, if you're a software engineer, you probably don't want to spend your life thinking about the best way to instrument your code. You know, it's probably more important what your code actually does. So, by having these patterns and by having kind of tried and tested methods, hopefully it can reduce a bit of the cognitive load and allow you to just go, well, this guy I saw at a conference said this is the best way to do it, so I'm just going to go and do it that way. There's a whole debate about white box versus black box monitoring. If you're already using Prometheus, which sounds like most of you are, then we've already won that debate and white box monitoring it is. And if you want the debate about push versus pull, you're in the wrong place. That's a bit of an in-joke, sorry. All the examples I'll give will be Prometheus running on Kubernetes as well. So, to start with, use method. This was invented by a chap called Brendan Gregg, or at least coined by him, and defines three things. For every resource in your system, you're going to monitor the utilization, the saturation, and the errors. And Brendan has this absolutely excellent, like de facto, web page that describes this incredible detail, and I'm just going to kind of lightly skim through it a little bit here. So, utilization, how much time, like wall clock time, was the resource busy for as a percentage? Saturation, this one kind of doesn't have such a nice definition. It's kind of how much work did this thing have to do? And so, maybe per resource, the units for this are going to be different. I like to try and find a way to normalize it to a percentage, because then I don't really have to interpret the number. And then errors, just what is the rate of errors for this thing? So, resources are CPUs, disks, memory, network. Some things people normally forget, things like interconnects. Yeah, and so on. And this is very, very targeted at infrastructure, at real things. It'd be quite challenging to apply this to your microservices, which is why we did the red method. Anyway, moving on. You end up with this kind of checklist approach, and when you've got a problem, I don't know about you, but I find my problems kind of bisect quite nicely, and there's half of my problems I like, are very obvious after 10 minutes of fiddling. And then there's the other half, which I have no clue what the problem is. I don't know what I don't know. So one of the nice things about having these kind of checklist-based approaches is they really help in the second instance. In the instance when you don't know what's wrong, it's quite reassuring to go, look, if I just go and check all of these things in this order, then I'm likely to find a problem that gives me the clue for the next thing. So you end up with this kind of checklist, and if you're like me, you turn this into a dashboard. Yeah, you turn your hunting for problems into something that's a bit more methodical, a little bit less kind of haphazard. So diving into the real stuff, how do you do this with Prometheus? Well, CPU utilisation. One of the reasons I love Prometheus is because the queries fit on a single page. So you get nice, neat queries like this. This is using the node exporter, which is a Prometheus exporter you run on every machine if you're using Kubernetes, you use a demon set. And then you go in and you say, for each CPU in the machine, how much time was it idle for, and you take that as a percentage of the total time, and you take it away from one. So this is super easy, right? CPU saturation. This one's where it gets a little bit more interesting. CPU saturation is traditionally measured as load average. So I've picked the one minute load average here. Load average for those who aren't aware is the number of processes on the run queue, like a sampled average of that. So this is, you can quite easily have, you know, a 10 core box, a tens of width number, a 16 core box, and have 300 things waiting for CPU time, and then your load average would be 300. And so because I don't like to interpret these numbers as kind of what does 300 mean, I like to take it as a proportion of the number of CPUs, which is why I've divided it by number of CPUs. This one here is a... Oh, no, let me do that. This one here is a handy little recording rule that I've written that gives me per node the number of CPUs per machine. You'll notice I don't have errors on this slide. So this is really the US method so far. That's a terrible drug. CPU errors are kind of really hard to measure, it turns out. I looked into a bunch of places, and the best advice I got was, like, gripping through D-message for, like, exceptions and machine check exceptions, and they're kind of machine specific, and they're kind of like the Intel processors have some nice counters on, so I don't really have a way to do that, but luckily, CPU errors don't seem to be a thing I need to worry about. This will get pretty dull if I was to go through every single query in this level of detail, so I'm actually not going to do that. I'm actually just going to jump straight to the dashboards and show you what I've got. This is another good example, this memory one, of where saturation is in some weird units, and the best advice I've seen for how to measure your memory saturation is actually to look at how much paging your machine is doing, because if it's doing a lot, it kind of tells you that your memory is oversubscribed, and if it's not doing any, then saturation is not a problem. The list really does go on. I've already touched on CPU errors and memory errors are another thing that are really hard to actually get a counter for. I found it really hard to get a counter for hard disc errors. You'd think this would be quite obvious. It turns out there is a counter in Sisyphus, but Prometheus doesn't expose it yet, so I was hoping to get it done before I gave this talk, but I'm going to plumb that through node exporter so you can actually have a counter. Disc capacity versus disc IO. I don't know about you, but disc errors have never caused an outage for me, whereas running out of disc capacity is probably the root cause of most of my outages. Disc capacity doesn't really seem to fit into this use model very well. Maybe this is a criticism of the model or my understanding of it. Network utilization is another interesting one because I really want to have this percentage of some total, but I've not found a way to know the actual capacity of my link programmatically. Currently, in my dashboards, I know that for this machine type on AWS, I have this much network bandwidth, and then I use that as the denominator and get my percentage, but I'm not very happy with that. I also don't have any clue how to measure interconnects. Let's see if this MiFi works. We've done a set of dashboards, so if we go to the... Yes, good. So I didn't mean to go to Prometheus, I meant to go to Grafana. If we go to our Grafana, I've published this set of dashboards that I'm calling, I love this name, I'm calling clumps, because it's the Kubernetes Linux use method with Prometheus. So David, co-founder of my company, absolutely hates this name and really didn't want me to include that joke. So if anyone has a better suggestion for a name, then please do come forward and let me know, because clumps is a bit weak. Anyway, you end up with these use method dashboards, and so this is a cluster-wide use method dashboard, with every single one of those queries I talked about encoded into it. You can see my cluster is not particularly heavily loaded, and it's just really straightforward. Each row is one resource, each column is a different bit of the use method. This is all open source, if you're interested. It's on GitHub. Who else here does mono repos? Yeah, me too. We have a public and a private mono repo. I'm not sure it's a good idea yet. So all of our open source projects live at causeable public, and clumps is here. So if you want them, the dashboards, the Grafana dashboards, they're generated with something called JSONit, which is like this really cool document language for like merging JSON dictionaries and doing all sorts of really cool stuff. So you have to like generate them. But this means you can parameterise them. If you use slightly different job names for various node exporters named differently on your cluster, then you can just bung that in and it'll generate out the dashboards. And that is pretty much the use method dashboards. I want to iterate on these and get them to include kind of the fixes that I've discussed here, get them to include disc errors and so on, and maybe other people will find them useful. We're also going to add things like drilldowns by host and by service and so on. Anyway, that is clumps. No, that's not it. I swear keynote used to make this switching between mirrored and not mirrored much easier. So yeah, if you want to do some further reading on this, the use method website by Brendan Gregg is absolutely fantastic. Goes into a lot more detail, has a lot more kind of pros and cons and side effects and so on, you should definitely read that. And then we've got the clumps dashboards that I just mentioned. The red method, what we're really here to talk about. So the red method was kind of inspired by a question we had at work, which was, like, why aren't we using the use method? And I found it really hard to explain to people that the use method is for resources and the red method is for my services and they're kind of two different things, so we do use the use method, but the use method doesn't tell me if my services are healthy, they tell me if my machines are healthy. So the red method is a bunch of stuff I learned at Google that I kind of misremembered. So if you're familiar with Google's four golden signals, it includes saturation, and when I came up with this, I forgot about that one. It's also like, use has three characters, so this needs three characters, and I couldn't really fit a way of, it could be the red's method, but that's a bit lame. So the red method is for every service in your architecture, monitor the request rate, the rate of errors of those requests and the duration of those requests as some kind of distribution. Why? Well, this is a nice way of knowing, without knowing what the service does or how it works, knowing if the service is healthy-ish. This is really useful for when you've got an on-call team that on-call for things they didn't write, and if you want kind of your DevOps to scale and to be something that you do as a company, eventually you're going to have to be on-call for stuff you didn't write. Eventually someone else is going to have to be on-call for your code and you're going to have to be on-call for their code, and that's how you build this kind of scalable operations team. So to do that, it's nice to have a nice uniform method of knowing what a service does, knowing if a service is healthy. So the red method. It's also nice because this is kind of the building blocks for being SLA driven. You should really only care if your customers are happy. You shouldn't care if your computers are happy. So this gets like starts getting towards high level metrics that tell you if your customers are happy. If things are slow, if your durations are really high, then your customers probably aren't happy. If your errors are high, they're definitely not happy. If your rates not high, that's an interesting one. Maybe you're just not popular enough. So this started this I think is the first mention I could do. I had a different haircut at this time. This is the first mention I could find of the red method and this kind of caused a big storm because the person said an alternative and really it should be a kind of an addition, like a different method of monitoring a different thing. Anyway, how do we do this with Prometheus? That's what we're all here to look at. It turns out it's really easy to do with Prometheus. You have a block of code like this. This declares the Prometheus metric so I don't have a laser pointer. But this declares the metric. We're going to use a single histogram to represent all three of the attributes of the red method and we're going to call it by best practices. We're going to include the SI unit. We're going to record things in seconds, not in milliseconds. We're going to include various different labels and this is how we're going to differentiate between successful and failed requests so we've got a status code in there. Then we're going to register that metric. You need this once in your application, normally at the top of main.go. I prefer the approach of having a piece of middleware that I can wrap around my HTTP handlers that will go and instrument them. Instead of manually instrumenting into every single HTTP handler. There's a nice set of generic middleware out there. There's a Prometheus one. The Weaveworks guys have one in their common repo which is the one I use. This is the one I've come up with here. This uses something called HTTP Snoop which is a really cool little Go library if you've never seen it that gives you this fake response writer that records a whole bunch of metrics as it's being used. What the error code was and how much bytes were sent and so much and then actually recording the metric is just that request duration with labels, method, path, code, so on. If anyone can spot the obvious mistake on this slide they can win a causal sticker. If anyone can spot the other obvious mistake on this code is there an extra prun? Yeah. No, no, that's correct actually. That's closing the one after the with labels. With label values. Yeah, retracted. No, so the obvious... Oh, they've got a different path. Two maps? No, I think that's... So the obvious mistake here is I've used the URL path as the label and like rule zero of Prometheus rules is don't use high cardinality values in your labels because this will really overload the time series database underneath it. So actually what we tend to use instead of using the URL path is like whatever random user IDs you've got in your path. We tend to erase out the high cardinality things. So if you're using like guerrilla mux and you define your pattern matcher for your path we would use that instead of the actual path and this helps reduce the cardinality helps keep your queries fast basically. So I get to keep my sticker. Anyway and then we use it by we serve the Prometheus metrics so Prometheus can come along and scrape them and we serve a function which won't compile because it's three dots. So then to get the rate of requests is super easy. You just look at the request duration seconds count and so this is kind of an internal implementation detail of how histograms are implemented in Prometheus is they actually export a whole bunch of time series. They don't just export a single one they export a count, a sum and buckets. So we use this as a count to see how many requests are going and then we do a count and we filter them by the status code that is not 200 and that tells us the error request. And similarly we use a histogram quantile with a very complicated expression that I can never write first time and that is missing a quote to tell the histogram. So let's have a quick look at how this will work so actually I would normally type it in here but I find it much easier to use my product of course but this is not a pitch you can do this just in normal Prometheus. So first we'll do sum, rate, request duration and the reason I find it easier is because it's got this auto-complete. So we can do, we're going to need a rate of that and I'm going to filter this so you're only seeing the dev traffic. Okay there we go. But we can start to do way more interesting stuff if we go and break this down by request sorry, by status code. So here we go we're going to drill down into the status code and now we can see that there is no errors per second and there are 36 QPS to our dev front-end. So this is like the first two bits of the red method in a single query and then this is where it gets way more interesting because I have to write and get right first time a histogram of content which I almost never get right. So we're going to do a sum and a rate and we're going to do request and we're going to scroll because my spelling is perfect. Now this time I use the bucket prefix and this tells us a set of cumulative buckets that the histograms counted over and then it will kind of interpolate between those buckets to tell you where the 99th percentile is. Again I'm just going to do it for dev because it's much faster than prod and I'm going to do it for the last minute and then the thing I always forget is to break it down by this LE and this is really where the leaky abstraction comes in for Prometheus. The LE is the dimension we use in Prometheus to differentiate between different buckets and it worked. So that's 6 milliseconds I think for requests at the 90th percentile and you can change this just to give you a taste of what this looks like if I didn't do all of that jazz we'll use this as a table. If I just look at the raw set of time series that query returns you can see this LE dimension has like various numbers like all the requests would be in the plus infinity one and then only ones that took one second or lower being one, 2.5 and so on and you can see it broken down by method the route oh got it right in America doesn't include usernames so there we go that's how you do the use method with Prometheus just see what I was going to do yeah okay and I like this approach because it's just super simple there's not a lot there's one thing you need to instrument all your apps with most companies probably have a common library that they can put this into so their developers don't even need to think about this when it gets a bit more interesting is your generic microservice architecture usually looks like lots of services that all talk to each other and if you're really lucky there's no loops obviously I've just talked about how to instrument a single service but generally you actually want to instrument your whole service so if you think about this as a graph of services and you do a kind of traversal of that graph we do it breadth first if you like and then you build a dashboard that looks like this where one column is QPS broken down by errors and the other column is latency then you get to kind of you get these nice standardised dashboards that you can understand whether you've seen the service or not and it also allows you to see where the errors are coming in you know if you've got some errors at the top you basically have to scroll down until the errors stop and it's the service above that that's introducing those errors if you kind of get what I mean this is kind of me taking that graph and traversing it row by row so I find this a really nice way of writing dashboards and to the extent where it's completely automated now like the system will just generate these for me using JSON the other one this is something I learnt in the hard way when you're plotting your latency graphs you should never plot average right you should never use average this has been drummed into over the multiple years of doing monitoring but except for your notice I've got average plotted here and there's two reasons I plot average one is because latencies don't sum you know if you've got one service that's got a really high latency and you look at the two services it calls you know it quite often happens that they've got really low latencies and it just doesn't help you because latencies do not sum like that averages do sum so if you've got one service with a very high average then one of them below is going to have a very high average and that helps you point the second reason is that averages can be used to sanity check whether you've got your buckets right when you're talking about LE and buckets for histograms like it's an implementation detail but it's when people get wrong sometimes so sometimes you'll see a nice low latency no problems at all because you've only declared your buckets go up to 100 milliseconds and obviously it can't report more than 100 milliseconds and then you go and plot your average which is taken just by dividing the count by the sum nope the sum by the count and then you see that's at like tens of seconds and you've realised you've got your buckets wrong so that's the average in here if you want to read more about the red method there's a bunch of work we've worked but it's really taken off with other companies now the Cindy's got this fantastic blog post about logs and metrics which I'd really go into the slides will be online and these are clickable links so there's not much point in taking a photo but yeah that's and Peter has good writings on these as well so finally we're going to talk about the four golden signals so this is what I should have done when I did the red method and this is not a sticker that you put on your mobile phone to improve your reception how can that even work? it doesn't so you'll notice this is very similar to the red method and it just has saturation latency is the same as duration traffic is the same as requests errors is the same but this saturation thing how full your services is like okay is that your definition how full your services and one of the reasons I excluded it is because what does that even mean but it turns out if you're using Kubernetes there's a nice definition of this so everyone puts resource requests on their pods right I'm not seeing many people nod you should put resource requests on your pod you should be saying this pod needs one CPU or tenth of a CPU and this gives you a nice denominator to tell how full your services so this one I am going to cheat with I'm afraid because it's quite a handful to go into cube state metrics and see advisor and build this up from from like the actual primitives but I've defined a bunch of recording rules and it is under I don't know actually I've forgotten namespace that's not a good idea let's do pod here we go it's not that one it's none of these oh man I've got it written in my slide notes not that slide ok I'll cheat even more because I built this into a dashboard so I don't have to remember these queries I have these these cluster resource dashboards which tell me how full my cluster is how committed my cluster is so how much of the CPU I physically got is in use is been committed in requests so this is what these mean but these will tell me this is a fork of Grafana we use because with Grafana you can't have these multi column tables with numbers in the columns so we've got a fork we're merging it all back in of course we're just currently running this fork but let's go and have a look at this so yeah there's the so we're going to take this so this is the container resource requests I'll put this in here what this one looks like so not particularly actually it'll be easier to see in causal but slower I blame them ify so here we go you can see it's got it broken down by instance actually the instance is the same it's got it broken down by node and namespace and the container and so on and it tells you basically how many CPUs are requested for that pod so this pod needs two CPUs so pretty straightforward and then this one says container CPU seconds by namespace now if we just take this you should be able to see the same thing and this gives us the ongoing CPU usage of the containers in my cluster that graph was a bad idea and now we need to divide out the two and that will give me some idea of CPU usage of all the pods in my cluster and if we divide the two it should give us the idea of how full are my pods and I'm really gutted that I can't find the query for this because it's not as simple as just dividing the two how long have I got time wise five minutes okay I might not try and do it live because it's not the most interesting thing to watch sorry about that you can divide the two but the problem is pod name is different if you look here this one's using pod and the other one's using pod underscore name so you have to do a relabel and I don't think that in my recording rules anyway this is all in the open source so you can go and knock yourself out and show me how to do this better ah here we go this is the one I need so we take this one which is the recording rule that's tidied it all up and then we take CPU cores container CPU usage at some rate that's the one and this is the one that has it all tidied up as well and then we can just divide the two and we can there's a bug in our dev UI and you can see how full the services are if I do some of that some of this by label name this will give it in a much more meaningful way here we go these are the various services we've got running in the cluster and here you can see how full they are you know 40% is a bit high maybe but it's fine and so this finally you can get an answer for what is the saturation of the services in my cluster so if you want to read more about this the Google S3 book is online and has an absolutely fantastic chapter on this and there's another blog that mentions it that I found on Google so in summary we've covered the use method using it for monitoring the resource utilization saturation request for every resource really useful for infrastructure the red method that is request errors and duration for every service in your cluster we've taken a quick look at four golden signals which is red plus saturation and how you might want to do saturation on your Kubernetes cluster and then I've kind of given you a couple of demos there's a couple of other methods out there that is actually another one of Brendan's Greggs that looks at threading and method R is I have no idea I need to look at them for the rest of the day so you're already in the red method talk which means you're probably interested in Prometheus there's a Meet the Maintainers where you can come and argue with me about the stuff I've discussed here and we're also going to have Frederick and Gal from there if you want to ask them some more interesting questions Ilya from Weaveworks is giving a talk practical guide to Prometheus for app developers and the guys from the French FedEx company are giving a talk about how they use Prometheus and then at 8pm after the booth crawl we've got some drinks at a local hotel sponsored by the Fresh Tracks guys there's the link here if you want to sign up I take it we're probably getting close to capacity to sign up quick the Prometheus salon we did earlier today was so full like standing room only that we're going to run it again on Friday at 2pm if anyone didn't manage to get through the door yes we will repeat not this session the Prometheus salon anyway thank you very much any questions there's a microphone behind you if you want to talk into the microphone and then and then everyone will be able to hear you form an orderly cue the microphone's not turned on are the batteries flat oh right just shout it I'll repeat the question what do we alert on we generally alert on error rate the key thing to get right when you're alerting on an error rate is not to alert on the error rate for the entire service because you could have like most services have more than one RPC that they accept especially if it's your front end you might have 100 different paths in your front end and your one route in your front end might be returning 100% errors and you'll miss it if that route isn't called as much as say another one so in causal we've got one route the right path that's called like three orders of magnitude more often than the query path so any errors on the query get completely drowned out if we're to alert on the total error rate so we do error rate per service per route and then latency as well we alert on latency we have some pretty strict targets that we don't always meet so right now they don't actually pages unless they're like really really bad but hopefully as we start to meet our stricter targets we'll start alerting on breaches of those but that's pretty much it and then you I tend to be of the idea that you shouldn't alert all standard Google practice don't alert on symptoms alert no don't alert on causes alert on symptoms so you should alert on things that breach your SLA and we don't alert on you know random service over here hasn't been able to elect a master unless something at the top starts then complaining that that's causing errors to a user the only exception I make for that is discapacity because when you run out of discapacity everything stops so that's the only one where I alert on a kind of symptom or cause thank you good question do you instrument outgoing requests with the red method that's a really good question I don't except for when I'm talking to Amazon so nothing wrong with Amazon like it works it's just there's I you know I've been really dissatisfied with the metrics I can get out of cloud watch so what I tend to do for instance when I'm talking to DynamoDB or when I'm talking to you know S3 or any other service that I don't own I instrument on the client side but normally when I'm talking to services that I do own I think it's good enough to just instrument on the service side I mean you can do both but then that kind of depth first traversal of all your services starts to become really complicated right and when the network is broken it normally becomes pretty obvious pretty quickly so hopefully that answers your question any other questions anyone from over on stage right middle middle no go on so the question was in this dashboard where do I get the graph of what service talks to what other service now I just know in my head and so that's just you know what I do but there is a cool tool from my previous company which I wrote called Weave Scope do you use Weave Scope oh great I spent two years of my life writing that and then I moved on to Prometheus but Weave Scope kind of gives you that I don't know good screenshots but there's probably better screenshots yeah gives you that kind of dag of of what talks to what and you could use that for instance if you didn't know but I tend to just know so I use those any more questions okay thank you very much