 So, here we have Tom Loughey, who is going to be discussing the red message, which I gather is something to do with raspberries. Thank you, Brian. Hello, everybody. This is my first host, then. It's quite impressive. Let's get it on here. So, I'm really enjoying it. Thank you for having me. I'm here, as Brian said, to talk about the red message. Patterns for instrumentation and monitoring are for such a dry subject. I'm quite surprised it's decent, many of you, but I'll try and make it a little bit more interesting. God, the audio is terrible, isn't it? Oh, just my voice. So, we'll start with two questions. This talk's mainly going to be about chromitis and cubanetes, for examples, but the theory is general. So, who here is using chromitis? What's your name? Who here is using cubanetes? And who is using chromitis and cubanetes? Ah, okay. So, quick introduction. My name's Tom. David, who's in the crowd, and I started a company called Causal Weasel recently. It's just the two of us. We're bootstrapped from ABC. And we're running a hosted version of Prometheus, basically. Before that, I worked at company Weaver, which is where I kind of got into the chromitis and cubanetes. Before that, did a re-strip at Google with Ryan Impact. And before that, another startup called CUNU, where we worked with Alexander. This talk is going to be in four sections by yet another time. Talk is going to be in three sections. Now, we'll go really quickly. Use method, which kind of inspired the red method. And then forego the signals, which is actually what inspired the red method. I just forgot about one of them. So, introduction. Why am I giving this talk? There's a Prometheus conference called TRONCON that happened in August, I think. And this red method thing was mentioned a whole bunch of times. And no one really kind of explained what it was. And we had a few people saying, what's this red method? And so the red method is a name I gave to these three signals you should measure from your services in, like, 2015. And I felt like, well, really, somebody, you know, maybe me, should explain what I mean. So that's why I'm here and why I'm doing this talk. And why, yeah, that's why I'm here. So we'll start first with the use method. The use method is something that chapter of red and red invented. And it basically specifies, for every resource in your thing, in your cluster, in your application, you should measure the utilization, the saturation, and the error rate. And it gives some basic definitions of these things. So utilization means, for what percentage of time was this thing busy? Saturation means, how much work does this thing have to do? I mean, this is kind of poorly defined as you'll see later. And error rate means, oh, how many errors are happening maybe per second or something. And a resource in this case might be your CPU, your memory, your disk, your network, it's a good one. But it could also be things like your interconnect. It could also be kind of more virtual things. You could think of, like, queues of resources, almost. The use case for this I find generally is if you're hunting performance issues. And, oh, your phone's ringing a bit. Should I talk to anyone? I'm using it to tether because the Wi-Fi here sucks. I don't have to remember my friend. So this is more used to kind of hunting down issues. It turns a kind of unknown, unknown into a known unknown. And you can now have a kind of methodical approach to hunting down performance issues. It makes a kind of, oh, what do I look at when something's gone wrong? It's a kind of tactical problem, which is why I like it. I'm now going to give you some brief examples of how you do this with Cometius in Kubernetes. So this is, this will tell you the node CPU usage. This is a particularly nice one. I picked this actually off-ground. You basically average it because this deals very nicely with different numbers of CPUs, different hosts. Saturation, this is really just your one queue length, your load averages. So pretty easy to do in Kubernetes. Here I like to normalize my saturation by the number of CPUs. So I can try and, like, have no units on everything. So I know kind of 100% bad, above 100% terrible, below 100% easy. So this is kind of why I provided it by the number of CPUs. You'll notice no way of measuring error rates on CPUs. We did have an idea of using mTail to grep through dMessage and look for kind of like machine check exceptions. And we did that on our machines and no running on Amazon, so there's just no errors. I guess if there was an error in the VM, we'd just disappear. So similarly, you can do the same for memory optimization. This gets really dry, so I'm not going to do any more. But basically you construct these queries that kind of go through and say, how much of my memory is it using? How much of my memory is saturated? The amount of memory of saturation is kind of a weird one. Brendan's use method page suggests maybe you use your paging rate. There may be a really high rate of paging, and that means you've got a lot of content in your memory. This is particularly troublesome because it doesn't have a unit that's kind of, you know, you can't say, what is 100% paging? And again, no errors. And then, you know, the list goes on, and there's really hard cases by talking about CPU errors and memory errors. It turns out, getting a hard list of errors and how the linux kernel is relatively difficult. I think we found one in SysFus, but the loading spot, it doesn't scrape it yet. So I think David, you're working on that one? Yes, you could use it with a Smyr. Again, in a virtualized environment, Smyr is kind of meaningless, right? And a lot of these are kind of meaningless in a virtualized environment. The other thing it doesn't really capture is like most of my, well, not most, but a very common problem I've had when running systems is you run out of this capacity. You know, where does that fit in this model? Network utilization is a really tough one as well because how do you know what 100% is? You kind of have to know that this instance has gigabit networking, to get your kind of... And I just had a really interesting conversation with Chatham Intel about exposing metrics from your interconnects and from your PCI bus and everything, and I kind of asked him to do all work. But yeah, we don't currently have that yet. So quick demo, and then we can get on to the actual thing we're here to talk about. There is one other thing I wanted to add, whilst I'm mirroring my space, is that I've built a little library called... it's got an awesome name. It's in our open sourcing. It's called Clumps. Clumps stands for Kubernetes Linux use method with Prometheus. But basically, if anyone can think of a better name, that would be very welcome. But basically it's a set of dashboards in Grafana for doing this kind of use method on Prometheus, with Prometheus on Kubernetes. And it's all written using JSONic, which is a configuration language that kind of came out of a 20% project at Google. And I believe that Prometheus is very keen on JSONic in the future. Yes. Yes, good. I'm very keen on JSONic, too. But they kind of look like this. You write your dashboard, and you say, you know, a dashboard, and I want to row in there, and I'm going to add some panels, and then you get straight to your Prometheus query that we had earlier. Hopefully you can all see that. And so, you know, we've gone through and done this for every single resource. We can think of both utilization, saturation, and error rate if we've got one. And this is what it kind of looks like. This is running on our production cluster right now, one of the most. And you get utilization, saturation, there's no error rates. Memory utilization. Here's the kind of paging rate, not particularly being faulted. There's no disk saturation, apparently. Network utilization. And then I just stop this utilization. We can do this cluster-wide as well, which is kind of nice. There's cluster-wide. Use method, clumps dashboards. It's kind of getting a bit off-topic, but this is actually way more useful. Is, um... Actually, I'll say this until the end. I'll just teach you that one. Right then. So that's kind of the use method in eight minutes. Now we will move on to the red method. I'm not actually yet sort of out. So, oh, references to the use method. There's only really one, which is Brendan Gregg's website. It's fantastic. There's loads of pages, loads of detail. I've been in justice, so please go and read that if you're more interested. And there's the link to clumps. So, the red method. The red method says, for every service, so when I'm interested in resources, we're interested in the services now. Microservices, services. Measure the rate, the number of requests you're getting per second, the errors, the number of those requests that are failing, and the duration of those requests. So the rate of the requests, the rate of the request, and the duration of the requests. Why do this? Well, basically, we had someone join WeWorks when I worked there, and they asked us about our monitoring strategy. We were just launching our hosted version of another product called Scope, another open source project. And we were like, you know, yeah, we've got monitoring. Don't worry about it. And he's like, ah, do you have a methodology? How do you know your monitoring right things? And I was like, I have no idea what he was talking about. He's like, oh, you should use the use method. So I went away and I went and read about the use method, and it was completely not what I wanted. I'd spent a few years at Google, and the use method was very much, you know, what we did for resources, and we were very much interested in what we did for services. And at this point, I didn't know about Google's SOE book. I don't think it was even out yet. And so I kind of said, look, we need to like the use method, but for services. And so I said, the web method, because it kind of makes, you know, it makes a word. And I completely forgot about the fourth one. So basically, yeah, it came from four golden signals with a capture name, and I forgot the fourth one. But what's useful about this world, if you do the web method and you instrument all your applications so they expose these metrics, you can kind of get much better operational stability. So an individual can be on call and can solve problems with your services without necessarily being the guy who wrote them. Without necessarily being the person who, you know, instantly knows the inner workings of that service. They can kind of treat them as black boxes as long as you're measuring the web method metrics of each of those services. And this allows you to get some operational scalability and you don't have to be on call for code you've written all the time. You know, we've worked, I looked after code I haven't written, people looked after my code, and it kind of works. And same as Google, I barely got any of the services. It also relates very closely to your SLAs that you should be offering your customers. So really, like, this is what you should care about. You should care that your customers are happy. You don't really, like, this is the problem I have with the use method is it's kind of, are my computers happy? Are my CPUs not too stressed? And do they have enough free members on? And I don't really care if my computer's unhappy. In fact, it's probably making me happy when they're not. I care that my customers are happy. My customers are happy when there's not many errors, you know, like, less than 1% maybe, we'll see. And when the duration, like, the latency of the request, I like, that's when my customers are happy, I hope. So this allows me to build SLAs. I, yeah, as I said, I came up with this in about 2015. I had a different haircut back then. I still have that T-shirt, though. And people thought I was trying to say that rendering request method, use method was not the right thing, and it's not. These are kind of two complementary things. Anyway, super easy to do a red method with Prometheus. You've declared metric. I promise to have code on these slides, so here we go. You've declared metric. This is just boilerplate. You know, you need this once in your application or, for instance, with Coursing and Cortex or with WeWorks, this just appeared once in a common library. Declares a histogram metric. Inside that, we break the metric down by method root status code. And the method and root are kind of interesting, but the status code is the one that allows you to measure the error request. One of the things you get with a histogram metric is you also get a count of the number of requests and you get a sum of them. Yeah, count and a sum of the times it takes. That allows you to get an average transmission. And then when you want to instrument, this is, for instance, how you do it with HTTP services. I like the idea of having a little bit of middleware. So this wrap function takes a HTTP handler, instrument it, returns another HTTP handler. And then the other thing I wanted to point out in here is there's this really cool library called HTTP Snoop. And this gives you a wrapper around a response writer that's instrumented. So you go and get HTTP Snoop. It will give it a response writer. It will return you your response writer that's instrumented, pass that on to the actual HTTP handler. And then when that's finished, you can go and ask this about the metrics it's gathered. It's pretty cool. Actually, if you look into it, it's like how to do this auto generation explosion of all the different interfaces that response writer might potentially implement and so on. There's a long thread about, is this the right thing to do, but I like it because it's one line of code. And yet, also, don't forget to expose the Prometheus metrics handler. And that's it. You've now instrumented your services with the red method. Unlike the use method, all the queries for the red method fit on a single page. And I will now touch it in live to prove I know what I'm doing. What? Or prove I don't. So this is, I'm going to cheat a little bit and use causal interface, which is basically the same as Prometheus' interface. Just blue and white. And we can say, let's say a request rate for our dev cluster. So we do request and the reason I'm cheating will become obvious because I have this awesome tab complete. Awesome when I've got so many things. So I don't want to enter count. So this is the number of requests we've had across all our services. I mean, if we look at the table for this, you'll see, I mean, there's literally hundreds of services. And that's not particularly useful. So what I'm going to say is I'm only interested in a particular job. And that's going to be our dev front-end. And I want to know a moving one-minute rate of that. So there you go. We're handling about 416 requests a second on our development front-end. This is actually interesting because the Wi-Fi at Fosthen is exposing loads of Prometheus metrics. And they're all being sent to us. And that's Ben and Richard have done that. And, well, they must have turned something off. And then we can start saying, well, now we know how many requests a second we're getting, we can take the number of errors and divide it by this. So the way we do that is we go that status code label that we put in the metric. We're going to say status code and we're going to say it doesn't match the regex, which is 200. There was some errors. There was 0.03% errors. What? Half an hour ago? Who knows? That was errors on our side, probably. And then the final thing we can do, and I'm not going to type this one in because it takes ages and David's out of this really cool auto-complete, is we can basically say, it'd be really nice if these were sorted out and better compared to it. This is a bit of a live feature. Request duration seconds, there we go. So this automatically sets up the queries you need to get your 90% or your 50% and your average. And then it takes literally seconds to calculate that. That shouldn't be that slow. Anyway, there you go. There's our kind of latency going into our Dev front-end. Well, actually, this is going into both our front-ends because it's not broken down. And we can see that our 90% latency is this is in seconds or is that 20 milliseconds? That's not bad. There you go. That's the red method. Not quite finished with it though because one of the things I particularly like, I talked a bit about how the red method allows someone to be on call for your services when they haven't written the services. And you can imagine your microservice architecture to be a kind of tree of services. And if you've got this tree of services, maybe our front-end at the top and depending on the path the user goes to, they get a different service. And maybe one of those services also has some supporting services that make storage and so on. You can see how this model kind of fits. And what I like about the red method is you can map this straight onto a set of dashboards that you can build. And you might even be able to build automatically. They kind of go depth first. No, breadth first through that tree. Doing on the left, your request rate broken down by success or failure. So that kind of covers rate and errors. And on your left, the latency. And this is how we do all our dashboards, of course. Well, there's a bit of latency in life. And then you kind of get this nice methodical approach. Something's gone wrong. What I do is I start at the top and I look for high-latencies or high-error rates. And I just keep going down until I see the service that's causing those high-latencies and error rates. And then I've kind of pinpointed in a very kind of unscientific way the error. So this is a, there's a company called Vivid Cortex who wrote a blog post called Hierarchical Red Method. And this is, I think, what they mean. The other thing I just wanted to focus in on is the latency measurements. So I think I actually picked this trick up from you, Brian. One of the things that I always include in my latency measurements is averages. And you're probably seeing talks where they say, never look at the average. Average is in misleading. And it's true, they are. I think the average for two reasons. One, it tells me if my histogram metrics are correct. So in Prometheus, the histogram metrics are collected as a series of buckets. So you've got, like, all requests below a millisecond, all requests below 10 milliseconds, all requests below 100 milliseconds, and all requests above that. And if you get your buckets wrong, you can kind of, for instance, if you have, like, all your requests in the upper bucket, Prometheus will tell you they're all taking 100 milliseconds when they could be taking 25 seconds. So your average, I've got the average, and I always hope it's close to the 50th percentile, because if it's not, it kind of tells me, you know, if, for instance, the average is above the 95th percentile, then that tells me something's wrong with our buckets. The other reason I have the average is because the average is some. And if I've got a service that does a request for another service and then they request for another service, the average latency of this service should be the sum of those two services. The same isn't true for percentiles. So this is kind of, I find averages useful, but yes, you should still use 99th, in this case, a centile, because that tells me that most of my customers are happy. If you've got any questions, by the way, feel free to interrupt me. A bit more reading material. The slide will be online. I mean, the slide is already online. But there's a whole bunch of other talks. I mean, this is like the fourth time I've given this talk. The good ones are Cindy's logs and metrics. Cindy organises the Prometheus San Francisco Meetup, and is a really good author. So I would encourage you to read anything she likes. Peter's the same. He's a really good author as well. He mentions the Red Method a couple of times. He was at Reeve Works when he came up with this. Last one, four golden signals. So this is really the thing that was the Red Method that was supposed to be. I just forgot about one of the signals. What it is, is an attic stick and you put on your mobile phone. That's what, if you Google, you know, golden signal, that's what you get. So for every service you measure the latency. This is like the duration in the Red Method. You measure traffic, which I guess is like rate, like request rate. You measure the error rate, which is exactly the same. It literally stands for the same thing. And then you measure this one, which I forgot, called saturations. How full your service is. How full your service is. How do we measure that? I mean, I don't know about you guys. I guess I should really go and do a lot of capacity planning and load testing at my service and know that with one CPU I can handle this many requests a second. But I have other things to do with my time. So I don't know how full my service is. And then someone suggested to me, well, the easiest thing to do would be to look at their CPUs and look at how much of, say, the limit of CPU you put on the service is compared, how much they're using compared to that. And that sounds actually really easy with Kubernetes and Commedius. So on our Kubernetes cluster, we give every pod a request, a CPU request. So we come along and say, this pod really shouldn't be more than one CPU. And we do this because we run our dev environment on the same cluster. We've got different users and different jobs all on the same cluster. We don't want to interfere with each other. And it turns out we can just look at the proportion of that that it's using. And that gives us a kind of indication of how full this service is. So this one is actually a real device. Mirror. And we go. I'm going to have to copy some things out. This is all in the clumps dashboards, by the way, guys, if you like. It would be kind of useful if I could clear that all out as well, then. Instead of having to reload the page. You're on a white rice slide. So, what we're going to do in the clumps dashboards, there's this thing called namespacenamecubepodcontainerresources. So this is a little recording rule that we've written that tells us based on the namespace the job is in and the name of the job and it kind of aggregates together jobs of the same name. How many CPUs do they progress in? There you go. Let's give this job a second. Yeah, so we basically say our ingester in Cot needs six CPUs. This is quite a lot. I thought it would take more than that. And then what we can also take is a different one which says containerCPU is at second some rate, which is telling us how much they're using. And we can just divide them to, divide the two. And then we need namespacename. If that works, it doesn't work. Namespacename. That's what's really wrong. Oh, that's going to really annoy me now. What have we done wrong? We're probably not running the latest version. So, not okay. I'll go there. That works good. So, now let's divide them up. There's a bug there somewhere. And let's take that. And this is how you know the demo drill. If that worked for the first time, right, it would have been far too polished. And here we go. We can now tell how full our services are. And because we're monitoring the FOSDEM Wi-Fi traffic with our system, you can see, like, time's up. Luckily, this is my last thing I've been saying. You can see that, like, our query in Dev is actually 20% over its capacity. So now we know how it is. And that's it from me. There's more you can read. And there's a summary. Okay. Thank you very much. The metrics should be in part for each service. Do you distinguish with the different operations of the service that they have different characteristics, or does it then make sense? No, I mean, that very does make sense. The question was, if you didn't hear that, I said that the metrics should be reported from the service. But do you distinguish between different operations on the same service? Now, there's, like, two schools of course. I guess in the microservice world, you're not supposed to have different operations like if it's a different operation, it might have to be a different service. Like, I don't necessarily subscribe to that. So I do, in fact, as you saw in the first bit of code, we include the method in the metric. And one of the things that's particularly useful, especially when alerting on these, is alerting on these by broken down by method, because for instance, like, you see in our latency below, but actually if you broke that down by a crew in latency, that one thing, which is really low traffic, has really high latency. So we do actually alert by method as well. So yeah, it's a good point on this point. Any other questions here? We've got another time for a question. How quickly would you do that if you have something like X-teros ACA running? Because they're not synchronized calls. Should have this directed graph there? If you have what, then? If you have ACA, the reactive service. ACA? Yeah, no, I did. I don't know anything about ACA. There was supposed to be a fourth section to this, which I called so I didn't think I'd have time, which is how you kind of monitor processing platforms. Maybe that's like ACA. But I would advise a kind of dietesting approach, where you inject some false events and then explore history around each stage saying how it took to process these false events. That's how we did it. Any more questions there? Any more questions? We're up at the border here.