 So this talk, I just want to look at the word monitoring because lots of people have different meanings for that word. So just so you know who I am, I am one of the core developers of Prometheus. This is not a Prometheus talk. Don't worry, there's plenty more on this track. I studied computer science and strategy, was in Google for many years and have been contributing to open source as well for many years and not just in the sense of Prometheus, of which I'm trying to consulting and support Roosh to make money. So monitoring. Monitoring. Monetoring. When you see this word, is everyone here thinking the same ideas, the same notions? Because I've talked to a lot of people and I find we are often talking past each other because obviously when I'm thinking monitoring, I'm thinking kind of Prometheus-y solution. Other people are thinking Elstack. Other people are thinking you're the guy who checks whether or not I'm accessing Facebook during work time. I've had that one a few times. So lots of people have different meanings for this and this leads to a situation where we're talking past each other. We're using the same word but talking about completely different systems with completely different trade-offs, completely different constraints so we can't actually have a proper meeting of minds and everyone just leaves angry thinking yes, I'm right, they're wrong, everyone's used my solution. This is silly. We should be all thinking about the same idea and understand where the other person is coming from and that's what I kind of want to address in this talk. So we kind of have at least a shared vocabulary and realizing the other types of monitoring systems that there are. The other thing you notice today is a lot of the monitoring we're doing is actually based on tools and techniques from the 80s and the largely 90s. In the last few years certainly there's been quite an upsurge in monitoring stuff both commercial and open source. You know, last five years like New Relic, Gap Dynamics, you've got InfluxDB from ETS, a whole pile of stuff came out. But we're still kind of stuck in the past in a lot of ways. And it comes from an era where, you know, you had the server, or the server rack singular, managed by the sysadmins, plural, and each of them got personal attention, very artisan, you know, loving care and attention. And special cases were the norm. Everything's a snowflake. Everything is a pesh, not cattle. So this impacts how we approach monitoring. Because if you come from a world where every machine is special, then all the monitoring is also special. And we've moved beyond this and other places, so we should really do that for monitoring too. Because right now we are still, to quote CMU, we are still often feeding the machine with human blood. We are feeding it with fatigue. We are feeding it with engineers, burnout, all these things. And we don't need to. The system will work reasonably well without that. So to give an example, at a previous company I was at, there were, the on-call was getting, I think, 2,300 alerts a week, of varying levels of utility. And I went in and I basically deleted all the alerts and added like five new ones to cover what we care about. So the volume of alerting went from like 2,300 a week, down to like, I think, five a week, with no loss of availability, customers never noticed. To think about things that way, like we were doing all this work and we were doing not unnoticeable burnout from all this alerting from this old school way of doing monitoring for no real reason. I was just kind of thinking about those sort of things. So let's start though by looking at the past and looking at some of the main tools you've come across. So the oldest monitoring thing that's still in common use is basically ORD, which comes from the heritage of MRTG. It was started in 94 by Tobias Oteker. That's correct, pronounced correctly. He created a Perl script because, you know, that was the fashion at the time. And that became MRTG in 95. And it basically worked with SNMP and it worked with external programs. And it just constantly wrote a NASKI data file with all the data every time it's scraped. And I remember setting this stuff up 15 years ago on my home system and it was amazing. You know, never seen a graph before of my load and my CPU usage. It was really amazing with one minute resolution. In 1997, basically, it moved some code to C to make it more efficient. And also, interestingly, where ORD, the round-robin database was started to improve performance, and that came out in 99. And ORD was the basis of many tools like Graphite, although that's using Wisp for these days. But if you look at the heritage, what's all coming from? It's all started from a tool, a Perl script, written in 1994. So when you're taking Graphite, or anything that's Graphite-like, it comes from 1994, which is what, 20 years ago? Nice modern software, lads. And this is an MRTG graph. You have all seen them. They're still very common in network world. But yeah, seeing that for the first time, isn't it awesome? Yeah? So in 1996, it started actually as an NSDOS program to do pings. Right? Which is interesting. And, you know, the actual project itself started in 98. It became NetSaint and was renamed for legal reasons in 2002 to Nagios, a name which we all know and love. And basically, it runs a script. It looks at the exit code. If it's non-zero, you get them woken up. And there's lots of things that are still following the Nagios model. So Ashinga, Senzu, Zedmon, which is basically a Python-based one. Just trying to scale this all up in various ways, but still following the same basic idea of, I run a script, I get an alert if it fails. That's what it does. So here's the Nagios dashboard. So that's some of the history. And we see the stuff we're basing off is like 94, 96, we're talking 20-year-old ideas, systems that came from a different era in terms of how we manage machines. We're easily managing what? Tree orders, magnitude, more machines is not uncommon, going from 10 to 10,000, not unusual. But it left us with another legacy which, because we had this already graphite versus Nagios alerting split, we ended up with a world where graphing and alerting are separate. These are totally different concerns, which has some interesting effects as well. And it's also kind of a world where, you know, because everything's a pet, even the slightest deviance, the slightest increase in load must be jumped on by a knock. Because obviously the most efficient to use engineers is to have them staring at a screen and noticing if a line moves. Good news, 2017 computers to do that. But if we're talking about going to a cloud environment or even an environment from 10 years ago where you've got hundreds, thousands of machines and applications is not unusual, this does not scale. Unless you're going to throw a whole pile of engineers or juniors or operators or whatever to get burned out by managing, hey, the load is 4.1. You should do something about that. And, yeah, so that's important. So I think we need to kind of step back and look at things from first principles, right, and answer the question of what is monetary. And, you know, we can't just have this alerting, graphing and then jumping on everything that looks slightly off. There's just too many things that are slightly off because real world. Okay, and we probably want to look at other things as well, like logs. Logs are kind of handy for debugging and finding problems. Like how else are you going to find that your hard disk failed? Browser events, you know, the web happened like since, well, what was it? Web was 94. So we'd have to check the exact dates of potentially MRTG's old written web. We need to look at these things. And look at what it is the actual problem we're trying to solve is rather than what these tools from the 90s give us. So I see monitoring as basically one, as three things and one extra. One of them is knowing what goes wrong, being able to debug that thing that is wrong, looking at trending over time, and, well, all you've got is a monitoring system, everything looks like a monitoring problem. So if you are to talk to people and say, well, it's monitoring, a lot of them will say just alerting. Like, I have met people who think that monitoring is only naggiest, that's all you ever need, there's nothing else involved. And it's like, yes, certainly alerting is important. But the question we need to ask is okay, we want to detect what's wrong because wrongness is bad and we want to fix that, but what sort of wrongness is it? What level of wrongness? Like, if we have a blip that lasts half a second and causes a problem, that's not what we can summon up for. But if it's a latency issue affecting users or things are down, that's something you want to wake someone up about. So we need to think about the severity and where we best invest our time, where's the best return for sending an alert. Rather than just saying, oh, yeah, no, it's higher. Because fundamentally, we're humans and one analysis would say that realistically an on-call engineer can handle two events per shift. Because, well, it's going to take you four hours to debuggish the post-mortem and go through all the steps if it's a real problem. If you alert on every single thing, you're going to get pager fatigue. If you ever are in a situation where you receive an alert, you go, meh, put it back in your pocket, you have a problem. Because it means if there's a real problem, you're not noticing. No, no, it's actually a true positive for once. So you need to care about your signal-to-noise ratio. And, you know, if that's happening in the middle of the night, that's ruining your sleep. You're not an effective engineer. So the thing is, though, you care about user-facing latency, for example, because that's important. That directly drives revenue in many systems. There are hundreds of things that could cause an increase in user latency. You know, you could have a deadlock. You could have the network being weird. You could have someone running a batch job in the same machine. Someone could have, you know, hit the network cable with their backside and now it's doing 10, maybe it's rather than 100. So many hundreds of things that could cause latency to go up. Trying to alert or even tink of every single one of those, it's never going to happen. This one? Cool. Probably need to do that for a video. Okay. Is that okay? Okay. That's what you're getting? Okay. However, the thing we care about is not, you know, what the network interface is negotiated to. The thing we care about is user-facing latency. So, hey, let's alert on user-facing latency. And what you'll see is if you have a system like, say, the Nagio style systems where all you can do is run a script, you can alert on CPU usage, that's not too hard, but you can't alert on user latency because that requires too much math and analysis that you can't do. And even if you were to do that, you're still only doing it on one machine. Whereas these days, a service might be on 10, 100 machines. So you end up getting false positives because, hey, log rotate ran too long. You had too much CPU usage. And false negatives because, well, a deadlock is going to use no CPU and never trigger your CPU alert. The closest you could get to having a user latency alert was a high CPU alert, so you had to do this stuff. And the result is spamming alerts, waking up people in the middle of the night and missing real problems. See, that kind of sucks. So the thing is, what I would say to you, what we're taking about alerting and what a monitoring system whatever it is is doing is alerts should require intelligent human action. Right? If we're engineers and we engage for it or we're paid for our brains, not just to do what a shell script can do. Or not the shell scripts. And alerts should actually relate to real end user problems. Okay? You know, your users do not care if your machine has load four. Your users care if they cannot get to their cat videos. And that is the reality of the thing. Sure, you might get the warm fuzzies from saying, yes, the load is now lower. That minor weirdness is solved. But there's so many minor weirdnesses, you know, you'll just end up fatigued and burned out if you try and do that. Once you have your alert, you need to investigate it. Because let's be honest, if you have an alert like disk full, it's pretty obvious what you need to go looking at, or high CPU. If you get an alert like, the latency is high. Yep. You now need to figure out what's going on from there. So you need some way with dashboards and what not to be able to drill down through your system and figure out what the problem is. Okay? So if you think about your stack of services and how they're talking to it, it's a tree which is hopefully acyclic. So you can start at the top and say, right, there's high latency. Is this service overloaded? No. It's back ends. Have any of them increased in latency? Ah, that one has. Look at it. Okay, this service, is it overloaded? No. Does any of its back ends have increased latency? Ah, this one. This service, is it overloaded? Yes. Okay, why is it overloaded? And stepping through things logically, rather than looking at, you know, graphs and hoping you're going to get divine inspiration to remember, A, what the graphs mean, and B, find the right one. So you need to think about this logically and methodically because we need to scale out how we approach these problems. And you don't just use graphs. You might have to use logs. You might have to use source code. You might have to use some end user debugging. You might need TCP dump. Bring in more tools as needed. There is no one tool that will help you debug problems. Because you need a pile of tools. You're going to need some metrics. Logs, profiling tools like GDB speedup and the source code. Because if you're coming across any non-trivial problem inside an application or what not, you're going to need all of these. So as an example, recently I was optimizing communities itself and I was using a mix of source code, profiling and metrics. Logs were abuses in this particular case. But I couldn't just use one of those tools. I had to use them in concert. And without those, I would never have made some rather nice performance improvements. If you're thinking about problems, you need to not just lock yourself into one way of looking at data. You need as much as possible. The third use case for monitoring is trending and reporting. Because alerting and debugging tend to be on the order of minutes to maybe days. Trending and reporting is weeks to years. Much longer term, much more strategic. Because you might want to know, hey, how is my cache hit rate changing? Is anyone using some obscure feature? And will I need more machines? So more long term, being able to make engineering decisions, being able to make business decisions. And the final one is plumbing. Because sometimes people just want to use your monitoring system as a data bus. This is not monitoring, but it's done often. So you should be aware of it. Because it's going to happen. So basically, what are the APIs in and what are the APIs out? And often it's for control systems of some form. Whether it be business or technical. And to be honest, if someone needs to pass like one piece of data through once a minute, that's basically free. Why not? Assuming all the reliability stuff is okay. If they want to push through lots of data, there's this thing called CAFTA. Or whatever your preferred message bus is. So now we're talking about the tree general goals of monitoring. How do we actually do it? What data should we collect? How should we process it? What sort of things do we need to decide on? Because let's be honest, money doesn't grow on trees. Resources aren't free. Time is not free. We can't just say everything all the time. That doesn't work. At the core of all monitoring are events. An event is something that happens. Like I come in here, I sit down, I turn on the machine. More commonly, we're talking about computers. It's going to be a HTTP request. It's a package, a library call, or a broadcast probe. And there's context. If I do that, it's that I did that inside this room at this time in the context of FOSDEN. And all that context adds up. So a single piece of data, a single event, and all the call trace call stack might have hundreds of pieces of context there when you consider everything. Like what browser was I using? And there can be millions of events per second at a level, because we're thinking about all the function calls being made, all the context switches, all the ins and outs, which you could potentially want data for. And I see that there are basically four classes of monitoring very roughly. Profiling, metrics, logs, and distributed tracing. If you're talking to someone about monitoring, they normally mean one of the first three. Four wasn't the newer. And they're only thinking of that, not of the other ones. But you need all of them in a modern environment because they're complementary. They're all of different trade-offs. So profiling is the one we all know and love. TCP dump, GDB, S trace, D trace, basically Brendan and Greg, if he has ever talked about it, it is profiling. That's the easy way to think about it. Also read his stuff, it's awesome. And the thing is that it's awesome because it gives you really detailed information about individual events. I can see how many times this kernel function was called, and oh yeah, that caused whatever weirdness that Brendan and Greg has found for the first time. But the thing is because this data is so detailed, you can't turn it on all the time. You wouldn't have this space for it. It would hit your performance like no one's business. So any use of it needs to be highly targeted and it needs to be temporary. Like if you're turning this stuff on for more than a few minutes, depending, or an hour, depending on what you're doing, your network is going to run out of this space or whatnot. So profiling is great. If you have an idea of what's wrong, it's like, ah, I think it's kernel latency. Then grand, let's look at the kernel stuff. Or oh, I think it's a monitoring problem. Let me get a TCP dump of those sides. But you're not going to get the kernel dump when you don't know what's going on in the first place, because what are you looking for? And so that's kind of handy. The second one is metrics. Because profiling is looking at all the events with all the context, but we can only do it for a very short period of time. Metrics, what we do is that we ignore the individual events, but we look at the context and we aggregate based on that over time, like temporarily, maybe a minute, 10 seconds. So with metrics, if you have a HTTP request coming in, we wouldn't track, you know, the exact URL it looks at. But we will tell you that there were 3,000 HTTP requests in the last minute. We will tell you that 14 of them failed, that 17 of them hit cash, that one of them hit some weird obscured code paths, and we have that at that level, so we can get a general overview of what's going on. We can't tell what's happening with individual events, although you can cross-correlate to kind of figure it out, but we can get a nice big broad view of the system. And that's what metrics give you, because you can track practically speaking tens of thousands of metrics about a single binary. Don't try too many more than that. But you can't break them out by 2D dimensions, because every single one of those, all that cardinality, that's one more metric, one more time series. So if you broke out metrics, for example, by customer email address, and you have more than 10 customers, that's not going to end well, because that's what eats all your resources. So it's great for just figuring out generally what's going on, but it's not so great for tracking individual events, because you've made the trade-off, I'm having breadth, not depth. On the other side, we have logs. Logs track individual events. So that was hasty review of this earlier. That was Mr. Fu visiting my endpoint. Yesterday at 7, he had this size response, this status code. It was these endpoints. But the thing is though, because it's per event, you're limited bandwidth-wise, whether that be disk or network, to 100-ish call-ish fields in your logs. And the data volume's involved typically means there's a propagation delay in processing all this stuff. So the elk stack and gray log, that sort of solution is what you're looking for there, syslog, and something like that. And in logs, little fractally, there's also another thing that people don't realize we're talking about different types of logs when they say logs, because there's once again a pile of trade-offs. So at one end, you've got your transactional logs, or probably audit logs, where you basically have to have them. You cannot lose a single piece of data on it or something else key. You can imagine if you were in a financial trading, everything's transactional. You can't use it. You have to keep it around for seven years for compliance. On the other hand, you have the bug logs, which is basically profiling, which you might keep around it for a few hours if you have the disk space. And the trade-offs involved in these are completely different. One of them, you basically need a CP system, get everything absolutely perfect. The other one is like, yeah, there's data there. Might be there in five minutes, might not. If we lose it, who cares? Those are different trade-offs. And immediately you've got request logs and application logs. Like application logs need to be viewable by a human, by eye, without any tooling. Because that happens sometimes. Entirely different trade-offs. And the final one is distributed tracing. And this is really a special case of logging. But in the more cloudy environments, like microservices and not distributed systems, you need this one because trying to figure out some certain problems with just logs and metrics doesn't work. What it does is each request as it comes into a system, gets an ID. And as it causes, you know, a chain reaction of requests, those also, that ID is propagated and logged. And then it basically pulls back together all the requests, the result of the original request and said, right, here's the full tree and the timing tree and what caused it. So when weird stuff's happening, you kind of need this. So open tracing and open Zipkin or that. And you get this sort of result from it. So it causes the main request. It causes this, causes this, causes this, which causes these parallel requests and they responded in this time, therefore such and such was slow. So it's an interesting use of logging. So if we go back then to the original question of what monitoring mean. Monitoring isn't just black box alerts that are going to a knock. It's not just per host graphs. It's not just aggregated white box application. Oops, so there is it. Yeah, graphs. This is just the big TV screen on the wall. How many people have been in a place where it says we need monitoring? So let's get tree ground, get a massive screen, push it up on the wall, and it's going to have the graphs monitoring solved. I'm seeing a lot of smiling faces who have been in companies where that's happened. Yep. And there's also a question. This is a big question to monitor. What does word monitoring mean? And very generically monitoring is the tools and techniques you have for monitoring things. And that includes the humans and the culture and the policy around them. As important as well, as Jason Dix was saying, there is no silver bullet product. There's no product that's going to do everything for you. You can't grab all the data. You have to make a trade-off, some trade-off at the end of the day. So to summarize, four minutes. The goal of monitoring, the thing we're always trying to do at a high level is to go wrong. It's being able to debug those things and then more strategically, rather than tactically, being able to make business decisions based on trends. And, you know, moving data from A to B sometimes because someone's going to want to do it. And sort of monitoring systems you'll see is profiling, which is looking at lots of data in lots of detail, but you can only turn it on briefly in a very targeted way. Metrics, which are great for breadth and distributed tracing, which you need to debug distributed systems. And I will say as well, just as we're looking at these things, what made sense 20 years ago when you had a handful of machines does not make sense when you have a handful of data centers. Each with a handful of services with a handful of replicas of, you know, thousands of machines. In the same way that, say, 10 years ago, we were starting to use tools like Chef and Puppet to stop doing deployments by hand and starting to automating all that stuff, we need to take the same mindset and apply it to monitoring. We need to stop caring about individual machines being slightly slow as long as the service and aggregate is doing okay. And making that sort of culture shift to it. So think about the same way, how do we scale up as we move into a situation where we're going from five machines to 10,000 containers, which are recreated every hour because that is a reality that we have to look at. So that's a good approach. Incidentally, I hear Prometheus is good for metrics. So if there's any questions, I think we've got a few minutes. Four of us to stop. Three minutes. Any questions? Is one there? Yeah. Fabian, do you want to get hooked in? Wherever you are. Hello. I do not really understand the difference between profiling and logs. Okay. So profiling is so information-dense, you cannot turn it on all the time. So if you can imagine TCP dump, you know, on anything that's not extremely low traffic, if you turn that on, it's going to take you a hit on the network and a hit on disk space that you just can't sustain. Like, if you ask Richie, a network person, I want to mirror all the traffic going through this switch onto this machine. It's like, wait, wait, wait. This machine has an aggregate bandwidth and it's like, if you want to track every single kernel call that's being made, you know, that's going to add maybe at least double, maybe triple the time it takes to do a kernel call. So you can do that for debugging, and hopefully it's not going to affect the debugging too much, but you can turn it on all the time because it's too much data and it's going to slow you down too much. Too much for performance. Does that answer your question? Cool. One more question maybe we can squeeze in. No? So you said that a lot of things should be done based on user-visible aspects of the that form. I don't really agree with that. This is useful, of course, but there are lots of problems that can rise from easily identifiable problems that are not user-visible. For example, disk space going up at some point, you want to get an alert just to say we are over 90 percent. So the question is about basically saturation alerts about running out of disk space or other situations where you're running out of quota. If you know that running out of resource is going to imminently cause and imminently in this context means about four hours because that's about how long it takes to deal with an issue like that is going to imminently cause an outage running out of disk space, running out of any form of quota, whether that be file descriptors that you've bought off your CEN provider, yes, that should be an alert because you're just preempting a user-visible alert that's going to happen in, let's say, four to 24 hours. But if it's just, yeah, that machine filled on disk, but that's it, we have another hundred of them which haven't filled up on disk, you know, you can deal with that with business hours. So I was just thinking about, okay, is this a user impact right now or in the very near future such that it's worth me getting a human to look at this immediately from bed that we can deal with in business hours. So that sort of change because it's silly to not prevent outages but if we're just saying, yep, disk space just went to 70%, it will take two years to fill but it hit 70% we need to page someone, that's silly. That's where it is. So thank you very much.