 Hello and welcome again to another OpenShift Commons briefing. This day is Wednesday in June so it's been quite an interesting year. I think we're up to number 77 in our briefing series and this is yet another talk on monitoring because monitoring seems to be one of the most popular things and maybe some of the most or one of the most diverse group of things because it means so many different things to so many different people and a couple of weeks ago we had Brian Brazil from Robust Perception come on and talk a little bit about Prometheus and other aspects of using that technique for monitoring OpenShift and Kubernetes and out of that conversation we decided that we needed to have a higher level conversation about what monitoring means for cloud natives so that we could kind of level the playing field. So we brought Brian back again we're going to get him to give a little bit of an overview of the whole thing and then we'll have some time for Q&A at the end and conversation so without further ado Brian take it away. Hello so as Diane said I'm Brian Brazil and here's what I talk about what monitoring means because a lot of the time when people are talking to each other about monitoring they're really talking past each other so to give you a bit of background about who I am I am one of the corporate developers of Prometheus even though this is not a Prometheus talk at all this is much more high level than that I was in Google for a few years I contributed to a number of open source projects and I work professionally in monitoring doing consulting and support for Prometheus in the course of which I interact with a lot of under monitoring systems because people want to integrate them so you would learn a lot about how they all work and all the differences. So let's consider the word monitoring monitoring monitoring like when you see this word are you thinking the same sort of things that I am thinking and that's kind of the question I want to look and address here so I want to start off and do some history because if you talk to a lot of people today what we do is really based on things that were really good ideas many decades ago this is back in the day well before things like open shift existed where you had a handful of machines like never even a full rack they were cared for by citizens who are more artisans and giving loving caring attention to everything and special cases were the norm and the thing is that our tools have moved on greatly for example we went from manual configuration to tools like chef and now to Kubernetes and open shift but practices many of them are still stuck back in those days and we're offering as a quote someone else still feeding the machine with human blood and doing lots of things which are basically just burning people out for you know not any appropriate gain in productivity for anyone so on start then by just looking at some of the tools and that the historical tools and many of which are actually still used today so we're going to start with looking at MRTG and ORD so in 1984 Tobias created this a proscript which was MRTG and you know 95s released and it was used to graph SNMP priority which is network stats and in fact there are many of these websites still around today with these graphs anytime you come to an internet exchange and they have graphs they're probably still MRTG and it started off storing the data just in a big ASCII file that was rewritten every time which isn't exactly performance it was moves to C like to treat it later and later on it was ORD called the round robin database which pretty improved performance rather than rewriting an ASCII file and that was all released in 1999 and ORD is actually the basis of a lot of more modern tools like graphite although it has a more advanced system these days so if you've ever seen a graph that looks like this it's from MRTG so very common for bandwidth as I said so that's kind of your graphing on the other side you have your alerting so it actually started out in 96 as an MS-DOS application not at all a UNIX thing and all it did was pings and it became like a proper project in 98 and first released in 99 as net-saint which got renamed in 2002 because there were legal reasons so basically what it does is it run scripts on a regular basis and if the exit code is not zero it sends an ORD there are a lot of projects which are inspired by the Nagios way of doing things so for example have Ashinga which is pretty is an extension of it you have sense to and you have Zedmon which is a Python version which actually comes from a German clothing company called Zalando who have offices here in Dublin as it happens and Nagios looks something like this you have a machine and it has checks which are the scripts that are passing and failing and thing is that because we had this world where on one hand you had MRTG and the other hand you had Nagios graphing which is all your white box watering which is looking inside the application and alerting which is black box so from outside the application and they are separate concerns separate tools and this is all from a world as well where all machines were special services only tends to live on one machine because we didn't have the scale that required more than that and it's also a situation that because things were small like not many machines not many services if there was even a slight deviation and there would be an engineer to just jump on that problem this is not the world we have in a cloud native environment with a system like OpenShift where we think of the machines and services as cattle, rudder than pets and where deviation is the norm because we just have so many machines that that's going to happen so if we're going to ask what monitoring is then in the modern age and we need something that's not just alerting graphing and jumping on everything that might be a problem and we also need to look at things beyond what the machine has we need to care about logs as well and browser events because well lots of users are using web browsers to access things and the JavaScript and their browsers is getting pretty complex and needs to be monitored in and of itself so I think we should do then is rather than looking at what the tools have let's look instead at the problem statement of what we're trying to do what we're trying to do is all of this monitoring and I kind of see there to be four categories of monitoring which is no it's gonna go wrong debugging trending and then plumbing so let's look at each in turn so the primary thing for monitoring is alerting okay if you're in Agios that's almost all it's going to be doing so the thing is that we're trying to think the wrong the question is what is the wrongness we want to detect do we want to look for a blip because you know that happened in a package was lost due to a solar ray or we actually want to focus on latency affecting end users because in the end of the day you can't take care about whatever product it is we're producing whether that be a social good or something for money or just the back office process that's handy to have and and we want to focus on that sort of thing rather than everything because if you alert on everything that just isn't going to work because the thing is that humans are very limited in what we can handle we can't work 24-7 we need reasonable amounts of sleep and if you get alert on every single thing that could be a problem you're just going to burn out and get what's called alert fatigue pager fatigue there's a few other names from it it's basically burnout and burnout is bad that's not the best way to use people at all we're engineers and should treat problems like engineers and the kind of a key insight there is that you care about let's say user facing latency and there are hundreds of things that could cause that a machine could have failed the rack switch could be slow you could have lost some power mating routing changed and you're now going the long way around Europe you cannot possibly learn every single one of those there's hundreds maybe thousands of them and you're going to miss them however what you care about at the end of the day isn't that a rack switch has failed it is that the user is having a bad experience so if you can alert directly on the user having a bad experience that's just one alert and it's going to catch a multitude of problems real problems so let's take the canonical example for this which is CPU usage there are systems where you cannot directly alert on the latency of your servers because that's just not information is exposing or useful but you do have CPU usage because well if you have say in a process running inside Apache and it's using all of your CPU you can bet that things are slow but there are also going to be false positives there so if log rotation or whatever background processes that run likely are taking too long that's going to give you a false positive if there's a deadlock in your code well it's just gonna lock up and use no CPU and you're not going to detect that either so what a test tends to result in is spammy alerts and you know whoever is receiving these alerts just tends to ignore them which ultimately means because of that alert fatigue that they're going to miss real problems because it's going to look at it just go oh it's that again and let's continue on whatever they were doing so the thing is that alerts should require intelligent human action for engineers you know not auto-autons and it relates as well to end user problems users do not care in the slightest if your machine has a load average of four in fact in modern systems load average isn't even a particularly useful metric however they do care if they can't view their cat videos so think about cat videos not load averages the second thing you have is that well after you've got an alert what are you going to do about it because if we're going to follow the guide I'm previously and all we have is a high level symptom well that's great but where do we start going and the thing is if you're looking for the microservice architecture built itself something like OpenShift you're going to have some form of like tree structure and what you want to do is just use the scientific method start at the top and drill down into your problem so if you have high latency and two backends you might start at the very root of your tree the you know the front-end HTTP server and say right is it overloaded yes or no and then check okay it's backends has either of them gotten slow it's like oh this backends gotten slow right same process again is it overloaded oh yes it is overloaded great we now we found our problem but it's backends look fine let's dig into further that with more tools and the thing is as well is that there is no panacea there is no one tool that is good to help you to bug you have to bring in a myriad of tools depending on how complicated the problem is what the nature of the problem is so metrics for example are great for figuring out roughly where the problem is but they're never going to tell you exactly which request from the user is to blame for that you need logs neither metrics or logs are likely to tell you which line of code needs to be optimized that's something for profiling and then all of those are going to tie back to source code so you can see what's going on what the code is and you're going to jump between these as you try to narrow down a problem and the tourist case then is trending and reporting because alerting in the bugging is short term you're talking minutes to days maybe weeks trending on the other hand you're talking months to years so for example we want to notice that your cash it rate is changing over time and add or remove machines like if your cash it rate is starting to get to like 1% you can just get rid of the machines they aren't helping you on the other hand if it's really up 90% maybe you can get rid of some other machines if maybe you're instrumenting your features it's no hey so this is your feature we add as well as remove will checked no one's used in two years so then it's pretty safe that you can move that code and that's no longer going to break your continuous integrations and a most common one of course is capacity planning which is quick when will I need more machines you want to see how the cost per query is increasing over time and how the number of queries are increasing over time so lots of trending and reporting stuff there that you need for engineering purposes and you need for business purposes the final the fourth case I see from monitoring is flowing because no matter what you design your monitoring system to do someone is going to look at it and go hmm I want to get that from a to b and you have a thingy that kind of has data that could use the transport can I use your monitoring system now this isn't monitoring but someone is going to ask for it if it's kind of free like they want to transfer a log line a minute sure you know a metric that's not necessarily even a bad idea rather than building a proper solution but can also get more expensive so something to consider what your monitoring system is what happens when people start using it for other stuff and at what point do you tell them so you know what now is the time to break it out into a proper solution rather than just piggybacking on our rather critical monitoring system so we have like the tree goals of monitoring and plumbing on the side what what are we going to do about it what data do we want how do we want to collect that data and the thing as well is that monitoring isn't free you can't collect everything all the time because that's going to be more data than you're actually processing in the first place now you try to monitor data would explode exponentially so we need to decide what trade-offs we're going to make so the core of all monitoring I can test is the event so an event might be a HB request package a function call or anything there's also context like who is the user of a made-up what IP addresses it from what machines is that hitting on the path what sort of business request is it and how much that is involved and you can imagine as an event works through the call stack what work should I 10 20 servers and hundreds of functions it might be thousands of piece of conflicts context associated with this event and there might be millions of these events per second as you know you've got requests cascading through the system and there's kind of four or roughly four classes of dealing with events and how you view them when you're talking to someone and they said the word monitoring they normally mean one of these and the question is which one of them and that's where a lot of the disconnects come around the word monitoring is they mean one of these or alerting so let's look at each and turn so profiling is a pretty broad area the short version is if Brenton Greg has ever talked about it it's profiling so he's got a great blog lots of tools like the Berkeley packet filter and so TCP dump GDB S trace D trace what's in common is to get you a lot of detailed information about individual events but the problem is because that's a detailed information about a visual events that's kind of expensive so you can't turn these on all the time because you're too expensive and you can't point them at everything because it's too expensive so you end up saying that right I'm just gonna monitor this process for like 30 seconds and hope no one notices and which is how CPU profile including works so these are great tools if you kind of have an idea what's going wrong like if you know something weird is happening in the network and with TCP then use TCP dump or you know hey there's some weird syscalls use S trace and but you kind of need to know what you're looking for before you start using a profiling tool and metrics then they take a different trade-off their trade-off is they're going to ignore individual events we're gonna track how often particular context shows up so examples here are Prometheus graphite and anything else in that space so you wouldn't be tracking like free requests the time it came up with that but you do know how many there were in the last while how many have failed how many questions always how many or how much not individual requests and you also have one of these cases where one of the hit on the odd code path and it's going to increment a metric but you're not tracking usual events you're kind of just tracking how many or how much the thing is metrics is there handy because you have lots and lots of these how many questions but you can't break them up by too many dimensions so like you were to break out metrics by email address you if you have millions of users that's going to take out most monitoring systems so what metrics are good for is breadth across your code base and know what's going on I kind of just drilling down and doing your initial debugging steps when you get an alert but for tracking individual events and really getting down to the nitty gritty you need something else which brings us to logs which are kind of like the opposite metrics so they track individual events but they track a limited pieces of information for it so you know it was Mr. Foo who is this endpoint this time got this response with this status code but you can only have so many of those fields like 50 to 100 before you run into bandwidth issues whether that be disk bandwidth or network bandwidth just because you're tracking for every single event and examples are the okay stack the elk stack or gray log and you would have also commercial tools here like Splunk and but with logs it gets even a little more complicated because just as monitoring has like five or six common meanings and those are just the common ones there's also a few different types of logs and you need to be specific about which one you're talking about because each of them have different reliability requirements different volumes and other considerations so I'm going to say that there's roughly four types of logs so you have your transactional logs this is anything you cannot lose so that might be required for billing might require for legal reasons it might be passing on to other systems like losing them is just not an option you must do everything in your power to stop them being lost you have request log so if you're logging individual HTTP requests or anything that's tied to our request it's a request log and you got application logs which are different from request logs they are telling you about the application and its background processes and what's generally going on in the application as a kind of a management system as a thing from the actual requests so things like garbage collection or SIGCUP rather managementy stuff and background these stuff happening inside an application and then you have the bug logs which might be very detailed stuff like exact request information individual function calls it basically is profiting massive depth of volumes and because of that's generally not practical to keep these for very long but these are all different types of logs and if someone says you can't ever lose logs they're probably talking about transactional logs but they're saying no you must look every detail ever they're talking about the bug logs because trying to reach the bug logs is transactional logs will not end well and for example application logs you normally want those to be readable by a human without any tooling in terms of volume the fourth one then is distributed tracing it's really a special case of logging because normally logs across different machines are independent what distributed tracing does is each request like each end user request gets a unique ID that's propagated as all of its sub requests cast gauge the system and then something like open tracing opens it can like stitches these all back together so you can see the entire history of these requests so this is useful in distributed systems particularly where you've got many services many interconnections for figuring out when weird stuff happens so it is essential if you've got a microservice architecture on something like OpenShift here's an example here from OpenZipkin and you can see here how we've got the big overall request here and here's my SQL part or my SQL and all the other servers it's talking to you can see how this is traced over time so that's very useful for debugging so if you go back to the question of what monitoring means it's not just some black box alerts going to a knock it's not per host graphs or aggregated white box graphs like you get from Prometheus it's not just logs it's not just metrics and it's not just the big TV screen on a wall which is where a lot of boundary seems to end unfortunately so the way I would see it is monitoring is the tools and techniques you use to keep an eye on your system see what it's doing and keep it functional. There is no silver bullish out there that is going to solve your monitoring problems right you need multiple different things you need your metrics you need your laws you need your profiling you need distributed tracing as well keep in mind culture and policy and people are also part of your monitoring system it isn't just technology and computers you need to think about how you're going to organize your own call shifts to make sure that people still get holidays and can see their families and all these things come in together. So then to summarize this the goals of monitoring as I see them is knowing when things go wrong being able to debug and gain insight trending and plumbing because it's going to happen and then the classes of monitoring systems you've got row filing it has lots of data but so much that you can only use it very briefly. Metrics they're great for breadth but they don't work too well with depth whereas logs they're great for depth but not good for breadth and distributed tracing basically you need it for distributed systems because some stuff you just can't really practically capture with the others. And so the final thing I would say to you is that don't just be held back what we did 20 years ago. We need to take advantage as you go to a cloud native world using great tools like OpenShift to take advantage of all the classes of monitoring systems because we must scale ourselves as engineers. We no longer provision machines individually right we get Amazon to do it we use Packer or whatever and we shouldn't similarly care about a single machine being dead or slow. We should be caring of a services and end users and I hear kind of Prometheus is good. So are there any questions? So we all hear Prometheus is kind of good it's pretty it's pretty rock and awesome. We definitely do and I think you should as one of the core contributors to it. I think it's it's been pretty much an eye-opener and its use with Grafana and other things have really made monitoring quite quite nice. I'm wondering what you see as the future of where monitoring is going and what I kind of want to interject is a little bit of the idea of predictive modeling. You know we've been I've been talking a lot with some of the Apache Spark folks who have been doing deep dives on OpenShift log files and things like that and we talk so much about being event centric like this happened and therefore that you know we must be alerted or you know the profiling things but what do you see in your wish list with all of this data as some of the things going forward in the future that you'd like to see happen for monitoring? So in relation to like predictive stuff I kind of wish people would stop thinking it's a panacea because it isn't in fact it barely works because the problem is why you're talking about there is building effectively machine learning model of some form and if you have a system like Prometheus which is used to getting like a thousand metrics per application instance and you might have a thousand of those and you're sampling those once a minute and you've got you know that 150,000 samples per second coming into the system there is just going to be so much noise in there that actually putting out any useful information with a reasonable to signal to noise ratio doesn't work. So the way that someone pushed me someone who works on machine learning systems you know the more traditional ones is that it takes 18 months to train a model which if you translate that to monitoring means it's going to take you 18 months to set up a single alert. That's not very good. That's not very useful when you can't get to your cat videos really. Yeah exactly and so you could you know have someone figured out exact levels of traffic and so on but it's going to require a lot of maintenance and tweaking and I would take the approach that for most things the vast majority a simple static threshold is good enough rather than you know spending a lot of time on something you're going to spend tweaking anyway. Yeah and I only say that because I've been playing and looking at some of the Apache Spark stuff that's been coming out and because we have such huge historical data files and lots of information coming in it's always looks like you know great data to train. The thing is if you have logs because just fewer fields there's less potential for noise. Best of that can also work. So this is where you often will see some form of prediction is capacity planning because there you're not looking at millions of metrics there you're looking at like three or five because you only care about global traffic or regional traffic and there's only so many regions and it's a really important number for costing and figuring out how many machines you need to buy. So for those it can make sense to spend a half engineer it takes like the permanent half engineer or half statistician or whatever to get co-op in a model for it actually. And I also think as we change have changed over to this cloud native model or this just cloud basis things and we really do change from I really I love the metaphor used at the very beginning about human blood in the machine as well as the pet versus cattle conversation is that having been a cis admin back in the early 80s I remember the blood in the machine and you know and having pagers early on you know and thinking we were special people because you know only doctors and cis admins and DBAs had pagers back in the day and realizing what a pain in the arse that was and the human cost of always being alerted every time any strange thing happened I think that in this world where we're more concerned about the outcomes and the availability of the system as opposed to why did this one machine go down because something is so quickly replaced everything changes and so the tools do that doesn't mean your way your Nagios or other things because some of that stuff is absolutely still necessary to keep up and running and so yeah it's their approach to scales better because like the Nagios thought system is still thinking about the machine where we should like in the open chef sense be thinking about the replica set and I think that's the funny thing about cloud is people always think of this ethereal floaty thing out there and the ether and they forget that they're actually server farms out there somewhere heating up the environment and real machines and so there's this playoff of what do we use as people who are developing apps and deploying clusters of pods onto clouds to monitor our systems versus you know the people who actually are putting those machines and servers together and you know so there's this two sets and I think it's it's been it's been a great talk actually today breaking it out by the different things the different aspects so hopefully today's talk has given everybody a good sense of not just the history but where we're going in the different aspects of monitoring as well so I really appreciate your time today Brian I'm not seeing any questions but that may be because it is a high level thing and we didn't actually do a real demo here but we will do lots more demos up and coming as well so stay tuned for some of our upcoming open shift Commons briefings because monitoring is near and dear to all of our hearts and you had an upcoming event in August prom-com yes prom-com in fact ticket sales started there yesterday and so you can get those at prom-com.io and what's the dates for that it's August 17 and 18 tickets are just 80 euro or if you want chance to win free tickets and my company is actually rafting some off so if you just sign up to our mailing list which you can find a link of on our website you've got a chance to win some free tickets perfect and it's Munich there's good beer yeah so it's prom-com is always a fun is a rather small event but it's growing and I think this time it's in Munich yes last year we were 80 people this year we're 200 it's it's definitely a big community and it's it's happening so if you want to take advantage of that and also I'm sure there'll be a lot of Prometheus folks talking at KubeCon in Austin where we're going to host the next open shift Commons gathering in December so hopefully we can get Brian there as well I think Julius is going maybe Julius we'll get some we'll get some robust perception insights there I'm sure all right well I'm gonna let you go back to your days and thank you very much everybody for coming on and listing in and we'll put this one up on the open shift blog shortly along with the slides take care