 Good afternoon. Today in this talk I'll be talking about monitoring. So if you came for the NF tables or Ansible, it's not this talk. Mostly today I'll be talking only about open source monitoring and realize that this is really not an exhaustive survey of what's available out there because there's a lot of things that are out there, a lot of small projects, a lot of big projects, a lot of well known and some lesser known ones. But rather we're going to talk about some of the different types that are available out there and the stacks that are represented in the space. Especially talking about those that are both legacy ones that are continuing to be used today as well as the new breed who are starting to take over. So if you don't hear your favorite stack or app, please understand that I intended to talk about types and not necessarily an exhaustive but rather just some examples. Before we get started a little bit about me, my name is Tom King and I've been working in the embedded space for about 40 years. I started working in embedded Linux in 2014 specifically and got involved with working in with different open source projects and some of them needed system administration and help with infrastructure so I started working on a lot of that and eventually ended up running build servers and build farms and things like that. I came to a point where I realized that we weren't getting the performance that we wanted out of them so started working with monitoring and went through a number of different stacks to figure out what was going on and finally migrated off to what I'll call the new breed or the new approach that people are taking these days. So I'm going to turn off the camera here because I think it gets a little distracting. Can everybody see the screen? Are you seeing me sharing? Just checking here to make sure. So let's talk first about the goals of monitoring. Let's talk about first about the goals of monitoring and define what it is and what it isn't. That's just about as hard as you can get there. So the first goal that we talk about is learning. When you're monitoring something most of the time you want to know if there's something wrong with it. So first thing we want to be able to do is notify out of most of the monitors actually do that. Oftentimes multiple platforms can do it. We'll talk about that in a few minutes here. A lot of the various methods are old style text messaging or email, even some business communication stack. For example, some of them have connectors for Slack and other things like that. This is the responsive element of your monitoring system. The alert itself should contain the information about what's wrong and where to find additional information. So if you get a text and it says, you know, something's wrong, you want to know exactly where to go in order to be able to start looking at solving that particular problem. Sometimes it might be an alert that says, hey, it rebooted. Well, then you get to go look through the logs and figure out what's going on. It's a good idea to create a setup for monitoring that has different priorities such as the right such that the right people can get the alerts. You only want to wake up the people for serious things like alarms and let the alerts notify without waking people up. This is one of the things that drive system administrators nuts is that you they get lots of information in the middle of the night that either they don't work isn't required for them to actually act upon or alternatively. They can't do anything about it anyway. Oh, I rebooted it. Well, I'm not going to be able to do anything about that it's already done. So why is it waking me up. Ideally, however, you want your monitoring system to actually alert you before the system or network or resource crashes. So, yes, you do want to know that you're going to run out of this space soon, but maybe you don't need to know that you're going to run out of this space. Three days from now at three o'clock in the morning, so you need to set it up appropriately so that that doesn't become an issue. Most of the monitoring systems only notify that is to say they'll tell us that something is wrong, but they won't really take any corrective action. However, there's a few that actually can do that. And we'll see some of them in just a minute. We'll talk about them. And the second goal is to collect metrics or data. Some of the things we might want to collect data on our for the network, how many packets per second are we pushing through what the data throughput is actual, how many packets are being dropped. So for that for trying to find out whether you're getting a DOS attack or something else going on for a router, you might want to again talk about packets dropped packets per second. Am I seeing that DDS attack in my router and my firewall. And for a server one of the things that we notice is that is my CPU too slow and I running out of just space or Ram. How long does it take to service a request that's kind of important is your web server overloaded. Are we hitting swap. Things like that are very important to be able to monitor and collect that data so you can watch the trend over time. And we're collecting it again for later analysis in some cases that later maybe just a few seconds. We'll see what some of the monitoring software that's out there today. On the other hand, you may want to go back and look three or four months ago. Hey, I think something happened about three or four months ago. Let's go find it in the logs. Oh, look at that. We saw a big, big change at that particular time in the graph. Pretty easy to visualize then. And the last one is visualization. This is how you look at the data. Whether it be short or long term trends or resource usage. Am I seeing an increase in RAM constantly over time indicating I might have some memory leak issues in some application or some, some daemon that's running. Am I seeing a spike that happens at 4am every day and why am I seeing a spike at 4am of load or or memory usage or whatever that happens at a particular time and you want to be able to investigate that and find out what's going on. So these this visualization is the tool that you'll use the final tool that you'll use in order to be able to apply the monitoring such that you can see and take action at that point. So let's take a look at some of the monitoring systems that have been around for a while and are pretty mature. What I want to talk about is what I call the oldest tool that I found, and that's or used and that's called monitor. Back in the 80s I had a Linux system or a unique system and we use monitor for processes that got out of control and it would kill them and restart them and make big life a lot easier on me. Now it didn't really tell me what was going on. I just noticed that it rebooted or some application or some daemon restarted again, so I was really happy with that, but that meant that I had to go ahead and go back and do some analysis in the logs. And I might not know that it actually occurred unless I really saw uptime or saw something change in as far as the timing went. So let's say you have an embedded device such as a router and DNS mask, which is the application the daemon that provides DHCP and does some other work for you there, crashes on your open WRT router. Well, you want to restart it and keep going otherwise the router is dead. Well, we don't want the router to be dead. So that's typically what it would be used for. Also system D can do some limited process monitoring and it can also do bouncing as well. So there's a whole section on that. I won't talk about in this course. Alright, so let's talk a little bit about our RD based tool monitoring. RD tools is a class of tools which originally came from the application MRTG multi router traffic graphic. And it is actually a is actually. Let me look at something here. Sorry, I'm having a problem with the app here. Just a second. Can you see the screen right now. Okay, this is an example of MRTG. The important thing to note is this graph here in the middle, the graph here in the middle is actually MRTG. It's actually an RRD tool graph. So on the on the X axis here we have time in this case it's in hours. And in this case, they're using bits per second. They're seeing it rock along, rock along, rock along. Oh, look, we got a whole bunch of traffic that you showed up starting around noon and it just kept climbing and climbing and climbing. Looks like we might have gotten in slash dotted or something. So that's kind of the way it looks. They all look this way. If you go look at any of the monitoring tools and you see a graph that looks like this. So that RRD tool is running in the background and that's what's actually providing the graphing for you. Okay, so one thing nice about RRD tool is that it does provide a great amount of plugins. SNMP for networking equipment, because it was originally designed for routers and switches and things like that. So it does understand how to talk SNMP very well. There's plugins for various OS's. It's a, they have gather collector plugins that send the collected data to an aggregator and display server. So there's two different pieces. You can put the aggregator as a collector and gather pieces on a multiple different nodes in order to be able to monitor servers, monitor routers, monitor different different pieces of equipment that you might want to use. It is cross-platform that makes things a lot nicer. Some examples of RRD tool-based monitoring is MRTG or multi-router traffic graphic, cacti, and monin. For many years I used monin and looked at these graphs pretty much every day in order to figure out what was going on and see if there was an issue. I liked it pretty well. It was a little bit heavy-duty sometimes, especially when it was trying to do the creation of the graphs every couple of minutes. It was pretty intense on the servers. So that's one thing that happens with this. So let's talk a little bit about the advantages and disadvantages of RRD-based monitoring. They are fairly easy to implement. A typical node for me would take about 15 minutes to put up and a full system I could have running in about an hour, including connecting all the nodes up and starting to see traffic flowing in graphs being generated. It does have consistent graphical output. You know what to expect. So you don't have to worry about what it's going to provide for you. It does have vertical integration with monitoring and alerting both. So that's kind of nice. Cacti and all those and monin will let you know if there's a problem. So that's nice. You can also set limits. You can go ahead and create a situation where you have, well, if it hits this, it emails and if it hits this, it texts. So you can do things like that with them. So that makes it a lot easier for you to stay sleeping when you need to. Disadvantages. The notifications are somewhat limited. Pretty much email and some kind of a text message if you decide to create it yourself. As a pre-allocation of the database size, which kind of makes things annoying. It's a fixed presentation of time sequences. So you basically have day, week and month and year type things. And probably the biggest disadvantage in this particular style is that averaging means data is lost. They use circular buffers. So therefore, when they go ahead and transfer the data over from the day to the week and the week to the month and the month to the year, they aggregate and average the data. So that makes it a little bit, makes it lost. I find that probably the most annoying thing about using already tool-based monitoring. Couldn't go by without saying something about Nagios, which is now called Nagios Core. It's been around for about 18 years, has a large community of users, does do distributed monitoring, lots of plugins, it is multi-platform and there are several forks of it as what happens with some projects. But one thing I found is that it was very difficult to configure and get working when we were trying to get it. We tried to use that at one point and just gave up because it was a little bit too daunting for what we were trying to do. Even though we were monitoring multiple servers, multiple all Linux servers, it still was a bit annoying to try and make that work. So let's talk a little bit about the new approach. I say new because just like we're doing containers now and we're doing a lot of JavaScript front end pieces and a lot of different things where we're distributing things through the cloud, monitoring is also evolved in that kind of a way. So the new approach can do the same things that the RD tool-based monitoring systems do. And most of the things that Nagios can do because Nagios is a little more full featured as far as that goes. The one thing that I find the most important thing is that they have the ability to go back in time and to do comparisons and metrics with a scale that allows you to zoom in on something that happened six months or a year ago. That's kind of nice to be able to do that provided that your database server has enough room. But that's the one thing that I really find useful in this. One thing to note about the new approach is it's actually the oldest approach. If you've worked around Linux and Unix for any amount of time, what's the first thing they say to you? Don't build some big, ugly application. No, use smaller daemons that do their job very well and then hook them together to create a flexible, replaceable system. If you start out with a particular collector and you say, well, I run into a wall, it doesn't actually do what I want it to do, you can switch that piece out. We'll talk about how you can switch those pieces out. You start out with a database that you need and the database that's there and it has, you know, ingest the data and you look at it and try and figure it out. If you don't like that database, change to a different one. There's several to choose from that are all time series databases. We'll talk about those in a minute here as well. For visualization, there's one standard right now, but in the future there might be one that gets replaced that replaces it. So visualization piece, go ahead and swap it out if you'd rather have something different. So let's talk a little bit about the three pieces of the new approach. The three pieces are the collector. The collector pieces go out and they grab the data. And they make it available for the database. So things like collect D. It's a UNIX like system is designed with read plugins to monitor various and sundry things. It's about 14 years old. And I used it when I first started to try and do optimizations to try and figure out in my build servers how much how much was I actually using a CPU Ram. Am I hitting swap things like that so that I could optimize the build such they don't got as much out of it as possible. Since then I've moved on to something called net data. In this particular case it's very low impact. Oh, I was going to tell you collecting is relatively high impact. So be aware that every time you try and collect data, collect information from a from a running process or from a server, you are going to impact performance. So it does have a slight impact. In this particular case net data is about 1 to 2% of CPU and has a very low disk impact. One of the things I really like about it is how low the disk impact is on the actual device that's being monitored. It works in Linux, free BSD and Mac OS. It's about six years old, so it's fairly mature. It's not too old. Also, many of the devices nowadays still use SNMP, especially network devices, if you want to go ahead and monitor those. So we still use SNMP, it's not going away. Let's talk a little bit about the aggregators, things like influx DB and Prometheus, which are time series databases. Time series databases are different in that they are specifically written with timestamps and key value pairs. So that makes them easy to search time during a time period in order to be able to look at expand out and look at a particular time period. Both of them are written and go, so it's a fairly modern language, which makes it a little bit nice because there's a lot of active development on it. Something that's different, however, is that they're not SQL languages. So all the databases do not use SQL. Most of the time they use their own proprietary query language. There is some movement to try and get a standardization, but it's not moving very fast. And these are about seven years old, so they're fairly mature and they're not, you know, they don't have as many bugs. I haven't really run into any bugs in Prometheus or influx, either one of them. So they really designed to do searches of not only what data you want, but over what period of time do you want it? So that's the significant difference between a standard database and a time series database. Let's talk about the visualizers. Well, the thing right now is that everybody mostly uses a Grafana because it just works. It also hooks into the databases that you have either Prometheus or influx. You can write queries with Grafana that will pull the data set out that you want to be visualized over a particular time period. Grafana has been around since about 2014, so it's fairly new. Also written in Go and is cross-platform. So there's a lot of effort to make these work across many, many different platforms because, again, being written in Go and making it cross-platform makes it a lot easier. Grafana uses the concept of a dashboard. The dashboard allows you to create multiple visualizations, group them in such a way that you can see them in an order and find which one of them that you want to go visualize at the time, and then display that for you. You can also use that dashboard to create different queries so that each graph has a particularly different query of information that you need, maybe from multiple different platforms. And we'll see that in a minute here. It's become the de facto standard for monitoring system visualization that's not an integrated system already that has everything. Now, in the future, any of these might be replaced because that's one thing about doing it the old way is to use something that's relatively small. And if it doesn't work for you or you want to upgrade it or update it, then go ahead and change it out. You might have to do things like because the databases are because the databases use a different query language, you might have to rewrite all your Grafana dash dashboards. Because if you switch from influx to Prometheus, the query language isn't the same. Okay. So let's talk about the last piece here, the notifiers, right. That's what we first started out with when we talked about the first thing that would notify the first thing we wanted to do is notify. Well, once you get to the point where you've collected it, you've aggregated it, you've displayed it. You can say, well, I need to actually have a notifier to tell me, well, turns out that all three of these I don't believe collected as I've never used it for use that for that purpose, have have notifiers. And they do things like be able to send things as email and telegram and Slack and text messages. Some of the things like Prometheus can send alerts using email page or duty we chat and more. And Grafana, on the other hand, has got a whole bunch of things that it can do. It can do email, a Hangouts chat, Kafka, HipChat, Slack, Telegram, Webhook, and several more. So it's really quite full featured. And it really does understand the trends that we're finding nowadays. So if the notifications are in a Slack channel, it'll drop it into a Slack channel for you and say, hey, there's a problem here. I need to know there's a problem. Whoever's subscribed to that channel. Those kind of things are I find really useful, especially when working with groups, and especially as distributed groups all over the world that we're seeing these days. I have a couple of demos that I've created here. Actually, they're not really demos. They're live. This is what I'm doing most of the time here. Let me switch. All right. So I run a build farm for the open embedded project. And what the open embedded project does is it builds operating systems for embedded processors for embedded systems. It's designed to build multiple architectures. And simultaneously, in this particular case, I have a number of builders here that are set up to where that I can actually tick them off pretty much at the same time. I'm going to go ahead and log in here. And I'm going to go ahead and log in here now and stop that. One thing I hate about that. Okay, so we're going to go ahead and start this oil worlds build here. I'm going to start build now. Here we go. So I'm going to kick off a build and we started the build. And I'm going to start looking at this is that data. So first thing you're going to see here is we're looking at a live view. This is actually my file server. And you can see that my file server is now being activated as things start to happen with with net data. And the builders are now starting to build. So if you look here, there, let me go back here to this. You can see I've kicked off a bunch of builds here. So there's a whole number of builds that are all working simultaneously. Over here with Grafana, I'm looking at the data set again. You can see that over here in the very right hand side, we've got a little bit of. Let me give it a couple seconds to look at it. We'll go back here to to looking at net data. So this is just on my file server. I've also got net data collectors on all of my builders as well. So I'm collecting the data as to what's going on with with the builds process. I really care about things such as what's the guy awake time? What's the CPU load look like? What's the what's the memory pressure that's being put on the builders? How much memory is being consumed both as tempFS as well as overall. So I'm looking at things like that. In order to be able to optimize the builds and make it so that they run faster in this particular case. So while this process is going on periodically because this is an NFS or ice guzzy system, we go ahead and see activity across here. And so this is a typical monitor. Now net data is interesting because I'm only showing the big system overview here. There's a whole raft of things over here on the right hand side, including things such as load and disk activity and RAM activity and swap. So we can monitor all those things that indeed all those things are being monitored by net data for me. Periodically Prometheus, which is the database server we're using will come by about every 15 seconds and scrape that data and put it into the database with a time series time stamp so I can see exactly what's going on over time with that particular server. So right now we're doing that. No one can take a look back over here at loading. You can see that that there's a spike here in the far right hand side here, you see that there's a spike everywhere. Oh, the ram over here, which is the very top one how much memory is actually being used. We restarted everything so it cleared all of the RAM. So now we're back to zero again and starting. So the first thing that's happening with these builds is that the builds are actually using a lot of CPU. And the reason is because they're parsing all the data trying to figure out what they're going to build and in what order. So every one of them is doing that same thing. And when that process is over, they'll start actually doing the builds. You can see that we got a we got a very fast spike. And then it fell off right then it fell off is the date as time went on. So, in a few minutes here when the when the build starts to go into its next phase, you'll start seeing that the will start grabbing source code and grabbing pre built pieces off of the file server and assembling them as part of the build. So again, we're looking over here at that. Let me go ahead and change the time scale. Now let's talk a little bit about that. Right now I'm showing a time scale from about 250 in the morning till 1040 in the morning, my time. Right now this scale is set to to that time. I'm going to go ahead and change the query. I'm going to say give me the last 15 minutes. And you can see that changes the scale. The data set is there I can see that those bumps of CPU. I can see that we dropped our memory usage. And we're starting to see the memory usage creep up. And that's what I'm trying to look for is to see. See that kind of data. So, over time I can also sit there and say well, this is very nice to see this one but what happened yesterday. Let's go ahead and look at yesterday. Let's look at 27th. You can see that over time, we had a builder builder builder up we just did a short build here. Came up 11 o'clock in the morning and was done by one o'clock in the afternoon. You can see that it's across multiple builders that I'm looking at. So I can zoom in again I can zoom in it started 1100 yesterday and ended up at 1300 so let's go ahead and change that. I'm looking doing a query using Grafana to tell me to look at what time frame I want to look at so I can expand that time frame out. So yesterday I'm going to change it to 1100 instead to 1300. You can see that in this case, I've expanded it out I started about 11 o'clock in the morning. I could see here's the start of the build. And then we started to see some activity here, not very much CPU activity that's hurting things. And then we get here and we see some IO spikes here. Well these IO spikes are when it starts writing to the disk. And again, another portion where they write to the desk you'll notice that they're using that they're almost identical and lined up in this particular case is because at different points in the build they do different things. And oh this one is a different architecture. So it took a little bit longer and you'll notice that it also took a lot more Ram here. This Ram piece that you're seeing up here, whereas the other ones came up said that I'm done and then quit. This one says no I got to build a lot more and oh I can take a lot more Ram and a lot more Ram and then when I'm done I'm done over here. Oh why did this one run longer it shouldn't have run longer. What happened then you go back and look at the building and the bill logs and see what went on. But the real the real ability is to visualize things not only presently what's going on, but in the past such you can analyze what's the difference between what way we're doing then versus what we're doing now. Let's go back to the last 15 minutes here and we can see that we're all we're doing is we're increasing more and more and more Ram as things get loaded up, but not a whole lot of interesting right there right now. So I'd like to open it up for any questions. So I have a question from Ian in the answer to that question is that I did not use multicast for collecting it was mostly used on a single individual machine basis. I would like to collect the data and analyze it actually on the build machine when we were doing build optimizations anyone else.