 Okay. So I guess we don't have Lee today. I'm phoning it in because it doesn't seem that I'm able to get things to work quite right. There. Hello. My name is Tom King and I'll be presenting today. We've had a few technical difficulties, but I'm hoping that I can at least get about half of the presentation in. I did have a number of demos, but unfortunately those are not going to work today, so be aware of that. And also I took a tumble off of a ladder, so I'm recovering, so let's see how this goes. So today this talk will be about monitoring. So if you came for NF tables or Ansible, this is not your talk. I'll only be talking about open source monitoring and not about anything else, any of the proprietary ones. And this is by no means an exhaustive survey of what is available out there. But rather, we're going to talk about some different types of monitoring applications and stacks that represent the space. So if you don't hear your favorite app or stack, understand that I intend to talk about types in some examples. Next slide, please. So let's talk first about the goals of monitoring. And we'll talk a little bit about what it is and what it isn't. Next slide, please. The first goal is alerting. Alerting means to notify through various methods, which may include text messaging or email, or even some of the newer business communications platforms like Slack or Telegram or Discord. This is the responsive element of your monitoring system. The alert itself should contain information about what's wrong and where to find additional information. It's a good idea to create a setup for alerting that has different priorities, such that the right people get the alerts at the right time. There's no reason to wake up somebody if five days from now your car drive is going to run out of space. But you do want to let somebody know if a server crashes and goes down. So be mindful of that because in this 24-7 world, it's really, really hard on those people who have to be responsible for that. So ideally, you want your monitoring system to alert you before the system network or resource crashes. So you want to set thresholds that make sense that allow you to be alerted and be alarmed at the point so that you can prevent an outage. Most of the monitoring systems we're going to be talking about today only notify. But there are a few that can actually take corrective action, and that's pretty much all they do. To see if they work, you must first look at the system logs after the attack. And we'll see a couple of those in just a minute. Next slide, please. The second goal is metrics or data is to collect them. You want to be able to have the data in such a form that we can display it using visualization software, some of it integrated, some of it separate. We'll be talking about that in a bit. Some of the things that we may want to collect are data from a network. For example, a network switch or a router. What's the packets per second that's being processed through? What's your actual data throughput? Your packets dropped? Where's all this data coming from? You know, that might be a question you're asking. For a server, you might simply look at it and say, you know, is my server too slow? Do I need to buy an ACPU? Do I need to replace it? Do I need more RAM? Because I'm running out of RAM and I'm hitting swap all the time. How long does it take to service these requests that people are sending me all the time through the web browser? So those are the kinds of things that can be answered through monitoring. Next slide, please. So now once you've collected that data, the first thing you want to do is you may want to alert, but that may not be what you're going to do later on, which is to come back in and look at the metrics using this display. The typical way this is done is using what's called a time series graph. In the time series graph, the x-axis is usually time, whether it be a day, a week, a month, a year. So all of that data will be displayed across that axis. Now, this could be done in real time, or often you take a look at it days or weeks or even months later when you're trying to do an analysis. Next slide, please. So let's take a look at some of the monitoring systems that have been around for a while and are very mature. The first one we'll talk about next slide is MONIT. MONIT is an application or a daemon that's been around since the 80s. I used it on a system 5 UNIX machine because I only had 4 megs of RAM and it wasn't unusual for me to run out and have own killer go on a RAM page and decide to kill a bunch of stuff that I needed to have running in order for me to be able to log back into it, like the telnet daemon that used to get killed. So I'd use MONIT to actually monitor whether or not that daemon was running. If the telnet daemon was not running, it would restart it for me. A good application for this would be if you had a router that's running openWRT, for example, and you wanted to restart DNSMAS so that you can get DHCP again. Otherwise, your device won't give you DHCP. There's one other thing that can actually help us out here, and that is if you... And that is system D has a limited process monitoring capability as well. And it can also bounce things. So you can actually go ahead and have it try and do some rectification in the system. Next slide, please. Okay, we're going to talk a little bit about RRD tour based monitoring. So the first one we talked about, basically it monitors, it fixes or essentially fix the problem, and that's it. No visualization, no ability to store data or anything else. In a minute, they've asked me to mute my audio here. Just a second. Okay, hopefully that'll solve the problem. We're going to talk a little bit about RRD tool. All right, next slide, please. We're missing a slide. I'll go back to...go back one, please. Yes, that'll be good enough. Originally, the original RRD tool application was called MRTG, and actually at the time when it was created, MRTG was created, RRD tool didn't even exist. So RRD tool was designed to answer the question, how much bandwidth am I using? How many packets per second are going through? It was designed for multi-router traffic grapher. That's its name. That's what it means. After they started working with it for a little while, they realized that it was a little slow and they weren't happy with the performance of it. So they reimplemented both the data collection and visualizer tool, and it became what's known as RRD tool. Based on that work that was being done, several other groups decided to create monitoring applications that would extend into the server space and into monitoring other things as well. One of them is Cacti. It's a very pretty display. I had a display for it, but I can't show you. The other one is Monin. I used Monin extensively. It has a node that actually runs on whatever box you're trying to actually monitor. It's a little piece of it. Periodically, a collector comes by. A daemon from RRD tool will come and pull and grab that data that you've collected by the node that's on the device itself. It'll actually pull all those metrics and store them for later visualization. RRD tool is great because it's so old. There's been a lot of ports of it. It works class platform on Unix-like devices via FNMP for polling network devices like switches and routers. It also works on Windows and macOS as well. So it's really quite robust. It's written in C. It's over 20 years old. And yet, a lot of people are still using it today. Next slide, please. Next slide, please. So let's talk a little bit about the advantages and the good on this. It's fairly easy to implement. When I say implement, I mean to set up for a user. It takes me about 15 minutes to set up a node. Make sure I have all of the appropriate module pieces that I need for Monin to be able to get all the information that I would need from a typical server to monitor the web browser, the web server, and the system itself and all the other little pieces that you can monitor with it. It takes a bit. When we originally set up Monin the first time, it took us about an hour and a half to get it running for the first couple of nodes. And then it was 15 minutes thereafter for every other node that we decided to work with. Very quick, very easy. Displays data really consistently. You'll be able to see that if you look at some of the websites that I showed you there, you'll see a picture of it. It shows you day, week, or months, or even year to date. The solutions that we're talking about that most of your RD are all utilized RD as the display and the engine, and the rest of it is their own internal piece. It's been around for 18 or 20 years. All of these other applications, such as Monin and CatDye, both. So a little bit of a bad. Because they're so old, they haven't thought about new things like we have today, so they don't have any links to telegram or to Discord or things like that for alerting. Mostly, they'll either use email or they'll use a notify script that you'll have to actually provide how you're going to communicate with whatever platform you want to do. So you might have to write your own. Another disadvantage is that RD uses a circular buffer for each metric collected and averages the data before storing it in the next longest time buffer. So when the week is gone, all that data starts over again in the daily, and you lose the data from the previous week. Now, it has been averaged into the weekly and then into the monthly and then into the yearly, so the data isn't completely lost, but it's been reduced considerably. So the precision of it has really, really been reduced. And because of that, because of that, you can't go backwards or forwards in time and see any detail in the backwards in time. Next slide, please. I bring Nagios into this simply because it's been around for a long time. It's not really a visualization tool. It's more of an alarm and alerting tool. Large community of users, lots of plugins, and I found it very, very difficult to configure and operate. We tried to use that first when we were trying to do this and we had a hard time with it. It is multi-platform and there are a lot of users of this particular product. It is open source and it also has a closed source component. Next slide, please. So let's talk a little bit about a new approach that has been taken in the monitoring realm. Approach can do things that are the tools simply can't and what most of Nagios wasn't designed for. And they also have the advantage of being able to go back and forth in time in order to do comparisons of metrics with precision. Next slide, please. It's actually the oldest approach. What is the oldest approach? Well, the oldest approach, next slide, is to run smaller daemons that do their job very, very well. So there's a number of daemons that can do the various functions that you need in order to create a full monitoring system. Next slide, please. You hook them all together to create a flexible, replaceable system. Don't like a particular one, but if I find another collector that you like, you substitute that and put that into the database. Don't like the visualizer that we have right now, and there's a newer one that comes along that's much better. Replace the visualizer. Those different things will be able to be swapped out. It makes it a lot easier for you to decide how you want to run your system. Next slide, please. So let's talk a little bit about collectors. What I'm going to use the term in quotes, collectors. They're not really called collectors, but that's the term I'll use here. The first one we'll talk about is, next slide, please. Oh, sorry, go back. Okay. That's where the slide is missing. So we'll talk a little bit about CollectD. It's not a really new thing. As a matter of fact, it's a very old thing. It's about 14 years old. It's for Unix-like systems. It's designed with both replugins for different things to monitor and write plugins to decide which database you want to be able to use. Or it can go directly into RRD as well. I've used it. Most of the time I use it for optimizations when I'm doing optimizations on a system, used CollectD. There's another one that I'm using right now called NetData. It's designed to actually have low resource usage. CollectD is also a relatively low resource usage. And I'm talking about on the device that's been being monitored. One of the problems with a lot of the previous versions was they were very high loads for short periods of time on the device that was being monitored and they would skew the data that was being collected. With these new versions like CollectD and NetData, they really have very, very low usages. So for example, 1 to 2% CPU for NetData, very low disk impact where it actually, instead of constantly writing to the disk in very small increments, it offers things up a little bit and writes bigger packets. It works on Linux. It works on macOS. And it works on FreeBSD. And I'm talking specifically about NetData right now. And it's about six years old. So it's one of the youngest things out there. Next slide, please. All right. Let's talk about the aggregators. We've all used databases, whether it be MySQL or Postgres or something else. And those are great relational databases, like SQL databases. But for time series databases, they're considerably different. The one thing is that they use a timestamp plus an ordered pair for their data storage. So every single data point that's in there has a timestamp associated with it. This allows you to aggregate the time, do the averaging for over a period of time, or actually display the data, the entire dataset from one time's hack to another time hack. So some of those that are in use nowadays are InfluxDV and Prometheus. It's very interesting. They're being written in Go, both of them. So that's kind of interesting to see new languages. They use non-SQL language, query languages. And each one of them currently has its own query language. It turns out that the visualizers actually understand how to pass that data through so that if you're using InfluxDV, Grafana knows how to deal with it by passing the query straight through to the database so that they can get the pieces they want to graph for you. They're only databases. They don't do any visualization and they don't do any collection on their own. Prometheus is about seven years old and also Influx is about the same age, about seven years old. There is a project called Open Metrics to standardize the query languages for time series databases, but currently they haven't really made any progress since about 2018. It doesn't look like according to GitHub. Next slide, please. Okay, visualizers. The visualizers are just that. They're graphs. You create a query in like Grafana, which goes and queries the database for certain fields along with the time hack data. And you say, this is the time period I want to look at, and it creates a graph of the time period that you want to look at. You want to change the time period. It's very easy to do with Grafana. You can tell it, okay, instead of looking at this data until 12 hours before this, I want you to go seven days back and seven days plus 12 hours back, and I want to look at that period of time. It's very easy to do from a graphical user interface right on the display. Again, written in Go. It is cross-platform. It works very well. It actually runs as a web browser piece, and it uses the dashboard concept for its layout. You create a dashboard. You tell it what pieces you want on the dashboard, what graphs you want on the dashboard, what displays. They've got nice little gas-cage-type things and speedometer-type pieces for display that make it quite pretty and quite easy to read. Grafana has been around since about 2014, so it's again one of the younger things, and it's become the de facto standard for the new monitoring systems visualization. Next slide, please. Now, let's talk a little bit about notifiers. You'll notice in this particular case, all three. NetData, Prometheus, and Grafana all can do... can actually do notifications. And because they're fairly modern, they also know about all of the different platforms. Each one of them knows about different platforms than the other, but for example, NetData knows how to deal with email, telegram, Slack, and a few others. Well, Prometheus can use email, page-or-duty, WeChat, and several others. In addition, Grafana, which has the richest set, starts with email, Discord, Google Hangouts, Chat, Kafka, HipChat, Slack, Telegram, Webhook, and a whole bunch of others as well. All of them will tell you whether or not you're across the threshold, and some of them can understand trends too. So if you're looking, if they're looking at the slope of the curve as well as the actual raw data value that they're looking at. So normally at this point is where I would do the demo. Next slide, please. But unfortunately, I don't think I'm going to be able to present those at this point. If there are any questions, if there's any questions, if you go ahead and put them in the chat. I appreciate it. If you think of some questions after the talk is over and gone, there's my email address. I'm looking to see if I see anything. Hold on a minute here. I'm going to get to the right place here. Okay. Yeah, I'm sorry about the bit of echo in the sound. How are we making it, Tom? I'm done. Excellent. Yeah. If there's any extra questions, please give Tom an email. It's on the screen. I do apologize for the little bit of confusion starting up today. It does happen occasionally. And as I said, I apologize. We'll do better. Promise. Thank you, Tom. And thank everybody for hanging out and enjoying the session. And we'll have a... We'll see you alls. Is there questions? I didn't see any questions in here. Oh, there's some questions. Do you see the questions off page one? Hold on a minute here. Let me look. Okay. I see there's a couple of extra pages. And thank you for the questions, by the way. Yes, if you can hear me, yes. Check MK is a great Nagios wrapper. Yes, I understand that Nagios wrappers are abound. And there's a lot of them would solve the problems that Nagios itself has. Yeah. And it didn't suit my particular purposes, which is the reason why we ended up not using it. Manin did the job for us at the time. And we were looking for a little bit more trend analysis. Since then I've not using either of those because we're not doing... We're not actually doing any web hosting anymore or any hosting. And instead what we're doing is builders and I'm doing optimizations. And because I wanted to be able to do optimizations and know when I made a change. And I'm going to run a 12-hour build. So we're building operating systems and we're doing cross builds and operating systems. So because of that, we needed to be able to know large 12-hour chunks or 24-hour chunks for some of the bigger builds. And unfortunately the RD tool approach didn't allow us to do the back and forth that we wanted to do to compare and see what changes were made and how it worked. So we have you... Let's see. I can make this public, I guess. How do I do that? I just talked to them, Tom. Okay, I can talk to you. Okay, the question is, have you tried using NetNata for end-in troubleshooting? Yes. But because it's a real-time visualization and I have to kind of just stop it and try and slide back in time, it really doesn't suit that. When I'm watching the thing in real-time, I watch it undirectly on Grafana with Grafana on NetNata. But when I'm doing the analysis of what the change was between the two, I use Grafana with the database information. Thank you. Let me read the next question, Tom. I don't know that our customers can see them on the list here. Are you going to make the demos available offline? I'm going to try and do a video of the whole thing and upload the video. Okay, thank you. And that should be with the video downloads. Excellent. Right, that'll be our download. And the next question is... The links that are in the slides are links that you can use. I'll send you some more links. I'll try and upload a different slide deck that has better links to the slide that I was going to show you. Okay. Thank you. How much work is it to deploy and maintain all of these different tools? It depends on how much information you want. It depends on how much you want to visualize. Like I said, with the money, it took us about, I would say, for the entire system to deploy it. It was about three and a half, four hours. To get it dialed in the way I wanted it, I probably put another 10 hours into it. And that was monitoring 35 different machines. So it worked very, very well. It got a little bit boggy towards the end because we were trying to... We had had so much data. Datadog is a great... The person asked me about Datadog. Datadog is a great proprietary solution. Again, I was not talking about anything that's in the proprietary space here. But I know people who work with it and who work on that. It's a very nice product. Somebody asked me about my recommendation for NetData, Prometheus, and Grafana. That is actually what I'm doing. That's actually what I'm particularly doing. I have a friend who's using Collecti with InflexDB and Grafana, and he likes it as well. So any of those, like I said, if you'd like one tool over another, just swap out the little piece that you don't like and continue on. And it's a personal preference thing. We ended up landing on that based on what we were trying to monitor. That's why we ended up with that particular one. Also, because I'm able to real-time monitor using NetData, which is the first thing we installed, then we installed Prometheus so that we could then use, and obviously with Grafana, we needed for being able to visualize the real-time in NetData. So that was already installed, and then we chose to use Prometheus as the database because it wouldn't allow me to... NetData did not allow me to get the information that I wanted at the later time. I tried to use Zavix. I had a hard time with it, and I couldn't make it work the way I wanted to. That was a case of not being able to get the data that I wanted specifically, and there wasn't a Zavix plugin for it. Otherwise, it looked really nice. It looked like a great product. I was pretty happy with it, but it just didn't do what we needed. It's not really... Somebody asked me about observability versus monitoring and how it fits in. I'm not really prepared to speak on that subject right now. You have a copy of the last slide. Actually, the last slide actually is in the slides that I've uploaded, so you should see that slide there. I want to show you the demos, which are actually... Somebody asked me about, other than the demos, what would I have done if I had more time? The primary thing I would have done is I would have actually made it so that I was able to demo it and show some of the features, show how to set up the time windows that you're looking at, do a few queries just to be able to see how it works. It's very simple. Somebody asked me what the best solution is for monitoring. I said the one that... And I'm going to answer this. The one that gets you the data that you want and the visualization that you want and the alerts that you want. I haven't really played with either one of those. Somebody asked specifically about push model versus pull model. I know that I'm using a pull model in Prometheus, and it works very well. It lets me know when it can't pull a node, for example. So I really like that. That's really kind of nice. I can elaborate further on the pull versus push model, however. That's a personal preference, I think. Somebody asked if the new approach is a mix of many different technologies. Not really. It's kind of old and new at the same time. You still have to collect the data. You still have to store the data. You still have to figure out how to visualize it. Somebody asked about whether or not this approach forces administration teams to deal with many diverse potential points of failure. You build a lamp stack, right? You build Linux, Apache, MySQL, and PHP if you have to, or whatever the latest version of things that you're doing. Anyway, this is one more stack just like anything else. It's pretty straightforward. It works very well. I've not noticed a failure at all since we started working with it. Like I said, I have a different viewpoint of how I want to use the monitoring. I want to use the monitoring for performance monitoring based on making changes to the systems and how we do things. Sorry, it's just taken me a little while to go through this. What tools? So I think the time series database, what tools have become the de facto standards in the future? I think Rafa and I has pretty much captured most of the market for type sequence display and visualization. The databases, especially if they decide to make the query language standard, I think will become interchangeable and it will be a personal preference thing like whether you use MariaDB or something else, or Postgres or something else, that will be a personal preference. And as far as collection tools go, I expect the collection tools to get better and I expect the collection tools to also be much more multi-platform in the future. Somebody asked about the list of the new approach. SNMP is a protocol. That's correct. But they haven't forgotten about it just because of the fact that they've gone to the new approach. They still use SNMP. It's still very, very popular and all of them support it to one degree or another. That was the one piece actually that did not work correctly for me was Zavix was getting the SNMP module working, so I wasn't too happy with that. And that's the reason why we switched to NetData and went off of that. I'm not talking about logs or log files because those are not part of this discussion. Sorry, Richard, I know you've probably submitted one. I just have to find it. Sorry. Okay, somebody asked about which one to use for notifications, whether you use Grafana or Prometheus for it. I say the one that gives you the best information. I just try both of them and say, okay, this one gives me this and that looks pretty good. It sets the priority right or it doesn't. I would say, you know, if you try the other one and it gives you a better alert or a better notification, then you go with that. Again, it's a personal preference thing. So configuration of the tool is pointing to the resources. We simply use their setup pages and the how-tos and some of the YouTube videos that are out there to do the setup. It was relatively quick. The hardest part was the number of nodes that we have, which is relatively high. We've got like 40 nodes that we had to monitor. So that took a lot of time to get all of those nodes into Prometheus and get them being cold. So that took a little bit of time. I don't know anything about pattern recognition. AI notifiers, they're available. Again, log analysis is not part of this. Unfortunately, it's not part of this talk. It's a different thing. Okay, that's it. Thank you.