 Welcome everyone. I'm I'm giving the first talk I guess today. Thanks for showing up early and I'm David I'll be talking about effective infrastructure monitoring for Linux with Grafana, I guess because I'm from Grafana and we also use a lot of Prometheus and Loki and node exporter so those things will feature heavily in this in this talk and So before we get started Let's do a quick Sort of catch up on what monitoring looks like at Grafana itself because we also run a couple of servers in a couple of clusters I think right now we have about eight clusters running 270 nodes I think was yesterday and for all our hosted metrics and hosted logs SAS offerings and We mostly run those on 16 core 64 gig machine so quite big things which is quite exciting and But I'm also curious on If there's sort of overlap with the technology that you folks are using so maybe just do a quick show of hands Who here is using Grafana? So yeah, that's that's quite a bit and Prometheus for metrics also, yeah with fewer and Anyone starting use Loki already? Yeah, okay. Yeah early adopters excellent and Jaeger Cool. Yeah, similarly, which will be quite a big thing for us next year getting a bit better integration for tracing and Monitoring mix-ins has anyone heard of this and is using this already. No. Oh, okay. Yeah, Chris. Oh You heard of them. Okay. Okay, that's good because a lot of Content from this talk comes from these mix from these mix-ins, but I'll talk about those later so just another kind of Philosophy that we also use a Kofana that we we try not to like we make dashboarding software But we don't want to look at dashboards all day So we try to just have monitoring by alerting so using the time series that we produce also to write alerts for and then only alert us when things are happening so that most of our info team can after work go off and make Shakshuka or whatever and Only have to worry that if something happens they get an alert on their phone or through other means and Only then they have to act so that's sort of the philosophy around this and Some of the alerting rules that I'll show later they also come from this monitoring mix-in and So Talking about monitoring for Linux for us. There isn't really a way around Prometheus and with the help of the node exporter So I'm gonna I'm gonna do a quick tour around this and initially I was looking at the node exporter website or at the GitHub repo and they read me and it just basically I'm getting select alerts and There's there's there's a bunch of collectors in there that offer you hardware and OS specific metrics and because this whole thing is written like a pluggable system and For me, this was all a bit kind of a confusing cloud so to say of of Modules and collectors. So I thought since I'm here in Berlin and Ben from GitLab It's also based here in Berlin and he happens to be the maintainer from Of the node exporter, so I just met him I just met up for him at the Korean restaurant and just talk about through How we can kind of make a better mental model of what all these different metrics in a node exporter do and But he I mean he'd like to talk Accelerate at a Korean restaurant. He just like to talk about kimchi the whole time and how he made 500 people's worth of kimchi one time, but he had a really good idea and he Remember this graphic who has seen this graphic before? Yeah, I guess we're the Linux user space conference That works and So this is a really good map from our brand Greg about Linux performance observability tools and Ben has also been using a bunch of these quite in his in his day-to-day job But the bigger task for him is always how can we also have similar? Information out of or that these tools also provide how can we have those in the sort of time series and luckily there's a bit of overlap with what a node exporter gives you and For us then it became a challenge. Okay, how can we map the node exporter metrics to what's? sort of a weight or on what we need to cover in the parts of a Linux system so we clean up this graphic and then Try to map the first couple of the first couple of metrics So if you look at the metrics that a node exporter exports They always start with node and then underscore and then and then these modules are usually the next term in the metric name And so we try to do this sort of mapping and Part of this talk now is to go a bit around these different metrics So the first and kind of classic example is CPU utilization what does CPU utilization even mean People are used to just looking at top for example, but what does that give you it's a bit unclear with time series and the node CPU seconds total. It's a bit more clear what it does because it gives you the the second spend in wall clock CPU in a wall clock time second On the CPU in the various modes that the CPU supports So we have system, user, idle, guest, and steal and all these all these different modes and so So here we're making use of this metric to draw the utilization graph where 100% would be Nothing running in idle mode and 100% of the CPU being used by Processes in non idle mode Yeah, so similar metric that often comes up for a CPU saturation is the load average which We have just been using traditionally because it's it's just been around for a long time and people have a vague idea On what it means. It's not really ideal because it's not just a CPU bound Kind of metric because It also tracks now a lot of uninterrupted task code paths, which aren't necessarily a Reflection of how resource-bound your CPU is so and for this metric we've Which is also not ideal either. We had to always normalize it by the number of CPU cores to have kind of a Zero to 100% saturation Scale basically So instead of load average what I hear what I've heard a couple of times now is to use the pressure-strolling indicator, which is I guess a new way of Auditoring how much pressure is or how much demand there is on the system from a CPU point of view and The difference there is that we no longer tracking the just a number of What is it threats That are waiting for the processor, but we're actually tracking the waiting time. So we're we'll get a better sense of How contested their resources and As you notice also, we don't really have to divide this anymore, but a number of CPUs because Now we're just tracking the overall time that they're waiting So another let's quickly go through memory metrics we use for the for the utilization here, we use the the fraction of the available to the to all the available memory and Subjectors from zero and from one and that that gives us the the use space and Similarly for saturation we've decided to use pages swap per second If anyone has a better metric to Kind of talk about memory saturation I'm also curious one feedback there and For Discuterization we're tracking iotime because so this is how often or how How long the disk operations are taking or how much time was spent in IO and Another interesting thing there is That we have a lot of metrics that actually map to Things that are given to you by IO stat So there's another sort of the space of the overlap there is quite big actually and there's an excellent article by Brian Brezel Getting into which of the different fields that I always that gives you can be represented by metrics and So one more thing about discs I think this is also something that features in every permit his training which is the fear I guess of discs getting full and We have some good alerts to or you can use the permissives queries to write some good alerts on how Fast or they will alert you when you're sort of on a trajectory that your disc will become full and in this example we're using That the disc or that the available disk space Should be or this should fire when it's less than less than 40% available and if the Available space is going to be zero if the trend on how it's filling up over looked at it over the last six hours If that's going to be zero in the next 24 hours, and I think this is a quite powerful Way to express these sorts of constraints But the and we This condition has to be true for one hour. So there's all as you notice or already there's a bit of a drawback there if the disk now fills up for some higher trajectory In less than an hour, you will have a problem, right? So you need to combine this with other alerts or other alert rules that track more aggressive Disc filling speeds, I guess and then you would you would tag them with a different severity and then route them to Maybe your pager and not just creates a warning alert and Yeah, so actually a bunch of these are defined in the note mixing which is Jsonnet based library of alerts and recording rules and dashboards and the node mixing for the has a lot of rules for Disconnectful, so not just Not just Let me just do this Not just the disk space running out but also running out of iNodes that can also happen So I really recommend taking a look at that at that mix in and even if you don't use Jsonnet in your In your organization, I think it's still good to look at the At the alerting rules to be you can just copy them out and if you if you think they can be improved you can You can also add You can you can open a good type issue on that same on that same repo and then other people Can benefit from from your suggestions Okay, just to run us up. This is some this is some network metric metric queries that we're also running and always like these these graphs where you can plot the transmitting rates on The negative y-axis and then you get these sort of nice comparison on how your incoming traffic is shaped with your outgoing traffic Okay, so here's something that happened to us recently. This is another metric that's being exposed about contract entries that tracks How full the contract table is on your system which is needed to I guess Take or take care of your of your connections and we recently had this we're running our nodes on GKE and some of the nodes just Hit that limit and then couldn't establish any new connections and so But this is also something we didn't have an alert for and this is where I just want to quickly show how you can use Kofana to To Write queries that will help you Write an alert. So let me check on this goes So I'm using the I'm using the Explorer view here and I'm already Zoomed zoomed into the time where this actually took where it is actually took place. So this was on The 12 and between the 12 and the 13th down here Notice the Shacking and then as you can see here, there was this big spike and we're hitting we're hitting one, right? so and if you compare the fraction of entries to the limit and it's one then The table is full obviously so this happened here and then What I then try to do is I use a split view to come come over here and Basically start off with the same query and then I start adding these These comparisons so I only want I only want time series basically where the Connection over where the fraction is bigger than 50% for example Then I also use I also want to only alert if they're bigger than the average of Of all the fractions and then another good Practice is to use the standard deviation like or double the standard deviation From that average so that so that you get So that you can rule out a couple of Faults false positives when for example Everything is sort of at 70% for example, then maybe that's fine. Maybe that's GKE is supposed to run but if some of those are Dba too much from those these will then show up here, right? So if we change this year to Let's say three And then run this again Okay, then Yeah, it's good Good good point No, I didn't work Yeah, so now we have fewer And then so on this side now I would be modifying this query and then when I'm happy with it basically I'm happy when I've identified The offending time series that are on the left and if they survive on this side I know I have an alert alert rule that would have caught this thing, right? And then this is what I would copy out into my alerting rule Okay, let's go back to the talk Is that working? Yeah, I've lost this window Quick here, I just have to Find my notes So there we are there the notes Oh God, this should be easier and then can I just go left? Yes Perfect All right, so a couple of collector gotchas It's good to be aware how many time series they produce and when you run a note exporter They it also runs or you can also exit on port 9100 on your system and then you see the huge list of things on the usual of time series and sometimes So there's a bit up to you then to figure out which which collectors you you want to enable to not have too many time series and then also Some legacy collectors they actually run scripts or execute programs and then so this puts another puts additional burden on your system For in terms of volume of time series, you could obviously write relabeling rules for to just drop a couple of these But maybe that's a bit too tedious. So I was pro tip from Richie Who builds data centers in a spare time and he tends to run two note exporters on each machine and one with a minimal set and one with a full set so that the stuff that's regularly scraped is is just contacting the one with the minimal set and if they need to investigate a bit further or look at the just some some things that You would otherwise have to log in the machine and Look at the proc file system You can then just go to the second note exporter that has the full set and look at the metrics that are exposed there and I'm told that there's a lot of savings to be had there. Also. He's a big fan of entropy alerting because when you run data centers some hosts sometimes run out of entropy Which to me always sounds a bit like something from the future And A little bonus collector everyone who's also here told me about this the the text file collector and This basically adds little text snippets from From files that are in a given directory to the exported metrics Text that's being parsed by your previous so and There's some some nice examples for this for example If you have If you have maintenance jobs like running backups or Pushing out new updates of your system you can track those with just writing it writing this This little line like when the last run was and then you could you could also envision and alert that basically checks this timestamp and compares it to the time that's now and let's say if it's If your backup hasn't run in like three days, you should probably alert and then another nice one is if you have If you have a lot of If you run if you run a lot of software directly on the metal Maybe in your whole fleet you have some features enabled on some and On on other machines you haven't we have a lot. It's good to have a general overview over Over how many parts of your fleet have a certain feature enabled certain metric like this just helps you track this then another fun one is Tracking how long your SSE discs are gonna last so speaking of rolling out and Managing whole fleet. I'm always a bit jealous when I look at the the git lab The gilab dashboards I highly encourage you to look there, too. They expose most of their infrastructure monitoring Dashboarding on the internet so you can go to dashboards get lab.com and just see how they are doing this Here in this particular example, they are they are tracking the kernel version distribution on all their nodes so here in their production system they seem to have around 200 nodes and The majority of nodes seems to be running a 410 kernel Yeah Which if we paid attention We'll probably won't have the pressure indicator yet, right? So this is how you can kind of see what sort of capabilities your fleet has and then they have a really cool one down there, which is the tracking which Which hosts are deviating from these majority versions and and then you can which also gave mean idea And I also heard this from Björn that at SoundCloud they had these leaderboards about outdated systems So you can write queries that basically a group by team or by cluster on Which team has the most out which team is running the most outdated hardware So these are the You just click here And See this works. Yeah, so this is our cluster now So we do have around 246 nodes for example, and I'm doing the Top-k query here Let's see if we have a couple more So there's some older ones Actually they're running I guess similar counter versions. They're just not very normalized. I guess that's that's one of the problems with this But Here on the other side, I'm using a similar query. We're amusing the top one and then I'm Counting again grouping by cluster so I can see which of these are Not running This majority Kernel version basically and so this is and then so a similar query You could use in a dashboard to build this sort of leaderboard that that I showed you so Back to this one So I guess the bigger the bigger question is also how should I organize all of this all of the views in Gofana and Here's an example from GitLab again, they do a really nice thing where In their general fleet overview, which includes all their all the hosts for and in this case the production system They do this thing where on Each row They have for example the CPU Utilization and on the second row they have the load average But they have grouped this in columns by the application tier right so on the from the left You have the web workers, then you have the API and then you have the git workers and this is sort of in your in your head how GitLab I guess is supposed to work and Here you can already see quite easily I think on How busy each of these systems is right now and I think this I think this is really helpful for Capacity planning for example, and then this is how we do this internally at Gofana we use the the mixing again because that gives us the the dashboards right away as code and that has a cluster dashboard as an aggregation of every node in the cluster and then also a Template query driven No dashboard which basic basically just runs this node exporter query and then returns us all the instances Which is our way of Service discovery or like no discovery because every system that runs a node exporter should expose this thing And That was it. Oh Yeah, we're also hiring But I'm also really curious if if there's any questions you have about the note mix in the dashboards or I'm also curious if there's any sort of funny alerting Rules that you had to write in the past. So that's it So I think your arm went up first so the question was what's the minimum version to be running to Have the mix ins and the answer for this is probably It should it should already work with five I think because it because the mix ins Is something that you run in your build pipeline to output json that you then feed into Grafana instance and I mean I think if talk was here He would he would say like you can go back to version two, right? But I don't know if I if I can make that promise. Yeah, but version five should be fine. Yeah Yeah, stop stop stop Thanks, Ed from packet Did you you saw I saw one metric that you pulled which was SSD related Drive drive health. Have you done on any any other bare metal monitoring of note of like looking at the underlying Hardware for the system that you're monitoring So so we we at Grafana. We don't but I would imagine that that this is commonly done and If there's any sort of if there's any sort of Little tool that can output something to to a text file Then this will be one of the text file collector use cases where you have a the cron job running and producing this Occasionally and the cron job would run its route and has access to all these things and then The the note exporter then exposes it and Prometheus comes occasionally and reads that value Yeah, yeah, so for yeah anything yeah anything that's not directly included as a as a module for For note exporter you can probably easily Replicate with the text file collector. Yeah, good. Thanks. Oh It's Casey. Hello. It's at the Korean dinner Korean lunch as well. Yes Do you know of any best practices for ignoring network interfaces that are useless? So if you run containerized workloads, you wind up with a million vets and bridges and things like that So you don't care about you can't run out of packets or drops on those. Yeah, so we We we actually have recording rules that ignore Yeah, we have recording rules that ignore the looping interface for example So you would write you would write a similar recording rule So it would be ignoring useless something like that's literally a name that we chose for the recording rule Yeah, so Maybe not a really satisfying answer, but yeah Hi You create those rules the alerting rules manually, right? So are there any tools that you could use like say that? Okay, we've had a problem at this time and this time series show That something is going wrong and then you would get an automatic alert that will alert later on when the same thing happens again. Yeah, so There's two parts to this if it's if it if you can answer this statically like with like let's say a certain value should never be Hired and 70 or something you can use just Grafana based alerting where in the panel You can you can draw a threshold and then and then that is run in the Grafana back end as a continuous alerting service if it's more dynamic and sort of the threshold is Yeah, kind of dependent also on the rest of the traffic then Prometheus Prometheus Prometheus based alert will would be better But there's no currently there's no good interface on how to write these and I think I've shown you the way the way I would approach this just with this sort of split view where I can see this is all the results And I only want to filter out the ones that would have alerted, right? And then since the since the Prometheus alerting rules are just like queries Where then you just have to also say for how long should this condition be true? Then I think it's pretty easily copy and paste the ball. Yeah Thanks, so That's one more cool Just one more You mentioned one Component the PSI needed a 4.20 kernel. Are there any other kernel versions of note? I know this is a user space Linux conference Yeah, so this yeah, that's actually what I'm also curious about and maybe people here Actually know the answer and it's because that's also a feature that if you looked at earlier at At our majority version. It's a feature that we can use either since we're running on GKE and it'll it might take A good two years. Yeah before this will become available for Anything that runs in the cloud which is a bit sad but Yeah Cool, I don't see any more questions. I'm here the rest of the day. I might pop out real quick to the brand bouquet for the Climate strike, but I'll be I'll be there for for the evening event as well if you have any questions Cool Alrighty, thank you