 So we're talked by a non-gitlab person on gitlab. Now we have a talk by a gitlab person on non-gitlab, or something like that. The CCCHH HackerSpace is now open, or from now on, if you want to go there, that's the announcement. And the next talk will be by Ben Cochie on metric space monitoring with Prometheus. Welcome. There we go. That's better. All right. All right, so black box monitoring is a probe. It just kind of looks from the outside to your software. And it has no knowledge of the internals. And it's really good for end-to-end testing. So if you've got a fairly complicated service, you come in from the outside, you go through the load balancer, you hit the API server. The API server might hit a database, and you go all the way through to the back of the stack, and then all the way back out. So you know that everything is working end-to-end. But you only know about it for that one request. So in order to find out if your service is working from the end to end for every single request, this requires white box instrumentation. So basically every event that happens inside your software and inside a serving stack gets collected and gets counted. So you know that every request hits the load balancer. Every request hits your application service. Every request hits the database. So you know that everything matches up. And this is called white box or metrics-based monitoring. And so there's different examples of the kind of software that does black box and white box monitoring. So you have software like Nagios that you can configure checks or Pingdom. That Pingdom will do a ping of your website. And then there's metrics-based monitoring, things like Prometheus, things like the tick stack from Influx data, New Relic, and other commercial solutions. But of course, I like to talk about the open source solution. So we're going to talk a little bit about Prometheus. And so Prometheus came out of the idea that we needed a monitoring system that could collect all this white box metric data and do something useful with it, and not just give us a pretty graph, but we also want to be able to alert on it. So we needed both a data gathering and an analytics system in the same instance. And so to do this, we built this thing and we looked at the way that data was being generated by the applications. And there are advantages and disadvantages to this push versus pull model for metrics. And we decided to go with the polling model because there are some slight advantages for polling over pushing. With polling, you get this free black box check that the application is running. So when you pull your application, you know the process is running. If you are doing push-based, you can't tell the difference between your application doing no work and your application not running. So you don't know if it's stuck or is it just not having to do any work. With polling, the polling system knows that the state of your network. So if you have a defined set of services, that inventory drives what should be there. And again, it's like the disappearing is the process is dead or is it just not doing anything. With polling, you know for a fact what processes should be there and it's a bit of an advantage there. With polling, there's really easy testing. With push-based metrics, you have to figure out if you want to test a new version of the monitoring system or you want to test something new, you have to tee off a copy of the data. With polling, you can just set up another instance of your monitoring and just test it. It doesn't even have to be monitoring, you can just use curl to pull the metrics endpoint. So it's significantly easier to test. The other nice thing is the client is really simple. The client doesn't have to know where the monitoring system is. It doesn't have to know about HA. It just has to sit and collect the data about itself. So it doesn't have to know anything about the topology of the network. So as an application developer, if you're writing a DNS server or some other piece of software, you don't have to know anything about monitoring software. You can just implement it inside your application and the monitoring software, whether it's Prometheus or something else, can just come and collect that data from you. And that's kind of similar to the very old monitoring system called SNMP, but SNMP is a significantly less friendly data model for developers. So this is the basic layout of a Prometheus server. So at the core, there's a Prometheus server and it deals with all of the data collection and analytics. And basically this one binary, it's all written in Golang. It's a single binary. It knows how to read from your inventory. There's a bunch of different methods, whether you've got a Kubernetes cluster or a cloud platform or you have your own customized thing with Ansible, you can tell Ansible can take your layout, drop that into a config file and Prometheus can pick that up. Once it has the layout, it goes out and collects all the data. It has a storage and a time series database to store all that data locally and then it has a thing called PromQL, which is a query language designed for metrics analytics. And then from that PromQL, you can add front ends that will, whether it's a simple API client to run reports. You can use things like Grafana for creating dashboards. It's got a simple web UI built in. You can plug in anything you want on that side. And then it also has the ability to continuously execute queries called recording rules. And these recording rules have two different modes. You can either record, you can take a query and it will generate new data from that query or you can take a query and if it returns results, it will return an alert. And that alert is a push message to the alert manager. And so this allows us to separate the generating of alerts from the routing of alerts. And so you can have one or hundreds of Prometheus servers all generating alerts and it goes into an alert manager cluster and sends, does the deduplication and the routing to the human. Because of course, the thing that we wanted was we had dashboards with graphs, but in order to find out if something was broken yet to have a human looking at the graph. So with Prometheus, we don't have to do that anymore. We can simply let the software tell us that we need to go investigate our problems. So we don't have to sit there and stare at dashboards all day because that's really boring. So what does it look like to actually get data into Prometheus? So this is a very basic output of a Prometheus metric. And so this is a very simple thing. If you know much about the Linux kernel, it tracks, the Linux kernel tracks in proc stat all the state of all the CPUs in your system. And we express this by having the name of the metric, which is this node CPU seconds total. And so this is a self-describing metric that you can just read the metric name and you understand a little bit about what's going on here. So the Linux kernel and other kernels track their usage by the number of seconds spent doing different things. And that could be whether it's in system or user space or IRQs or IO weight or idle. Actually, the kernel tracks how much idle time it has. And it also tracks it by the number of CPUs. And with other monitoring systems, they used to do this with a tree structure. And this caused a lot of problems for like, how do you mix and match data? So by switching from a tree structure to a tag-based structure, we can do some really interesting, powerful data analytics. So here's a nice example of taking those CPU seconds counters and then converting them into a graph by using promql. Now we can get into metrics-based alerting. So now that we have this graph and we have this thing, we can look and see here, oh, there's like some little spike here. We might wanna know about that. So now we can get into metrics-based alerting. So I used to be a site reliability engineer or I still am a site reliability engineer at heart. And we have this concept of the things that you need to run a site or a service reliably. And the most important thing you need is down at the bottom is monitoring. Because if you don't have monitoring of your service, how do you know it's even working? So there's a couple of techniques here and we wanna alert based on data and not just those end-to-end tests. And there's a couple of techniques. There's a thing called the red method and there's a thing called the use method. And there's a couple of nice links to some blog posts about this. And basically it defines that, for example, the red method is it talks about the requests that your system is handling. And there are three things. There's the number of requests. There's the number of errors. And then there's how long it takes, the duration. And with a combination of those three things, you can determine most of what your users see is did my request go through? Did it return an error? And was it fast? And most people, that's all they care about. I made a request to a website and it came back and it was fast. Like if it's a very simple method of just like, those are the important things to like, determine if your site is healthy. But we can go back to some more traditional sysadmin style alerts. And so this is basically taking the file system available space, divide it by the file system size. That becomes the ratio of file system availability from zero to one, multiply it by 100. We now have a percentage. And if it's less than one, less than or equal to 1%, for 15 minutes, this is less than 1% space. We should probably tell sysadmin to go check to find out why that file system's full. And it's super nice and simple. We can also tag. Every alert includes all the extraneous labels that Prometheus adds to your metrics. So when you add a metric in Prometheus, if we go back and we look at this metric, this metric only contains the information about the internals of the application. Anything about like, what server it's on? Is it running in a container? What cluster does it come from? What continent is it on? That's all extra annotations that are added by the Prometheus server at this discovery time. Unfortunately, I don't have a good example of what those labels look like. But every metric gets annotated with location information. And so that location information also comes through as labels in the alert. So if you have a message coming into your alert manager, the alert manager can look and go, oh, that's coming from this data center, and it can include that in the email or IRC message or SMS message. So it can include like file system is out of space on this host from this data center. And all those labels get passed through and then you can append additional labels like severity critical to that alert and include that in the message to the human. Because of course, this is how you define getting the message from the monitoring to the human. And you can even include nice things like if you've got documentation, you can include a link to the documentation as an annotation and the alert manager can take that basic URL and massage it into whatever it needs to look like to actually get the operator to the correct documentation. We can also do more fun things. Since we actually are not just checking what is the space right now, we're tracking data over time, we can use predict linear. And predict linear just takes and does a simple linear regression. And this example is it takes the file system available space over the last hour and does a linear regression prediction says, well, it's going that way. And four hours from now, based on the one hour of history, it's gonna be less than zero, which means full. So we know that within the next four hours, the disk is gonna be full. So we can tell the operator ahead of time that it's gonna be full and not just tell them that it's full right now. So that they have some window of ability to fix it before it fails. And this is really important because if you're running a site, you wanna be able to have alerts that tell you that your system is failing before it actually fails. Because if it fails, you're out of SLO or SLA and your users are ever gonna be unhappy. And you don't want the users to tell you that your site is down. You wanna know about it before your users can even tell. So this allows you to do that. And also, of course, Prometheus being a modern system, we support full UTF-8 in all of our labels. And then here's another one. Here's a good example from the use method. So this is a rate of 500 errors coming from an application. And you can simply alert that there's more than one 500 error per second coming out of the application if that's your threshold for pain. And you can do other things like you could convert that from just a rate of errors to a percent of errors. So you could say, I have an SLA of three nines. And so you can say if the error, if the rate of errors, so error, the rate of errors divided by the rate of requests is 0.01, or is more than 0.01, then that's a problem. So you can include that level of error granularity. And if you were just doing a black box test, you wouldn't know this. You would only get if you got an error from the system, then you got another error from the system. Well, then you fire an alert. But if those checks are one minute apart and you're serving 1,000 requests per second, you could be serving 10,000 errors before you even get an alert. And you might miss it because what if you only get one random error and then the next time you're serving 25% errors, you only have a 25% chance of that check failing again. So you really need these metrics in order to get proper reports of the status of your system. And there's even options so you can slice and dice those labels. So if you have a label on all of your applications called service, you can send that service label through to the message and you can say, hey, this service is broken. You can include that service label in your alert messages. And that's it. I can go to a demo and Q&A. So any questions so far? Or anybody wanna see a demo? Microphone. Hi, does Prometheus make a metric discovery inside containers or do I have to implement the metrics myself? So for metrics in containers, so there are, there's already things that expose the metrics of some container, of the container system itself. So there's a utility called CAdvisor and CAdvisor takes the Linux CGroup data and exposes it as metrics. So you can get the data about how much CPU time is being spent in your container, how much memory is being used by your container. But not about the application, just about the container usage. So because the container has no idea whether your application is written in Ruby or Go or Python or whatever, you have to build that into your application in order to get the data. So for Prometheus, we've written client libraries that can be included in your application directly. So you can get that data out. So if you go to the Prometheus website, we have a whole series of client libraries and basically we cover a pretty good selection of popular software. Sorry, sorry. What is the current state of long-term data storage in, very good question. So there's been several, there's actually several different methods of doing this. So Prometheus stores all this data locally in its own data storage on the local disk. But that's only as durable as that server is durable. So if you've got a really durable server, you can store as much data as you want. You could store years and years and years of data locally on a Prometheus server. That's not a problem. There's a bunch of misconceptions because of our defaults and the language on our website said it's not long-term storage, simply because we leave that problem up to the person running the server. But the time series database that Prometheus includes is actually quite durable. But it's only as durable as the server underneath it. So if you've got a very large cluster and you want really high durability, you need to have some kind of clustered software. But because we want a Prometheus to be very simple to deploy and very simple to operate and also very robust, we didn't want to include any clustering in Prometheus itself, because anytime you have a clustered software, what happens if your network is a little wonky? The first thing that goes down is all of your distributed systems fail. And building distributed systems to be really robust is really hard. So Prometheus is what we call an uncoordinated distributed system. So if you've got two Prometheus servers monitoring all of your targets in an HA mode in a cluster and there's a split brain, each Prometheus can see half of the cluster and it can see that the other half of the cluster is down and they can both try and get alerts out to the alert manager. And this is a really, really robust way of handling split brains and bad network failures and bad problems in a cluster. So it's designed to be super, super robust and so the two individual Prometheus servers in your cluster don't have to talk to each other to do this. They can just do it independently. But if you want to be able to correlate data between many different Prometheus servers, you need an external data source to do this. And also you may not have very big servers. You might be running your Prometheus in a container and it's only got a little bit of local storage space. So you want to send all that data up to a big cluster data store for bigger use. And so we've got a couple, we have several different ways of doing this. There's the classic way which is called Federation where you have one Prometheus server pulling in summary data from each of the individual Prometheus servers. And this is useful if you want to run alerts against data coming from multiple Prometheus servers. But Federation is not replication. So it only can do a little bit of data from each Prometheus servers. So if you've got a million metrics on each year in Prometheus servers, you can't pull in a million metrics into, if you've got 10 of those, you can't pull in 10 million metrics simultaneously into one Prometheus server. It's just too much data. So there's a couple of other nice options. There's a piece of software called Cortex. And Cortex is a Prometheus server that stores its data in a database or specifically a distributed database. So things that are based on the Google Bigtable model like Cassandra or what's the Amazon one? Yeah, DynamoDB. So if you have a DynamoDB or a Cassandra cluster or one of these other really big distributed storage clusters, Cortex can run and the Prometheus servers will stream their data up to Cortex and it will keep a copy of that across all of your Prometheus servers. And it's because it's based on things like Cassandra, it's super scalable, but it's a little complex to run and many people don't wanna run that complex in infrastructure. The other new, we have another new one that was just blogged about yesterday is a thing called Thanos. And Thanos is Prometheus at scale. And basically with the way it works, actually why don't I bring that up? So this was developed by a company called Improbable and they wanted to, they had billions of metrics coming from hundreds of Prometheus servers. And so they developed this in collaboration with the Prometheus team to build a super highly scalable Prometheus server. So the Prometheus itself stores the incoming metrics data in a right of head log and then every two hours it creates a compaction cycle and it creates an immutable time series block of data which is all the time series blocks themselves and then an index into that data. And those two hour windows are all immutable. So what Thanos does is it has a little sidecar binary that watches for those new directories and uploads them into a blob store. So you could put them in S3 or Minio or some other simple object storage and then now you have all of your data, all of this index data already ready to go and then the Thanos sidecar creates a little mesh cluster that can read from all those S3 blocks. And so now you have this super global view all stored in a big bucket storage and things that like S3 or Minio are bucket storages and not databases so they're operationally a little easier to operate. Plus now that we have all this data in the bucket store and the Thanos sidecars can talk to each other, we can now have a single entry point so you can query Thanos and Thanos will distribute your query across all your Prometheus servers. So now you can do global queries across all of your servers. But it's very new, they just released their first release candidate yesterday so but it is looking to be like the coolest thing ever for running large scale Prometheus. And so here's an example of how that is laid out. And this will let you have a billion metric Prometheus cluster. And it's got a bunch of other cool features. Any more questions? All right, maybe I'll do a quick little demo. So here's a Prometheus server that's provided by this group that just does Ansible deployment for Prometheus. And you can just simply query for something like node CPU. This is actually the old name for that metric. And you can see here's exactly the CPU metrics from some servers. And it's yeah, just a bunch of stuff. And so there's actually two servers here. There's an influx cloud alchemy and then there's a demo cloud alchemy. Oh yeah, sure. Whoops. So you can see all the extra labels. And we can also do some things like let's take a look at say the last 30 seconds. So we can just add this little time window. It's called a range request. And you can see the individual samples. So you can see that all Prometheus is doing is storing the sample and a timestamp. And all of the timestamps are in milliseconds. And it's all epochs. So it's super easy to manipulate. But looking at the individual samples and looking at these, you can see that if we go back and just take and look at the raw data and we graph the raw data. Oops, that's not it. That's a syntax error. And we look at this graph. Come on. There we go. Well, that's kind of boring. That's just a flat line because it's just a counter going up very slowly. So what we really wanna do is we wanna take and we wanna apply a rate function to this counter. So let's look at the rate over the last one minute. And there we go. Now we get a nice little graph. And so you can see that this is 0.6 CPU seconds per second for that set of labels. But this is pretty noisy. There's a lot of lines on this graph and there's still a lot of data here. So let's start doing some filtering. So one of the things that we see here is, well, there's idle. Well, we don't really care about the machine being idle. So let's just add a label filter so we can say idle, or sorry, mode is the label name and it's not equal to idle done. And if I could type, what did I miss? Oh, I erased my bracket. There we go. So now we've removed idle from the graph. That looks a little more sane. Oh wow, look at that. That's a nice big spike in user space on the influx server. Okay, well, that's pretty cool. What about, but this is still quite a lot of lines. How much CPU is in use total across all of the servers that we have? So we can just sum up that rate and we can just see that there's a sum total of 0.6 CPU seconds per second across the servers we have. But that's a little too coarse. What if we want to see it by instance? Now we can see the two servers. We can see that we're left with just that label is the influx label or the instance influx and the instance demo. And so that's a super easy way to see that. But we can also do this the other way around. We can say without mode comma CPU so we can drop those modes and see all the labels that we have so we can still see the environment label and the job label on all this data. So you can go either way with the summary functions. There's a whole bunch of different functions. It's all in our documentation. But what if we wanted to see it? What if we wanted to see which CPUs are in use? Well, now we can see that it's only CPU zero because apparently these are only one core instances. So you can add or move labels and do all these queries. Any other questions so far? Yeah, I don't have a question but I have something to add. Prometheus is really nice but it's a lot better if you combine it with Grafana. Yes, yes, Grafana is. So in the beginning, when we were working on we were creating Prometheus we actually built a piece of dashboard software called Promdash and it was like it was a simple little Ruby on Rails app to create dashboards and it had a bunch of JavaScript and then Grafana came out and we're like, oh, that's interesting. It doesn't support Prometheus. So we were like, hey, can you support Prometheus? And they're like, yeah, you got a REST API? Get the data done? Okay, here, boom, done. Now Grafana supports Prometheus and we're like, well, Promdash, this is crap, bleat. So the Prometheus development team we're all backend developers and SREs and we have no JavaScript skills at all. So we're like, we'll let somebody else deal with that. So that's one of the nice things about working on this kind of project is we can do things that we're good at and we don't try and do a lot of, we don't have any marketing people. It's just an open source project. There's no company, there's no single company behind Prometheus. I work for GitLab, Improbable paid for the Thanos system. Other companies are like, Red Hat now pays people that used to work on the, for CoreOS to work on Prometheus. So there's lots and lots of collaboration between many companies to keep, to build a Prometheus ecosystem. But yeah, Grafana is great. Actually Grafana now has two full-time Prometheus developers. All right, that's it.