 So how many people are familiar with Prometheus at all? All right, good. Well, this will be a good overview, which I think is the objective. And Prometheus, I'll give you an overview of the different components that make up Prometheus. And then I'll give you a quick example of how to deploy it and how to pull data out of it. It's a pretty interesting tool. I used to work at a company called Orbitz in the US. Their online travel site, I think Rates to Go was like an Asian version of what they used to do. At least back in the day, they did. I don't know if they're still around. But we created something called Graphite. You might have heard of Graphite, but it was an open source project that came out of Orbitz. And I was on the team. Actually, one of my teammates is the one who actually wrote Graphite. And then my responsibility was getting the metrics out of Graphite and creating graphs and meaningful information. We also created a whole event processing stream around that as well. So whenever I dive back into the world of metrics and see how we've kind of progressed, it's always really interesting to me knowing that I have this knowledge or background from 11, 12 years ago of what we were trying to do of monitoring our systems. So a little bit about me. I'm all things open source and sysdig. We have a couple open source projects. We also interact with a lot of open source projects. And Prometheus is one of them. And we like Prometheus because it gives us a standard way to pull metrics out of applications. It also gives us a standard way to pull metrics and push it into our cloud-based SAS monitoring tool from all of the components in your Kubernetes cluster as well because the Kubernetes world, the cloud-native world is starting to kind of standardize on Prometheus metrics. I used to work at Chef. And so I have a background in open source for a while now. I worked at Chef for about five years and did a number of roles for them. And like Anton, I'm a DevOps-based organizer as well. I organized DevOps Day in Ohio. And then I also found DevOps Day in Minneapolis and a few others as well. And if you want to reach out to me, that's my GitHub and Twitter handle, as well as the Twitter handle will be down in the corner. So this is not picking on open shift. This is just kind of like the most complete diagram that I had. But this is your typical Kubernetes cluster and what it looks like. So you have lots of different components. You're going to have storage. You're going to have the master node where Kubernetes runs. And then you have where your applications actually run on the infrastructure nodes. And then you have a whole service layer and hardware and pod provider and a whole bunch of other things that you still have to monitor and get whole information from. So the other thing is these infrastructure nodes are going to scale out, scale up and down, depending upon what your workloads are and how many application nodes you need. And then, of course, on every single one of those nodes, you're going to have tens of containers running on them that are actually running your application. So it gets very complicated to figure out, how do I start pulling metrics back from all of these things? The other thing that makes it complicated is that these environments are dynamic. So machines are coming up and down. Or containers are spinning up and down. And so how can you pull back the metrics and information from things that are very rapidly churning and going out from underneath you? So the question is, how do you monitor your core infrastructure services? How do you alert when there's an issue? How do you monitor the application themselves and not just look at infrastructure metrics? While infrastructure metrics are important, the real golden signals that you want to look at in your environment tend to come from your application. And then how can you give developers access to monitor their application? Not only how can you give them a way to instrument their code, but how can you give them a way to actually look and see into those particular metrics? So let's talk about what is Prometheus. So it's multiple meanings for a word. It can be a monitoring stack. It can be a way to instrument your code. It can be a metrics interface that Prometheus has defined. It can be a query language, which is actually called promql, or it could be the actual metric server itself. There's actually several components of the Prometheus stack. And this is what it looks like. So you have where we get our metrics from. So we have pull metrics. The metrics are typically in the default scenario in the Prometheus world. Metrics are pull-based, meaning that Prometheus server goes out and scrapes metrics off of an HTTP endpoint. You can have short-lived jobs that will push metrics. And you can push metrics to a push gateway. And then the Prometheus server pulls the metrics again from the push gateway. It can use the information that's provided through the Kubernetes metadata API to pull back information about what's actually running in your Kubernetes cluster, also doing things like auto discovery. So if you put annotations on your pods, those annotations can be used to actually go in straight-back metric off of the endpoints that you put in those annotations. So it's an easy way for Prometheus to auto discover where it needs to start pulling metrics from. There's an alert manager, which allows you to send alerts. And you can use the same language that you use for promql to actually write the queries. You can use that same language to write what you would be alerting on. So you basically run a query. And if the value is of a particular range, then you would have the alert sent out to those various locations. And then Grafana is kind of the default open source graphing solution. It can interface in with a whole bunch of different databases, not only just Prometheus. And it gives you a common place to do dashboarding and visualization, give access to dashboards to your developers and other things like that as well. And I'll deploy that quickly here as well. So let's talk about the first component, Prometheus being a way to instrument your code. So back in the day, and we still kind of have this problem. So this is the problem that Prometheus is really trying to solve, is that if you wanted to pull Java metrics and you would have to pull metrics in the JMX format, if you have custom metrics and you might be pushing them to something like stat to be, and then you need to pull metrics from stat to be and then store them in database. You have things like EXP vars, and then you also have things like vendor APIs. So you might push to a vendor API, or you might have to pull metrics off of a vendor API. A good common one where I pull a lot of metrics from is the Docker Hub repository or registry. You can query the API, and you can find out information about number of pulls. And so as we're tracking the success of our open source projects, I actually have to scrape back data from that API to figure out how many times people are downloading our software. And the problem is that everyone knows what PETA means. So it's a pain in the ass for everyone. You have to maintain this huge code base to figure out how you pull metrics back from all these various different locations. You have different ender faces, especially if you're doing something with vendors as well. It's not always available for your language. So you might not have an SDK to actually go and implement the thing that you need in your language. And it doesn't always work in this dynamic microservices world, as things are spinning up and down and auto discovery of metrics and things like that tend to be very hard with the traditional way that we've done things. And so there's a whole bunch of different things that people have done to try and fix this problem. Prometheus Metrics is one of them. And Prometheus Metrics and Open Metrics are this slide was written a while ago. But my understanding is Prometheus Metrics and Open Metrics are essentially merging. And the Prometheus Metrics format is becoming the standard Open Metrics format now. So Prometheus Metrics or Open Metrics have really become this de facto standard of how we expose metrics. So a lot of the cloud native applications already support exposing Prometheus Metrics, as well as the projects or services that you'd be running on top of your Kubernetes platform. So things like Istio, traffic, 4D, and S, 4D, and so forth. They all expose their own Prometheus Metrics. And so once you install Prometheus and you have those services running on the cluster, you're able to start collecting information back around how those application services are performing. The nice thing is that you can instrument once and you can support many different use cases. You can also support many different providers as well, so that if you have a commercial company that you want to use, most of the commercial vendors will pull back Prometheus Metrics out of Discover Prometheus Metrics, something like DataDog or even Sysfig, you could automatically have those metrics pulled back. So you just need to instrument once and then you can support the Prometheus server, your commercial agent you might use, Grafana, whatever dashboarding the commercial vendor provides, and so forth. And so the goal behind this is really trying to get rid of all those custom exporters and scripts and everything that we used to do to massage JSON. And this is an example of massaging JSON. We'll skip that. So let's talk about the metrics format and what it looks like. So you have the metric name and then you have a series of labels. You can have more than one label if you wish. And then you have the value of the metric and then the timestamp. And there's a format that this all follows. There's also the data model that they talk about on the Prometheus IO documentation as well. So you can see here we have HTTP request total. In this case, we're just looking at posts that were successful. And then we are looking at posts where the code is 400. And so the nice thing that this allows us to do with these labels is that we can query these labels later using Prometheus well to only get the information out that we want or only get the metrics that we want. So if we wanted to track errors, you could do 400 star star and then a tilde there to match something that matches that regular expression. And I'll show that here. All right, so there is the metrics interface as well that's defined. So the first thing that's important is understanding how metrics names work. And a couple things I'm going to highlight on here because this is really long. So typically the first thing that you have is the prefix. And then beyond the prefix, you then you have the actual metric itself. The other thing is that you're going to have at the end, you're going to have things about what the actual unit is that you're measuring. So you can see here that we have seconds, we have bytes, or we have total. And total isn't necessarily something that's unit. So like HTTP requests doesn't necessarily have a unit that you're measuring. You have requests per second. That could be another metric that you define. But a total or accumulating counter would just be a total. And then you can see here when we combine them where this is the total over number of seconds as well. There's a ton of libraries that allow you to instrument your own code. And so you want to instrument your own code around things like how long is it taking your application to do the business logic type things. And that was actually the problem that we were trying to solve at Orbitz was that we had hotel searches that were taking 60 seconds and we had some hotel searches that were taking 10 seconds. And so we needed to be able to see when those events are happening, when all of a sudden our hotel searches start to fail or our hotel searches start to take a long time. And then we could use that information to determine maybe one of our providers were down or maybe one of our providers were having their own issue. And then we could turn that particular provider off. The other thing that we were really wanting to look at is air search metrics and other things like that. And so we wrote actually a custom format for our Java application to expose these metrics and push them into graphite. And that's an open source project that's probably the code was released but never touched again called Irma. I know it's still out there. I've seen it in a few places. But that was only for Java. So the nice thing is what's happened in the world of exposing metrics is that no matter what language even bash, you have a library that you can use to easily expose metrics in this Prometheus format. The other thing is I would avoid trying to write your own metrics exporter for your language. And I would really encourage you to use one of these that's provided. You need to be careful about when you're collecting metrics is that you don't want to block your application. And so a lot of these libraries actually take into account. We needed to have multiple processes, one serving metrics collection and one serving the actual application code itself. And you also, there's also a lot of libraries and these where you can create the data structures around the actual metrics and things like that. It makes it a ton easier. So there's what's called the metrics exporter. So what you do in your application is you use one of these SDKs to create the exporter. And it's all metrics are exposed via HTTP or HTTPS. So this gives you a common standard way to go in straight metrics. The other nice thing is that you can just open up in your web browser or you can just do a curl against the metrics endpoint to see if you're getting metrics, see if your metrics are incrementing and other things like that as well. There are lots of well-known exporters and there's about 470 different ports allocated for the different exporters. So these exporters are all around things like even low-level things from the hardware layer. So if you're running like Dell and what's that called, the DRAC inside of the Dell, you can pull in stats off of the actual hardware components. You can pull information from PodWatch and other things like that as well. And then there's lots of different exporters as well. This is more of a definitive list of the actual exporters, the ports allocated, but not like for example, we've allocated a port for one of our open source projects, but we haven't started exposing metrics yet. So it kind of gives you the idea of like, there's a lot of people that are wanting to expose metrics in this format and shows you the popularity of them. There's a couple of different ways you can collect metrics as well. So there's some good examples are things like Cadvisor and NodeExporter and KubeStateMetrics as well, which will pull metrics back from your Kubernetes cluster and then expose them for Prometheus to pull. We also have a commercial agent, but I'm gonna skip that because that's not relevant for us. So let's talk about the query language real quick. So PromQL is a full-featured query language that allows you to analyze metrics in real time. You can filter metrics by labels. There's a whole language where you can actually use functions as well. So you can do things like averages, standard deviations, square roots and other things like that as well. Logarithms as well, if you need to smooth things out. You can also do deltas very easily. So it all depends upon what you're trying to create out of your metrics. So if you're trying to create a gauge or if you're just trying to graph a counter, there's different options available to you. It also automatically creates histograms for you as well, just by leveraging this query language. And so the data you get back isn't the raw metrics data, but instead you get the histogram with the bucketing already done and then it makes it easier to expose it to a UI like Rippon or something like that. You can also leverage regular expressions. So let me look at this real quick. So I keep moving this and it's rattling the mic. Sorry. So HTTP request total, this would just be every single request. It's not scoped by time and it's also not scoped by any labels. Then if we wanted to get more specific, we could then say show me the job API server and handler slash API slash common. So this is kind of showing you how you can use, expose different metrics, especially it's really important inside of your application. You're going to want to say the handler. So you can actually trace back into your application where that metrics actually coming from as well. You can see how you can use regular expressions as well. So maybe I want everything that starts or I'm sorry, everything that ends with server. Or for example, I want to not show anything that's a 400 error. And so not anything that matches this expression. So dot is similar just like in grep or standard regular expression. So any character and then star is repeating of that previous character. So in here, since status codes are only three characters, we just put two dots. And then you can also do things like where you're running functions or using functions. And so this is going to give us the rates over five minutes and it's just going to return the data back from 30 minutes ago to one minute ago. So we're just getting that set of data instead of all of the data. All right, so let's talk about the monitoring stack real quick and what's all composed in this stack. And then I'm going to do just a quick deployment and I'll play around with how you set up Rafa and things like that very quickly. So you have the Prometheus server. You have the alert manager. You have the Grafana for UI and then you have exporters and the push gateway. And I kind of had already talked about this as well. By the way, if you were in the operator talk earlier there is a Prometheus operator. And that operator allows you to set all of this up very easily. I'm going to use a Helm chart today. If you're not familiar with Helm, it's essentially kind of like app yet RPM but for Kubernetes manifest. And so, or kind of like Terraform as well where you can very easily go and just say Helm install the name of the thing you want to install and it'll automatically set up that for you in your Kubernetes cluster. And when you have one of these large deployments using something like Helm or other tools makes it a lot easier or even the operator. This is the real idea behind the operator is that if you want to install Prometheus it's much easier just to have one block of JSON instead of all the JSON I'm going to show you or YAML that you need to actually go and set up this entire stack. So there's a different, a couple of different ways that you can do this. I kind of already talked about these as well. One thing that I'll point out is that you probably need lots of Prometheus servers. And that's kind of one of the limitation of Prometheus is that scalability at this level works in a horizontal fashion. And so you have lots of Prometheus servers that are all collecting metrics and storing metrics. The other thing I'll point out is that I'm gonna work around this in a different way. Sorry. Prometheus doesn't do long-term storage. So if you need long-term storage or let me put it this way. Prometheus long-term storage is only as good as the durability of the disks that you provide to Prometheus. And so if you have a SAN and you're like, we're still doing SANs or something like that or using EDS volumes and things like that, you can have some guarantee around persistence. But Prometheus itself gives you no ability to do clustering and other things like that around your metrics. So you need to use something like inflex DB or Cortex. There's some other commercial software that you can use as well to actually go and store those metrics and have a durable cluster. And then instead of going and connecting Grafana into Prometheus, you connect Grafana into one of these time series database tools. And so let's talk about Grafana real quick. So Grafana is an open source dashboard and graph editor. The other cool thing about what Grafana does is that there's a whole marketplace around data source plugins or data provider plugins and dashboard as well. There's a whole bunch of different data sources that you can use to query data from as well. And so it's just not Prometheus focused, but it's focused on lots of different databases as well, including simple things just like MySQL. So just doing MySQL queries and pulling data out of MySQL if you wanna check. All right. So there's also, as I already said, there's vendor solutions as well. So there's Grafana Cloud and Weave Cloud. We have our own backend that we run as well that lets you store these metrics in a common location as well. We also do things like implement prom QL so that you can actually run queries. So whether this is more about long-term storage, the other thing that we can do is you can use Grafana with us as well. If you choose to. So that's kind of how we try to interact with the open source community by leveraging a lot of the de facto standards that the open source community has created as well as our own projects that we do around open source as well. So let me just grab a chair here and we can do a quick demo. Hopefully my demo environment didn't crash as well. All right. So if I do a kubectl, git, pod, I shouldn't have anything running here. Must be getting close to lunch because now the Wi-Fi is not working. All right, there we go. That wasn't too bad. And so I'm just gonna do a history because it's easier because there's a flag I gotta set. So if you wanna know how to get started with Helm, it's actually really easy. You just download the Helm binary and then you run a Helm init and that Helm init, what it does is basically installs Helm for you. And so this is actually a new cluster that I had just set up and did all of this this morning. And so if I just run a Helm install, specify the name, that's your unique identifier for it and then you specify the chart. And then just like we learned in the Terraform chart, there's values that you can set to change what actually happens when this gets deployed. So we're changing a variable that's defined in the Prometheus chart, RBAC.create, and we're setting that to true. There's things like image names, image versions, there's a whole bunch of different things that you can change. And I actually downloaded the values.yaml. And so you can see here, there are things like mounts that you can change, upgrade strategy, if you have a service and you want to expose it via an IP or a particular load balancer, then you can do that as well. All right, so I'll hit install. And then you'll get back information about how to actually get to the things that you want to get to. So I'm going to export this, and we're just going to connect directly to Prometheus and not use Grafana real quick. And so now if I go to 127.0.900, I can see that Prometheus is up and running. The other really interesting thing here is, as soon as I deployed it, I started getting metrics back as well. And so if I click execute, then I can actually see the metrics and the value. You can also click on graph and you can actually see that data as well. And so maybe I want to go back just 15 minutes and you can see that when I did that deployment, it's automatically started pulling metrics back for me as well. So it's really just that easy and just that quick to start pulling metrics back. And then let's install Grafana. And so once again, using a Helm chart and I'll do a Qubectl real quick, Qubectl get pods. And this is the value that Helm provides. So you can see all of those different components that Helm created for me that are running Prometheus. And then you can see that it created the Grafana container or Grafana pod as well. And so it makes it a lot more simpler to actually go and deploy things to your Kubernetes function. So now I need to go and get my secret. And now I'm going to go and export this again. All right, so grab the secret. And now I should be able to go to port 3000 and Grafana is up and running for me. Then I'm logged in. So the first thing you need to do is you need to configure a data source. And so you can see all of the different options that are available. There are different plugins that you can add in as well. So if there's not an option there in the default out of the box, you can download a plugin and then that plugin allows you to access one of these other data sources. And then I'm just going to add Prometheus and I'm doing a half the cheat because it's a long URL. There we go. And all I need to do is save and test. And it tells me that the data source is working. So I should be able to go here now and say create dashboard. And I'm going to create a graph and do edit. And then we should be able to do that node load one. So that interface that I was looking at earlier, remember that kind of query interface against the Prometheus server, you're not able to save anything. You're not able to get these dashboards and other things like that as well. So this gives you a much more fully functional UI to actually go and create these graphs. You can resize things. You can zoom in to any time range as well. You can turn things on and off as well if you only want to look at particular things. You can do that. But the nice thing is that Prometheus or I'm sorry, Grafana offers, let's say, lots of different dashboards that you can import as well. And so if you go here to import and let's just go to Grafana and click on dashboards and you can see all of these different options that are available to me. And so let's take this one here and all you have to do is copy this ID and then I paste in the ID and click load. And it's gonna give me information about this and I say import. And now I have the host metrics of like Kubernetes cluster. I can drill down into the dashboards that they provide, looking at things like interrupts, what version of the kernel that I'm running and so forth, an overview panel as well. I can drill down into CPU or load and other things like that as well. So there's a whole host of different dashboards available that you can very easily insert into your environment and load them up. So with that, thank you for taking the time to listen. Hopefully you've learned something about Kubernetes and I'm happy to, I'm sorry about Prometheus. Maybe you learned something about Kubernetes too. Happy to take questions if anyone has any questions. I just say that like this is also a good way if you have a bunch of, I know there's a lot of people that are like hardware hackers in this community. So if you're having IoT devices and other things like that as well and you need to expose metrics, this is a very good way to expose metrics from your IoT devices and other things like that as well and then you could have a collector go and collect those metrics so they can push it back to a push gateway or something like that as well. And then you have Prometheus and Grafana where you can go and easily start to graph those metrics and pull that information back as well. I'll be around for a little bit if anyone wants to chat but thank you very much for attending.