 Hello everyone. Good afternoon. Thanks for joining me as I talked to you guys about M3 and Prometheus and then how the two of these open source solutions can work together to create a metric monitoring solution at a global scale. So a little bit about myself before we get started. My name is Gebs Cullen. I am currently a developer advocate at Chronosphere where we provide a hosted metric monitoring solution built on top of M3. Before Chronosphere, I was a product manager at Amazon and AWS for a few years and then also included my GitHub and Twitter handles here in case you want to connect. So just going to go over the agenda for today's talk. So we're going to start out talking about monitoring with Prometheus and some of the challenges it has when reaching a certain scale. From there, we're going to talk about monitoring with Prometheus and M3 followed by a quick demo on how to get started with M3 and then there'll be plenty of time at the end for Q&A. All right. So monitoring with Prometheus. So just to give a quick overview of Prometheus, Prometheus is a single binary metric solution. It comes with a tag-based metric ingestion format in query language called PromQL. It is a scraped-based metric ingestion solution. So it is pool-based. It's a push-based. It comes with an efficient metric storage solution. So it's very efficient at storing metrics within itself. It also comes with the ability to visualize and graph metrics. Many users do, however, use Grafana for this, but Prometheus does have its own out-of-the-box solution as well. And finally, you can create alerts using Prometheus alert manager and metric roll-up rules. So you can aggregate sets of metrics using Prometheus as well. So as you can see, Prometheus has really risen to popularity over the past few years, and it's kind of taken over the title from graphite as being the de facto or go-to open-source metrics monitoring solution. And this is mostly due to a lot of these advantages that we'll discuss here. So first one being that it's very easy to get started. So there's basically just one single binary that's used for ingestion, storage, and query. And then there's another binary that's used for alerting as well. So very simple out-of-the-box setup. It is also the CNCF-recommended monitoring tool, which means that most of the software projects within the CNCF ecosystem also expose metrics in the Prometheus format. So this makes it really the de facto tool for getting started. It also has a wide range of discovery mechanisms. So it's easy to use on any platform you're running, including Kubernetes. This also makes it easy to integrate and start discovering particular metric client endpoints to scrape your metrics from. And then finally, it has a wide ecosystem of exporters, which are basically just custom integrations with particular software projects. So most major software projects, especially open-source ones, have these existing integrations. So this really just means that there's a great kind of community around these integrations when you are wanting to implement Prometheus. So getting started with Prometheus, as I mentioned, it's very easy to get started. And this is a pretty standard setup here. So in this diagram, we have our services exposing metrics via their Prometheus endpoints. From there, we have our Prometheus instance setup to discover these endpoints and to start scraping and adjusting the metrics from these endpoints and to the Prometheus instance. From there, you can choose to run an instance of Grafana and point it to the Prometheus instance to visualize your metrics. And you can also do the same for alerting with alert manager. So basically just also pointing your alert manager to your Prometheus instance in order to get alerts on your data. All right, so now we're going to get into some of the pain points with Prometheus. These pain points arise as you start to scale out your Prometheus instances, as well as when you start caring more about reliability. So from a reliability perspective, kind of by default, all of your data will kind of go into a single Prometheus instance, as you can see here. So if it does go down, you not only lose all active real-time monitoring of your services, but you also lose access to all of your historical data of that very of that particular service. So this is really a significant point of failure with the out-of-the-box setup of Prometheus. So the recommended solution to get around this is to have two or multiple instances of Prometheus and having them both scrape the same client endpoint. So that if one does go down, one Prometheus instance does go down, you will still have a copy of your metrics. And then when it comes to alerting, since the alert manager is also a single binary, you would need to run two instances as well, and then they would trigger alerts for their respective Prometheus instances. So when it comes to viewing data from a dashboard with this setup, that's when it gets a little bit trickier. So typically in this setup, you would put a load balancer between your two Prometheus instances and point the Rafauna instance to the load balancer. So from there, all read requests would get balanced between the two Prometheus instances. So if one does go down, you're still able to fulfill these requests. And this generally works well for reliability in the sense that you get one copy, at least one copy of your data. However, the problem is that if you are doing rolling restarts of your Prometheus instances for any reason, whether it be maintenance or for upgrade purposes, then you'll start coming across a gap in your data as the instance is down or kind of restarting. And you can see that here in this diagram. So in this example, let's say we did a rolling restart of both of our Prometheus instances, you can see that each instance is missing metrics from when that instance was down. So neither instance has a full image of the CPU usage over the, you know, over the skin and time period. So from a final standpoint, if you have a load balancer in front of it, you will really only see one of these instances or one of the two copies of your data here. So if you do refresh your graphs, you'll get either one gap or the other in your graph. And this can really lead to inconsistent results. Unfortunately, there's no really out-of-the-box solution at the moment with Prometheus to merge the two sets of data. The second set of kind of pain or, you know, the second major pain point that we will be discussing is around scaling up your Prometheus instances. So common use case here is if you are monitoring certain services with a single Prometheus instance, unless they all are set and one of your services starts producing a lot more metrics, then maybe that one instance can no longer manage a load and can become overwhelmed. So the recommended way to get around this is to create a separate Prometheus instance and then have that instance store and scrape metrics from the particular service while having the original instance store and scrape metrics from the other services. So by doing this, you are manually starting the load across your fleet of Prometheus instances. However, this does get a bit tricky for a few reasons. So the first is from a dashboarding and an alerting perspective. So you need to tell each dashboard or alert in this scenario which Prometheus instance to point to in order to get the data you're looking for. So originally, all the dashboards were pointed to the kind of this first Prometheus instance, but now that the data set is started across the two, you not only need to change the data source in Grafano when you are kind of wanting to look across the two different instances, but you also will kind of lose any of the historical data of service A when you do kind of, you know, spin up the second instance. And then the same thing happens with alerting as well. So there's another scenario or instance when a single dashboard or alert needs data from both of your Prometheus instances. So kind of summing up the data across your various services here. In this case, you would need to make sure all your data goes into a single place. So in order to do this, you will create a third instance through a process called federation. And this will allow you to see, you know, see your data across or view your data across both instances. By federating, by creating this federated node, you get a subset of data from your original instances. And then this allows you to query the kind of query this federated or third instance, you know, with that subset of data from the various services. The problem with this, however, is that you still only have a subset of data stored in your federated instance. So if you do need more data than what's in that node, you need to also query and point your alert manager to the perspective instances where that data is being stored. So as you can imagine the management of this and knowing which instance has which data and which ones have an overlap of the data can get very hard to manage at scale. So as you can see here, scaling, scaling this up leads to many more federated nodes, which can leave you with a very complicated Prometheus structure. So similar as the kind of previous slide. If you wanted to look across your instances or in this case across regions or zones, you would need to federate the data in another Prometheus instance and combine that across both zones or regions. However, you still need to maintain awareness and understanding of which Prometheus instance contains the data that you're looking for. All right. And so the third pain point that we're going to discuss is around efficiency. So Prometheus is not the most efficient when storing long-term data. And this is mostly because there's no built-in down sampling capability. So for example, if you're storing data for six months at a scrape interval of 30 seconds, it ends up taking up about 8,000 kilobytes. But if you're able to down sample the same dataset at a one-hour resolution for the six-month period, it would only use about 68 kilobytes. And then that's just for one instance. So you can see if you were to scale up to 100 instances of Prometheus, this discrepancy becomes much larger. So as you store more and more longer-term data, down sampling becomes a very valuable asset for efficiency. And just one thing to note is that as you start to look at metrics for that six-month period, even in dashboards, there's no really great way of viewing your data at a 30-second granularity because there aren't enough pixels on the screen. So it's much more common to store your data at a higher granularity anyways. But regardless, Prometheus does not have a great out-of-the-box way of doing this. So the way to get around it is to kind of federate your data so that you would have a second Prometheus instance as you see here. And this instance would read the raw 30-second data and then down sample and store it in a separate instance at a higher resolution. However, by doing this, the down sample data needs a new metric name. So when you do query the data, you need not only need two separate queries to look across both resolutions, but you also have to kind of switch between dashboards within Grafata to look across the two use of data. All right, so now we're just going to kind of recap the major pain points that we've discussed regarding scaling up Prometheus. So from a reliability perspective, Prometheus is not really designed to handle these availability zone or region failures, especially with a high level of consistency in terms of scalability. As we discussed, the management overhead of sharding the various data sources can become very painful over time. And in addition, the management of federation and configuration becomes difficult at scale as well. And then finally in terms of efficiency, while the platform is really great at storing short-term metrics, without the down sampling capability, there's no great solution for storing longer term data. Not to worry though, Prometheus is very aware of these pain points. And they intentionally built Prometheus to be a very easy out-of-the-box solution. So rather than kind of going in and creating solutions for these pain points themselves, they introduced this concept called remote storage that basically has remote write and remote read APIs that it exposes. And then these APIs can be implemented by other technologies such as M3, Thanos, and Cortex, which can then be used to solve for these pain points at scale. So for the rest of the talk, we're going to dive into M3 in particular and then how M3 and Prometheus can work together to solve for these particular pain points that we've discussed. Okay, so monitoring with Prometheus and M3. So just going to give a quick overview of M3. So what is M3? It is a open source metrics engine. It's comprised of three main components. One is a custom built time series database called M3DB, which is used to efficiently store all metrics data. Then there is a ingest and down sampling tier. And finally, we do have a query tier that is used to provide optimized queries and fetches for all of the data. M3 was built in open source from day one, having its first kind of check in and get hub in around April 2016. It was also originally built to solve the metrics and monitoring use cases at Uber, which it successfully did. And since then it has helped many other companies such as Walmart with their various metrics and monitoring solutions and use cases. And it is continuing to be maintained by Uber as well as Chronosphere. And as I mentioned, it is designed to be Prometheus remote storage compatible. So some of the key features of M3 as they tie back to the main pain points with Prometheus that we discussed. From a reliability perspective, M3 does keep consistent copies of all its data. It does have a default replication factor of three. It's also designed to tolerate single node availability zone and region failures out of the box. It's also designed to be highly scalable. So each tier is horizontally scalable. And it has already been proven to store billions of metrics time series data at a time. In addition, it was built with a very simple operation in mind. And this was to kind of help reduce the need for a complicated management overhead as you scale up. And then finally, in terms of efficiency, it does have a built-in down sampling capability. And then all data is optimized with a special compression algorithm that was designed specifically for metrics and time series data. So going into the architecture overview of M3, we have our three tiers here that we mentioned. You know, the first being the ingest down sampling tier called the M3 coordinator. From there, we have a distributed time series database called M3DB. And then finally, there's the query tier called M3 query, which is used to fetch all data. On the right side of things, we have our Prometheus instance pointing to M3 via the Prometheus remote right endpoint. So the coordinator from there will then implement this Prometheus endpoint. And then on the read side of things, we have Grafana, an instance of Grafana set up, and it's pointed directly to the M3 query tier, as you can see, which has a built-in Prometheus or built-in promql read endpoint and can execute any read requests by fetching data from the M3DB tier. So diving into the M3 coordinator and the right path. As you can see, we have a Prometheus instance here writing to the coordinator using Prometheus remote right. Every data point that goes into the coordinator will get replicated three times, and then these data points will then get written to three different places across the M3DB nodes. When writing these metrics to the M3DB nodes, the coordinator does use quorum right, which means that at least two of three copies of data have to successfully write over in order for it to be called a success. So if this doesn't happen, then the coordinator will retry until it does become a success. The coordinator is also in charge of down sampling. So what this means is the coordinator will generate down sample data inside itself. So for example, when a raw data point comes in, it will be in charge of not only writing that raw data point to M3DB, but it will also be in charge of down sampling that data, dating data point. And then from there, it'll write that down sample data to the various M3DB nodes using quorum right. So zooming in a little bit inside one of one of the M3DB nodes, just a couple concepts here that we're going to discuss. The first is a namespace, which is most similar to a table in most other databases. It is generally recommended that you store data with the same resolution and retention periods within a single namespace. And then that namespace will then be spread across all M3DB nodes within a cluster. So within a namespace, all data is sharded and replicated. So as I mentioned, there are going to be three replicas of each data point. In terms of the sharding, it does occur on the metric ID. And these metrics are then randomly sharded based on their metric IDs, which are essentially just a combination of a metric name and a metric tag key value. So inside this shard, there are two components of time series data. The first is an in-memory metric index. This is going to be an inverted metric index index that uses a tri-like structure to fulfill certain queries. And it can also read regex and glob patterns. In addition, all metric time series data is going to be stored against the metric IDs themselves. And they're also going to be stored based on time and blocks that go back over time. So for example, in this diagram, we have the metrics being stored at two-hour blocks, and these two-hour blocks are being compressed into these two-hour blocks by our compression algorithm so that the two-hour blocks are stacked up next to each other. And then finally, on the query and read set of things, we see that the query tier here can implement both promql and graphite or stat-steam metrics. In this particular diagram, we have a instance of Grafana pointing to our M3 query tier, just as you would with Prometheus instance. From there, the M3 query tier will fetch all of the data being requested using quorum reads. So this ensures that at least two of three copies are being fetched successfully so that you can get a consistent read. From there, it will apply any functions on top of the data before then returning it back to Grafana. So one of the good things about the M3 query tier is that it is a stateless tier. So you can put a load balancer in front of M3 query tier and have any of the instances that you have set up be able to fetch the data for you. So for example, when pointing your Grafana instance to the M3 query tier, you just have to point it to one of the single read endpoints and then you don't have to worry about which data source you're connected to within Grafana. And then one final thing to note about the M3 query tier is that it is designed to be 100% compatible with promql. So scaling up M3, as you can see, we still have a similar setup here where all metrics are being scraped by a single promql instance, which will then forward along the metrics to M3 via promql. So scaling up your various tiers here is just a matter of putting a load balancer in front of them and then adding additional instances. However, the way of doing this varies across the different tiers. So for the coordinator in the query tier, because they are stateless, it's much more simple and you can just basically add additional instances. However, for the M3 DB tier, it is designed to be aware of when you are adding or removing instances to the node or to the tier. So it is a stateful tier. So when you are wanting to scale up this tier, you do need to make sure that any sort of configuration changes are being made. All right, so now we're going to look into how M3 works across multiple zones. So in this example, we spread out our M3 cluster across three zones. As you can see, we have the coordinator query and M3 DB tiers here spread across these three zones. However, you will notice that for the M3 DB tiers, we have kept a single replica of each tier within a particular zone. And this is designed intentionally so that in case you do have one of your zones go down, you still have two copies of your data. Also in this example, you have the coordinators here writing to all three replicas of M3 DB, as well as the query instances here are also going to be reading from all replicas across the zones. And looking on the right side of things, you can put a load balancer in between your Prometheus instance and the coordinator, and then have the Prometheus remote write go to any of the coordinators. And then as an alternative, if you didn't want this setup, you could also run your coordinator as a side card next to your Prometheus instance. And then looking over to the right side of things, you would also put a load balancer here in front of all of your M3 query instances, and then you could point your Grafana instance to that single endpoint, which would then evenly distribute your query requests across the zones. So basically, if you do, if you did lose an entire availability zone, the cluster will still be up and running, and you'll still have two copies of your data intact, so that you can kind of maintain that core reading, core write logic that we discussed earlier. In terms of the load balancers here, if a zone did go down, they would simply just reroute the data to the zones that were still up and running, and then they would do this until the zone that went down comes back online and recovers. In general, it's kind of to help with the management of this multi-zone setup. Really, the M3DB tier is the hardest part in terms of scaling up in management. So because of this, and that's because it is a stateful tier, as I mentioned. In order to kind of help with this, we do have a Kubernetes operator in open source. So if you do want to run M3 on Kubernetes, all you have to do is tell it the number of instances you want in the cluster, and then from there, the operator will take care of scaling everything up and down for you. Now we're going to look kind of across two regions. So in this example, it's going to have a similar setup as the previous screen, having the various tiers here spread across three zones. Obviously, you don't see the zones in this example because we have zoomed out a bit. But in terms of this setup, the recommendation for having multiple regions is to have all of the data that's being stored and produced in a particular region stay in that region. However, if you do want to kind of look across regions, we do, the recommended setup is to have your M3 query tiers here connected so that they can talk to each other across regions. And then that's how you can make any sort of fetch requests across your multiple regions. So for example, if you did want data across your various regions here within a single Grafana dashboard, how this would work is a query request would come in to your local query tier, which will then fan the query out to all other query tiers in the other regions you have set up. From there, the query tiers will fetch data from their local M3DB cluster and then send the results back to the original query tier for that request. From there, that original query tier will then combine the data across all regions and then perform any functions on that data locally inside that particular query node. And it'll do this before sending back the requested data to the dashboard or to your instance of Grafana. But the main point to get across here with this setup is that when you do have multiple regions, it is not recommended to replicate data across your multiple regions. Instead, you want to store the data within its respective region and then simply connect the query engine so that they can fetch data across your regions as needed. Okay, so now we're going to get into a quick demo of getting started with M3 and M3DB. And then just a couple of things to note before I switch over my screen to the demo. You know, we did talk about Prometheus remote write and we will kind of show that in the demo. I also already have three instances up and running just for the purposes of time and for the demo. We have two Prometheus instances up and running and then we do have one instance of M3DB. And in addition, you want to make sure to check out the documentation we have around M3 as there are other ways of deploying M3 if you were interested in going about those as well. Okay, so now we have the demo up and running here on my screen. So it is a pre-recorded demo. So I'll just talk through the different stages here. So you can see I've already kind of cd-ed into the repo where I have my various config files that we'll be using to kind of get the different instances up and running. So from here you can see all I did was just a single kubectl command and this from here created the cluster for me. So super simple just one kubectl command to get the cluster up and running. And then here is the, you know, kind of the configuration file for that cluster. So you can see we have the rectification factor of three. And now I'm just going to show you kind of the operator is already up and running in the back end, but I'm just going to like run this command here to show that to you. So as you can see here, pod status says we do have it up and running. Then now I'm going to do a quick kubectl command to get pods to show you the various instances that I've already have up and running that I mentioned. So here are the two Prometheus instances I mentioned, A and B. Those are already up and running for me. And then also just one thing to note here is I do already have an instance of Grafana up and running as well. All right, so now we're going to kind of go to Grafana. So we're going to port forward to our Grafana instance with port 3000. Cool. So now we're in Grafana UI. I'm going to go through the dashboard, go to the dashboard portion. So now, yeah, so I've gone ahead and selected the Prometheus A data source, which is set up to scrape from the Prometheus B instance. So running the up command here, you can see that we only are getting data for that one instance of, you know, Prometheus B. I'm zooming in a little bit now at the 15 minute interval. So doing the same thing, switching over to the Prometheus B instance. Also with up command, you can see that we're only seeing data here for the Prometheus A instance. So now if we were to go in and add the remote right configuration to our Prometheus config map configuration and kind of apply that. Now switching over to our M3DB instance and data source, you can see we have, you can see both Prometheus instances here in the same view. So rather than having to switch between the data sources, you can just see everything in one snapshot. And then this is with the up command. I think I'm going to show one more command here. I'm just getting a different perspective. I'm going to scrape duration seconds metric. So again, you know, you can see here the two instances or metrics across both instances are being shown together here in a single dashboard. So that concludes the demo. All right. Well, so now that we've included the demo, that kind of wraps everything up for my presentation. So thanks everyone for joining. And before we get to Q&A, I did want to kind of go over a few different resources here for M3. So, you know, if you're interested in kind of getting involved with our community, we have various active Slack channels. So make sure to sign up there. You know, if you have any additional questions on some of the topics or components I touched on today, you know, you can go to our documentation to dig more into that as well. And then on top of that, we do host monthly office hours, which you can sign up for with this link. And then we also do have a community meetup group where we try to host meetups every month. So yeah, thanks again. And now we will open it up to Q&A.