 All right, let's get started. Hello, everyone. Thank you for coming today. My name is Matt Schallert, and today I'm here to talk to you about storing long-term metrics with M3 and Prometheus. A little bit about me real quick and why I'm talking about this. Oh, you gotta be kidding me. One sec. So, like I said, my name is Matt Schallert. I work on infrastructure at Chronosphere, where I'm an infrastructure engineer there, and Chronosphere is a software as a service observability platform, mostly focused on kind of large-scale, high-throughput use cases and trying to do that at a more efficient cost. And a lot of the way that we do that is by building on top of M3, which is the open-source project that I'm here to talk about. Before I worked at Chronosphere, I was an SRE at Uber, where I worked on their in-house observability platform. This was during Uber's kind of massive growth phase. So, I learned a lot about what it takes to run a large-scale observability platform. So, for better or worse, and for some reason, I've spent the last half decade or so working on these sorts of systems. Let's dive right in and talk about a topic that you might be familiar with, Prometheus. Prometheus is what many would consider to kind of be the de facto open-source solution for getting your metrics stack up and running. And a large part of the way that it's earned this title is by being an all-in-one solution. So, it really encompasses a lot of aspects of a metrics platform. It is both an exposition format for exposing metrics from your application as well as a solution for gathering metrics from targets that are exposing them, querying the data that you've gathered, and it also has a built-in alerting component. What's great about Prometheus, one of the many things, is that it's really easy to get started with. And a large part of this is because it all runs on a single instance. So, that is by default, all of your data is gathered from, stored on, alerted on from a single instance. Like I said, this has kind of a lot of operational benefits, but it also comes like everything with a few trade-offs that we'll take a look at. So, one of the main trade-offs is, like to give you an idea, let's think of this example where you're emitting metrics from multiple services. So, service A, service B, service C. And you decide to just have a single Prometheus instance that will store metrics for all of them, be responsible for gathering metrics from all of them. And this is a totally reasonable setup. This is kind of the way that you first get started. But over time, let's say one of your services grows. Maybe it's getting more traffic. Maybe you have added more metrics to your applications that you're kind of just emitting more for the same traffic. Maybe you're running more instances of your service. For whatever reason, you have just emitted more metrics from a single service. And it might be starting to overwhelm what a single Prometheus instance can handle. Maybe it's, again, like the size of the dataset. Maybe it's the number of targets that have to be gathered from. One thing that's really helpful is that Prometheus does offer some knobs to kind of help you shape your ingest and kind of tune how much data you store. There's what's called relabel rules or drop rules. And all of these options can definitely buy you some headroom, but even still you have that computational overhead of scraping the data. So you're at this point, you have overwhelmed kind of what a single instance can handle, what do you do now? One possible option is to split things up and just send that one services metrics to a dedicated Prometheus instance. So you've bought yourself enough headroom, you're no longer overwhelming that previous just one instance, but now you kind of have this additional challenge, which is that, or a few challenges. One, from kind of a configuration and operational perspective, you have to decide which Prometheus instance is gonna contain data for which targets and how you're gonna divide that. Maybe you're going to split them up by service like we've done here. Maybe you're gonna split it up by the team that owns the service or some other part of your org chart or just do it completely randomly. Whatever you decide to do, you now have multiple instances that you have to query this data from. So for example here, if I wanted metrics from like service A, I would have to query the one instance or if I wanted metrics from one of the other two services, I would have to look at the other. There are some things you can do here, like maybe you send the queries to all instances, maybe you send your query to all instances and kind of stitch the data together, but the source of truth for these different services is now different instances. And then kind of thinking of like the next level of scale here, what do you do if you get to the point where that one single service outgrows what any single Prometheus instance can handle? So at this point, your service is so large, maybe it's the monolithic like kind of entry point to your entire application and you've scaled up the Prometheus VM to the largest instance size that AWS offers and it's still not cutting it. So now you're kind of faced with another challenge. Are you going to redesign your services to emit less data? If so, how are you gonna decide what data to throw away? Kind of all of these are like, I mean, they're approachable problems, but they're also kind of non-trivial. And then finally, even if you've done this, you've kind of found the ideal distribution of services or whatever other dimension to Prometheus instances, there's the question of what you do when a single or when an instance just totally fails. If you only have like one Prometheus instance graping your service, then if that Prometheus instance is down, whether say you screwed up the configuration or there's some sort of host error, while the instance is down, you are now losing the data from that service or from whatever group of targets that instance was scraping. One option here is to have more than one Prometheus instance scrape every target. So you kind of have like just high availability by number of instances. This helps with data redundancy, but then now you have this other problem of you have to figure out like which one to query from. So for example, say you're scraping it from two instances, one of them is down for a bit of time. You have to know like which one to query from for that time range or again, do some sort of thing where you stitch the data for multiple ones together. One thing I should mention is it's possible that like a lot of this isn't really a problem for your use case. You might be scraping some low priority target. You might not really care if you have gaps in some of the data as it's just operationally not worth it. Like on my home computers, I just have one instance of Prometheus running and I would never go through the hassle of like setting up some crazy high availability thing there. But what I've seen is that as Prometheus usage has grown, we're seeing people use it for more and more purposes and increasingly mission critical purposes. You might be capturing business metrics that you're kind of surfacing to other parts of your organization. Maybe you are capturing metrics that feed auto scaling systems for like the rest of your infrastructure and you don't really want gaps in that data. So again, it's important to remember that different people will have different priorities here. But again, in some cases, in some of those instances, it is kind of gonna be a critical thing. So to recap, if you are starting out with Prometheus and expecting to scale your usage over time, the two questions that you'll likely have to eventually answer are how are you gonna handle data growth and how are you gonna handle instance failures? Luckily, people have thought of this. As I mentioned earlier, the Prometheus developers kind of chose as a design constraint to focus on this single instance use case for storage. And this is not a bad thing. It has made operating Prometheus a lot easier and it allows the development community to focus on the decisions that are best for that use case and it has gotten the project incredibly far. The Prometheus project decided that they would explicitly support kind of more long-term or remote storage by providing a set of APIs called the remote write and remote read APIs. These remote APIs allow Prometheus to act as a bridge between the targets that it is scraping metrics from and external systems that are actually gonna be responsible for storing that data. If you're implementing one of these solutions, you pretty much just have to implement this HTTP API and then you configure Prometheus to either remote write to that API or remote read from that API. In this case, you greatly reduce the work that Prometheus has to do because it kind of just acts as a bridge between these two systems. It still will store a minimal amount of data, kind of just like enough to buffer it and make sure it's reliably sent to your remote storage. But again, greatly just reduces the amount of work it has to do. There are a lot of different remote storage systems in the ecosystem. The Prometheus docs kind of list out what a bunch of the known ones are. Some offer both read and write support. Some offer just remote read or just remote write. But the one that is near and dear to my heart is the open source project M3, which I've worked on for the past few years along with incredibly talented and dedicated community. M3 is an open source metrics platform that has kind of all the dependencies you need to run a metric stack. So this includes a distributed time series database and index, a query engine, an aggregation tier and kind of a few other like bridge components between those systems. M3 implements both the Prometheus read and write interfaces along with some other protocols that you might be familiar with like Graphite or StatsD. But obviously the one word here to talk about is Prometheus. So thinking back to kind of those challenges that I said, you know, often come when you're starting to scale your usage of Prometheus, the two of the main goals of the M3 project that make it well suited to be a remote storage solution are that one, it is highly available and two, it is horizontally scalable. And we're gonna kind of take a look at what each of those looks like in practice. High availability in M3 is handled by M3DB, which is the core distributed time series database that kind of sits at the center of the whole stack. M3DB takes care of sharding your data and replicating it across different failure domains. So in this example I've put up here, you can see that we have a nine node M3DB cluster. It is spread across three different zones and there are three nodes in each zone. M3 has a set of cluster membership algorithms that when you register an instance, take into account things like the zone that the instance is in, how much memory that instance has or a few other kind of properties to determine how many shards to assign from it. And the cluster membership APIs take care of distributing the shards according to those failure domains. So in this example, if we had a cluster with a replication factor of three, each zone will have a full copy of the entire data set. Within M3 there's this component called the coordinator and it kind of implements like a fat client that handles read and writes. So that means that when a coordinator receives a write, it takes care, it sends it to all replicas that own that shard and then depending on the responses decides if the write has achieved quorum or not. Similarly when it does reads, it fans out, reads the data from all instances that own it and then compiles the, and then kind of like dedupes and quorms the results. One thing that's kind of important about this is that this is as opposed to having the database itself handle like replication of data between peers. If you've ever worked with some systems like Cassandra where specifically there's this concept of hinted handoff where database nodes are responsible for also sending writes to other database nodes. Having operated my fair share of those, it kind of adds a like tough to estimate load to the databases. And one thing that's nice about this approach is that by having this stateless coordinator handle all of the quorum logic, you can pretty much horizontally scale that independently of your actual storage tier. But back to high availability. So like I said, M3DB has quorum reads and writes. So that means that in this cluster with a replication factor of three, you only need two replicas available at any given time to serve fully consistent reads and writes back to the client without any errors. So in this case, let's say we lose an entire zone like AWS did a few days ago in US East one. It does not matter whether you lose the entire zone or just some nodes in it. The clients writing to M3 won't see any read or write errors and their data will remain fully consistent. That being said, if you lose nodes across more than one failure domain, so say across more than one zone, it's a little bit harder to uphold those guarantees because you no longer have two that are healthy, that can participate in quorum. You can still actually in that case serve potentially inconsistent reads from the one node that is still up. But once you lose more than, or nodes across more than one failure domain, admittedly some guarantees are kind of off. So that is an overview of how high availability is accomplished and let's talk about horizontal scale. So let's think back to that situation I mentioned earlier where you are emitting metrics from many different services. Over time, one of those services grows, again starts emitting more data or you're getting more traffic or something and you are beginning to overwhelm the capacity of your M3DB cluster. In this case, as opposed to kind of having to make any sort of application level changes or kind of like configuration level changes, all you have to do is add more nodes to your M3DB cluster and it will rebalance the data and you'll kind of achieve just a better balance and you'll buy yourself a lot more headroom. So in this case, let's say we triple the number of instances and we can just kind of keep doing that anytime this happens. It is kind of worth just talking a little bit about what that process looks like behind the scenes. I will admit is not a transparent, automatically handled for you process. It is a user initiated process to kick off an expansion. There's currently like no built in auto scaling in the system and you have to like actually assign more instances like whether it be VMs or coupods or something. So there's a non-zero operational overhead to it and you have to make sure that the cluster has enough capacity. What that actually does under the hood when you go to add instances is rebalances data through a process called peer streaming. So let's say we had this cluster, it was three nodes and there were like 30 shards in the cluster. So each node will own an entire copy of the dataset and we started to overwhelm what the cluster can handle and decided to add more nodes. Once you register the new nodes and they kind of like initialize with the cluster, they will begin the process of what's called peer streaming. So pretty much the new nodes that are joining will reach out to the old nodes that own the shards that they're taking over depending on kind of what the new distribution that the cluster membership APIs decided. They will fetch all that data once they've persisted it to disk and confirm that their results are consistent with what the old nodes owned. The old nodes can or the old shards are removed from those nodes and then they kind of just like drop all the data that they no longer have to own and you've reclaimed a bunch of disk space and expanded your cluster. So that's a little bit about what the data rebalancing process looks like. I've talked a lot about the right path so far so how you get data into M3, all of which is in my opinion, pretty useless if you then can't actually get data out of the system. So like I said, there are these two Prometheus APIs, remote write, remote read. Remote read is the other side of this equation for getting data from external systems through the like Prometheus APIs. And implementing remote read is admittedly a little bit trickier than just remote write. Like remote write, you kind of just have to store the data somewhere and say that you stored it. Remote read, you have to answer more complex questions. So when you implement a remote read API, you take a variety of parameters from Prometheus like say the time range of the data that you're looking for or some other kind of like metadata about how to shape the response. The most important thing though is what's called a set of label matchers. So label matchers are things like return all time series where tag A equals value A and tag B matches some reg X. And then your storage system is responsible for pretty much just returning all the time series that match that request to Prometheus. And the Prometheus query engine that runs inside of your Prometheus instance will then kind of actually crunch all the numbers, compile the response and return it to your client. So let's say you're browsing it through like a Grafana dashboard in this case. So what does this look like if you're specifically using M3 for remote read? You actually have two options in this case depending on your preference. So one is you can use the remote read APIs like we have over on the left here. So let's say you are already using a Prometheus instance that's hooked up to a Grafana data source and you don't really wanna change that. So you'll actually just reconfigure your Prometheus instance to read from your M3 cluster instead. The other option just for kind of convenience sake is you can actually hit Prometheus APIs on the coordinator itself. So rather than having to route the queries through Prometheus you can pretty much send it directly to the coordinator and what we've done in this case is actually just embed the Prometheus query engine inside of the coordinator process. So you're getting the same data that or the same engine that Prometheus would use to process all the data you're pretty much just getting without this extra hop. And this is actually pretty important because in order to make the interoperator ability more predictable and to also kind of align with the goals of the Prometheus community we did not want to do some non-standard thing where it's like a different flavor of Prometheus. Like we really wanted this to be open source Prometheus. There's actually a set of promql compliant tests, promql like compliance tests that were done and M3 since we just like I said embed the same Prometheus engine was able to achieve full 100% compatibility on them. So that means that if you had the same exact data set on a Prometheus instance and that same exact data set on an M3DB cluster and you queried either you would get the same results. This concept is actually being taken a step further now. It's a relatively recent development and I admit I'm not super familiar with it but I know that the CNCF has recently kicked off a series of very promql or general Prometheus like conformance program. I don't have any updates on that right now from like the M3 project kind of submitting its results that because it's pretty new but based on this it's definitely on the roadmap I would say. All right, so I've done a lot of talking so far. Let's look at a demo of what some of this stuff looks like. So in this demo what I'm gonna show you is two Prometheus instances. Each are scraping each other and that means that we have two data sources in Grafana that we have to query from. Both of these Prometheus instances are configured to remote write to M3 and then we'll see what happens if we query from M3. I'm gonna be completely honest with you. I have trust issues with conference wifi and this demo is running on my desktop at home so it is not live. It is recorded but it is recorded recently. So let's switch over to that. Okay, so I'm running a Grafana instance here. It is configured with two different data sources for two different Prometheus instances. You can see on the left Prometheus A, on the right Prometheus B. These are all running inside of a Kubernetes cluster so you can see there is a pod for Prometheus A and one for Prometheus B and these are running just as stateful sets that then have a unique configuration to scrape data from each other. It might be a little tough to see but on the left, so since these are scraping each other they're exposing stats kind of about like the meta scraping process. So you can see that when I query Prometheus A it has stats about how long it took to scrape Prometheus B and vice versa. So let's see what happens if we add M3DB into the mix. What I'm doing here is creating an M3DB cluster using the open source M3DB Kubernetes operator. This under the hood will create three different stateful sets. Each of these corresponds to a failure domain which in like a real production environment would be a zone. And on my desktop is each a MiniCube VM. The stateful sets, once they are created the pods come up and they go through this initial kind of like register themselves with the cluster process. Kind of just to get the cluster up and running call it bootstrapping. It doesn't take too long for them to come up so I think after a few seconds they should be up and running. And then like I said, I have both of these Prometheus instances configured to remote right to M3DB. So if we switch over to M3DB we can actually see that from this one single data source we are getting data from both of the Prometheus instances. So the main thing here is that whereas previously if we wanted to query the metrics for Prometheus A we'd have to hit Prometheus A if we wanted to query for Prometheus B we'd have to hit Prometheus B. In this case we're just able to query M3DB and since they're both remote writing we see kind of a global view of the data. Actually Prometheus was even nice enough to buffer some of the data and backfill it so that when we brought up the cluster we actually got data from a little bit back in time. And you can see here if I look at the stats that from Prometheus A for Prometheus B you get kind of the same, that green series on the left and the yellow series on the right are the same values because Prometheus was processing the same samples and just setting them on. The main thing here being that doesn't matter how many of these instances I have or where they're distributed I kind of just get one place that I get to query them all from. So that is it that I have for this quick demo and back to slides. Sorry, I'm having technical difficulties today. We'll click. So like I said earlier, M3 is just one of many open source Prometheus remote storages. There are others that you might be familiar with. There are honestly too many to do like an in-depth comparison of every single one but it is worth I think kind of just talking about some of the properties that make M3 relatively unique in the space. The primary differentiator is that M3 itself is storing data like on disk as opposed to kind of offloading storage to some other service like object storage or some other kind of like remote API. This gives some really great benefits that we'll talk a little bit about. The main one being the performance that you get out of the combination of like optimized index like in-memory index data structures combined with fast local disks. That being said, it also comes with some trade-offs that I will also be upfront with you about. But kind of just taking a look at what it looks like when you do issue a query to M3DB. So let's say in this case you're issuing this tag query like kind of like the one I mentioned earlier where tag A must match some reg X and tag B must not be equal to some specific value. Since in this case, let's just say we're looking at like one isolation group. So each of these database nodes only owns like a third of the data set. One thing that's really nice is that this query gets fanned out to all the database nodes that then do kind of this like local computation to figure out the results and then bubble that back up almost like a time series MapReduce. But the main thing here being that like a lot of that computational responsibility is fanned out to each individual node and then those results get returned to the coordinator which kind of just crunches the numbers. Generally speaking, the two patterns that I've seen with these systems are other kind of like full stack storage solutions like M3 where you're actually storing bytes on disk yourself or offloading the data storage to an external service like S3 or GCS. This takes a little bit more work to get similar performance since your query instance actually has to like pull down all the data itself. And there are some workarounds you can do like store the objects on object store more efficiently or kind of like shard them and have semi-stateful components that are responsible for knowing like which ones store which are like which objects have which shards and then kind of doing the fan out there. But generally speaking, even if you kind of go through some of these optimizations it often won't be as performant as being able to fan out to kind of like database nodes that have the data immediately on disk versus having to go through a remote API. That being said, it comes with some trade-offs. I've talked a decent amount about what the benefits are namely like performance and full control of your data and kind of as few external dependencies as possible. But the main thing is the main trade-off is that if you're using M3DB like you own your data you are responsible for it. You're responsible for backing it up ensuring that you have proper capacity in the cluster. Like I mentioned earlier, when you go to expand it you kind of have to know that you have to expand it. You have to provide it with the right resources to perform well. Like if you're using a hard disk backed remote storage you're probably gonna have a pretty bad time. And just kind of these overheads of running it yourself. If you are offloading the storage to something else you get some benefits which are namely not dealing with the things I just mentioned and you kind of just have to pay your bill. This is obviously a bit of an oversimplification since even these solutions typically still require you to run some kind of like stateless compute components or in some cases at least somewhat stateful components but the general idea stands. So that is about it that I have. I hope this has been helpful for you. Thank you so much for attending and happy to answer any questions. Or if you just wanna say that you hate M3 or something that's fine. Let me check the Slack as well to see if there were any virtual ones. All right, then I think that is about it. You are now all Prometheus and M3 experts and thank you for your time.