 Hi everyone. Today Nick and I will be presenting about how we scale the monitoring system at Databricks from Prometheus M3. So some quick introductions. My name is YY. I'm a software engineer on the observability team at Databricks. Yeah, so for those of you who haven't heard of Databricks, we were founded in 2013 by the original creators of Apache Spark where data and AI unified analytics platform as a service, and we serve over 5000 customers. We're still a startup but we've grown pretty big. We have more than 1500 employees with more than 400 engineers and with an ARR of $400 million or more. We run across three cloud providers in more than 50 regions and we launch millions of VMs per day to run data engineering and machine learning workloads, processing exabytes of data per day. So in this talk we'll cover how we monitor at Databricks before M3 and then I'll talk about how we deploy M3 at Databricks including architecture and migration. And then Nick will cover some of the lessons we learned in this process including operational advice, important things to monitor in an M3 cluster and how we do updates and upgrades. So what was monitoring at Databricks like before M3? First I'm going to provide some context about the role that monitoring plays for Databricks engineers. We have two main metric sources. The first is our internal services that run on Kubernetes clusters that we manage in-house. And then we have external services running co-located with the VMs for customer Spark workloads and these run in customer environments. We've been running a Pramika-based monitoring system to monitor these targets since 2016 and all service teams rely heavily on the system. Service owners write and emit their own metrics from their own services and they use these metrics for dashboarding. We use Grafana for dashboarding and they write their own queries. And most engineers are prom kill literate and engineers maintain their own alerting roles. Some of the use cases we have are real-time alerting, debugging, SLO reporting and alerting and automated event response. So yeah, to sum it up, monitoring is a pretty critical workflow for engineers at Databricks for their services to run smoothly and it's core to our engineering culture. So this is what our original Prometheus monitoring system looked like in one region. So for those of you who are not too familiar with Prometheus, it's a single node monitoring solution that is used for event metrics and alerting. It uses like a pool-based model and it will scrape metrics from other services and it stores these metrics as time series data and then it can service queries and alerts using this data. So in each region we have two different prompts for the two types of monitoring targets that we have. The first prompt is Prometheus normal. This is the instance that scrapes all the Kubernetes pods local to the cluster. And then we have a Prometheus proxied instance for proxy metrics from services that live outside of our Kubernetes cluster. In particular, we use like a push-based path through Kafka or Kinesis to get these metrics into our metrics proxy service, which runs in our Kubernetes cluster and then this is then scraped by the Prometheus proxied instance. For this Prometheus instance, we maintain a whitelist to only ingest some metrics so that we can reduce metrics volume since the metrics from our customer environments are the higher cardinality workloads. And the reason we have two separate Prometheus servers instead of just one server per region is because of scaling limitations. And we found that metrics from both of these sources would not fit on a single Prometheus server. So though it's not ideal to have separate servers in one region, there's kind of a logical separation between these two metric sources so engineers can still function pretty well with having a sharded view of metrics per region. Each of these Prometheus servers are set up in the high availability green deployment so that we can in place update one of the Prometheus servers while the other is still running and serving alerts and inquiries. We also have a disk attached to each Prometheus server to store metrics. We have a global Prometheus server which contains a subset of metrics federated from the regional Prometheus servers across all regions. So this we're just using the out of box federation feature from Prometheus, and we also maintain a whitelist here so we only federate some metrics that users need to have. If they want like an aggregate view of their services across regions. So users interact with this monitoring system in two main ways. The first is alerting. So each of these Prometheus servers will evaluate alert rules and issue alerts to alert manager and then alert manager will forward the alerts to the other channels like pager duty or slack. And the second workflow is curing our dashboarding. So users can query the regional Prometheus servers to view the metrics from their services, or any metrics like underlying alerting incidents, and typically users will mostly interact with just a regional Prometheus, but when they need like a global view of services across regions, then they use the global Prometheus server. So here's some numbers to give you a picture of the scale that we were running this Prometheus modern system at. We ran in more than 50 regions and different cloud providers, and we monitored like 100 plus microservices with an infrastructure footprint of 4 million plus VMs of Databricks services and customer Apache spark workers. And because of our architecture where we had a single Prometheus server handling all the metrics from the customer environments. It would be huge like at its peak it was handling close to a million samples per second, had a pretty high metric turn rate with a lot of metrics from short lived spark jobs persisting for like less than 100 minutes. And the disk usage was pretty high at four terabytes for only 15 days of retention. And we also were running on a really big machine with 64 cores of CPU and two terabytes of RAM. On this scale we found our Prometheus system extremely difficult to operate. And we found that it also impacted on the user experience. So for us as operators, we had to handle a lot of capacity issues like frequent ooms and a lot of high disk usage situations where we had to resize the disk. And also we had to deal with multi hour prompt updates because of the long right ahead log recovery process during restarts. In the user experience, users had to deal with like a shorted view of metrics, they had to be like aware of which metric source they wanted to query whether it was from normal or from proxied. Users also had to deal with slowness of query. Big queries would take like a long time sometimes they would never even like complete, and sometimes they would even cause the Prometheus server to. We also had to build a custom coring UI since out of box for me to see why it was too slow to handle high volume. It was a shorter retention period and they could only see metrics in like the last 15 days instead of the last 90 days which was ideal. And they couldn't really see metrics spanning across different release cycle since our release cycle is only two weeks. And also users had to deal with a metric white list. They could only include a small subset of metrics in the white list and occasionally for us as operators when we were dealing with like capacity issues we would even have to actively remove metrics from the white list. And then just to keep our premises server running. So with these scaling bottlenecks and pain points, we really needed to find a more scalable monitoring solution. Some of our requirements were the system needs to be able to handle high metric volume carnally and turn rate, since Databricks was growing rapidly and we needed a system that could keep up. We also wanted there to be a 90 day retention so that engineers can monitor their service release over release. And also importantly we needed it to be promptial compatible since everything is already built on top of Prometheus so we didn't want to migrate user workflows like alerting rules querying or dashboards and other system and teach users to use a different language. Also we needed the system to have a global view of metrics so that engineers could have the ability to have an aggregate view of important service metrics across all regions. And finally the system needed to be highly available. So here's we considered like nice to haves were first ideally would have a good update and maintenance story without some of the pain points that we experienced with our Prometheus setup. So like no metric gaps during updates and less manual operations for updates. It would also be nice if the system had been battle tested in a large scale production environment. Also good if the system was open source so that we have more freedom to manage it on our own. To make it more suitable to our metric workload and also we felt that there was more transparency into the cost of running an open source monitoring system. Some of the alternatives we considered in mid 2019. Firstly, we considered starting Prometheus even more. But we were a bit hesitant about this because it was more of like a do it yourself situation and not an out of box scaling solution. We considered some open source solutions like Cortex and Thanos. We prototype down as in late 2018. At that time it wasn't that mature back then and we weren't really comfortable using it at our scale in production. And we also prototype Cortex and we found that I wasn't really suited for metric workloads which were pretty, which had a pretty high turn rate. We also considered some hosted solutions like data dog and signal effects, but they were too expensive. And the alternatives we considered why do we pick M3 and three fulfilled all our hard requirements it was designed for large scale workloads and it's horizontally scalable and it exposes a Prometheus API query and points while it's set up to be highly available with multi replica setup and it's designed for multi region and cloud setup which is perfect for our use case and has a built in global and important point also was that it had been battle tested at high scale at Uber and a production environment where it had been running for a couple years. And also an added bonus was that M3 has a Kubernetes operator for automated cluster operations, like scaling the cluster or updating the cluster. And also we thought there were like some pretty cool features that we would be interested to use in the future like aggregation of metrics on ingest and down sampling of metrics which could potentially allow us to have a longer retention period. So now I'm going to cover how we deploy M3 at Databricks, like specifically the different decisions we made along the way and how we arrived at our final architecture. The initial plan was for M3DB to be a drop in replacement for Prometheus local disk storage. So instead of Prometheus servers storing metrics in an attached disk it would remote write the metrics into M3DB for storage. Prometheus servers would still be evaluating alert rules and they would remote read metrics from M3DB to evaluate these rules and then forward the alerts to alert manager. So this setup would result in some improvements. First, since all metrics would be remote written into the M3DB storage database in the region, we would get rid of the shredded view of metrics in one region. So users could have a consolidated view of both prom normal and prom proxied in one region, rather than having to query both separately. We would also not have to maintain a global Prometheus server anymore since M3DB has a global current feature and it can connect with other M3 clusters in other regions to provide the global view of metrics across all regions. So, though this architecture was really simple and it would have been the least amount of work to incorporate M3DB into our system. We did find like some trouble with remote writing from Prometheus servers. Specifically, we couldn't remote write the scale that we needed, especially for this Prometheus proxy instance, which was handling all the higher cardinality metrics from our customer environments. So now our question was how do we scale up the right path successfully. To make it work, we needed to move away from having two giant Prometheus servers and instead make the metric remote writing path a lot more lightweight handled by many small components rather than fewer larger components. So the solution to this in the Grafana agent, which is open source by Grafana. This is a lightweight component that is Prometheus API friendly and solely functions to scrape and remote write metrics. There are many instances of this running in our cluster, and we shard this great targets and assign them across the Grafana agent instances. So this makes it really easy for us to scale the metric scraping and remote writing path. Since to do that we just need to increase the number of these agent replicas and then reassign this great targets accordingly. This is for metrics coming from outside our Kubernetes cluster from our customer environments that were proxied through our metrics proxy service. We added a remote write protocol into the metrics proxy service so that it would directly remote write metrics into m3db. So this saved us from meeting to have another set of Grafana agents scraping the service to handle metrics coming through this path, and it just overall reduced the number of hops and the end to uneasiness of the metrics being proxied from outside our Kubernetes cluster. We solved the challenge to make the right path scalable but we needed to find a way to evaluate alerting rules since we weren't using Prometheus servers anymore. So unfortunately m3db doesn't have an out of box full evaluation engine. It mainly just serves as like the metric storage database for writing to and querying from. So this led us to building our own role engine. This was more like in the spirit of designing an architecture with more lightweight and scalable components each serving a more narrow purpose. So for this we ripped out the role management code from open source Prometheus and deployed that as our Prometheus API friendly role engine. So as our original Prometheus monitoring systems alerting role configurations into it, it issues alert role queries m3db m3 handles a role evaluation and evaluates the query and then it returns a query result back to the role engine. The role engine does some extra like processing for checking, for example the for duration of the alert it adds like any extra external labels to the alerts, and then it'll issue the alert back to the user. So we covered the components we set up to interact with m3 like this great agents, the metric proxy service with the remote writing and the role engine. Now I'm going to focus more what our m3 setup looks like. So basic setup of m3 will typically have like two main components, it has the storage cluster and the m3 coordinators. The storage cluster consists of multiple replicas in our case we use three, and each replica has multiple pods. And each pod has a disk attached to store metrics to scale up the cluster we just increase the number of pods in each replica. And then we have the m3 coordinators. These allow us to interact with the storage cluster to read and write metrics. The coordinators have empty query embedded in them. For write requests it receives the requests and packs them and then writes them into all replicas of the storage cluster. And then for read requests, it will receive the request fetch metrics from the storage cluster, evaluate the query and then return the query results back to the client. We also run the m3db operator. This is like optional to have an m3 system but it was like we use it because it's really useful for a community setup. We also automate scaling up and down the cluster and also automates like deleting and creating the storage cluster so we don't have to manually manage the three different replicas. One issue we did have initially with this basic setup was that the m3 coordinators were having a lot of like noisy neighbor issues. For example, if a user submits a heavy query it might like take up all the CPU and memory on the coordinator, and which might impact the right path and cause data loss, or impact the role evaluation workload from the engine and affect the availability of alerts. So to make our m3 deployment more stable and have more isolation between writing, querying and then alert evaluation workloads, we decided to create separate deployments of the coordinators for different purposes. The m3 read and write workloads are quite different for the coordinator. Writing is more CPU intensive since coordinators need to handle and unpack many incoming write requests, and reading is more memory intensive since they need to fetch and store cash data in order to evaluate and serve queries. We did four different groups of coordinators. We have a group that handles writing. These consists of many small replicas of write coordinators to handle any incoming requests from our scrape agents and the metrics proxy service. And then we have a group designated to the role engine. This just handles like a regular and more predictable workload of querying for role evaluation, since the role engine will just submit the same set of alert roles at regular intervals to m3. And then we have two groups here to handle ad hoc querying. The first is the regional group. This just returns query results based on the metrics in the regional cluster. And then we have the global current group, which reads from m3db clusters across regions and provides a global view to our users. We wanted to separate the regional and the global coordinator group since the global coordinator group. The configuration is really different. It requires setting up listening across clusters and has different security configurations. But more importantly, our users mostly only use the regional view of metrics in most cases. And so since we knew if we just exposed the global view it was unlikely that users would be like making the extra effort to always add a region label filter to each query. We wanted to provide a default region only query endpoint to avoid creating the extra cross region traffic and costs for no reason. So now that we separated the coordinator groups are two most important stable workloads, the right path, which where stability is important to prevent data loss and the rule evaluation path, which is highly critical for us to always have alerts. These workloads were isolated from the more spiky and unpredictable workload of ad hoc queries where it's subject to bad behavior from users. So the final challenge was how can we monitor all of these m3 components since there are so many of them and still avoid like a circular dependency where we're using the same system to monitor the system. So for this we decided to set up a vanilla lightweight Prometheus server. This Prometheus server only scrapes m3 related components has a disk attached and its retention is only a couple of hours. So it's very easy to maintain since we consider it to be stateless and restarts happen really quickly. The metric retention period is short but it's sufficient for us since we mainly use this to alert us if any entry components are down. It really required us to look back on metrics over the past couple days for this Prometheus server to issue alerts. And it issue alerts straight to alert manager and doesn't have like it's completely independent from them through system. We also have a global m3 monitoring prom for a longer term view of our m3 related metrics. For example to track this usage or memory usage or like the number of reads per second. We also have a global prom federation feature here to federate all metrics from the regional m3 monitoring prompts to be persistently stored in this global prom. So this global prom has a disk attached and also we run it with the blue green high availability setup. And the main reason we use Prometheus here instead of just running a separate m3 cluster to monitor m3 is because Prometheus is a really good all in one kind of out of box solution for monitoring especially at a small scale. So it's really difficult to set up a separate entry cluster with all of its like small components just for this use case. Also we're more familiar with Prometheus and we've been using it for a couple years now and also Prometheus has been around for longer in the community and we felt that we wanted to be comfortable with something if we're monitoring m3 which is a relatively newer system. So now I've covered how we deploy m3 and our architecture and I'm going to share more about how we did the migration from our Prometheus system to the m3 system. So there are more four main steps in our migration. The first step is bringing up the whole m3 system as a shadow deployment. We were dual writing metrics to both from an m3 storage and we were evaluating alerts in both the old Prometheus system and the new m3 system. So we just sent all our alerts from the new m3 system to a black hole receiver that didn't fire alerts any real receivers. And we also opened up a querying endpoint but only to internally to the observability team and not really to the rest of engineering work the end work so that we could do some behavior validation. Yeah, so the second step of behavior validation is we compared alerts across the two systems by using scripts. And we also did some querying speed and correctness checks by comparing our more like well used complex dashboard side by side. The third step was an incremental rollout of querying traffic and alerts for ad hoc querying traffic we stage it across environments and we did a percentage based rollout of traffic from Prometheus over time three. And then for alerts we did a per service migration we replaced alerts admitted by prom with alerts emitted by m3 for less critical services first and then we moved on to more critical services in the later stages of the rollout. And the final outcome of this rollout is that all ad hoc querying traffic and alerts are served from the m3 system, and then we can safely deprecate Prometheus. So here's a diagram to illustrate like how we did the rollout of the ad hoc querying traffic. It's a pretty simple setup we just put an m3, an engine next in front of the current endpoints of Prometheus and m3 and we split the current traffic across both. And over time we just slowly increase the percentage of traffic directed to m3. And this is pretty good for us to give an idea to like give us an idea of how ad hoc querying traffic affects the utilization in the m3 system. And we did end up doing like some tuning of query query limits and extra provision of resources upfront. As we did this role to make sure that the rest of the rollout would be smooth. And this is a picture to highlight how we did the per service migration of alerts. Our alerts at Databricks are emitted with a service label, which denotes the service and the owning team they belong to. In addition to this label we added a source label to indicate whether the alert was emitted by the old Prometheus system or the new m3 system. And then we made some rooting configuration changes in alert manager so that if we want to roll out m3 alerts for less critical service first. The m3 alert from that service would go to the receiver, but the equivalent Prometheus alert for that service would go to the black hole receiver that doesn't alert anyone. We found this rollout strategy to be like really nifty because it was all controlled in alert manager at the routing level so an alert manager can hot reload any new config quickly without any restart. So it was very easy to advance the rollout board, but more importantly it was like easy to roll back and we found any issues. And this was good to do since alerting is a highly critical service so rollback shouldn't be able to happen quickly. The outcome of this one year migration is that m3 now runs as a sole metrics provider in all environments across clouds. The global coring endpoint is available via m3 for all metrics. This is still in beta so we're still like testing this and rolling it out. And the user experience is largely unchanged. Our users still use promql for alerting rules and dashboards and we still use alert manager for all our alerts. Retention is widely 90 days across all regions and this is unlike before where we had our large Prometheus server which had like two weeks of retention. And overall, we think that migration went pretty smoothly we avoided any major outages since we had a good rollback strategy. Now we have much higher confidence that we can continue to scale the system into the upcoming years, since Databricks is like continue to grow continuing to grow rapidly as a company and it will keep processing larger and larger workloads. And most importantly, the observability team doesn't have to deal with like a giant Prometheus server anymore that runs on two terabytes of RAM and takes like multiple hours hours to restart. Okay. I'm going to nick for lessons learned. Yeah, thanks by way. I was having some sound issues earlier. Hopefully it sounds okay now. Thanks everybody. Cool. So, I'm going to go cover some of our lessons learned in operating M3 over the past year or so. And some of the things I want to talk about are some of the system metrics that you should be looking at if you're monitoring M3, give some general operational advice. Talk about some things that we found it really helpful to alert on and also talk a little bit about how we do upgrades and updates. I just want to give a brief overview because when you're talking about sort of like from the trenches I think the perspective can sometimes sound negative because I will be talking about issues that we've we've run into but I want to like emphasize at the start that overall M3 has been amazingly stable for us So, I sort of talked about how much trouble we had operating Prometheus. It was a constant source of alerts and trouble for us. And we operate, you know, more than 50 different deployments of M3 and it's just really stable. We have a few places where we've run into issues and I'll be talking about those. And those are obviously the ones that have sort of the highest scale where you're really pushing against the limits of what we can do. But overall, it's been an extremely stable thing for us and honestly the biggest problem we have is just that we keep running out of disk space in places because our metric load keeps going up. And that's not obviously M3 so that's our fault for needing to be better about how we deal with incoming metrics. So, even though that's so that's the positive side, we have had some problematic things so I'm going to dive into that because you know, dealing with problems is obviously an interesting thing to hear about. But just a little bit more about how how it runs, you know, we like I said we have a large number of clusters so we have to automate things. We use a combination of Spinnaker and Jenkins to do like template applies to update things. So that's where having the operator is really nice because it makes it pretty easy to kind of do those updates. In our bigger clusters we process, you know, close to a million samples per second, and about 200,000 reads so we are more right heavy. You know, that's definitely the workload that we have at Databricks. Cool so I wanted to jump into sort of at the top level of the things that we found most important to keep an eye on. While you're operating. So we look at how much memory is being used. These are in the M3DB pods. We have seen that if you're steadily over 60% of memory usage that can be bad. Mostly because there are certain things that happen that can cause memory spikes and then if you're consistently over 60% those can can get you all the way up to over 100% and then you. You know, it's nice that because it's distributed and highly available if only one pod ooms it's not a big deal. It recovers nobody even notices we don't even get alerted. When that happens for a single one, but if you're all of your pods are consistently over 60 you have a good chance of multiple ones moving and then things can be bad. So how you can you resolve that if you're steadily over 60%. You can scale up your cluster or you can reduce incoming metric load those are the two primary ones we found. Obviously if you're in a more read heavy workload you may need to do something like reduce the amount of reads that are happening. One thing I'll mention here with the new version of M3 having all these nice limits for reading and writing. It's really a great other way to sort of put limits on the memory use so you know we don't we've already done that so that's not a way we resolve these but if you haven't set the limits that would be another way to try to reduce the amount of memory. And I've included here I you know you obviously don't need to like memorize the queries that I've put in here but I've just kind of put them in here as a reference for like the way that we look at these so we look at this particular metric filtered for our pods. We also need to alert on disk space like I mentioned this is a problem for us, just because as things grow. The cluster can get bigger and bigger and your disk can fill up we use predict linear to look at how big the disks are the disk space usage is actually a very easy to predict over so it's pretty accurate. And so we do it very early, mostly just to give ourselves lots of time to react this, you know, there's other things going on sometimes it takes a little bit of time to get to it. And also, you know, we've found that scaling up can take a significant amount of time in the really big regions. So it's nice to give ourselves enough time to deal with this. And again ways you can fix this. If you are running out of disk space are scaling up the cluster. Like I said, reducing retention obviously will free up some disk space, or reducing the incoming metric load. And again, here's the query. Cluster scale up can be slow. There has been a good amount of work in improving this bootstrapping time in the newer versions of them three, we are a little bit behind on the update schedule so we're hoping to see some improvements in our cluster scale up time. As we upgraded to the newer versions of them three, but I would encourage you if you're operating m3 to try to do some testing around how long. How long it takes to do this cluster scale up so that you can set the sort of limits and like how long in the future you need to alert for these kinds of things so that that you know how to react to them. So that was some of the like those were probably the most important things that we need to alert on I'll get on to some of the other like smaller things in a little bit but I also wanted to give a little bit of general advice that we've kind of a crude over the time over operating m3. One thing is a, you know, try to not add a lot of custom annotations labels configs on top of the deployments. The operator does have certain expectations about the stuff that's going to deploy, and we've we've shot ourselves in the foot a few times by trying to do weird custom things and then the operator can get confused or we can get confused and anyway. If possible avoid doing lots of custom things, let the operator just do its thing. Like I mentioned do observe your query rates and set limits so look at how your clusters being used in a good state and try to set limits so that you can prevent a bad state from occurring if a giant queries come in or you know things like that. One thing that I think we waited much too long to do at Databricks was to have a really good testing environment. Meaning, we we rolled this out in all of our clusters, and it was working pretty well but you know a monitoring system is something that everybody relies on all the time, which means that your dev clusters. So, our development for other engineers, but for us, our dev clusters are actually quite important to have good monitoring because people care about observing how their, you know, test clusters are running. And so we needed sort of a dev dev I guess or you know m3 dev cluster which we now have but took us too long but it was really important think it's important to have a place where you can quickly iterate on rolling out new versions testing load. It's important to be able to just throw away the data there so that you know if you're testing some stuff out and it doesn't work you don't, you're not scared of ruining your data. And I think it's also important to try to have that at scale. And this is non trivial, you know, having low generation and stuff because if it's truly dev dev you're not going to have a lot of stuff running there naturally. So there is some work to doing that but I think it's very valuable to have this so that you can kind of be aware of how your production clusters are going to behave. And be testing it in production, because if you're only testing high loading production. It's not it's not a recipe for great success. I also would encourage you to have a look at some of the m3 dashboards that are out there and kind of learn what these metrics mean it can be really helpful. You know, as Rob mentioned m3 has a ton of features and it also as a metricing system has a lot of its own internal metrics. The dashboard I've linked here is kind of the the one that that the m3 web page mentions, I would say that the, this one that's linked here the graph at Grafana.com is somewhat developer focused I think it's built to help people who are working on m3 understand and debug issues. And that's useful for that. But I would maybe suggest, kind of looking through that and understanding what some of the more useful metrics are from an operator operational perspective, and making a dashboard with your own key metrics. I'm not going to cover that this is like, you know, for future reference. This is what our one of our internal dashboards looks like. So we kind of look at some a lot of the more high level things that kind of show you know how CPU how's memory doing. And you can kind of see in that memory one how it's a little bit, it goes up and down, but we try to keep it, you know, at about the 50% to 60% level of what's available. You know, and I'm not going to cover all the other stuff that's on here but basically these are some of the things that we have found really useful to monitor in terms of understanding what's happening from a more sort of operational perspective and then this would look really internal perspective of developer working on m3. Cool, so a few other things that we alert on, just that things that are worth, worth looking at. So we have high latency ingesting samples. So this is the coordinator ingest latency bucket metric that can mean that you're just getting backed up in writing samples, or that there's some other problem with the incoming metrics. We do try to filter on the incoming metrics and prevent them from coming in late but I will say that although the Grafana agent has been pretty good for us one problem it does have is a tendency to sometimes try to write old data. It's just hard to get it to not do that. So you can sometimes see spikes in this from that and then you need to go kind of kick the agent to have it not do that. We look at the rates of both right errors and fetch errors. These are good ones to be watching mostly because they kind of represent the user perspective, they say like, how is it as a user of m3, trying to do something like write a metric in or trying to do something like issue a query. Is my seeing errors right there can be all kinds of other errors have happening under the covers but if these two metrics are steady. From a user perspective, you're you're kind of okay you're meeting your SLO. So, we monitor those. And those ones kind of issue some high priority alerts if you're getting right errors, something's bad you're not able to write error right metrics into the cluster if you're getting fetch errors, something's bad you're not able to query the cluster, and your users will be seeing issues. We monitor and I wanted to mention this one because it can cause for us it can be a really big issue. But it may not be depending on the your deployment but we do look at how many out of order samples are being are being written. The reason that we do this is because if you are double scraping pods or services that can cause all kinds of craziness in your metrics because they can kind of bounce around and the counter semantics of Prometheus make it think that crazy stuff is happening. And because we operate in a largely poll based architecture. This can cause a lot of false alerts for our customers. It is a little bit of a tricky one to monitor because some amount of out of order merging is is expected and doesn't mean that there's a problem so I put an X here. Because you'll probably need to look at how it looks in your cluster and another thing to be aware of is that during node startup. There can be a little bit more out of order merging just because there may be some data that's sort of built up and then things are not coming in directly in order. So you'll want to inhibit this during node startup but if in sort of normal operation, you can cut you should be able to get a good sort of baseline for what you're out of order rate rate is and then if that spikes, it can be indicative that you somehow messed up the configuration of your scraping. Yeah, so that's another one that has been this. Great. So, talking a little bit about upgrades and updates. Even, I think Rob mentioned that they do really focus on or m3 really focuses on having good forward and backward capacity compatibility. And we have seen that we have experienced that we are not scared of doing updates to new versions, we have not seen really any issues. I only mentioned this because it's literally the only issue we've run into is just there was a tiny query evaluation regression, where I think they changed something in the query engine. It was also relatively, it was not a big deal and and was fixed relatively quickly but we've done, you know, we've been running m3 for a while through a lot of the early pre 1.0 things. And so there were relatively a relatively large number of updates there that we've gone through and it's been, it's been very smooth for us so that's been great. We are now just moving into 1.0 slowly throughout our clusters that has also been very smooth. Rob, Rob kind of mentioned that there were some config changes and made it sound like it's big deal but honestly there weren't that many config changes we have a pretty complex kind of programming system for generating our configs and so it was pretty easy for us. I suppose if you have manual configs it might be more work, but for us it really wasn't a big deal to update to 1.0. One thing to be aware of there were some sort of API changes. So this is probably only relevant if you have built up sort of institutional knowledge around m3 already but a lot of our runbooks have to go we have to go through and change some of the API paths that we say to hit for things like changing retention or updating placements that's sort of like advanced usage so as a normal user of m3 you probably won't run into any of that. And that we manage all of our upgrades and updates via Spinnaker and Jenkins. The one sort of minor issue here that we have is up until now there has been a lack of fully self driving updates. We kind of count you know we're very bought into Kubernetes shop, and we kind of account on our pipelines being able to do updates by just calling kube control apply on a new template. This is not fully working until recently but as Rob mentioned with the O13 version of the operator, this should now be available. That's a relatively recent release we have not had time to go fully test this but we do believe we'll be able to move to this in the near future. One thing that I will want to would do want to mention. If you are doing these kinds of fully self driving updates. Or sorry if you're not doing fully self driving updates where you are just applying and then having to do some kind of manual intervention to get the operator to do the update. So we have to be vigilant that the configs which can be updated by just calling kube control apply stay in sync with M3DB version, we've had some issues where you know our pipeline goes into plays new version specification and new config but the old version is left running and then we restart it and it's like I don't understand this config. So that's one thing to be careful of. And one suggestion I will have. This is probably generally good but it's one that we found really important during the upgrade process is to have a readiness check for your coordinators so try to make sure that your coordinators as they come up are able to talk to M3DB. And then you know, I'm not going to cover sort of update strategies for for Kubernetes but if you're familiar with Kubernetes you have like a rolling update strategy that doesn't restart too many coordinators at once. What we found is that if you know you run a lot of coordinators these coordinators are super lightweight, which is great and we run a whole bunch of them. But that means if they all restart quickly, which is what happens if you don't have a readiness check. So many services restarting that Kubernetes can deal has a little bit of trouble dealing with the churn and can leave a number of the coordinators unable to connect to M3DB. Just because it hasn't had time to sort of update all the endpoints and the various service updates that it needs to make to do that. And then you have a little bit of downtime and because the Kubernetes control loop is not super fast. It can actually just take a while before it sort of churns through and updates everything so having this readiness check. If you have a connect consistency on the coordinator. Then that was sort of enforced that the coordinators restart slowly and reestablish their connection before the next one goes down and you can have sort of zero downtime up downtime updates. Cool so that's what I wanted to mention about upgrades and updates. And just a little color about you know metric spikes in any high volume system that you operate you're going to have to have a way to deal with spikes. You know, an example of something we sometimes have is, you know, some somebody goes and adds a new label to a metric and it has an absurdly high cardinality. So suddenly your, your number of time series is just going up by a huge amount. You know what happens, you don't control all the services that get deployed, they can do crazy things. And so a great way to deal with this is, you know, you need to be able to identify where it's coming from. So, you know, have some metrics that you can look at that that cover you know the graphon agents can can expose this. We have our own metrics inside our other systems that sort of push directly into M3. So always have some metric that you can use to see, you know, who who is producing all of these metrics. And then also be able to cut that source off easily. So we have good ways of sort of quickly blacklisting things. Because this is extremely preferable to ooming your cluster, I'm much happier to be able to go to a customer and you know internal customers say hey, your services now currently not having any metrics because you did something silly, rather than having to send an email out to the entire company and say hey our our metricing system is down in production right now. So that's a very nice message to send. So yeah that that's something I would recommend. So, just a little bit about how we do capacity. Obviously I think, you know, it's going to depend a little bit on your workload. So, you know, we found that about one M3DB replica for about 50,000 incoming time series works pretty well for us but we are very So that may be different depending on how how your cluster looks. I saw that there was a question about how many nodes we use on that big cluster. We're currently running about 18 nodes per replica so you know, do the math for how many we have all together. And we run about 50 right coordinators and two different deployments about 100 and like I said, we just are happy to run lots and lots of these for stability, we probably could get away with fewer but this just you know gives us sort of the buffer that we need and and it's it's okay. And these are just sort of numbers that we've come to sort of organically. So, I don't have it would be nice if I had sort of a formula like this many in through this many things. But I think partially also just testing and because it will depend on your workload. Cool. I did want to talk a little bit about some of the things we want to do in the future so I would say at this point, we have completed the fully completed our migration away from Prometheus we are removing Prometheus everywhere. So we don't need any more, aside from these sort of like lightweight monitoring ones that that why I mentioned. And we're now getting to the point where we can sort of look towards the future and what nice new things that we can do so we were sort of previously we were in this bad state that Prometheus kept catching and everything was unhappy and and it wasn't good. So we've, you know, had a very successful but but somewhat long migration, also to now have a sort of stable set of monitoring clusters running. And so now we can start to look forward and say hey, now that we have this nifty new M3 thing what can we do. And so some of the things we're looking to do in the future are to start getting down sampling for our older metrics. So that's something that we expect to see significant savings in this space, probably also in the query performance over older data. You know we we run with a 30 second scrape interval, and it's a little bit unreasonable to expect that when you query data that's 60 days old that you'll get it at 30 second resolution. So we will be looking at using different namespaces for metrics with that have different sort of retention requirements so like I mentioned, we run a lot of test things in our dev clusters. It's a bit unreasonable also to expect that these test clusters will have really long retention on their metrics. So, a really nice feature that M3 has is that you could kind of put those metrics in a different namespace, and then have a shorter retention on that so your test clusters get you know five days or whatever of retention. And that's great. And then for things that matter more, you can have longer retention again this is to deal with less needing less disk space and lower lower load on the system. And then another one that I think we're we're excited about that M3 has enabled because M3 supports sort of pushing metrics in. It allows us to do to enable work use cases where you're in a in a mode that it's actually very hard to be scraped. So there's things like Databricks jobs that we that a lot of that we we leverage where you know spark clusters are running and processing lots of data. Our data science teams really want to track those jobs and historically it's been very difficult because those jobs are running sort of somewhere else in an isolated cluster. And how would Prometheus reach over there and scrape the metrics out of it. And so being able to sort of build a little proxy so that they can push metrics directly through into M3 is is really great. Another one that our dev tools team is looking at is they want to sort of monitor things from developer laptops to know hey how long is it taking to do certain developer operations so that they can improve the developer experience at Databricks and they want to also be able to put metrics in because obviously your scraping system is not going to be able to reach out and scrape metrics from your lab from your developers laptops. So that's another these are another feature that M3 has that's kind of like a new thing that we can do. There obviously are ways to push metrics into a proxy for for Prometheus but we found them we we do operate we did operate some of those and we found them to be less than reliable just because of the nature of kind of needing to cash the data, and then re scrape it. It's much less reliable to do that than to be able to just directly push the metric in. Caching is just a difficult problem with metrics. I'll leave it at that. It's a lot of time to fix various caching metric caching issues. Anyway, so I want to leave a little bit of time for questions so I'll just conclude and then we'll leave about 1010 minutes. So, I would say, in conclusion, this has been a very successful migration for us. I was, you know, we were very happy with where we've landed overall. The community has been extremely helpful. You know we've worked a lot with the people at at Chronosphere who have been extremely helpful. We've worked sort of just with the open community as well. And people it's a, it's been a great experience there. And a lot of great new things sort of on the horizon for us and we're really excited to be sure you're shifting gears into making the overall metrics experience at Databricks better from a sort of feature perspective, not just from, you know, stability perspective. So cool. I thought that's what I wanted to say. And I think we can maybe shift to questions. Yeah, I see a lot of them have been marked as answered, but okay, so let's see. I see why cortex didn't suit the high churn rate. We know we didn't, we didn't dive very deeply into this other than that we did actually talk to the name is escaping me right now but the main guy who started cortex at Grafana. We did kind of identify this this problem that cortex has with ingesting a lot of sort of short lived metrics. And so, and I see that why I would like to answer this question live, but I just clicked it for you. Oh, okay. Thanks. So that that is something that that we kind of ran into. I mean, we actually did try quite hard to deal with that. And then how many production issues are we facing daily on over 50 clusters. If I exclude disk space issues, I would say we average maybe one or two a week of actual production issues on over 50 clusters so like I said it's really very stable at this point. I would say probably the main thing we are focusing on right now is kind of getting a handle on the increasing load in Databricks is scaling very quickly in terms of how many people use the platform. And so our metric loads just continue to go up. We get a lot of alerts that sort of say hey you're going to be running out of disk space and then we have to kind of decide are we going to reduce retention here are we going to try to scale up the cluster are we going to go try to figure out which are the worst defenders and make them reduce their things. So those are sort of a separate set of issues, mostly because they're not really I wouldn't blame those on M3 that would be an issue no matter what system we were using. And so in terms of sort of like M3 issues, I would say, you know, on the order of one or two a week when it's usually something like high memory usage, or something like that we we also end up, you know, with 50 clusters there's inherently going to be some underlying infrastructure issues so a lot of times it'll be, you know, you've had some cluster where your node just can't schedule because there's some underlying issue with cloud platform or something like that. But like I said, not a ton of issues overall. So hopefully that answers that one. How do you communicate with the community through the Slack channel. Yes. We have not taken the Chronosphere office hours we may start doing that. But yes, currently our communication with the community has been through the Slack channel and but through flying filing GitHub issues, you know, I mentioned the operator self driving thing. I think that we reported that it wasn't that there were some bugs in there and they've now fixed those so that's been good. So I would say our primary things of communicating are through the Slack channel where people are pretty responsive, and also just through GitHub for sort of code issues. Node size using for the large cluster I actually don't know off the top of my head. Why do you know, like around 100 gigs of RAM and 16 CPU. Cool. Oh, and someone noticed that I'm listening to madlib so yeah that is a really good new album if you want to check it out. That's pretty awesome. Cool. And did we let's see. I saw that somebody else asked if we were wanting to to open source our rule manager engine. And the answer is yes, we do want to. At some point in the future. We, we do not have a like roadmap for doing that right at the moment. But as soon as we sort of are able to to prioritize that. I think that is something that we would definitely like to to prioritize and get get open sourced we are a company that likes to open source things if possible. So, yeah, let's see. I don't know if I should go through I see why I answered a lot of these questions. Sort of by typing so thanks why why. Let's see if there are any other questions. Are we using stateful set disks. Yes, so we use a stateful set. We use SSDs under the covers. So we're like I said we are we run across multiple clouds. One of the nice things about about that is that Kubernetes can kind of hide some of those details for us so we just have persistent volume claims that that let us get disks for the clusters. And then, you know, they magically have disks and it doesn't matter which which cloud you're running in. Cool. Well, I don't know if there are any other questions. Otherwise, I think. Great. I know what a kill all Barbara does but that sounds fun. Yeah, that was me trying to be able to stop sharing my screen but now I still can't. Thank you both. I'll hand it back to us but yeah that was I hope that was like, I mean, yeah, extreme deep dives. And really valuable I think for everyone here about so thanks for putting in so much work into it I did it's, you know, really great to see under the covers for the further folks out there running in free obviously. So yeah, I just wanted to say. Yeah, give a great thank you for for all the detailed materials here. Awesome gives. Yeah, no agreed that was really great. Thanks both why why and Nick.