 Hello, everyone, and welcome. I hope you all are having a good week here at KubeCon. Today, we're going to be talking about DoorDash as migration from stats D to Prometheus with over 10 million metrics per second. Before we get started, I wanted to just do a quick poll of the room. How many people here work at companies that use stats D? Okay, what about Prometheus? All right, and what about both? Okay, all right, try to catch some people in the middle of their migration, right? Cool, so my name is Ben Raskin. I'm a Solutions Architect at Chronosphere where I work on customer enablement and onboarding. Chronosphere is an observability company that focuses on cloud-native companies. Prior to Chronosphere, I was a software engineer at Uber working on M3, an open-source metrics platform. Hey, everyone, I'm Emma. I joined DoorDash for about three and a half years ago. After graduating from Carnegie Mellon University, now I'm in our Co-infrajecture Observability team and I help the whole organization to migrate from stats D platform to Prometheus based monitoring. Yeah. Cool, so just a bit of an agenda. We're first going to look at some of the challenges that DoorDash faced with stats D. We'll take a look at how Prometheus resolved those. Next, we'll talk about the client-side migration effort from stats D to Prometheus. Next, we'll talk about enablement, sort of how we got the end users and engineers and service teams up to speed. And then finally, we'll do a bit of a retro looking at some of the learnings and key results of this gigantic task. So we wanted to begin by talking about some of the challenges that DoorDash faced with stats D. So a lack of naming standardization and limited support for tags makes it hard to give context and meaning to underlying stats D metrics. So we can see here, we have two metrics here coming from two different services. They're both tracking the same thing, the number of page views. But unless you have intimate knowledge of these services and the metrics that they're producing, it's hard to know that these two metrics are in fact tracking the same thing. We'll see in a bit with tags and labels in Prometheus, it's much easier to give context to these particular underlying time series. Next, the number of metrics scales with user traffic in stats D. This means if the number of user requests or traffic to the overall business goes up, the number of metrics go up. Not necessarily the cardinality or unique number of metrics, but the total volume. This oftentimes requires the need for an aggregation tier before actually storing the metrics in the time series database. Otherwise, these metrics can grow exponentially. Next, because stats D is sent over UDP, there's the potential for packet loss. This is particularly problematic, especially during high traffic times when the server has the opportunity to get overwhelmed and therefore you could be dropping important data points. Next, the way that stats D reports counters is by deltas. The biggest issue with delta counters is there's no way to interpolate missing data points. So once the server drops the metrics because of packet loss or some other reason why you would lose these particular counters, there's no way to actually estimate or interpolate those missing data points. And finally, the lack of histograms in stats D requires the need to pre-compute percentiles for latencies. And the biggest issue with pre-computing percentiles is that when you wanna aggregate them at query time, it's mathematically nearly impossible to get an accurate representation of the aggregated view. So why Prometheus? Based on the issues and pinpoints we experienced in the stats D platform, we set some new observability requirements and principles. First, we have a strong preference of using open source. The advantages includes firstly, it is more cost efficient compared to building and maintaining everything from scratch by ourselves. Secondly, in terms of the integration, we can make use of the open source data formats and open source query languages. Lastly, using open source can prevent vendor locking, which means that we can consider self-hosting in the future. Second, we want more governance and control on the whole monitoring system. We want standard conventions. For example, the standard for common text and metrics naming conventions. The common text can be used for alerts and dashboards, especially for a dashboard, it can be used for grouping and filtering. Also, we want to improve the governance and control on the call side. If we have ability to break down the usage by team and service and organization, then we can set quota and also really meet the team and the service level. The last requirement is about self-service and productivity empowerment. So for new services, we want the metrics can be automatically discovered and collected using the current metrics pipeline. We don't want extra works. Also, we want to automate the process of generating basic dashboards and alerts. This is very important to accelerate the migration and onboarding process. So why did DoorDash choose Prometheus? So first of all, Prometheus has emerged as the dominant standard for open source metrics and it aligned well with DoorDash's organizational strategy and requirements. First of all, the simple fact that there's a ton of community support out there really helped alleviate a lot of the pressure on the central durability team. Since now, end users could go to Stack Overflow and GitHub and all these other resources online to get the answers to their questions as opposed to having the bottleneck on a central durability team. Next, the query language, promql, has become the standard with regards to querying time series data. So again, this really helped onboarding of new engineers. Since now, we have this standard query language and it's not proprietary at all and so you can bring those skills right to different companies. The tag-based metric ingestion format allows you to give context and meaning to the underlying time series and there's several advantages of this. One of the advantages that I wanted to highlight with regards to DoorDash is now with a tag-based model, it's super easy to see how many metrics each service or each team is sending to the platform. And so this makes it really easy to do cost accounting. Next, because Prometheus is a pole-based system, this allows for client-side aggregation so you no longer have this worry of your metric traffic scaling with your actual business or user traffic. Next, strong support. There's strong support for the more fundamental tools in DoorDash with Prometheus, things like Kubernetes and Envoy, they really play well nicely with Prometheus. And then finally, Prometheus has obviously native support for histograms, such that you don't need to pre-compute percentiles and you can accurately aggregate histograms at query time. And then lastly, Prometheus reports counters as cumulative counters or running counters. And I wanted to dive into this point just a little bit more, just talking a little bit about the sort of the visual representation differences between delta counters and running counters. So we see here with delta counters, when you lose a data point, it's lost forever. There's no way to know what that data point might have been. With running counters or cumulative counters, because they're monotonically increasing, you can apply a special rate function in PromQL to interpolate or estimate what that data point would have been. Okay, let's see what we have done in the client-side migration. For the metric generation part, I will first introduce the custom metrics, then the PromQL six-porters that we have used. Lastly, it's about short-lived jobs. For custom metrics, including the application and the service metrics, we suggest service owner and the microservices to use PromQL's native libraries to generate the metrics. One interesting case that I want to share with you guys is that the Python client with a multi-process pattern rather than a multi-thread pattern. As we know, the PromQL's native client presumes a multi-thread pattern, which means the metrics can be showed within the workers. However, from the multi-process pattern, we discover that the metrics can be incomplete or require custom implementation with unsorted LPS impact. So to overcome the performance penalty, we use that spotter for this part of metrics. Also, we provided some internal libraries for the developers to use and for their flexibility. For example, we provide the internal library to generate HTTP and GRPC metrics automatically. Also, the JVM PromQL six-porters is used in .h Docker based images to export the JVM metrics. For some other cases and metrics, we also use the open source exporters. The community of PromQLs is so big and it provides and maintains so many used for open source exporters. For example, the AWS Cloud, which is metrics, there are many two exporters. The first one is the official PromQL six-porters and the second one is YC exporter. The difference of the two exporters is the different APIs used in their co-functions. And we choose the latter one because it have a better service discovery mechanism and also have less load on APIs. Also, we use Stasti exporters and Stasti is still alive in our monitoring system. There are many two cases that we use Stasti exporter. The first one is that for the legacy system that can not be migrated to Prometheus, we use Stasti exporter for that part of metrics. The second use case is, as I mentioned before, for the Python client with a multi-process pattern, we use Stasti exporters to cover the metrics from my two Prometheus and then start forwarding them to the backend. Also, we use JVM exporters, node exporters, Kafka-like exporters and the PG Bouncer exporters. These are many used for the infrastructure and the platform related metrics. For the short-lived jobs that cannot be scrapped by the metrics collector, we use the Prometheus aggregation gateway with a small modification for that part of metrics. To better control the metrics, we have some standard tags for all the metrics. The common tags include, first, the service, which is the service name. Second, the app, which is the application name. One service can include multi-different apps. Also, we have Kubernetes cluster, which is to indicate the cluster that the service is running on. And environment label is used to differentiate the production and the staging environments. Also, we have the SAP environment, which is used to differentiate the cannery sandbox with other environments. The region and zone are AWS technologies and they are used to indicate the region and zone ID that the service is deployed in. Lastly, we have COSR region, which is used to analyze the metrics COSR attribution. As you can see in the dashboard, there's a screenshot which show one of our dashboard. In the header, we have the common tags in the drop-down list for filtering. Yeah, next is about service discovery. Service discovery is used to find and discover the jobs or services for scrapping. It is done through Kubernetes annotations, which tell parameters what endpoints to scrap. The central of the ability team created gold standard Kubernetes manifest template, which the service team can use. At all that should we have the service template, which is to help the developer to generate the Kubernetes manifest automatically. So after the service owner or the service team added an annotation, which means their jobs or services is ready for scrapping. We also deployed in the metrics collector as agent in the Kubernetes cluster, which means that for every node, there will be one metrics collector to scrap all the metrics on that node. Also for a service that are not deployed in the Kubernetes environment, we deployed the metrics collector as sidecar with the service, which means that the metrics collector will collect the metrics from that service automatically and forward them to the backend. The default scrap options are also defined by the COT team, which means that the scrap options including like scrap frequency, scrap timeout, and so on are this kind of configurations are included in the service collector. Also we populate the standard labels in the metrics collector for all the metrics, the standard labels as I described in the previous slide. So during this migration, it also coincided with hyper growth at DoorDash, mainly accelerated by the pandemic. And so cost control was a huge issue for the central observability team. Fortunately with a Prometheus based monitoring system, there's a bunch of different mechanisms that can help you reduce cardinality, control costs, and improve performance. So I wanted to talk about a couple of these different mechanisms. So starting with relabel rules. So relabel rules are a native Prometheus feature that allow you to drop labels or metrics client side. So typically when you're using some of these open source exporters that Emma talked about, there's often time metrics that you don't need. You don't ever query them in dashboards or alerts. And so you can safely drop them client side. The next two mechanisms, roll up rules and mapping rules are features of M3. M3 as I explained in the beginning is an open source metrics platform that was developed at Uber. And it actually acts as the underlying metrics platform at Chronosphere. So roll up rules allow you to pre aggregate metrics before actually storing them in the time series database. So again, just as with roll up rules, oftentimes with common exporters, there's usually labels or tags that you don't necessarily need. And oftentimes these labels are very expensive. They have high cardinality and you typically don't need to see, you typically don't need to break down your metrics by these particular labels. A common one is instance or pod ID. So with roll up rules, you can safely aggregate them before actually storing them in the time series database. Mapping rules allow you to define the storage policies for each time series. So this means you can down sample a subset of your metrics and store them at different resolutions. So let's say your default scrape interval is 30 seconds, but you have a subset of metrics that you wanna store at one minute. What you can do is you can use mapping rules to do that. Lastly, we have recording rules. So recording rules allow you to pre-compute expensive promQL expressions and then store those results in the time series database. So all of these greatly improve performance, especially at query time, since now you're querying fewer and fewer time series. Cool, so I wanted to talk about enablement. I think this was probably one of the more interesting pieces. DoorDash obviously has a lot of engineers, a lot of teams. And so enabling the users on Prometheus and teaching them about promQL was one of the, I would say, probably more difficult but interesting aspects of this migration. So the first thing that we needed to teach users is how to best use the Prometheus data model. Because it's a tag-based model and most of the engineering team was used to statsD, which is a node or path-based model, we really needed to sort of teach them about all of the best practices. And one of the main things was teaching them not to overindex on the metric name, but instead use tags and labels to give context and meaning to the underlying time series. PromQL, as I mentioned earlier, is the Prometheus query language and it has become sort of the standard in the industry for querying time series data, especially tag-based time series data. And so we had to teach the engineering and service teams not only about sort of like the syntax and how to actually write these queries, but we also had to teach them about the quirks of PromQL. Two of these that I wanted to highlight are how you query counters because with a running or cumulative counter, you need to apply a special rate function and oftentimes one of the benefits of the rate function is that it will actually interpolate or estimate missing data points. And so we had to really sort of pull back the covers a bit on the rate function and especially with the more sort of technical or power users of the system. Next, we had to talk about sort of how to query histograms versus timers. So with histograms, as you'll recall, you do all of the percentile calculations at query time and you do this through specialized PromQL functions. Next, not everything needs to be built from scratch. Because we now had a standard data model throughout DoorDash, what we could do is we could rely on open source dashboards and alerts and recording rules that we could just pull into the system and Teams could just get all of that visibility right out of the box. Even with service teams, we could build gold standard dashboards that other service teams could use as models for building their dashboards. Lastly, the DoorDash central observability team decided to manage all of these resources through Terraform, so managing all the dashboards, alerts, aggregation rules, recording rules, things like that. And so we had to come up with a way for Teams to easily onboard all of their different assets. And then, of course, we had to give them the tools to make it easy to do so. The final part is about timeline results in the learnings. Let's say our timeline of the migration. Our journey began in Q1 2020. That's the time that we decided to move away from Stasti into our Promq tiers. And then, in Q2 2020, we started the migration of infrastructure and the Kubernetes metrics. And it finished in Q3 2020. During that phase, we migrated all the Kubernetes and infrastructure metrics, including the Seattle weather metrics, and notice pointer metrics, and also the AWS metrics. As I mentioned before, we used a lot of exporters as them. And we deployed the metrics collector as the agent in our Kubernetes cluster. For the services that is now deployed in Kubernetes cluster, for example, like EC2 instances and also ECS clusters, we deployed the metrics collector as a sidecar with a service to scrap their metrics automatically. So starting from Q4 2020, we started migration of service and application metrics. We finished all the metrics, dashboards, and alerts within one year for all the service teams. Then in Q4 2020, we deprecated the original Stasti pipeline and we only used some Stasti spotters and kept a lightweight of the Stasti pipeline, yeah. Cool, so just taking a look at some of the numbers of this massive migration. So we started with more than 7,000 alerts, 1,500 dashboards, and metrics for over 130 services. Needless to say, this was a massive collaboration across all teams and engineers, especially MS team, the central observability team. One year post migration, which also coincided with hyper growth in DoorDash's business, we now have over 27,000 alerts, 2200 dashboards, we're ingesting over 15 million metrics per second. And then post aggregation, we're persisting just over 10 million metrics per second today. So a couple of the learnings that we took away from this migration, the first one, switching from percentiles to histograms is tough. This was not only was this a client side change, where we had to sort of teach engineers about buckets and which buckets to select, the appropriate number of buckets, things like that. But we also had to teach them about how to actually query it through PromQL. And also, obviously along the way, we wanted to teach them about some of the advantages. So now we could, now engineering teams could get accurate aggregations across instances. Next, the instance label is a important concern for overall volume. As I mentioned before, the instance label is one of those, is one of those super high cardinality labels that unfortunately gets added to every single Prometheus metric. And this is to keep track of the unique metrics for the cumulative counters and histograms. Unfortunately, with rollup rules, we can pre-aggregate that instance label away and then safely store a lot less cardinality in the time series database. Yeah, one of the important learnings is that Prometheus is now friendly with short-lived jobs. So we use Prometheus aggregation gateway for this part of metrics. And it could be dangerous because it's kind of like push model and that is something that we want to avoid. So we just limited the method to some special cases. The last learning that listed here is that we really need to automate the monitoring onboarding process for teams. So in all that, the steps are like, first we provide the service teams with weak keys which document how to generate metrics using the Prometheus native clients. We give them some examples to follow. Then after the service owner and the service team add the annotations in the Kubernetes manifest which means their metrics, their service and the jobs are ready for a scraping. After that, we provide them with some Grafana dashboards, Grafana templates for them to use and copy from. Also in dollars we provide the alert tools for the service teams to use. They can use that to generate the alerts automatically. And we provide the alert template for some common alerts. For example, like the Kubernetes container level house. We provide the alerts to monitor like the Kubernetes container restart count CPU and memory utilizations. And also these metrics, yeah. Awesome, so yeah, thanks for coming to our talk. I think we'll take a few questions now. And if you wanna talk to myself or Emma or anyone from the Chronosphere team, we have a booth here at G15. I think we're giving away some pretty cool prizes today. We also created a video game just for this conference. So you should stop by and see if you can see if you can beat our high score. Yeah, so my question is that you mentioned there are 2,200 dashboards. Why you maintain that many dashboards? Because you think that you have to standardize or have similar set of dashboards? Yeah, so I mean, I think a lot of those, I mean, I think the original number of, I think it was like 1,500. I think a lot of those were probably still just kicking around. And a lot of the Prometheus-based ones we probably don't have as many, which obviously you can tell it's probably about, 600, 700 or so. But you also have to remember like, I mean, DoorDash is a huge organization. I'm not sure exactly how many engineers there are, but I would guess probably in like the thousand-ish. And so, you always have teams and engineers that kind of go off, make, you know, test dashboards or like, you know, personalize things. But yeah, I don't know if Emma has any other comments about that. Yeah, I think we have lots of dashboards there, and yeah, in different buckets, and for each team, like we have some like, Kubernetes container-level metrics for them to use, but they also keep their own copy, yeah. Yeah, hi, for someone who's just starting on this migration journey, what would be some mistakes that you would recommend that we can avoid? I saw the learnings, but would love some more detail there. It's a good question. I would say, I mean, I think the biggest thing that, and Emma could probably talk a little bit more from the DoorDash side, but I think the biggest thing that we found to be very helpful is really working with one or two, sort of like, you know, pilot teams, to sort of, you know, kind of get that model out there. You know, especially around, I would say, like the use of the Prometheus clients. I think having, you know, having a couple services, right, that are using each of the different, like, Prometheus clients, right, like maybe have one Go service, one, you know, Node service or something, and sort of, you know, getting them sort of fully across the line at the beginning, I think that can really help, because that, you sort of learn, right, a lot, you know, in that journey, and then other teams can sort of, you know, tack on and follow that. Yeah, so from my perspective, I feel like we really need to set the common text as soon as possible. Yeah, it's difficult to change, like if you have other set of common labels. For example, like we have the cost, oh, we have a cost origin that we really want to get rid of, because we want to use a service app, these two combinations to find a service uniquely, yeah. Yeah, so basically you provide the whole standard, kind of like templates and everything, but it's actually just for the first generation, right? So like, if the goal standard changes, how do you like ensure that all the teams actually follow the updated goal standard that you've provided? For example, like let's say that the cost of origin is like a new tag. How can I make sure that all the team actually gonna make it inside the tag as well? Oh, actually for example, like the cost origin label, they are generated from our side automatically. So in the metrics collector, we have like some shell script to get like AWS and AWS region and zone ID and also the service name automatically set for them. So they are not able to change that and we publish all these labels for them automatically, yeah. Yeah, I think that can avoid some mistakes, especially some spelling mistakes from the service team. Yeah, I think what we found and honestly to this gentleman's point, I think having a central observability team sort of take lead in this migration effort was super beneficial, because they sort of have a view right across the company and they can set those standards in place early on. And so obviously things are gonna change right in the future but the more that you can eliminate the unknown at the beginning, I think the better. So I think in general, you really wanna look at what your scrape interval is. I think that's like a pretty good standard to look at. So if your scrape interval is 30 seconds, having jobs under 30 seconds, probably not gonna work because those jobs will complete before Prometheus or the collectors can actually scrape them. So I think it, I mean, yeah, it really depends but for me if I'm just gonna say like a blanket statement, I would say like the scrape interval is probably a good sort of boundary. No worries, I heard. Yeah, the question was, do we have plans to correlate metrics, time series data with traces, specifically around open telemetry? So at Chronosphere, we actually just, or last year we launched a tracing product where we do just this. Yeah, I don't wanna say like, I can't really talk about DoorDash, but yes, we do have plans and we do have customers actively using that. And using it to correlate, right, between metrics and traces. Thank you. Thank you.