 Hi everyone. My name is Alash. How's everyone doing? Hopefully. Covering after lunch. Excellent. Yeah, like, as already said, Raboon and I are going to talk about decentralized and centralized scraping architectures. And I'm going to talk about it a little bit more in general, and Raboon is going to give you some very interesting examples of how they do this at DoorDash. For Chronosphere, I'm a part of the solution engineering team, and I've been in the observability space for the past five years. Really exciting how this space is really developing, growing, and what kind of challenges people are facing. So I'm really fortunate to work in a space that's moving fast. That's better. Hi everyone. I'm Raboon. I'm in DoorDash. I'm a software engineer in DoorDash. I am a tech lead in observability team, and we are here to just tell you about our experiences. Excellent. Thanks, Raboon. So going forward, let's, you know, starting in the beginning, if you look at metric data and data and how do you collect this data. There's, if we talk about the data, the life of the data point as we kind of deal with this. I kind of throw an example up here that actually uses bees. So if somebody has apophobia, I apologize. But they actually lend themselves really well as an example for how are we dealing with these things in real life, right? If you think about metric data, what's actually important? Well, discovering where you're actually collecting it, transporting it, and then storing it. Now, each of these areas has kind of different problems that you have to deal with. And kind of on the side of discovery and collection, right? How do we scale this? How do we reliably collect? Then how do we actually transport at scale data and then store that scale to make it available to our consumers? Most probably, if you look at kind of the value of the data point as it progresses through this cycle, the data is really valuable in the beginning of the cycle. Meaning, how do we get it without a lot of latency from where it actually being produced so that we can actually alert and query it? And then the longer we store it, the less value it kind of has, right? Because it's not as operational anymore. So why is this important? Well, because then the collection piece doing this reliably and at high scale is really important. If we can do that fast, performant and reliable, we actually can feed this into a system that makes our whole observability discipline a lot better. Now, that's why we're going to today talk about how does the collection infrastructure actually work and what kind of approaches do we use here? So obviously, as the title says, centralized versus decentralized. The centralized aspect is essentially having one large thing that runs, discovers all the targets that it needs to pick up metrics from, collect them, processes them, and then ships them off into wherever you store them. Now, that massive strip mining piece of equipment is actually a good example of it. Why? Because it's big, it's fairly expensive, but it's only one, right? So you can maintain it in one spot, you can control it in one spot. But it has a massive downfall and that is if it breaks, you're not collecting anything, which can be really problematic, right? Since we rely on this data more and more to make sure that we reliably deliver something to our consumers in the end. Now, the right hand side is the different approach, right? Decentralized approach. I don't know if anybody played Starcraft, but those are kind of the drones collectors that they use in Starcraft. You can have one gazillion of them, and they go out, they do their thing, they collect whatever they collect, and they bring it back to wherever we're storing the resources, right? And if one of them fails and we need to replace it, that process is usually quick. And if you think about it in the context of like a collection agent, right? If it's small, if it crashes, it's going to come back up in a couple of seconds, start doing its job, and you're off to the races. That might not be true in a centralized environment, obviously. Now, that obviously also presents a couple of challenges, mostly around command and control. So how do you actually make sure that that whole fleet of things is doing the right thing, and make sure that you are, that you understand if something goes wrong, and so on and so on. Now, obviously, this is complicated, right? And not all problems are best solved in the same way. And the obvious thing you think about is, well, what if we kind of do both? Well, Morpheus never said that, and Neo couldn't pick both pills, but in this case, we actually can do something along those lines. And Raboon actually has a really good example of how that was done at DoorDash, so I'll let him present that. Thank you, Alek. So before we start, I know some of you don't know DoorDash, so what is DoorDash? DoorDash is a company that does last mile logistics, and it provides the infrastructure to support local commerce. Basically, we have merchants, we have the consumers, and then we have dashers that are connecting them, so deliveries, or food delivery, or different types of deliveries. These are possible through DoorDash, and it's pretty popular, but how about DoorDash's own infrastructure? So we are cloud native. We have been on cloud from the get-go, and we are right now Kubernetes-based. We have multiple clusters. The number of clusters, I think it's now over 20, and the largest one, it has between 1,500 to 2,000 nodes, depending on the scaling and depending on the traffic, and we have more than 35,000 pods at peak, and that means we have a lot of scraping targets. So we have on some pods multiple targets, so we have around 100k scraping targets in a particular one. So we didn't start with Prometheus, so we were a stats-d shop for too long. However, we had some problems with scaling the stats-d infrastructure, and if you go and check our blog post, you will see things that we have tried to scale this up to a certain point, and after that point, we said, okay, this is not working, so we need a new paradigm. And then we shifted to Prometheus. I should say that was a tough migration. It was a multi-year migration. It had a lot of tricks, but the nice thing about that was we learned a lot about how to use Prometheus, especially distributed versus centralized. And right now, we are over 10 million in Metrix PPS. By the way, we batched them, 32nd batches. We feel like it's a good number, and we have more than, I think, 10,000 alerts in real-time. Our teams, they love metrics, they love alerts, so it's a challenge. So with that, so this is what we have right now. I'm going to talk about just metrics, so no logging or tracing. So most of our infrastructure lives on Kubernetes. And what we have with this is, by the way, our vendor, as you can see, is Chronosphere, and we use their collector. It's an agent that we deploy as a demon set in our Kubernetes cluster so that they can do the cloud-native discovery and Metrix scraping in Kubernetes environment. It's pretty well, it is good, but that's not enough. Sometimes, because of whatever reason, for example, our traffic team believes in dedicated traffic machines, so we have some EC2 instances, and you can see, we are on AWS shop again, and for them, we don't use the demon set approach. We deploy them as services. More or less a similar concept here, but it's a different concept, because they have one responsibility, whereas the collector in the Chronos Kubernetes one is responsible for multiple endpoints because of the pod deployment, and they do discovery real-time. For the EC2 one, we generally use a file-based discovery. It's simple, it's convenient, and that works well for such thing. However, again, we also have other workloads. One thing, first of all, you need your AWS Metrix, and we don't use CloudWatch that much. Some teams do, but they want observability in a single place, and they want this to be available on Chronosphere, so same for other vendors. So what we do is we have a separate deployment of Chronocollector responsible for this, and we use intermediaries such as custom exporters or StatsD for batch jobs or scripts. In the end, you can see I mentioned Chronocollector in different places. They are all different types of deployments. Even if the binary is the same, it doesn't matter. We configure them to be special to those tasks. So these are the models that we have. First of all, we have a demon set. Obviously, we keep the configuration for the demon set as simple as possible. We want this to be generic. We want this to cover multiple microservices, and what we do on top is just add some rewrite rules. We have a standard set of labels that we attach to all of those metrics. So this is what we do with the config, and we have some extra rewrite rules to add some extra metrics, and that's it. It's kept simple on purpose. And there's also the sidecar approach. Why would we need this? Because, unfortunately, some services are just too noisy. It might be a buggy thing. It might be the nature of the service. So for those, we keep a sidecar. More or less, we have a special container attached to that one within the same pod just for the purpose of scraping that particular service. The good thing about this one is you can allocate resources. Depending on that particular service needs, this is critical because I will discuss the merit of this one in the next one, and also the custom configuration that they can have. They can say something like, hey, I want my metrics to be scraped every 10 seconds, or some say, I don't need 30-second resolution. I'm fine with five minutes. So to handle those, instead of finding tricks in the demon set, we use it in the sidecar. And also there is a Unix service. Like I mentioned, it doesn't do Kubernetes-based discovery. It has its own specific configurations, and that's it. And the last one is the deployment. So the deployment is we have a separate set of collectors aimed for a single task. And in this case, the example is the one that I just gave. First of all, for example, we'll get EC2 metrics from AWS and expose them to send them to Chronosphere. And these have evolved over time. It wasn't designed like this. We had problems. We discussed some of the solutions, and right now it works better than I would expect. So when we were making those choices, we had some important parameters. First of all, when you use a demon set, the resource consumption is pretty critical because you are deploying the same resource requirements across your whole fleet. So unfortunately, it means we need stability. And collector stability is the key parameter for the demon set. And for the others, it's okay if they have resource issues or bugs, but the collector itself, deployed as a demon set, should be stable. The second part is obviously discovery. We don't want our collectors to lag behind discovery because the developers are pretty sensitive. They really want their metrics to be available from the get-go. And if there's a delay, if there's a resource problem, they are not forgiving. We also do enrichment. We add, for example, a couple of labels. One is the environment label. In StatsD World, this was a chaos. So we had n as a label name versus environment. We had prod versus production, uppercase, lowercase, all those different combinations. And it prevented us from building standard dashboards. So what we did was we said, okay, no more infra labels. Let us tag those for you. And then you can use them as if you are creating those yourself. And we have a lot of rewrite rules in different places for various reasons. And this was another parameter. And the last one is the recording rules. Our SRE team, they love this because they keep their SLOs, SLO tracking using the recording rules. So we have a lot of them. So we also needed something that would scale and keep up with the ongoing SLO SLA project. So with this, we have both of the world's rights. So we have the distributed one versus the centralized one. So obviously, if you have a decentralized architecture, you don't have a single point of failure. And this saved us multiple times. There is always something. There is always a bad neighbor. There is always a bad machine. There is always a bad configuration. Because our compute team also relies on the collectors to scale the cluster and up and down, it's critical that we should be able to survive. So it's pretty resilient. It's pretty scalable. I'm saying this from our point of view. I'm a client of the Chronosphere. And from our point of view, we just ask for more resources. And we get it, which is nice. It's flexible. Like I mentioned, we can have different combinations of those. And because they are local, the rewrite rules are local, there is negligible performance penalty because we distribute and we scale them differently if there is a need. And the last one, if you have noisy neighbors, they just impact that particular node, not the whole cluster. However, the cons resource usage is a problem in decentralized architecture because you have a demon set, right? So you need to make sure that you are scaling your system to handle your worst performance. So that is always your limiting one. So you go up, up until you feel fine, but then you have most of the cluster not utilizing your allocated resources. And then you have to play with the rules. You do oversubscribing, you do all those tricks. But this is the case. When you do major changes, it can be a version deployment, et cetera, or it can be some test that you're doing. It's a demon set deploy. So we are trying to be responsible engineers. What we do is we do this slowly on multiple clusters, so it takes days. We don't want to speed it up because it is a critical piece, so we want to make sure that it's reliable, but it means you have to let it deploy really slowly, which can take some time, and it can create some operational burden. And the last one is when you have an issue, when it's some of the collectors or some of the agents misbehaving, then you have a problem because then you have to find in the dynamic Kubernetes environment what's going on. You have a lot of logs because every log is like really multiplied by thousands. So you have to pinpoint what's going on. You have to isolate or coordinate some of the nodes to see what's going on. And then what you do is like you're looking for a needle in a haystack. So that was the original problem that we had to improvise all those different types of deployments, and that's what we ended up with. So these are my conclusions, by the way. So our experience told us that you need a distributed deployment after a certain scale. For us, that was when we passed this one million metrics per second mark because at that point we started seeing some loss. Maybe we didn't do a good job of optimizing, but it felt like we needed something. And it also plays nicely with Kubernetes. So we wanted to scale Kubernetes, so observability is really critical for the compute team, so we had to scale better than what they asked for. So we moved to a distributed architecture. It's pretty stable. It can have its own issues, so you need to improvise. You need to take some of them out, put in a single deployment, do sidecar model or whatever. But once you have this, once you understand and reconfigure your system, it's pretty stable. You don't have to do anything. You just let deployments go through. But you will need to work with a lot of exceptions. You have to experience and see... By the way, this is DoorDash experience, right? So your experience can be a little different, but you know that there will be some. And you will be creating a lot of helm charts, a lot of changes. And with this... Is this the last one? Yeah, so with this, that concludes our presentation. And thank you, and any questions? Hi, if you didn't touch on storage, I think you were going to talk about that. I'm curious where you put all these metrics after you collected them when you went to your distributed model. Yeah, so like I mentioned, since we used Chronosphere, we are lucky. They stored metrics for us. So I think Alish can talk about the numbers from their point. For us, we don't store. They have the API. We just ping the API, get the metrics, and we are done. Yeah, we didn't necessarily mean to kind of dig into that, but we do have a stand out there, and we also have a stand at KubeCon throughout the conference. So feel free to come by, and we can kind of show you what a solution does and how it does it, and discuss that further. Okay. Hello. You mentioned it was hard or took a lot of work to move from StatsD to Prometheus. How did you do that? Did you start with Chronosphere from the beginning, or did you try it by yourself, and then you kind of moved to them? So maybe you can tell a little bit more about this process. Yeah, that's a good question. And I think on Friday, we will have a presentation explaining this, but I'm going to give you a short summary on Friday. Yeah, I think it's Friday morning. Yeah, Friday morning. But the short summary is, I think you need to take it very slowly. You need to first understand that you're shifting the product. It's not just pull versus push. It's also like you're moving from person files to histograms. You're moving away from your... I can do whatever I want. I'm just going to push the metrics, and it's easy to... You need to have libraries. You need to expose them. You need to understand your observability. So what we did was we started with a well-known set of services that had strong leads and strong engineering. And we used them as our beta customers. We worked with them to see what was failing, what we need to do. Like, is it just the library change? Is it a rewrite? Because some teams use micrometer, and they just switched from StatsD to Prometheus, and it was a disaster for us. The metrics lost their reliability. So we had to re-improve it and say, okay, let's use the native libraries instead of using a facade like micrometer. And then you need to understand how you do things with your storage systems. Is your database exposing Prometheus? If not, how do you do this? So all those stories, you lay them one by one, and you attack them. The biggest problem is not that one, actually. The biggest problem is the tail end of this migration. Because what you leave to the end is the most complex cases. And with them, you have to spend a lot of time. Because those are the real problems that you haven't seen before. And I mentioned some of them, like the deployment model or using system deservices or using StatsD exporter versus Push Gateway. Those were all based on our experiences with other teams, and we had to improvise. So a lot of improvisation. And in the end, even with the best intention, we have some StatsD metrics. And we do a lot of conversion, unfortunately. That means the metrics are not reliable, but at least they're better than having no visibility. But the details, I think my friends will be talking about them on Friday. Thank you. It's Friday morning at 11 a.m., and I think it says... Portside ballroom. My turn to exercise. How did you deal with your missing metrics? You said you were having problems where the metrics would disappear. How do you detect those? Yeah, so... You're talking about the StatsD loss, right? Yeah, so... We had a nice way of... We used the UDP packet loss as our guide, and the metric loss was also... Our business operations, they didn't like the metrics at all. They were seeing a discrepancy, and it was consistent. So we knew that we were losing metrics. And then when we switched over to the distributed model and the Prometheus, we saw that we were losing 25%? A massive thing? So this is the reason we are using the distributed model. So we want to make sure that we are not having any issues. So the collector itself has a lot of metrics. It's exposing its own errors, et cetera, which you didn't have with the StatsD one. Because it was already too late when the metric didn't arrive, and the agents like StatsD didn't have any idea. So that was the big shift. And I think we are in a good position. We don't have a lot of issues right now. It's stable, but yeah, with StatsD... We tried all of it. We tried scaling the machines. We tried increasing the cluster size. We tried splitting this into multiple clusters. But there is a certain scale that you can do with this, and we said, okay, that's it.