 My name is David Polk. I'm a senior software engineer at Datadog, and I'm passionate about how Cassandra solves our big data problems. But recently, I've been curious about what Datadog can give back to Cassandra. So first, let me share a little bit about myself. I've always loved the water from competitor swimming to scuba diving, and when the water turns to snow, skiing. During a college internship, I learned that aerial images could be used to map underwater forests and clustering algorithms could be applied to quantify their degradation by human threats. Seeing how powerful the right pairing of data set and algorithm could be, I started my career developing machine learning platform at Amazon's Alexa. But one thing I noticed was that the iterations on the AI fundamentally depended on how efficient our platform was at processing the speech data set. So within a couple of years, I found myself fascinated with Cassandra because of how it enabled products to interact with huge data sets while providing high throughput. So how many of you here have experienced an outage and not immediately known the root cause? OK, a decent number of you. So you're not alone. The cause of a problem with Cassandra is not always immediately clear. Cassandra is complex. It's a distributed system that demands proper management of its resources, ranging from client connections across the cluster down to a node's heat memory. Understanding its resource management can be just as tricky. It requires knowing which cluster events are important, where are the logs related to those important events, how to monitor the metrics, and when to alarm in the monitors. There is so much data to process across the cluster, let alone in a single node. When you encounter a problem, you usually need to start from a high level symptom and drill down to the root cause. As Yuki mentioned in his open telemetry talk yesterday, this process can be time consuming, tracking down app teams, trying to uncover relevant query information. Datadog can assemble data from logs and other metrics to provide context that is helpful in minimizing incident response time. So for example, instead of manually gripping through the GC logs, for pause times, you could collect the metrics and extract their contents to Datadog. Or you can enable Datadog integration, such as JMX, for other kinds of information. So if you're here, I'm going to assume you run some Cassandra in production. Let's say you're new to a team responsible for operating a Cassandra cluster. You don't know much about it yet, but you definitely want to make sure your cluster is healthy since it's on the critical path for your web app. Take a simple graph of request rate for the cluster. You can see that the cluster is servicing around 10,000 requests a second, has a cyclical traffic pattern. What kind of signal would you look for to determine if your cluster is not healthy? Well, here are three scenarios to walk you through what could happen. Let's say you're seeing a few sporadic timeout errors. And then one evening, you get an alert that a timeout's threshold has been breached. You acknowledge the page, get a cup of coffee, and start to troubleshoot. Since timeout errors can have different causes, how do you troubleshoot? Well, timeouts are triggered once latency is too high. So what does latency look like? You see that two days ago, latency was topping out around 60 milliseconds. And around the same time, the timeouts were pretty sparse. But once latency started to creep up above 100 milliseconds, that's when timeouts really started to explode. So why did latency start creeping up to begin with? You decide to look at one of the fundamental resources and node needs for answers. So high latency can occur when the node is strained for CPU resources. So what does CPU utilization look like? High CPU utilization. As the amount of CPU available decreased, the latency increased. So why the higher CPU utilization? A node naturally grows more strained for CPU resources as request traffic grows in the cluster. So what does request volume look like? Just as you suspected, more requests means fewer resources per each request. So let's provide more capacity to handle the extra requests. To provide more CPU headroom, you scale the cluster. Since the nodes also have little disk headroom, you decide to scale the cluster out rather than up. As you can see from the graphs, the new nodes come up on the top graph, the CPU saturation is alleviated, and the timeouts go away. What's your takeaway? CPU and timeouts are not linearly proportional. Once CPU is fully saturated, high latency is turned into timeouts, there are retries, and then you get a snowballing effect. So the rest of the week passes quietly, and you catch up on your sleep. But the next week, you get another alert that a timeout's threshold has been breached. You acknowledge the page, grab your cup of joe, and start again to troubleshoot. At first glance, you notice that the air has looked a little bit more abrupt than last time. You pocket the information. Once again, you start by checking the CPU. You start to roll out the easy problems. Last time you had an issue, it was CPU saturation. But here, the CPU across the cluster looks pretty healthy. Even though you see a slight surge in CPU when the timeouts occurred, the nodes that timed out did not experience a CPU deficit beyond the other nodes. You think back to the abruptness you originally saw. Hmm, what else can be abrupt? Garbage collection. So what is garbage collection? Memory management done by the JVM to remove unused objects from heap and free up memory. There are three types of garbage collection used by G1GC, a collector that we use for our Cassandra 3 clusters. First, there are minor collections. During these, the G1 operator will remove dead objects in young generations and move long surviving objects to old generations. These collections stop the job app briefly. Then second, there are mixed collections. During these, the G1 operator will find live objects in young regions while concurrently finding live objects in old regions, a search called concurrent marking. These can take a long time, but won't stop the Java heap. Third, the feared full collections. These occur because the job app allocates too many objects that can't be reclaimed quickly enough. They stop the Java heap to clear the entire heap, or they stop the Java app to clear the entire heap, both young and old generations. So what about GC? Why would GC be relevant? As you know, the collections can pause the app, in this case Cassandra. G1 tries to keep a pause target for collections. The limit is 200 milliseconds by default. But when a full collection is forced, the JVM must stop the application from running for a short time, during which Cassandra cannot service requests. So what do collection times look like for your cluster? So now that you see your garbage collection stats, you notice that before and after the timeout errors, the longest collection pause times were a lot shorter. But there were some times when the collection pause times went above 200 milliseconds, and as it turns out, these coincide with when you were seeing timeout errors. You don't have any direct signals here, but you're able to infer from the lack of CPU saturation and also the long tail GC latencies, the GC is not running smoothly. The timeouts correlated with the long pause times are enough of a signal for you to undertake GC log analysis. So when you look there, you see some detail about humongous objects. So what are those? Objects that take up more than half of a heap region are deemed humongous. The regions that consume these are automatically marked as old generation. You start to think about how these could affect collections. G1 can't collect a region without loudness information during mixed collections if concurrent marking that's supposed to attend to these old generations was not able to complete by the pause target, then these humongous objects remain in the old generation. And mixed collections require the concurrent mark to finish. So if G1 wasn't able to complete the mixed collections, could some of these old regions with humongous objects have been reclaimed again and again? And then you ask yourself, well, what would happen if many of your cluster's rights were deemed humongous objects? As the number of old regions grows, the number of free regions remaining on the heap would shrink. Pretty soon, there would be no more free regions left to utilize for the live objects in the young generations. And eventually, the compounded allocation of many humongous objects would only be alleviated by a full collection, which would get triggered. To avoid this, mixed collections should ensure that concurrent marking completes. Giving concurrent marking more time to complete would be one way to fix this. But that would really just be a band-aid on the humongous objects problem. Instead, you could find a way to decrease the number of humongous objects, and therefore, the allocation rate into the old generation. So you read Oracle's docs, and you see that you should be picking a heap region size where the ratio between that and the heap region is approximately 248, while also being a power of 2. You see that you have a significantly smaller heap region size confirming why you might have accumulated these humongous objects. So you increase the heap region size, deploy the update, wait for a few more collections to take place, and voila. You alleviated the G1 of instances where objects take over 50% of a region and turn freeing up the heap of accumulated old gen, as you can see in the top graph. Your high, long-tail GC-lanesies that trigger the timeouts have also disappeared, as you can see in the bottom graph. So now, but not least, a story about the graceful degradation of capability. So far, you gain a better understanding of your cluster, and you come to understand the behavior of your GC. You tune its heap region size, so now the collection times are much more reasonable, and you're getting better at recognizing CPU overutilization before timeouts start occurring. One month later, you see that CPU is sitting at an acceptable level at most times, but was abnormally high for a few brief brief periods. So if you look here, you see over on the left that during a rule of restart, the load in the cluster was almost flatlining. In the CPU saturation scenario, there was no real mystery as to why the CPU utilization was high. In that scenario, the query rate was also growing. At the same time, there was also deficit of resources, so you follow protocol. First thought is to scale. So to add additional headroom, your two options are to either scale out and add more nodes or scale up the instance type. Vertically scaling is a relatively safe and normal operation, and you think you want to do that because you see that there is also around 5% disk utilization in the cluster, so vertically scaling would be a little less wasteful. So you run the vertical scale. The first node in the updated instance type comes up. It receives stream data from the other nodes. As it joins, it starts to serve CQL requests, gossip settles, all the other nodes view the node as fully up, and oof, surgeon errors. Like I said earlier, this cluster is on the critical path to your web application. During the operation, once the node joined, your web app started to experience an incredibly high rate. You hurry to cancel the workflow and stop the operation, and after a couple minutes, the array subsides. Why then? Why did we start failing queries when the node joined? So what happened? The new node joins and starts servicing requests, and all of a sudden, all the nodes in the cluster have their CPU completely saturated. So how could the CPU get more exhausted after providing the cluster with more CPU resources? Is crystal clear that something interesting is going on here? Well, you see a 4x increase in request rate when you look at the request graph. The cluster is normally sitting around 10 to 15,000 requests for reads per second, but after the node joined, you see a spike up to almost 30k reads per second. So you ask yourself, was this a retry storm? When the new node joined, the backlog of requests, it started to serve. Could quickly overload the node. If those requests were pushed to other nodes, it could have overloaded them as well, leading to a retry storm. But then you look at the retry graphs and see that that wasn't the case. But then you look at this dark green line over here on the bottom graph, which represents a second type of query. They caught your eyes since you started to see upwards of 10k of them. As soon as the incident began. So you ask yourself, what are these other queries? You look at the request type, range slices. But what are range slices? Range slices are cross-partition queries. Here are a couple of examples where one is if there is no partition key in the where clause, another would be if the in operator is used on a column, which is a non-partition key. And also, range slices are typically expensive. And as a result, are a common cause of highly unseeded timeouts. This aligns with what you just saw, a large spike in range slices along with an increased number of timeouts. So you wonder, where are the range slices coming from? You're tracking the queries with the app team's key spaces, and you see that there aren't any range slices there. So you wonder if the queries could be happening in the system level key spaces, which you hadn't been tracking since they are typically uneventful. In this case, you dig up some more fine gray metrics on requests. You end up finding that the driver was making a number of calls to the system tables without specifying a partition. So now you discovered the source of the range slice queries. But why would the driver need to make range slice queries? You look further and find that these range slices coincide with driver refresh periods. As it turns out, cluster level information, such as topology, schema, tokens, and status are held in the system level key spaces. And the client driver queries the system tables for changes to that information. Why? It helps the driver gather information about the current state of the cluster so that it can efficiently route queries. Now what you're starting to think is when you vertically scale, you perform a topology change. And when you perform a topology change, all the nodes announced to all the CQL clients, the topology has changed. I'm going to redo all these queries. So you saw that you could get a spike when you do a rolling restart of the client nodes. But what you didn't know is that when you've started a single server, you could see a similar effect. So you took down things when you tried to vertically scale. You're probably thinking, the driver took down the cluster? What? Your company owns a number of Cassandra clusters and you haven't seen this behavior before. So why is it happening to this degree? And why haven't you seen it before? So now you turn to client connections graph. The total number of connections in this cluster is grown to be very large. The top graph here shows the client connections. And what you're seeing is that the number of these connections is grown to over 60,000 per node. So why didn't we see this before? This is a huge number. Over the course of the past six months, the number of connections in the cluster has been steadily rising, increasing from 20,000 connections early in the year to its peak of 60,000 connections. So it's been kind of a slow climb. So what's causing this increasing of connections? Well, like I mentioned earlier, the cluster is accessed by workers and then the number of workers is grown, so hence there are a number of connections to the cluster. So the driver is making these system level queries. When there are schema, status, or topology changes in the cluster, more specifically, range slice queries happen when the client is trying to connect to a host for the first time or reconnect to an existing host. So when the node went down, the cluster was perfectly fine. Each client received the down event and maybe tried to reconnect, but it stopped pretty quickly. However, when the new node joined and the cluster began to service TQL requests, every single client application began immediately performing range slice queries and interrogating the cluster for the most up-to-date information, which effectively VDOS the entire cluster and caused the outage. So in summary, range slice queries were triggered server side, but the number of range slice queries that needed to be sent to an increasingly large number of clients. So how do we solve this? There are two ways to solve this problem. First is you could entirely disable the metadata queries within the cluster. The effect of this would be the loss of token-aware routing. Cassandra knows which nodes are responsible for which token ranges, and so knows where to forward requests. In an application that values low latency queries, transmitting the information to the client is very valuable because it reduces the number of intranode requests to serve a single query. So maybe we don't want to disable it, but if we don't want to disable it, we need to try and control it. So there are a couple of refresh window settings, which defaults to zero, and that results in the immediate request of information on status changes. By increasing that refresh window from zero to, let's say, five minutes, each client will randomly be assigned some sleep period within that five minute window, and it'll wait for that amount of time before interrogating the cluster for updated information. So after applying a client side change to this five minute refresh window, you can see here that this green, this dark green line, which represents the range slice queries, has spread out because the number of client requests has spread out over that five minute window. And eventually, this will allow us to resume normal cluster operations and finally complete the scale up. So what are our takeaways here with these three stories? Well, the first takeaway is the importance of practicality for discovering metrics. Over the past three scenarios, you saw how you could dive into a particular problem, but it extrapolates more broadly. You saw a few request graphs across the three scenarios. The left column graphs here is a slightly broader selection of request information, which is still only a subset of the request detail that we monitor. And then on the right side, you have another lower level signal that we use, disk usage. From these, we can answer important questions. Is a problem isolated to a single node, a replica set, a rack, a data center, or is a cluster wide? Is a recent trend problematic? Or is it only the peak of a cycle, such as compaction? If you were looking at the graph on disk, you might find that. And so many more. Having a breadth of information in your fingertips allows you to eliminate possibilities and quickly narrow the focus of what you have to investigate more deeply. And that allows you to move more quickly in a time-sensitive situation. OK, and here's a second key takeaway. When you have a lot of observability, it shows you a lot of interesting places that you can look deeper. Datadog provides you with the convenience of automatically strapping metrics. This enables you to navigate between correlated events, whether within a trace or across patterns that coincide. It also gives you more of a granular look at events. Otherwise, you might be limited to surface level details. SSH-ing into a node and browsing through current JMX request counts or a few GC log files would not necessarily give you the full picture of what's under the hood. So up here in the top graph, you could see how you could take some errors and then automatically find the correlated metrics, the containers which are associated with those errors, and then let's say you pick a container. Well, then from there you can see that container within the entire span of the containers or the nodes in your cluster. Looking at all sorts of tag details on them. So my second takeaway for you is the importance of navigability. Ease of navigation gives you the ability to grow your understanding of the relevant metrics of your cluster from a small subset to a broader one, as well as how to monitor those for outliers. And that's all. Does anyone have any questions? OK, I think I see two. Would you like to go first? Or either way? It's been a while I worked with Datadoc, but I remember you guys are very good with helpful widgets and everything. I'm wondering if you have any kind of tooling or what's not to monitor client side Cassandra matrix versus server side. Because there are scenarios where you can see a lot of timeouts and application side, but server looks healthy. That would be due to some conflicts unless you look at the throttling matrix. You wouldn't know that. So if you can tell more about it, that would be great. Sure, yeah, absolutely. And that's a good question. It's something that over the time of us running clusters, we've gotten better at. We didn't start there, but it's definitely an important piece of observability to have of client side and server side. And so where we've come from is a place where we've had silos and two different application teams and then us having our own dashboards. And we started to kind of bridge that difference by sharing dashboards around and kind of this. But then on top of this, there are traces, right? So APM traces. And you can start from a query on the client side. And then you can drill into how does that query break down? What are the different time? How long do the parts of that query take? And we're not quite there yet where we have a complete link from the client side to the server side for, let's say, a particular query we can drill down. But that's definitely the direction that we're looking at. And we're interested in going, so you can figure out if some query you thought should not take that long is actually maybe spanning multiple partitions or something like this. Let me get to it. You mentioned that you have all these correlated metrics. How do you scale that? Because there could be thousands of metrics. And correlation is computational, fairly expensive. So you have all these correlated metrics. And then you know which ones are most suspect. And then you can start looking at that. That may give you clues of what's actually the root cause. Whereas in your example, when you went through that, you said the engineer noticed this dark green spot himself. He didn't really use your correlation. Right. Can you talk a little bit about that? Yeah, sure. So I think this is kind of like where we were at versus where we're going. I picked some of these stories because they were interesting at the time for us. But then I guess looking back is 2020 hindsight. So as time has gone on, we've come to recognize all the tools which are readily available to us. And we follow this process called dog fooding to try and make our own observability better by what's already available and what's becoming available. So correlated metrics is a perfect example of that. You might see some trends that follow the same kind of shape as the signals that you're seeing when it's like an error. But a starting point where we were at and we're kind of building from is just having a large dashboard of all these metrics which you are going to think, OK, well, in general, as you grow, you're understanding the cluster. This is what we tend to see is like this is interesting and also this is interesting. So it's kind of more of a manual process at first. But then as time has gone on, we've recognized and started to take advantage of some of these more automated tools that as you're absolutely correct are important for making this robust. Oh, how did a note of focus on this specific? Yeah, so for example, you could look at this error, right? And you could see this is like a server side on the server side dashboard, the clusters dashboard. This is some sort of timeout error. And then from there, you have the ability to navigate towards other kind of information with the same tags. So maybe like the containers which are associated with that error event. And then from there, you can move on over to there and say, OK, well, this is the container. And this is kind of where the manual step comes in, right? I mean, you could then look at that container from the perspective of the cluster that's in and see, OK, is this error something which is limited? Is it something that it has to do with a resource limitation or something like this? And maybe a helpful visual way of doing that would be to look at the entire scope of the cluster and to see, OK, this is the node, this is the error. But in this particular case, this is kind of more just handcrafted together. So these are two disparate instances as opposed to one of the earlier scenarios where those actually were connected through step by step. Textual data? Yeah, well, I think with any sort of log or error or anything like this, you're going to be tagging it with all sorts of tags that kind of say that this event happened on this service or on this machine or during this event maybe if you wanted to tag it in a certain way. Right, exactly. And the labels kind of you can think of can support that correlation. They can kind of lift it up on stilts and say, OK, maybe there's some sort of trend. Maybe there's a trend that happened completely across the company in a different application. But just because the shape of it doesn't necessarily mean that it's caused by the same problem or something like this. So what you can use is tags to be like, OK, well, not only are these the same shape, but also these are closely related services. This is the client. This service is the target, and this is the client. And so you can leverage that information to kind of tighten the confidence in that correlation. Yeah, yeah, so I mean, this is all kind of in its own separate group within the company. But yes, there's all sorts of things related to ML forecasting of trends. Is this more of a seasonal or cyclical trend, or is this a permanent direction that we're going in? And then in this case, though, yeah, like figuring out the correlations of events statistically based. Yes, so in fact, if I were to, and I didn't have this on the slide, but if I were to click on view the correlated events, you would see a dropdown of a bunch of other events that maybe are errors on different services or something like this that occur around the same time, or maybe they have the same shape, or maybe they have a lot of the same tags. But these are the things. And then of course, they're ranked based off of some sort of similarity heuristic. Uh, huh? Maybe after. One more? So in Consentral, is there any metrics that we can monitor or see the network issues? Yeah, absolutely. So that's another panel that we have on our dashboards that I didn't show there. But absolutely, if you have some sort of network partition that is preventing some of your nodes from connecting with each other, that would show up on some of our dashboard graphs that are related to ingress, egress, all sorts of network data. Oh, OK, yeah. So yeah, everything comes through ultimately the Datadog agent, which is configured to just basically streamline metrics across different feeds, JMX being one of them. So it might listen in on JMX, but then some other things as well. Yeah, so these run as a container alongside, yeah. Well, I think I'm just about out of time. But thank you for attending. If you want to chat more, feel free to reach out to me. Yeah.