 Alright, hi everyone. So you just have me today. My name is Jordan West. Unfortunately, my colleague couldn't make it today, bringing people back into the office, gets people sick, and I got lucky, and he didn't. So it'll just be me. My name is Jordan West. I've been with Netflix for about three years, but Cassandra Committer, working on Cassandra for about the last eight years. Databases are kind of a life sentence. I've been doing those for 10 to 12 years now. And my colleague Chugn couldn't join, but he's been at Netflix, and some of the work here, about half of it, is work that he did. So I'll try to give him credit where it's due. I work on a team at Netflix called Online Data Stores. If you were just in Joey's talk and Vidya's talk, we provide the foundational stores to the abstraction layer that Joey and Vidya talked about. We manage a whole host of databases, our small team. Cassandra is one of them. We also do elastic search, evcash. We have a layer on top of Redis called Dynamite. We do ZooKeeper, which honestly is the one that wakes me up the least at night. And then we have a whole host of relational data stores that we offer. Cassandra is the only one that we allow into sort of the important paths of the streaming service at Netflix. And bad partitions are the things that tend to keep me up at night along with schema disagreement. And even though we have all the wonderful abstractions, sometimes clients still do things that you're not expecting. And so they cause problems. And we'll talk a little bit about what we've built to deal with them and identify them. A little bit about Cassandra at Netflix. If you saw this talk or a version of this talk at ApacheCon, I haven't updated the numbers. That's intentional. But this will still give you an idea. We have over 900 plus clusters in production. If you count tests, it's closer to 1600 in production environments. We store over 12 petabytes of data and handle about 12 million requests per second. About 60-40 read-write across the whole fleet. But if you look at individual clusters, it's going to be very different. So this was just me taking our graphs and zooming out to the whole fleet and figuring out what we do. When I gave this talk a year ago, we were pretty much entirely on Cassandra 3.0 with a small contingency still on 2.1. I am proud to say that as of this Monday, we are over 85% on Cassandra 4.1. We've been upgrading about a few hundred clusters a week after about six to eight months of testing and qualifying the database and the release. So if you were in Josh McKenzie's talk, he was talking about the frequency of releases. And so, you know, for us, we spend a lot of time qualifying a release before we put it out. But once we qualify it, we move quickly. So we were at 0%. I would say right before Thanksgiving. So why are bad partitions bad? Well, because they have actual business impact. So this is from a Cassandra cluster from an incident. You can see latency there going over millions of milliseconds for a short duration of time. We'll talk about why that's short in a moment. And when you zoom out to actually look at the API for playing videos, you can actually see that same latency spike at the same time. So this is preventing people from watching Netflix, which is, you know, bad. Netflix has this great multi-region architecture. And so we're sort of able to see this problem in a region and fall out of it. That doesn't mean the database itself has actually healed. And so we can see if we look at the replica latency, you know, the sort of wider box that's highlighted there on the left is the actual incident time. But then you sort of see these spikes continuing when the data was still read. And that's because the bad partitions are still present. They're still being read. They're just not affecting customers anymore because we've failed out of the region that has the problem. So before we can fail back in and use our resources to our fullest, we have to actually fix the problem in the underlying database. In this case, and we'll talk about the different ways that bad partitions can show up, this is a much more zoomed out graph. But this had to do with the number of assets tables that were included in every read. So if you're using like LCS, for example, you're really trying to, you know, keep the number of assets tables per read to a certain size. You can see here, typically for this cluster, it was somewhere between two and four. After we mitigated the issue, it went back down to like one to two. But during the incident and leading up to the incident, we had some hints that this was going to happen because we started to get to like eight, nine, 10 assets tables per read. And that's because this is a right heavy cluster. It was starting to dump into L zero faster than compaction could keep up. So what types of bad partitions do we have? We have ones that we don't really deal with this at Netflix, but at my previous not to be named employer. We deal with the total size of the partition. So when you get into the gigabytes per partition, you have problems with the column index and you start to get performance issues on read. At Netflix, we tend to have more of the bottom three. That is having millions of small partitions or small rows in a partition. So maybe you have two or three columns per row, but you have, you know, five, 10, 20 million rows in the partition. Maybe you aren't fronted by the abstraction layer that we're talking about. Or again, sometimes clients do things with the abstraction layer that you don't expect. And this happens anyways. The same happens with tombstones, typically with TTLs, but you also kind of get like this queuing use case where people write a bunch to a partition and then delete from that partition really, really rapidly. And those tombstones cause a lot of pain. Or again, writes happen faster than compaction. And like the incident that we showed earlier, you have a problem where assets table count grows. And so now you're reading off the disk way more than you want to in latency spikes. So what makes them bad? Besides latency, which is the obvious one that we sort of see at the application layer, CPU usage spikes, because now Cassandra is having to do comparisons over multiple rows. Memory usage grows because you're allocating more buffers, you have more object overhead, you have more GC, you have more GC pauses, you're reading more files, compaction slows down or can't keep up. And so it's kind of this self exacerbating problem. And then again, you have this read latency issue. And this starts to if you let it go long enough, you'll actually start to see bands of latency across the cluster. So the replica set that has the bad partition has the worst latency, then the one next to it is picking up traffic for it and it has the next worst latency and it kind of continues around around the rings, you have this cascading read latency problem if it's not addressed quick enough. So we sat down after having a few very serious incidents, including the one that I showed the graphs of earlier and we said, how do we make this better? And we sort of broke it down into three areas we need to know what partitions are the problem. We need to block those partitions. We need to mitigate them. And then an ideal world we'd prevent them. We're not going to talk much about prevention today. Some of that was in the talk Joey and video just talked about which is doing smart things at the application layer to not do these things. There's a whole bunch of you could probably dedicate a whole talk to prevention on its own, just like the 30 minutes that Joey and video spent talking about how they fix a good chunk of these problems like chunking and bucketing and that sort of thing. So identification, how do we figure out what partition is the problem? We know by looking at the graphs that we have a problem, but we don't know which key is the issue. So when we started this project, all we had is what we called the wide row report or the bad partition report. And basically, as compactions running, it logs out to your logging system that it's seen a partition over a certain size. And so we took that log message and we built a report using Elasticsearch and Kibana. And so you can see we've just taken that log message and broken up what cluster it came from, what key space, what table, the key, the time that that message was logged and then the data size here on the right. And again, in our case, it can be rows in the size of tens of megabytes, but to have tens of megabytes, this data set had to have millions and millions of rows. So this was fine, but you don't actually know that any of these are the ones that are causing the problem, because they could just be sitting on disk and they're not the ones being read. So how do you know which one is being read? Well, this is what we used to do. We used to go and just kind of like, oh, we sort of know what key or maybe the customer told us what key. We would ask what nodes own that key. Then we would ask what SS tables own that key. And then we would literally dump the SS table to prove to the customer, like, here are your millions of rows. And maybe you do some like horrible bash things to like get a count of the rows. So there's that type row in the JSON there. And so maybe you're just counting the number of time type row comes out for a specific key. And then maybe you're also looking at the SS table metadata because sometimes you want to figure out like the levels that it's in sort of show like, hey, this is because rights can't keep up because all the data is in level zero. And so this is a lot of steps for someone to do on call. It's also super error prone. And it's also hard to like share that output with a customer in an easy way. And so we wanted to make that better. So we when we first gave this talk and this was open source, what's nice now is almost all of this is open source. The first thing we added was a little extension to node tool get SS tables, which is what we're using on the second line there, which is just to if you're using level compaction strategy, which we are for most cases, is just to dump the level with the SS table name when you ask for it. And so that way, again, if if a customer is asking me what's going on, if I do get SS tables dash L, and I see a bunch of levels zero in the output, I can quickly say that rights aren't keeping up with compaction. If it's lower than we know we have a different problem. This also helps us see how spread a partition is over different levels. So you can see like if the update rate is kind of causing a pain or something like that. And this was open sourced in Cassandra for one in 18023. The bigger one we added for identification was this used to be called top partitions in 4.0 and above. It's called profile load. And it's actually much more robust. If you're doing anything sort of trying to figure out what's being read or written and causing you a problem profile load is where you want to go. We added extensions profile loader partitions. You should just tell you what partitions were read or written to the most profile load added latency to that. And then we added SS table count row count and tombstone count. So you can sort the profile load output by which partitions have the most SS tables being read from which ones have the most rows being read from which ones have the most tombstones. And again, that's open source in 4.1 and above in Cassandra 18022. Along the way, we also kind of found some bugs and fixed some issues if you're running Cassandra 3.0 still. This was in 17254, but basically we were using byte buffers wrong. And so you would be running top partitions at the time for 3.0 and you would just get an exception. And like as an on-call operator, the database not telling you anything besides I'm broken is probably one of the most frustrating things when your customer is breathing down your neck. So we just sort of fixed this along the way. And then thankfully in 4.0 when the profile load stuff was rewritten, all of that also kind of went away. So if there's one thing I can sort of recommend from this is get on Cassandra 4.1. So because I'm solo, unfortunately my my colleague would usually run the demos here. I'm not going to be as bold to run the demo live. So I'll just sort of talk through a little bit of this output. This is using 3.0. So it says top partitions and it has some additional flags in 4.0. Again it's profile load and some of these flags aren't necessarily to be passed. You can just sort of do like node tool help profile load and it'll tell you what to use instead. This was from a real production incident on a Cassandra cluster where basically the user was writing everything to one partition ID, like one partition key. And then they were deleting everything from it very very rapidly. And this started to cause production issues and it was a pretty critical cluster. And so we can see at the top is when we asked what partitions were being read, sorted by tombstones, that this started key, that was literally the one of three keys in the entire table, starting to grow and in this case it had 5,900 partitions. That's not very much but we had just seen this incident a week before where it got up to over half a million within about 10 minutes. And so we were watching this number grow and grow and grow. And so the way that we used to mitigate this, and I'll talk about how you can do it better in a few slides, is we would just drop GC gray seconds for the whole table. And in this case since there's only like three keys kind of works anyways. So you can see we dropped GC gray seconds. We ran a full compaction which is a bit inefficient and we'll talk about that as well. But then when we ran it again and we can instantly see that like, you know, enough of those tombstones were over an hour old, they started to get dropped and now we're down to 2,970. And if you were to continue going with this, you would see, we would see the number go down and down until it stopped causing a problem. And so this insight was something that we never had before. Before you would just kind of like drop GC gray seconds and hope it would work. It's much nicer to be able to just go and show a customer like, look we've mitigated it and actually just paste this output and show them that the problem that they're causing is going away. So that's one way to mitigate things. But before we dive into like the major mitigation, we'll talk about a feature that's new in 4.1 that we back ported to 3.0. But again just upgrade to 4.1 and use it. Which is to block partitions or to denialist them. It used to be called the block list in early patches. Now it's called the denialist. This was introduced in Cassandra 12106. And it prevents reads, range reads and even writes to a given partition key. I will say since the last time I gave this talk I never thought there was a use case for writes and then we found it. So we had a case where no matter what we did the customer was writing so much that even as we mitigated it, they would just recreate the problem. And so being able to just at the application level as an operator for a single partition turnoff writes was huge to bringing the rest of the database online. So you know to have one partition take down your whole distributed system is kind of a problem. And being able to have this control is great. And then also being able to have it fail range reads that read over it is great. And all of this is configurable. So by default we prevent reads and range reads. We don't prevent writes and then we have like a property that we can set on the cluster to change it to turnoff writes. You do this by either writing to a system table. So you go into CQLSH and you insert into this system distributed denialist table or you can go to jmx and there's calls you can use. In our case we have some tooling built on top that calls jmx and writes to it. And hopefully there's some CEPs to do like the Cassandra Management API. Hopefully this is something that in the Cassandra Management API we can do going forward. But it brings that control back to the operator. So what we used to do before is we would make our customers build this into their applications if their application was critical enough. And sometimes we would be in the incident and be like an action item for you is going to be to please go off and build this denialist in your application. So now we have this for every application that uses Cassandra and we have control as operators to again prevent one key from breaking the whole system. There were some bugs in Cassandra 4.1 that we fixed as we started to deploy this in Cassandra 18116. There is basically an infinite load loop if your cluster didn't start up completely healthy. This would cause so much load on the cluster and that would lead to so much logging on the cluster that the whole cluster would go down. But again if you're on Cassandra 4.1 the most recent version this isn't a problem so upgrade. So the last thing that we'll be talking about today is mitigation. And really mitigation is about automating a way that whole setting GC gray seconds for the whole table and then running node tool compact. Anyone in the audience know why lowering GC gray seconds for your whole table is dangerous? So what did you say? Yeah so GC gray seconds essentially controls how correct your data is when it comes to deletion. So you typically set GC gray seconds to be higher than however long it takes to do a full repair of your cluster. Which for us we're not running incremental repair yet as I'm like the order of 10 days. Customers typically don't set TTLs of like 10 days. They set them much lower. And so when you do what we did before where you lower GC gray seconds and run a compact you're actually running the risk of inconsistency of your deletes across the whole table. Now for the example I talked about with like three keys okay that's not a problem but when you have a hundred million keys to run the risk of inconsistency for one key is a big risk. The second one is using node tool compact. Using node tool compact is really heavy-handed. It does what's called a full compaction. If you're using things like STCS this can be really really expensive and can take really really long. So if you have let's say a dense node of about a terabyte and you do node tool compact it could take you an hour while you go compact a whole bunch of other keys before you get to the one that's actually a problem. And so we wanted to be able to do a more targeted compaction. So this is where my colleague Chung who couldn't be here today came in. He wrote this tool. This is also an open source since Cassandra 17711 but upgrade to 4.1 and you get it. And what node tool force compact does is it takes the key space in the table but then it takes a list of partition keys that you want to compact and so you just sort of lay them out on the command line and it ignores GC gray seconds for only those keys and it only compacts the SS tables that contain those keys. So when again if you have you know a terabyte of data but you have three partitions that are a problem you're only going to affect the SS tables that have those partitions and you're going to ignore GC gray. So you're going to like kind of automatically set just for those keys GC gray to zero clear out all the tombstones or clear out all the TTL data and then resume. So we have used this a lot in anger in production especially again when customers do things that we don't expect and and try to delete really really rapidly or set extremely low TTLs where you wonder why they're using a database. So again I would have loved to show you a demo but I wasn't prepared for this last minute change in just me speaking but we'll look at an example. On the left is what we did prior by altering the GC grace to an hour and then running compact and on the right is what we did using force compact and you can see that there's a part cut off but I would have passed the started key as the last argument there. So mark and no one is the key space started annotation operation ID is the table and then started as the key and it has the same effect but without all of the pain on the cluster. For some future work we're looking at automating this so we're looking at if we see a key as we're reading have some tombstone threshold and we have some settings that say it's okay we'll just automatically force compact that key away sort of automating away the need for an operator to even come do this after the customer comes to us and says there a problem or after we get paged in the middle of the night. Joey kind of hinted to it but a lot of my goal is to sleep through the night I like sleep I really really like sleep and so ideally you know that the system would do this automatically for us instead of us having to like get a latency page and then sort of go through all these steps we know as we're reading it we've already identified the partition because it's in memory and we have the key and we have the tools to just mitigate it away. The other one that this has been very useful for is sometimes customers or in this case sometimes us introduce bugs into the client and we write what are called tombstones so tombstones are tombstones that are significantly in the future to the point where they cover future rights so maybe like in this case we wrote into 2033 instead of 2023 which meant that any write up until 2033 would just disappear from the system and when this happens you you have like some very very odd choices of like rewriting SS tables etc but if you know that they were done in error you may just want to drop them and so force compact will actually let you drop these tombstones as well this is something that without force compact probably would have taken two or three senior engineers one or two weeks to fix to be sure and instead took two minutes so that's a huge one for us and more sleep for me so sorry today's a little short because we didn't have the demo but y'all are troopers for staying till five o'clock and listening to me anyway so i'll give you a few extra minutes but with that thank you all for being here it's really awesome to see the Cassandra summit again and if you have any questions we have a little bit of time doesn't block so the question was is there a reason Cassandra doesn't block tombstones by default tombstones um i think it should should it block tombstones but there are people who think it's a valid use case yeah so that's the use cases there's some people who want to like literally close out a key forever in perpetuity um what you'll see is when Cassandra doesn't want to do things that we disagree with we put it in an abstraction and we prevent people from doing it that way that's one of the reasons we rely on it so much and i thank Vidya and Joey and team and Vinay for helping me sleep more at night i think you had said during the talk that you're not using incremental repair what's preventing you from moving to incremental repair we had to move to floor one first our like biggest task for next year is to get the fleet on incremental repair once you're on incremental repair you can drop that gc grace considerably i think our pretty conservative so our target is to drop it from 10 days by default maybe a day or 12 hours but even that is a huge huge win in terms of tombstones but yeah adopt four one adopting for mental repair anything else if not uh hopefully you all have some nice plans to socialize and meet and talk to sandra tonight and again thank you for for being here it's really nice to see everybody