 Happy 410 p.m. Slot y'all are the The real troopers getting to the end of the conference. So thank you My name is Isaac wreath. I lead the no sequel infrastructure team at Bloomberg and with me today is Lindsay Serov Jack She's an engineer with me on the no sequel infrastructure team It's been almost seven years since Cassandra as a service came to Bloomberg over that time We've learned a lot about the many obscure ways that Cassandra can break and this talk will present a few of the problems We've seen over the years some of the warning signs that Cassandra has for identifying these issues What you can do to both fix the issue in the immediate term as well as prevent it from happening in the future Before I talk about all of that I'm going to talk a little bit about Bloomberg in case you aren't aware of what Bloomberg is and what we do So Bloomberg is a technology company that provides financial data primarily through software knows the Bloomberg terminal The Bloomberg terminal has more than 350,000 subscribers and offers different financial analytics tools Which we call functions to finance professionals around the globe who use the data and news we provide to do their jobs every day We have over 21,000 employees in 70 countries more than 8,000 of whom are engineers our engineers are active members of the open source community Both as users and contributors if you've merged my PRs Stefan's in the audience somewhere. Thank you And we even have people who serve on as PMC members as well When we transitioned over the last two decades to becoming an open source first company use hundreds of open source projects to power our products and infrastructure A little bit about the no sequel infrastructure team We offer Cassandra as a managed service to engineers at Bloomberg Our goal is to make it easy for engineers of Bloomberg to use Cassandra to power the applications without needing to understand the complexities of managing large-scale distributed database We started in 2017 and we've been growing ever since The scale at which we now operate is around 2.5 petabytes of data served by 3,000 Cassandra nodes broken into 250 clusters Operating at this scale is presented a number of interesting challenges And we'll dive into those next through a bit of a role-playing exercise Lindsay and I will both take the role of engineers doing an on-call shift for the no sequel infrastructure team We'll talk a little bit about the problems we see and how we fix them But as with all based on real-life media will preface our story with a disclaimer The events you're about to witness are fictitious and have been condensed into a single on-call shift for dramatization purposes No engineers were woken up in the middle of the night in the making of this presentation So I start my own call shift and I get my first issue I log in to figure out what the issue is and assess the impact I see that disk usage is getting pretty high in a number of our Cassandra nodes It's not yet impacting our users, but we need to figure out what's going on before the disks fill up Looking further at some of the Cassandra metrics for the cluster running on these machines I see that there's a table with a large number of data files in this cluster When I look at the droppable tombstone ratio which defines the ratio of data which can be cleaned up by Cassandra to live data It's very high. So I need to ask myself two questions One, what is deleting the data and thus generating the tombstones and two, why isn't compaction deleting the tombstones off of disk? So I take a look at the table schema and I can see that TTL is very clearly was generating the tombstones And the table here is compacting data using time window compaction strategy on the surface. Nothing seems to be too strange here Then I asked myself why isn't time window compaction strategy cleaning things up? Luckily Cassandra offers a tool called SS table expired blockers to be able to figure that out So using that tool I discover the culprit and I use the SS table metadata tool to actually inspect the data One thing that I see on this SS table is that it's mostly expired data However, the max expiry time the meta expiry time of the data in this SS table is different This is what's preventing the data file from being dropped and in turn letting other tombstones build up So what's actually causing the expiry times to be different? I look at some application code and I see the culprit. There are some code paths which have a using TTL clause Adding to the insert statements in this case It's using a TTL of two minutes However, if you remember on the last slide our default TTL was set to six months So that means that some of our data is expiring in two minutes. The other data is expiring at six months And so to explain why this is a problem I'll go into a small example of how time window compaction strategy works and we'll be able to illustrate the issue But before we dive into an example give an overview of time window compaction strategy for those who aren't aware of it So time window compaction strategy is a compaction strategy whose objective is create a single data file for it given time window Data in a given time window will generally be queried together in a time series use case So having a single data file where all the data for that time window is in the one data file as well as expires at the same Time is a really nice optimization we can have However, once a single SS table is created for a given time window TVCS won't actually merge it with any other SS tables in that time window And it will instead rely on dropping fully expired SS tables as well as tombstone Compactions to actually clean up the tombstones So now that we've talked about that we'll go to an example So in our very simplistic example, let's say that we have a time window of one minute And we start with no data on disk So we first starts writing some data and we generate our first SS table if you call SS table one And this SS table what you'll see is some numbers those represent partition keys Those become important in a minute We continue writing day and we flush another SS table SS table 2 and then compaction runs and merges them together And an SS table called SS table 3 So now one minute has passed and we have like a new time window that started all the SS tables for the old time window Have been merged together We see SS table 3. It's actually pretty nice SS table all the data in this SS table is Expiring within two minutes. They all expire at the same time So we'll keep writing more data and we'll create a new SS table SS table 4 And SS table 5 this is where we start to run into problems This data is not nearly as nicely organized some of the data in each of these SS tables expires in two minutes But there is some data that also expires within six months So compaction will run again and merge them together to a new SS table. We call SS table 6 So now that two minutes have passed since the original SS table was created you'll see that it's fully expired It would be awesome if we could clean up this data. We don't need it anymore. We know we have a TTL You know expiry immutable data type use case However recall that Cassandra will no longer run a compaction to merge it with SS table 6 because they're part of different time windows And so instead we need to rely on purging the fully expired SS table However to do that Cassandra will run a check first to see if there are any partitions Which overlap with live data before dropping a SS table and if it does that SS table won't actually be dropped And so what you'll see in this case is among other partitions Partitions zero and partition one overlap between SS table 3 and SS table 6 Even worse partition one will expire in six months, which means SS table 3 is going to be sitting around for a while before we're actually able to clean it up So this is going to be our first issue and we'll talk about how you can fix this in a minute But continuing on will flush two new SS tables SS table 7 and SS table 8 Which get compacted again into a new SS table SS table 9 So now more time has passed since the two-minute TTL data was written in the SS table 6 and that's now expired as well So SS table 6 is not fully expired But it would be really nice if we could run a tombstone compaction to clean things up here So why isn't Cassandra running that? Well SS table 6 will only have a tombstone compaction run on it If again, there are no overlaps between other live SS tables. Otherwise the tombstone compaction by default won't run So what can we do to actually get Cassandra to clean things up here? Well, the first thing that we can try is by setting the unchecked tombstone compaction parameter in your compaction strategy This will let Cassandra run a tombstone compaction even if the SS table in question that you want to compact has overlaps with other live SS tables In this case only the tombstones which do not overlap will be purged and this could have helped clean up that data in SS table 6 However, if unchecked tombstone compaction isn't good enough to clean things up Then we can also set allow unsafe aggressive SS table expiration This name sounds really scary And I think it's supposed to be kind of scary because it can End up deleting data that you don't want to be deleted if you don't have a 100% appended only TTL use case So be very careful when setting this this argument You need to set both a JVM option and alter your tables compaction strategy to set it And this would have allowed us to drop the SS table SS table 3 for example So this would help clean things up in the immediate term But what could we do maybe longer term to fix the problem? Thinking longer term if you can change your data model to work better with time window compaction strategy By bucketing different XBerry times into their own tables This can help prevent the issue in the first place But this is only really possible if you know up front all the different TTLs that you need We understand that sometimes isn't always feasible to do sometimes you really do need fully dynamic row level TTL to solve whatever problem you need to solve And so in that case if you can't break apart your TTLs don't use time window compaction strategy We'd probably suggest using leveled compaction strategy It's more aggressive. It'll help purge tombstones more aggressively out of these types of use cases one thing really excited for in Cassandra 5 is unified compaction strategy. So hopefully we can spend a lot less time thinking about compaction So after getting the tombstone issue out of control. It's a really busy shift for me. I have another problem I got to solve So all of a sudden we're starting to see right timeouts acute occur on a few nodes in one data center We're not really sure why this is happening but these rights seem to always be timing out in the same data center and They're indicating that three responses have been received in that data center And so what this is indicating to us is really that a cross data center handoff is failing in Cassandra Looking at the remote data center to where we see these timeouts. We see that one node had an out-of-memory error But this is really strange, right? It shouldn't the node be down like why is it still getting requests? Why is it still participating in anything? Well, if you use the default JVM option that ship with Cassandra There's a flag exit on out-of-memory error that will be set that causes Cassandra exit when an out-of-memory occurs However, this flag will only work when an out-of-memory is a result of a heap space or metaspace exhaustion And for a native thread out-of-memory the process will sit there still running So you might then be asking yourself, okay, well if that process is still running Why is it still reporting us up to the rest of the cluster and this has to do with gossip? So if the gossip threads are still running it will still be responding to the pings that the other nodes send to it And so the cluster will actually think that this node is healthy when it actually can't create any new threads And so why is this causing a timeout? Well when a cross data center coordinators chosen for the right The right request will just most of the time be pending and it won't actually be processed Because that right request isn't processed the cross data center coordinator can't forward on to the replicas in that data center then from the Query coordinator perspective. You're just going to see a timeout So what can we do in this situation to make sure that we don't have to deal with this problem? Well first and foremost you should Identify that native thread out-of-memory has occurred and when this happens you need to jump into action right away I'm sure the node that is killed or JVM agents like JVM quake, which can be really good at helping to To automate this problem, but at the very least you need to make sure the node comes down So this will help mitigate the issue in the immediate term But in the longer term you need to figure out why you're not able to create new threads on a given machine a lot of the times at Bloomberg when we've seen this it's because there's some sort of Configuration on the machine that's wrong like and proc isn't set properly or maybe you're running with system D And you forgot to set the task max parameter And so one thing to note with this is even though that we found this in Cassandra It's really a problem that plagues all JVM software So we definitely recommend taking the learnings that we had from this problem and applying them to kind of all your JVM based applications So this has been a pretty busy on call shift for me and Lindsay can see that I'm getting pretty tired So she's gonna take it over from here, and I'm gonna go get some rest Enjoy your sleep Isaac So I've barely been on call, and I'm already getting paged. Let's see what's happening All of a sudden an application is seeing sporadic authorization errors But we haven't made any updates to the authorization policies for this cluster on the server side We see that we're under a lot of load, and we see this specific error message about a failure to authorize a user What's happening into the hood? So in order to figure that out we have to see how Cassandra does authentication and authorization Cassandra maintains internal caches of authorization policies roles and credentials on each note and on a cache hit the request is authorized However, if a cache entry has reached its expiry or if the policy isn't present in the cache Then Cassandra will perform an internal query to system off roles and role permissions tables in order to refresh it So why is the failure happening these authorization queries get added to the queue with other queries and therefore when a note is already Overloaded authorization queries can start to time out as well So if this auth request fails there really isn't an exception which can be thrown other than unauthorized exception or Authentication exception depending on which part of the pipeline actually fails And so when this error gets sent back to the client it gets sent back as an authorization failure Instead of being categorized as a query timeout, which applications will typically retry So what can we do about this? Well, these types of except exceptions will indicate when the exception is a result of a failure to authenticate or authorize Including when the system off table queries timeout So we can know that these failures should be retried So we highly recommend adding retries and error handling to your application for this case if you haven't already Now this error isn't always intuitive So the error message itself is going to have details around the failure so that you know you're hitting this case And if you're curious about this and you want to understand it more thoroughly Cassandra 15.041 has more information And it may be a medium term solution increasing the validity of the caches with configs such as roles validity in milliseconds permissions validity in milliseconds and credentials validity in milliseconds is going to result in fewer cache refreshes Which can further help reduce these failures and In more of a long-term solution if you're not on Cassandra 4.1 you want to upgrade so you can use this last feature Configure Cassandra to asynchronously refresh this cache With their your roles cache active update permissions cache active update and credentials cache active update Which you guys should have these slides so you can see this later To emphasize how these last two solutions can help. Let's revisit this diagram So when we set the async refresh these additional system off queries aren't going to happen at query time And while the validity and the refresh configs actually both default to the same value of two seconds a Strategy could be to set the async refresh period to a value. That's less than the cache validity So you know you always hit the cache and also if you were at mix talk earlier I believe he said you can up it to like 30 seconds. So There's that too It's also worth noting that while using async refresh will help avoid this type of error It doesn't necessarily reduce load on your cluster So in times of high load you may still experience the load shedding that causes this problem and the queries to system Off-tables can still time out, but the difference is is that this error isn't going to get sent back to the client and cause this confusion And additionally you may now be reading from these tables more frequently depending on the refresh rate that you set So be conscious about this as well We get yet another call. It looks like some nodes are under higher load than others, which is causing timeouts Let's see what's happening In addition to seeing timeouts We're also getting alerts that the disk is filling up on some of these nodes within the cluster And we look at node tool status to help us understand a bit more about the topology We see that this cluster is very imbalanced. There are some nodes which are going to be like twice the size of others, right? What what's causing this? So to figure that out we need to look at how Cassandra handles data distribution Cassandra distributes data around the cluster and determines which which nodes own what data when nodes are initially added to the cluster So assume these six nodes are in one data center here This makes up what's called the token ring each node is responsible for a subset of data determined by this ring The partition key which is defined in your data model is hashed to determine the first replica where the data is stored So for example in the query we see here The partition key is the name John Doe the value is hashed to equal the token seven Now Cassandra uses consistent hashing so no matter how many nodes we add to this cluster The partition key is always going to hash to the same token value We see that node one is the node responsible for this data because it owns token zero through ten If we assume that we have a replication factor of three we have two more replicas that are going to hold this data Now the replicas are determined by moving clockwise around the ring and selecting the first node That's in a different rack than previously chosen replicas until the replication factors met So we see in this example We now have node two which is on rack two and node four which is on rack three have been selected as replicas Now imagine there's a rack with only one machine this one node is going to be supporting data that other racks would split between several nodes So this is going to become a problem The issue is even clearer when we see the physical view here rack one has only one node in it and no more inventory to add Into the cluster So we'll need to get rack one out of the cluster entirely and add in a few more nodes from a new rack such as rack zero here Which has more inventory? Now in this cluster example, we have four racks of nodes to choose from but a replication factor of three So how can the relationship between replication factor and rack count help us to prevent this in the future? So having a replication factor that's less than the rack count is going to make it more difficult to tell which nodes are replicas for What data as Cassandra can choose many combinations of racks depending on the token value of the partition key and the token ownership Of the nodes in the cluster as we saw before So one partition might be replicated across nodes in racks one two and three But another partition may be on nodes racks in nodes of racks one three and four So this is going to result in some imbalance and some racks may end up being responsible for more data than others Depending on the partition strategy because remember the partition hash determines where we start in the ring On the flip side having a replication factor greater than the rack count will result in placing replicas within the same rack Which Cassandra tries to avoid this makes your cluster more susceptible to outages as nodes within the same rack Could all come down at the same time due to network or cooling issues and that's going to result in loss of quorum Which obviously we don't want So the solution here is to set the number of racks equal to your replication factor by creating a logical grouping of machines This results in Cassandra placing a replica in each rack and spreading data between the racks So let's visualize this We see here now with a replication factor of three and a rack count of three We know that there's a replica in each rack and the machines within the rack split the ownership of all the data and Since all machines are now part of one of three racks the same set of machines now results in more usable inventory available for the cluster Now even though we're no longer bound by the physical location of the machine when adding machines into the cluster It's still important to remember the physical locations when creating these logical groups, right? When taking this approach you should ensure that the machines on the same physical rack are part of the same Replication group so that no two replicas are placed are placed on the same physical rack And if you're an AWS using the EC2 snitch or the EC2 multi region snitch It's important to align the number of availability zones that you deploy in with the replication factor Since these snitches are going to use the availability zone as rack information So now when all the nodes in the cluster begin to fill up we can spread the data out more easily by adding more nodes in each rack It's important to note that the strategy really only works when you enable the config allocate tokens for key space or Allocate tokens for replication factor depending on which Cassandra version you're running These configs trigger the distribution of tokens more evenly around the cluster taking into account your replication factor here So while the strategy as a whole is going to help you with controlling data distribution around your cluster It's not going to solve large partitions It's not going to solve the imbalance of data relating to a bad partition strategy, so still be conscious about that. This is not an out We made it to the end of our own concept, so what do we learn? In addition to the solutions of the individual problems that we saw there are some themes across all of these problems that we need to Take note of going forward first observability is immensely helpful in identifying issues in Cassandra There are a lot of metrics available as well as very thorough logging So make sure to use these and understanding what's going wrong here in addition to metrics and logs Cassandra offers great tools such as node tool SS table metadata and SS table expired blockers that help diagnose these cluster issues Second configuration is very important. It's really tempting to use the default configs But there are a lot of opportunities to enhance performance and prevent the types of issues that we talked about if you're willing to dig a bit deeper In addition to the official documentation the asf Jira is also very helpful in understanding what these configurations do And documents some of the more esoteric configurations that Cassandra offers And finally I cannot stress this enough be proactive when monitoring your systems learn the early warning signs Such as an increasing dropable tombstone ratio and act when you see them to prevent the fall of your application Okay, so we're gonna take questions, but while we do that We're hiring So if anything that we said today sounds interesting, I promise that's not a typical on-call shift Come talk to us. Feel free to scan the QR code Any questions? Well, you're gonna let us off easy. Nice. We'll take it Yeah, I'd say in like the extreme So that's like a very real example that we had where we had a single Like rack node in a single rack like that those examples It didn't really help a whole lot and we did need to be able to add inventory in there Just spread out the load Before we went with the strategy of just racks equals replication factor just setting that parameter did help a lot with rebalancing your data But for an example like that it you're still need to be able to spread out that load a little bit more All right. Well There's no other questions. I mean, we'll be right over here so you can come ask those questions there You know where to find us Thank you all for making it this far into the conference really appreciate you all coming to the talk and hope you enjoy the rest of your day