 So for all of you who are running a Cassandra cluster in production I am hoping from this session that you are going to have a ton of things to think about and to investigate and to take home This the sessions title was popular Cassandra problems I and how to fix them. I lied to you I'm taking you for a ride It's going to be popular Cassandra fixes because the amount of content that I've got it could be a full-day workshop So I'm going to run through it very quickly and I won't have time to talk about how you verify The problems to these fixes exist or how to verify that the fixes have actually worked for you or how to make those fixes optimal so Do not take anything in this session any of the hot list of fixes that I have to give you and Just make changes without doing your homework and your due diligence first I have to repeat that do not take any of my advice in this session and just make those changes So as I said, there's a ton of details and it's because this is basically just a load dump of the health check report That we built up at the last pickle. So we're talking about a Decade two decades worth of Cassandra expertise accumulated and hundreds possibly a thousand of production clusters Analyzed this job is usually three days worth of work per cluster And the report is often up to a hundred pages long So I'm going to try and just skim through it as quickly as I can for you in 30 minutes now In the photo of the last pickles Anthony and Rada van are missing and I would also like to give a special shout-out to Andrew hog and Phil Miceley Who have taken this on this work on in data stacks? And improved it even more Briefly about myself. I've always been involved in open source It's just obvious, isn't it? I hope And What's interesting is like like as a kid if you'd asked me I would have given you Ideology or or reasons of principle and Now and especially looking back at that what I was involved in over time. I can see the pragmatism and You know, it was really always Just about solving the problems that I had at work Faster it was just like I've got a problem. Let's look around for something and rip That was it. So I think pragmatism is is Far more valuable Reason to do open source than we realize Okay, so down to business. I'm gonna break this up into six six six sections And within each of these sections I'm gonna break them down into Three further subsections. I'm gonna talk about the the fixes that come Immediately the fixes that you should be doing in the near term That is like in a three to six month time frame and the fixes that you should be doing a long term now This isn't talking to the severity of those problems. This is a trade-off between the severity and the cost of change the risk involved and You know the the cost of verifying both the problem and and the changes So first up infrastructure the infrastructure has a base slide to it These are just that some of the base recommendations that you should That you should be aware of Cassandra nodes Only really only work in a certain range of specs. So we say from 8 to 16 cores and 32 to 64 gigs RAM is good for about production node If you have nodes, which are over 20 cores or over 120 gig of RAM You're really not going to utilize those machines And so the recommendation there is to to either use containers use Kubernetes Kate Sandra project for example or from for zero onwards You can actually because all the ports can be configured run multiple nodes per port This may change especially as we see, you know with what tries is happening and tries is doing with the memtables and If they become properly off heap This may change over time. So pay attention to it. The other one is use separate dedicated discs for each node Discs are usually in the category of about two to eight terabytes that's because you don't want to fill them up more than 50% and the the popular the size tid and the level of compassion strategies They don't really want to work over four terabyte. We have seen time window compassion strategy tables kind of go up to 16 terabytes And nodes work. It will still make your arrange movements or streaming incredibly painful So that is a recommendation again with five zero with unified compassion strategy and with the tries memtable and SS table This is going to change. We're going to see the ability to store a lot more data per node Okay, and and don't use sand Don't use sand because an angel dies every time someone has to tell a Cassandra user don't use sand I'm not going to explain it Simple benchmark to to run if you have info people who are trying to force a sand disk on to you or something is run 10,000 IOPS Soak test on each node at the same time That is your baseline if you can't do that if the controller starts to jam up The production class is not gonna work very well Okay jumping into the immediate concerns Okay, I have hiding in place for upgrading always the latest childcare patch and the latest Cassandra version You know, it's such an obvious advice, but we just jump into so many clusters and they're just behind This should just be automatic for you user kyber IO scheduler The knob is also quite popular for a lot of people It's hard to give an absolute definite advice everyone on this one, but we have seen kyber again and again Just be the right choice disable swab a Lot of people just kind of like we always say, you know Swab off and VM swapping this tool tool tool one not zero because there was a kernel bug at one time Where zero would cause problems and we say it's really important to do both people say like why? one's enough and the problem is that kernel upgrades and other things can happen at the infrastructure side often by different teams and a Swap setting can accidentally accidentally be undone And that can have a cluster-wide impact and bring cluster down. So it's a line of defense Otherwise the that the Cassandra three prod settings recommendations on the data sex website is really good I hope we can upstream them That's a pretty good baseline Yeah, I was gonna keep moving Okay, so critical items here Like the new term items Have an NTP client running on every node and every client application server More than that it has to be part of your monitoring and you have to have alerts set up for clock drift Trying to debug problems related to clock drift in a Cassandra cluster. All right, absolutely Also pay attention to found repair logs and B99 latency changes a long-term infrastructure Things to do if you are using multiple data directories in Cassandra stop Replace that with LVMs There are so many bugs and limitations around multi data directories setups in Cassandra Just don't do it. Also major Cassandra upgrades are more involved than people think they require a lot of planning And a number of careful sequential steps to do properly There is a great documentation on this on the data stacks website Which we are again hoping to upstream to the Cassandra website Quick note on the the the EOLs the Cassandra community has recently agreed that it will only maintain the last three Branches so that it effectively puts an EOL on That fourth branch People panic about this But you know hate to tell you open source doesn't give you support You know if you're that worried about EOL branches, maybe you should go to a vendor and sign up for support What we mean is we simply recognize the number of contributors and the number of committers that we've gotten the project We really can't man more than three branches. So that is what we're committed to so we're not going to see bug fixes From that group of people on older branches CVE's Maybe it depends again. You may be asking us to work on weekends and we'll make that decision Of course Vendors they can come in and they can say we would like to support older branches and and if they bring in contributors who putting bringing Fixes that will change Okay, so moving on to configuration Consistent config on all nodes. I guess so many of the things I'm gonna list here are just so obvious But the number of clusters that we've seen and every M of ours different It's it's horrifying and that's why there's Monk scream there because these are things that I I'm embarrassed to be up here talking about Logging it's not a big deal, but the logging is by default configured with Logging to places to file and to stand it out and very few of you will need both places And typically not the standouts turn it off garbage collection Just use G1 Even on the newer JDK's We see benchmarks where ZGC or Shenandoah on certain use cases perform better Sure But what we're seeing is that G1 is actually really good It's up there and it's just easier to use and it's what you should have any way your best setting for heap is about 2024 gig do not go over 31 and For the best G1 settings look at the trunk or 5-0 JVM options file Also, make sure to set the floor the new size of the heap Set it to about one-third or to half the G1 is supposed to with the ergonomics Adjust it just never gets there with the new heap. So you got to give it this hint to get to the right place quickly The other one off-heap and tables this gives you a 10% right throughput increase is the one to test There are some cases where it won't work for you though the two main bugs that I know that people Hit that prevented them from using off-heap have been addressed So I think that advice comes in even stronger Moving on So near-term configuration changes You should have for each DC three seeds listed and Each node should list those three seeds from every DC consistently The failure policies that you have in the YAML Never do best effort. It's really a choice between Stop and die and there it's a choice between whether Do you have a monitoring system that can actually? Detect and flag a node that is down that is in that stop zombie mode Or do you have a system that's going to automatically restart Cassandra if you do kill it? So depending on how your system is set up you've got to choose these ones correctly Once your cluster is operating well, and you've got decent hardware Jump into the Cassandra YAML and add this dynamic snitch disabled It gets in the way on well-performing on well-performing clusters and since Cassandra 4.1 We have guard rails each heart out on them. They are a lifesaver for operators There's heaps of them and they just keep on getting added to From everybody because everyone is seeing an immense amount of value on them long-term changes Keep an eye out for ring in balance That can cause problems over time and it's a little bit of a hard one to fix in Large clusters or over capacity clusters There's a new setting allocate tokens for local replication factor since 4.0 Take advantage of it and when you use that don't worry about small imbalances that you see Because the point of that allocation algorithm is to account for Future nodes and so it kind of is always a little bit off But it makes sure that each step stays within a certain error Margin also we are currently working on a Cassandra latest YAML file that should land I think in the next week or two That is really for new clusters the the set of options that you should be using We can't put that into the Cassandra YAML file because the Cassandra YAML file has to look after people who are upgrading So it's got all the conservative defaults keeping in mind people who are upgrading the Cassandra YAML is what we'll be pointing Anyone who wants to do benchmarking or to kind of replay with Cassandra to Okay, so moving on to data model This is a big one the two big issues in data modeling would be tombstones and Wide large or hot petitions You know I'm not going to go into them, but but kind of number the issues will kind of touch on them so the hot the hot or the critical list is Remove unused tables Tables have a memory overhead in Cassandra Cassandra is not a good place to park data So if you have tables if you just like taken kind of like copies or something and there was no longer any Reads happening on those tables get them out of your cluster Disable both the local and the DC local read repair chance This reduces five to twenty percent load on your cluster and that this this Async read repair chance it offers no operational guarantees So there's a really there's nothing that you can work with in a concrete way And it adds a significant amount of load to the cluster in the recent patch versions of all of our branches It has been removed compression chunk length Bring it down to 16k if you've got skinny rows even bring it down to 4k And then once you've done that upgrade SS tables So that the compression kicks in it's only going to work if you've also Put your disk read ahead to 4k that was really important one I skipped over it They go hand-in-hand Make sure your dev teams or your applications are syncing their schema changes As an operator dealing with schema disagreements can quickly turn into a huge nightmare Yes, it's kind of supposed to be able to do concurrent schema changes, but in practice you're living on the edge Get the dev teams to sync their schema changes This changes in 5.1 with transactional cluster metadata where you will be able to do unlimited concurrent schema changes Disable a rocache really never works especially over time people set it up They kind of tune it that works six months later. You later just start There are better ways to do that also check your prepared statements in the clients again and again and again We see string literal sneak into those prepared statements and that thrashes the Prepared statement case on the server It's something to keep an eye out and there's no way to hide and force it enforce it near term It's a long slide. I know Don't do secondary indexes before SAI I'm favored in normalization Don't try to avoid unfrozen collections and UDTs The performance difference between and the problems that you encounter between frozen and unfrozen is significant So whenever you're using UDTs or collections, if you can freeze them Just do it. If you are using unfrozen lists Try to use a map or a list or add a clustering key instead the list collection time Unfrozen is not good. Don't use materialized views ever Sorry I'm not gonna go into that one The whole section here on compaction strategies, I'm not really gonna go into that because with 5.0 We've got unified compaction strategy And you really just like just upgrade and just start using that Again Unlogged batches They're only for a single petition people think that they're wonderful performance gain. They're not Cassandra likes Concurrent in like small concurrent requests This idea of kind of trying to batch up a riot even within a single petition. It does not work Maybe like between 10 and 50 rows max in an unlogged petition may give you a small benefit Don't go beyond that logged batches They're fairly much don't give you any guarantee They trip people up There's a little bit more dev work in complexity To just do it manually by ordering the rights and then ordering the reads in the opposite way With the retry logic in place that's usually a much better solution again Cassandra 5.1 with a cord Fixes that problem for you Gc grace seconds is an interesting one. We see again and again People play with GC grace seconds try and fix their tombstone problems. It is not the first thing for you to play with It's nearly always done wrong GC grace seconds has a number of constraints Or interact or odd interactions with it First of all, you've got to be running repairs within your GC grace seconds That's the easy one, but it also plays with the max hits window and the TTL and it you get some odd behavior If you GC grace seconds check it goes blow above those but blow some of those values We have an excellent blog post on the last pickle website that goes into that in more detail Also configure your speculative retry both server side and client side to hard-coded millisecond values Usually something between the piece 75 and the P90 Latencies doing normal operations Is a decent starting place for you to think about it really depends on how much you're willing to sacrifice load overhead for a bit better performance But when you do that Also at some point go and test that the cluster can still run when speculative retry is at all Otherwise you have a cascading failure problem Long-term issues don't overuse lightweight transactions Like weight transactions. They do heat contentions. They do time out If you have a lot of lightweight transactions, I recommend you to upgrade to 4.1 and When you upgrade to 4.1, it does not automatically use the second version of Paxus So the second version of Paxus that came in 4.1 It halved the number of network trips. So you're going from four network trips down to two network trips That's a huge improvement on your cluster. But the problem is It doesn't automatically use it. There are some manual commands and as far as I know, they're not documented anywhere So good luck with that Now reach out and ask Don't do more than a thousand tables and Designed for collections less than a thousand. Yeah, the limit is 64k But what we see is that at design time, you're gonna have going already over a thousand then in production in a year You're going to be in pain So at design time, you think those collections are gonna need thousands of entries Try to find a different solution for that Okay, jumping into operations If you're not familiar with the UDA loop from the US Air Force, please read up on it The basic idea is that you take these sequential steps before you take any action You sit down and you collect observations in an unbiased manner You get as much observations as you can Then you start to orientate then you allow your biases to come into the picture Then you build hypothesis and you weigh multiple Hypotheses and you weigh them up against each other and then you make a decision at any of these steps You eagerly go back to collect more observations. Once you're confident you make that change and you verify the change and If it was no good you undo and start again Your monitoring systems your logging systems your tracing systems your backups all of those emergency ops They have to work when your platform is on fire These infrastructure tools need better availability and redundancy In many situations then the production system itself You know because of the production when the production system is on fire. You are dependent upon them To see what the problem is so it does not happen again. So operations Neat home Okay, keep an eye on wide large hot petitions It was interesting and a number of other another session at this summit who are saying 10 megs. I go with that The hard limits really for a wide petition before 311 was 100 meg Say softly that would be 10 meg that was accurate From 311 upwards We can actually handle wide petitions a bit better Last the high limit then moves up to about a gig This is not talking about large petitions large column of data cells And it's not talking about our petitions long list here Anything I'm super interested in talking about Okay, so two items here. What's your per table compression ratios? One of the things that we see in all of our reports is that there is maybe 10 percent Of the tables are not getting any benefit whatsoever from compression You know it's super simple just jump in just turn compression off on those tables Otherwise, you know this this You know half of the the remaining tables Simple adjustments Could save 10 percent Discusage for you. I know like I was we were speaking about this the other day and You know like the engineers in the room immediately jumped on us like yes Yes And we should come up with this new compression algorithm and we could you know have a custom dictionary based on the data You've already got and they have even better compression ratios and for me That was a little bit like when you talk about climate change and people talk about brand new technology They want to invent and like just stop it all the solutions around Just get to it and fix it low-hanging fruit But people don't do it because it it's quite cumbersome and takes a bit of time The other one was watch out for anti-compactions Again and again and again. We see clusters thrashing around with anti-compactions And we say you're doing incremental repairs and they're like no Wow like like repo or your schedule repairs are not got incremental repairs anywhere And what happened was some operator at some point run one incremental repair and that put Repair that timestamps on the ss tables and from that moment onwards you had anti-compactions running in the background You're basically doubling your Repair load and it's a bit cumbersome to the go through you got to kind of turn off the node and Unset all of those repaired at timestamps and turn it back on to get out of that situation a long-term things to check out for You know test your hint window and your the replay speed of it It's one of those things that People don't find out until they've had a node down for an hour or two They can't actually get that node up and running again, and it's because the hint replay is working too fast You've just got too much in your hints So take the time when you can to figure out What's the correct? Throttling rate for hint replays, especially when you have like an hour or two worth of hints. Okay, so security only once Slide no couple of slides for security Low hanging fruit in the Cassandra Yammel There's a number of kind of like interval in milliseconds or validation period in milliseconds. I think there's six of them There may be nine of them now I can't remember they're all set to quite low values increase them all up to 30 seconds You know having like if there's a problem there or like there's checking your your Your your off tables and roles tables too often that can cause problems The other one is the system off key space Increase its replication up to the number of nodes you've got or just like a higher number like five or even more and use NTS so you make sure you've got copies of it in each type of center and Yeah, do I have to say it? Don't expose JMX Off the machine if you haven't done your security properly on it Don't use the default super user Cassandra user And that is because With that user All internal requests around it use quorum As soon as you create additional Users and super users They then use Because this is the level one So the the default Cassandra super user can actually get you stuck and cause problems Set up your users and then Disable that one. Okay long term There is so much to go through here. This is in basic basic highlights The problem with security is there are no half measures You really need to do everything here properly carefully Just go read Nate McColl's blog post last pickle He goes through all the steps Well on the last pickle is also some additional blog posts, you know around Requireings the client Authentication, which is really important. It's a must and if you're using a common CA you should also do Host name verification Okay, last slide of the deck It's called trouble shooting, but you know, I probably could have called it verification Because when I say, you know, verify the problem and verify the fix the method is pretty similar to troubleshooting um So one thing one of my favorite things to tell people is Be familiar with the P 95 to P 99 spread derivative and what I mean by that and this is applies to both read and write latencies is That you look at the gap between your P 95 and your P 99 latencies and how that changes of your daily traffic shape And how much it pulls apart in peak traffic and then look at that? over three six 12 months That's going to give you some interesting insights as to the saturation of your cluster and and saturation growing over time as well as Potential hardware problems and other kind of like non Cassandra problems that may pop it pop in If you have any problems with CPU with CPU Increasing CPU getting saturated Take a flame graph before you do anything like even if you're 100% confident with the problem is please for love of God Just take a flame graph Flame graphs are also a fantastic way of getting into the code of Cassandra. It's it's not the best code base To begin learning with and to sit down and to try and just start reading it from scratch is pretty painful But the flame graphs are actually a really fun way to get into the code base Also be familiar async profiler to the fame graphs These tools D stat I have top if you've got a newer version start learning the PCC tools and for your GC Profiling even though G1 is pretty good if using those latest GVM settings You probably don't have to but if you want to do GC tuning just go and use the GC easy website It's fantastic, and it's free That's me Thank you very much and a quick plug to because community overcode happening in Europe Next June if you're from Europe. Are you close by? Please come and join us there Thank you Questions I will load them up. Yep. Thank you for reminding me question Polo again a thousand But yeah, I didn't put that in there's there's a number of things that could The list is long. So again, it was the popular problems The high column count we don't see it that often now and again Yes. Yes, we recommend them Absolutely Nearly all of the time it would do it's it's much much lighter than running for repairs caveats with incremental incremental repairs is that a lot of clusters you will still need to do token like break it up into smaller Repairs because not every incremental repair is necessarily Small if you've like loading if you're a table where you're coming to full asset loads like once a night Then you're that incremental repair is still a full repair, and you'll have to tune it the way you did before Also, you still need to Do full repairs Every couple of months maybe like for the sake of disc rot you've got SSDs It's still helpful to do those four repairs sometimes your question was on timeout values The valid reason for increasing timeout values is for example with if you have analytics DC Or Your client is not low latency like if you have an application which is more throughput based Then you're you're you're gonna throttle the configure you're going to tune throttle the configuration based on throughput not on SLAs or SLOs So that's why people increase the timeout values. We have no way like we have no way of Quality of service so you like like if you have to have like ad hoc queries on your cluster, and you know that they run Forever and so then you have to kind of increase that timeout and that sucks because 99% of your traffic is actually low latency traffic Where you would like to have a like a low we don't have a solution for that yet. It's been discussed a number of times I know people talk about this One way to solve that is if you have separate DCs So you have a DC which is analytics, and then you do it all ad hoc on all that stuff on there You can figure that differently, but Have a good use case for that and you like to write it up and submit it Like do that to the community because we need to hear more concrete use cases where we say yeah, that's legit And we'll start to form kind of more ideas on on ways to to solve that