 All righty, why am I here? Great question, it's not because I'm awesome at AV, it's because I like putting out dumpster fires. So, let's talk a little bit, just to kind of frame where I'm coming here. Last 10 years, I have been working pretty much exclusively with Apache Cassandra, throwing some other databases here and there, but I've done this a lot. I worked at DataStacks for a little while with Patrick McFadden. His team had a great time over there. Went to The Last Pickle, we were a small kind of boutique consulting company. Was there for about three years, I was added as a committer to the project at that time. I moved on to Apple for a little bit, worked on the Cassandra 4.0 upgrade, went to Netflix, did a lot of the performance tuning for the Netflix fleet, including all the GC tuning for the entire thing, put out a lot of dumpster fires there, and did the performance tuning for some of the most high profile clusters in the company. Some of the ones that were 600 plus nodes, so I did a lot of work there. We do have to ask ourselves, why do we care about performance? I think this is one of those things that it's easy to kind of misunderstand the reason why we care. Throughput is typically always affected by performance, latency makes a big difference. We really care about performance because first, it affects our end users. If people have to wait around for something, they're not happy, they get a little bit grouchy. Reason number two, and this is a really big one, cost. Especially when you're running something like Cassandra and a fairly large cluster, it's not cost effective to have to double or triple the size of your cluster when instead you can do performance optimizations and achieve the same result. Because at the end of the day, what we really want to do is get our latency down as low as we can while keeping our throughput as high as we can. And that's really what I've been focused on for a really long time. Now, when I was at The Last Pickle, I have to read. When I was at The Last Pickle, we really kind of focused on having this expert mindset and understanding really all aspects of a problem before we try and solve it. And so it really helps to have a methodology when you're gonna be doing this that you follow rigorously every time you approach solving a problem. So today I'm gonna be talking about a couple different methodologies that we can use for our problem solving. It also really helps to have a good amount of observability. And by that I mean we need the right tools in place and we need to be able to ask the right questions as a result of the methodology that we're gonna develop and we need the right tools to get the right answers. There's a lot of really bad tools out there and there's a lot of really good tools out there. Some tools don't give us great information, other tools do. So I'm just gonna talk a little bit about that a bit. And lastly, much like playing a musical instrument, you need to practice. It doesn't matter if you sit here in this talk and you don't actually take the things that I say and walk away and actually try them on your own. You have to try these on your own. Otherwise when it comes time to do performance tuning or there's an incident, you won't be ready to do it because you hadn't practiced. Just like you couldn't read a book about playing the guitar, and then go up on stage and play, you'd never do that, it makes no sense. You have to practice. So this is like any other skill. So speaking a minute about methodology. Without a methodology, without approaching a problem in some sort of sane way, we're literally just doing the equivalent of closing our eyes and turning dials. I've seen a lot of people try to do performance tuning by making arbitrary changes to configuration files, to GC settings in hopes that they're gonna do something correctly without understanding the impact of what they're doing. But in the case of Cassandra or any database system in general, there's literally hundreds of changes that you could possibly make. And if you need to make two changes and there's thousands of different parameters that you could change, whether it be schema, whether it be something at the OS level, Cassandra configuration, it becomes very, very difficult to guess and get the right thing, especially when there's multiple. So what I'm gonna teach you is the right approach to achieve an understanding that allows you to tune the right parameters. We're gonna start with something called the Udalloop. This is something that we used heavily at the last pickle. It was developed by the Air Force for fighter pilots, but it turns out that that kind of problem solving and that sort of methodology is the exact same type of thing that we need when we are trying to put out a dumpster fire or we're trying to just improve performance in general. So the Udalloop stands for observe, orient, decide, and act. Observe means that we just need to take in information. Orient means that we need to build a mental model to understand what's actually happening in the system that we're looking at and then decide and act or pretty much self-explanatory. If you think about the example I just gave where we just changed arbitrary configs, that's just act. We didn't really make a decision based on our observations and a mental model. We just guess at things. It doesn't work. We're gonna stop doing that. What we're gonna do, whenever we wanna do some sort of performance tuning, is implement the use method. Now one thing I should have noted, but I got a little bit caught off guard by my original, by my laptop issue here. There are QR codes on these slides. I don't know if they're gonna work when you take a picture of them, but they do link to other helpful resources, whether it be blog posts, YouTube videos, things I've written, stuff like that. So maybe it works, if someone can try, hopefully it does. Yeah, it does, awesome, cool. So these QR codes are gonna be helpful. So I'm not gonna be able to go into the normal amount of detail that I would love to, because I don't have six hours to speak to you today, and you probably don't wanna do that. When you're ready, lots of really good information here. The use method is a way of understanding what's happening on a system by examining all the different resources. And this was invented by Brendan Gregg. And basically what we do is we say for each resource, we're gonna check its utilization, saturation, and errors. Now, utilization is something that we might be familiar with if you think about CPU utilization. This is fairly common terminology, but the definition is the average time that the resource was busy servicing work. So for example, our CPU is at 90% utilization. When we talk about saturation, we wanna know the degree to which the resource has extra work, which it can't service. Okay, so when you have any of your components of your system all have queues. So for instance, when Cassandra sends a request to the underlying IO device, it first goes into the queue and then it becomes part of the, and then it gets serviced. So if we end up at a point where we've sent too many requests to an underlying device, we end up with a queue and the saturation is the measure of the length of that queue, right? For example, we might have an IO queue of 100. Using a tool like IOstat, you can see this is gonna show up as the average service queue. And errors, we definitely need to know how many errors a particular subsystem is throwing. This can be at the network level. For example, if we see TCP retransmit, we know that we have packet loss along the way. So we need to be able to monitor each of the different components of the system, understand their utilization, saturation, and error rate. And we do that with awesome observability tools. So from a high level, you're definitely gonna wanna make sure that you have high level dashboards that you can use to figure out when you have nodes that are anomalies or performing outside of the normal expected behavior or if the entire cluster has a problem. There's a few things that you don't wanna do. I'm just gonna go over a handful of anti-patterns because I see these a lot and I wanna make sure that folks walk away and have something that they can work on here. Using averages is really terrible. I've seen a lot of dashboards where people plot average latency and they use that as their indicator whether or not their performance of their cluster is doing okay. Average latency is worse than nothing because it will actually mislead you into thinking that the cluster is okay and you'll be sleeping just fine while your customers are experiencing pretty terrible performance. The reason for that is because a lot of your requests, it turns out, fall outside of the average. And if we're not doing a good job of identifying when those requests are actually happening, we get lulled into a false sense of security. We don't wanna hide the outliers. We actually wanna do, my opinion, it's silly to graph anything less than P99 and you wanna make sure that you're also graphing your max. So you need to know what your max latency is at any given time and your P99 latency and anything under that, I would just ignore. I would not bother doing those graphs. Another common mistake is averaging summary data and only showing that. For instance, if we were to take the P99 latency from each node in a cluster and average that together and just plot that on a graph, we're not getting useful information because now we might have an outlier that's experiencing significantly worse latency and we've hidden it by averaging that number. And if we take a deep dive into statistics, which we're not going to do right now, you'll discover that averaging a P99 latency does not give you any useful information at all. It's completely worthless. So what you wanna do is if you're gonna show something like a P50, you wanna make sure that you're also, again, showing your highs and your lows. So whatever you think your average latency is for the cluster, you also wanna make sure that you understand what your worst latency is. And another anti-pattern is not understanding what your tools are actually telling you. It's really common, especially with a tool like IOSTAT, to look at some of the output that it is giving you, like the average service time and think that that number is telling you something good or bad, all it's doing is giving, once again, an average that doesn't mean anything. So I really recommend taking a look at this link that I have here with the QR code. We're actually gonna learn quite a bit more. There's a guy named Gil Tenne who talks about, he has a talk from Strange Loop in 2016. And the talk was called, I think it was everything you know about latency is wrong. Really, really interesting talk he is the CTO and co-founder of Azul and an incredibly bright guy. So I highly recommend looking into that and watching it. It's entire talk on this topic. So once we understand what's going on at a high level, I really like to use system tools. I think observability tools like Grafana, Prometheus, AxonOps are fantastic. Sometimes you need to get in and get more granular data and you need to get some data that might not be available in those tools. So you definitely wanna be familiar with this. Now at a high level, I think a lot of people are familiar with Sysstat. Sysstat comes with useful tools like IOstat, which I've mentioned a couple of times, MPstat and a handful of other utilities. And these are really nice for getting a high level first cursory look into a node, but it doesn't actually give you enough detailed information to help you narrow down problems quickly. That's where I really, really like EBPF and BCC tools. So EBPF is a technology that originally was called Extended Berkeley Packet Filter. Now it's just its own name and it's no longer that. This gives you the ability to tap into the Linux kernel underlying probes and you can actually get a lot of detailed information about what's going on in the various subsystems of your server. So let's suppose that we wanna answer a question, right? This is kind of a common activity. What happens if we give Cassandra a bigger heap? There's a few things that can happen as a result of this. One of those things is we can improve our GC pause frequency because a larger heap gets collected less often. But one of the other side effects of this is we can take away memory from our page cache. If we only have 32 gigs of RAM and we give Cassandra 30 gigs of RAM, we're not gonna have a lot of memory left over for our page cache. So one of the things that people are very cautious about is making changes here, especially when it can take away. Your page cache is gonna serve requests a lot faster than your underlying drive can. So making sure that you have high cache utilization is really important. But how to measure that is a big problem. So before and after you're gonna make any sort of changes to your heap, you're gonna wanna go in and take a look at a program called CacheStat. Again, this is available in BCC Tools. So CacheStat can tell you your page cache hit ratio and it can do so at any interval that you want. So I typically will run this at a one second interval over a long period of time and I'll watch to get an idea of what kind of page cache hit rate I'm getting. That way, if I've made a change to the amount of memory that I've allocated to Cassandra, I know whether or not that change now has a side effect of taking away memory from page cache. And then as a result of that, you would be expecting that you'd see more IO. So then you have to ask yourself, well, how can I tell what's going on at the IO level of my file system? So that's where we're gonna use tools like EXTFS slower and EXT for DIST. The same tools exist for XFS, EFS, BTRFS and a whole slew of file systems. EXT for DIST is really nice because it will break out by the type of operation that you're doing, how many of those operations there were and what the latency distribution is. So you're gonna, these render power of two histograms, which is really, really nice because whenever you do encounter a performance problem, the first thing that, I think one of the first things that people blame is the file system where they blame the underlying drive. So it really helps to be able to get numbers out of there like this. Now, one of the things that is a little bit tricky about this is these numbers are gonna be a bimodal distribution. You're gonna see some requests that came out of page cache and you're gonna see some requests that came out of your block device. So you need to then ask yourself, well, how do I get the information out of my block device? Well, there's another great tool in BCC tools called biolatency. Biolatency will show you the same type of histogram, but it will take a look at the underlying block device, clock every request and give you a histogram of that latency distribution. I was able to use this at several organizations, one of which was Netflix to prove that EBS was not a performance problem and another client of mine who had a SAN and they were convinced that the SAN was the problem and we were able to use this tool to determine that the SAN was actually not the underlying problem for their performance issues. Now that we've taken a look at the underlying hardware, we know how to look at IO, I think one of the next logical questions that we can ask is, how can we tell what Cassandra itself is doing and spending time on? Because I think a lot of the time it's easy to think of software as a black box that we can't peek into and understand. But the reality is, is that modern profilers, especially sample-based profilers are safe to run in production and can give you incredible insight into your application. So that's where I rely on flame graphs. Flame graphs are probably one of the most useful tools I've ever encountered in my entire life and I've spent the last several years using them to solve some of the most complex problems that otherwise would have taken a lot longer to solve. So this, I believe this QR code goes to an article in my blog where I've kind of go into some of the detail, really, really go into detail about how this is used in some different ways that you can use it to profile, to do CPU profiling and heat profiling. So you can actually see the root cause of allocations in addition to CPU time and numerous other things that you can take a look at. But let's take a look at one example. So this is a flame graph that I took when I was running my Cassandra benchmarking tool TLP stress against a single node. And one of the things that you can see here is that there's kind of a spike up in the middle with a wide line. Now the levels of the flame graph represents the stack of the program. So as functions are called, it gets higher and higher and higher and the width represents the amount of time spent. So one of the things that you probably can't read is that at the top of that graph where there's that larger red line is it says LZ4 decompress. Now that corresponds to the LZ4 decompression algorithm that's used when we're reading SS tables off disk. Now if we ask ourselves, okay, cool, how does this help me? How does this help me? One of the things that we should naturally ask ourselves is okay, if we're spending a lot of time in LZ4 decompression, is there a way that we can make it faster? Is there a way that we can do less of it? So compression is one of the table parameters that exists on every table. Whether or not you specify it. In versions before Cassandra 4.0, this is gonna be that question mark there that's kind of tough to read is gonna correspond to, it's gonna say 64. And on newer versions, it's gonna be 16. So compression works by having, so the compression chunk length in KB value is used to create a buffer. When you're writing to an SS table, boom on disk, that buffer gets filled up, it gets compressed and it gets written to disk. Whenever you wanna do a read, the opposite flow happens, it reads the compressed buffer off disk, decompresses it and returns it to Cassandra. So the interesting thing here that we can ask ourselves is okay, what does that mean, what's the impact? Well, if we have really, really small values, like small rows that we're reading like I have right here, where you can see that the partition size is only 35 bytes, because there's hardly anything in it, you start to wonder what's actually taking place here. And what we can logically deduce is that Cassandra is reading a whole bunch of data off disk, it's decompressing it, and it's only using a tiny bit of that data to return to the client. So what we actually have here is an example of massive read amplification, where we're reading so much data off disk and we're turning hardly any of it. That results in a higher level of disk utilization and under enough load, we'll eventually read to full saturation where we end up with queuing on the disk and then our requests take a long time because they're slowed down. So one of the things that's important here is to be able to have the knowledge, to be able to look at a tool like no-tool table histograms and figure out from the partition size that it's okay to resize your compression chunk length. And when you do that, in this case, you would wanna resize it to four because that's the smallest possible value that you can use and your partitions are so small that what we wanna do is minimize the amount of data that needs to be read off disk. So now when we use four, we're actually gonna read the block that was originally 4K and now it's compressed. The trade-off here is that you're gonna use a little bit more memory for compression metadata. So you might wanna increase your heap size because that's the trade-off and you may get a slightly worse compression rate. However, if your problem is IO and your problem is related to GC pauses, this can help a lot because read implication on small values is one of the problems that I've seen in almost every single cluster that I've looked at. So this doesn't just affect read-heavy workloads. It actually affects several other workloads. So it affects anything else that's doing a read before a write. For example, if you're doing a lightweight transaction, when you do lightweight transactions, we actually have to read the original data off disk because you say if not exists or if some value equals something. So as a result of reading data off disk, now we actually have to think about how to tune our system and how to tune Cassandra to handle reads as performant as possible. And so typically with lightweight transactions, this might be very, very small records. In fact, I can only think of instances where you would be doing something with one row. And so you're pretty much always going to want to resize your compression chunk length to four KB in instances where you're using lightweight transactions because the effect of read amplification is so significant. And it's the same thing with counters. Counters read the current value off disk and bring it back into memory. There is a countercache, but countercache shouldn't be thought of of something that just patches over a slow problem. We need to fully optimize our system so that we can, even in the case of writing to counters which are not in the countercache, we still get high performance. And that's why in Cassandra 4.0, we made this change, lowering the default chunk length in KB from 64 to 16 because for so many workloads, it actually makes a significant impact to make this very, very simple change. And on the topic of read amplification, if any of you have seen any of the things I've written or read about or spoken about at conferences before, I really wanna try and make sure that people understand this. One of the worst things that you can have on in Cassandra is one of the system defaults, which is read ahead. Read ahead is one of the most tragically broken features in modern hardware. And it should basically be disabled in almost every single circumstance outside of, if you had like a really, really slow sand where there was very, very high latency. Read ahead works by, when you do a request for a particular block, read ahead or a particular file, read ahead will get extra data. And it puts it in the page cache. And the intent is to make up for the slow latency of spinning disks and it's supposed to help you keep more things in page cache. But what it really does is it puts more pressure on the CPU, more pressure on your disk and results in overall just poor performance for Cassandra. So I worked in several situations, both in my own benchmarking and in real environments where I've gotten a 10X improvement to workloads like counters and lightweight transactions. I did the performance tuning for one of the largest gaming services on the planet. And this was the most important change that they made that helped them go from outages during major releases to everything humming along super silky smooth. So if you're curious about how to figure this out, if you're impacted by this, if you haven't made a change, you probably are. By using block dev report, we can take a look at the read ahead by going to this RA column. And the RA column is gonna say generally 256. Now this corresponds to 512 byte sectors. So read ahead, whatever you're doing here, divide it by two and that's the size in kilobytes. You can also read from, there's a system variable that you can read out of, it's like sysq, sysblockq, something. I forgot what it is exactly. To change it, you can do block dev, set RA, pass your new read ahead. I like to put a fairly small value here. It's okay, so this is eight, which means the value is 4K. The reason why I do that is because you have to read at least 4K if you're reading off disk anyway, so it's okay to do this. So the last thing I wanna touch on was practice. Practice is really important. As you can probably tell, I've spent a lot of time just looking at this stuff in detail and thinking about it and trying to wrap my head around how it works. It's helped me get a really, really good understanding of Cassandra at a level that I think most people aren't willing to commit to because I've spent so many hours doing it. The easiest way to get started doing this is to set up a lab environment. Now I wrote a tool when I was back at the last pickle that Datasnacks now maintains custody of. It's called TLP Cluster. You can use it to spin up lab environments in AWS in just a few minutes. It's actually, I designed it to be pretty easy to use. I think you have to maybe type in about three commands and you'll end up with running Cassandra clusters that already have some additional software installed like the Flamegraph tooling that I already described. I'm in the process of rewriting some of this tool so that it actually builds an AMI with all the tools on it already, including all versions of Cassandra so that we can actually get that start time down to about a minute. And I'm really, really excited to be working on that but it's totally broken right now so don't use my version yet. The other alternative that you can do is set up AxonOps. So AxonOps recently, I think, but today, yesterday announced your provisioning process. So AxonOps is, you can talk to John back there. They have a really nice provisioning tool that can set up clusters for you and set up monitoring for you. So if you're trying to just tinker to something and get it done, I know that they're offering, you can use a small environment for free. So pretty cool stuff but you just definitely need to have something that you can mess with. If you've already got all your own software, awesome, keep using that. You don't have to use anything else but if you're looking to get started with something easy, one of these tools is probably gonna be your best bet. And the second thing that you need to do is benchmark. So you need to break things. I can't say enough how helpful it's been for me to have worked on so many clusters that were broken. I know that might seem kind of crazy but it means that I had to fix problems. And when you have to fix problems and you use something like the Oodaloop, you get put in a position where you ask the right questions and then you can kind of build that mental model to understand how things work. And you get the best, in my opinion, understanding of how a system works by breaking it and looking inside of it. So I wrote TLP Stress back when I was at the last pickle. This is one of the most widely used benchmarking, one of the most pieces of widely used benchmarking software for Cassandra I know of several hundred teams that are using it. It's used to benchmark some of the largest environments in the world. There's also NoSQL Bench, which is from Datastacks and Indy Bench, which the folks at Netflix have provided. So in summary, it's really important that we have a good methodology and we follow it whenever trying to do any sort of problem solving. It doesn't just apply to fixing, improving performance and fixing a problem. It applies to dog fighting and anything else that you can possibly think of. Next, we wanna make sure that we understand the full state of observability. That doesn't mean just your dashboards, but also the tools that you wanna use when you wanna look at an individual machine and answer very, very specific questions. There's a lot more to the observability side that I didn't talk about. There was a slide on Brenning Greg's slide that I went by kind of quickly. There was a QR code there, I hope you got it. And the last thing is practice. Practice is super important. If you don't practice, you won't know. So one of the things that I'm really, really excited about is to announce that I'm now not just doing Cassandra Consulting, but I'm also doing training. So I am doing hands-on training sessions starting Q2 of next year where we're gonna be doing Cassandra fundamentals and a lot of this type of thing, figuring out what's going on. So if you are looking to learn pretty much everything about running Cassandra in a production environment at really, really big environments, lots, lots of requests, millions of requests a second on some of the most complex systems in the world, I would love to share the information that I have and help you become experts as well. Thank you. Did I go over time-wise? Am I okay? Oh, it looks like one question. Average latency is not useful for performance investigations whatsoever because it doesn't tell you, you lose, you're throwing away the worst part of the information. So if you look and you have an average latency of five milliseconds and a P99 latency of 200 milliseconds, those numbers are so far off that you can't possibly use average for anything meaningful. And so the thing that you wanna make sure is that you, whenever you're looking at latency you're kind of thinking about both P99 latency, which is gonna be what most customers are experiencing and your max latency, which is gonna be your outliers. A little slow. In the case of, again, I still say that average latency is not useful. So the QR code that I posted on that slide leads to that talk by Gil Tenne. The reason why average latency is not useful at the database level is because of the birthday problem. So if you're looking at 50%, for example, if you look at P50 instead, your P50 time as soon as you do two requests by one user to the database, you only have a 25% likelihood of achieving that P50 latency. So as soon as you start plotting your averages, you start looking in this fictional world where your customers aren't actually experiencing it. So what you really wanna do is focus on the P99. Yes? Yeah. Yeah, you're not gonna wanna go past 31 because as soon as you lose your compressed object pointers, it'll be, I think you have to go into the around 45 gigs or close to 50 gigs before you actually see the effective memory be about the same. So I never go above it. And the only thing that you really wanna do is make sure that you have enough space for the compression metadata. And you can limit that in size as well. So yeah, I wouldn't go above 30. But to your original question, yes, I want to see the effectiveness of my page cache. And whenever I make changes to my heap, I wanna make sure that I didn't do so with the detriment to page cache. I think I'm out of time. Thank you, everyone.