 Awesome. Thank you, Candice. Hi, John. Yeah, nice to meet you online. It's been a while. I guess we should start off with a quick intro. So do you want to go first, John, your experience around Cassandra? Yeah, yeah, sure. So I first got started with Cassandra about 10 years ago, putting it in production version 1.1. I think I was the first person to use CQL in production. I ended up going to data stacks for a little while. I was a technical evangelist. That's where we met. And after that, I went to the last pickle did consulting for a few years and to Apple worked on the Cassandra for upgrade while at Apple. I spent the last year and a half at Netflix and now here I am back consulting this time for myself and just helping people with all sorts of Cassandra issues. How about you? What are you up to? Thanks, John. That's a great intro. And yeah, I just got to say to everyone that John's got this amazing experience around Cassandra. You've seen one of some of the biggest clusters around the world. He should be beating on himself a bit more in my opinion. Just for myself, software engineering background, moved on to DevOps and came across Cassandra in 2009, putting out there in production and joined data stacks for a little while where I went out and set up my own company, consultancy and marriage services company around Cassandra. And we built a tooling around Apache Cassandra called Axemops and I'm one of the co-founders. All right, with that, let's move on to the five steps for Rock Solid, actually Cassandra. Yeah, man, let's dig in. Yeah, let's dig in. All right, absolutely. So there are potentially a lot of steps that someone could break these things down into, but these are kind of like the core steps that we want to get out of it is these five steps here. We're going to be talking a little bit about infrastructure assessment and kind of give you a ballpark and what your hardware should look like and help you kind of understand your hardware requirements in general. Talk a little bit about configuration. This is super important, understanding how your operating system, your applications are configured. So we're going to be talking about like minor changes in application behavior. Some of these things can have huge impact. And then of course that leads to observability. Like how do we actually determine if we're using our hardware correctly or if something is misconfigured or not tuned correctly. We're going to talk a little bit about automation along with some examples of things that you can automate, maybe even share a story about what happens when you don't automate and things go wrong. And of course we're going to be touching on security, because this is super important and absolutely critical, especially when you have a system like Cassandra, which can go across multiple data centers, you might not necessarily just be dealing with private network, pretty important stuff. Did I miss anything on there? No, no, that's perfect, John. And I'd say we're going to talk about security first. And I'm going to move on to the... Oh, actually, we need to talk about why people are using Cassandra. One of the main reasons, John, that big enterprise and corporations and tech companies are using Cassandra out there. Yeah, there's a few good reasons, it turns out. To start out, one of the things that absolutely necessary security, I just mentioned it, super important. You can't have a database if you're not able to lock it down and know that things are going to be working properly and people aren't just logging in. That's not great. I think the core thing about Cassandra that people really latch on to is its resilience. I already mentioned multiple data centers. Being able to run in multiple data centers across the country. And being able to have that resilience to be able to withstand network partitions and to be able to basically make the choice between do you want to have availability versus consistency. That is an incredibly powerful tool that other databases really don't offer. There's very few options when it comes to that type of... If you have those uptime requirements. On the performance side, this is a thing that I'm not even seen in anything that I've written in the past or videos that I've made. I'm a big performance guy. We're going to talk about this a little bit. Cassandra has incredible performance. If you know a few things that you need to set out of the box. It's deceptive in the way that it performs. If you know what you need to tune, it can be absolutely incredible. The community is another great reason. The community is probably one of the major reasons why I didn't just get involved in Cassandra, but stayed involved in Cassandra. There was such a good community. Back when we were on, the community used IRC for everything. I met a ton of people through IRC. Made a whole bunch of friends in the community. It is proven. Cassandra is being used in some of the biggest database deployments in the world. Apple has something around 200,000 nodes. Netflix is running tens of thousands of nodes. There's other companies that are running thousands of nodes. Banks are using it. Insurance, oil and gas, energy sector. Everybody is running Cassandra at this point. It's amazing tech. It keeps getting better. Did I miss anything? I feel like I sold it. I hooked it up a lot. Did I miss anything there? No, that's absolutely perfect. As you mentioned, some of the biggest tech companies are using Cassandra. That speaks for itself, right? We've been involved with some of these large companies and large deployments with Cassandra. This tech just works. It's something I did when I first started using Cassandra. Back in 2009, we set up a small office set of PCs, simulating one, and then putting power cables out just to test the resiliency. I was just gobsmacked, right? That's why I fell in love with Cassandra technology at the time and deployed it into production without telling me anything. I apologize later. That's the sort of technology that is. I did something similar. I was just deleting data files. I was like, it keeps working. That's weird. That's it. Let's talk about security. This is one of the steps in implementing rock solid Cassandra cluster. One of the things that you can do with Cassandra. These are interesting. There are also things that are best done up front. I'm going to say right now that if you're thinking about getting Cassandra, if you're thinking about using Cassandra, this is definitely something you want to do up front. It's a lot easier to set it correctly in the first place than this change it later on. Especially since we're talking about point number one here is encryption. We have inter node encryption. Do the Cassandra nodes communicating to each other. We have from your application to Cassandra itself, driver. That needs to be encrypted as well. It's getting the driver to talk to Cassandra and switch to encrypted. That's not the hardest thing in the world. It is a little error prone. You do need to be careful. You need to deal with a run book. It's not something that's fun to do. You definitely want to get this set up first. Then, of course, there's the authentication authorization. That's standard database stuff. Best practice is to make sure that you're always using authorization and authentication to make sure that people are connecting to your database that way. We need to be able to go back in time and have the ability to take a look at what's happened. See who's connected. See what's going on. Get a little bit more information. We know that in the past something happened or didn't happen. Is there anything you want to add to this? I know we talked a little bit about some of the aspects of SSL when we talked earlier. It's very important. TLS version 1.2 plus is the default for these days now. I'm going to go back a little bit about the authentication authorization. Authentication, you're thinking about the database drivers connecting to the database or even users with a browser accessing the database. You've got to have the right roles and permissions and so on. The latest versions of Cassandra now have a feature that's an inter-node connectivity. They can also have authentication enabled as well. This is something I need to think about. The other side of things is as John mentioned, this is actually quite a new feature in the latest versions of Cassandra as well. We always crave for this feature in the open source Cassandra in the past because enterprise always have to have the DML, DCL and DDL type of audits in the logs. It's now native to the later versions of Cassandra so do enable those. In terms of SSLs, it's something that you should implement but there are some implications around that. I've come across our customers and they've got the SSL certificate expire and I had to actually go into the Cassandra code and modify the code in order to ignore the expiry in order to fix this cluster because if you restart that node it can't join the other nodes anymore because of the certificate expiry. Things like that, those kind of things should be monitored. I'm just going to show you a service whereby you've got a service like SSL certificate expiring on Cassandra checks so you should have these kind of monitoring in place when your SSL certificate is about to expire and so on. That's definitely not a fun surprise especially since I've found in the past that these types of problems very rarely do you just do a rolling restart for the sake of a rolling restart. I think that maybe you end up doing JVM tuning and then you do a restart and then it can't connect to the rest of the cluster and you're trying to figure out why can't this thing connect and our brains because we sat there and we just did JVM tuning we're thinking to ourselves well it must be this. There's nothing else that changed. It's like it's not going to get expired so if you don't know where to look and you don't think about it it's really easy to just not notice it and not be aware of the fact that that happened and these types of things can chew up a lot of time because they're not obvious it's weird sometimes and you're just like okay but what is going on here so usually now you've got two problems you're trying to do your JVM tuning you have to you have to know when your circuits are inspiring. Alright so yeah that's let's go back a few slides sorry about that and let's go back into the slideshow alright I think that's a good takeaway there John so I'm going to move on to the next step one of the five steps infrastructure assessment this is really important for databases right? Remarkably important databases tend to stress hardware resources quite a bit this is this is pretty interesting especially on the Cassandra side of things because I've worked in a few places where teams have just kind of been given hardware and they either don't know what it can do or there or it might not be optimal for the Cassandra deployment if you have the option like if you have the ability to determine what kind of hardware you're getting up front you're going to be a lot happier Cassandra doesn't need or really perform that great as CPUs scale up if you're sitting there and you throw 64 CPUs at Cassandra it's not actually going to take advantage of them that well you'd rather have multiple smaller boxes than try and do vertical scaling and your money actually goes a lot further when you do that anyway and it's the same thing with memory you don't necessarily need 500 terabyte of memory on a box you really just want to make sure you have enough memory for the JVM and to satisfy your reads out of cache the better your cache hit rate the faster your responses are going to be and the better you get out of the machine because you're going to do less IO and on the topic of IO that's where we've got storage the the last the last couple decades have seen huge advancements in storage nobody deploys spinning disks anymore there's no point they're horribly slow at a minimum now we're talking solid state drives and really everybody now are shifting to NVMe drives and NVMe drives Hayato I know you've worked with these two they're absolutely phenomenal like when you can get two gigs a second of throughput at 100 microsecond P99 latency out of out of your storage it completely removes it as the bottleneck for database like it's absolutely amazing I know that you had something you wanted to show on this you want to talk about this for a minute yeah absolutely I think the standard size of a Cassandra that you might be wanting to put in you know generally speaking in production database you were talking about minimum of 8 cores and then possibly absolutely 32 cores one of the things about the the the data of this and you have issues you got way too many cores and you can't leverage them right so but yeah so this is the class term I think we're a screen share of there we go yeah my machine is freezing unfortunately can you hear me yeah yeah you're breaking up a little bit though I'm going to stop sharing for a minute oh no can you see my screen I just see you right now all right okay the latest release of Buzzing has been quite problematic let's try to share again there we go share how's that can you see yep we're infrastructure assessment all right okay I'll just stay on that on the slide for a minute then so yeah so I've come across like customers whereby they bought this really expensive kit and you know when you're deploying a large cluster it does cost quite a a lot and yeah essentially they were leveraging about 2% of the CPU but they bought this expensive kit just in case so yeah I think one of the things that you need to do is you know benchmark your hardware before committing to spending a lot of money now obviously cloud makes it bit easier because you can change the instance types very quickly but yeah it's you still need to kind of you know benchmark whatever hardware you've got now do you want to talk a little bit about some of the benchmarking tools John the one I think the thing that if we're going to talk about one because typically the thing that we benchmark is storage right this is probably the one of the most useful tools that I've ever used it's called FIO you can develop you can write configuration files for to simulate basically any workload and it's actually the used to benchmark the IO sub system it's a very very mature excellent tool and this is what I use across like whenever I'm setting up a new hardware or new environment if I'm not familiar with an environment and you know I start taking a look at it I like to get an idea of what the hardware is actually capable of I think typically people have a little bit of a disconnect between what's actually going on on the hardware level the resource level hardware when they don't actually know what the capacity is or how to measure what it's actually doing so bringing in benchmarking to your hardware is super important and you want to benchmark every step along the way which is why we say benchmark everything you want to benchmark your hardware then you want to benchmark the database then you want to benchmark your application how all those things integrate you can think of it as layers of benchmarking getting each of those is really important because you'll never really understand what your hardware is capable of if you start with application level benchmarking because it probably won't stress your hardware out all the way so it's important to do that first on the network side we've got a question from sorry, John we've got a question from Pedro asking about the benchmarking tool yeah it's called FIO and it'll be in one of the slides at the end we've put a link to it but it's I'll type it yep there it is hey guys look at that community everyone's involved the FIO is an incredible useful tool so I've developed some configs for it which would actually simulate Cassandra like random IO readers like configurable number of readers and then sequential readers and writers to simulate compaction it's pretty cool you can do a lot with it and when you start running a database and you look at the numbers I see how these are relative to what it can do on the network side we didn't really touch on this yet a lot I won't go into it in super detail but iPerf has been a really useful tool for testing out networks if you want to know what your throughput and latency, you're getting on a network especially if you're looking at a WAN really really important because I think a lot of people tend to underestimate how long it actually takes data to move from DC to DC and they think things are completely instant but in reality everything's not perfect the network isn't that reliable so it's a really good idea to test this stuff out ahead of time before you get into trouble yeah yeah absolutely I've come across like a Cassandra cluster has been deployed on one gig network and it's fine for most circumstances but when you're restoring from backups or when you're repairing a node from other nodes the bandwidth becomes bottleneck so it's helped to have that extra bandwidth at restoration times and so on yeah alright the other thing I've seen is like a Cassandra cluster being deployed on a VM kit with a SAM backend and there was one instance where it was quite recently this cluster was deployed they benchmarked one VM and said the hardware's fast enough but the cluster just did not perform well with the IO and it turns out that the IO SAM backend had a bottleneck with all the Cassandra region now I said to them did you run the benchmarking on all the nodes at the same time and obviously the question the answer was no and it turns out to be that the problem with the SAM can always be a problem with Cassandra as well yep I just recently ran into some SAM questions a team was using the SAM and basically the similar type of issue there wasn't any benchmarking done there was no numbers whatsoever it was a giant question what can this do do we need to expand there was a couple other issues at play there was GC issues and some misconfiguration and a whole bunch of other stuff it can be tricky if you don't know how to look at your hardware and understand you're basically just throwing darts with your eyes closed it's really hard to get it right if you don't know how to measure it which actually brings us to our next topic I believe indeed so configuration planning there's quite a lot around configurations that we like to talk about and John and I have gone through many different configurations with our customers and from our experience some of these kind of points we're making here do you want to talk a little bit about the Cassandra configurations yeah and this is unfortunately this can be tricky there's a lot of configuration tunables that exist in Cassandra knowing which ones to tune can be really hard it's especially tricky since Cassandra effectively ships with configs that are really good if you're going to be developing try and develop Cassandra make changes to Cassandra and you want to develop it on your laptop they're at the most it's going to give you like enough throughput where you're not going to completely lock up your machine but you bring these to production and we were talking about Cassandra performance I've done a ton of Cassandra performance tuning and the number of times I've seen the stock config put into production and not take full advantage of the hardware it's like almost every product it's almost every you know cluster I've ever had to look at it has the same problems so it's the Cassandra configs aren't meant for prod the some of the OS settings that come out of the box like if you just install Ubuntu you're going to get a few settings in there which aren't optimized at all they're actually terrible for Cassandra and the biggest one is read ahead read ahead will go ahead and effectively you issue a read to your disk read ahead graphs way more data than you need so by default it's set to 128k so any read operation that you do to your disk is going to pull extra data that's called read amplification and it can be a problem like you can like let's suppose you only need 4k off disk or 1k like why do you need to read 128 so you end up reading so much more off your disk and you end up actually fully utilizing your disk but you're not actually like doing anything with that data you're just churning it through bringing it into memory for no reason the JVM configurations like Cassandra I believe we just fixed this recently but in 4.1 but Cassandra used to for the longest time shipped with JVM settings that were not great at all the NuGen would be, so it used ParNu and CMS with a really small NuGen which caused frequent GC pauses and then the GC pauses would end up taking a long time because there's lots of memory that had to be copied around and it ends up being kind of this like just terrible experience and you don't want that you definitely want to just give it as much memory as you can up to 31 gigs and it works out much better Cassandra configurations themselves I mean we were just talking about this like the concurrency options in the Cassandra config I think you have something you can show us where the defaults are with this yeah let me see if I can my zoom is working again and so let's take a look at one of Cassandra configurations and this is what what's in Cassandra YAML file by the way just representative on the nice tabular form I'm just going to type concurrency and concurrent so there's a whole bunch of concurrency based parameters in here and the defaults they are only suitable for your development environment where you're not going to put much load on to most people forget to change these parameters and what happens is they're pushing a lot of load on to Cassandra but because you've got these thread pool sizes set to quite low values for development purposes you're not leveraging the hardware as a defensive kit and then your CPU utilization is not going high well it's because you haven't let Cassandra to do so things like concurrent writes concurrent reads there's native transport max concurrent connections or there's a whole bunch of parameters that you need to change production and as we said in the slide defaults are bad have you seen these kind of occurrences John? Everywhere this is 100% of the clusters I've worked on like I said people are just putting the YAML in production without realizing it and it's an easy mistake to make I mean everybody does it that was my first cluster I put into production I used the defaults that's how we figured out that JVM settings didn't really work I even have blog posts with a fellow comiter Blake Eggleston that go back almost 10 years talking about how to optimize Parnu and CMS for read heavy workloads Cassandra was an 8150 I think was old Jira that moved to G1 GC people were talking about it forever it's a contentious topic changing any config it becomes difficult because it's how do you make sure that you're not not tanking someone's performance and at the cost the benefit of improving someone else's so yeah going through making sure that these numbers actually make sense having a process to understand like what the difference is when you make a change super important and that's where we start talking about observability that's kind of that leads us into that like bigger observability discussion as to how you can tell whether the changes that you're making in your configure actually having the expected you know the expected change that you're looking for one last thing that I wanted to bring up that we had in here was table configurations so I talked a little bit about read ahead and how we'll go ahead and just pull data off just for you so you know in the broad category of read amplification there's also a table level configuration that matters to you so when you do a when you describe your schema you'll see with compression equals and then you'll see the compression class and the chunk length in KB now the chunk length in KB is the size of a buffer that gets filled up with data compressed and then written to disk along with some other metadata whenever you need to do a read it has to read in that whole chunk and decompress it so you can by default for the longest time be sandwiched with 64 KB chunk length so no matter what you always have to read that whole chunk so again my example from before if you only need one K of data or K to K whatever you still have to deserialize that whole thing so there's a lot of overhead associated with that so what I found which is really interesting to me and you know I I've been sharing this with everyone that will listen read ahead improperly the default read ahead setting with the default compression setting fixing those two things on a lot of workloads you can get a 10x improvement in throughput and latency you do a little bit of a decent JVM tuning and like the improvement to a cluster performance can be absolutely remarkable and that directly translates to cost like people I think when people hear performance tuning they assume that I'm trying to get us from 10 milliseconds latency to 8 milliseconds latency I've had instances where we've talked about hundreds of milliseconds of latency down to single digit milliseconds latency that big of a deal you go from storing 200 gigs of data per node to terabyte or two terabytes of data per node and now we're talking about a humongous cost reduction so for me you know looking at like proper configuration and doing performance tuning the thing that we want to get out of this is huge cost savings and being woken up in the middle of the night less frequently thanks John yeah so yeah one thing one takeaway from here is that's the defaults of ad right and do kind of go through them and just take away these recommendations like the screen ahead and compression of chunk size makes huge difference to your performance alright the next slide is about my favorite topic which is observability so yeah John you know what do you say about observability I love observability this is actually one of my favorite topics and observability is interesting right it's one of those things where the idea of just having dashboards I think sometimes people they say we have dashboards therefore we have observability and it's not quite right like you're not like just having them doesn't mean anything being able to interpret the information being able to build a mental model and get a proper understanding of what's happening starting with your hardware to the application really really important right like it's like you don't just measure one thing and then determine the result of you know everything that's going on based on one measurement you have to have these in the same way that we talked about benchmarking having layers observability has layers so it's really important to understand what your hardware is doing that's like when it comes to trying to do a performance analysis the first thing that you want to do is understand where your hardware is actually doing so in the same way that we benchmarked it before using like FiO here when we're looking at hardware you know we want to be able to take a look at what the latency is what the throughput is of the underlying storage and we want to have a methodology that we follow to build that mental model on our head of what the hardware is actually doing so this is where I usually recommend Brendan Gregg's use method it stands for basically the deal is for any given aspect of your hardware you can look at the utilization so that's the percentage used so 0 to 100 saturation is there a queue on it so are there requests sitting in your device like your storage queue to be you know before it actually gets submitted to the device itself and the error rate because it doesn't matter if you're getting p99 100 microsecond latency if 80% of your requests are errors like that's that's not useful and I have actually seen that before your old kernel with NVME drives turns out like you know they didn't really behave correctly and so the team had NVME Cassandra on NVME drives and the performance was terrible that's because they were using an old kernel version and then you know we're not just talking about like let's look at the hardware we actually want to see warnings and errors and be able correlate those to the different events that happen so you need a nice system to be able to like search within periods of time and that's where we like that's where I like elastic or open search mixed with Kibana and that's you know that's open source is fantastic you have to have all your logs aggregated and you have to have them you have to pre-build like searches ideally so that you know like you're not you don't want to have to think too much about the problem like how do I get the information that I need to solve a problem you want to present it to you right there so you're literally just putting information in your brain and then you know one of the things that I have found is that it's not enough just to have dashboards right like usually if something's going wrong we identify a node that's that's a problem and that's where you need to I don't know about about you how to but like for me I have to keep my brain in either program or mode or like sysadmin mode and when you start talking about observability and like what's going on in like the Linux kernel and like thinking about latency like you need to be in sysadmin mode people that if you try and force a program or mindset on this it can be problematic because a sysadmin knows to use the right tools to sysadmin will use you know IO stats take a look at the IO what's happening in IO in the background MP stats look at multi processor stuff and new like the newest tools that are incredibly useful and I really want to stress these is BCC tools this uses the BPF kernel machinery and you can get a ton of really really interesting information about what your device latency looks like at a block level you can get slow file system access you can watch every file operation you can see absolutely amazing tooling and these have really changed the way that I've worked because I've been able to spot issues that I hadn't before to put it and then I'll be honest with you I have one thing that has been my I'm not going to say the secret sauce because I'm pretty public about it flame graphs I have found that when it comes to understanding what's going on with any system and any database or any application I usually reach for the async Java profiler and pull up a CPU flame graph that's going to take CPU samples of the stacks of what's going on and give you a nice visualization as to what where the CPU is spending its time and if you haven't made any configuration changes and you haven't tuned compression for example you're going to see that compression is spending a ton of time in CPU and I think we have a flame graph that we can show just to get an idea of the visuals here you mind bringing that up yeah there you go so yeah so when you run into a performance problem this is my go to right now I'll run this over 30 seconds or a minute and you look for the big bar when the big bar does where the time is a lot of the time it's that easy to fix some of these issues or to be able to at least confirm your suspicions about stuff because I think with compression a lot of folks they'll say how do I know that compression is really the problem it's not enough to show that they're doing a lot of work you need to actually be able to say hey I can demonstratively show you because you're going to make this change and it's going to cause Sandra to rewrite every we're going to rewrite every SS table that's a lot that's a lot to ask of somebody to make that change this helps give you the insight that you need in order to do the job and if there are amazing set of open source tools out there like Grafana I think Prometheus to capture all of the system and Cassandra J.D. in metrics and as well as you were talking about the Elk stack elastic log stack, Kibana stack in order to capture the events and logs so these are sort of things that you need to do I mean at Axanops I guess we have commercial tooling is putting all of those tools together into a single pane of glass so we have laid out a very curated set of dashboards for people to be able to analyse the platform and how they're performing and you can identify where the problems are and I was showing one of the charts to John I wish I had that to solve this problem one of the things is this chart and John what was the issue you had recently? Yeah this is a I absolutely love this I was working with a team and basically in the middle of the night at some point like load would just appear out of nowhere like they have a huge spike of requests and they weren't sure where it was coming from and unfortunately the right monitoring wasn't in place and if you don't have the monitoring in place up front there's not really a great way of going back retroactively figuring out where the traffic came from so then it's like okay what do we do does it happen every night what if it doesn't who's running this job how do we track it down what do we stay up to so you can tell what table or key space something is writing to but in my case I was dealing with a multi-tenant cluster with hundreds of tables and dozens of teams all using the same like the same set of data I generally as a best practice tell people not to do that I don't like having everybody connecting to the same thing it makes it really hard to understand there's a lot of reads from this table there's a lot of folks doing a lot of reads from this table going back like we had no way to determine where it was coming from so how does this so this would help us this here would have been amazing being able to track down and look at IPs in the past where that traffic came from that would have saved us an incredible amount of time it was such a headache these kind of information is available you just need to extract it out and present it in a way that's easy to understand so that's what we've done for our users out there I'm just a little bit conscious of the time so I'm going to move on to the next topic which is automation again another one of my favourite topics John you got any key takeaways for the automation yeah absolutely a lot of automation the reason why automation is important going from the first thing that you do all the way to the end if you can automate everything that's kind of the utopia if we're talking cloud provisioning do you want somebody to go into the AWS console and click around to get you a new Cassandra box and hope that they did it with all the right settings or do you want to have tooling like Terraform or a home written tooling that will properly get you the right instances it can look for spot instances it attaches the right drives with the right settings with everything the way that you want your infrastructure to be homogenous has to be all the same it's really hard to figure out when things go wrong and it's the same thing with config management and I think a lot of folks will use tools like puppet ansible things like that but what I found is that there's actually a lot of teams that will manually manage config files so I recently worked with a team that had several hundred nodes and when they wanted to make a change they actually had a change ticket and someone would go in and go node by node and make the change sometimes they missed things it was different it was problematic you definitely don't want your configs to be wildly insistent you have to have that stuff be managed for you and it's the same thing with your actual operations like repairs people used to put repair on cron and you'd run into weird race conditions and timing issues now if you want to manage repair and you want to do it through open source Reaper when I was at the last pickle we adopted from Spotify they had written it originally and Reaper is amazing for managing repairs it won't it will make sure that it doesn't repair them too aggressively it will repair small chunks over time it's really great and I know AxonOps has something similar can you pull that up can you show that to us yeah absolutely we're talking about automation repairs and backups repairs wise it's you just really don't want to be dealing with repairs if you don't have to they are well proven methods of repairing Cassandra clusters now as John was talking about Reaper the open source tooling that came out Spotify I actually was used to visit Spotify office on consulting Cassandra databases that they had they told me they developed this Reaper tool it was actually I said why is it called Reaper I said oh it's because it was a Swedish pronunciation of repair that was misheard by an English person and then it got named as Reaper so that's a bit of a fun knowledge there for you guys anyway so repair can be quite invasive to your Cassandra cluster comparing data across all of your nodes so this is something that you still have to do this so adaptive repair in AxonOps takes care of that very gently without affecting the performance of your Cassandra cluster and slowly repairs across your tables in a scheduled way and it's also detecting if your cluster is being busy or it's free to do more repair work so it speeds up and slows down accordingly so it throttles the repair process automatically for you so the thing I like about the adaptive repair is if you think about the work that a cluster is going to do it typically isn't constant there's always spikes, there's always valleys and trying to plot your repairs and the spikes and valleys manually is really hard especially if folks are deploying new things I just talked about this random thing that just appeared that we weren't expecting if that happens at the same time of repairs running a really aggressive repair schedule combined with a huge workload at the same time can really harm a cluster it results in a lot more allocations which causes more GC pauses more longer GC pauses and then you add a big workload on top of that now you're basically fighting for resources I really like the model where you don't have to think about when repair is running it just runs when it can't when it can be fit in it's almost like co-operative multitasking the workload on the cluster is dropped now we'll do the repair and to me that from a performance standpoint it's nice because you need the repair done you certainly need it done this minute it can be any time over the course of today same again with backups as well databases have to be backed up and I've seen so many people who don't back up databases and came across a specific customer who didn't back up and run the cleanup script against the production database and they lost all their data and I actually kind of helped out looking through the actual file system trying to recover deleted files that was an interesting challenge there but yeah you don't want to get to that situation so always back up your data and automate it there are many tools out there MedUse is a great open source tool Axanops has got a graphical way of managing your backups and restores and so on so do automate your backup process yeah and the thing one thing to remember is that backups aren't just about you know environmental catastrophes I think it's really kind of easy to think about backups as being like well you know if the data center disappears in a tornado then I need a backup but that's really not a lot of what we have backups for there's very few complete DC destruction events that happen now DCs are fairly reliable that way but there are a lot of mistakes where you type something in let's say from personal history before I got involved with Cassandra I was working on a database upgrade and I was sshed into a whole bunch actually it was every production database and my local all in tabs and iterm and I did send to all tabs where you type in one tab and the same command goes everywhere and then I did and you know a few minutes later I did an RMRF star locally because I had some files that I wanted to get rid of guess what I deleted all the production data files from every single production database like just idiotic right like the moment you do it my stomach dropped and I was like this is bad and fortunately I had a backup I hadn't tested the strategy so at that point I had to figure it out I wish I had automated the restore and this is super super important right when we talk about automated operations and reporting automating backups have to automate the restore to because when like from personal experience I will say I'll tell you when you need to do a restore that is not the time to be figuring out how do I do a restore like that needs to be like done ahead of time you need to practice it you should ideally be restoring from one cluster to another like for let's say you have load tests that you want to run it's a really great use for backup restore there selectively doing a few tables I know I think you were showing me AxonOps can do individual table restores and backups right? That's right yeah and so you know you go to restore and you can see a history of all the backups and you click on one and one of those guys pick a node and then pick a table and then do your restore so it's as easy as that yeah this would have definitely saved I would have much preferred to have that restore then when I was like now how do I do restores figuring it out on the fly that was that was scary it's really afraid that's why we built it yeah yeah I think we're getting towards the end of the session and I guess we can you know talk about the summary and take away so we did talk about the hardware selection yeah we talked benchmark your kit update your default settings because generally the default settings are not optimized Linux kernel or Linux is the ones you download from the internet is designed for general workload as Sandra you download is for your development environment you do need to fix those implement monitoring absolutely you know it will help you diagnose issues in the middle of the night when the problems always seem to happen and you get woken up and with your blurry eyes you need to be able to identify issues very quickly automation repairs the backups the configuration management you need to be done as to make your Cassandra platforms rock solid and security right you've got to secure your data it's got to be sitting the deepest darkest corner of your data center right nobody's allowed to touch it and it's encrypted and so on so secure your data alright yeah love it look at us on time basically yeah so this last bit of info I think some of the links to the tools that we were talking about are here I believe the slides are going to be shared but feel free to take the screenshot and yes so anything else John our contact details there too yeah yeah I think we just I think we just got Q&A and hand over to Candice from Linux Foundation thank you so much Hayato and John for your time today and thank you everyone for joining us as a reminder this recording will be on the Linux Foundation's YouTube page later today we hope you join us for future webinars and have a wonderful day so we have some questions