 So, welcome everyone, my presentation about how Cassandra 4 gave us to cut the cost and we will be talking about the money, everyone wants to save some money to have some pay rise or stuff like that, to not let bezels left to the space one more time and do, for example, create next weird movie. So, my name is Robert Struppio, I was on the first Cassandra Sum in 2014, the first Cassandra cluster was also made like that, the first Cassandra production failed and also success and I'm Java developer, architect, DBA DevOps and finally also the FineOps. So who is the FineOps? The FineOps is the person who played with the Google Cloud, with AWS, with Azure to pay as less as possible, for example looking for some leftovers, cutting costs on everything what is possible, analyze day by day cost explorer, advice in CUD, CUD saving plans, some reserved or and finally FineOps should save more than just earn. That's just the general rules, but sometimes it could be different and FineOps also looking for the different weird correlation, like for example here, sometimes it's okay that we are finding something, sometimes it's like here, okay, so let's go further. So this presentation will be focused only on AWS because I have 25 minutes and I have to do some on that. I know that everyone want to go to the lunch to eat something, so I will be speed up and also only talk about AWS. So I don't have enough time to talk about Azure or about the GCP, for the GCP fans. So Cassandra could be deployed on on-prem, cloud, hybrid, everyone knows that and we can also choose a database as a service like the Astra key space or any other or just self-hosted on the Bermetal virtual machine or containers. So we will be focused also on just on self-hosted Cassandra. So where is the cast on the Cassandra which is self-hosted? So mainly it's inside the compute, so CPUs and memory on the storage it could be ephemeral, non-ephemeral, on the network of course, and some other stuff. And of course the DBA tools. So CPU and ROM, why it's so important to go into Cassandra 4 to save some cast of the CPU and memory. So Cassandra 4 of course as everyone knows supports Java 11, but why it's so important Java 11? So what it's new in the Java 11? So too long to read, just Graviton is Java 11 for being full speed. There is an article about that, there is a source of the slides, that's on the Java 8, the Graviton don't have the full speed. So if we go, if we want to cut the cost, just use Graviton. So it's ARM CPU made by Amazon. And the third generation of Graviton contain the DDR5 RAM memory. We don't have the issue of NUMA, so no new memory access because it's just one socket. There is no Graviton 3 machine which contain more than one physical CPU. So if you want to go with Graviton, just go to the newest Ubuntu version. If we for example, okay, it's Apple or Bezos, I don't know who. So basically if we go to the Graviton, we should use also the Amazon Linux 2, that's the second edition, or Ubuntu 22 or Ubuntu 20.23, something the new OS because on the old legacy, also for example the cache are not visible. So if you want to have the full speed of Graviton, we should choose the Cassandra 4 because it's Java 11 plus the newest Linux. And finally in our company, we are switching from old Intel machine to the newest Graviton. And finally after that, we realize that the Graviton third generation are so fast because it's DDR5 that we can even cut the number of CPUs and memory by 2 and there is the final result of the cost on the AWS cloud. I have to cut the numbers, but we see that we are switching for for example, R5D 20 X large to for example, R7G X large or so on. Okay, so storage, where we can save some money on the storage size. So basically on AWS, we have two possibilities, go to with the FMRL drive, so it's local SSD and FMI, on FMRL is the EBS, so elastic block storage. If we go to the elastic block storage, of course, we have three different options. It could be the standard drive, so it's a magnetic drive. We could go with GP2 or GP3. Don't use GP2 if someone have GP2, for example, for the root volume or something else. Let's move that to GP3 because GP3 is less expensive, it's cheaper and it's much faster or IO2 and IO1 and IO2. So for example, if we compare the same machine, one is R7G 2X large plus GP3 FMRL drive, the cost to R7G which the drive 2X large, so it contains the local SSD, the price is lower on the machine without the FMRL drive. But of course, FMRL drive is much faster. So we have the FMRL, but if we are choosing the faster FMRL drive which is the local SSD, we have also some additional cost, the cost of FMRL risk. So if we stop the instance which contains the FMRL drive, the FMRL drive will be loose and we lose all the data which are inside. So we cannot do any fancy feature like the AMI rollout. We cannot do, for example, the detach and attach this drive to another EC2 machine. We cannot easily scale up and scale down EC2 machine, so we lose a lot of capabilities which we have when we are using EBS drive. And we cannot easily extend or replace our drive. For example, after making some weird stuff on our tables, for example, remove this one which are not used, we want to save some cost by making our drive smaller. But we cannot do that using FMRL drive, we have to go with EBS for example. So what is new in the Cassandra 4? If we go to the Cassandra 4, there is another new fancy feature which is the compression. If we have some tables which contain a lot of data and it's not, it's like sometimes rarely used, we have a lot of disk space used, but the CPUs are not so often in use. We can switch, is it a compression from default LZ4 to that standard? After that, we will cut the disk space like 25%. After cut the disk space 25%, we can switch the EBS drive also to the smaller one and also save some cost. And we do exactly the same. We deleted and we dropped some tables which are not used. We realized that after some long investigation and we also switched the compression algorithm from the default LZ4 to that compression on the rarely used table. And after that, we cut our EBS drive cost by two. So another thing, so there is some parameters read ahead on the Linux site. So when we are reading some data from our EBS drive, we're reading something more and place that in our page cache. So the general recommendation which was on the Cassandra Apache.org in the production recommendation page said that if we are using SSD, we should go to the 4 kilobyte read ahead. But this is the parameters which should be set for local SSD. If we go with GP2, with GP3 or IO2, these parameters when will be removed, when will be replaced from the default 256 into 4 kilobytes, make our cluster so stressed that it will be working two times worse. We test that very, very hard. So with the general recommendation, but when we are using the bare metal machine with the local SSD drive. If we are using EBS drive, this general recommendation is not for us. Just remember about that. It's not for us because the EBS drive are, of course, the network disk. And the latency are much bigger. And when we are read ahead something from our disk size, and we are keeping it in our OS page cache, and we have very fast DDR5 memory, it has the benefits. So, network, where we pay for the network. Of course, it's the interzone, which is free on the AWS. There is a traffic cross availability zone, and of course cross the region. Service like S3 also could be for free, but it also depends on our configuration. So for interzone, we are not paying anything. So for some cluster, which are not so critical and can even lose all the data, we can place that by topology only on one availability zone. But it's also very risky, and I know it's not the general recommendation to avoid any other availability zone. But it's something which we could consider. But if we are thinking about the traffic, which is the interregion. We have the possibilities to enable compression. So first one is the traffic between the nodes and the client. Of course, we should use for the very small pieces of data, the snappy algorithm, which is by default disabled, for example, in S3 Cassandra data, and the traffic between the nodes. So in Cassandra 4, there is the new feature, which is the non-blocking inter-node connection. It also used the netty, not like previously, which was blocking. So we should compress all the data. By default, of course, then, I don't know why, but now it's working. So maybe I talked too much, but by default, of course, the traffic between the nodes in Cassandra is the compression between the nodes in Cassandra is enabled for data centers. But if we have different availability zones, and just Cassandra in one region, we are using more often the availability zone per rack. So we pay for this traffic between the racks. So my recommendation is to enable this compression between all the nodes. So between the different availability zone. Of course, we also did that. After enable the clients to node compression, and after enable the compression between the nodes, we also have some cutting cost after that. So basically, if we are sending some data into S3, like, for example, the incremental backups, like, for example, the backups or something else from our EC2 machine, we should not pay anything for that. But of course, we could have something which is called, sorry for that, we should, we could have something which called is the node gateway. So if we are running out of our VPC to something else like the S3 service, we could serve our network traffic directly to EC2, directly to S3, or via node gateway. So if we are not setting something what's called of gateway endpoints, if someone want to, if someone don't want to pay a lot of bucks for using node gateway, there is free of charge gateway endpoints. And after that, we are not paying for connection to S3 for our EC2. We see the result of enable this endpoint. So client's anti-pattern. So Cassandra 4 have some new capabilities, which name is the virtual tables. And after that, we can easily track the service which is the betas, for example, which use the different protocol version, which use something weird, which make a lot of data transfer, which we want to, for example, avoid. And after that, if we are going to the virtual tables and analyze that, we can also remove some weird stuff. I don't have time to talk about that. And finally achieve some better latency. And for example, there is a miss of this slide. But people are often use the spring data Cassandra. And inside the spring data Cassandra, it's some critical bug. And it's the critical bug in the GitHub 1, 2, 1, 3. I think that a lot of people know about that. But it was in the spring Cassandra data for 3.x. And it's bugs create the prepare statement for every single queries. So if we, for example, are happy that we are using some fancy feature like spring Cassandra data, everything is prepared, is made, every query made prepare statement. But it's not reuse that. It's not binding the new value. But for each particular query, create the new prepare statement. So we know that it made that stress on our prepare statement cache. And also generate additional traffic. So my, and if we are go to the table, which was introduced in Cassandra 3.11, not in 4.x, we saw that, for example, for each, for each queries, we have the new prepare statement, which of course made a lot of latency issue. And of course also generate the additional traffic, which also made the cost. So my general recommendation is to upgrade spring data Cassandra to the newest version. To avoid that issue and for example, if we are, if we have something, which name is the vault in the integration and for each microservice, we are generating the new credit credentials inside our Cassandra cluster. There is in the 4.x, very extremely important parameters. So we can set up our out create constancy level and our divide constancy level as ever we want. So basically by default in the Cassandra 2, 3, I think that even in the one, there is the constancy level for rich authentication, just one. So if we have situation like that, that we have some microservices deployed, for example, Kubernetes on something, something else. And we have this vault, we generate the new credit credentials. We could have the situation that one node goes down, what is normally in Cassandra because there is some patch, there is some operation because maybe someone kill minus nine by some incident or something, something, something else weird. So any other nodes which was the coordinator during creating the new credentials have some contain some hints. If the nodes will go back, it will be set, it will be sent back to this node which are, which are, which are down. And of course, if we are asking that node about our microservice to know if some users have the permission, have some credentials, the information about that could be stored in our hints and in the queue of the hints. And if we are asking that particular node which was down, we have some, we have some out permission issues. And to avoid that, we have to increase the identification for its constancy level or to not use any weird service like Vault, which creates in every five minutes new, new accounts and propagates that to all the cluster. Okay, of course, if I want to convince people to go into Cassandra 4, there is also something that's it's end of life for 3.x. So if you are looking for the support version during the download, the Cassandra, we just read that's the end of life. It's around December 2023. So now, if we have the version which is out of support, we should have, we could have some non-finance risk, which also costs us a money. For example, in the in the past, there is some major bugs, major vulnerabilities like the JMX RCE. So with remote code execution, which was of course solved by releasing new version. There is, for example, lock for shell, another big major big major bug. For example, 2023 overflow bugs. If we are staying on the legacy version, which not provide the upgrades, we will have the issue that there is some vulnerabilities, which is not fixed. So summary to go to the launch. So on the Cassandra 4 choose the Java 11, even if it's possible to use Java 8. And of course, the newest OS. Of course, go to the Graviton. Always test all good advice if it's much your cases. Because some advice could be some, for example, the Z-standard. In our cases, on the 90% of tables using Z-standard, it's good enough. It's not increase the latency, but cut a lot of disk space. Review from time-time cost explorer, if you have access, of course. If CPUs have some free cycle, enable compression. And many other good advice. So final result of our cutting cost mission. There is a question for the audience. Have anyone want to guess how many percent we cut our TCO? So total cost ownership. In how many percent we cut our TCO? You can guess. OK, next shot. 60, next shot. Close. For the one who will shot, I have one Bitcoin. Not less. OK, who set 70? OK, it was 69. So a lot. So it's a Bitcoin for you on the end of presentation. So finally, we go with Cassandra 4. We switch because we should, because Cassandra 3 is end of life. I don't think if I have to tell something else why it's worth to go to the version which is under the support. But it's important. So thank you for attending my presentation. And there is a time for question. Now only that one. Of course, we did a lot of different scenarios like going to the table, looking on the table starts, going to, for example, we don't have data. But going to our monitoring and figure out which table are in use, which generate the latency, when we can switch to the compression, different compression, which table are not in use. There is also one table which contain, if I could remember, 10 terabytes of data which are not in use so often. So we go to the business to talk with them if we can remove this table. And finally, when we have some calculation of our EBS drive and how much it will save us the money, they said, OK, so remove that. Which storage? OK, so the question was on which storage we are. So we are using EBS GP3. Of course, IO2 is each much faster and have more and less the performance very similar to the local SSD. But it's so expensive. If I will use the IO2, I think that this presentation will be not the cutting cost name, but will be some changing the drive, changing the EBS type of drive. Because IO2 provision, only not the disk space, but also the number of IOPS and the throughput. And if we want to have more IOPS than 100, we have to pay for that. On GP3, by default, we have 3,000 IOPS for free in the free tier. OK, so our big IOPS requirement, if we are a monitor, so during the peak, it's in the prime time around 2 PM where people start to go and maybe write some charts and stuff like that. It was like 2,000 IOPS per machine. So we are still in the free tier. In any time when we will observe some big peak of the users, we can go to AWS console or run the Terraform and increase that. If I could remember, AWS allow us to switch one time per six hour to switch the IOPS in any time.