 So good morning and welcome everyone. So I'm Nikolai Kalisnikov and Praveen. We work both at AWS. So today we're excited to talk about, I would not say excited, talk about Cassandra antipatterns. So we have Cassandra and our agenda today. So we have the quick intro in Cassandra common antipatterns use cases case studies. So the case studies technically work arounds that I work for the customers. And then finally we'll just end with the Q&A session. So we are sure many of you familiar with Cassandra, that's the Cassandra summit, yeah. And what I can say it's highly scalable, wide columnar storage. That's many of my customers using this on a daily basis for some reasons. So the first of all, it's no single point of failure. Zero downtime, that's what everyone wants. Peer-to-peer distributed architecture. So that allows you to read and write data from the same note. And there's definitely adding resiliency with improving performance. So built to handle a massive amount of the data, probably a lot of you use this with terabytes or petabyte storage. Definitely works well. Cassandra's horizontal scaling can straightforward and cost effective approach. So you can easily add more nodes to the cluster if you want to improve performance so it's linear scalability, it's great. So wanna double, just double number of nodes. So however, there are some common antipatterns that we usually observe across our customers that can cause performance issues. Today we'll explore five such antipatterns but the focus only two. We don't have a lot of time to cover all of them. Maybe quite complicated. So are we going to talk about too many tables and large partitions dive deep? So, but I will just touch a couple of them here. For example, probably familiar with the spark usage. So if you have a large cluster and you're trying to on daily basis get the data out from Cassandra, read the data, slice and dice, aggregate and prepare the report, sometimes not failing. So you can see this cluster is in down state so three or four nodes died in some reason. So, but because the spark what it's doing it's using the anti, allow filtering. So technically it's gathered together your cluster to get data out and it's already antipattern itself to get data out from the cluster everything. So for this there is a Cassandra enhancement proposal 2028 that's called reading and writing Cassandra data with spark ball analytics. That allows you to access in Cassandra not from the front and directly from the 1942 port. You're going to use just access, direct access to access tables and with the help of it's called the Cassandra sidecar to get data out and process if at the high speed. Collections. So unfortunately collections it's not limited you probably can store up to two gigabytes there but the performance drastically is going to degrade if you over 64 kilobytes. And it's reason because it can be sliced, it reads entirely and not be paginated internally. So there is a problem with collections. So keep them small. Now let's talk about the improvements that we can achieve with the fewer tables. So let's say in case if probably you're familiar with the situation when you have too many tables in your cluster and Cassandra clusters complain you create over 256 or 300 tables already. And from my experience I see in the customers that even creating over 300,000 tables. I would like to call upon my colleagues for proving a stage is yours. Yeah, thank you Nicolae and yeah, hey folks. So let's start diving a little deeper into more specific anti-patterns here to like understand what happens with your cluster when you have too many tables. Let's take a deeper look into what exactly is happening on the Cassandra service. So for every table that you have created in your key space it allocates about roughly about a meg into your heap, JVM heap. So for example, if you have 1000 tables in your Cassandra cluster that's almost a gig of space in your heap that's gone to storing the table metadata. And this portion of your heap is actually scanned on every GC but never like released. So you're allocating a lot of space not very efficiently. And another bottleneck we see when it comes to having too many tables in your cluster is memtable flushing. So as we all know, every mutation in Cassandra is written to your commit log and then is stored in your memtable which is in memory. Now when you have multiple tables created each table has roughly a memtable which is in your heap by default. There are configurations to store it off heap but by default that's a configuration. The total space allocated in your heap for memtables is shared across all tables. And what happens when you cross that threshold is it triggers a flush which is flush to disk. There are a couple of bottlenecks that hit there. There are only two flush writers which are threads allocated for flushing your memtable to disk on Cassandra by default. You can expand it up to eight but all tables share this bottleneck which is a disk IO operation which as we all know it is very slow. And another bottleneck what happens when we hit when we have too many tables in your cluster is we have many tables which have smaller memtables and by default Cassandra tries to prioritize larger memtables to be flushed to disk first. And when you have too many tables we start fragmenting the space. So you start triggering premature flushes because you're meeting your threshold faster and we start triggering multiple disk operations wherein we are flushing smaller fragments of your memtables because now we have too many tables in your cluster. Another regression that you see when you have too many tables in your cluster is increased pressure on your monitoring stack. And this isn't very Cassandra specific. It's a common design pattern design thing that we should take into account while designing any large scale system wherein you want to design your metric exporters and disk in such a way that your exporters are able to keep up with the metrics that are being emitted by your service. Cassandra emits quite a few metrics at the table level and when you have too many of them it starts imposing a lot of pressure on your monitoring stack. And when your metric exporters are not able to keep up with this it starts leading to like metrics being dropped on the floor and in worst cases it lets your service fly blind which as we all know it like operators hate that. So there are a few other issues which when you have too many tables. For example, client drivers during startup retrieve and download table schemas and this increases your, and when you have too many tables this increases your client startup time. There's also quite a few shared caches based on implementations both server side and tables, client side which are shared across tables. So there's some impact on those as well. Now let's try to answer how many tables are too many tables, right? And this one is a pretty hard one to answer deterministically because it depends on a lot of parameters such as the Cassandra version you're running or the specific hardware you have and other configurations, right? But our load tests do give some pointers. So for example, this is one of our worst case load tests that we ran. We ran it against a three node 4.0.11 version of Cassandra and we allocated, we ran it on T2 XL EC2 machines and we simulated a load test using Cassandra stress load test tool which is packaged with your Cassandra binaries and we simulated worst case traffic pattern where our traffic was majorly inserts and the traffic was uniformly distributed across all tables in the cluster. So as you can see here, we see as we decrease the number of tables in the cluster we see a linear improvement in our right throughput and a much more significant win is our tail latency. So we see significant improvement in P99.9 aggregated latency against the number of tables, right? And now let's go into how we can design our systems in a way where we can overcome these this sort of anti-patterns, right? So one of the recommendations we have when we have like smaller or mid-sized workloads or migrations is to use NoSQL design patterns. One of the common patterns is to use a single table approach wherein you overload your primary key or clustering key to represent different types of rows or different types of items within the same table that reduces the total number of tables that you have in the system. Another approach which is actually seen as an anti-pattern SQL world but in the NoSQL world it's fine to do is denormalization where you introduce some data duplication but you cut down on the number of tables in your system, right? Simple example as shown here where we try to combine two tables into one and we introduce some data duplication due to that. This sort of design pattern might be hard to put in place or conform when you have larger teams or multi-tenant systems and then we have to start thinking a little bigger. So here we have an abstraction layer built for distributing your actual tables which is abstracted of the physical location of these tables on the clusters is abstracted away from client side. So in this case, for example, we have a routing fleet that sits in front of different Cassandra clusters. For the client it is abstracted away and virtually everything looks like it's on the same key space but physically the tables are distributed across different clusters and we have routing metadata to support this and help our routing fleet route the connections accordingly. This sort of architecture helps you scale horizontally because you can add and remove, scale in and scale out your Cassandra clusters in the backend but completely abstracted away from the client side. However, it comes with a lot of challenges. For example, one challenge would be how you would aggregate system metadata tables for a key space which has tables distributed across different clusters. In those cases, maybe using managed solutions would be preferred. Now I'd like to hand it off to my colleague, Nikolay to talk more about large partitions. So before I go into jump to large partitions, I've allowed to the previous slide that we usually observed a lot of customers quitting table in the case of engineers, let's say the experience with the MySQL database or any RDBM system and they're trying to replicate the same concept of the workload and recreate the same number of tables. Like if you have 110 entities, you'll go and just create 100 tables and they expect it's going to work. Maybe on initial stage when you have little bit data but after you will see that completely work differently. Thank you very much. So let's talk about large partitions. From my experience, I would select 80% of my customers that I'm working, experiencing this issue of large partition. It's not as a problem, it may be just a design thinking. So how we initially built your system, you probably don't test well initially or you probably don't know even enough information about your data geometry. So how the data will look like in the future, how they will be distributed. And if it's small startups, they probably build something quickly and after it will going to grow to something bigger, they will see this issue later. One example here, one the customer is building a system and there is just a primary key, they decided to use the customer organization ID and for small customers that tenants, I would say tenants they have, let's say maybe private tenants, 10, 15 rows per partition. But after they're on board there, I would say corporate clients, it's immediately they're going to change the structure but the number of transactions against this key is going to grow and grow very fast. And the problem it's going not just to grow like whatever slow I would describe, I would give more proper work for here, like a linearly grow or exponential grow. That's dangerous, so. And what we notice there that sometimes customer, even they're building a system and taking into account the large partitions, sometimes they use a select statement without clustering columns. So it's immediately maps everything to the memory when you read and touch multiple SS tables for the storage. So that might be an issue too. So utilizing identical key that I already mentioned, yeah, they, for example, it's a customer ID or organization ID, something like that you can see in their table, it creates an issue immediately. So in more of its affecting compaction process, because it's taking more for put to compact large partitions anyway, so and you'll see this as a parameter in Cassandra Yemel file that's responsible for threshold here. So my experience suggests that maintaining partitions should be under 10 megabytes, don't go over. So 10 megabytes, it's a perfect solution here when you design. It definitely can help improve the performance of your cluster to a great extent. And you will be able to use the skill Cassandra without surprises. Because it's usually a lot of surprises on the road. So and additionally, so sometimes I see customers using like store some binary workload, the workload itself might be quite large. So let's say 10 megabytes, 15 megabytes, the store images inside or the store some, like I would say some medical information that's usually might be quite large too. So I ended trying to at some point split, but split and not enough that even one megabyte or two megabytes and you don't know how many rows you're going to end this partition. So partition can be easily over 500 megabytes. In my experience, I see the partitioned two gigabytes, close to two gigabytes, it's extremely large and the customers ask us to help to redesign completed solution and move to the new schema. And if you have a large payload, maybe sometimes to think about maybe it's not right place to store this large payload in Cassandra. Think about something different. Maybe it will be memory solution that close to Cassandra or it's some maybe object storage that you're going to choose to store the object and store just the reference link to your object inside your table. So case study one. Yeah, so case, case, so the four small customers like a mid-size clusters when they have one application per cluster, it's we can easily just redesign it. So there's not a lot of efforts there and usually you can find a low hanging fruit there, especially in the regular columns, you can bring to partition key or sometimes you can look into and say, okay, in the clustering columns, if I move the clustering column directly to partition, it will improve some data distribution there, split the partitions on the smaller parts and you can easily make an easy improvement without like hurting overall infrastructure. So yeah, we can go next. So case study two. So we have a customer that use Cassandra in a way where it's only one large cluster. So it's a shared cluster that a lot of applications there. So it's not to go one or three, it may be 100 applications there and they might affect each other because they just use large partitions there and moreover they just figure it out, okay, we create something, we create something and we didn't know that it's actually is going to affect the overall performance but the later affecting all your applications, your all applications start slowing down because it's one application hurting everything. So one approach here is to build, to unify data storage on top of Cassandra that allows you to store data in unified format. In this case, your primary cable looks like unified so it will be the same. It will be prefixed with a bucketing, like a bucket ID or something like that that you can use where each bucket technically might be stored in different clusters. So you have a fleet of different clusters where you can split these buckets and store them across different cluster and I will just reuse the same architecture that previously mentioned by Praveen here. So by just adding additional bucket there. So the bucket store multiple partitions but bucket should have a limit. So it's limit how many partitions it can store at what size. In this case, the cluster should automatically split in your bucket and your workflow can go directly from the fleet to proper cluster. So let's conclude here. So my recommendations here and Praveen's recommendation would be to keep the number of tables low. So keep them under 200, don't go over. So I think that most products that you can find right now, even it's data stack server of Cassandra, it's automatically complaining about that you exceeded the number of tables. And it's to state if you read into 56, it's complain, it's give you a warning and after it just fails after 512. It will be failures. Use guardrails, it's nice new feature in Cassandra 4 that you can find Cassandra 4.1. You can set the number of tables that you allow it to create in the cluster. So it's a good way to manage the number of tables. And aim to keep the same size of partition under like 10 megabytes, don't go over. And test your workload. So use the tools, so frameworks are available. For example, out of the box Cassandra stress tool, even the people don't like this, I like it and I continue to use it. So another available still P stress test that's developed by the last P code. So another tool that you can use no SQL bench. And sometimes I like to use the tool that's called Locust. I don't know if you're familiar with Locust. So the Locust is another test framework that allows you to imitate your like a real workload with HTTP requests. So it will look like you have real cluster that's attacking your Cassandra, see that's how it's going. So when it's easy to write because it's just a plain Python, it's easy to deploy quickly or you can use scale so you can produce medium transaction per second there. So it uses distributed architecture tool to just create a workload. So for the large partitions, you can actually use node tool stats, table stats to identify them or you can use the virtual tables. And virtual tables, there's the statistics available. Just look at there so you will find your largest partition there for each table. Okay, thank you. If anyone has questions, we would be more to happy answer them. And afterwards you can find us here. So yeah, those are questions, we have a question. Yeah, so. Yeah, we can hear you. I would not say anti big pattern, so the but it's anti pattern still. I've seen this couple patterns that you mentioned right now. So what we usually recommend to be a switch from to have a lot of your columns in a way that's horizontal. It's a separate column there to maybe sometimes we compress the value inside. So we see that's for example, if they store a JSON or something like that and change it from just like a row in a partition. I mean, so you have a column there, but now it's just a row with partition with specific column that says this is like key value, like a key value pairs. And plus you can add the sharding value there so you can still split them. But the problem here, you need to keep on this splitting number. So when you're trying to read everything, so where the data should store, it should store an application site. Application should not up front. So this partition stores let's say five buckets and we need to back so it will read each bucket by one one. So that's you can technically from this representation go like this in partition and split the partitions. So you can might have a lot of column, but if it's not affecting the size, yeah, so that will be fine. Yeah, the partition it will be available because I've seen like a four, 5,000 even columns a lot and it's afterward just affecting the size of the partition. So yeah, I will, before we wrap up, here's some other events coming up too we think you might be interested in. We have a workshop and AWS dinner there. So we have a booth in other sessions. So we look forward to seeing you there. Thank you very much. Store somewhere there. So you can have another storage just to link your table it'll be just a ledger for your artifacts, but this large artifacts will store in different storage. For example, radius can store up to 512 megabytes in memory. You can have the object. If your performance is important part of you, like it's should be milliseconds, you can store the object there, but the reference you will store in Cassandra and you can quickly get the data out by the reference. Yeah, use more Cassandra is more like a data pointer. It's like the main your main ledger you store your data there, but the objects, large objects it stores somewhere different place. Yeah. For example, radius, it might be radius. You actually, yeah, it's a memory solution. You can, yeah, I recommend you put in a, even now. And after 4 join, so it can be a multi-fetch pattern. Yeah. So about multi-fetch pattern here, you can use, if you use Java, so the driver has a custom codex. You can develop your codex that behind the scene, you looks like you're querying the Cassandra table, but behind the scene, it takes the link and make a request to different storage and return data. You will see this return you back the normal result, but the data stored in different place, but you can mimic this behavior easily with data stacks, custom codex. It works well. I tested, I didn't test this for super large objects like that's 100 megabytes. I tested this for one 10 megs objects and it perfectly works with the custom codex that covers this and usually developers don't see it, so they just use it as standard. Like a select statement from table, you have a prepared statement and it will return you the large object back, even if it's not inside Cassandra. So regarding to five, I don't have a lot of information, but I probably, you can ask the session will be probably next. Here's one of the presenter from data stacks. We can probably answer this question regarding to if it's able to support. Because you can easily split when you have a partition, so you can split the partition, you can add a bucketing value, so prefix your every key. So let's say you have a one million rows and you're trying to store this one million rows inside the one partition, not like this, not like in horizontal way, yeah, like the columns. So in this case you can define, so you store per partition thousand, so you will assign a bucket, so that each bucket you know start thousand values. And you know up front how many buckets you have. So when your query will provide to your query, supply your query with the bucket ID to get data back. You know the data storage there. You can use a hashing techniques, you can have a hashing, I will algorithms to figure it out the bucket ID. Because if you hash your primary key, it will give you anyway the number of the bucket that stores this data. And you will prefix new data. I would recommend to go in a way to store this inside the partition and split the partition. Yeah? Yeah, it will put some pressure either on the client side or you can push that pressure to your routing fleet if you have another abstraction layer. And you can do something more intelligent there based on the queries that you're doing to only set up connections only when needed, right? For example, maybe something that's aggregated across clusters needs to set up the connection. Otherwise we can have the connection set up more focused. So for the client side, it will still look like a singular one, but you can abstract the complication out. Okay, thank you very much. Thank you. Thank you.