 The talk is about MemTable Pogability and the kind of alternate MemTable implementations that we have in Cassandra 5. And let's start with why we want to do anything about MemTables at all. The MemTable is this very important piece of the Cassandra local storage. This is the first place where every piece of data goes to before being reorganized and written into SS tables. And it has a lot of work to do. It has to take data from the user. It has to organize it in a way that's suitable for retrieval. And finally, it has to prepare it for being flushed into an SS table. So it's also on the synchronous path of every write. So it has to acknowledge a write before the user knows that something can later be read by users. And in this example, we have some writes happening, a read being done for by user three, and finally something being flushed on SS table. Why are MemTables important? They are probably the most important thing for the performance of the database. On the first hand, the fact that it's on the synchronous path of every write means that it directly determines the write throughput, the peak write throughput when you're taking a lot of data in a short term. We do more processing of every write, but that happens in the background, so it doesn't really slow down your other processing. Until eventually things get saturated. And then the MemTable also matters because the bigger the MemTable is, the bigger we can make the level zero SS tables. And these SS tables determine also the height of the compaction hierarchy or how much work the compaction has to do for you to be able to read things as quickly as you want. And also another part why it's very important is that MemTables are things that stay in memory. And in large part, this means that they stay on heap. And because they also stay on heap for a long time, they're not something that can be very easily handled by the garbage collector. And they have a lot of complexity inside them which makes them hard to process. So to be able to do improvements on the MemTables in Cassandra, we started, well, the best way to do something like this is to play with different implementations, try them out, compare them against each other. And the best way to do it is by supporting some kind of kind of plugability of implementations into Cassandra. And since Cassandra 4.1, we've been able to support plugable MemTable implementations. We have an interface that the MemTable implementation needs to support. And once you do this, you're able to first define MemTable configurations in the Cassandra YAML. Like in these examples here, there are skip lists, sharded, and tri-configurations. You can choose one configuration to be your default for the database. And you can also decide to use a specific MemTable implementation for specific tables. So it may be that some tables are not suitable for some types of data. In that case, you can just choose the ones that you want for every table individually. And we have at the moment three different implementations that are provided with Cassandra. The first one is the skip list MemTable which is basically the legacy MemTable implementation in Cassandra. That's the one that we've been using since, I think, 3.0. It's implemented as a concurrent skip list of partitions. And for every partition, we have a B tree that defines the map to individual rows within the partition. And we have even more B trees below that to find the actual individual cell. Something that's important about this is that while you can store the data for every cell of heap, all of these organizing structures, the B trees, the skip list, multiple levels of B trees, and all of the B trees for each individual partition, they're all on heap, which forms a very complex structure. The second implementation we currently have, which was made as a kind of a proof of concept for the plugability piece in 4.1, was this sharded skip list MemTable. At one point, when we started trying out Cassandra with notes that have lots and lots of cores, above 20, 40, 60, you start seeing that the MemTable or the concurrent skip list becomes a bit of a bottleneck because it's contended, the writes starts slowing each other down. And one of the ways you can improve on things like this is by sharding the MemTable. And we implemented a simple solution, which basically uses several independent skip lists instead of one. And this way, you can take advantage of many more cores that you could with just a single concurrent skip list. So this one also supports a blocking mode where every time you write to a shard, I mean, it serializes the writes of individual shards. And in some cases, this can be beneficial because on one hand, it lets other pieces of the machinery like compaction have more ability to work. And on the other hand, it saves some memory because sometimes we may attempt to write something and then fail because the comparison swap didn't work. I'll give some examples how this is beneficial a little bit later. And the third one is the most important new addition, which is the TrimemTable. This is coming with Cassandra 5. It's a very different implementation. So instead of using a skip list for the partition map, it uses a try, which is a different data structure. And it's aiming to solve quite a few of the problems of the existing of the legacy stable solution. And if you switch to it, you're supposed, I mean, you're expected to get a much better performance as well as garbage collection efficiency. Now let's talk a little bit more in depth about this new MemTable and how it works. There's a key concept that these new MemTables use and it's called byte order. Sorry, yeah. So if you think about the legacy MemTables and basically every typical solution for storing maps in any database, they're usually comparison based, which means that you have your types, you have your function, which is able to compare two values in the same type and your structures are based on comparing different values of the same type. And by this you're building things like B trees, which every time you're making a decision, you're comparing two values and deciding which way to go. The problem with this is on one hand that because you need a comparison, you need to be able to give values to the comparing function. You have to have these values in forms which are comparable. You can't take just part of the value and compare it to the part of the value. It has to be the whole thing to be able to pass it to the function. So the keys can be short like integers, but they also can be very long if you have strings or something like this. And it often happens that in keys you would have repeated prefixes. If it's a string, which is a path to something, part of the string will repeat. If it's a multi-component key where you have several clustering columns, for example, you will often have many of the leading components being the same thing and just the last component, for example, being different to point to the specific row. And if you're not taking care to form a type of hierarchy between the maps, you're going to be repeating a lot of the comparisons you do when you're working this structure. So there's a balance here to make between repeating prefixes or adding code to the system to make it work a little bit better and save also on space. Another part about our implementation of these structures is that they're, partly because of this problem with keys and comparable forms, all of these things have to be kept on heap with a lot of overhead. And this basically means that the on heap size of a memtable is much larger, typically much larger than the data size, which can be put off heap. And of course, that structure is very complex, having all the trees for different levels of the maps. And when you do modifications to it, you have to, you throw away some objects and you create new objects. And these objects are not immediately thrown away. They stay around for a long time. They get promoted to higher generations of the compaction, sorry, of the garbage collection hierarchy. So they're not easy at all for the garbage collector to reclaim. And this complicates the process of garbage collection. So what did we do about these two problems? The approach we took is based on bite comparable keys. So if you imagine, for example, that in a database, we're not storing different types, but just strings and skis. If we're using just strings and skis, one can think of, one knows that for strings, you have a lexical graphical comparison, which means that if you're trying to compare two strings, you can start from the beginning and the first time you find a difference, you can immediately tell which one of the two strings is smaller, which means that if you're trying to find some place that splits between two strings, you can shorten it up to some prefix rather than having the whole string or the whole key. Which is in itself can be an advantage, but it can also be used in structures that are specifically made for this kind of structure of the keys. And these are called tries. It's something that's been around since the 60s. And it's something that's also generally been not used in databases until recently. So yeah, the key idea here is that we have a little bit better understanding of keys because we have these prefix decisions, which lets us have better efficiency. Well, the problem with Cassandra is that we have different types. We have more than just strings. We have also integers, we have floating points, we have big integers or even big decimals which are not comparable directly as bytes. So what we can do to sort out this problem is to do a translation of the value into byte comparable form. For some things it's pretty easy. For example, if you take integers, they're not comparable directly because negative integers have their highest bit set to one and positive integers have their highest bit set to zero. So negative integers come after the positive integers if you directly compare them. But if you flip the top bit, then they start being byte comparable. You can do a very similar trick for floating points in the IEEE floating points. You flip the head bit and also if it's a negative number you flip all the other bits and it works exactly as normal comparisons would. You can do that sort of translation for all of the types that Cassandra has. And also you can form, translations of multiple components as one flat single byte ordered sequence which makes it easier for the code that's below in the core database to just ignore anything about the internal structure of the keys and just deal with this flat sequence of bytes. So the Trimem table, which we introduced with CEP-19 is a implementation of such a data structure for the mem table. It uses key translation to take all of the partition keys into byte comparable sequences. Then it replaces simply the skip list on the highest level of the mem table hierarchy with a try. The reason why it's just the partition map is because this work is not yet completely done. So we have done only one step of the process. We're still getting a huge advancements, huge improvements, but it's currently only replacing the partition map. It works very well for key value workloads at the moment because of that. Now the try itself is something that we built internally because we needed some features that you couldn't really get anywhere. There are some, I mean one of the main structures that people would use for something like this is something that came out of the adaptive radix tree paper from 2013. There are some implementations of that, but we needed a few additional features that they couldn't provide. For example, being able to read concurrently with data being written to the try, which is really important for the performance of the database. We also did a few extra tricks to make it efficient, which I'm not going to go into detail because we don't have the time. And finally, because they are a single writer structure, which was much easier to write and much easier to make correct than fully concurrent structures, you have to shard the main table into a few independent trees so that you can use more cores and do more parallel writes at the same time. What did we get out of this? One of the first things that we could see is that random accesses into the main table are about twice faster. So 1.8 times faster. So in this test, we wrote 10 million entries for a quick micro benchmark. The time it took to write these 10 million benchmarks was about two and a half times faster using the try main table rather than the skip list one. And querying them is, again, almost two times faster than the legacy main table implementation. The other thing that's important is the size of the resulting structure. If you're using heap buffers for everything, you can see that it's a huge drop of the significant drop of the size of the main table from close to 6,000 to four and a half or something like that, gigabytes, for the same amount of data. If you're putting the data off heap, both are again smaller because you can take a little bit more off heap and you need less overhead when your data is in these off heap objects rather than buffers. But you can see how small a part is off heap for the skip list main table and that almost doubles when you're switching to a try main table because now we can also put the try off heap which makes it, well, there are two effects of this. One is that because the trigger for flashing is the amount of memory you have on heap, then you reduce the frequency of flashing. You store more data before you need to flash again which is very helpful for the performance later. And if you look at longer term performance, we did a few NoSQL tests filling a node with about one terabyte of data of very small, about 100 byte payloads. And you can see this here is just the first hour of the test in which you can see that the burst or the peak right throughput of the same database only replacing the main table with a try main table, it can take almost twice as many writes per second. And this can be sustained for very long periods. In this case the test run for something like a day and a half and the blue graph is the one using, using try main tables and the orange one is the one using skip list. You can see that it maintains over twice better throughput throughout this test. The reason why this uses not open source Cassandra was because at the time we did this test the other things necessary for this kind of performance weren't yet in open source Cassandra, they are now, you're going to see it in a minute. Yeah, so some of the things that also are important that in addition to the throughput increase we get latency reduction if you're doing fixed rates, reads and writes, it's about 30%. You can also see that every time you're creating a new SS table on level zero it's about 30% bigger which means sometimes you may need one less level of the compaction hierarchy which can be a significant improvement. And something that was really important for me personally is that we can definitely see a reduction in the total garbage collection time for this test or for any test we have actually done on the order of more than twice. Now, something else I wanted to discuss is the effect of sharding and how whether sharding is a good thing or a bad thing basically because if your data is well distributed if you're in partition hashing or the partition works well and is able to distribute everything in a good way, then sharding definitely helps even if you're using just a simple skip list and even if you're using a blocking variation of the skip list which can only do one write at a time to the skip list your performance improves, improves by something like 10 to 20% just by sharding the mem table. And the try mem table is about twice better than the skip list in that case but then once you start introducing skew things start to get better for the skip list mem table until eventually when you have every second write going to the same partition for this specific test then the skip list is better than all the other solutions that the plain concurrent skip list is better than all the other solutions but for anything realistic up to 20 even percent you can see that the tries and the sharded solutions are doing at least as well. Now this was the part of the talk which Intel was supposed to cover. They started about what was it three years ago or something or maybe even more. They started work on making Cassandra work well for their persistent memory products and what they initially wanted to do was to create a completely separate storage system for Cassandra but it turns out that we can actually get exactly the same benefit by exactly the same end result by just using their thing as a mem table which never flushes. So mem table that never flushes is the same thing as a completely separate storage system. So in this graph here there's a mem table, I mean our mem table interface that we implemented for plugability. You can use it on skip list mem table that flushes to disk from time to time or in try mem table that flushes to disk from time to time or you can use a P mem persistent memory mem table which never flushes and only stores the data in persistent memory because that memory is persistent you also don't need any commit log because we just start with what you had the next time you restart the machine. Yeah, this project is not merged into Cassandra mainline because the persistent memory products were discontinued and they didn't see any point in continuing and finishing the work but let me give some details about how this worked. The actual implementation was done using a direct implementation of the art of the ideas of the art paper. They used the same byte order translation of the keys that I talked about a few slides ago. All of the data is in memory so you're doing in place updates very similar how you would do it in a classical SQL database and because their processing was single threaded it's sharded and always blocking so unlike our solution they couldn't read a piece of data while the same chart is being written to which is a bit of a disadvantage. They didn't give me any performance numbers to share actual performance observed but they were saying that they expected something between two and six times better performance than the normal mem tables when you expand them to a big enough size. One thing that's interesting about this is that the mem table probability interface in this case actually completely allows us to replace the storage engine of Cassandra. So we have a mem table. The mem table controls whether or not you're actually flushing when the database asks you to flush and it doesn't trigger any flushing on its own because it doesn't fill up. It also controls whether or not you need a commit log, you need to write anything to the commit log or not and some of the other things that you need to do for it to work well with Cassandra like streaming data from one node to another. You can do, I mean you can get the concept of how this could be done much more easily if you think about it as just a mem table. When it's time to stream something for example you can give a range to the mem table and say please dump this into an SS table so that you can then stream it over and the other side can say I don't want this in an SS table I want to receive it directly to the mem table and this works. So the next time anyone needs or wants to do a separate storage system they can do it using this this plugability interface. And the last thing I want to talk about is work that's related to these mem tables or basically to tries and to byte order. In Cassandra 5 we have two more pieces. One is the BTI SS table format which is basically you take the same data format as the legacy SS table format and you replace the primary index with something that's based on tries. And again for the keys we take their byte order translation because we already have the complete keys and the data we don't need to store the full keys in the index we just need to store some prefixes which lets us point to the exact place where the key should be. So this can be much more compact than the previous indexes. It's much faster to read again depending on what your workload is it's typically about twice faster accessing a single SS table. Something that's really important and really nice about it is that there's no need to manage any key cache or any index summary if you're using this format because it doesn't use it. The try does, the try is fast enough that if you add to it a key cache it's going to slow things down. And the index summary because the structure is very page friendly it can actually efficiently be cached in memory just by the page caching of the operating system. Another thing that's really important we talked about rows per partition in the previous talk. This format can handle millions of rows. It doesn't have any problems with lots and lots of rows because again it uses the same kind of structure which you can very quickly index into and find what you're looking for. It don't have to copy the structure to memory which you would need to do the binary search in the legacy structure. It's used exactly as this on disk and we've tested it out with millions of rows and it works well. The other related work you can call SAI also something that's based heavily on tries and byte order. It's not for everything that SAI does but some of the types that it does are handled by the same tries implementation both for memtables and on disk and byte order. Another thing I wanted to mention in this talk is that we with Cassandra 5.0 we've opted to keep compatibility in the default configuration of Cassandra which means that most of the things I'm talking about today or any other people are going to talk about in the rest of the conference most of them are turned off by default. And if you want to run your system or benchmark your system or test it or start seeing what you could achieve from Cassandra we didn't have a way to select all of these options. So we're going to be supplying a new configuration file called Cassandra latest.yaml which turns all these things on. Why is this important? One of the reasons you can see on the right side the performance of writes is several times faster. And I mean just because these improvements can make your life easier. So yeah, this was the end of my talk. Here are a few pointers to a few papers that talk about tries and byte order being used in databases. The first one is the art paper, the so-called art paper. Adaptive Radix tree. I think I actually have a mistake here. I think the rotate tree is a bit double E. The other one is another example of how you can use tries for indexes on disk. And the third one is the paper which is around exactly this work that I discussed in this talk. I'm ready for questions. So yeah, so the default value, sorry. So the question is, yeah, the charted structures have a parameter which is the number of charts that you would want to use. By default, the number of charts if you don't supply this parameter is set to the number of CPU cores. That's often sufficient, but if you want to get the highest possible write performance, you maybe want to increase it a little bit. There are metrics that you can look at by JMX as far as I remember, which tell you how much congestion you have, how much threads are waiting for another thread to free up to. And if you see that these are going up, you can increase the number of charts. Because it's a memtable, you can do it at any time and it's going to take immediate effect, practically. Yeah, Derek. Yeah, it is a variable length try. I actually have slides for that, but I didn't. Yeah, so the view on the try is a byte try. So you look at it as if all the transitions are bytes. Internal implementation can use multiple bytes. It can use, well, it's using fixed size blocks because it's that's easier to manage in terms of memory management. You can allocate and free something like this very easily. But it's also because the cache is very happy when you're using the same size of blocks and these blocks are close to 32, 64, something. I got like a cache line. Yeah, and so in 32 bytes with four byte pointer, the widest you can go is eight. So in some cases, we would do three bit transitions. In some cases, if the point in the try is not branching that much, you would just list all the possible transitions into one block. In other cases where you just have a lot of single byte transitions going down the path, you just put up to 28 of them into the same block. It's the memory management of the blocks individually for the try, the pointers and nodes of the try is done by us. We're allocating blocks using Jemalloc, also a time for off heap. And you can do it on heap using just allocating big on heap buffer as well. So that's, I mean, your off heap, your memtable option for the data storage also selects whether these are kept on heap or off heap. If you're selecting heap buffers, they'll be on heap. If it's off heap anything, it will be taken out. Yes, have I seen situations where the on disk try doesn't work that well? Honestly, I haven't. And this actually has been in production since 20, since data stack six, which is at least five years ago. And I haven't seen anyone complain. Which ones? By key churning, you mean? Thread efficiency. I think the run is available via Jmx. I'm not sure on that. If there's an automatic way to get to Jmx stats via virtual tables, you could get to them, but I don't know. Yes. This is just key value. I mean, this version of the try memtable is only key value stores, because it only replaces the partition map. In a few months, hopefully, we'll have a version that does more than this. And the main reason to do that is not to have more performance, but actually to get more of the structures off heap and out of the reach of the garbage collector. I think it was G1 for the tests. Yes? So the question is whether we can have one huge memtable for everything. And that's something that actually I haven't seriously considers. And you hinted at the reason because it's going to be much easier to drop the things that you've actually flushed onto disk. I think that's a few years away, at least, though. It's not a bad idea, in general. Yeah. Okay, if there are no further questions, thank you for being here.