 Hello everyone, my name is Piotr Kołaszkowski and I got a talk on storage attached indexes that are coming into Cassandra 5.0. I hope that the feature that I present will solve at least some of those problems that were described just in the talk before from Verizon about data modeling. So yeah, when I started my experience with Cassandra, I remember that I did some trainings, I gave some trainings to people who were mostly aware of relational databases but Cassandra was new to them in those days. I'm talking about well times 10 years ago when I just was starting working for data stacks. And I showed them CQL language. They were very excited because the syntax is kind of familiar, like it's almost like SQL, right? I mean, you got create table, you got select, update, delete, they said like why do this training? We already know this, right? And then we started doing some exercises. They quickly created something similar to this. Maybe the example was slightly different but well, simple table with a few columns. And then they try to do this select. Here we got a column with this color. It's not a primary key. And it turns out that when they try this, they got this. And what is that? Yeah, it took me a while to explain what allow filtering but this talk is of course not about allow filtering and how it works but mostly how to avoid this. In order to fix this problem, Cassandra originally offers a few solutions, one solution is of course to do what is suggested by this error message. The mayor message tells you, okay, use allow filtering. Yeah, you can do this but it is useful only in a very small subset of use cases. If you're filtering low amounts of data that would be probably fine. The problem with filtering is that the filtering happens on the server side or both on the coordinator and the replica level but if there are many, many rows to filter out, you might, well, traverse thousands of rows before we even hit one live row that you need to return to the client. In this case, it's very likely that you run into timeout issues and even if not timeouts, then the queries would be simply slow and consume well resources. This won't work very well. Another idea which was already presented also in the previous talk from Verizon, you can general normalize by hand so that creates separate tables, each table serves, stores the same data but in different order and then you can direct your queries to proper tables that are laid out in such a way that you're always queried by the primary key and in this case it is fast but it also has some operational problems like well consistency. You're also duplicating a lot of data so actually you need to prepare space for the data. There is a feature called materialized views that is supposed to automate some of that but current implementation also runs into some consistency and operational issues so it might work for you but not always. And then there is a bunch of indexing options in Cassandra. There are legacy Cassandra 2I indexes, secondary indexes which are unfortunately working in a very similar way like creating a separate table locally on each node. So they also have some problems, they're pretty limited in features like for example as I remember there are no range plans and it cannot, you cannot use multiple indexes in the same query on multiple columns. Then there is SASI which is a bit more modern, still a bit experimental and also has its limitation that you cannot put multiple columns, index columns in the same expression on a single query. And now the features that I'm presenting, so SASI, search attach indexing. This is a big improvement over previous indexes and doesn't solve all of the problems probably but I guess that it should be good enough. Okay, the basic syntax is very simple. Create index, you just need to specify using SASI to tell the database that you want to use this kind of index and there is an option like in the Cassandra YAM you can set default secondary index enabled through, in that case your statement could be simply create index and it would automatically choose SASI. Okay, so let's see this example again like with indexes. Of course it works, it returns all yellow fruit. Returning yellow fruit is a very important problem like for anyone who has kids probably knows that they have various preferences, not just on the taste of the food, but also the color. For example, my youngest son won't touch anything that is red, doesn't matter if it's tasty or not. Red can't do. So yeah, this way we can just get all the yellow fruit but what about if we wanted, well, lemon is probably not a good thing here. I mean, my son would say, okay, I like yellow but lemons are yellow, but I don't like lemons. So please give me meal without lemons. So we need to add another condition. For example, bananas are okay. In this case, this works with SASI. So you can use multiple index columns. The system would intersect two indexes and it will return proper results. We also got ranch queries. Ranch queries currently work on numeric data types. There is a plan to support them on all data types in the future. And of course you can mix ranch queries together with just the equality queries as well. So as in the previous example, I mean, this is all composes pretty well. Those queries are also pretty fast. Like I mean, it goes directly to the index fetches, just some sub-range of the tree and I'll show later how it's executed. But there is no filtering. That's the most important part of it. Collections, SASI also works for collections. So we know that Cassandra supports lists, maps, sets. So that's a bit more complex because for example, maps have both values and keys. So you can create, for lists and sets, it is easy. You just create an index on the column as usual and then you can ask, like please give me rows where the collection contains given value. For maps, it's a bit more complex because you have this additional syntax. You can tell it like which part of the map you want to index. Is it keys, values, or maybe both keys and values? And if you create proper index on the map, then you can also ask queries, like for example, give me entries which contain given key or which contain given value or where there is a specific value associated with a specific key. So in this case, actually the index will be constructed on both keys and values on key value pairs and it would be used to search for given pair and if it finds in the index, it will return the row. Yeah, here's an example. So we have a bunch of maps and then we create an index on keys of those ingredients maps. And as you can see, the query with contains key, keys returns just the items with keys. Okay, let's look at how it's implemented. So generally a secondary index is a data structure that allows you to associate some indexed term with a list of rows. Actually it's list of pointers to the rows. So like the index is not really duplicating the row data. It's just storing the identifiers, the row identifiers and of course with a single indexed term, with a single value, there can be more than one row associated. So like in this map example, let's go back. You can see that for cheese, like we have two items, two rows. So in the index, we have one term cheese and posting list. The posting list is list of row IDs. So we got two rows associated with that. That is of course a very, very simplified view. It's just a high-level overview. In reality, the data structures that we use are a bit more complex. So how we do that physically in the database. Each SS table and each memtable has an associated index that indexes just the data that we inserted into that SS table or memtable. So those are not global indexes. Those are local to SS tables or local to memtables. This makes it very easy for us to keep the data consistent because indexes have the same life cycle as SS tables. So whenever Cassandra flashes a new SS table, we build an index for it and it's immutable like the SS table itself. When the SS table gets compacted, eventually it will get deleted. The index gets deleted as well. And of course we get the index data already in the new compacted SS table. So this is very easy to get consistent. See, each index, and this is something unique to SAI. Sassi didn't have this. Each index is actually a set of files. It's not a one file. And we got two types of files, two types of components of the index. One type of component is per SS table files. So we realized that some data that we want to index are actually common for all the indexes on this particular SS table. So like if you want to create three indexes on three columns of SS table, some data actually would be common. I mean data like primary keys. We need to start them on the ones. So there's no point in repeating that data and duplicating that data in each column index. We start it once and all the column indexes use that data. So this is what we call per SS table index files. And of course each index column also has some components. Well, that is typically the tri-structure or tri-based structure that allows us to quickly map between index term and row ID. When we're writing data to an indexed table, the data goes as usual to memtable and comment log. And for memtable we also have memtable index. So when the memtable fills up and there's time to flash, the memtable gets flushed into SS table. And the memtable index is also literally copied to a SS table index component. This process is fairly straightforward and quite efficient. Of course the whole flash happens synchronously similar to like just traditional way how Cassandra does flashes. Okay, eventually the SS tables need to be compacted. And here there are two ways that we could achieve the same effect. Currently what we are doing is like after the SS tables get compacted the compaction creates, well writes new rows to an output SS table. We register a listener on those rows and we feed those rows to the index builder and this index builder basically builds in fresh index from those rows. We are not compacting the indexes. That is an alternative solution that we are investigating that we believe that in future would be faster to actually take just the indexes and compact them together instead of building a fresh index from SS table data. Currently you might notice some slowdown because of that indexing building on compaction. So we don't recommend creating too many indexes per table. Like 10 is okay but don't go too far beyond that because yeah it might slow down compaction a bit. This is of course the index building is done synchronously because we need to have fully operational index before the SS table is marked as usable in the system. Read path, well for read path because each index is local to main table and SS tables we have to of course consult all of them. So first the query goes to each index. It returns an iterator over primary keys. Again a simplification. This is a multi-stage process to get a list of primary keys. So we get a primary keys. One nice thing about this model that Casano uses is that all SS tables are actually sorted and sorted in the same way sorted by token and clustering. So because they are sorted it's very easy to merge them. And then after we are merging them we get a list of well all primary keys from on that node. And then we just go to the storage and fetch data, fetch rows by the primary keys. This version of SAI that goes into Apache Cassandra as a part of CP7 is row aware. So it actually indexes exactly rows, not partitions. We had earlier, we had an internal version of SAI that actually indexed just partitions. It was a bit easier because we didn't have to store primary keys, we just needed to store tokens. So in some use cases it was a bit faster but well if you had white partitions it was not good because it had to filter all the data of a partition. Here this is much more precise and this is going into Apache Cassandra. Well one more thing, there's post filter block after so after retrieving the rows we still need to filter out so like recheck if the row matches the selection criteria because indexes do not track deletions. So like if you have tombstones, so yeah tombstones of course track deletions but this is for Cassandra for like primary keys. And here because we don't record deletions we might hit some false positives. That's why we are post filtering them to return correct results. Okay what if we have query involving multiple columns? So here we have column name, column quantity. Then we just open an iterator on one index for name open iterator on quantity. After merging then we have another block that's intersecting to iterators. So basically it returns only the primary keys that were present in both of the iterators. Again because the order is the same it's easy to do this operation. However there are some costs involved. So imagine that those indexes on both sides return just a very tiny subset of rows. We had one million rows for example that each returned just 10 rows. So whenever you get one item from left index here you have to find if there is a matching entry on the right one on the right iterator. So we have internally some kind of skipping operation it's actually called skip two. So instead of just fetching all the things from both iterators and comparing naively we just ask the second index like hey can you move forward to the point with this primary key if this primary key is not found of course give me the next available ones. And this operation is a bit costly but again it's still often faster than just going through all the entries because like if one index was very very selective and another one was not very selective like one index for example was something like a Boolean plug you can index Boolean here, right? True or false? 50% might match, right? So if you had that then you would probably not another index that is very selective wouldn't be any good because you would still have to go through all the entries on the other side. So that's why we have the skip operation which in this case skips forward by many items much faster without going through the data file. But of course be aware that if you put multiple columns in a query those intersections they do cost a bit more performance so it's not as fast as selecting from one column. And yeah detailed stuff so like I said that the index returns a list of primary keys that's not exactly like it is done so this process has multiple stages so first the well we have actually two types of indexes so we have try based index this is for string data so for literal data and this is because tries have this nice property of compressing prefixes very well so like if you have a lot of text and all the prefixes would be actually stored once in the try. This is the same data structure as Cassandra is using for memtables where it was already presented by primary area today. And for numeric indexes we use some kind of KD tree this is not exactly KD tree it's a KD tree with one dimension this block balanced tree but the structure is very similar so it's basically a tree with some blocks on the leaves and each block packs some 1024 items probably if I'm not mistaken but this is just a detail. So we get just row IDs from those either try or the KD tree. Row ID is just a sequential number of the row in the table because as a stable is immutable we can just count rows from zero. This was designed like that because also those numbers are very small so they compress very well. Like we don't need to store very big numbers like if we stored full primary keys in the try that would take up much more space and that would be also much slower to search for. So we store just row IDs and then we use a per SS table component which is only once it's a common component for all the column indexes on the SS table. Then we do the translation from row IDs to actual primary keys. So we have several map like row ID to token map row ID to partition key and row ID to clustering key. We have separate maps because often we don't need to actually map row ID to a full primary key. Sometimes it's enough to just map it to the token because when we are doing comparisons we can first compare by token. If the token don't match there's no point in fetching the partition keys or clustering keys because that's also costly operation. So that's why those structures get separated and then when we have the primary key then we of course can go to storage and fetch the data. Okay, why SAI also improves over SASE and previous indexing technology. So we actually did a lot of, we use a lot of compression, data compression techniques in how we design the indexes. So using row IDs is one example. Using per SS table components is another example like not repeating the data so avoiding redundancy at all costs. But also we try to be very smart in data encoding. So for example, instead of remembering when in posting list instead of remembering all the row IDs we can like give the table a small for example. We don't need to use long integers for them like those integers are smaller. So we can actually bit-pack them. If they're bigger, well that's still not a problem we can do something like delta encoding. So deltas between the rows are actually much smaller numbers than the absolute values themselves. So this way we can actually store much smaller integers and then use bit-packed encoding. So all those things together they play well they amplify. So like using just a single technique maybe won't win a lot but all together works very well. Block-packed frame of reference so actually storing data in blocks and each block something like a base number for example like a value that other stuff is just delta encoded from that and also all the things within the block for example are using just five bits integer. And we just store the number five bits is stored just once per block in the metadata segment. So those things result in much better discusage overall. So here's a comparison with 10 column indexes on the same as a stable, on the same table sorry. And we can see that well there's the difference in discusage is massive. This also translates to performance increase because the less data you have to fetch from this of course the faster it is. As per throughput of course it's very dependent on the type of query as I said like if you do a lot of intersections it will be slower. I mean you can't expect the same level of performance as you get from primary key lookups in Cassandra that's really hard to beat Cassandra is so fast that not many databases actually can compete with it. So here the performance would be of course slower than primary key lookups but it's still much faster than filtering. And for simple reads like read unique this is a benchmark which just reads one row from using just one column. We can see that it's still in tens of thousands per second this is on a single node but this benchmark is just a single node benchmark. When you do the ranges it's in thousands but if you do like intersections which we have like different types of cardinalities like low cardinality columns, high cardinality columns we get a lot of different variants of those benchmarks. You can see that of course the performance is slower but it's still in hundreds per second. So overall if you look at latency you can still most of the time be below like 20 milliseconds and often below 10 milliseconds. It's just fairly good enough. Okay, so the last slide, what can come next? There is work quite advanced on adding node operators or negative queries, not in node contains, not contains key, not equals. That's currently in review mostly they're just minor fixes required. Then well indexing frozen collections, range queries on the marble values. So general range queries on other data types is what I said in the beginning that we actually want to just remove the limitations. We know that there are some gaps still. So that might come in the future. Maybe also adding or operator like if we have both nodes and or and and I could do, well arbitrary nested expressions. This looks already like an RDBMS doesn't it? And another thing that's one missing point from Sassi at least Sassi does that. So Sassi has like support, SAI currently doesn't but this is also like an idea that we might want to investigate in the future doing full text search, maybe prefix search or suffix search. Okay, thank you. Yeah, at least the text comes to text here exactly. Okay, thank you. And if you want to collaborate on this, this is all open source, right? So there's ASF channel Cassandra Sassi and there's also documentation already out there so you can see actually the preview. Thank you. I take any questions if there's a minute. Okay, so the question is about how extensively we tested performance of this on what kind of workloads? So okay, yeah, we have a set of synthetic tests for SAI. So those are basically the slide where I show this is a subset of those tests. There are many more. So actually loading a billion of rows well, yeah, a billion of rows exactly. So then loading that into a cluster and measuring latencies, measuring throughput. We have this fallout system. By the way, fallout is also open source. So this is a distributed performance testing system that we use. So those benchmarks are pretty regularly. And then, yeah, that's basically it. We haven't tested in production workloads yet but maybe that would come. I mean, we have some ways to do this as well. Yeah, no SQL bench, actually, yeah. So officially we use no SQL bench and no SQL bench is actually those results are from no SQL bench. I personally, because I'm also part of that team. So I personally also use Lata, which is my tool for just doing some stuff. And actually this matched no SQL bench very well. So like confirmation from two different tools. Yes, so there is some impact on memory and on heap and say like if you really index a lot of rows there are some data structures in the indexes that are in memory on, so they load it in memory. So they are used for accelerating access. We don't have to touch disk every time. So yeah, adding a bit more memory for SAI is recommended but this should be just a fraction of memory needed generally for memtables. So I don't have any ballpark number at the moment but I mean, I guess like 10% probably should be enough. Sorry. It's less than a thousand. Yeah. Okay, one more. Yeah, yeah, that's right. Yeah, but vector search and indexing is a bit of different thing. Like we of course use SAI in vector search but there are still some areas that need to be solved. And we are currently working on this. So that's why we try to be conservative and like just don't let people run into problematic area. Okay, we ran out of time. So thank you again for coming.