 Databases, a database seminar series at Carnegie Mellon University, is recorded in front of a live studio audience. Funding for this program is made possible by Ottotune. Google. So Rex studied today, he has Ben Hannel from Roxette, come talk about Roxette, he's super awesome. So Ben did his undergrad and master's at Stanford, he's been at Roxette since 2019, over three years now. So one of my best students, Yash, once he was at Roxette, we were happy to invite you back and give him a little talk. So Ben, the floor is yours, again for the audience, if you have any questions for Ben, please give him a talk, say you are and fire away at any time, we'll have a conversation and I'll just have Ben talking to himself for an hour. So Ben, go for it, thank you. Thanks for having me, Andy, it's really exciting to give the talk. So I'll be talking about a few aspects of Roxette today, one of which is how we get high performance despite being dynamically typed and schema-less, and I'll be diving into some interesting aspects of how we index data. So I know you've seen a lot of databases here, so I'm going to focus on what makes Roxette different for most other systems. We're schema-less and dynamically typed, every document in Roxette can have a different set of columns, it can have nested data where you have objects and arrays inside each other. It can have columns or fields of mixed types. So despite supporting full SQL, it looks a lot like a document store in terms of what kind of data you can put into it. Roxette also indexes everything, so no manual configuration of which fields to index and of what type will index all of your fields and all of your nested data by default. We also support real-time data, real-time ingest, so when you do a write, it's visible within seconds and real-time queries, being able to handle operational queries with latencies in the tens or hundreds of milliseconds. So before I dive into how we build a schema-less database, I want to dive into why. Why is that useful? Why is that something that people would want? There are all the traditional arguments you get from people building document stores, folks like MongoDB, DynamoDB, advantages where you don't have to manage a schema, adding columns whenever your application needs change. It also means that if you have some kind of pipeline to get data into your database, if you want to use a more traditional relational database to analyze data from the real world, you're going to need an ETL, like Instract, Transform, Load Pipeline first. If your database is more flexible, then you can either do that transformation in SQL, which is easier than doing it before you hit the database, or you might not even need to do it at all. That's often the case, that ROXET is sufficiently performant that you can just take data as it comes with whatever denormalization, nesting, and then run queries on it directly. Now, on top of that, there's an advantage to being schema-less in that you can interact easily with all these other schema-less systems. So ROXET can ingest data from a traditional LTP database with a schema, but it can also take data from a new SQL document store, data streams with JSON strings coming through Kafka, data lakes if you've got JSON or CSV and S3. It can ingest all of these sources, mirror them in real-time by either continuously watching the data lakes or by ingesting a CDC stream from no SQL or LTP sources and keep data in ROXET in sync with the original source. Everything making sense so far? And then you get fast SQL on multiple sources, basically joined together so that you can combine different sources of information into one place where you can analyze and then build applications on top of it. Sorry, that ROXET, is that Kafka's KSQL? Or is this just a ROXET thing? I think this is something in our slide deck defaults. I don't think we took it from anywhere. I think this is our ROXET. No, no, no, I mean, sorry, the Kafka guys use a similar ROXET for KSQL. Hmm. Okay. I hope we didn't accidentally copy paste Busta, the Kafka SQL rocket. All right, so that's why you want schema-less dynamically typed data. Why is that hard to build? There are some interesting challenges. To explain how ROXET approaches getting good performance on dynamically typed data, a little bit of background on what a vectorized execution engine is and why ROXET uses it. Common technique in databases, I bet a lot of people have heard of it. To advertise away the interpretation overhead of doing an operation from your query, you will operate on batches of data time instead of individual scalers. So you don't want to have to trace over the AST of your query for every single tuple in your input. Instead, you trace over the AST saying, oh, we're doing an addition. We can add this column of 4,000 ins to this column of 4,000 ins and do that interpretation overhead once and then your inner loop is really tight. So ROXET has a columnar vectorized execution engine. And the nice advantage of that is that not only can you use it to amortize away the interpretation overhead of interpreting the AST of the query, you can also leverage it to amortize away the overhead of dynamic typing. So think about a column in a strongly typed or a column in a statically typed database will be all of one type. So you can have a representation that's like eight bytes or four bytes for each of these integers and your loops just look like iterating over an array of integers. A naive implementation of a dynamically typed database might look something like this, where you have each tuple has some tagged type, where you have an indicator of what type it is and then some value, which will be in a different format depending on the type. The performance for this kind of in memory data representation wouldn't be great because for every single input tuple, you're going to be dispatching on what type it is. So you'll have a branch in your inner loop. You're using all this extra space for your tags. This isn't going to be a very compact memory layout, not a very performant representation, but in practice, even when you give people the ability to use dynamic typing, typically there is inner schema to the data. Any given block of data is probably going to be all of one type. So within a columnar block, which I think is 4k by default, if it all is of the same type, then we'll get this compact representation where there's one tag on the whole array and then individual values that are compactly encoded with a homogeneous representation. We use this on disk for more compact storage. We use it in memory so we can process over it without dispatching over the type and we send it over the network like this so that we use less network bandwidth. Of course, if a user actually has mixed types, they have every other documents and int versus a string, then you'll get the heterogeneous representation with a bit of a performance overhead to that. So how often does that happen? In practice, it's relatively rare to see mixed types. The type of mixed type you do see very often is like one type and then null, which we have a special case for that actually is an extension of the homogeneous format. And if you do have some sort of mixed types like instant floats, we have ways to coerce to one or the other at ingest time so you can get the homogeneous advantages. This also happens over the course of a query. So if you read heterogeneous format from disk and then cast it, the output of the cast will become homogeneous. So everything upstream of that will get the benefits. What about like if the visibility, you're keeping track of the end, like a key value pair or an array that most of the time you expect to see a float, then also you see a string. Will you fix it for them and cast it to a float right there or use to take it as is? We preserve the exact semantics of the data as the user gives it to us. So if they put a string into their dataset, we will keep it as a string. And if they query it, they will get it out as a string. We're not going to do any magic conversions in the background. Yeah, even though people think has it, if they're stupid and wrong, I mean, I guess you just let them be stupid and wrong, right? Our approach is generally like let the user know they're being stupid and wrong, give them some kind of warning, some kind of error, that they're doing something on types that they really shouldn't be. But we find like silently converting generally causes more confusion in the long run. Yeah, thanks. Yeah, really good question. Users have a larger set of ways to shoot themselves in the foot when you give them dynamic typing. Type systems can be a blessing at times, but their advantage is getting the flexibility. So now we have many different homogenous representations for many different types. I'm showing two here, but we actually have something like 16 types, 16 primitive types. We've got timestamps, date times, dates, wider integer types, a lot of really interesting stuff. And then you need code that's generic. Other question, Jen. Will users have JSON type in the scheme? How do you handle it? She already asked your question. Yeah, so basically, I think Rockset maybe supports some schema less in JetStream. So if users just give you a JSON column, will you do some optimization when you ingest the data and convert it to some homogeneous representation or you just store the JSON as string inside Rockset and grow in the computation? So that's a really interesting question. It gets to something that a lot of other databases, at least in Rockset opinion, have done wrong. They have JSON as a type in their database where you have a column of integers, a column of floats, and a column of JSON. That approach doesn't work very well because now you have this column with this weird type that doesn't act like any other types that you can't really index properly. And even if you can, you're going to require weird custom extensions to language to make it work. What Rockset says is our notion of document is a superset of JSON. So you can ingest, if you want to take an object, like parse from a JSON object and say that's a column in Rockset, we will not just be able to store that and represent it fully, we will shred it and index it all the way down, all the way to the leaves of that JSON object. Even if there are arrays, we'll index into the arrays. There's not going to be, not shoe-horning JSON into one corner of one type. Rockset is expressive enough to store JSON as it's done. Does that make sense? Yeah, this makes sense. I think I'll have later examples that will clear it up a bit more. Let's see. I was saying we have many different homogeneous types for all the types in Rockset. We want to write generic code that operates on it, but without paying the dispatch overhead in the inner loop of saying, okay, what type are we operating on right now? We do that using C++ templates. So I'm going to show this as a code example because I think that's the clearest way to understand. Say we want to write a function where we're saying, okay, if I duplicate the last element in some array of some type, we don't know what type yet, how many bytes of memory will that use? So I value is the dynamic type that we use as a holder for all of these different homogeneous and heterogeneous array types. Think of it as a tag and then some heap allocated data that the tag is telling you how to interpret. So the first thing we do is we invoke this dispatch function, a ray type dispatch, where we give it this dynamically typed I value, and then we pass it a closure that we want to execute on this dynamically typed I value, following so far. So here's where things get interesting. The closure doesn't take an I values of parameter. It takes auto, which basically means an inferred templated type that will vary depending on the runtime type of I value. So this is a runtime polymorphism based on some tag, but then this is a compile time polymorphism where we are generating a new version of this closure for every possible type of the array, which means any code we do inside here doesn't need to have some runtime dispatch on the type because it's known to compile time. Each version of the closure only needs to know how to handle one type of array. So we can symbolically manipulate the type, saying here's the type of array we're working with, and then we can use, was it dependent types? Basically the array knows its own data type and it knows how wide it is. So for all fixed-width types, this is sufficient. This code handles every fixed-width type, saying how much space it will take to replicate the back element end times. And of course, we can handle special cases. We can have an I value array of I values because you can have nested data. You can have an array of arrays or array of objects. And in that case, you actually need to check the runtime byte size of the last element. But you only pay the overhead of this, you actually don't pay the overhead of this check at runtime for anything except the actual dynamic I value. All your homogeneous cases, this just compiles away and all you have to do is return this constant size of. Does that make sense? So you can write generic code that is high performance and pretty concise. This is one way we dispatch across types. There are some more interesting mechanisms. Like for casting, we have a dynamic input type and then a constant output type to herm by the query. So we can do a lookup in a 2D array of function pointers saying what's our implementation of our cast from type A to type B. We can likewise do interesting like constant expressions stuff to analyze properties of a given type and do that all compile time within one of these overloads. So we can, for instance, get a constant expression constant saying like, is this type numeric? And then branch on that in a compile time way. And all of these can be vectorized so that we can hoist out all the dispatching logic out of the inner loop. So our inner loop is fast and all the overheads amortize away for homogeneous types. If I get esoteric, so are you sort of doing like the vector-wise approach where you have a bunch of these primitives and you're just stitching them together at runtime for the query plan? And assuming that the compiler's gonna vectorize everything for you. Yes, we definitely rely on vectorization over JIT. We've considered JIT, but haven't done it yet. We find that we can get a lot of the same advantages without a lot of the complexity. Does that answer your question? It's going to be over. Yeah, but I think you are, Yeah, I think that's another question, yes. Okay, then this is Madam, one of Andy's students here at CMU. Is there a sort of inflection point where there's a trade-off between all this template specialization and compile time, putting pressure on like the instruction cache as opposed to doing dynamic dispatch at runtime? That's an excellent question. We don't have a huge amount of data to justify this, but in practice, it has not seemed to blow up our instruction cache in a way that doesn't scale. It's possible that's in part because a relatively small subset of those 16 types are actually hot. So we can fit all the relevant code into our instruction cache, but there is a trade-off in terms of how much code you put inside this closure that you're going to replicate and ways versus how much, there's a trade-off between how much code you wanna generate and how many branches you wanna waste out of that loop. Then we can deal with this, we can also do partial specializations where we can say, I know that this function is actually only supported for types X, Y and Z, only generate the inner loop for types X, Y and Z to eliminate some of that loop. Nevin, do you have a question? Yeah, hi, I'm wondering what effect like your approach has on compilation and build times? That's a good question. And it's what you expect. Our builds are fairly slow. We have a distributed build system and thanks to the distributed build system, we can get a full build in like four minutes. But without that, I think it's something around 25 minutes locally to build our entire C++ backend. It's tolerable, but yeah, you pay a price for this. All right, any other questions? Okay. So in addition to performance challenges, there are also semantic questions around what happens when you do SQL on a dynamically typed system. For instance, does the integer one equal the float 1.0? There are a few behaviors you could have here. You could say like, yeah, they're equal. They aren't equal or it's a runtime error. You can't do comparison between unequal types and you're going to have to cast one side of the other. We decided it makes the most sense to say that these are equal. This is what Postgres does. Postgres allows implicit conversion here. Some more pragmatic reasons. If you just say this is false, you'll get really weird behavior when a user says like, oh, select star from my table where x equals one. If their column is a float, but they thought it was int, they're just going to get no results. No error, no nothing else. So really true or runtime error, the only valid options. And runtime error just makes querying more cumbersome and being non Postgres compatible is a pain. So that one's pretty easy. What about zero equals 0.25? Do you cast one side to the other side? Do you like to do an implicit version? Do you do some exact arithmetic thing? So what we decide here is they're not equal. Once again, consistent with Postgres, consistent with mathematics and generally produces intuitive behavior for queries. Now things get interesting. Does the double two to 62 equal the integer two to 62 plus one? Oh, I'm just really curious about that right now. That's definitely not typo in the slides. This is interesting because of the representation of 64-bit floating point. If you do what Postgres does and just convert both sides to, wait a minute, we'll get this right. Convert both sides to double. This integer will actually round to the nearest 53 bits of precision and will round to two to the 62. And Postgres will tell you, yeah, these two numbers are equal. Rockset doesn't do that. And there are very good reasons not to do that. Mostly that it violates the transitive property. If you say that the double two to the 64 or two to 62 is equal to the integer two to 62 plus one, you also have to say it's equal to the number two to the 62 minus one. But two to the 62 minus one is definitely not equal to two to the 62 plus one if they're both integers. Like those are 64-bit integers that have different bits. Violating the transitive property isn't a huge problem with Postgres because you're not often doing set-style transitive operations with mixed types in Postgres. It's pretty much impossible to do. I would argue even in Postgres's case, it's kind of unintuitive that equality is not transitive. I have an example here where A is equal to B. Let's see. Actually, I think you need a third one to show that it's not transitive. But basically you construct three things where A equals B, B equals C, A is not equal to C. The reason ROCKSEP cares so much about the transitive property is that you can have a group by or a join on mixed types and you need to do something same. And kind of the only same thing you can do is use exact mathematical equality even across different numeric types. If you were to convert everything to int, if you have floats that round to the same or truncate to the same integer, they would go into the same group by bucket and you would get wrong results. If you convert everything in the group by to float, then ints out of the range of negative two to the 53 to positive two to the 53 would truncate to the nearest float and you'd get wrong results. Runtime errors is an option, basically saying you can't mix both of these types in a join or group by, but that could cause really confusing behavior. Like you write one new document to your system and all of a sudden all your queries are failing. Or there's also the fact that since we're doing distributed joins and distributed aggregations, actually determining if it is mixed can be difficult because you have to get that, like you hash scatter your grouping key and then you have to figure out a coordinate between those nodes. So they all know like, oh, I got some floats. I got some ints. Is this actually a mixed type operation? So we decided to use exact mathematical equality basically like the number that is being represented by the bits in this double, which is gonna be like a one shift over however many places to manipulate instant floats. How often do anybody actually hit this issue? Like, I mean, you have it to 11 here but you have all the queries you've ever executed. Be sure to keep track of them. But like, if anybody ever complained that like my query on Postgres got me this, I'm getting the different answer from lock set. Or is it analytics and therefore it's a huge watch that nobody would know with the exact numbers anyway? So there are subsets of the semantics we've discussed here that we did come to because customers care. For instance, the case I was talking about where like you search for a select star where x equals one and you have x equals 1.0. If you try to return false users will get angry. They get really confused. That one's obvious, sure. Yes, but I'm saying like this, like this, you know, when you differentiate yourself or you're doing something different than what Postgres said you're going with the mathematical quality. I can't imagine anybody would notice that. It's rare. It's very rare for people to actually do it. But like if you pick the wrong thing here you're going to be giving wrong results in some subset of cases. Sure, but it's like, you're not doing approximate queries, right? But at a certain scale, would anybody actually notice? Like when I'm asking, did anybody say anything before you actually emulated it this way? Or did you just say, we're gonna emulated it this way first and then it has worked out since then. In the case of how we're doing implicit conversion flow, no, I don't think that was driven by a user complaint. I think we decided we wanted an internally consistent system and this was the way to get reasonable semantics. Okay, it makes sense. But all of these rules I'm describing actually apply similarly to dates, date times and timestamps and their users do care a lot. And sometimes do fuzzy things where like they want to like select something based on a date constant but they want exact conversion to timestamp. Very similar rules. And yeah, if you don't allow implicit conversion or if you have implicit conversion with unintuitive semantics, people will come up. So how do we actually do this? For equality, we figure out which domain we can perform the quality in safely and then convert both sides of that domain. So small numbers will both go to double, large numbers will both go to int and you can guarantee that you will, you can go to at least one of the domains without rounding. Inequality, less than greater than. We follow the same mathematical exact rules. It's a little more complicated but similar general flavors equality. For hashing, we need to be able to do like hash scatter and hash table operations while treating one and 1.0 is equal. So basically for any double that can be exactly converted to an int, we do convert it to an int before hashing. And this is a check you can do very quickly. We chose to have this check on the double path rather than the int path. We could have converted ints to doubles when they're convertible. But we thought a small performance penalty on the double path would be better than on the int path since aggregating and joining on ints is more common. And all of these rules apply analogously for temporal types which are also comparable and can be grouped together. I'm assuming you don't support UDTs but at some point you may need to. So how like, in addition to being able to say, something just does one UDT equal another UDT within the same time domain. Now they potentially have this provide all these casting rules as well. Yeah, user-defined types would get very interesting, definitely. I don't think we have any plans for them right now. I mean, Jason, sort of get you halfway there, yeah. Yes, you can have nested objects and decide on semantics for like the keys of those objects. And we have user-defined functions. You can write an JavaScript. So you could just have your user-defined type as like an object. And then anything you want to do that instead of using the built-in equality, you write your own UDF. You could probably get almost all the same behavior. Bad thing. All right, now I'm going to move on to indexing. How does Rockset index everything? So we index everything, including nested data inside arrays and objects. I feel like it's easier to understand how that works with an example. Say that you have some data set with contact information for people. You have nested data where you have an object inside a top level field. And you also have arrays. You have a contact info which contains an array of phone numbers. So how do we index these phone numbers for fast retrieval? The format of our index looks a lot like a map from the field name, which can potentially be nested if you have multiple levels of fields, like named dot first means the field first inside the named top level field. Field name, the value we're indexing on. So that's going to be the value and the key value pair here. And then the set of matches. And these matches will be documents and potentially the index is within the arrays where it appears within a document. So a location within a document. So this index is a lot like what a more traditional database index. You just have a list of matches for this particular value. But then we also index data inside of arrays. And the way we do that is the path will contain both the field names, contact info dot phone numbers, but then also some representation to say like, okay, now we're going inside of an array. And notice it's star here rather than any particular index. I can dive in a little more in a minute on why that is star, why we index every position and array the same way. And then you have the value and you have a list of matches for that value. So if I want to say like, find me all users that have a phone number equals one, two, three, four, five, six, seven. You just look at the posting list for the field path of that array, the value you're looking for, and then read off the list. You can even intersect these indexes. So if you want to look for a user with a particular phone number, that's what we're using array contains for here, saying this array has a value we're looking for. And has these other properties, we get multiple lists in our index. We intersect them in an efficient way. And then we have a final list of documents that we can look up. So we could have alternatively done this where instead of having a star here, we index at a specific position. But words like contact info dot phone numbers dot one, dot eight, six, seven, five, three, oh nine, and then map that to a specific document ID. But that makes a array contains expensive because now you need to look up every potential position in every document, which is O of the length of the longest document. The trade-off here is that if you have a look up where you're looking for a phone number at a specific index be with this value, you have to look at the wildcard index and filter out the ones that match the correct position. We thought this was the better trade-off because if you want the other behavior, all you have to do is convert your phone numbers array into a phone number's object and have fields called like one, two, three, four, five, and you can get the other behavior. And this is also more sensible to both because generally the performance penalty involved with sorting with basically finding all matches at any position and then filtering for just the position you care about is relatively small versus performance penalty for looking up an index for every single array position, be quite large. This basically means if you delete this record, this document and you gotta reverse and delete everything. I'll get into how we handle updates later. Updates to this index structure are actually quite cheap. We use a, I'll go into more detail later. Yeah, but you would have to update everything, all the indexes for everything later. Got it, thanks. Let's see. So I said that we can intersect multiple indexes at once. We have an interesting way of doing this. So you essentially treat each index like an iterator, maintain a pointer on each of them and then you advance them progressively until you find matches. So let's say we first advance this first iterator named off first.genny and we say advance this iterator to the first position greater than or equal to doc 42 which is the first match we found on the greatest of the three iterators. We can advance to the greatest of the three iterators because we know we're not going to find any match before doc 42 because this list doesn't have anything before doc 42. And these iterators support both a sequential next API and a seek to location API that's the logarithmic. Okay, so we advance our second iterator, find a match. We advance our third iterator and immediately find a match. We have a match on all three iterators. So we emit that to the output. Once we emit this document to the output we can look up in the row store to find the rest of the fields corresponding to this document. There was a question I kind of glossed over here. How do we decide to advance the named off first iterator before we advance the state iterator? In this case it doesn't matter too much since we find the match immediately but if we have really long iterators where some are sparse and some are dense getting this right matters. You want to intersect the smallest lists first and only once you get a match in the smaller lists you advance the larger lists. Otherwise you could do a lot of work intersecting two dense lists only to find you could have skipped a million spots ahead if you didn't skip to advance in a different list first. The way we do this is every few thousand times we advance we look at how long we're skipping with each seek operation and each next operation and we estimate dynamically how dense each list is. And then we decide, okay, whichever ones are sparsest according to our estimates we'll intersect those first and only once the sparse ones match then we'll advance the dense ones. This means that it is correct and robust even to complicated distributions that change over the course of the doc ID space. These docs are assigned sequentially so if you have changing distributions over time it will adapt as you advance and say like, oh, this thing became very sparse in this time domain so now we can use it to more efficiently skip the other indexes. Makes sense. So another nice thing that falls out of indexing data in arrays you get text search for free. If you tokenize your data at ingest time which Rockset provides tools for and turn basically each word into and turn your string into an array of word tokens you can now use array contains or our specialized text search syntax to perform text search on this token list because basically you can say like, oh array contains each of these three words I want and I want them to be in this proximity we have syntax for that and now you have fast index text search. So the type of index I've discussed so far is our simple inverted index. So you go from a value to the set of documents with that value. The invert index index can be used to serve range queries but it's not as efficient as it could be. If you just have the inverted index and you have a range query saying like get me everything we're X less than 20 you're going to have to look up a lot of distinct iterators in the inverted index and then you're going to merge them all together before you can intersect them with any other predicate. That merging process gets expensive you're talking about hundreds or thousands of individual inverted index entries. So we also build a multi-level range index where we take different fixed ranges of integers, timestamps, date times on those sort of ordered types and we group them. We say for everything between values of zero and 15 give me all the docs that match and likewise we do this at multiple granularities. I think two to the four to the eight and two to the 16 to the 12. We do it at multiple granularities so that any range and so the most ranges can be constructed out of a small set of iterators that can be cheaply linked together. Chi, what was your question? Yeah, so I'm curious how do you know the data range advance? So basically integer has for example zero to two, two, 32, this big range. So how do you know the user's data range advance to construct such index? We don't know the distribution in advance. We use a fixed set of rules where basically smaller numbers get indexed in smaller ranges and then it sort of scales exponentially so that if a number is like less than two to the eight then we index it at the two to the four and two to the eight and two to the 12 granularity. If it's greater than two to the eight then it gets like it goes into the two to the eight size buckets, the two to the 12 size buckets. Some kind of rules like that. I think I'm forgetting the exact powers but basically the further you are from the origin the wider the buckets are in such a way that for the vast majority of ranges you can construct them out of a small number of these static sub ranges. That makes sense, thanks. I think this technique is called a static range tree as opposed to a dynamic range tree where you move these boundaries around based on the data distribution which can get better performance in some cases. Any other questions? So in this case you would construct this by uniting together the zero to 15 range as well as the individual inverted index iterators for 16, 17, 18 and 19. So you'll use a few individual iterators at the end of the year range and then a series of ranges to capture the bulk of the middle. So ROXA has three storage formats. I've been talking about the index store the inverted index and range indexes. That the index store produces these dock IDs which will then look up in the row store to retrieve the rest of the document. And when I say like doc one here it really is just the locator that we're sharing here none of the fields inside the document. So we use the row store to look up the rest of the data. The column store is a columnar format for fast bulk scans that get all the traditional advantages of a columnar store, good compression, good throughput when you only need some subset of columns all of that. Are these three copies? This is all stone, everything's in ROXDB. Yes, yes, ROXDB is the underlying storage engine for all three of these stores. Okay, okay. So indexing everything seems like it would be expensive. It seems like it would take a lot of compute. And to a certain extent it does but there's two things that we do that mitigate this a lot. One is that our storage engine is a log structured merge tree which means that all of our writes are sequential even if we're updating a bunch of different indexes the pattern on disk is at the original time when you write it, you're writing one sequential series of files and then compaction will merge those into lower layers of the tree in the background. If you were to try to build an index on every single column in like a B tree oriented database the amount of fan out you would get to your disk would be pretty insane. Like the number of different point writes you would have to do to ingest a single document. That's not an issue with an LSM tree because all writes are sequential. We also use merge operand to reduce the cost of indexes and mutability. Say you have an index where you have a list of a million doc IDs for a particular value and then you add one new document with that value. You don't wanna read in that index of a million elements long, add one value and then write it back. It will tremendously increase the cost of ingest. Instead, what we do is we write a delta representing I added this one document to this index and then when compaction comes along to merge layers of the LSM tree it will combine those deltas and amortize away many deltas in one read, modify, write operation. Because this is a separate code path or separate operation than the like the traditional one or compaction in RocksDB. Yeah, what RocksDB gives you, the primitive is a merge operator. Basically, you can define custom semantics that you plug into RocksDB saying it gives you, it tells you the key it's operating on and then you will define the semantics for how to merge two keys, two values for that key. Indexes also usually take a lot of space to store. How does Rockset deal with that? We have disaggregated storage where we have separate independently scaled tiers of compute nodes and storage nodes. Storage nodes having big SSDs and not very many processors. Compute nodes having a small local SSD for a cache and very fast processors. This independent scaling can somewhat mitigate the cost of having more data and storage because that storage when it's cold can be a little further away and a little cheaper. There's also the fact that flash storage has generally beginning cheaper, faster than compute has. So making this trade-off where you store more data in exchange for using less compute at query time and faster queries is often favorable from a total cost perspective. There's also the advantage of less human costs for configuring indexes since you get them all by default. When you say roughly a percentage across the entire fleet of Rockset, like what is the percentage of data that goes off of S3? So the two storage tiers, all of the data is still in flash storage, but it's in flash storage that's attached to very little compute. So we pay less per gigabyte. Got it. So it's a one percentage of that. I should one is why not jump to S3? And then two, one percentage of the entire data set that Rockset manages is on this cheaper storage. So all of the data in Rockset is on the disaggregated storage tier, but some of it also lives in the local cache. It's an inclusive cache. Let's see. I'm trying to think about what percentage of our total data set fits in this cache. I would have to look at numbers to get something exact. I think it's gonna be something on the order of 10 or 15% maybe. Okay, there's basically, there's S3 and then there's like a cold cache and there's a hot cache. Yes. Okay. We actually never want queries going directly to S3, at least today. S3 just has latency that's too long to serve queries from. Like a single get could take 100 or 200 milliseconds. And if you're doing an index lookup, every single one of those documents is going to turn into a different random access. We use S3 for persistence though, for durability to guarantee like even if our storage tier goes down, the data is still there. We don't wanna be serving queries from it. All right, I said I was gonna talk a little more about how we do writes. So I was talking about merge operands. How do we store a delta for an individual write on an already very large index? So we have a new insert that basically will go into this inverted list. It's already very long rather than reading the whole thing and writing it back. We store a merge operand. So now we have two values for the same key which RocksDB allows. And at read time, we can basically treat them both as iterators and do the same thing we did for intersection but now we're doing a union where we emit a result when either of the iterators matches. And then once compaction comes along to merge these back together, we will end up with one longer list and we don't need to pay the cost of the streaming merge anymore. But by the time compaction comes along, we'll probably have seen several more updates to this index. So instead of doing a read modify write for a single entry out of a million, it might be a read modify write for a thousand updates out of a million. So to sum up the advantage of this dynamic type system. Oh, G do you have a question? So I have a question on this page, right? So I can think of a way to encode those keys so that you don't need to do merge operands. Like, can you encode a key like state.active.doc1, state.active.doc2? Yeah, so you're actually describing an earlier version of our index. Where actually we had a single key value pair in RocksDB for every single match in the index. The issue with that is that a RocksDB key value is a pretty heavyweight concept both in terms of storage and compute. You got several bytes of overhead for the key and a RocksDB next is fairly expensive. I think it's going to be something on the order of maybe a hundred microseconds, 10 microseconds versus we really would rather be able to grab these matches just like iterating over an array of eight byte integers which we're talking about like a nanosecond or a fraction of a nanosecond. Yeah, thanks. So yeah, we essentially, this is actually a simplified picture. We actually group these into ranges where every 64K range of docs is one RocksDB key value pair. And then you'll have like the next key value pair for the next set of the range. This would actually be not a million RocksDB key values but maybe, what, a thousand, a few hundred. All right, so since Roxa has a dynamic typing system and a scheme list, it can ingest data from anywhere. Its expressivity is a superset of most other databases you'll see. So you can take data from DynamoDB, MongoDB, JSON, CSV sitting in S3, whatever you want. It indexes everything so you can get fast queries without manually configuring indexes and figuring out the right database configuration for your workload. It has, is real time or data so that since the indexes are immutable and the columns for Roxa are both are also immutable, you can do a write and it'll be visible to queries within seconds. No batch processing, every delta is visible as soon as they're written. We also have fast queries on columnar data. You can use our vectorized execution engine and columnar format to do fast scans. And for anything search based, you can use indexes on every field that's present in the prerequisite of your query. All right, do people have any other questions? I'll applaud Matt for everyone else. Before I'm gonna rip open into the floor if anybody has a question for Ben, just let them meet yourself and go for it. I'll use the time. Here you go. So my first question is, how large are the blocks in the column store? That's configurable, I think our default today is 4096. But configurable by you or by the user? By us, by us. Got it, okay. And I don't know what conditions you guys change it. Very rarely, 4096 works well on pretty much all workloads I've seen. Going beyond that, like you've already amortized away the cost of dynamic dispatch. Yeah, I haven't needed to figure it in practice. 4096 is a good value. Got it, okay. And then my only other question is more of a philosophical one that there's basically two ways to speed up in little queries. There's the approach you guys are taking where you just index everything and anything. And then there's the other approach where you materialize views where you try to pre-join things ahead of time. And so maybe the rocks that customers are self-selecting that they're doing much queries that don't do join. But all of indexing, you know, the day may not actually help. I mean, it helps with joins because you don't have to build the hash table. Right, I guess, even as I'm getting, it's like, under what conditions would the index everything approach not work as well as materialized views that could potentially pre-join things? Yeah, so indexes only help for joins if the join is very selective. Basically, if you're only going to need to retrieve a very small fraction of the data on one side of the joint. And by small, I mean, like less than 1%, less than 10th of a percent. Because as soon as you're using indexes, you're doing random access. And random access either in memory or on disk is gonna be like quite slow. Even like really fast SSDs today, like what you can get on NVMe SSD EC2 instances, you're talking like 12,000 IOPS per core, somewhere in that domain, which just nowhere near as many tuples as you can scan sequentially. So yeah, if you're doing a join where like you're only selecting one in a million rows from one side, using an index join, which we support, will be really, really fast. If you're using getting 10%, you're probably gonna want a hash join, which we also support. Does that answer the question? And I'm thinking about the number of sites. You're saying so, anything less than 1%, your index join works. Anything greater than that than you want to use a hash join? Yeah, the threshold is gonna be somewhere between like a percent and a 10th of a percent. We're indexing only wins when you're very narrowly selected. Yeah, and there's this bit of between random access versus bunch of that stuff. Because like the hash join essentially building its own index, right? Building an ephemeral data structure to do the join. But you're saying that like the, but if the next is based on, sorry, if the join product is based on the index, then you're gold, right? Well, the index access will be sequential. But looking at the other fields of the document from the row store will be a random access. I, again, that's the, how does this, like you still have to do that for the hash join. Anyway, I guess you took the whole tuple and the hash tuple and you get that. Okay. Yeah, it's an in-memory random access. Got it, got it, okay, okay. And even then, I think you can pipeline that shuffle operation in ways and memory that would be harder to do with an index join. Do you guys support distributed execution, like one query touches multiple computer nodes? Oh yeah, absolutely. Every query can chart out, joins are distributed, aggregations are distributed, scans and indexed operations are distributed. Okay, so if it is the case that anything less than, if you guys believe that anything less than 1% sensitivity and a join is better served by a hash join, wouldn't that argue that potentially materialized use would be better, right? The need to support that. And I guess my, the least back by original premise of the question of like, are people choosing Rockset because they don't have heavy joins in their workloads? So they're going, grabbing stable documents. Is that sort of the real time antelope workloads you guys are trying to support? I definitely do see customers using Rockset with very join heavy workloads. We perform well. I'm trying to think, we don't have materialized view of the joins, we have rollups, which are materialized aggregations. And sometimes those can yield similar advantages. Got it, okay. That makes sense. We guys are also running in like to your point, like the everything's hot in the cache, sitting on attached stores. So again, you're not pulling from S3 or these things. All right, cool. Sorry, audience, any other questions?