 Thanks for coming today to another quarantine database talk today. We have drew buff from rock set Duba is the CTO and co-founder of rock set and somebody I've known for a while now I mean by to be over the summer to give a talk just fun So prior to starting rock set He was on the database engineering team at Facebook where he helped work on rocks TV But I guess maybe he's claimed the fame or when he goes tries to meet people at bars before the code He would say that he was a founding engineer HDFS so whether or not you think that's a good thing. That's you can take that up with them, right? Okay, so with that we're super happy to have Duba to come give a talk with us again The the way we'll do this is that if you have any questions Unmeat yourself and interrupt at any time, but be sure to say who you are where you're coming from I then ask your question. Okay? I do with the floor is yours go for it. So, yeah, thanks for inviting me to present here I'm going to talk about real-time indexing on fast queries on some of these large data sets I just to give a short introduction about myself. I worked like Andy said I worked a lot on Hadoop file system in the very early days at Yahoo And I'm also the founding engineer of rocks DB, which I was the first engineer in the rocks TV project building Storage engine at Facebook Right now. I'm at rock set. I'm building a distributor system for data processing, which is what I'm going to discuss with you folks today Actually, I was at Pittsburgh long long time back because I was working on Andrew file system from TransArt Labs. This was like in 1990s It was a spin-off from CMU and I was there for three years It was good fun building a lot of EFS code which got open source now So what one of the one of the co-founders of TransArt is his son is on the is watching today right now Oh Okay Cool, okay, so talk so overview of the talk again Please feel free to interrupt me whenever you can and I can help answer your questions I'll talk a little bit about where rock set plays into rocks at strengths and don't talk about the rocks as architecture and Then touch upon three Important things that are kind of the unique parts of the rocks that architecture talk about how we do schema discovery in sequel Then I'll talk about conversion indexing, which is how what powers are back in and then I was also talking about cloud scaling Architectures and what rocks it does to make this a very cloud native and crowd cloud friendly So the first Point is that or some people that asked me is that what is unique about rocks? And what is it trying to do different or what is the use cases for which rocks are to use for so? I have I've seen the Hadoop Ecosystem grow like from day one where a lot of batch processing and mostly system optimized for like efficiency Again like Andy says not some code that is like super rock solid But it was the first system which was where you were able to store petabytes of data, which is why Hadoop became so popular It's not because it was one of the best pieces of software that we have written Same thing essentially, maybe with spark where it's not a stream processing optimized for throughput at even Kafka I think I'll put it in this little bucket. So rock set is mostly focused on Analytic analytical applications Which basically is optimized for three things at the same time so you can store large data sets You can also query and expect your queries to be milliseconds And that's query latency and also about data latency, which is basically The moment your data is produced how quickly can we make queries on that data? Is it few seconds or is it many minutes or hours? So it's kind of optimized for three different angles and I'm going to touch in the architecture how the architecture is Built to solve these three cases at the same time Any questions so far Okay, so analytics again traditional analytics has been mostly about Say warehouses or things where Very standard reporting or some things that you already know what you're looking for here We're trying to do analytics on the fly and again It's a system rocks at a system which is not a key value store But it's mostly an SQL system, which means it does aggregations Join sorts order buys and we also want the queries to be fast Which doesn't mean that queries can be many minutes or or tens of minutes for latency So basically try to spin up more Hardware and then try to reduce the latency of your queries and Then the third is that it's a system Which is high QPS which means that it's it could be your user-facing analytical database for example Which is where thousands of queries are coming in at the same time and they're being served and this is another difference from traditional All that kind of workloads where if you have an warehouse or something else It'll be a few queries per second concurrent queries that you might make So that's kind of the positioning of it I wanted to share this because that plays a lot into how the technical aspects are built using these requirements Low data latency the fourth one which is essentially again many systems needs to go through a lot of ETL processors and joining and Cleaning before to actually get loaded in a query build system for rock set. We try to do that so that You can avoid a lot of these pipelining and ETL processors before you can make it query The focus again is analytical applications very different from a warehouse reporting kind of Workload so analytical analytical applications are things like Take for example, like you have a fleet management company and your trucks are coming into the loading zone Somebody needs to look at a lot of data to figure out what things to be loaded into the truck Similarly say you're an online gaming system and lots of players are playing games And you want to show leaderboards to these gamers. So leaderboards are complex sequel Analytical queries that need to look at your most recent data to show you the leaderboards. Just two different examples from two different completely different Industries so again a rock set is essentially real-time indexing database on massive data sets And it's used for building real-time applications on live data, which is what I mean by not stale data Data that's just coming in now and avoiding ETL supply points So this one line summary of what what the rocks a database is essentially designed for So now let us deep dive to how the design plays a part into solving that use case that I mentioned earlier So it's an analytical Database which means that there are no transactions. First of all, very clear There's this is rocks that doesn't do any transactions. You cannot do an asset transaction or read modify write But what it can do is that it has data which is coming in from streams or data lakes or databases databases to be your transactional database or MongoDB or something else which is your system of record and you want to get data from all these places and Build your analytics engine. So what kind of database would you use to build an Analytics engine which can take data from all these places and then serve your queries. So the very Top-level architecture is what we call the aggregator leaf tailer architecture What it does is that the tailors are the guys who are tailing data from these data streams and Then it is depositing data into a set of leaf nodes leaf nodes are the nodes which actually own this data and crunch this data and make it Ready so that queries it can serve queries and then there there's a two level aggregator Which is what is serving all the SQL queries that are coming from applications and sometimes live dashboard, but mostly applications for us So What is the uniqueness of this? So the reason this is very different from a traditional say lambda architecture or a copper architecture that Quite popular nowadays is that this basically follows the CQRS pattern, which basically is that the rights are separated from the reads Because we want data latency to be low We have to handle bursted traffic which means the data is coming in at some different volumes of right rates And we cannot allow it to impact query latencies because it's a user-facing database and you are expecting Let's say every query to finish in 500 milliseconds, so you need to finish it So this is why we follow the CQRS pattern, which is the left side of this vertical line is where the writers happen And on the right side are all the queries happening So this architecture is actually what we have inspiration from from While building the Facebook news feed application So if you all use the Facebook news feed that is an app that is an analytical app it is not a transactional app and it's an app which needs to look at a lot of data and Then show you relevance ranking and then finally shows you the news feed So it uses an aggregator leaf tailor architecture again Just for the same reason that query latencies have to be very separate from data latencies and you want to optimize on both So the tailors are there which basically translates data into like our internal format And so if there is a lot of high volume of rights, there are more tailors in the system If the amount of data that you need to store is more than need more leaf nodes And if the queries are more than you need more aggregator So it's a completely disaggregated architecture, which is why we can run this efficiently on the cloud system Where each of these three tiers can scale up and down based on usage So it's not like tight tightly coupled system the very loosely coupled system and we can scale each of these tiers independent of one another Any questions so far as far as the high level and I'll give them more into this But any questions from this picture? I mean this is essentially what scuba does at facebook and then seagull does the same thing. This is pretty common Correct. Yes, exactly. So scuba does this then Facebook news feed does this then spam detection systems do it because it was very important to detect spam as soon as it is produced You can't have a data latency of more than let's say five seconds So yeah, plenty of systems add placement systems use this Um, so yeah, it's quite popular essentially for a lot of web scale companies I believe linkedin also uses the same architecture for the linkedin feed for example So that's the high level architecture So now the benefits again is that we want to make queries fast and more than fast We want it to be consistent So the query latencies have to be consistent and then these queries are very complex in nature So this is not a key value store Again, my claim is that key value stores became very popular in the last 10 12 years and I mean, I wrote a lot of hbase code and other systems as well But that was because I think This is people actually wanted a very fast consistent query system Which is why key value stores became popular and if you run sequel systems there you get very widely varying latencies Uh, just because the complexity of the language is so high So for us, we want to make sure that we can run actually complex queries on the fly and give them consistent latencies And then we want it to be cost-effective so that people can actually run it for large-scale data for data systems and then We like we said we want to separate the reads and the writes so that bursty traffic does not impact your queries So the two this is the only system or like very similar systems I have seen from close at facebook, but in open source or in commercial I haven't seen that many where you can do write separation from query separation on a single database It's not storage separation with compute. It says about write compute versus read compute separation Which is needed for these kind of real-time databases in my mind So how do you do this? um So the key design principles that i'm going to talk about is something called first thing that I'm going to talk about is converged indexing and I'll explain what that is I'll also talk about smart schemas, which basically says how the sql engine works for us And then I'm going to talk about the key architecture reasons why this architecture scales or why this implementation for our scales to high high data rates So what is conversion indexing? So rock set what it does Is that it's a no-SQL database which means that you can dump json csv xml or any semi-structured data And what rock set does is that it builds Each of these fields in your data So no setup needed no configuration needed by default We have made it possible so that you can actually index everything in your data What does the indexing mean? It means that it builds a row based index just like say postgres or mongo db or traditional relational Tables that I've built which is basically the row is the primary key and then I built a row based index given the primary key I can find all its fields inside a document and their values I also build a column store in there just like a warehouse Which basically helps in our aggregate queries and then I also build an inverted index just like elastic search So inverted index are really powerful to find like the needle in a haystack kind of queries are Queries which are very high selectivity So all the systems have been in existence earlier. It's not like Rockset invented this but what rock set is trying to do is that We have made it really cheap to be able to build all these indices In a single system and on all your fields in your data without having to configure anything So no need to configure and maintain indices And no slow queries because of missing indices and I'm going to tell Why this is possible now versus why nobody had done this before so like um Maybe you'll get this later, but like so an update comes in through the tailor and you you want you want to store it So you store it once but then you also get an index of three times Do you maintain any consistency across these indexes or does like or like Can you show up in the row index before the the inverted one because the inverted ones are expensive Yeah, great question. So we don't have asset transactions, but do we do have atomic updates Which means that when you update a document all the indices are updated So either you will see the row based index or the column and the inverted index at the same time or you won't see any of So it's an comic right and I'll explain how we do that actually but yeah, all the indices show up at the same time Um, so how does the conversion indexing? Leverage the alt architecture that you mentioned earlier. So um, so scaling up tailors is a relatively easy task um, because it's kind of stateless and You can scale up scaling up compute is relatively easy compared to scaling up stateful systems Like all of us know this right? This is what databases It's tough to build a database which is very cloud friendly and which automatically responds to your things But what is happening is that the tailors are the guys who are actually Extracting all the fields inside your semi structure data and then It stores the index on the leaf and aggregators now know Which indexes to use to be able to make your queries fast and I'm going to give you some examples Take for example, there are two documents coming in with an example here dot zero and dot one It has only one field inside the document Now what rock set does is that rock set doesn't store data in In a tradition or just in a in a row based format or in a column based format It actually stores it in three formats So the are so basically it shreds the document into key values And it uses open source rocks dv to store this data And if you look at the right side, this is the data representation. So the our keys are the row keys So all the keys that start with the Are are essentially the row keys where given a document id and the field you can go find all its values And you can scan through them and quickly recreate your document that you that you stored The c fields are the column store fields, which is basically All the data for a particular column are stored together So that you can do vectorization and other things to scan through all the values again traditional like standard database technology Nothing new invented there But the fact that we build the column nor store using a key value Store and we can actually And I'll show you how we leverage it when a query comes in and similarly All the keys started as they're like the inverted index like what elastic does So if somebody is uh, so in this example s dot name dot But that one everything is in the key. There's no nothing in the value because it's an inverted index Uh, so now if take for example, somebody is looking for finally all records where names Equal to druba. I am going to go to search for one single key in the database s dot name dot druba And then find the doc id and then I know which document it is Like this is this is how elastic search makes their queries fast, right? If you're looking for queries, which are very high selectivities like one maybe one or two lookups into your database That can give you the one or two records you're looking for So now take for example, I'll I'll do a little bit more complex Example here here. There are two documents which is more fields than one The first document has an array interest is an array. It's a json and we support actually arrays and nested objects Because this is our like our sweet spot is that people have highly nested documents And we want to index each and every field of this array and of the nested document So, um, there is storage costs associated with it and I'll explain Why we can still do it economically or feasibly for our users But the fact that I wanted to show in this picture is that Arrays are a first level citizen of this Similarly nested objects are a first level citizen of this So we still built the columnar index the inverted index and the row index Or all the three fields of all the docs in our system. But this is by default Now what happens is that if a query comes in Uh, well, okay, so I also show you the columnar here. So let's talk about the first First document has name equal to ego and the second document has name equal to druba So now if you see the columnar index, which is the first table on the right side You can scan through all the names stored inside the name column By just iterating next next next over your over your data space So it's very easy for us to be able to do say aggregations max min Some standard deviation or whatever variance you want to calculate across one column All the values of a single column Uh, so now I'll give an example of two different queries. So this There's this query on the left is that we're looking for inside some logs We're looking for a keyword like say hp ts and locate like a m So there are some filter clauses and there is a weight loss um So we know let's say that based on our statistics, uh, we know that okay This is a highly selective query. There are very few records who are who have Who's going to hit this uh, these constants usually We'll have like 20 filters for some complex queries that they're running production And then we using all the filters we can very quickly Use the inverted index and give you the results on the right side are examples of Group by order by and you want to count our average main max then Most of the time we default to the columnar store and try to scan and make it as fast as like let's say you're running it on Like redshift or some other warehouses. We'd be at least as fast as those if not bad Hold up to me listen make sure I understand this so going back to your example We you decomposed it into the the documents into the three index types The the index are just stored in the same I guess rocks db doesn't have tables, right? It's the same table space like you're storing that like you're intermixing the the The inverted stuff the column store and the row the row index is all within a single table space of box db That's a great question. Yes. So like let me go back here Yeah, right so the all the keys which are started in are are actually in the Are actually representing the rows all the keys that started c are representing the column Stories that we have and all the keys that started as are the inverted indices So yeah rocks db has column families, but we don't use it just because it gives us better performance if we do this ourselves It's one So literally to save in order for me to do the look i'm saying, okay, I need to I need to do a look up on the row index. I got to find all the keys that start with the letter r Exactly not all the keys. Yes, but I mean the one you're looking for. Yeah, exactly So you prepend the r automatically again not In your it doesn't show up in your sql query. I'm talking about the internal implementation. No, I of course that's right. So like this seems Like is rocks db really the right thing to use for this? It seems like you're you're leading a lot of efficiency off like on the table Because you can't do like rocks db. I think sports page level compression But you can't get any like the column store compression benefits Uh, you're certainly storing the letter. I mean the bit r or it has it has to be at least a bite over and over again No, that's a great question. So that's a great question because what happens is that in rocks db There is something called delta encoding already Uh, so actually the overhead is like one byte per 4k By so it's very less in real life So it already does delta encoding for us Uh, and also there is rocks. We also have things like column families But we don't use it because there are some other limitations of those is of column families in general So we try to put inside one table space and it's a general purpose key value store. Yes, I agree. So I think Uh, so I'll show you how we reduce the overhead of some of those things that you mentioned But for r or c or s that is that is not an overhead But there are some other overheads that we have to manage more I guess not so much also overheads I mean like many so for to get the mdp up and running rock cv is an excellent choice Everything this is this is the go-to storage. Okay copper juices that a lot of you were using this But I feel like that it's it's not so much the overheads of of construing or Contorting rock cv to make it do what you want to do But like if you built an engine That was specifically designed for a column store index or a verdict index. You certainly could do much better But I understand like that's more engineering work Yes, and I think uh, what has happened is that Um, if you design a column store index, let's say our a column store database Most people are essentially optimizing for the size of the index because at the end of the day if you're doing a column store You're doing a lot of aggregations, which is why your column store is good Right. So most databases will optimize. How can I reduce the size of this column on disk and in memory so that I can do I don't miss cpu caches. I can do vast vectorization. Everything falls in place if you can reduce the size Great for column store. But what did you do for an inverted index? Does it Plastic or any Lucene does that? No, because not very easy, but we still do things with we Yes final encoding But it's a different kind of encoding that are used for the inverted index part of this database You see what I'm saying. So it's not just column store everything It's different kinds of encoding you use for different pieces of this puzzle Which is what I'm going to explain a little bit in more detail now Okay, so what? So if I go back to this example, yeah So what are the challenges with this conversion actually one of them is obviously the disk size, right? The second one Essentially is that when a write happens We don't want we want consistency of writes, which means that we want atomic writes Which means that if we write one Document we make sure that all the indexes are updated atomically and how do you do it in a distributed fashion because Second indexes might be on a different machine in traditional systems, right? If you update a record on one machine, but their secondary index Field resides on a different machine now you might need to do a little bit more consensus between those so that they become atomically visible Similarly, if you're doing one field and that if you're updating one record and that record has say 500 fields Now if you want to build 500 indices, your write rate might be very high So you have to build a system so that you can support that kind of index Is my thing making sense Basically indexing is a problem when you have say a thousand fields in a in a in a record And you say hey, I'm going to create an index on every thousand field My databases have this command called create index Rockset has nothing rocks that doesn't have a create index It says I can make this command obsolete and databases don't need this create index field anymore or come on anymore How do you do it? So in traditional databases again Shorting, let's say this is a document coming in Again, I took a simple example in this case But if we want to build all the different indices together Let's let's say a columnar index an inverted index and a record store index and they reside on Three different systems or three different machines. Then I need to do a kind of a More like a consensus protocol so that I can make sure that these three updates on three different machines Are visible at the same time What is what rock set does is that it doesn't use term sharding So rocks that uses something called doc sharding, which means that the entire doc Is stored on a machine. So all the indices for the doc are actually stored on the machine So this is very similar to how search systems work Search systems like if you look at google search or like even facebook search that we build They're all built where entire documents and the indices reside on one machine And What do we do is that when when writes come in they actually get Piked through a distributed log and the distributed log is kind of Sharded among all the machines based on some keys or basically let's say the doc idea of the of the document And all the indices of the document are local to one machine So the secondary indices don't don't need pack source or raft or anything else It's all local to one machine. So if we have to build a thousand indices On one record on one document that's coming in all the thousand indices will be on one machine Just makes sense on the other hand the disadvantage is that when a query comes in Now you need the query need to find out to all the machines in your database and get results So this is the basically the difference between traditional systems that I have built and I've read like Each base is one system that I'm pretty handsome with Similar to Cassandra or anything else that you have where most of the databases in my mind are Uh are optimized for like throughput and efficiency Not for latency and search systems are optimized for latency Google search or facebook search and here the focus is how can I use a search architecture? And build a database so that you're is focused on optimizing or reducing your latency Which is why when a query comes in It needs to fan out. Let's say there are 100 machines in our cluster It needs to fan out to all the 100 machines in our cluster Um, there are advantages and disadvantages Charging one is that If one machine is slow, then you could have some trouble in making sure your correlations are still good Right because you have to hit all 100 machines in our cluster The other there's an advantage though is that if there's complex SQL queries You get to use the cpu of all the 100 machines to run that SQL query in parallel on small sets of your data So this is why the latency is really small compared to traditional databases or database sharding I should say if you do doc sharding So another system that does doc sharding another open source That does doc sharding is elastic search, which is why the query latency is very low But none of the traditional open source databases do doc sharding at all They all do term sharding and optimize for throughput and efficiency not for latency So this is one thing that rocks rocks it does which is basically everything is doc sharding Another simple example to explain doc sharding is like Basis if you have two different, um fields in your record like let's say one Front data other field has reviewed it up You'd probably try to do term partitioning and put them in different places so that they can do Efficient queries on on fields that you're interested in whereas in rock set everything is You form them on all the machines and so right at the time of right you go to um any machine and Then queries need to hit all the machines in your cluster and give you So just Defining architectures for rock set versus many other databases Any questions so far? Okay, and the second one. I think that is also a challenge is that How can I build an index on a record which has thousand fields in every record? So like if you're using a beat rebase system, I think that's quite complicated I mean this is probably known to all of us here in the database community Uh Because we don't want to update thousand beat relief pages When an update happens, otherwise there'll all be thousand writes onto the storage system So rock set does is obviously by using the rocks to be LSM Which is what we get for free is that when data comes in we write it to memory buffer in the log structured merge tree And then even though the record has thousand fields. They're not thousand writes to the storage system. There is One single write because rocks to be does a good job utilizing all the writes all the random writes into making it Sequential writes on storage. So the memory buffer gets full and then gets written to An SST file on storage and then there is background compaction. So just by leveraging rocks to be LSM We can handle spars Fields and we can handle thousands of fields in a record and continue to index them without having this random write problem that Other systems might have any questions so far on this Okay So the second part so that is conversion. That's and that's how we maintain the storage system that is that is there Right now the question is about how to make SQL queries on it because what rock set has It's json data, but SQL queries on the other side There's nothing in the middle that A user has to do So what happens is that we automatically generate a schema Based on the exact fields that are present at the time of ingestion Uh, what I will show you a pictorial description about what it means, but I'll give you a second to read this line What it means is that It's not schema free or schema less But the schema is deduced based on When you make the query we look at all the data in your system at that time when you make the query And This is what your schema is going to it has 500 columns of your table has 500 columns and the types of these are these So how do you do this? This is semi-structured data that's coming in no SQL data that's coming And I'll show you this picture. So let's say there are these two records coming in on the top Right there two json one has ages 31 the other one is let's say age is a string or some other Thing because I mean in no SQL world. This is very common Some guys refuse like that has no fixed schema. There's gonna be variable things people might make mistakes in generating this data Right. So when we come in and we we index Uh All these records and all these fields in these records and the type of each field. So we also index the type And what rocks it is different from the systems is that typically in most Relational databases the type is associated with the column, right? Let's say the column name is age and the type is an integer That will be how you can create a SQL schema out of it for us We associate the type with every value inside the column So the type is not associated with the entire column It's associated with every value inside the column and we store it efficiently So in this example, you can see that hey look age is one 50 percent of ages have strings and 50 percent have Integers, so it'll just Tell you this when you do a describe table or describe schema or whatever your equivalent is And then you actually know what the schema is of your database at this time Am I making sense? So it's not a schema free system or schema less system. It is a schema system If you're not doing any writes, this is what you are going to see your schema because your data is Fixed, but if you add more data With different times your schema might change because you'd have more fields in your table So this is what I mean by So what do I get when I do select star in this example? Because like the first document has Has city the second one doesn't have city Should I should I get back city and then it's null for document two or the good question? Yeah, so take for example Um, so our result set is obviously a set of json's that you are getting right? But if you're looking at a particular document and you have a weight loss, let's say select Star whereas age is greater than 30 Right, so now our system will automatically know that Looking at integers because you're doing a type comparison sequel types So this is this specified in sequel statements. So what will happen is that it will ignore? Um Records where the type there's type mismatch If you also want to do the type then you can do the type of command in sequel And then you can also do a type of where clause saying find all records or the type of age is string and blah, blah, blah So basically my question is like though like for the for the projection list like select star Do I get back? Do I get back city but for the second document city set the null Yeah, exactly. Yeah, exactly. So sequel null is what you'll get If you're doing a sequel query, if you're using a json api, I'll get a json now because we'll assume that okay. It's not there What percentage of your current customers use json versus sequel? uh Most people use sequel right now, but we also have a feature called query lambdas which basically People can put a sequel inside the rest api because a lot of their developers don't know sequel So the sequel developer will go to query lambda and create a sequel and say, okay This is how I expose it to my developers and then those guys So that is part of the query lambda and I can explain you actually as an example Which is also quite a popular feature for rocks But but yeah, so there's always this complexity of people It's very easy to confuse sequel null with json now because they're completely different Uh, so we have like a good description about why who should use what? So I have a small question if you don't mind my name is lean. I'm a teacher student here working on databases So so I'm wondering for this table this summarization you are showing on the right bottom corner Is this a actual table you are storing? Are you actually making this thing or is This is materialized. So what I do is that we actually make a sequel query to all the Leaf nodes like let's say there are 100 nodes in your cluster. It will make it query to all the leaf nodes and assemble the Um the schema it's just like any other query Okay, so this is actually a materialized view you are maintaining It's not a materialized view, but it is a view that you get by looking at okay So let me put it this way there are counters and type counters that are maintained on every leaf node So when we need to say describe table, then it makes it query to all the leaf nodes and get this materialized Are the counters from these systems and then shows you the sorted value of them. I see makes sense. Got it. Thank you Um, okay, so that's there Uh, so this is what I meant by saying that we want we do schema binding at query time. So it's a very sequel API that you have it's completely antsy sequel. We have joins aggregation sorts and everything else Uh, and we also have all the sequel types. So take for example, if you're doing a json time We can have like the eight different sequel daytime Varieties or variants that are out there most of the sequel language And you can make pure sequel queries on those that sounds terrible I know. I mean, I wish the sequel language was a little bit different, but I mean, that's what we have now. So And and you'd be surprised that people actually use all these variants so So that's one. So the second one, uh, is that uh, challenges with smart schemas doesn't need People usually complain at least database people and they said, oh, this is going to eat a lot of cpu because we have to do Lots of indexing we have to do this. We have to do that and also increase a lot of disk space to store types Because we associate a type with every field or every every value inside the column and not just the column, right? Uh, these are standard questions. I mostly get that's why I put it in here. Um So the first one is if you look at this example, the top one is the relational database So if you have a schema and then you have a data store the schema is stored separately from the data store, right? So let's say the city Is a stream it's stored in one place in a relational database somewhere and then all your data is on like the tightly formatted columns or files inside on your data In the second example is json where the type comes with every field, right? If you have an int It'll say age colon 31. That's by looking at it by looking at the json's pack Now, you know that this is an int which basically means there's a schema at every value um in your json And there's a purple line below each of these So that's the amount of storage that we actually use Take for example relational the purple line is Very efficient if you store json, your purple line is very big because you're basically storing a schema of every field in your json What rocks it does is it's simple things like field internal So if your data is similar types, it basically stores the type somewhere else just like a relational database And so the purple line is somewhere in between A tightly packed relational database versus a very loosely coupled json's storage system where you need to actually store the scheme out every field I don't know whether i'm able to explain the last Thing there is all i'm trying to say is that for If you store data, which is of similar types then the overhead more or less comes close to relational database tables Only when you store lots of mixed types like a same field Has 50 percent integers and 50 percent strings and 50 or 25 percent strings 25 percent integers 25 percent objects And they are kind of mixed in intertwine then the overhead might be slightly more for rock set compared to a traditional database But that's the cost of doing business with json or semi-structured data That's as far as the space store is concerned now as far as cpu is concerned Again on the left, I gave you some relative examples like strict schema Let's say I use this smart cpu to run a query on completely json data I'll need probably double at least at least some of our measurements But for our smart schemas, we definitely use a slightly more than traditional relational tables where The schema is kind of extrapolated out But we come close to where relational tables are and what we do on the right side is that Again, relational tables mostly store Similar size Values like in a factor or order are some good order so that they don't have to store the type Whereas in schema less the types and the values are kind of interquined or intermixed or interleaved For rock set we do this type hoisting where we hoist the type if all the types are same We actually hoist the type to the beginning and essentially we don't have In much overhead if lots of the types are of the same, lots of the values are of the same type So this is basically all I'm trying to say is that efficient engineering can reduce the cost of these semi-structured data formats Especially if most of the data is anyway of the same type and only like say one percent of the data is Something else The last thing that I wanted to cover is a little bit about cloud economics and how we do Cloud storage or how we efficiently leverage the cloud. So the focus again is that on the cloud Rending one CPU for 100 minutes is the same price as renting 100 CPU for one minute So this is kind of one of the reasons why Whatever rocks it does all this while all the things that I explained is usable right now because With the cloud we don't provision for peak capacity We just provision for current demand and we have a disaggregated architecture So all these reasons play a part of why indexing is possible now versus 10 years back Because 10 years back if you try to do indexing you have to buy a lot of hardware to be able to provision for peak capacity And I will tell you how we try to scale up and scale down When data changes So So again, our vision is that if your query is slow, it's because The software is not good enough if if I can spin up 100 cpus I would rather do that and make your query complete immediately rather than I'm trying to complete that query in 100 minutes because again the cost of the hardware is the same The challenge is all in the software And so that makes us feel really excited because I think I mean we are our team is good in software versus building new hardware platforms and Setting up racks and those things So this one what we do is that how does rock set actually leverage the cloud architecture So one thing that we mentioned earlier is the cqs pattern where when data comes in the tailors other guys who are Extracting the data and getting it into say serialized Small parts for each serialized feeds And so if there's a lot of volume of data we definitely scale up all the tailors Which is pure cpu and so it's easier to scale them up No provisioning needed so based on Some autoscaling policies that we have based on aws autoscaling and kubernetes autoscaling we scale up these tailors when needed similarly leaf nodes when There is more data to be stored we scale up the leaf nodes very fast to be able to replicate and Copy and I'm going to explain how we do that. That's a little bit challenging And then the aggregators again, you don't have a provision for peak capacity when queries come in we try to spin up more aggregators when There's more cpu that you need to be able to keep your query latency So i'm going to focus on scaling of the leads because that's the state group part and that's why It's usually difficult compared to the stateless parts to scale up and scale down So what do we do? We use s3 obviously, so Our claim is that shared storage is back in action and shared storage is cloud storage So how can I leverage the cloud share by the way? David do it was my prof whenever the Wisconsin so used to have very Exciting discussions with him at that time, but this is this is much newer So he's he's also claimed that they the end of share nothing is here um So how does rock rock said leverage the cloud? so Here the same picture of the art architecture in a slightly different fashion just to show where the things are So the green things are the leaves which is where the data is and they use something called open source rocks db cloud So rocks db cloud what it does it is basically a layer on top of rocks tv So every time new ssd files get produced. They actually push the ssd files to cloud storage So that's the only extension of rocks db cloud versus rocks tv, right? So this is another part that rocks at open source is a rocks db cloud Open source software So what it does is that when tail when data comes in it comes in from the tailors And gets distributed the leaf and they get indexed they get compacted by rocks db and then you get pushed to s3 Or equivalent gcs Now um, what we have done is that we have separated durability from performance So durability is from s3 So we never have data loss armin our data loss Probabilities as low or as high as whatever s3 which is probably like 29 or some some ungodly number So essentially we have guaranteed that you always get durability because we store data nasty Now how do you get performance? We have used something called zero copy clones of rocks db cloud. So let's say there are three replicas Serving the same data Now it's getting highly loaded and we need to create more new replicas. We don't do peer to peer replication We create a new rocks db cloud instance. It has a feature called zero copy clone What it does it takes the ssd files from an existing leaf shard starts filling it in And then starts scaling new data that's the tailors is generating or new updates and then Becomes part of the query processing and the queries through the aggregators start coming to the new leaf process So basically all I'm trying to say is that in this shared storage architecture There is not no peer to peer replication or copying we use cloud storage to do this replication And we have separated durability from performance, which is another reason why we can be far more efficient than say Elastic search or some other Cassandra that you might be running three replicas just for durability Although nobody's querying your data So if nobody's querying this data, then you can afford to run into one replica for example And again replica essentially the ssd Ram and cpu which is where the cost of your database usually is Uh, so that's one and then we also do something called remove The rocks to be cloud is what it's a it's a hacked up version of rocks db Or is it a standalone library that speaks the rocks? Yeah rocks db cloud is also a library So rocks db already has pluggable ways to extend everything. So that's the whole point of rocks db So rocks db cloud has extended the rocks db and rocks db has something called an env environment So rocks db cloud provides a cloud environment when running rocks db Which lets you do this automatically. So the api is exactly like rocks db. Your application doesn't have to change Uh, but only thing is the feature that it gets is that even if your machine dies You can get all the data back because the data is actually necessary And then for the rocks to be that rock set runs is that are you guys maintaining your own fork? Or is it just rocks db off the shelf? So we actually run rocks db cloud, which is the source code that we have open source And it is in lockstep with rocks db because we work closely with the rocks db team Uh, and what it does is basically rocks db has pluggable apis and rocks db cloud specifies those apis To run well on s3 or gcs or any other cloud storage system. Did I answer your question? Yeah, but rocks db cloud is a wrap around rocks db My question is for rocks db like the core You know database storage engine Are you running the open source version or does rocks db rock does rock set have a Hack up version or forth that you guys maintain that's separate No, it's the open source rocks db data Yeah, but we have rocks db cloud which kind of wraps up rocks db and gives us more power Right, I understand. Okay, good Are we about four or five minutes left? Sure Any other questions here Uh, hi, this is steven from here with a question since s3 is eventually consistent We're in order to do a strong consistent reads. So where are you storing the S3 keys in your system? Good question. Yeah So what happens is that So let's say In this picture, let's let's say the rightmost leaf has written some s3 file, right? Now when you need to create a leaf when you need to create a replica Uh, we have to read the same s3 files into the new replica. So what happens is that? Uh that S3 so rocks db has something called a manifest inside the database So the replica actually reads the manifest and finds what s3 files need is part of the database Right, and the replica if it doesn't find an s3 it knows that it is coming. So it retries and gets it So basically what I'm trying to say is that rocks db metadata tells us What s3 key is to look at? Um, and if it's not there yet, then we wait for one or two seconds and with a retry look to make sure that we can get it Uh, so that's there. So now we also in rocks db We also have something called remote compaction. Now this picture is not very clear because this just got shipped like two three weeks back Uh, so compaction is a big problem in general like ellisons problem in the sense it has to be tuned so Compaction typically runs on rocks db itself right on the node, which is running rocks So what we have done is we implemented something called remote compaction again to separate compute from storage so The server is actually writing new sst files And I wanted this time to compact it makes an rpc to a compaction tier and says that hey compact me these three s3 files And give me the results Uh, because everything is against shared stories So we are banking heavily on shared storage architectures on the cloud And we are saying that I can dissociate compute from storage by being able to do things like remote compaction We have a blog there which gives you at least better explanation compared to the pictures that I have here So that's that's another way how we can kind of separate the right compute read compute and the storage completely independent of one another So again just to recap before I take questions Is that there's no need to manage indexes with converged indexing? So no create index command and no need to define schemas Uh, which basically means you can do a describe table and it will show you the schema Just like a pure sql system does and you don't have a provision servers because Storage is going to keep adding more pods as and when you deposit more data So just to summary from a engineering perspective is what is different in rock set? So the philosophy of rock set is that we we are claiming that our database Can run large data sets and we index and scan our partition and index versus doing partition and scan So most traditional big data systems including hadoop map reduce and everything else that followed Is all about partitioning and scanning? That's how most of the big data systems on even bigger warehouses For us we are claiming that we can actually partition and index so this is uh One of the big bets that rock set is taking and saying that partitioning index is actually gives you better query latencies On large data sets So that's the first point the second point is that we definitely want to separate write compute from query compute This is also, um, not that many databases do this very well, but Prove me wrong. Maybe I'm not I might not be knowing some newer ones that people are building um So how can you make sure that write compute is very different from query compute? Uh, so that they don't interact with one another And then the focus of doing the first two is to see how you can do get again Dole load it a latency low query latency and high qps on the systems So that's kind of the first two are basically the engineering architectural differences compared to previous generation Data systems that I have played around with but I'd like to hear your opinions and Questions or anything else? Okay, awesome. Uh, so again, because we're not in a in a physical space. I will clap on behalf of everyone else Uh, so we have time for a few questions. If anybody wants to fire one fire one at Hi, this is panos kisantis. I'm a professor at Pete and My I have two questions. The first one is related to the last part that you said Basically, somehow your model is shifting the trade off from one part of the pipeline of Or ingesting and queuing to the other part And I felt that maybe a sort of a bottleneck was actually this atomic right Is that true and somehow is that uh, Something that you try to Overcome in some respect Uh, that's a good question. I think uh, I think are you talking about this picture in general? Yeah, the one that you basically you try to write all the documents Into the same site. So to avoid Too face commit or That seems to be the first big bottleneck That you have to deal with with this new shifting Great question. Great question. Yeah, so I think The way I look at it is that If you look at a traditional database like I was very closely associated with each base so I can speak for him But most systems what they do is that when the query comes in The focus of the database is that How can I make sure that I use my system efficiently? To be able to give you query results, right? Let's say the two queries come in they try to do a time share or whatever and try to make sure that I can Use all the hardware to give you good latencies on those two queries So whereas for rock set the focus is that it's more like a search database So what it means is that the focus is always first on latency And it's only second focus is efficiency and the third focus is throughput So the first one is always about query latency. So this is why it is like a search database. So what it means is that um When a query comes in the query now has to make rpcs to let's say 100 machines because the 100 machines in your index So there's a cost of cpu to making 100 parallel Rpcs to 100 different machines Traditional databases try to optimize this by saying that I'm going to put all my data on two machines because I know How to partition this data, right? And I'm going to make those two rpcs and get my results back Whereas for rock set It's possible that we actually spend more cpu as part of this query processing because now it has to find out 200 machines If there's 100 machines in our cluster But the advantage is that very complex SQL logic Now is spread out among 100 machines And they're working in parallel You see what I'm saying. Yeah, but you have to have a proper query optimizer, right? I agree. I agree. But this is how I mean, this is basically the tradeoff between search databases and term sharded databases is that Search databases, they are always optimized to give you the lowest latency and if there's a complex query Our claim is that there's no other way but to spread this load among all the machines in your cluster So how can I get 64 cores on each of these 100 machines running the SQL query in parallel small parts of your day? That's the only way to leverage these 6400 cpus I might have in my cluster to to optimize the query latency of this See, it's not about optimizing the entire throughput of your system If you want to optimize throughput of your system, then you might want to partition your data so that If there are 40 queries you all the 40 queries hit two nodes at a time and give you the best results Are using less cpu Am I am I answering your question or like? Yeah, partially. Yes At an architecture level seeing that there's a tradeoff and I understand your question Yeah, I was wondering whether you measure the tradeoff or somehow to Yes Yes, you claim is better than the other this was so more my question actually No, that's a great question. So what we have measured is that when there's a complex SQL query We would like to use as many cpus in parallel as possible on the data set so This is basically the the theory there And there's no other way if I do term partitioning then if all my data is on one machine then That machine becomes bottleneck. The other machines are idle and they're not able to solve the query So so yeah, so I mean Okay, but that's a good question. All right, so we have one last question from ken burns. Go for it Hey, so given that the sst's are Where are you coming from? Sorry? I'm from california. Yeah unaffiliated perfect. Go for it Given that the sst's are written to s3 but not the right ahead log Does that mean that the tailors are stateful or do they rely on upstream durability for log of play? Uh, excellent question. So this all depends on upstream Uh ability to tell you data right, so if I look at if I go back to um This picture maybe for example, right? So the tailors they're getting data from a source And there are two two expectations there if the sources are say like data lake or casca stream or something else Then obviously you can go back to The data source if you're not able to replace something into the database. There's also a right api so There is also a way to actually write data to rock set without telling you from a particular source Like let's say you have you have an application and you're doing rights to the database that also writes to To rock set without having a tailor because it's just like a plain plain dump right system Uh, and so that what that happens is that um Rock set actually uses a distributed log inside to be able to make it Durable before it hits s3 So I skipped that part just to reduce complexity But yes, there are two assumptions one is that if you already have data in data sources like streams and lakes Then you don't need the durability in rock set But if you're using the right api to rock set Rock set uses a distributed log and it does three your application For the last one minute of or five minutes of logs Before it actually hits the s3 storage system Does it answer your questions steve? Or yeah, great. Thanks. Okay. Okay. Awesome. Uh, so we're out of time again, drivott. I think you were spending the afternoon with us