 Carnegie Mellon vaccination database talks are made possible by OttoTune. Learn how to automatically optimize your MySQL and post-grace configurations at ottotune.com. And by the Stephen Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. All right guys, thank you for coming. This is another session of the CMU, or the Carnegie Mellon vaccination database series. We are happy today to have Gian Merlino. He is the CTO and co-founder of Impli, which is the main startup backing Apache Druid. He's been a Yahoo, he was at meta markets before Impli, and he has a computer science degree from Caltech, which is a small program, so there's not many of you guys out there, so that's all. And so as always, if you have any questions for Gian as he's talking, please unmute yourself, say who you are and where you're coming from, and you want this to be a conversation, so feel free to interrupt at any time. Okay, Gian, the floor is yours, go for it. Thank you for being here. All right, great. Yeah, definitely, thanks for having me. So yeah, so let's say I'll be talking about Apache Druid, and I'll be talking about its storage and query engine and a little bit of how the system all comes together. Let's see, who am I? I'm Gian Merlino, thanks for the intro. I'm a committer and project PMC chair at Apache Druid and also co-founder at Imply. If any of you are looking for jobs after you're done learning about databases, anyone work on databases professionally, we are hiring, come talk to me. All right, what are we talking about today? So first off, I'll do a brief introduction on Apache Druid. What is it for those of you that are not familiar with it, talk about the core concept of Druid, which is a segment. It's the core data, concept core data format that everything revolves around. Talk about how data servers work, how a cluster comes together, Druid is a distributed system, and then zoom back in from the 10,000 foot view of a cluster into the smaller view of the segment file format and how an individual query is executed. So we'll talk about all levels of the system. It's a complex distributed system with a lot of different zoom levels. So I think it's kind of interesting to take a little bit of a look at each one of them. So, okay, so first off, what is Druid? Druid is a powerful query engine. We like to think of Druid as this rocket ship that goes really fast, but that's when we are really just full of ourselves and have our head in the clouds. What Druid really is, is a fancy calculator. And it's a system for adding things up really, really quickly, counting things, taking mins, maxes, averages. It's a fancy calculator, but people want a fancy calculator on huge amounts of data, but being able to do that really efficiently is good. Druid powers OLAP apps. This is the kind of app that you want a big fancy calculator for. OLAP is online analytical processing, just a term for doing aggregations, filtering that kind of thing, mostly very read heavy workloads that hit the entire data set or large slices of it. You don't tend to be hitting one row at a time. You tend to be hitting large chunks of the data set. And on the right side, it tends to be more insert-based than update-based. So lots and lots of inserts, lots of reads, a big chunks of the data set tends to be useful for interactive analytic applications where you want to do visualizations, like the one you see here. That's one that we make it imply called pivot. Druid also speaks SQL, which is a standard language for doing this kind of stuff and people build their own apps on top of it. Druid's been around for a while. It's enjoys nice use out in the wild. Druid's used that Netflix for analyzing user experience, use that Twitter Mopub for analyzing data related to online advertising and mobile publishing, use it Salesforce for understanding what their users are doing, use it Airbnb for analytics on the mobile apps. So use all kinds of people that have these really big scale problems for delivering analytical applications. Some members are implied customers. So it imply we are one of the big sponsors of Druid. We have a lot of engineers working on it full-time. We're certainly not the only company behind it, but we're one of the big ones. And some numbers from implied customers, petabytes of data under management, thousands of daily active users, millions of rows per second ingested and very fast query times. All this with 90 percentile query times less than a second and even 99 percentile query times less than two seconds over the last month or so when I pulled these this morning. So we're running it at really high scale and really high speed. And that's what Druid is designed for. And I don't wanna... All right, so the query is for the query doing the calls them to get fresh. Is it just they read a lot of data or is it like they do crazy joins? Like what is it? How is that? Yeah, go ahead. So it's a long tail of a variety of stuff. So some of them are doing an ampersand tile. Even the fast ones generally do read a lot of data. The ones on the slower end, it tends to be a mix of doing more complicated stuff than is typical. So like a typical query, people really do want a pretty simple stuff. Like I over an incredibly large amount of data I want relatively simple statistic. Sometimes we wanna know something more fancy that takes more time to compute on a per row basis and that can take longer. And sometimes people want to query more data than typical. So maybe typically you'll be looking at a year of data but every once in a while you wanna do a query over 10 years of data. Yeah, thank you. Okay, so I didn't wanna talk too much about this stuff cause this is a internals talk. I just wanted to motivate y'all for why you would wanna listen to the next 45 minutes. So let's get into the details inside Apache Druid. So I hope that some of you are familiar with this video, the powers of 10 that came out in the 70s. I think I first saw it in my childhood at the Science Museum and it captivated my imagination. It's a pretty cool video. It's about seven minutes long. If you haven't seen it, check it out on YouTube. It is a video of going all the way, zooming all the way out from this gentleman at a picnic all the way out to the bounds of the known universe and then zooming all the way in to a proton in a molecule inside one of his cells. And I wanna do a similar thing with Druid. I wanna start from sort of a mid-level zoom and then zoom all the way out and then zoom all the way in. So let's go on that journey. We'll start off with the segment. The Druid segment is like I said, the core concept that everything revolves around. It's got about a million rows in it. So it 10 to the sixth rows. And this is sort of the mid-level zoom level for Druid. It's a heavily optimized storage format. It's a columnar format like every other analytical database out there. With Druid, you're gonna get the greatest hits of columnar databases. New segments are created continuously as data is ingested. So roughly every million rows, every one to 10 million rows of new data coming, there are millions or even tens of millions of these segments. They're immutable once created, but they can be dropped, replaced or combined. And we call it combined and compaction. And if you're dropping them, that's like deleting data. And if you're replacing them, that might be, well, we would do that if you issued an update command. And generally one to five million rows each. They have a life cycle. So this is a little depiction of the life cycle of a single segment. We in Druid have two node types. One is called the indexer and one's called the historical that interact with these segments directly. The indexer creates them. So it reads data, it reads streaming data, batch data coming in, it generates segments. It pushes them to what we, well, it does that. It pushes them to what we call deep storage. Deep storage is, it's going to be something like S3 or Google storage or HDFS. It's some kind of distributed storage. Druid occupies, and this talk isn't really about the separation of storage and compute, but suffice to say that in the sort of 10 years ago, we had Hadoop coming out saying that we get all our speed from co-locating the data and the compute. And now we have the new fangled cloud data warehouses that are saying that we get all of our versatility from separating storage and compute. And these are both true. It's faster to combine them and it's more versatile to separate them. And Druid occupies, I think, a really interesting point on that spectrum that we actually get a little bit of benefits of both sides of it. But this talk isn't about that, I just wanted to mention it. Anyway, we push things to deep storage. And then other servers pull from deep storage. So deep storage is sort of a, every all data we've ever ingested is gonna be on deep storage. It is gonna be pulled, loaded locally by the historical servers. And they do this as a preload before a query actually happens. So this is what I mean by the sort of interesting space. When you have systems that completely separate storage and compute generally, things are pulled from storage on demand during a query. They might be cached, but they're generally pulled on demand. When you have systems that co-locate fully storage and compute, you tend to have data coming into a server and then stays on that server for the long term. In Druid, the hybrid model we have here is the data is not, there's no strong affinity of data to a specific server. So these segment files can float from server to server. And when they're ingested, they end up generating a different server than when they came in on. They can be repartitioned, shuffled, all these kinds of things that you would expect from a system where they're separated, where storage and compute are separated, but for the one difference that we do pull things from deep storage onto the data servers before they're queried. So it's a prefetch rather than a cache. And that's for performance reasons. So I think Hamid, you have a question. Unmute yourself. He's in the chat. He's from IEM. We're in the chat. I can give you a question. Hi, Andy, thank you. Hey, what's up, man? Go for it. Yeah, so one question is that when you read the data from local, either memory or NVMe, what is the bandwidths of your reader, the storage read? Is it like two gigabytes per second or is it 10 gigabytes per second? Are you done? Go ahead, sorry, go. Go, what were you saying, Andy? I was, Hamid, are you talking about like, you talk about like the imply hosted version of Druid, right? Because it obviously depends on the hardware. Yeah, I mean, you got say NVMe, suppose you have a reasonable, you know, amount of the CPUs, like, you know, go extra large version from AWS. So I want to see that you are... I actually don't know. I think of it, I tend to think of it more in terms of rows per second than in terms of bytes per second. And those are the numbers I have on the top of my head. So in terms of rows per second, for a simple query, you might get 100 million rows per second per processor. And for a more complex one, you might get, you know, one to 10 million rows per second per processor. Yeah, okay. So let me ask you differently. Is the, what is the bottleneck? Is that the CPU or is the storage? Not the backend storage, NVMe, local NVMe or even memory? Yeah, good question. It depends on your ratios of your storage hierarchy. So typically people with what we call hot data will have a ratio of memory to disk that is very high and a ratio of CPU to memory that is relatively low. And in that case, people tend to be constrained by CPU and my memory bandwidth. And which one they're constrained by depends on how good locality is on the data. If locality is bad, they'll be constrained by memory bandwidth. If it's good, they'll be constrained by CPU. Okay, so you say that if the data is in memory, you can actually keep up with the memory speed. Which is a very high speed. Yeah, and that's, so we'll talk about that in a little bit when we talk about how the compression works, but we prioritize compression and encodings that can be decoded very rapidly. Okay, the idea is to be able to, the idea is to really optimize everything for the case where locality is good and where your CPU constrained. And so we prioritize things that let the CPU fly as fast as it can, but we do tend to be CPU constrained. Okay, so my other question was at some point probably gonna answer it that compare this, your offering with the cloud that I drew it. So let's, Hamid, let's do that one VN, okay? There, not now, sure. Not now, yes, okay. All right, cool. Dianne, go for it. Sure, so let's see. Okay, so what I was saying is that there's this interesting spot on the storage compute combination or de-aggregation spectrum. And one of the things that happens here is what we call balancing. So I mentioned that there is a affinity of data of segments to servers, but the affinity is not as strong as a system where storage and computer are completely linked, but it's not as loose as one where they're completely separate. And so in a completely separate system, data we get pulled on demand. That doesn't happen in jurid. In jurid, we have a process called balancing in order to move data from one server to the other as things make sense. And then that balancing process, what happens first is step five in the segment life cycle is another historical, a different one loads it, and then the original one will potentially drop it. And balancing happens as new data is loaded in an effort to keep loads relatively consistent across the data servers. It is just a cache though, right? Cause like the storage of record is the de-storage, right? You're balancing the site, okay, to drop the cache here and pull a cache up here. Yeah, yeah. We actually, in the code, we actually call the cache and it is a cache. But I find that usually it's sometimes when I call the cache and talking to people, people think that it's a cache that's populated on demand. It's a cache that's populated ahead of time instead of on demand. Sure. Okay. I mean, I wouldn't call it, it's not hybrid share, nothing shared everything, right? It's shared, nothing shared disk, right? It's shared disk with smarter cache. Yeah, that's sort of how I think about it. Okay, that's fine. And the caches are completely local and we never query from the, if you can think of deep storage as a shared disk, that's never involved in the query path, which is the main difference between what George would do and what something like Presto or I would do is they might do some caching, but they would also involve, they would involve the storage layer in the query path, which George doesn't. Got it, okay. Okay, so that's the 10 of the sixth level, one million rows. Let's go up one zoom level. Let's go to 10 to the, oh my God, I got the wrong thing here. I got, okay, I'll fix it in post. This should be 10 to the 10, not 10 to the 13. Inside a data server, 10 to the 10 rows, not 10 to the 13. That would be a really awesome data server. 10 to the 10 rows on a data server, let's talk about the goals of a data server, of one data server. So there's the two goals revolve mostly around how I want to deal with CPU memory, two sets of goals. So for CPU, one goal is we want to be able to harness all available CPUs for a single query, which means everything we do has to be multi-threaded. And another goal is we want to balance CPU usage across concurrent queries in some kind of optimally and scare quotes, who knows what optimal means, but we'll define what we think optimal means in a little bit. And then on the memory side, we want to be able to take advantage of high memory hardware. So if you do have enough memory to hold your entire data set, we want to take advantage of that in an optimal way. We also want to be able to fall back to disk when memory is insufficient. For the kinds of applications people build on Druid, it's generally the most important part of the application is how does it perform on this hot data, which is generally 50 to 80% memory. So we spend a lot of time optimizing for that case, where most of your data is in memory, maybe not all of it, but most of it. And then there tends to be a fraction of the use case that is on colder data. And this might be something that's like, maybe the hot data is the past year, like I was saying, colder data is the nine years before that, or it could be a table by table basis. It could be certain tables are hot and certain tables are cold. However it is, the typical application built on Druid is mostly hot and some cold. And so we want to be able to handle both of those use cases in one system. All right, now the legend is back to the correct 10 to 10 rows. So let's look inside a data server schematically. So there is these sorts of three layers in terms of how we think about what's going on inside a data server. On the bottom layer, we have all the segments on that data server. It could be maxed out at a couple of hundred thousand or so. Memory map disk. So we, like a lot of other systems invented around the time Druid was in modern systems today, we rely heavily on memory mapping. It works really well. It works really well with SSDs, which are pretty popular these days. So like a lot of modern databases, especially a lot of modern analytic databases, Druid's meant to run on SSD back systems. And so we have all the segments lying on disk. We memory map all of them and we let the OS figure out what's in the memory and what's not. So we don't actually have code in Druid to figure that out. We just, we allow the OS to figure it out. Above that layer, we have the processing thread pool. There's one thread per CPU. And then we have the processing memory pool. And there is a certain amount of memory would allocate per CPU as well. Generally on the order of half a gigabyte or so. And then the top layer, we have some per query resources. The bottom layer is all memory map files. The middle layer is resources that are per CPU. And the top layer is resources that are per query. And let's talk a little bit about how these resources get managed. Quickly, for the memory map files, are you guys doing the M-Advice or like giving the hints to the OS about like what the V-Vesh and things like that? Or are you just letting OS blindly do whatever it wants? We're doing the second one. We might start doing M-Advice at some point. I think some folks have done some experiments in the open source for Druid, but none of them made it into mainline. Okay, all right. And if you're thread pool with that, this all spoke code, or you're relying on like, like there's the Intel thread building blocks. There's a couple of like scheduler stuff that's out there. You use any of those? Yeah. It's, so Druid's written in Java. We are using mostly the standard thread pool in Java Util concurrent with one change that we added to for prioritization purposes. I could talk a little bit about in a couple of slides, but we changed how it prioritizes tasks that are queued up to be processed. The actual, the rest of it works the same way as the basic Java one. Okay, thank you. So in terms of the resources of the server, a lot of what we've done in the server to manage these resources is based on prioritization. And the reason that's important is because the base design of the system, if you look at the processing thread pool, the memory pool, the memory MAC disk, the basic design of those layers of the system achieves the first goal I mentioned. It achieves the goal of enabling a single query to use every CPU if it wants to. Because we have a lot of segments, every CPU, there's many more segments in CPUs, so the server might have a hundred CPUs and a hundred thousand segments. So there's many more segments in CPUs. So if you split up the segments to the CPUs, then you get good multithreading right there. And the issues come in, or the issue in this design comes in when you have concurrent queries and you want to balance resources between them in some fair or optimal way. And I think about it more in terms of fair, or sorry, more in terms of optimal than fair, because you don't really want to be fair necessarily in a system like this, but you want to do something that the user is going to expect. And that, to that effect, we've built a prioritization system inside Druid. And the idea is that most people have a variety of workloads they want to put on a system like this. And some of those workloads, they expect to be fast, they expect to be interactive and some they don't. And what we want to make sure of with the prioritization system is that the workloads that people expect to be interactive remain interactive even when the system is under a high load. Of course, that's not possible in the general sense without auto-scaling, which is something that in Apache Druid doesn't exist. People on top of the Druid have built auto-scaling type stuff, but I won't be talking about that today. In terms of the base system, it does not automatically scale in the open source. The ability to meet these expectations is mostly built around prioritization. So, prioritization, here we have three queries. Two of them are outlined in orange and one's outlined in blue. The orange ones that are high priority and the blue one is low priority. The way that the lower levels work is that if we have a bunch of queries in the system, some at higher priority, some at low priority, the higher priority ones get priority over everything at the lower levels. They get priority over all the CPUs and all the memory pool. So they could, if they wanted to take all the resources away from the lower priority one, which is generally what people expect to happen. Generally, people expect lower priority work to sort of exist in the gaps between high priority work. And in practice, this takes advantage of the property that the high priority work tends to be pretty spiky. It tends to be human driven by people clicking around in an analytical application. It doesn't tend to be space-filling. It doesn't tend to fill all the available computation time system has. And so when there is high priority work happening in that instance, we do it. And when there's no high priority work happening, then we fill the space with lower priority work that may be around. This extends down to the lower system because the segments are all loaded. They're basically accessed as needed. And as these processing threads access areas of the segment, the OS will keep them paged in or a page to minute or keep them paged in. And this also tends to be helpful because it tends to be that the higher priority work and the lower priority work don't always want to access the same segments. They tend to want to access different tables or different time ranges or that kind of thing. Okay, so that's one data server. Let's go up one more level and talk about how a cluster comes together. So this is, now we're finally really at 10 of the 13. This is 10 trillion rows. This is about as big as one cluster is going to be able to get with the design that we have in a batch of jurid today. Above this amount of rows, you might want to make two clusters. Anyway, so there's a query server called the broker. Like I said, jurid speak SQL. The broker is actually the piece that does speak SQL and it translates the SQL query into what we call a native jurid query that every other server in the cluster can understand. So there's this one server type that speaks SQL and everybody else understands a much lower level native language that's more operator, it's more physical operator oriented and less logical than SQL is. And we call this process query fan out because a query starts in the broker, fans out to every data server that has data relevant to that query and then fans back into the broker. Okay, so now that's as far up as we go. And now I want to go down. So let's go back to the segments and then we'll start zooming into the segment and talking about how individual segments are structured and how queries execute on those individual segments. So back to segments, back to these one million row data files. This is an example segment. This one has eight rows in it, not one million, but that's only because I can't put a million rows on a slide, so just pretend. So there's eight rows in the segments. This segment has five columns, time, artist, city, price, and count. This is sort of a, I guess a pre-COVID data set of ticket sales. And so there's three different artists in three different cities and this data set is how many tickets were sold for each artist in each city and what the total price was. Okay, so I wanna look at a few columns, column by column. Like I said, this jurid is a columnar database and like all column databases, we store each column separately and each one has its own kind of encoding, its own kind of compression and in terms of locality, is all stored together in the same place on disk and so it can all be loaded into memory together and all be transferred to the CPU together. So these are all the timestamps that we've stored. They're all the same in this example, likely because in this example, when we ingested it, we chose to truncate timestamps to the hour, truncate them to the day or something like that. So they're all the same here. These are going to be stored probably with table encoding, which is one of the ways we can encode integers. We do it for integer columns that are low cardinality. So it'll likely be only a bit per, one bit per column. Oh, sorry, one bit per row. The next column is artist and the artist column is a string. It has three sections. It's got a data section. There's also eight values here. There are eight values in the same order as the other eight values. So every column has eight values. They're all row zero, row one, row two, up to row seven. So this is a, the artist column, the first three rows are Justin, the next two are Kesha and the next three are Miley. Really, very data musical references, but what can you do? There's a dictionary section. So all of our string columns are dictionary encoded, which is useful and I'll talk about that a bit later about why it's especially useful. And then we have an index for each of these columns and the indexes are bitmaps. Use either concise or roaring for the bitmaps. Generally these days we prefer roaring. That's the default for that. And one of the nice things about these is they can be stored compressed and they can be combined. They can be intersectioned and unioned in their compressed form, which is really cool. And finally we have all the other columns. All the other columns are, like I said, there's every one is gonna have eight values and they're all gonna be stored the same order. So if you wanna retrieve row three, you would read the third value at every one of these columns. Obviously that's not a very efficient way to retrieve a single row, which is why these kinds of layouts are used for databases where people wanna do aggregates over big chunks of data as opposed to where people wanna retrieve single data points. So when data shows up in the indexer and you're building this, right, you have to wait until the segment is full, then you run your pass on it and decide how to build the dictionary, right? Because you can't know all possible keys. You see ahead of time. Before you freeze the data in the segment in compressed form, is it accessible through the execution engine or queries? It is accessible. It's in a row store then that is in memory and doesn't use this format at all. It's a pretty common technique using columnar databases to have a row store that flushes to a column store. Yeah. Yeah. Yeah. Thank you, Pat. So, yes. Question. If you have the bitmap indexes, why do you need the data segment? Can you reconstruct the data from the index? You can, but storing the same data in two different ways is useful because we might wanna do different things. Generally, if I'll go through an example query where it's just hopefully the eliminated, but in general, if you're gonna be filtering on something, it's faster to use the index and if you're gonna be grouping on something, then it's faster to scan the data section. Okay. So that's a really good segue into this next slide, which is let's do an example query. Let's go through how Druid would do this query on this one segment basis. So in this SQL query, we are gonna group by city. We are gonna add up all the prices and we're gonna, for artist equals Justin. So we just wanna see for Justin, what are all the cities that people bought tickets in what was the total price for those tickets? Okay. So the first thing we're gonna do is we are going to resolve the where clause and we're gonna do that by looking at the dictionary first. So we wanna figure out what is the dictionary code for Justin? And we store the dictionary sorted. So JKM, these are in the lexicographic order and we can do a binary search through the dictionary to find any particular value. It doesn't really matter much here. There's only three values. A scan would be pretty quick as well, but in situations where there's 10,000, 100,000 million values in a segment, then the binary search really helps us out. So we're gonna binary search through the dictionary. We're gonna see that Justin is zero and then we're gonna head on over to the index and get the zero if entry. The index is the same cardinality as the dictionary and the values are in the same order. So the zero thing in the index is the bitmap for Justin and you see the first three positions, the first three bit positions are ones here and the rest are zero, signifying that this particular column has Justin in the first three rows and then the other rows it doesn't. Next what we're gonna do is we are going to head on over to the city and price columns to do the group by. And so we're gonna read them both at the same time and here we're gonna use a data section instead of the index because we have the row numbers that we wanna read. We know we wanna read rows zero, one and two and the fastest way to read zero, one and two is to look at the data section. So go to the data section, we see that the first one is two and then the price is 1800, then we see one and 2912 and then two in 1953 and we add all these up into a scratch space. That's scratch space that we've allocated out of the processing memory pool from a few slides ago. There's a few different ways you might do it. We might use a hash table here. We might use an array here. We might use a heap here depending on whether we're pushing down the limit, depending on whether we know the cardinality to be large or small. In this case, because the cardinality is small, there's only three values of city. We're gonna use an array. So this is actually just gonna be a long array, so 64 bit in array with three elements, zero, one and two. Zero is gonna have nothing in it. One will have 2912 and two will have 3753. Then we're gonna go to the dictionary after reading through the column. After reading through both columns, we're gonna go to the dictionary and replace all of the dictionary codes with the names. And we do this on a per-segment basis because every segment has a different dictionary. And so in order to merge results from different segments, we have to, at this point, after we've aggregated the data for one segment, we have to then replace it with the dictionary with the name. So now we have L, E, and S, F. We ditch the null. We ditch zero. We don't look at the dictionary because it didn't exist. It was not found. And then we have the final result for this segment, which we will then go on to combine with the results from other segments and then return it to the broker. And this, I think, illustrates the power of this particular format. The awesome thing here is that in order to do the filter, we never had to read the data section of the artist column. We only read a dictionary. We did the binary search through it. We did a single random access into the index. We didn't even do a search in the index. And then to do the aggregation, we knew which rows we had to read. We knew we had to read the first three. We did it using an array instead of a hash table. So a much smaller per row cost. And then after the aggregation was done, then we replaced the IDs with names from the dictionary. We only had to do that one time per unique ID. We didn't have to do it once per row, which in this example does not matter that much because there's only three rows and two IDs, but a more realistic scenario would have millions of rows and maybe hundreds of IDs. So there's a big boost there. Okay, let's see. So that's what I wanted to say about that. And then I want to move on a little bit to one more zoom level down, blocks within segments. So now that was 10 to the six. That was one segment. Now we're 10 to the three, 1,000 rows. Every, each of these segments is split up into blocks of not exactly a thousand rows, but of that order of magnitude. And I'll talk a little bit about why we do that and why that matters. So let's think about one column within that segment. This is that artist column. And in this artist column, we had a data section that had eight rows in it. That it's not really enough for talking about this concept. So let's pretend it had 16 in it. So let's pretend it had 16. I made up a new one. And let's pretend that each of these blocks is four rows instead of a thousand. So here's a segment that has four blocks and that each block is four rows. In reality, there would be many more blocks. Each block would be about a thousand rows, but bear with me. And the first three rows are Justin and then there's Kesha's, then some Miley's and there's some more stuff than there's another Kesha, another Justin, another Miley and other Kesha and so on and so on. So data layout is really powerful and it helps to think about these blocks and think about data layout. Each of these blocks is individually compressed and because each of these blocks individually compressed, typically we use LZ4 for this. When each of these blocks is decompressed, we must decompress the entire block because LZ4 is not an algorithm that enables random access into a compressed payload. So what we do is we read a block containing about a thousand rows and we decompress that block and then we're able to operate on any of those thousand rows relatively cheaply because the blocks are already decompressed. The idea here is let's say we're filtering on artists. Go back a slide. Let's say we're filtering on artists equals let's say Kesha. So if we're doing that, then we have to read three of these blocks. We have to read three blocks because the first three blocks have ones in them, the last one doesn't, so we don't have to read it. But we have to read a block if it has even one row with a particular value in it. So we have to read these three. That's three blocks need to be decompressed. And again, we have to pay that cost to transfer those blocks from disk to memory, from memory to CPU to run LZ4 on it. And then we can decode the one row we wanna read, but that's a lot of cost to pay to read just one row. So if we sort, we can address this by sorting. We can address it by sorting on artists. Of course, we can't sort on every column at the same time. That doesn't really make sense. So typically people choose to sort on a column that is something that commonly gonna be filtering on or commonly grouping on. So let's say that in this data set, we're gonna, let's say we're ticket master and we're almost always filtering on artists. So if we're almost always filtering on artists, then we might want to sort based on the artists and that way here and now we're filtering on any artists. We never have to read more than two blocks because all the justins are clustered, all the catches are clustered and all the my days are clustered. And clustering things this way minimizes the number of blocks to read, gives us better data locality and speeds things up quite a bit. I have a little example of how that, of a real world example where that actually helped a lot. So I mentioned juridism Java written in Java. This is what we call a flame graph. This is created with a tool called SJK. I think it stands for Swiss Java knife. Anyway, it's a cool tool that this is a profile of a query that was running in this situation. This is a query that was running on a time series data sets with a bunch of sensors, a bunch of facilities. And you can see that 90 plus percent of the time being spent here, the tops of all these stacks in the flame graph is LZ4 decompressed. So we're spending a ton of time decompressing things and very little time doing anything else. And this is a pretty common thing to see if you have a data set with poor data locality and you're filtering into a small slice of it, which is the case here. We have a data set of a bunch of sensor data from a bunch of different facilities, industrial facilities and the particular data set, the application wanted to filter onto specific sensors and specific machines and specific facilities. And so it was grabbing a point out of this block, a point out of this block, a point out of this block and you get a flame graph like this. When we change it to cluster the data by machine, so all the sensors are the same machine and the same facility together in the same segments in the same blocks. The overall CPU time for a query went down by 8x, so the query's got 8x faster or equivalently that could be the same speed on an eight times smaller hardware footprint. And you can see that the LZ4 decompression part of the flame graph is at an eyeballing, it's maybe 15, 20% much smaller than 90 plus percent. And so this is a much more healthy looking flame graph where we're not spending all of our time decompressing blocks just to retrieve one or two points out of them. So let's see. Finally, so finally I wanted to talk about one single row. So going down to 10 to the zero rows, one row. On one row we're doing type-specific compression. So for strings or indexing or encoding, for integers we're doing potentially table encoding, potentially delta encoding, delta encoding being encoding differences from a base value, table encoding, meaning there's a small number of unique integers and we are going to give each one dictionary code. So we could do dictionary encoding for integers as well as strings and then data dependent packing. And what I mean by that is we're packing into a certain number of bits based on the range of data encountered in that segment. So if in a given segment, we encounter data that spans a range that could be packed into six bits, then we'll store six bits for row. For indexes, we're using borrowing bitmaps, like I mentioned, so there's type-specific compression being used there. And for layout, the value of each, each value of a row is being spread across all the relevant columns. So we sort of explode each row, which increases the cost to retrieve a single row by quite a bit, but that's okay because we don't really need to retrieve a single row. That's not really what we're trying to optimize for. We're trying to optimize for scanning large chunks of data. And that this exploded storage format is really useful for that because each column can be loaded separately in the memory. And then rows that are likely to be queried together can be clustered together within those columns, leading to really good, I guess, flow of data from disk to memory to CPU. All right, what is table encoding again? Table encoding is, it's the same with texturing coding, but at least in jurid parlance, when we do it with string, we call it texturing coding, with the integer we call it table encoding. Okay, but it literally is like, I repeat the same digit over and over again. So I'll store it in a mapping from that with a code to what the actual value was. Yeah, it's like in this column, it's designed for, it's most useful for integer columns where the integers are IDs of some kind. So maybe you have a customer column and a customer ID, there's a hundred customers. And maybe the numbers are really wacky. Maybe you give them all random numbers. We'll crunch them down to zero to 100. Now I understand, okay, yeah. And then your data to patent, that's just bit packing. Yeah, exactly. Okay. Yeah. Bit packing for on the scope of one column. So whatever the spread of data is for one column term, it's a number of bits for that column. Okay, so that's what I prepared. And if you wanna stay in touch, if you're interested, follow us on Twitter, which we are at, check out jurid.apache.org, the project website. And again, if you are hiring an employee, employee.io slash careers. Thank you. And I think we have some time for questions. Yes, awesome. So I will call on behalf of everyone else. So I always have questions. So let's open up to the audience first. So I know Hamid has a question, but does somebody else in the chat, you wanna unmute in the cells? I can unmute them. Of course. E-Y-A-L, German. Go for it. Hey, hey, Jen. So I just wanted to see if you can talk a little bit about how the column architecture represents itself in the data storage. Where exactly is the separation? Two different columns, is it in the file, I'm sorry, is it in the block level, et cetera. Yeah, so good question. If you think about each segment as being its own miniature database, which I think is a good way to think about it. I mean, I guess technically if you had, it's good to think about it because if you did only have a few million rows to query, then you would only need one segment. One segment would be the entire database. So if you wanna think about it that way, then it's all stored in a single file. And in that single file, every column is laid out end to end, and then they're all stretched out in a line. So the first column might occupy bytes, zero through five megabytes, and then five megabytes through eight megabytes might be the next column, and then eight through 12 be the third column, and so on and so forth. Question? Would you ever really, oh, sorry. Mogadjee, go for it. Hi, Ian. My question was, I know you're talking about dictionary encoding for the string columns, right? I know it makes sense when the coordinate is very low. What if my coordinate of the particular column is very high? Even in that case, are you guys gonna do the same dictionary encoding, or what are the cost implications in that aspect? Yeah, that's a good question. We are gonna do dictionary encoding in that case as well. And I guess the follow-up question is, is that a good idea? Yes. And I guess it depends. Well, I guess what are the alternatives? I guess the main alternative is to store the high cardinality string in line in the segment. Because if you think about the way the dictionary works, the dictionary is sort of like, internally it's sort of like a pointer. So we have dictionary code zero as a pointer to a location in the dictionary that has that string. And then one is a pointer to a different location, and two is a pointer to a different location. And you could store those things instead of pointers, you could store them in line. And storing them in line has the advantage of having better locality than storing them as pointers. So it might be a good idea to store them in line. It might be an especially good idea to store them in line if they are always a certain length. And so maybe if your high cardinality strings are UUIDs, well, I guess that's a bad example because those are integers really. But if there's something else that's, if there's something else that's a fixed length, it might make even more sense to store them in line. It might be a good future improvement to Apache Druid. Okay, I think just in conclusion, I think what we do now is, I think that, I mean, I think it works. But the main downside is you're not gonna get a ton of benefit from it, and you might get better locality if you store them in a different way. Okay, we'll do Ben next and then Aaron after this. If you have a really large dataset which has a bunch of segments, you wanna do a query that only touches a handful of rows, you get issues with having to fan out to do a random lookup on the index of every single segment. So let's see, that there's, I guess two, there's two scenarios there. One scenario is where you are filtering something that is a way that you've partitioned or clustered the data. Because I mentioned, I talked about clustering is useful within a segment, but clustering is also useful across segments because data can be clustered into segments. And thinking back to this artist example, this ticket sales example, if you partition your data across segments based on artists and then they're filtering on artists, not only will the data be clustered within the segment, it'll also be clustered across segments, meaning that of the million segments you have, maybe only a thousand of them have data for a certain artist and the system does know that and can prune the list of segments to scan accordingly. So that's a scenario where you're not gonna get very much overhead at all. Another scenario is if you're filtering on something that you have not clustered by, if you're filtering on something you have not clustered by, then we do have to look at every segment. And in that scenario, there's some more overhead, which is why clustering is so important. And I mean, it still is gonna work fine if you're filtering by something you're not clustered by. That's actually the scenario we designed for. Like the intent of jurid is that you should be able to filter on anything you want. That's why we have these bitmap indexes and everything. But there will be somewhat more overhead if it's something you're not clustering by. Okay, awesome. Aaron, I've asked him to unmute. There you go. There you go, I'm unmuted. Yeah, so I mean, I pretty much wrote it in the question there, but dare I say a larger cluster, if you have that moment where like, you know what, we really should have done this differently from the get-go in terms of very commonly used columns. Is there anything more efficient than going back and just in chunks doing, re-indexing jobs, doing the hash-based partitioning? Nope, that's the best thing I know of too. I think if you have a ton of data loaded, if you have an absolute ton of data and you decided you wanted to purchase it differently, then you are gonna need to go through and reprocess everything. I think that one of the nice things about jurid is that it does have this, you know, this goes back to that space we occupy on the storage compute aggregation spectrum. One of the nice things about the space we occupy is that you can scale up the cluster temporarily to do a re-indexing like that really rapidly and then scale it back down and that works fine. Oh, are you talking about in terms of like the indexers? Yeah, yeah, you can add a ton of them temporarily. If you're in a public cloud environment where it's easy to do that, then you can add a ton of those temporarily, do the job and then you can turn them all off and it'll be okay because the stuff is gonna be in deep storage and we'll get loaded by other servers. Gotcha, yeah, thanks. All right, next question is Bali. Hi, are there any plans to support high-precision arithmetic of aggregations on average? Let's see, high-precision arithmetic. You mean like sort of SQL style fixed-essimals? Yeah, the fixed-essimal ones where the precision is of utmost importance, like you cannot be off by like a digit or so. Yeah, it's hard to talk about plans in an open source project because that's not really how Apache collaboration works, but there are certainly people talking about it. If you go on the GitHub, apache slash droid and search for a decimal types, you'll see some active conversations. So it's something that's certainly being talked about. Okay, great, I'll follow up there. Thank you. Well, Api does, both of my hand badly. If you're a real person, ask your question. Yeah, so maybe your last answer kind of touched on this too, but I'm just wondering, are you guys thinking about the Apache Arrow project at all and how kind of some of its aims might play with what y'all are trying to do at Druid and kind of future directions of things? Funny you should ask. Yes, we are thinking about it and talking about it. Apache Arrow is very interesting. It's, of course, similar to the Druid segment format in some ways and different in other ways. I think the Druid segment format is a better fit for what we use it for, which is the thing, long-term storage. The Arrow folks have a page on their site about Arrow versus Parquet and the differences between them. And a lot of that applies to Arrow versus the Druid segment format as well. One being, Arrow being designed for 100% memory and then Parquet being designed for long-term archival storage and disk. And the Druid segment format's kind of a blend between them. It's something that is at home on disk, but also is at home in memory. Well, anyway, there's that. I think as to where Arrow could be especially interesting and what we're actually thinking about incorporating Arrow into Druid potentially is for transferring data between servers. I think that for something that you're gonna read a ton of times and have on disk and memory maps, Arrow's maybe not the best choice, but for data that's generated once and then gonna be transferred to the server and then read once, it seems to be a really good choice and have less overhead on the right side than the Druid segment format does. Druid segment, because of all the compressions they're doing, reading is quite fast, but the writing is fairly slow. So we push a lot of work to the right side and with transient data that goes between servers, you don't wanna do that. You're willing to sacrifice some footprint for having really high speed on the right and read side and Arrow's good at that. So it might make sense to blend them together. I don't know if that answers your question, but... It does, thank you. Okay, good. So next question is Steve Moy from Steve Moy Foundation for Giving It Real. Hi, since this is implemented in Java, I'm very interested in if there's an experimentation running on the Amazon graphite on two instance where it has excellent memory bandwidth. So I'm kinda interested on that. I don't know of anyone that's done that, but it would be cool if you do it, give us a holler. Okay, so Hamid, do you wanna ask your questions you had in the beginning? Okay, so any comparison with WhatDruid at CloudData and also comparison with ClickHouse? Yeah, so the juridic cladera is relatively similar to Apache Druid. I'm not sure if you're asking about the difference between imply and cladera or Apache and cladera because I guess I could represent either one, but none of the three, I don't think any of them are incredibly far from the others, the differences are, at this point, relatively modest. So I think that that's how I'd answer that question. The imply ones tend to be a bit more up to date from the cladera ones. I think the cladera one is somewhat older, but other than being older, I don't think it's really diverged terribly much. And then with ClickHouse. ClickHouse is also cool. ClickHouse is a columnar data system that's pretty neat. I think that as it relates to how individual queries get processed and the data formats, it's actually very jurid and ClickHouse are pretty similar. As it relates to the distributed system design, they're a little more different. I think that ClickHouse to me has more of a flavor of strong affinity between data and server. And jurid to me has a weaker affinity and that comes out in various ways as you think about how to operate a system and scale it and that kind of thing. I feel like the weaker affinity in jurid makes it simpler to think about how to make things elastic and you sort of just add more processes or kill processes and that kind of stuff. And again, at the same side, I think that the stronger affinity in the ClickHouse world makes it easier to think about how to run a very small setup that might only one or two servers. So yeah, that's I think what I would say there. Okay, thank you. But a very interesting system. So for everyone I posted in the chat, we had the ClickHouse guys come give a call in May last year for the quarantine talk. So if you're interested in learning more about ClickHouse, you can check out there. So I'll finish off, I guess I'll ask my last question. You didn't really get into discussion of like how the query engine actually is reading this data, especially when you're doing like long scans, are you guys doing like any like cogen or vectorize execution techniques or is it just the job code just sort of doing the standard interpretations that sort of Postgres does? Yeah, so we're doing vectorize queries. I didn't mention it much because it seems like table stakes these days for an analytics database, we are doing it. We are not doing cogen today unless you count what the JVM itself does, which I guess you could think of hot spots as a sort of cogen. But we're not, we haven't influenced anything on our side explicitly in the cogen realm. Yeah, okay. Hamid is asking if you guys are doing joins, but I think you've already said in the beginning you guys do support join. Yeah, we do support joins. We don't support every kind of, every kind of, I guess, shape of join you can think of. So the heritage of jurid is to be sort of on this fast analytics side, powering interactive apps, that kind of stuff. And historically, fact-to-fact joins have not been super important. And so we don't support like a shuffle join or a sort merge join. We support hash joins and broadcast joins. I think at some point we will add support for all that kind of stuff because I feel like the natural evolution of every database these days is to support everything as you get more and more users. But it's also important to, in your journey towards supporting everything, remember what you actually want to be really good at and what we want to be good at is this interactive app stuff. And the broker, that's all like custom, or sort of like, that's just a druid code, not like you're relying on CalSight or something like that, right? You guys heard your own query optimizer? So the, yeah, good question. We, the fan out stuff is 100% druid. The SQL stuff is using CalSight for the optimizer core and for the parser. And then on the druid side, we like the code that we have integrates with CalSight to basically like teach CalSight how to generate native druid queries is pretty big. So we've, CalSight's really nice and that it's got a very, very extensible optimizer. And it's a great project and we are making a lot of use of a lot of its features in order to have its optimizer generate query plans that makes sense for druids. Yeah. A bunch of people sort of, they only use CalSight for people parsing something you're using it for all, like for the parsing and the planning. Yeah. Other than the fan out please. Okay, awesome. All right guys. So with that, we are done. Again, Jen, thank you for doing this. Appreciate you being here.