 The Carnegie Mellon Vaccination Database Talks are made possible by AutoTune. Learn how to automatically optimize your MySeq call and post-grace configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome guys to another Vaccination Day with seminar series. We're studying today to have Steven Dilt. He is a manager at Jermio and he's going to talk to us about the query optimization, acceleration and execution in the Jermio database management system. So as always, if you have any questions for Steven as he gives his talk, feel free to interrupt him at any time, unmute yourself, say who you are and ask your question and feel free to do this anytime. And that way he's not talking to himself for an hour on Zoom. So with that Steven, thank you so much for being here. Go for it. Great. Thank you, Andy. I'm really excited to have this opportunity. So like I said, I'm here to talk about Jermio. As you can see from this first slide that we call ourselves the SQL Lakehouse platform. And I'm going to be talking about the, you know, this query optimization, acceleration and execution. So who are we? All right, this is a diagram. It's a little bit of a marketing diagram, but I, you know, I'll just, you know, show it here. You can see the, you know, this pyramid here we have the base layers, the, you know, cloud storage. Jermio kind of sits in between the cloud storage and data scientists and BI users. So what we offer is a fundamentally different way to deliver analytics. And like I said, Jermio sits between the data lake storage and end user tools and our product consists of two key areas. The first component is a scalable high performance query engine that allows you to query data in place as it resides in data lake storage. So this is the one way how we're different from a cloud data warehouse or a database. And the second component is, oh, sorry, and it's not just for exploring data. Jermio was built from the ground up to deliver high performing BI dashboards and interactive analytics on data lake storage. And we do this by baking in a collection of query acceleration technologies. And the second key component is our semantic layer, which is the experience we provide for both technical and non-technical users. Our technical users are semantic layer enables teams to assign consistent meaning to data that can be leveraged by any downstream application instead of needing to create individual definitions and calculated fields and isolated BI tools. For data analysts and data scientists to discover, curate, analyze and share data sets in a self service manner. It's a great experience for data analysts because now they can run high performing dashboards on their data immediately without needing to wait for it to land in the data warehouse. And they can easily integrate as much data as they need into their dashboards without needing to call upon their data engineering counterparts. One of the neat things for analysts is that all our query acceleration technologies work transparently so they can create data in its logical form without having to connect to specific tables or materializations. It's also a great experience for data engineers because now that line of business users can analyze data on their own terms, engineers can focus on more strategic projects instead of maintaining data pipelines. Okay, so that's kind of the high level view, a little bit of a marketing take at it, but I wanted to get out of the way to kind of explain what is Dramio from a very high level. If there are any questions, you know, we're going to jump in but if not, we can get into the more technical side of the talk. Dramio is built on open source technologies. Here are a few of the open source frameworks that we're using. The first one that I'm most interested in is Apache CalCite, which is a query optimizer framework. Another one equally as important to Dramio is Apache Arrow, which is our in memory column or storage format. And in fact, Dramio was instrumental in the launching of this Apache Arrow project. And although a lot of other groups outside of Dramio have contributed vast amounts to it, you know, we were there at the very beginning and continue to, you know, work with that community. Within the Apache Arrow, specifically one thing that we've added was what we call Gandiva, which is the LLVM based analytical expression compiler that works natively against the Arrow memory format. A more recent component that we've added is Apache Arrow Flight. This is a framework for fast data transport. Apache Parquet is another one that we use is kind of our default column or storage format for storing the data. I think, to some extent, Apache Parquet has become in the big data world, the default already for column or storage. And another, you know, finally, Apache Iceberg, a table format for large datasets. This is another one that we're extensively using and also we're involved in the community to try to promote. And you might ask what is the difference between a storage format versus a table format and I think that might get into some more details about that later. So I'll leave that to him. Thanks. Just also one quick slide about who are we? So my name is Stephen Phillips. I'm currently the manager of the query optimizer team. A little bit of my background. I'm actually a physics PhD dropout. I'm also a committer and, you know, a PMC member on Apache Drill and Apache Arrow. Now, I've been at Drumeo since basically the beginning. I'm not a founder, but I joined pretty much immediately after the company was started. And I've only been a manager for the last year and a half or so. Before that, I was just an individual contributor and tech lead. And I have not always been working on query optimizer. That's, you know, for the past couple of years before that I worked on execution, you know, runtime operators, you know, various storage connectors and so on. And then I was going to introduce Vivek as well. You know, he, we actually, I worked with Vivek about 10 years now at both at Drumeo and our previous company at MapR Technologies. So, as you can see, Vivek has a PhD in computer science for Georgia Tech. And he's a, you know, a really, really great distributed systems engineer. So, oh, I just wanted to mention one thing, Andy, that I've actually watched a few of your. Class lectures. I do recall at one point you mentioned something about physics, people with physics background being great at optimizers. I don't know. I'm not, I'm not using myself as a, as a data point in proof of that. I just thought it was an interesting, like, and I don't know if this proves anything, but I thought that was an interesting point to make out. So, I included the physics PhD dropout is. What type of thing. Well, you know, I said I was a dropout in my research with the experimental nuclear physics. I like high energy, high energy physics. What it would be mid mid energy physics so what it's not certain certain level it was. Okay. So, yeah. Okay, awesome. Keep going. Yeah. Okay. Alright, so, first off, let's talk about query planning. This is this is my, my area of expertise. So, like I said, we use site. So Calcite and this is taken from their GitHub provides an industry standard SQL parser and validator. Customizable optimizer with political rules and cost functions, logical and physical algebraic operators, various transformation algorithms from SQL to algebra. Alright. So we use Calcite extensively throughout our optimizer, however, it's not it's not as simple as saying that we simply use Calcite and optimize like it's calcite is a framework that provides a lot of the building blocks for an optimizer for how it actually gets used is is basically what our job is at drum you in the optimizer team, as well as you know, working in enhancing and fixing bugs and calcite itself. Where is actually used. So here that here are the basic steps in the query optimizer with calcite is used up first off and parsing. You know, and especially, you know, parsing the SQL means converting from a SQL string to a, you know, a SQL parse tree. The next step is validation. Validation is the process of, you know, verifying at the, not only is the syntax possible but it doesn't actually make sense does it actually, you know, reference columns that exist and do doesn't match the function signature and do the types match and so on. Finally, the, oh not finally the next step then is converting to what you might call the abstract syntax tree or the logical plan. And finally, is the optimization. This is actually multiple phases optimization optimization in the dream you optimizer. And the goal to go from this initial logical plan into a optimal physical plan, or what we think is not optimal optimal physical plan. Okay. So, in calcite where two types of planners, and each phase of planning uses we are one of these one of these two types. And they are heuristic and volcano planner so heuristic planner is, you can also consider it to be a rule based planner. Basically it's it takes a set of rules and applies them over and over again until they don't don't apply and each each time you transforms the tree and throws away the old, the old one. And finally, you know, once, once it can, that's the end of that phase of optimization. Whereas the volcano planner is a, is a cost based optimizer and uses dynamic programming. And as rules are being applied it keeps track of all equivalent alternatives, and then after it's exhaustively explored the space of possible plans, chooses the best cost plan. So, like I said before, Dermio has multiple phases of planning, and different phases use will use, you know, either either heuristic or volcano planner. And the, you know, some of the examples of phases that we use for heuristic planning include the we call the reduce expression stays this does a lot of constant reduction. We have phases really focused on filter pushdown, including the we call the transit predicate inference across joins. And then we also do join optimization in a heuristic phase as well. However, it's a little bit misleading. It's, it's, it's kind of a more of a hybrid heuristic and volcano plan planner, because because cost does does come into play as well. I won't go into the details about that right now. This is just the calcite library provides a couple of key framework for optimizer one is a well node. This is the basics, you know, the basic node for the logical plan. It expresses a relation and you know an example would be a filter project join etc. Now these role knows can also, you know, can represent both logical and physical concepts. You know, for example, a logical aggregate would be an aggregate node that is, you know, implementation agnostic, whereas a hash aggregate is a specific physical implementation of aggregate. And then relopt rule is the, is the base class for rules that both the both types of planners are understand. And so a rule represents a logical valid transformation of part of the plan. And you know, a few examples of what the purposes for a rule could be, you know, to rewrite the plan in a way that is possibly more efficient. For example, pushing filter below join. You know, another example would be converting one type of real node to another logical to physical logical join to hash join. We also use these rules for logical rewrites. A simple example would be, you know, the average function can get converted to some combination of a sum and count. So those are some of the uses of rules. All right. So moving on, the next topic I want to talk about is statistics. And here's the problem is that the volcano planner, and also the join optimizer, they need to have accurate estimates if they're going to make, you know, optimal plans. It relies on estimates of things like selectivity and row count quite a bit. And it turns out, you know, and bad guesses can result in very bad plans. And the, you know, one example would be a massively expanding join followed by a filtering joint. And in case that's not clear what that means is that sometimes a join, especially if there's lots of duplicate values on the join key on both sides of the join can result in a very large input coming from, sorry, a very large output from a relatively small input. The filtering join would be one that after the join there's not many rows remaining. So ideally, we'd want to plan the join such that the filtering join comes first, and then the expanding join, you know, follow a second. So this is something that the optimizer tries to do. So the solution that, you know, that we can use to improve this is with statistics. So here's some of the statistics that we collect. Now, right now this is a manual task. It's not mentioned here, but there's an analyzed table command. But then this is the statistics that we collect right now are the following first up approximate count distinct. This is this is done using a hyper log log sketch. You know, provides better estimates for things like filter selectivity aggregate row count and join join output size. We also collect another sketch called a T digest. This essentially functions as a hit in place of a histogram. T digest is a sketch that allows you to as an input. Say I want some percentile to I want 15th percentile or 99th percentile or whatever it will return approximately what you know what value represents that would be for that percentile. And finally, we have an items sketch, which is used to solve the heavy hitter problem, which is useful for identifying skew and also can better filter selectivity estimation in some cases. Yes, Jim Apple Facebook here. How does hyper log log solve the filter selectivity problem. Well, it doesn't completely solve it. I was just saying it helps in that, like, compared to nothing if we don't know anything about the select the the cardinality of a field that we're filtering on. We really just have no way of knowing, like, if we apply this filter, what is the what is this activity. We can make at least make a better estimate. You know, a better educated guess. If, if all we had is is approximate count distinct, based on, you know, we could we can say, oh, let's take if we know that there is. You know, the row count is, you know, X and the, the distinct count is Y. Well, then selectivity of Y divided by X is a better estimate than what we had before. But I agree with you that it's not, it's not perfect. It isn't taking into account the, you know, the, the skew and so on. So these other ones also help with that too. Like it's, does that make sense? Yep. Yeah, that's a very good point. It doesn't, it doesn't solve the problem. I don't want to say that it's, it's, but it's better than, it's better than nothing. You know, the basic idea is a quality filter on a low cardinality field, we can, we can guess we'll have a higher selectivity than one on a high cardinality field. But yes, T digest, at least for numeric range queries as much, it will be a much more useful in a lot, in a lot of cases. And then item sketch if we, if we know that we're, we're filtering to select for a high incidence value, then we can take that into account as well. Thanks. Good question. One thing to note about all, I'll say, maybe you're going to get to this later, but like, you know, this is the conjugated problem, like how good are your stats, quality of your stats affect the decisions that you optimize and makes. And maybe you'll get this when you tell what query execution, but is there anything you guys do in the query executor that is sort of can be adaptive to wrong statistics or you, you, you bake the, you bake the query plan once and just run it, no matter what. The answer is right. We bake the query plan once. We have, we have work in progress, you know, for, for, there are some features currently in progress that will hopefully be coming out in the next, within the next couple of months that actually do relate to adaptive execution. Specifically around aggregate. Applications, but we don't. The way that drumming is executed is called very optimistic execution. Essentially, we plan everything we schedule everything upfront and so that it doesn't, it doesn't lend itself very well to adapting very much. So there's an area that we're, you know, we were, we're working on a specific like the general approach that we're going to take is to adapt by starting over, like, you know, run the query again, learn from, learn from the first query and then, and then run, run, run, run a better plan. I think I'm turning up your camera. I'll, I'll photoshop. Okay, yeah, sure. Okay. Hopefully this, hopefully this helps a little bit. Sorry, sorry if that wasn't clear. Yeah. So yeah, the approach that we're, we're like, like I said, there are some adaptive execution enhancements being made. But from my standpoint, most of the, the approach I think, you know, works for us. Is to learn between between runs of the query. So maybe the first time you run, you know, it won't be a great plan. But then the second time we run, you know, it'll be approved because we've collected. We've learned, we've learned some things about the query. Does that make sense? Yep. And then Hamid has questions. Yeah, so regarding the statistics, my understanding is that you're running SQL queries to collect the statistics. So when you want to do the column cardinality, you end up with a large number of aggregate functions with distinct, like countless things. The SQL engines are not really good at because they focus on precise answer instead of a statistical answer. So do you have a special SQL for your statistical column to start collection? I mean, it's not really, it's not really a special SQL. We have function, splitable function implementations for all of these. So one thing, one thing to note about these three sketches that they're all mergeable. What that means is you can, you can collect partial, partial sketches, you know, on portions of the data, and then merge the results into one final sketch or, you know, data structure. So this allows for, you know, both parallel execution, but also incremental updates of these, of these stats. So, so yes, we have, we have SQL functions. We have, you know, special aggregate functions for computing hyperlog log P digest and item sketch. Does that answer your question? Yeah, you're there. Okay, so these functions are also available to like data scientists or it's on the internal too. Okay, that's an interesting question. Technically, they are available. We don't block anyone from using them, but they're not exactly like technically supported, I guess, functions, they're kind of meant to be internal. If, if we started getting, you know, customers who wanted to use these, and we would, you know, that's something we maybe would, we would publish some more documentation on them and best practices and sort of So right now they're kind of meant to be only internal, but technically not. Okay. No one's blocked from using them. Yeah. Okay. Good. Thank you. Okay. All right. Thanks. Great questions. The last question is how often you run, someone went through how often you run analyze because in the lake house, you know, you have many different writers. So you don't have, you know, the exact view of what the churn rate of data. Right now it basically it's admin, admin driven, they'll run it as, as they need. But this is one of the areas that we're looking to make more of an automatic, you know, we don't. At a high level at Jermia, we don't like to have, you know, admins having to turn a lot of knobs and, you know, have to do a sort of thing. But right now I look at stats is more they, it's a tool for our customers to use to get out of like, you know, when they see that they have some queries they're performing badly. This is something they can use to, to improve things. I would not recommend that they go and run analyze table and every table in their system. Every day, you know, blindly, you know, there's a lot of a lot of costs and expenses associated with that. The, the idea is that, you know, hopefully before too long, this has become more of an automatic automated thing that the customer doesn't even have to think about. But we're not there yet. So, and I know Andy, you probably have a lot to a lot to say about this sort of thing because I'm, you know, with the work you're doing and auto tune and so on probably is I'm sure this is something that interests you. Yeah, that's another conversation. Keep going. Okay, so anyway, next we have a materializations which we call reflections. It's a marketing term but like I said reflections are very similar to materialize use. Basically, we can we can accelerate queries by computing, pre computing expensive data cleansing operations pre compute aggregates and denormalize joints. And but we can create materializations on both use and on the base tables. And the reason, you know, it's beneficial to create reflections on a base table is, you know, a couple of couple of examples would be, you know, if it's used to rewrite the data from some inefficient format to CSV or JSON into a much more efficient format, you know, column format, which is a parquet. It also allows a specifying a partition and sort columns, which allows for, you know, partition pruning and page skipping. And then the optimizer, you know, will determine which reflections satisfy the query, and then choose the, you know, the best reflection option among all possible choices. So here's just want to go through a few examples of some of the types of situations where aggregation or sorry where reflections can be used. So one, you know, it would be if we have a, you know, maybe a filter on top of aggregate project scan query. And you can see here that the user query is aggregating on call grouping by column a and competing some some on column C. Now the materialization was created with a, you know, a similar aggregate but it's grouping on both a and B. Well, our, our optimizer can figure out that this, you know, this materialization satisfies this query, but needs to rewrite with a roll up aggregation. So to group by just a. And so we were able to rewrite the user's query to a new query that uses the reflection. Another one is the join aggregation transposition. So you can see here the original user query is doing an aggregate on top of the join of you know tables T1 and then P2. We have a materialization that is only an aggregate on table, you know, T1. Well, we were able to figure out that we can push the aggregate below the, below the join and match the reflection and then do a roll up aggregation on top of the join. So this is a very common use case with, you know, a lot of times with like a star schema type thing where you have a fact table joined with, you know, various dimension tables. You, you can create your ag reflections on just the fact table. And not to worry about, you know, denormalizing the joins and aggregate and so on. Yeah, this should assume that the join is lossless. Because if the query join is lossy, then some tuples will be left out and your aggregate value is going to be different than the materialized view. So it effectively requires a differential integrity to do that. We do that in DB2 exactly as optimization. But it has to be lossless. Because joins will eliminate rows, right? And then the aggregates will be different if it is lossy. Well, so I guess maybe, maybe we can take this, I don't want to take too much time on that. We can discuss that offline or something. Yeah, that's all right. Just go ahead. See the whole thing. Yeah, yeah. Because, yeah, yeah. That's a good question. So this slide right here is a simple case where we have the materializations are partitioned by different columns. We're running a column A. You see that we have two reflections, but one is partitioned by A, the other one is partitioned by B. It will choose the correct one and then print the table based on column A. And finally, we have a feature which we call Starflake. Now this is actually more related to what I think you were just asking about a second ago about the lossless. So the issue that comes up is that, let's say we create a materialized view on top of some three-way join. Maybe we have some fact table joined with two different dimension tables and we materialize that. But then we have some query that's only looking at the fact table joined with one of the dimension tables. Well, generally speaking, we can't use the materialization because we don't know for sure if the join is lossless, which I think was what you were talking about. The join that we're not including in the query could have removed rows or it could have added rows. So without going to anything special, we can't use this materialization. However, if this join, if these joins are what you might call extension joins, I've also heard the term cardinality preserving, then it's okay. And so actually what we do is when we create the materialization, we check and see what was this join cardinality preserving. And if it was, we keep track of that. And then automatically, when we see this happening, we will take that into account and we can use the materialization, even though the query does not include all the joins. So it's automatic that the user generally shouldn't have to even think about this. Not to do anything for this to work as a way, but I mean by that. All right, so that's, that's all the size I have for. Are there any questions before I hand over to the back. Yeah, hello everyone, this is Vivek, I'm going to go over some of the challenges specifically in query execution. So Steven, can you go to the next slide please. All right, so I think the first set of challenges are to do with really large data sets, right. So we have some customers who have millions of files in S3, and that constitutes a table for them. So one of the challenges that Romeo has is prior to running a query, we need to collect the committed data and just to identify all the files that belong to the table and you know, organize them so we can quickly run a query on that. So the way that Romeo was doing this earlier was we would store this metadata and internal database. And this entire metadata, we were organizing it by partitions. And by partitions, I mean, like most of these tables and data sets are, you know, partitioned by, take an example of, like maybe they're partitioned by year and month, right. So year 2020 Jan and that the bunch of files related to that and year 2025, there are a bunch of files related to that. So there are two columns year, year and month, which are like columns, which are partition columns. So this entire metadata that Romeo was storing was, was organized in that fashion, right. So you had like a couple year and month followed by a whole bunch of files that belong to that. And during planning, the planner was essentially reading this metadata and applying where pauses on these partition columns to, to, to run all the files that don't need to be there. And, and once these files were read, then planner had a good idea of which are the files that actually need to be processed and that's how it creates a plan. And so the drawback, one of the drawbacks was that one, the planner is like just being too much work. It's not only computing a lot of stuff, it's also reading a lot of data from its internal database. They can data sets with millions of files that tends to be a problem. And so it's basically single threaded or it's bottlenecked and assembled more. And we're not able to start reading the data files until the planner is done with all of this and you know creates a distributed query plan and shifts it off to all the worker. The second problem kind of related is in joint queries. So we avoid, like, like, can we do better at limiting the number of files that we want to read. And so I want to kind of start with these problems and if there's time, we can go ahead with some of the other problems that they have been working on. Next slide. So, so recently we switched the way we collect metadata and store metadata as opposed to storing it in our own custom table, internal table. We started using Apache Iceberg where we can. And so a couple of points about Apache Iceberg, it's an open source project that also offers asset transactions on tables in the data. And so Apache, like the Iceberg table format has metadata files, which identify all the files that belong to the data set. But in addition to that it also annotates each and every file with the partitions they belong to. And for each file, it has upper and lower bounds for each of the columns in the file. And these can be used in in cloning as many files as possible. So with the new execution flow, the query planning execution flow, the planner very quickly decides that it's going to use the Iceberg table format. For the execution. And the plan that it generates now has two different scan operators. So one is a metadata scan and the metadata scan feeds data into the data scan. So the metadata scan is essentially leading Iceberg manifest. So it can happen in parallel. It's not restricted person will know it can be parallel. And so there are like this on the query, we add filter operators on top of the metadata scan to take advantage of all the partition column information and upper lower bounds of that to eliminate as many files as possible. And this filtered set of data files go to the data scan operator. So with this mechanism we've been able to get parallelism in all the metadata scan operations and also in the partition pruning that the planner was doing on a single node. We managed to parallelize it. We also managed to start the metadata scan and the data scan concurrently. For large tables, if your filter is very selective, you know, the data scan can proceed along with the metadata scan and helps when you finish. The other big benefit for us is that it reduce the memory requirement on the planner and improves reliability. So how does the data skipping work? Are you using Iceberg for that? Yeah, so we do Iceberg for that. So the metadata scan operator is essentially reading your manifest files. And the manifest file has for each data file it has the partition column values and lower upper and lower bounds for the non-partition columns. So we apply the where process on that data that comes out of the metadata scan. So in terms of reliability improvements, this is because you're no longer quite as dependent upon the metadata single machine single point of failure as you used to be. It's more of the amount of memory that is required. You know, with such large data sets, like millions of files, especially if a file name is used, like 200 byte file names on S3 and stuff, it is likewise huge amounts of memory planning, especially if you're loading all the metadata in the memory. So it keeps you away from reliability issues like Umi. Yes. Next slide, please. So this is, we call it on time filtering and show you kind of seen something similar. So I have a SQL query to the right, which is, so I went to Georgia Tech. So, so this is a query that's trying to find the sum of sales happened within Georgia Tech, right, within that support. So it's doing a join on the store table and the sales table, the sales table being the larger table. And in this case, it's not really known at planning time. All the store IDs within that particular geographical area. We know there's a port, but we don't know the store IDs. So it's really hard to, you know, figure out how do I reduce the number of files that I want to be reading on the both side. Because the store IDs could be handful, like five or 10, because it's a very narrow way. So the solution in this case, still in the next slide, please. So, so in this case, so during query execution, what we do is, once we're done reading the build sky table completely. The, the data structure of the hash table that is created at the joint side. We know exactly the store IDs that they're interested in and that there are much just to do subset right so we take those and send it as a runtime filter to the scan at that point. And the scan is again has these. So all these statistics that we talked about right up on statistics typically. And those statistics are used and and also the partition column information, those are used to eliminate a large percentage of the files that need to be. So this is something that we implemented recently and we've seen quite a bit of performance improvement. I'll talk about the numbers later. Next slide. Yes. Two challenges that I want to talk about one is, like, so we need to evaluate complex expressions in both the filter, the where it was and the project operators. And in traditional databases, right and then there are indexes, which kind of narrow down the number of rows that you have to apply these expressions on. But in, in, in data lakes and stuff, there really no indexes at this point of time. So we end up having to evaluate these expressions on a large number of source and we need to do it fast. So a couple of things that help us here are arrow, which is like Steven said it's a column on this layout and, and the code generation that we would talk about that in the next few slides. And then, and then this whole notion of if you're running queries against data sets in S3 or as you're on the time, how are you going to deal with those network latency slide. The traditional prefetching techniques that people use, you kind of overlap IO with competition. They don't work really well with columnar file formats because as you're reading like four different for columns, right and those four columns are not contiguous in the party file. So you have to, you know, your traditional assuming the file is going to be sequentially accessed is not going to work on this. The party file is going to be so it's, so it's going to be randomly accessed. So we needed to come up with a different solution and passion is the favorite solution right and so designing a cache for like a huge amounts of data, 100 terabyte worth of data has some interesting challenges. Next slide please. So arrow. So this is like a very brief introduction to I believe, West is going to talk about arrow in a couple of weeks in this group. Weeks. Yes. Yeah, so I'm not going to spend much time on that. Except for, you know, like, it's a columnar format. So, so here is a table with three columns name, age and forms and we talk about age column first age is an integer right. So, so arrow stores integers and full bytes. And if you if you want to store a whole bunch of rows, then you know you allocate that much amount of memory, and then you lay them out sequentially in memory, right. Name is a string or a bar cap column. And it is laid out in memory using two blocks of memory. One is the values where you know the individual values like Joe and Jack or laid down sequentially there so there are no column separators. No slash zero or anything like that no column separators. But but but then you have another offset memory, which kind of tells you both the length and also the offset into where the values are for example if I want to be the first row of name. Right, so I go into the offset vector and I find that the first row of name in the name column starts at offset three. And it has a length of four because the second one starts at seven seven minus three is four so I read the four characters there and that's that's my need. That that's how the variable and columns are included. And similarly the list, the arrays, and this gives a phone number. It's encoded using three Blobs of memory the values, very similar to the name values column. The string offsets again are offsets into that and then there's another offset vectors for the list. Next slide please. Yeah, so I know, as you can see is like, as opposed to always lay out, it's a columnar layout. It's randomly accessible, which is a very good thing for us. And it can take advantage of modern CPU architectures by planning and send the instruction so for example one of the things that we are able to do is if the host machine has large wide registers, 120 big wide registers, so we can load four begins into that register and then do SIMD operations on two registers. We want to add two different columns, two different begin columns. We could just add them instead of doing it four times. So that helps us with like this vectorizer and all that stuff. And also because it's columnar and we deal with columns. So for example, if you want to send only the name and age columns on the wire, not the phone's column on the wire right if I want to send only the name and age columns and wire. I could continue to use scatter gather IO to take the data and write it to the PCP buffer without having to do another memory copy. So all of these kind of help improve performance with that. Next slide please. So this is another like one of the things that we had to do to get performance for very evaluating large expressions. So let me introduce LLVM. So LLVM is a collection of compiler tools that helps you generate code based on the host architecture on the target machine at runtime. And you can write code in different languages and you can get it to compile. It supports a wide range of compiler optimizations. And you can actually choose which optimizations you want to apply. So you can get very efficient code. And it allows support generation and static in runtime. So I'll talk a little bit about that. But this is what we use to dynamically generate code at very execution time based on the expression that comes as part of the filter operator. We use LLVM to generate code on the host machines that runs very fast first. Next slide please. So effectively what happens is if you think of your abstract syntax theme for an expression, right? So the internal nodes of the tree are like functions that could be an extract here function or like a rejects function here. And so typically what we do with Danyva is we implement all these functions in C++. And LLVM allows you to take that C++ code, the first the box on the top, which is the static code generation. It allows you to add compile time, take all those functions that you've written in C++ and compile it into an IR format for later use. At this point of time, all that is done is all these SQL functions are converted to an IR format. And the runtime code generation is results or like conditional blocks or Boolean AND or operator or case statement. And that is runtime generator with all the optimizations one would expect, right? And the combination of these two is merged into a single LLVM code block, which is compiled to the target machine. And that's how we get the final machine code for execution. And this has given us as a result which would have given us huge performance improvements up to like 90th improvement, especially when the expressions are large. Add a curiosity, have you guys, instead of taking the C++ code of the built-in functions converting them to IR and then inlining them in the query plan that you're generating dynamically, have you seen, have you measured the performance difference of having the dynamic cogen for the query plan invoke the C functions that are already pre-compiled versus inlining them? I mean, the question is, what's the, how much benefit are you seeing by merging them together and then doing the cross-compilation or cross-optimization? So we haven't actually done what you said, but I think so one of the things that for us like we, so we do not know on which machines customers will choose to run premium, right? So it becomes very difficult for us to, I guess, do what you said, because, like, so we cannot. We can have a Jemio binary running, right? So the Jemio binary would have to have its pre-compiled SQL built-ins. Yeah, so all of these are statically linked into the Jemio binary. There's no dynamic dependency. So we do not assume that there are, like, for example, boost libraries on the post machine that is running. So everything that Jemio uses. The question is, like, you have this registry of all the IR for these built-ins, right? And then when you dynamically have run time, you generate the code, the IR for the query plan, then you're injecting the IR of the built-ins. And then you take that giant IR, the module IR, and you pass out the LLVM and the LLVM compile all at once. An alternative is that you generate IR that then makes, you know, function jumps, function calls into the pre-compiled SQL built-ins. And I was just curious whether you've compared them. It sounds like no. No, no, we haven't done that. Okay, because some LLVM engines actually do what I say. It's like pre-compiled the built-ins and then call them at run time. You guys might be the only ones that actually do this IR registry thing. I'm just curious how much it doesn't make a difference. I'm not surprised that you're getting 90X over the interpretive execution. I'm just curious whether this is emerging. How much of that helps? Sure, yeah. Steven, can you go to the next slide, please? Yeah. Yeah, so this is the caching problem that I was talking about. So like I said, our data sets are in the cloud and there's a large latency. And the challenges are, you know, a couple of challenges. One is that the files that are going to be processed by a query depend on the query, especially with partition query. That's the first challenge. And the second challenge is that the columns that are correct also depend on the query. So it's not like we want to cache the entire path to file or anything like that. And one of the things that we wanted to do was reset like zero administration. And so that rolled out like memcache or a Redis standard solution. So we had to build our own homegrown caching equivalent of it. So we decided to cache files locally on NVMA or SSD and manage that whole thing without additional infrastructure that the customer had to install and manage. One of the challenges that we did was to get effective utilization of the cache. We really don't want the same file to be cached on each and every node in the execution engine. So we ideally want, because that would not be utilizing the combined cache space and all the nodes. So for a file, we ideally want to cache it on a single executive node or maybe a couple of executive nodes, not everything. And so we use consistent caching for that. So the way it works is that the, yeah. So the Rokon estimates and the past on the plan to determine the parallelism of the scan operator and the parallelism defines how many nodes need to be picked. And once the nodes are picked, then we use consistent caching and take all these files that we want to scan and map them on to and basically use consistent caching to map them on to the nodes that we have picked at Rokon and that's how we ensure that the same file that needs to be scanned multiple times goes to the same node again and again as much as possible. And that gives us that, so we're able to use the cache very effectively. And because this is a 100 terabyte cache, like our initial, like the initial eviction algorithm was, hey, let's implement LRU, right? But the data structure, the amount of memory that the LRU linked list with all the front and back pointers requires to hold all of it in memory for the size of cache that we're talking about, right? It turned out to be prohibitive, right? So we don't have that kind of memory and we don't want to spare that kind of memory to implement the eviction algorithm. So we ended up with a two-phase eviction algorithm. The first eviction, the algorithm first identifies tables that are least recently accessed and we maintain all kinds of statistics about how often our datasets are accessed and which files are accessed in that and which portions of files are accessed within that. And once the tables are identified, then within those tables we identify file chunks that are done. And all of this we use, yeah, so we do all of this without using too much memory on the execution nodes. Yeah, so next slide please. Yeah, so what is the chunk size you have here? So we use a one megabyte chunk size, like, yeah, large enough to take the hit of the latency but still not small enough. Yeah, okay. About three minutes over. How much more time you got? How many was live? This is my last slide, I think. Okay, I'll keep you on. Yeah, so runtime filtering gave us like an 8x performance improvement. Yeah, maybe I should just leave it here. So different techniques give us different improvements on different types of queries. That's how we're able to do in practice queries on datasets. I just want to say one thing about the runtime filtering 8x. I'm not really sure exactly what that means because it's a little bit arbitrary. We could design a use case that gives much bigger than that. In fact, I think some of our customers for them runtime filtering is a difference between the query completing in a few seconds versus never completing because it's kind of a bad query that they're running, but still. Like, you know, their query has the kinds of like cross joins and like this really nasty query and that runtime filtering they take forever. So they can actually have a much bigger than 8x improvement. Yeah, so I'm done when you can. Alright, let me just Q&A, but since we're out of time, I'll just put the obligatory over hiring side. Go to www.dremio.com slash careers, hiring across all teams. Yeah, like, yeah, hopefully if any of what we talked about interests you, you know, check us out and I'm seeing if you can join us. So I'll just stop sharing for now. Alright, I will pop on half of the audience. We have time for one, maybe two questions from the audience if you want to go for it. So how do you handle correlated subcouries? This is the NS square problem before every two probably have to run them from every node and send across all the nodes to NS square. Yeah, well, so this is actually a functionality that's handled by CalCyte and we decorrelate. The majority of correlated subcouries can be rewritten as a join. That's the quick answer. There may be some category of correlation queries that cannot be decorrelated that way, but they seem to be very rare. I don't think we've had a come up with a customer not being able to decor, have their queries be correlated. Does that make sense? Yeah, yeah, makes sense. I wrote the similar query on predicate. I mean paper on that one. 1995 was all implemented in DB2 by 1998. Yeah, there's a class of correlated subcouries. If the correlated subcoury itself has a top-in type clause, we don't handle that case. That's one example I know we don't handle. I think that would require, there is a way to decorrelate that using window functions, but we don't have that yet. To be honest, I don't think it's ever come up, or maybe it's come up once from a customer. In fact, I don't even know if the correlated subcouries are even that common at all. Yeah, because most vendors cannot handle it, whereas in the traditional, like Oracle and DB2, they do a lot of it. So that's why we did the correlation and conversion to window functions, tried on DB2, it works. He's an IBM research fellow, so he's hiding behind no camera, but he knows data business. Okay, great. MIR, okay. Excellent question. The correlated subcoury is a very challenging topic. Alright, let me ask one question. We'll call it quits. What are the key features that distinguish Jermio from maybe the other two main competitors in the space of a lake house, Databricks and Presto or Trino? In your opinion, what makes Jermio better? That's a good question. I mean, well, one area that we're trying to differentiate on is on performance, but that's a tough call. I mean, I think we obviously have our benchmarks to show, hey, we have to form Trino, but I think that's important, but I don't like to stake everything on that because the performance war comes back and forth, and you can see it's not as a product. We want to be able to perform it to help our customers. So I think probably just the things that we're focusing on, the semantic layer, the ease of doing the reflection or the reflection feature and making it easy for users to create the materialized views and so on. It was one of the areas that we're trying to differentiate on. If reflection, is that a Jermio concept in term? Yes, sorry. Maybe I was breaking up when I was talking about it earlier. It's a Jermio term that we use for materialized views, basically. But within a lake house or data lake? Yeah, I like it. I think that term comes back from one aspect of Jermio that we early on, like also the semantic layer, we actually really emphasize the virtualization kind of aspect. So that's what we have. We have terms like virtual data sets and physical data sets and reflections kind of reflects that point of view. But we're focusing more on the data lake and the lake houses because the metaphors kind of changed a little bit, but we're keeping some of the old terminology is how I put it. To add to what Steven said, I think it's the materialized views on the reflections, but also the matching, the automatic matching algorithm that takes us equally and automatically. So that's our which reflection to use for accelerated views. Steven, Vivek, thank you so much for being here. I read also that Vivek is in India, which is 4am, 5am right now, so we appreciate you. You might be the winner for staying at the latest to give a talk, so we appreciate you spending time with us.