 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Guys, welcome. Thanks for coming. Today we have a guest speaker, Jackie Yan from Snowflake. Again, Snowflake went IPO last week, so we're glad that he could find time from his high-flying lifestyle to spend with us to talk about databases. So Jackie's been with Snowflake since 2015. Prior to that, he was a senior member of the technical staff at Oracle and he did his undergrad and master's degree at Duke University in computer science and mathematics. So again, we're super awesome, super happy for him to be here today. We also want to thank the Stephen Moy Foundation for Keeping It Real for sponsoring us and will do like we do every week. If you have any questions, please unmute your mic, state who you are, where you're coming from, and ask Jackie a question. Please interrupt at any time. We want this to be interactive. Okay. Go for it. All right, I'll get started. Before I get started, I would say disclaimer, some of the slides were from a previous presentation I did before where I had talking points. Some other slides I recently made and I don't have talking points, so I might be very smooth in some of the slides and not as smooth in others. So just bear with me. So I'll start with the agenda. So I'll give a brief overview of Snowflake's architecture. Then I'll talk about Snowflake's query optimizer given the overview of there and we'll take a deeper look into some of the particular query optimizations that I find interesting. And finally, I'll touch upon this philosophy of data-driven development that we follow at Snowflake. So I'll start with the Snowflake overview. And so this is kind of a picture that shows traditional OLAP databases, and they fall into the camps of either shared disk architecture or kind of shared nothing architecture. Both of these architectures have lots of merits, but they also kind of suffer from a number of bottlenecks such as contention for resources, not enough processing power, not enough elasticity and scalability and so on. And Snowflake provides a solution for these problems in the cloud environment by breaking into three distinct layers of storage, compute, and cloud services where each individual layer scales independently. Our architecture takes advantage of the cloud-native characteristics like scalability and elasticity and avoids the resource congestion problems most often seen on-premises traditional databases. We call this the multi-cluster shared data architecture, and this architecture is also cloud agnostic and enables Snowflake to run all the three major cloud providers. So at the core, the storage layer here that you see is a single place to centralize all the data, and Snowflake provides a large suite of tools for seamless ingestion and transforming of the data. It provides native support for both structured and unstructured data and leverages the elasticity and scale and the cost of the blob storage. Snowflake separates the storage from the compute, so customers can run multiple workloads across multiple teams on the same set of data without resource contention. And more importantly, the multi-cluster compute layer allows workloads to scale up and scale out as needed, so workloads also leverage precisely the computer seated. And at the very outmost, you see the Snowflake services layer, which is really the brain of the architecture. So as the workload hits Snowflake, this layer determines the unique requirement for getting the processing done in the most performant and cost-effective manner. The services layer takes care of all the query optimization, security, transaction processing, and metadata management automatically, with near zero administration required. And finally, since the customer data is potentially everywhere, Snowflake is as well. So the cloud agnostic layer means that the customer can distribute their data across cloud regions or even across providers while maintaining the same Snowflake experience without having to deal with additional complexities such as managing connections and replications and so on. That's awesome marketing speak, but can you say roughly what percentage of the customers are AWS versus GCP versus Azure? How do curiosity? So I would say over 95% of the customers on either Azure or AWS, GCP is a pretty small percentage, both because Google BigQuery is a very good, a very good technical product, and because Google Cloud is relatively kind of a newcomer in the cloud market. And at the moment, AWS is still dominating the share. I would say maybe 60, 70%. I don't have exact numbers, but that's kind of roughly my thinking. But we see Azure growing very, very fast, especially outside the US in Europe. And in Europe, the mind share is pretty much between Azure and AWS. In some cases, Azure actually has a lot larger market share than AWS in Europe. So that's definitely kind of what we're seeing in Azure. But in the US, AWS pretty much dominates, although for companies that are competitors to Amazon, they usually prefer Microsoft. They're very, very large companies that prefer Microsoft because of that. Yeah. And okay, so scale, yeah. So currently, it's sort of like as developments in over 20 regions across four continents, and all three cloud providers. And we have hundreds of petabytes of data under management. This might be a little bit outdated. So maybe we have more these days with single tables with compressed size of over in the petabyte range and trillions of rows in a single table. We manage a fleet of tens of thousands of servers in each deployment and where the infrastructure is very highly automatic, elastic, and self-healing. So now before I go into the details of the optimizer, I want to mention a kind of a key architectural component of Snowflake, which is micro partitions. And micro partitions are essentially partitions which are relatively small in size. And they correspond to immutable files for data storage on the blob stores in the cloud providers. And each immutable file is essentially follows a tax format, which is a hybrid column format. And from the bottom right, you can see there's a header and the pointer is to regions of compressed column data. And the columns are divided so each column are grouped together and they are compressed according to different compression schemes, depending on the property of the data and so on. And each micro partition is a unit of DML. It's a unit of transactions and change tracking mechanism inside of Snowflake. And this is very important for many, many reasons. And this really simplifies the transaction management a lot in Snowflake. It also makes it much easier to collect optimizer stats, for example, as part of the DML. And for very large tables, it could contain many, many, many millions of micro partitions in Snowflake. And this kind of flexibility really makes it very, very easy to apply kind of fine-grained pruning on the data in a table and really kind of makes many of the optimization kind of works very well. And we provide a way to kind of rearrange the underlying layout of micro partitions for a table. And this is where we adopt the automatic clustering service that can be where people can specify kind of clustering keys for a table to rearrange the micro partitions to improve various kind of optimization such as pruning. So now let's go into query optimization. Jack, we have a question. Yeah. Hi. Yeah, I was going to ask a question. I was wondering, from my understanding, you create columns from unstructured data. Is there a limit on how many columns you'll create and what happens to the extra columns if you have, say, you know, a 10,000 column record in JSON? Do you stop creating new columns at some point or how does that work? Yes. Yes. There's a limit to the number of columns that we can essentially. Sorry. Oh. There's a limit to the spam. Keep going. There's a limit to the number of columns we can, what we call, column rise. And this is a kind of opportunistic optimization where our query engine deals very well with the cases where a portion of the micro partitions have a particular path or particular kind of column rise, whereas maybe a significant number, a significant percentage of other micro partitions may not have this column rise. So we kind of deal with that kind of transparently. And for the columns where we have columnarization, there are obviously two main advantages. One is we do have metadata for these columns that we column rise. And we can apply optimizations on regular columns the same way we apply it to these kind of what we call sub columns, really. And for example, pruning. So people might write a particular predicate on a particular sub path in JSON, and we would actually be able to prune and apply all the regular query optimization techniques on that column. We treat it effectively as a regular column. And for execution, it's the same thing. It's essentially a, you know, implicitly it's the same as a regular column. And for the columns that are not column rise, then we essentially regard that as kind of unknown metadata, a column with unknown metadata. So in many cases we do not need to apply optimizations on these columns, but they don't impact anything in regard to correctness, for example. So all of that would still work. But the columns that are not column rise would be less performant in that sense. And we're looking into ways to make these more intelligent, whereas instead of making local decisions, we could look at workload information, we could look at a bunch of other information to kind of improve the efficiency of this process. Yeah. Thank you. Hey, GIT, I have a question, Subtik here. A small question. So is this immutable data storage, like this design, is it because of the storage layer being object stores? Is this why you guys did it? So do you also work on immutable storage layer, like for AWS or something, or is it just object stores and storage layer? For the most basic version of the Sunflake Table, it's all on that immutable store. And yes, the original design is based on the limitation, but then we actually figured out, we actually found many, many good properties of this. For example, with the immutable storage comes the usage of micro predictions at the unit of DML's international actions. If it's immutable, then we would not be, the many, many optimizations would not be only a metadata optimization. We would actually have to look at the data. So this is actually a blessing to us in many sense. Even though the original kind of point that we started looking into this was because of limitation from the blob storage. Thanks. Okay. So let me go into overview of query optimization. This is a little bit of a very simple picture and pretty much everybody kind of knows about the life of a query. I'll just briefly go over here. It's by no means the comprehensive picture. So at the top you see the kind of the cloud services layer, which receives the query. And these are sent over through clients. And they can be ODPC, JDPC, WebUI, Python, any number of kind of clients. And we have, I think over 10 or 15 clients that we support these days. And we have hundreds of partners, connectors that we support. And once the queries are sent to Snowflake, we do a very lightweight compilation, essentially, that does the result cache lookup. We try to see if the result of this particular query is already there. And if it's there, then we kind of directly return the results. And if we see that it didn't hit the result cache, then we actually go to actually compile the query. This is step three where we go through the planner and optimize the process to query. Here we list a few of them. How aggressive is your result cache? Does it have to be an exact match? Are you playing games with semantics of the query? Yeah, that's a good question. So we are not as intelligent as we'd like, but we do certain optimizations like if the version of the table has changed, let's say if there's a change in the underlying table, we can still check the underlying what we call scan set, which is the result of micro partitions to scan after pruning. And if that's the same, then we will still be able to use the result cache. So we do a kind of, we do certain optimizations before we decide whether a result is reusable. So there are a bunch of things like that. There are also things like reusing kind of, if a result of a query is kind of a sub query that can be reused and we also kind of reuse that. But there are also things that we don't do. So there are optimizations that we do, but there are things that we would like to do better as well. Cool. I forgot the exact data, but we have pretty extensive monitoring of such usage and it's a pretty widely used feature actually. It's a bit surprising that especially in the OLAP world, many people, many tools that just issue these queries over and over again and they would actually do hit the result cache. Yeah. Yeah. I have a quick question on the result cache thing too. I'm curious, I assume because it's like all analytic data, you have like this constant flow of like new data coming in. So how do you deal with like a new flow of data coming in all the time and with trying to cache results? Right. So the result, so there's no proactive process that's happening. So the results are cached only when a previous query actually generated that result. So there's no previous query, then we don't do anything. So it's kind of a lazy process there. And if the data that comes in can be somehow pruned out, like I mentioned earlier, of the set of partitions that we figure out to scan, then we would be able to match that with the previous essentially we can shoot some sort of signature right on that on that set of inputs that includes the list of partitions to scan for that query and a set of other kind of other things. And as long as that signature matches, we will be able to kind of look up. So it's a pretty simple mechanism. So we don't do anything proactive in terms of like proactively compute the results for a query or anything like that. Yeah. Hi. This is Steven Mayer. I asked a question. Yeah. For a lot of the BI2 that sit on an OLAP solution like Slow Flick, let's say if there are some entries being just added, is Noflick able to use the partial result earlier and just simply add the delta from the addition of that pen to a table? Right. So we don't do that today. And it's one direction that we're looking at, but yeah, we don't do that today. And result cache is kind of, yeah, it's kind of something related to that, but we don't... So there are other ways that we do like materialized views, which is kind of similar to what you're describing, which actually people can use to return these kind of results by materializing something. And then the materialized views actually has an automatic refresh mechanism that produces the data in the materialized views. So in that sense, it's similar to what you're describing. But in a general sense, our materialized views does not cover any general query. It only conforms to a fixed shape of query, which actually can be incrementally refreshed. And if it's something that we don't know how to very efficiently incrementally refresh, then we don't do it currently. Thank you. Okay, cool. So yeah, and then the third step is to kind of the planner and optimizer. And after we've generated the plan, we go to the virtual warehouse to process the query. And here is a little bit more detail there. So it only scans the needed data from local SSD cache or block storage, and the process data returns to the cloud service layer. And finally, in the result side, there's a return process that uses pagination for large results, for example, and it also caches the result in the block storage again, back into essentially the block storage to be returned to the client. So that's the overall query processing process. And now let's talk a little bit about optimizer. And first, I'll talk about the optimizer's philosophy. So our query optimizer is cost-based, with many, many non-cost-based optimizations. And the main reason is we want kind of a robust optimizer, and we want to kind of favor plan stability compared to a super clever or a super optimized query plan. I would say that's kind of the main reason why we decided to do it this way. And our focus is on optimizations for analytical queries. These include the common constructs as well as other analytics constructs like window functions, recursive TTEs, pivot and pivots, things like that, which are pretty common requests among analytical queries. And our cost model has an emphasis on common analytics data models, such as Starjoins. But then our cost model is also generally not just for any data model. So it makes certain assumptions about the online data models in our optimizer's cost model, but that's not overriding. There's no strict requirement there. And yeah, so as mentioned, we focus really on the optimizer's robustness and plan stability as opposed to generating the best related to the most optimized plan. And the other assumption is we leave as much of the execution as possible, where we assume the optimizer can make mistakes, both due to its own limitations as well as limitations in the input regarding Karnati's mission, for example. And we want the execution to be able to correct mistakes from optimization time. And the way to do it is we try to adapt as much as possible in execution time. And the final point is we have to leave as little to the user as possible. So for example, for many traditional databases, there are thousands, if not tens of thousands, of primaries that the DPA or the user can tune. Whereas for Sophoc, even today, I believe, across the whole system, we expose something like 20 primaries to the users. And most of these primaries are not used for tuning the query engine itself. They're more like tuning certain behaviors and transactions and things like that. And this is kind of really our philosophy, which is to produce a good enough plan to produce a good plan that runs very good out of the box. And there's very little tuning that the user needs to perform. So these are all the philosophies that we try to practice in our implementation. Here's a little bit more details into our stats collection process. And it's really not very complicated. I think the basic idea is we automatically collect stats or metadata at multiple levels as part of the DML or the transaction process. So for example, there are stats we collect at the table and the micro prediction level. These include the things like the number of rows in the table, the size and the bytes, with compression information, for example, depending on how the memory footprint of the data can vary drastically with compressed data. So depending on the use case, we can make use of these kind of information. We also have column-level stats. These are very common kind of zone-map stats like name-max, null counts and distinct counts. We also have, like I mentioned earlier, sub-column-level stats, which are very similar, essentially the same as column stats. These are for the common paths in semi-structured data, which we decided to column-rise as part of the creation process for the micro partitions. And because the micro partitions are created as part of HTML and then the created in a transactional way, the stats are always up-to-date and they're always accurate for each micro partition. So these stats are used as input to the optimizer's cost model. They're also used as input to certain other compile time optimizations that we do, like pruning and constant folding. And finally, the column and data metadata are actually catched in the cloud services layer because we also have very, very low latency access to these metadata and to these stats as part of the compilation process. Okay, can I ask a quick question about that for you? This is Becca from Cockroach Labs. I was just wondering how do you use the stats as input to constant folding? So that's a good question. So I would say it depends on the type of constant folding and there are things like predicate-based constant folding, where if we see nothing in a predicate matches a particular constant predicate and that's nothing in a table matches that, then we can constant fold the predicate to false. There are also things like certain partitions have the same min and max values, in which case we wouldn't be able to constant fold that particular column into a constant, which would potentially trigger other rounds of optimizations. So it's really making use of the min and max properties, as well as the known properties. Sometimes if you have a not-not-predicated or not-predicated or sometimes the outer join generates certain implied predicates on those, then we would actually be able to make use of that to do certain optimizations based on the known properties of the predations and so on and so forth. Awesome. So it sounds like you're relying on the fact that these are really up-to-date stats and not-stale? Yes, they're always exact stats. So that means you're computing them on an ingestion and you have to? Yeah. So I would say they're more metadata than stats. They're actual metadata. Yeah. But then there's always a trade-off because it also limits the type of stats we can collect because we can't collect anything that's super, super expensive as part of this. Right? So for example, we are missing certain type of very expensive stats that are usually can only be maintained in the background, things like that. Yeah. So this is a very, I would say this is a very simple model that actually goes surprisingly long way. When I initially look at the stats that we have also, thought that we had lots of stats that were missing, right? These are kind of traditional optimizer stats in the literature. Many of these things we don't have, but actually they go a very, very long way already. And by and large, we're able to generate good reasonable plans with these stats that we have. Yeah. Ryan again, do you maintain these at the micro-partition level and the table level at all times? Or do you have to scan all the micro-partition metadata before you can plan for the table? We maintain them at the micro-partition level and the table level and intermediate levels as well. So it's almost like a multi-level metadata. So they're more than two levels. I mean, they're more than the table level and the micro-partition level, the intermediate levels as well. That's really an optimization for the caching layer because, as I mentioned earlier, there are certain tables with millions or multiple millions of micro-partitions then. And caching all the data in memory is pretty expensive. So we've done a lot of optimizations in that aspect. And when you have that many micro-partitions, the micro-partition management or the metadata management problem becomes also kind of a big data problem, as opposed to a kind of a small data problem. So that's what we find very interesting because we have tables that are petrified range and millions of micro-partitions. So we really need you to think about these problems. So I go into this kind of picture, which is a pretty very simplified picture. So we have the query text input that goes into the parser, which produces what we call query block internal representation. And after that, the query block representation goes through semantic analysis, which does name resolution, type checking, and all that. And then it goes into a round of logical rewrites. After that, we go through micro-partition pruning. And it's kind of technically part of the logical rewrite, but we take it out here because it's especially important for quick performance. And so you can see we do this through the pruner component. And that's also shared later in the plan rewrite phase. So after this micro-partition pruning, we go through the initial plan generation, which produces the query plan IR, which is where we dub the query plan representation. And with this representation, we start going to the plan, I would say, rule-based rewrite phase. And there are a large number of rewrite rules that we apply here before we go into the cost-based optimization phase. Today, this is mostly focused on focused on joint ordering. There are other smaller things like bloom filter optimizations that we do, but by and large, this is mostly joint ordering. And after that, we kind of generate the physical plan. And once the physical plan is generated, we upload the physical plan to be executed at runtime. So I want to briefly touch upon the terminology. So the plan that we generate, the physical plan that we generate in the end, is a DAG. So Snowflake is a push-based execution model. So the plan is a DAG that consists of operators, which are connected by links. And operators is a pretty traditional sense of a database gradient operator, which processes a set of rules and includes things like scan filters, joint filters, joints, aggregates, and so on. So this is pretty simple. The links are also very clear, I think, in terms of what they represent. They essentially encapsulate the data exchange operators. And they also encapsulate adaptive mechanisms at runtime. For example, they can handle parallel distribution of data at runtime, if necessary. The links also does not necessarily have to do the exchange. They can be local links, local synchronous links, asynchronous links, and so on. So there are many, many different types of links that can be generated depending on the plan. Also, the links itself can be adaptive at runtime based on what we see with the order of magnitude of the data, with the skewness of the data, and so on. So this is kind of the overall terminology of the query plan. And this is a kind of a simple example of the query profile that our customer sees in a Snowflake query. So I'm guessing this is a TPCDS query with table scans, the leaf nodes, and there are filters. On top of the table scan, as you can see, there are two filters, six and nine, which are filters on top of the table, five and seven, and there's also the drawing filter, eight, which is pushed from the other side of the table scan, and we will talk more about that later. And finally, there are drawing operators that are connecting them together, and there's aggregation and middle functions on top. And finally, right result operator. And result generation is sometimes surprisingly bottlenecked in query processing. Sometimes, especially if a query produces a lot of rows, and usually the results are not compressed, as well as the underlying data. So there's actually a lot of overhead translating, transporting the result to the client. So that's actually an interesting problem in itself. And on the right side, there's a statistics that also talk about kind of the stats that's going on in this query. And yeah, and then this is actually showing some of the debugging tools with the query plan, which I won't go into. Okay, hello. I have a question. So this is Dalon from Penn State University. So I have a question that all of these nodes are computational nodes, and intrinsically, I know Snowflake is a basement cloud. So all of them are separate executors in this case. Do I think about anything like fusion all this kind of stuff? Second, do you have anything that's pretty much like kind of sensed dual-cali of data, trying to make sure that computer data pretty much happens somewhere close so that you don't really kind of cost too many kind of remote IOs like from, yeah. Right. So the first question is about fusion of operators. Is that the... Yeah. So like you see this as like a full plan, but there's a lot of things that can actually kind of like fuse a lot of them together using some kind of code generation techniques that Conway used right now. So we don't use code generation itself a lot in our query execution engine. We use it only in the places that we think better the most, because our query engine is really a vectorized query engine. And in the majority of the cases, this works pretty well. The places that are obvious that I think it doesn't work very well is when you have a super expensive expression. So that's an obvious case that we use it. And there are also other cases like transporting kind of serialized formats of what we call ROSAS over the network in exchange operator. So in this case, packing and unpacking the ROSAS becomes pretty expensive and a pretty expensive operation that can be, that's not very easily, you know, vectorizable, I would say. In most other cases, our existing vectorized engine works pretty well and we don't really actually see a big need for code generation. There are other kind of fusion, I guess, you know, that's pretty common to this kind of engine is a concept of a pipeline, right? So within, it's not listed here, but you know, imagine the links between these operators. And on this right side, this right deep side, everything can be viewed as this one pipeline. So, you know, all the links here actually would be local synchronous links. And, you know, the build side, obviously, they will need to be built separately and so on. So that's that aspect. The second question, I would say, bringing compute closer to the data is what we do, what we do there is actually we try to, we try to affinitize the data access within the compute cluster. So we have local SSDs for each of the execution nodes in a virtual warehouse. And usually, you know, there are, usually there are multiple queries running on the warehouse that are actually in the same table. So we do something relatively standard, which is consistent hashing across kind of nodes in the table and the table, kind of the tables can essentially affinitize to these kind of consistent hashing schemes. And we're able to achieve pretty good, pretty good cache hits overall in a virtual warehouse in most cases. One follow-up, which is very interesting, because like I kind of know that stuff like worked in this particular model that like you can't allow like users have their own storage, for example, they have their own S3 and have their older data hosting there. So like by doing that, like pretty much you say like you can have like a gigantic cache caching in your own kind of like infrastructure. And trying to say like I'm trying to pull as much curious possible inside your like inside your own snowflake cluster. So like there's still kind of a link I mean in a sense that there isn't another storage service you're talking to. So like there's number one, there's a cache coherence problem. If someone, if they're actually using something else as well, like while you're, while you're doing, they're using snowflake. And also like I guess like, so like how do you actually try to kind of minimize this type of IO going into another storage service as much as possible because that can be very expensive. So I guess what you're referring to is what we call external tables where we allow customers to put data in a kind of a data lake like kind of storage bucket. Is that the function that you're referring to? Yeah, I think so. Like I don't like like one thing I don't really quite understand is like so you always like users to do that or like you're most likely you're always also hosting like most like most of time you're hosting the data, all of the data from your users. Yeah, so by default snowflake actually hosts the data. So the function that the functionality you're referring to is really applies to the case where people would like to use snowflake on a kind of a data lake setting. And they for various reasons they do not want to load the data natively into the snowflake. In these other cases where we actually, you know, actually build a functionality to scan directly over kind of external status in the blob storage. And there are the kind of the mechanisms are very similar, although the performance will not be as good as using native snowflake storage because we have no control over, for example, the size of a particular file that the customer puts in there. And, you know, they could be putting something super small, they could be putting something super large that have, you know, inverse IO kind of implications and so on. But by and large, the catching mechanism are somewhat similar, but, you know, it would not work as well as over native snowflake storage formats. Yeah, but I would say Parking is also pretty good, right? So Parking has a lot of very good kind of, you know, very good ideas to kind of speed up, speed up, especially for a columnarized kind of a correct decision engine. Parking is a pretty good file format. Yeah, thank you so much. Okay, cool. Okay, so now let's go a little bit of a deeper dive into some of the optimizations. Some of them are pretty, I would say, straightforward, but bear with me. So the first one we'll talk about is table scans and pruning. And we all know that full table scans are very expensive for large tables, especially for tables. It's almost like we have, you know, tables in the petabytes of range. So we go out of our way to avoid scanning full table, avoid scanning full tables. And there are two mechanisms. One is we only scan columns that are accessed. And this is facilitated by the tax format that we follow. And the other way is to use pruning, which is to use column metadata to avoid scanning data that's filtered out. And exactly how much data can be pruned depends on the filters that appear in the query, as well as clustering of data in the table. And that's why we have automatic clustering service to kind of reorganize the table for according to the dimension that the user, the query actually filters on. And sometimes there's natural clustering order that we can make use of. So here's an example of a bad pruning. This is the store sales table in 2CDS. And we see there's a column list price. There's a predicate on this particular column. And the table scan you can see from here in the middle row, the bottom here, we see the partitions total, 86 partitions. And we end up scanning 85,000, or 86,000, we end up scanning 85,000 partitions. And that amounts to 1.26 terabyte of data. And that produces 28.4 gigs of rows. And as you can see, the filter is actually quite filtering. And we see that it's actually about 100x reduction, where this is from 28 gigs down to like 2.90 million rows. And the reason that this is not very filtering is because this list price is a column that is not, the table is not plus on this column. So this is kind of a pretty well distributed column across all the partitions. So we can't really filter out based on this particular predicate. And a common example here is if we add another particular predicate on the sold date. So then the table itself might be clustered by this date column. It may not be this exact date column. It could be some other date column where this is passing correlation to. Then we would actually be able to print out a lot of partitions in the table scan. And as you can see from the middle here, now we're scanning 10,000 out of the 86,000 rows, or 86,000 partitions, and we're only scanning 160 gigs here. And we're producing 3.4 gigs of rows. And after filter, so the filtering selectivity remains the same, right? This is still about 100x. But now the inputs to the filter is much less. And now we actually see the result generation, like I mentioned earlier, is actually taking up a relatively large amount of time. And this is an example of a good pruning. And so, okay, I guess this is a similar, the same query, but this is kind of a click on the, with a click on the table scan, and we actually can see we are scanning 23 columns, or 24 columns actually. And this is kind of the table that is being scanned on. And so this is kind of a very simple example on pruning and how it works. And this is an optimization where we'll apply a compilation time. And so here's a bit of a deeper look at pruning. So on the left side is an example table with four columns. And the ones we care about are the ID columns and the date column. And as you can see, we have IDs from one to five and dates from November 2nd to November 5th. And we have four micro partitions for this table. And as you can see, the data roughly kind of is in line with the, is roughly clustered by date. So we have micro partition one, which is only, only has a data from November 2nd. And micro partition two has the second and the third and so on. And you can see the ID columns are relatively, relatively random because, you know, on any given day, we can have data arrived from any IDs in the table. So now let's say we have kind of a simple query like this, we have a select name and country from T where ID equals to one and date is written equal to November 4th. So we would actually be able to print on this. And if you look at the ID column, we can actually print out the partition one and we can put out partition three because we see that the min value of partition one is two. And then also for the min value of ID column partition three is also two. So we know for sure that, you know, these two columns would not match ID equals to one. So we can print out these two columns. And then we can also print on this particular date column. And we see that for partition two, you know, the maximum of that date is November 3rd, which is less than November 4th. So we're actually able to print out this partition as well. So as a result, we only need to scan one micro partition. So this is a kind of a very simple case of pruning. But as you can imagine, we can have a lot more complex pruning depending on the expression that's going on. So here are just some very, you know, a slightly more complex example. We have a lot, you know, in production, we have a lot more complex expressions that we can actually print on. So here is, let's say I have a created down column, which is really a timestamp. And we actually want to track it a timestamp to year and compare that with 2020. So this is a somewhat, you know, somewhat complicated function that we need to kind of understand the semantics of this function to be able to evaluate prune on this. So it's not as simple as, you know, a simple range comparison comparison of the min and max and so on. And another example is like this, I have two columns, salary and annual bonus. And I'm trying to evaluate the predicate, which is a sum of these two columns is greater than let's say some constant, right? In that, in that case, we need to take into account kind of the min and max of the salary column, as well as the min and max of the annual bonus column. So there are kind of like four input values, compare that to this constant value. So, you know, one thing I would emphasize is that the logic for pruning is actually different than compared to a logic for filtering because filtering is applying on individual rule and the pruning is applying for a range of data. So that we actually need to look at the min and max as well as the kind of the known as property, because certain expressions are not eliminating and certain expressions, you know, have certain predicates, certain behaviors with regards to nulls. So, and then the expression and saws can get super complicated. So this is kind of a non-trivial logic for many cases. And sorry, can I interrupt with a question? Yeah. So my name is Sande, I work for Elemental. So this means you need to have basically like a full expression evaluator wherever you're doing your pruning. Is that right? Yes. Yes, I'm sorry. Yes. So we have specialized kind of implementation for pruning logic. Got it. Hi, Ryan again. Do you ever cash the result of executing one of these expressions on a per micro partition basis so that you could at the local executor level decide to prune at some point? So not at the compiler level, but at the local executor level, if that date trunk function you know is never true for a given micro partition, do you cash that? We don't cash it today. We've experimented with it before, and it's something that we would like to get deeper into, but this is like something we have also thought about and discussed. But yeah, we had some prototypes, but we never actually implemented this in production. Thanks. Jackie, a quick question. So based on the results of the pruning, like based on the query profile, do you guys repartition the tables? And if you do that, how often is that like automatically repartitioning on some other column value? Like we saw that $180 thing, is it possible to do as to that? Yeah, so repartitioning is done through clustering, the clustering process, and today this is a user defined process. The user needs to define the clustering key, which is a declaration of intention to cluster the table by a certain way, and the clustering keys can be arbitrary in an arbitrary expression. So it can be something like date trunk of here and query long, that can be used as a clustering key. And we would essentially reorganize the table by this particular predicate. But the user has to declare that, and we expose all this information to the query kind of information above, like this information we expose to the user, and the user can take a look at the pruning information here and decide whether this is kind of a useful predicate for the workload, and they would actually be able to specify these. Yeah, but we also realized that we could automatically do this, but that's not something we have today. Oh, thanks. Yeah, and this is even, and I have a question on the complex expression on a date trunk. I've seen other engine, we're trying to do a logical rewrite to move the function on the column values into a range query. What is the philosophy on your optimizer in terms of how much time you spend on trying to unfold the input expression into something that can take advantage of your metadata? Yeah, so we do that as well. So we do this unwrapping optimization, which kind of converts essentially for cast functions, which this is for cast functions, we can do this optimization where we unwrap this into potentially two range predicates. Because we have kind of very highly optimized implementation for range predicates, for simple range predicates, these can be evaluated very quickly. So we have an internal set of heuristics that decides whether it makes sense to prune on a particular column, and when they make sense, we would do these kind of optimizations to kind of improve the evaluation speed for the prune process. So we do that as well. Thank you. Yeah, okay, cool. So the next point, actually, I was going to talk about, I guess, people have already asked about. So I'll go to the next point, which is, yeah, the query problem, actually, I talked about this as well. The query problem shows a number of micro predictions before and after the pruning, so people can look at it and actually make decisions. And there are other optimizations that can enable pruning, which I think these are, should be pretty familiar to folks, and they're really kind of quite standard predicate optimizations, like sort of push down, various formals, sort of move around predicate, predicate move around optimizations and push down optimizations, transitive predicate generation from drawings, predicate factorizations, and probably predicate generation, things like that. Anything that can essentially open up new opportunities for pruning as part of the overall optimization past. And that's why, in the graph earlier, we showed that there's this prune or component, which can be applied multiple times during the course of the query optimization. So whenever a new predicate presents itself, a new query optimization opportunity presents itself, we would iterate until we kind of reach the optimized kind of scan set information after applying these predicates. So now let's go to joins. For joins, I guess this particular optimization we're talking about is join filter. So again, this is a TPCDS query where we're joining between the source table and the data table on the date column. And there's a predicate on date column. And this is hash join. And the notion itself, we would put the, this is a little bit reversed in some other systems where we have the build side and the left side, and we have the probe side on the right. And obviously here we see that the store sales table is on the right side because it's a much larger table. And what we see here is the join filter optimization that we see on filter five. And the join filter essentially is an optimization that decouples the joins into two steps. So the join conceptually, there's a join step and the concept there's also the filtering step. And essentially what the join filter does is it takes the filtering step of the join and it pushes down that particular part into the table scan, into essentially into the probe side as much as possible because there could be other joins here and then the join filter will keep getting pushed down depending on whether it's legal to do so and there might be other limitations and so on. So yeah, so here's the join filter. And essentially the main idea for the join filter is to allow the filtering part of the join to happen earlier. That's kind of coupled from the actual join part of the join and to avoid subsequent processing like hash table probes and other kind of expensive operations that happens as part of the join operation itself. So we have two techniques for filtering rows. One is bloom filtering. This is a pretty common standard technique where it's brought bloom filters, you know, basically probabilistic data structure that filters out most rows that can be filtered by the join. There will be a small portion of the rows that may still flow to the join to get filtered out eventually. And now our bloom filters are, you know, not super large in size and, you know, but we are still able to achieve a very good filtering ratio in most of the cases, but it fundamentally depends on the on the data, the property of the data. And sometimes there's, it's not super possible, it's not really possible to filter out anything. That could still happen. The other optimization is range pruning. This is what we call kind of a bloom range vector optimization. The idea is that we maintain a summary of the ranges of value that will match the join. And this particular range pruning optimization or this data structure we actually pushed into the table scan of the probe site. And we can use it to prune out micro partitions from the table scan before we actually perform the join. So these are kind of the two optimizations that we do as part of join filtering. And this is kind of an example that shows this. And in the middle here, we can see pruning that's actually happening. And this is actually what's happening in the table scan for here. And the, you know, from the original 86,000 partitions, we are only standing 70,000 partitions here. And this is due to the kind of the range bloom vector that we're pushing down to the table scan to filter out the data. And for the join filter, this is the bloom filter essentially. And that will actually, this is quite selective in this case, whereas we have a 23 gigs of input rows and we're outputting about five gigs of output rows. So by and large, the join filtering optimizations is a very important optimization that we apply. Yes. Okay. So next I'll talk about metadata optimizations. And this is also not very surprising because, you know, our metadata are exact, our optimizer stats are kind of exact stats. So we can use this to directly return results for queries. The simple query on the right has a bunch of aggregations and we can directly evaluate the aggregations using the metadata. And as you can see, the query plan is very simple. It's simply a row generator and with a result return. So that's kind of what the plan looks like in that case. And yeah, like I mentioned, you know, Sophie always maintains accurate stats. That's kind of enables us to do it. And we can rely on stats to optimize various facets of the query predicate, aggregations, and subqueries as well. And yeah, finally, without caching, this is kind of what we kind of discussed earlier before. And this is in that case, the query plan will just look like this. There's a kind of query result we use later that just turns to results. Yeah. Okay. So now I want to talk maybe about two more optimizations, which are kind of, in my mind, somewhat reflects the philosophy of our optimizer implementation. One of them is aggregation placement. So aggregation placement is also called a goodbye placement. It's also called goodbye pushdown in some cases. But really, you know, the, you know, semantically speaking, you know, goodbye is can be either pushed below a drawing or it can be pulled up above the above the drawing itself in the query in the algebraic kind of query plan tree. And, and, you know, the right place or the optimal place to evaluate the aggregation really depends on the property of the drawing as well as the property of the group by right. So if a drawing is explosive, then it might make sense to evaluate the group by before applying the drawing. If the group by is reductive, for example. And what we do, the way we do it is there are two goals. One goal is to make it rather yet applicable. So the way we do this optimization is we, if the aggregation can be pushed below multiple drawings, then we will essentially, we will put, we would place these adaptive aggregation operators at all the, all the places along the drawing where it's legal to do so. And these will only effectively, they will only be activated if we think it's, it makes sense to do so at runtime. So this decision making is, is purely at runtime. And at compile time, we would simply put these adaptive aggregations at all the places where it's legal to do so. And there are other optimization opportunities that can be opened up by this. For example, when the aggregation is pushed to the table scan, this allows us to evaluate with encoded data, such as dictionary encoding to speed up the execution. It also opens up other opportunity, like materialized views, rewrites and others. But I think it's even more important is this adaptive philosophy where, you know, push down aggregations. We want it to be fully adaptive during query execution. And the aggregation checks a pipeline global cost model to avoid performance regressions. And the moment the aggregation detects it's potentially going to cost the performance regression, it will turn, turn itself off. And so it's even independently switched to pass rule. And this is also, it's a pipeline global, but it's also a kind of thread local model. Whereas there, we want to avoid the overhead of making this decision, the communication overhead across multiple threads. So this is kind of a thread local decision. And the good part about this is because, you know, maybe, you know, there's some skews and some threads is actually would actually benefit from evaluated aggregation. Some other thread would actually not benefit. So this is also kind of a very good, very good reason to make this decision kind of thread local but pipeline global decision. And this is a kind of a simplified, very simple model here. Let's say on the left side, we have the original query, which has the join and the aggregation top. And we have 100,000 rows that's produced by the join in the default case. And we have the probe side that's producing 10k rows here. And let's say this aggregation is actually not there originally. And this is kind of the cost of evaluating the predicate. Let's say, let's assume the cost is simply some of all the rows produced here. So that will be 110k here. And we compare that with the case where the aggregation is actually activated. And we observe that the aggregation is achieving a reduction ratio of 10% or 90%. So now we're producing only one k rows out of the aggregation. And even though the join is explosive, it's only producing 10k rows here. And the overall cost of the evaluation of here is just 10k plus 1k plus 10k, which is 21k. So in this case, it obviously makes sense to leave this enabled, to leave this aggregation up if it's enabled, because it actually reduces the overall cost of the plan evaluation. And compare that to the second case, which is exactly the same plan. But the characteristic of the join and the aggregations are slightly different. So let's see, let's say the join itself is, let's say a primary key for an key join and there's no explosion, right? It's kind of 10, it takes in 10k rows. It produces 10k rows. And also the aggregation itself is not super effective. It's only reducing 30% of rows. So in this case, when we compare the original plan, the cost is still 20k, but the cost on the website becomes 24k now because now we have to evaluate the aggregation of the 7k here. And that's more expensive. So the overall cost model tells us that it's actually better to leave this particular aggregation disabled. And this is a runtime decision that we make. It's still a cost model granted, but it's a cost model that we apply at query runtime, not query optimization. So what does that look like in the actual limitation in the code? Is it like a bunch of if-then-else statements, like if my plan looks like this, or is it like a general purpose pattern matching framework that can get triggered when you see query plans a little like this? Like how does it actually implement it? Yeah. So the plan is the same actually. So in both of these cases, the plan will look like exactly the same. It would essentially be this plan. And it's only this particular operator itself that will be able to turn itself off on and off essentially. So we try to generate relatively simple plans because complex plans adds a lot of additional overheads for the optimizer, especially if the plan, I would say, especially if there's an explosion in the plan operators early in the query optimization phase, it prevents us from iterating, it prevents us from applying complex query optimizations because there's a kind of fixed plan budget or budget to a world that we have to stop the optimization process. And the more complex the plan is, the more complex these optimization or the less passive optimizations that can be applied. It's always there and you're basically like just passing through tuples if you decide like, I don't want to do this optimization. Yeah. So in general, any operator that contains a buffering phase is a good place to do sampling. So aggregation has certain kind of buffering operations and these kind of build certain hash tables. And we can essentially, we see that we need to kind of buffer it anyway, so it might as well do some sort of sampling to figure out subsequent decisions essentially. And of course, this query actually has to wait a little bit to observe the drawing to actually also produce some data and kind of see that and so on. Yeah. So this here is a little bit more complex than that, but that's kind of the idea. And this actually is kind of a good segue to this particular optimization. This is a small optimization, but I just want to talk about it because it's also somewhat different from what other databases do. So this is with optimizing of disjunctive drawings. So this disjunctive drawing is actually, you know, the example is like here. So normal, I mean, the most common drawings is let's say it's the inner drawing with a bunch of equality predicates. So left-sided dot a equals right dot a and so on. And there are also certain type of drawings where the predicates is a disjunctive predicate like this. And you can imagine if the conjective predicate is not there, let's say if there's the first first cat contract is not there, then the whole drawing is going to be disjunctive drawing. And with a hash drawing implementation by default, it's going to be a partition product. And the way I think other databases kind of implement this in the core optimizer is they will generate a union kind of shape of the plan to actually evaluate each of these kind of disjunctive branch potentially and depending on the different physical access paths, it opens up other optimizations. And we actually want to avoid that because one thing is because as I mentioned about like we want to we want the core plan itself to be kind of simple and relatively lightweight. And the other reason is we want to kind of try to reuse the existing optimizations as much as possible. And we realize that with existing joint filtering optimization can be relatively easily extended to the disjunctive case. And that's essentially what we go for. So essentially what we do is for drawings with disjunctive equality predicates, we would perform the drawings using the hash tables which are brought from essentially the drawing case in each of these disjunct predicates. And there's a process later on that kind of reconciles the results. But essentially we avoid condition products here. And similarly we also generate balloon filters and drawing rig filters. And these gets pushed down to the probe side. And there's a little bit of addition logic to produce the disjunction of these results. But it's actually pretty easy to do. And we also leverage a certain predicate transformation to kind of produce these type of predicates to kind of to enable us to generate these particular hash tables and balloon filters and so on. So this I want to kind of just briefly touch on this because because this also shows that when we look at the optimization that can be applied by transforming the core plan versus kind of doing certain optimizations at core execution time, we always kind of prefer the letter because we want to make the plan itself as robust and as simple as possible to make sure the optimizer can make a good decision there. The one thing I would really like to emphasize is diagnosability. And for diagnosability we have re-execution of customer queries which includes select and DML and DDL which include a query table with select statements. And we also have, so this actually allows us to kind of debug with the customer query itself. For example, we can slightly tweak the customer queries to run them. Obviously everything is obfuscated and we're actually not able to see any of the customer data but we actually can see the performance of query and we can try to kind of tweak the decrypt optimization based on that. And we also have a pretty powerful internal web UI based debugging and visualization tool. And I'd like to emphasize that I think in the production quality, in the production level database optimizer, the debugability and diagnosability is almost the most important aspect I would say of the tooling of the implementation because many times it's very, very hard to roll out a particular optimization because it's almost impossible to not cause a regression with a critical query optimization. So in that case, the ability to diagnose the issue, to look into the issue efficiently is critical. So this tool that we have is available for all production customer queries. Essentially what it does is it does a bunch of other things but for the query optimizer part it generates the stepwise plans that captures the shape and expressions of the plan at each step of the optimization process. And by capturing this, the developer can compare the plans and figure out at which step this thing starts to go bad and at which step we made the wrong decision. And it's also used to figure out whether optimization should have happened or could have happened at a particular interface. So that's a super powerful tool. And we also have runtime plan information like cardinality estimates, long drawing build site, explosive drawings, and skews, etc. It's also integrated with the optimizer hints. So I found this pretty cool word developer and I can actually play with the plan and for example, drag and drop the plan to kind of get the drone order that they would like to test out and we will generate the optimizer hints that would actually reactivity in the query to test out that policies and all these can be applied on the production of customer queries. Of course, it has zero impact on the customer. This is completely internal debugging and understanding. And we have no way to look at any of the customer data. And again, another piece is for a reproducer. This is a common tool in many other production database systems as well. And this allows us to reproduce customer issues locally. And this is actually a super, very cool tool that we have here. And yeah, like I would emphasize I think the biggest challenge with almost any production optimizer is the rollout of a new change to the optimizer because it's almost bound to cause regressions and certain issues with existing customer queries. And that's often something that's unavoidable and needs to be really carefully studied and mitigated. So I don't know if Andy, if you have enough time, I have a little bit over how many more slides you have left. Yeah, I think I'll just stop here because I have quite a few other slides. But I think this is probably the... I mean, are the upcoming slides the keys to the castle, like tell you all the magic you're doing or... No, this is like I can quickly like in maybe one or two minutes go over this. Like I just want to talk to you. I do have to tell my wife and the baby. I hate to cut it off. Like this is gold. Like I have a family now. Okay, so I will thank Jackie on behalf of everyone. So we have time for one question. If anybody wants to unmute themselves. If they have a burning desire to ask him something about optimization. So I have one question. Hi, I'm Elena. Nice to see you again, Jackie. Yeah, thank you. Elena is a student with my mind at CMU for other people. Keep going. Yeah, so I want to know that the system explicitly called Plunar to conduct pruning after every career transformation, like after applying a rewriting rule. It doesn't always do that. But it will do that if the optimizer recognized that there's a new predicate that becomes available. So predicate is the key here. Yeah, so if we generate it, if certain core optimization produces a new predicate that we haven't tried to prune on the table before, then we would try that. Got it. All right, so my question will be, what is the join ordering now wouldn't look like? Is it just a standard DP, but like, some system? Yeah, it's a standard cascades optimizer. Oh, it's cascades. How do you guys handle cost model estimations? Is it cascades on the physical plan or are you doing the logical plan transformations as well? It's kind of on the logical join tree, I would say. How do you do cost model estimation for the range of bound for the logical plans when you don't have the physical plan yet? Do you do any optimizations there? Or do you always have to go to the physical plan to get a cost? So I have to admit our current algorithm is pretty naive. So we're not looking at any of the physical access paths and so on. So that's like our logical cost model is pretty straightforward. So hire Elena to do that, right? Exactly. Yeah, we're just looking at the cardinality and things like that. Okay, awesome. All right, guys, thank you for coming. Again, I apologize for cutting it off. I have an underage chat.