 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. Let's get started. It's still a pandemic. So let's talk about databases. So today we have Marianne and Cheng from Databricks to come talk about the Spark SQL Catalyst Optimizer, which as far as I know is based on Cascades, correct? Or no? Kind of. Okay, perfect. We'll explain what that means. So Cheng will be the first speaker. So he is an engineering manager at Databricks. He is a member of the Apache smart group and a committer for Apache Parkette. He has an eight-month-year-old daughter, which may interrupt us during the talk, but that's okay. Marianne is a staff software engineer from Databricks. Prior to joining Databricks, she worked with Intel, and she's also a committer member of Apache CalCy and Apache Phoenix. Okay, with that, again, we'll get started. Again, if you have any questions for Cheng and Marianne, please unmute yourself, interrupt them, say who you are, where you're coming from, and ask your question. And again, we'd also like to thank the Stephen Moy Foundation for Keeping It Real for sponsoring us today. All right, Cheng, the floor is yours. Go for it. Thanks, Andy. Yeah, everyone. It's so nice to have the opportunity here to introduce this Spark SQL Scalice Optimizer. So first of all, let me introduce our results. So I'm having my colleague, Marianne Shirley, here together with me, as Andy already introduced, and she's a staff engineer at Databricks. And yeah, I'm an engineering manager at Databricks, and I'm also an EMT member at Apache Spark and committer at Apache Spark. Okay. So before jumping into CalCy, let's talk a little bit about the unifying analytics story in Apache Spark. So unlike other database systems that discussed and the subcontinent DB and Tech Talk series, initially, Apache Spark itself was not invented as a SQL database system. So instead, Apache Spark is the first unified analytics and data processing engine. And it provides a rich set of APIs and multiple language bindings to express the different types of workloads within a single framework. The workloads are ranging from the traditional ETL to streaming, machine learning, and graph processing, etc., are all nicely covered. So this new paradigm brings us a lower operational cost and better performance. Now we only need to run a single framework and data exchange between different systems. It can be minimized. And at its core, there is the concept of resilient distributed data sets for RDB, which provides the primitives to express MR style parallel and distributed computation. And later in the second generation of the Spark API, we provided a relational structure layer and made it the new foundation of the rest of the ecosystem. So this structure layer is a Spark SQL, which is a set of a structured API that enable users to query data storage anywhere in any format as scale. So the secret behind this anywhere in any format is the data source API. Users can connect it to all kinds of data sources by building plugins, implementing this Spark SQL data source API. And the data source do not necessarily need to be other databases or data stores. For example, at Databricks, we build data sources for Jira, for Jenkins, and PagerDuty by invoking the REST API, so that we can join them together and analyze Jenkins test of failures and fake test, which is a pretty interesting use case. And so why we want APIs to be structured? Why they are more preferred than the original RDD API? So here are three implementation of the same query using RDD, DataFrame, and SQL. And you can see that we're just trying to group by a department and compute the average age of all the employees. As you can see, the RDD API does not really understand the structure of the data being processed. The users have to describe the data structure and manipulate them explicitly and use the defined functions. This not only makes the implementation a self-accomperson and error problem, but also makes it hard to optimize. Because to the framework itself, to Spark, the user defined functions are simply opaque in black boxes. And on the other hand, the DataFrame and SQL implementations are much more concise because the structure of the data is discovered or specified in prior. And the framework can leverage this information to compile the declarative programs more efficiently. Can you say roughly what percentage of Databricks customers uses those three choices? Is it primarily executing SQL queries or RDD or DataFrames? What's the most common? Yeah, that's a good question. So to our observation, SQL and Python API are the most popular. And nowadays, some customers are using the RDD API that they are mostly due to Alexia workers. So I think it's relatively safe to say that almost no new customers are writing new RDD programs nowadays. What was that you said? What kind of workloads? SQL and Python. But you said something like Alexa workflows? Is that what you said? Legacy, yeah. Maybe the DeBaba some RDD stuff years ago. Okay, awesome. Thank you. And also, as a summary, structure is a limit of what we can express. So they impose certain limitations. But in practice, the vast majority of the computations can still be accommodated. And limiting the space of what can be expressed enables us to write more decorative representations, and it also enables more optimizations. And to take advantage of the optimization opportunities, you basically need an optimizer that can automatically detects or those most efficient plans and execute those operations specified in the user workloads. So that brings us to the main topic here, which is the catalyst, the Spark SQL Core Optimizer. So I will firstly bring you a 10,000-feet overview of this optimizer. And then we will focus on one specific aspect, which is the adaptive core execution. And Marianna will bring you that part. And we will also see why adaptive core execution is that important in the context of Apache Spark. So here is what a catalyst looks like and where it locates in the Apache Spark projects. Because Apache Spark provides a different APIs and different language bindings, you can see that at the left side, you can express your workloads in different languages in Java, Scala, or Python, and in different APIs and data frame API, or the type-to-data set API, or just write a raw SQL. And no matter how you express your API, they are parsed or constructed into a quarter plan. And once you have the quarter plan, then the catalyst is there to help you to optimize your quarter plan into a physical quarter plan and then compile into RDD, which is the native or the assembly language of Apache Spark. And below this, below is the zoomed-in image of the catalyst optimizer. So you can see that there are several very standard components there and faces there. So there's an analysis phase that translates unresolved logical plans and tries to pick up where there's unresolved names and try to resolve them using information from the catalog and translate them into physical entities like tables or columns and that kind of stuff. Let's look at them more deeply. So the analysis phase uses the catalog to resolve names into physical entities like just that and this one transforms unresolved logical plans to resolve the logical plans and by resolved it means that all the unknown names are bound to physical entities like databases, tables, columns, etc. And in the logical optimization phase, it applies a relational algebra to transform resolved logical plans into a current but faster or tunneled versions and cost-based drawing and reordering also happens at this phase. And the third phase is the physical planning, where we translate optimized logical plans into executable physical plans. And as we hear catalyst theoretically supports generating multiple physical plans and then further leverage a cost model to choose one, but we are actually right now only generating one physical plans in a batches part. And we can we can see a part of the reason why in the later part of the stock. Make sure I understand. So the optimized logical plan and then the physical planning part you're saying that this is before you touch the cost model or you're using the cost model to then select the best plan? Yeah, this image might not be the best way to illustrate it. Actually, in the optimization phase we already have a statistics collected and we are already leveraging the cost model to do certain operations like drawing reordering. Okay. Yeah, and in the physical planning phase we are also leveraging the cost model to do things like a drawing selection. Okay. And then maybe I'm missing it, then there was this, you said you produced one physical plan instead of multiple physical plans. I'm missing what that point was. Yeah, so at API level, catalyst was designed to be able to produce multiple physical plans and then typically like other database systems you may want to choose one of the plan that has at least a cost. However, right now, catalyst does not really implement this approach instead of we are just generating one physical plan and always stick with that. So that means that you're not doing a, you know, are you doing this, you're not doing a search standard, you just have a bunch of rules that do the transformations. Right, the later, the latter. Yeah, we're just using a bunch of those strategies. Everything in right now in Spark optimizer or whatever physical planner is just doing local optima instead of global optima. So there was no, you know, like consideration between different rules in terms of the cost. Got it. Okay. That's helpful. Thank you. Yeah. And now let's zoom out a little bit. So within catalyst, there are two things that are the key concept. The first one is the transformation, which transforms our current plans to different type of current plans or the same type, but different shape of the current countries. And the second concept is the trees, which is used to do, which is used as the abstraction of the user programs. No matter your programs are written in SQL, in data frame, or in whatever language binding you are using. And here, you can see that before we are turning the optimizing part of plan into an RUD, there is a code generation phase. And by code generation, it means that a Spark, a Apache Spark actually tries to translate the physical part of plan into Java code, which is highly specialized. And then we run the Java code in RUD so that we can maximize performance. So first of all, let's look at the trees, which is an abstraction of the user programs. So here is a very simple irrigation. Sorry. There are very simple aggregation plus a joint, a simple SQL query here. So first of all, there are expressions. Expressions in Apache Spark is expressed as trees. So an expression represents a new value computed based on equal values, and an attribute as a column of a data set or a column generated by a specific data operation. And for example, here a value is a attribute from relation t1, and the value v is a generated column generated by the selection or the projection operator in this query. And the second kind of a tree is query plans. You can see that the select drawing and where are translated correspondingly into aggregate, project, filter, and drawing operators, which is quite common. And for RZ's query plan trees, we can further categorize them into logical query plans, which describes computation on data sets without defining how to conduct the computation. And for logical plans, there are a few fears. The first of them is the so-called output, which is a list of attributes generated by the plan node. And basically, if you see a plan node as a function, it basically has the return value of this plan node. And the second one is a set of constraints, which is in a set of invariance about the rows generated by a plan node. For example, t2.id is greater than 50,000. This one is a constraint about the order rows spit from the filter nodes. And the last one is statistics, which are the size of the plan node in rows or bytes, and also per column statistics available, like the min, max values, number of distinctive values, nulls, et cetera. And the second kind of a query plan tree is a physical plan. And the physical plan describes the computation on data sets with a specific definition on how to conduct the computation, which means that physical plans are executable. So you can see that instead of just saying aggregate or project or drawing, in here, we're saying hash aggregate or sort-meter drawing, and the table is a parquet scan or a JSON scan. You are trying to specify how to execute these relational operators. So the second concept in the catalyst is transformations, which are the building blocks of the optimizer rules and query planning rules. The transformation is the transform the shape and or the type of the tree. So they are building blocks of these are three kinds of concepts. The first one is analysis and optimization rules. And so analysis rules and optimization rules, they both translate logical plans to logical plans. And strategies, which translates logical plans to physical plans. This is where physical planning happens. And also physical plan preparation rules, which we will elaborate on later. So these rules are physical plan to physical plan rules. So here is an example. Transformations are implemented using a scale of partial functions and pattern action. So here's a very simple example of the constant folding. On the left, there is an expression of 1 plus 2 plus t1, the value, which is a parse as a tree like this. And here, because 1 and 2 can be computed at compile time. So we transform it into 3 plus t1 in the library. So instead of evaluating 1 plus 2 for every single row, right now, we just evaluated 1 plus 2 once during compilation, during query compilation. So that then hopefully brings better performance. And to implement this one, it is actually quite concise in calculus, sex to the functional features in the scholar language. So you can see that there is an expression variable, which is of type expression. And you can just transform it using a partial function, which is the body enclosed by this pair of braces. And the red part specifies a very concrete pattern, which says that if you have an ad operator containing two literals of integers, then you can basically return a single literal containing the sum of the left hand side and the right hand side. So this is like a built-in scholar thing to do this pattern matching. This is not something you have to build yourself, like a real engine. No, no. Yeah. This is a building feature of the scholar language, which is the reason why the calculus optimizer is very concise itself. Yeah. And using these building blocks, we can implement the different rules and also combining multiple rules. For example, here is an example of the product and pushdown. So the original SQL query plan, the original SQL statement that we just showed is translated into this query plan. The top most aggregate is admitted because it's not important here. And you can see that this filter can actually be pushed down to this one side of this joint operator because part of this filter only touches T2 so that we can transform it into the right hand side. And this way we can reduce the number of rows split from the right hand side of this joint operator. And the second example is column pruning. You can see that for on the left hand side, the left hand side of this drawing only needed t1.id and t1.value. And for the right hand side, we only need the t2.id. So we add two extra projections there so that we can shrink the size of the rows split from both sides. And naturally you can compose both of these two rules so that you have the original plan and after applying these two rules, you get the right hand side. And the facilities we use to combine different rules is this rule executor concept. A rule executor transforms another tree at the same time by repeatedly applying multiple rules defined in row batches. So you can see that an optimizer itself can consist of the multiple batches and a single batch can contain multiple rows. And a single batch can be executed in one of the two strategies, either a fixed appliance strategy, which means that it applies all the rules in this batch over and over again until the shape of the kernel appliance stops changing. Or it can be in a once strategy, which means that all the rules are just applied at once. And then we quit in this batch. So every single rule here is implemented based on transformations. So this is basically how those computing blocks are organized together. So does the program of these transformations have to specify what can be combined together? I don't fully understand this. I understand that there are rules you want to combine together, but who is deciding what could be combined? Maybe that's what I'm asking. Yeah, that's a good question. Actually, these are right now hard coded in Apache Spark. Because actually, we can make this one configurable so that advanced users can even choose what kind of optimizer rules you want, or it can make it more easier to implement your own rules and plug into that. But Apache Spark actually provides those kind of extension points. And also, if we want to go one step further, you can maybe also specify dependencies between different rules so that it may do some further more sophisticated optimizations. Today, it is basically hard coded inside the Spark. So I mean, you mentioned that either you do or you want to expose an API to allow people to modify these dependencies, to modify these rules. But I mean, that's not something the average user could even begin to do. This is not something you expose to your run-in of Databricks customer. Exactly. Are you aware of anybody making major modifications to the query optimizer? Yeah, actually, yes. So people who are writing advanced data sources sometimes want to charge these internal extension points. For example, a TIDB database from pink app, and they actually have a component called TIDSPARC. And they basically do not reach use of their TIDB database layer, but just build a Spark layer above their key value store. And that TIDB extension actually provides extra optimization rules and physical plans in order to read things from the TIDB efficiently. That makes sense. I think the Splice machine is doing the same kind of thing. Sorry, I think this is helpful. Thank you. May I ask a question? Yeah, sure. Yeah, this is Lin. I'm a PG student here. So you mentioned there are these two kind of strategies fixed point at once. That's kind of interesting. I'm actually wondering how do you decide what would be fixed point or what would be once? Is there some sort of like a pattern or principles that you are following? Yeah, so for example, if you are having certain rules that let me think how do you explain this? Let's take a custom propagation as an example. For example, your query actually contains a multiple places where a constant folding can happen. You can implement this rule in a very sophisticated way that you just go through this rule once and fold all the constants. On the other hand, you can also implement this rule in a very simple manner and only fold one pair of the constants at a time and apply this rule repeatedly so that it can always find out all the opportunities and fold all the constants in the end. So at this level, it is for implementation. It is basically implemented in details. If you choose the first manner, then you can use it at once and optimize it itself as actually more efficient at runtime. And if you choose the second manner, you need to use a fixed point. And there are also certain rules we apply at the end of the analyzer. We call them checkers, like we want to make sure that the analysis phase actually detected all the errors and all the unknown entities are already translated. And for those kind of rules, we only want to apply them once. So these are some very rough principles. Meron, do you have anything to add? Well, basically, I think for fixed point, usually it is a batch of a series of rules. And triggering one rule could actually change the plan in a way that provides more opportunity for the rule that is preceding this rule in a batch to reapply again. So at a certain point, even if you write a rule, constant propagation, for example, even if you've written it in a way that it doesn't have to be triggered more than once, but later on, some other rule may change the plan in a way that there is more constant propagation or constant folding opportunities. So like a batch of rules, they're kind of, you know, they're related. And then in my opinion, we don't necessarily need this once policy. It's just a shortcut for a fixed point. Overall, every rule should be like should be written in an item potent way that even if it's fixed point, it should, you know, come to a stable status just after being applied once. Yeah, yeah. Thanks a lot for the explanation. That's very helpful. Thanks. Yeah. Hi, I have a quick question. Sure. Is there any way for you guys to test if a new, like you told that all the rules are hard coded by developers? So is there a way to test if the rule actually broke some previous rules, like some regression framework or something? Do you guys do that? Or is that mostly like manual to code reviews and stuff? Yeah. Is that is that data breaks that we actually have different testing harness like random query generator and longevity test and those kinds of things to help us catching these kinds of issues. But if you are, for example, if you are an author of the advanced data source and you want to insert your own optimizer rule, then basically you are on your own. Okay. Yeah. And also, for a fixed point, sometimes a rule is buggy in a way that the query plan does not really converge. Sometimes it might be, it might be a growing the query plan indefinitely. Sometimes it might be jumping between two shape of the plans repeatedly. And we actually apply one maximum iteration limitation there so that, for example, the default value I believe right now is 100. If a rule is applied a hundred times, if a rule batch is repeatedly applied a hundred times, still haven't reached the fixed point and we also end it. And at this stage, we will also issue in a warning walk or something like that and basically indicating that something is wrong. I see. That's interesting to know. Thanks. You mentioned, you said you run longevity tests. What kind of tests is that? What do you mean by that? Oh, actually, I shouldn't say longevity tests because longevity tests are mostly for executing the, it's mostly for stress tests. It's mostly for the runtime system rather than the query optimizer. But sometimes, for example, it's just long running tests. Yeah, long running tests. Right now, I think we already in Spark, we also have this STEM, sorry, plans stability test. So for TPC queries. So if you, for example, if you just changed a rule or added a new rule that ends up with a different query plans and that that'll be detected. It doesn't necessarily mean that you've done something wrong, but at least some important plan changes can be caught by these plan changes. Yeah, whatever. That is basically a mechanism for you to review the impact of your new rules and not necessarily the changes are bad. Okay, so next step is physical planning. So in CalList, physical planning contains two phases. The first phase, we translate optimized logical plans into executable physical plans using our so-called strategies. Strategies are basically are basically a scatter partial functions that turns logical plans to physical plans. And the second phase, we use a rule executor to make sure that the plan is ready for execution. And please remember that the rule executors are used for converting one plan tree to a different shape of same type of plan tree. And in this case, we are using a rule executor to turn the physical plans into a different shape of the physical plans. And these are rule execute, this row executor is used to prepare things like a scalar sub-chorus and ensure requirements on input rows like sorting and partitioning those kinds of properties and also apply a certain physical optimization like removing unnecessary sorting operators. We will see some examples later. So for planning, for physical planning, an optimized logical plan is translated into executable physical plan by applying a set of strategies. And here is an example of the strategy that converts. You can see that the input plan node is a type of logical plan and the return type is a sequence of a smart plan. So from this function signature, you can see that initially CalList was designed in the wisdom line that the crowd planner should be able to generate a series of physical plans and then we try to search the plan space and find the optimal one. But actually right now today, as we explained, we only generate one physical plan at the moment. So here are two very simple or actually the simplest plan nodes inside CalList which is project and filter. So there is no tricky things. So it's just a logical project node to physical project executor. I think there's a certain noise. Okay. And this plan later function is basically a trigger for other strategies. So we can ignore it here. So it basically tells the CalList to plan this child physical, this child logical plan tree is using a proper other strategies later. And here is an example of what we mean by ensuring requirements. So let's let's check this sort merge join. So because for sort merge joins, there's a natural requirement that both sides of this join should be sorted by the corresponding columns. And in this case of the sort condition of t1.id equals t2.id. So for t1 rows should be sorted by t1.id and for t2 rows should be sorted by t2.id. And in order to meet this requirements, we insert two sorted operatives here. And you can see that this is a physical to physical plan tree transformation. And also a physical plan optimization can jump in here. So what if this t1 is already sorted by a t1.id. And in this case, we can actually remove a redundant sorted using a physical optimization rule. And in this case, t1 is already sorted, then the sort is no longer needed, then we can safely remove it. So in a nutshell, value spark emerged as a unified analytics of data processing engine, and later developed a relational structure later, Spark SQL. And we consolidated the rest of the ecosystem upon it so that all kinds of workflows can benefit from a unified optimizer and the runtime. And with the help of the callus optimizer and the data source API, Spark SQL, again, it allows users to work with data in any format to store it anywhere at scale. However, while being powerful, this approach actually imposes certain unique challenges. And first of all, that many data sources cannot provide a sufficient or accurate statistics to facilitate the current plan. For example, the use case I mentioned previously, we have been the right data sources for Jenkins and for PagerDuty. And it is actually basically impossible to extract accurate statistics from these data sources. And secondly, the storage layer is out of the Spark's control. So previously collected statistics can easily go out of sync because you don't know who can just override a certain part of the data set offline or it's basically out of your control. So these imply that maintaining an efficient and cost model is more challenging and may lead to suboptimal runtime performance. So the solution that we are trying at Databricks can, there are basically two approaches that we're trying. The first one is Delta Lake, which is an open source storage layer that brings the asset transaction to a MagiSpire and big data workloads. So it contains a scalable and self-managed table meta data so that order statistics are automatically collected and stored in a self-managed manner so that the better optimization opportunities can be leveraged. So if you want to learn more details, you can check Delta.io. But this is not the topic we're going to further discuss today. And the second one, which Merrin will introduce soon, is adaptive query execution. So this is for re-optimizing and adjusting query plans on the fly based on runtime statistics, which is always accurate. And then, yeah, so we expect that these are always accurate statistics in the middle of the query execution and then re-optimizing and adjusting query plans. So this is a new feature available in Swax3. And yeah, so next I'll hand it over to Merrin for the adaptive query execution. I have a bunch of questions about what you were doing. So if Merrin is going to talk about it, this is perfect. Go for it. I'll stop sharing. Oh, thank you. Anyway, so thanks, Liancheng, for going over the callous part. So the last thing he mentioned just now was that the challenges that we were presented with doing cost-based optimization, especially in terms of the up-to-dateness and completeness of the statistics. So to address this issue, we've been doing something like adaptive query execution and this term has been around in the database world for a while now. But in Spark specifically, that means we do re-optimization in the middle of query execution and it's powered by runtime statistics. And then so one of the challenges was, of course, you know, like stale and missing statistics that can lead you to inaccurate estimates. And then stats collection can also be expensive, especially if you want to collect some more advanced stats like common histograms, which can give you like more accurate estimate, but on the other hand is more costly to collect. And sometimes, you know, customers decide it's probably not worth it, you know, for the benefit it might just bring you. And then there are certain places where, you know, that estimate is not even possible. For example, user defined functions and it's just a completely a black box. There's no way to look into it and get, you know, selectivity estimate. And then some people would argue that, you know, hints would work for some queries. One big problem with hints is not automatic at all. So it involves a lot of manual tuning. And also, it probably won't work well for like data set that changes a lot over time. So for example, like if you think your data set is small enough to do a broadcast drawing, but, you know, one day it's not that case and you'll get, you know, out of memory errors, stuff like that. So to address this issue, and AQE is looking at, you know, like collecting round time stats and doing re-optimizations on top of the existing crew plan based on the round time stats. So we'll still do the static planning optimization as we do before, but we'll just adjust the plan on the fly when we get more, you know, round time stats. And then today, before we go into like the implementation details, there's one important idea in Spark. I like to go over here so that we understand how, you know, when, how a data execution, a query execution is applied. So this, so we have, I mean like, like Spark, you know, is a distributed computation system or Spark SQL, we could look at it as a distributed database. So we try to parallelize computation as much as possible, but at some point, we have to kind of move data around throughout the whole cluster. So these operations here, we call them exchanges. So it's either a shuffle, which means, you know, the data is moved from one node to all of the nodes in the cluster. I mean like from, like it's just, you know, it goes all over the place. Or broadcast is like, you know, the whole dataset, it goes to each of the nodes in the cluster. Anyway, so as such points, you know, like the core stages, we call them, we call them like each, like, sorry, sorry. So like each of these exchange points are actually the dividing point for something we call core stages. So inside each core stage, everything is executed in parallel on all of the nodes. And then at the, at a boundary of each core stage, data is moved around throughout the whole cluster. And then, you know, after this shuffle process or, you know, broadcast exchange process, another stage starts where the computation is conducted in parallel again, right? And then so these, like the boundary of those, you know, query stages are actually optimal for applying a data execution because, you know, it's like, first of all, it's a break point for operator pipelines. So there was no way you can pipeline the operations, I mean, across different query stages, right? And then the other thing is at each of these points, actually, this is the best, like this is the best time where you can get accurate round time statistics. Because in Spark, the way we implement these exchanges is we kind of materialize, we kind of write the intermediate result and onto disks at the end of each core stage to do, to be able to do a shuffle broadcast exchange. So when you write down the data, you get the exact size of your, your, you know, intermediate data set. And then this is a process of, like the overall workflow of how this work together with the, you know, the idea of shuffle stages. So first of all, you start all the leaf stages, which means, you know, those stages that don't depend on the execution or result of other stages. You kick off the leaf stages. And then whenever a stage, you know, completes and comes back, you have more stats. And this is when you can just go over the plan, you know, apply some optimizations and see if, you know, your query plan has to be adjusted since you have new stats coming in, right? And after that, it also means another thing that, you know, one stage or a couple of stages has finished. It also means that stages that are dependent on the finished stages are probably now ready to be started. So you could also look at there, if there's any other stages that have their dependency cleared, and then can be, you know, kicked off right now. And then you just start those dependent stages. And then you pee this process, you know, like wait for another stage to complete, wait for new stats to come in and then re-optimize and then start, you know, the rest of the stages as you go. And then in the very end, you know, you just come to the top of your plan and the whole process is done. Do we have any questions up to this point? So it sounds like you're doing the old girth graphy, the parametric optimization one, right? There's no switch operator, you're not generating multiple plans at the same time. You're just going back to the optimizer Yeah, exactly. So we're going back to the optimizer where you actually were reusing the optimizer and the planner, and the physical planner to do this. So like one benefit we get from this is, I mean, like, I mean, the decision will be unified, you know, whatever, based on the input. So it's, and also like, you know, we're not, the fact is that we're not running the whole optimizer because it's costly, you know, like the Spark optimizer is, you know, with all the rules present, it's costly, but we are just running a subset of the optimizer rules that can be, you know, that we can definitely benefit from the wrong, given the runtime stats, the type of runtime stats we can collect. So for example, if we are not collecting cardinality stats, we are not going to rerun the join the order rule. But, you know, some other stuff that we can see, we can see that we can possibly benefit from, we'll just rerun those rules. I have a quick question. So I'm wondering what kind of stats you are going to collect? Have any relationship with whatever stats already available before query execution or you just collect some new things from the scratch? And are those stats independent of whatever? No, actually, well, actually no. I mean, like, so the type of stats we have right now in Spark, it has, you know, for example, we have the row count, we have the byte count, like data size, right? And then we have column stats, for example, like the NDVs, like a number of distinct values of certain columns, stuff like that. And then we have min max values for column. We even have histogram, like column histograms, but I don't think it's being actively used in Spark at this moment. But for that for execution right now, only the very basic stats that are data size and bytes and a number of rows, row count, is collected for, you know, for the reason that collecting column stats will be too costly. I see. Thanks. Yeah. But I mean, like, you can definitely, I mean, like, it's not implemented, but in theory, you can definitely enable that if you want to enable, if your specific case would benefit from drawing the order at runtime. So you could just connect, decide to collect some, you know, column stats to be able to do that. Sure. Thanks. So I'll just go ahead with the three features we have already implemented. I think we have one more question. Elena, you have a question? Yeah. Hi. I'm Elena. So after Shaffer, I think the data distribution amount of different machines quite different. Do you consider that into your optimization? Here's the thing. Like, so data distribution, you can't really change that right now because, you know, it's required by the next stage, right? The data distribution up to a shuffle is whatever that is required for the next stage. You can't really change that. But we could also collect things like, you know, per partition data size, which can be helpful for some physical planning decisions. Does that answer your question? Yeah. Thanks. All right. Hi. Hi. Can you hear me? Okay. Yeah. I do have some questions here. So based on the three things you listed, I can imagine like some work data works has done for the first year, maybe like after the previous stages. Now the petition that we already have is compatible with something we're expecting. Or maybe based on the new statistics, we could apply some new drawings instead of some more drawings that can somehow speed up the execution. But for the third one, could you give an example? Like how do we optimize screw drawings based on actual statistics? Okay. So this is just like a page, you know, like kind of like an agenda for what we're going to talk about next. So I'll just take these items. I'll give, I'll just talk through how they work. Like I'll give like simple examples of, you know, the, you know, yeah. You know, I think what by the time we get to the end of my talk, we'll probably have a pretty good idea of how it works. Yeah. Right. We'll go ahead then. Yeah. And then so the three major optimizations we have for depth seclusion in Spark 3 are dynamically coalesced shuffle partitions means, you know, combine shuffle partitions, dynamically switch join strategies, which is basically, you know, from shuffle, sorry, sort more joined to, to broadcast has showing. And then the skew join optimization, which is like you have to detect the skew around time and try to solve the skew problem for your drawing. And the first one is, so we just talked about shuffle. And then one important factor to shuffle performance is this, like the shuffle number. So what is the shuffle number? So on a map or site, you have to decide like how many buckets you're going to shuffle your data into. And then, and then like, for each bucket, like, and then for each bucket, in the next quarter stage, it will be a, a task, like a, like a parallel task. So for example, if we say the default number is 200. And then if the your shuffle number is 200, that means your, the number of tasks in your next quarter stage is going to be 200. And then, for example, if you're dealing with 10 gigabytes of data, and you can roughly get this, you know, like, you know, like each of your tasks is going to be operating on like 10 gigabytes divided by 200, you know, that, I mean, then, and then as the size of data grows, if the shuffle number doesn't change, you would expect that each task is dealing with more and more, like larger and larger data set, right? So that's why it's crucial to query performance of, you know, and then, so for example, like, you know, we talk about, you know, like the data size is big, while your shuffle number is small. What happens is, for example, like your next quarter stage is going to do a sort, that's, that's the most common case we can see that, you know, and then, and then if each task is dealing with a large piece of data, what you get is your sort operation is going to keep spilling to disks. And in some extreme circumstances, like your task would never finish, it would just keep on running. And because of, you know, like, it just keeps spilling and also like keeps doing like garbage collection. And then the other side is okay, so your data set is small. And then your partition number is high. So like, maybe each task is just dealing with a few kilobytes of data, which is not efficient either. Because you have, like, you have a lot of, like a large number of tasks that your Spark scheduler has to deal with, it has to schedule over the cluster, right? And also, if the data set is small for each task, there's a lot of like inefficient IOs. And then also like, you know, there's, there's also like tasks for each task, there's also like, you know, like some small overhead in the beginning to set up the task, you know, the task environment for it to start, right? And then with Spark, right now, the shuffle is like once you decide a number, once your query starts, you decide this shuffle number, it just never changes throughout a whole query. And most likely your application will have the a single, you know, shuffle number for different queries. And then, like, like it's very common to see that within, like within each query, the data size expands or shrinks, like in different stages. For example, if you're doing a join operation, your data size will be growing most likely. And then if you're doing the aggregate, it's the other way around, you know, it just shrinks. And then having a, having a universal partition number doesn't work in this case, right? You don't want to set it too high. You don't want to set it too low. No good either way. So the solution here is it's not a perfect solution, I have to say, but one thing we could do here is we start with a pretty high partition number. But in the, like at the end of each stage, you can actually, after getting the overall data size for the next stage, right? Because the output of your current shuffle stage is the input of your next query stage. So at the end of this shuffle stage, you could get exact, and the exact size of your data set and decide, okay, so what number, like what number of tasks is appropriate for the next stage? And to leave based on, you know, the data size. And then once you get this number, or I mean, it's just a rough size of your tasks, of your, you know, parallel tasks. For example, the default setting is like 64 megabytes. So you have this 64 megabytes, and you try to pack like the, the, the small partitions, like, you know, because you're, you tend to be over partitioned in the beginning, right? And then you pack the small partitions into a big group, and put them in a single task to avoid, you know, your tasks dealing with two, like two small data sets. But, you know, it has to come with a, you know, like a sufficient initial number first. So that's why it's not optimal. It's not like perfect solution. But most likely it just, you know, like, if the customer, like if our customer is already doing over partitioning anyway, because the under partitioning, like the partition number being too small, and each task being too large is more of a problem than the other way around. But, but if they're already doing over partitioning for performance reasons, and this co less partition is going to improve the performance furthermore, because it's deal with, like the, the stages where data shrinks, and we can pick an optimal number for, for the number of tasks. Do we have more questions here? How many more slides do you have? Sorry. So we're running out of time, right? Yeah. Oh, I'm sorry. Yeah. So let me, let me go over this quickly. Let's just skip this one. And then let's see the next optimization in APUE. And so this is, and then the next one is dynamic join strategy. And this is like going, so again, it's just the, whatever the wrong estimate you could have at, at static planning time. So most likely, I mean, if your estimate goes, like, most likely it's over, because we try to be very conservative when we do estimate. So a lot of the drawings will be planned as sort more joined, which is less efficient than broadcast has joined. So APUE, when it sees, you know, one side is small enough to actually be broadcast, it will, it will change the plan to do a broadcast has joined at runtime. And I'll just like skip these slides. And because someone is like very interested in how skew join works. So, and the magic part of, you know, it's, I think it's the kind of the magic part of APUE here is skew join. So, and I, and as I, as I said, we can actually get the per partition size at the end of each shuffle. So you will see that, like, if you can easily detect that, you know, one of the partition or just a few of the partitions are going to be super large. So, like, if without skew join optimization, you, you probably end up with a situation where some tasks are dealing with a large amount of data, while the rest of the tasks are just, you know, they, they just end like so quickly. And what is worse is that the skew partitions, they actually, I mean, they actually cause other serious problems as this spilling. So, like, sort more joining has a sort operation. So it just like keeps spilling to this. And also, like, like, if, if you are joining two big partition, like if it happens, you know, in a way that partition, like a single partition is skew on both sides, it's even worse when you do join it, it explodes in that for that specific task. So, and then it's easy for APUE to detect such, like, extremely large partitions. And, and then the solution is actually, in that case, we split the partition, like, we split the large partition into several sub partitions. So here, we can see that, you know, like partition a zero, like on key a zero, is larger, significantly larger than the rest of other partitions from table A. And then after we detect that, we split into this big partition into small splits, like three splits, and each split has to join the entire, like the entire, you know, partition from the other side. And if, if it, I mean, like this, I mean, it doesn't show here in this slide, but just to go a little further, what happens if the other side is also skewed on this partition key? So what happens is, you know, like, we, we eventually will end up with a Cartesian product with all the splits. So if this side, we, for example, if this side has, we also split it into three sub partitions, and this side has three partitions, and we'll end up with a, you know, like nine tasks, like three by three, to deal with skew join. And so in a lot of cases that we've seen with our, like, a benchmark and customer cases, it's this, you know, it's, you know, this is working, like, very effectively. And then the last page is about, like, TPC benchmark, and we got, like, 30, 32 queries that has more than 1.1, like, 10% speed up. And then, and then query 77 has over, like, eight times speed up. And then this only includes the effect of the first two optimizations in adaptive execution, because, you know, like, TPC benchmark doesn't really generate any, you know, any, you know, any, you know, any, you know, any, you know, the effect of the first two optimizations in adaptive execution, because, you know, like, TPC benchmark doesn't really generate data skew. Yeah. And that's the end of my part today. Okay, awesome. We're a little over time. Is there any one quick question? Mingjia, do you have a question? Hi. Yeah. Yeah. Thank you. Thank you for the explanation. My question. So for the third optimization, well, I wonder, well, to get this applied in Spark 3.0, do we have to simply enable the config or do we have to preview some statistics on tables that queries are running on? Sorry, I can't hear you very quickly, very clearly, because, sorry. Your voice is just... Sorry. Can you, can you take the question? I think I got a question. Basically, do we need to pre, to pre-collect one of the statistics in order to use security? No, we don't. I mean, I think it works. I mean, if you're, I mean, like, here's the thing. So we, we did a benchmark on both, like, you know, like base, like the baseline is either, like, you know, dataset with pre-collected stats. And another experiment is on dataset that has no stats at all. And you can see that, you know, it's, it's more effective on the, on the latter one, because, you know, like, the baseline is lower, but you should expect it to work pretty well without pre-collected stats. All right, Siptic, one quick question. Is it quick? Yeah, it's a small question. So I was wondering, like, I read some talks in which by relation IQ and stuff, where they were using something called worst case optimal join algorithms to, in order to optimize, like, this Q join query. So I was just wondering, so I don't have much idea about these, but I just wanted to ask if you guys worked on these, or if you had any idea about these. I don't know if the thing you were talking about, but for skew drawing, I think for, I mean, it's just best suited with AQE, because you don't have, like, we, our data, like in, in Databricks, we had this, you know, like our proprietary feature with a static skew drawing optimization. But basically, you have to know, like, a lot of information, you have to kind of know which table and which columns are most likely to have skew. But this is so natural to do in that execution, because, you know, you just, at the end of each shuffle, you just get the data size of each partition. And you know exactly how skew it is, and how, you know, which partition are half, half the skew, and, you know, like, how many spits you have to, you have to break it into. Yes, I can answer this question. For the worst case optimal joins, I think the only systems that I know that supports this was logic blocks, and relational AI, the new guys, which is the exologic blocks guys. And then I think the Germans have it either in, in, in hyper, but I don't know whether it, I'm sorry to take back, I think it was, it was in Umbra, the new one, they put it in there. Okay, I have to go get my baby. I, I appreciate you guys being here and doing this. This is awesome.