 The Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySuite call and post-grace configuration at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Alright, welcome guys to another David's Vaccination Cinema series. We're super excited today to have Benki Ruckhaven. He is a senior manager of product at Green Plum. We've been working with him for several years now. Benki got his PhD at WPI back in New England when I met him several years ago. So we appreciate him for being here today because today is actually a holiday for VMware. It's not for the university, so we appreciate you coming on your day off to talk with us about this. As always, if you have any questions for Benki as he's given the talk, please unmute yourself. Say who you are and ask your question at any time. We want to feel like this is a conversation and not just him talking to the empty space of Zoom for an hour. So, Benki, thank you so much again for being here on your day off. Thank you for having me here. Hello everybody. My name is Benki and I want to talk to you about query optimization. You all focus very much on query execution and storage in my memory databases. This is a little bit maybe away from your core area of interest, but maybe there will be some of you who are interested in query optimization. I wanted to share with you the past, it's been 10 years since I started in Green Plum in 2010, more than 10 years. And when we started, we started building this query optimizer and we thought we could ship an enterprise level query optimizer that can handle millions of queries on terabytes of data and terabytes of data and it took four or five years of hard work to get it shipped. So, I want to show you from an engineering perspective as well as technology perspective how we achieved it and what is the intricacies that you have to know when you are building such an optimizer. So, I'm going to be going through some of the code. I'm not going to do a code review, but I'm going to show point to you certain parts so that you can understand what I'm talking about. And many people will come to you and say, hey, we are modular, we are extensible and stuff like that, but once you look at the code, it feels like where the hell do I start? So, I want to show you where it's modular and how it's modular and how we build it. So, there's going to be back and forth between code as well as the slides. So, who am I? As Andy said that I did my PhD at WPI and before I graduated, I got tired of being poor and tired for asking my husband to take me on a trip to vacation. So, I said, you know what, I'm going to get a job at Green Plum and my former manager, Mike Was, who is now the founder of Datametry, he likes to go hire PhD students because they have the difference between how much you get paid as a graduate student and how much you get paid as a first job is so high that you sometimes take the low value. So, watch out. So anyway, so I started off as an engineer and then I moved on in 2016 as a product manager and I currently handle query optimization, federated query processing. I have one of my former colleague, Francesco on the call. Hi, Francesco. So, which we do connectors for Hadoop connectors for high connector for JDBC connector for S3 or any cloud storage. And I recently started off machine learning and graph processing so I managed these three products and I have a great team working with me. So, so let me give the first couple of slides is an introduction to Green Plum. I wanted to send out a poll before say how much of you know parallel database and stuff that last week was rough. So, Green Plum is short introduction to Green Plum. Yes, Andy, you had something. Again, not all of you are my students like me from IBM here. So it's not all of my students. All right. Thanks. Thanks for that. So Green Plum is a shared nothing architecture. And basically what you have you have multiple. So we now renamed a master node as a coordinator, you know, we have to clean up our documentation to make it look, you know, perfect. So we have a coordinator node which is connected or a high speed Ethernet switch with multiple segment hosts and each segment has a segment multiple segments are typical architectures a segment host has four segments. So what you do is you take your data and you distribute it across these things so there are no duplication except for replicated tables but let's keep that aside. So you distribute the data across the tables, and then you can do join or select locally rather than shuffling the tuples across. So now you can have petabytes of data that is stored on these different segments. So we have multiple green plum instances, which have like thousands and thousands of nodes. So for example, Samsung has one and couple of other customers have one. So they have so many data processing units and now you distribute it and the way you distribute it is with a distribution policy. The distribution policy is either a random distribution or has distribution. One thing I want you to know you can think about Green Plum as each of these segments are nothing but a post business. You can think about it that way, right, the execution and then you have the data distributed across them and then you query it. So just a quick hint about random as the name suggests so you have these order staples have color coded the order item order keys. So you randomly spray on your from your coordinator node to all of the different segments. So you may have two orders from two line items from two different from the same order across different segments. For example, here you have this order. So here you have order one and order one in two different segments. That means if you want to do a join or something you need to reshuffle things. Instead, you can have a hash distribution where all of the order ones are in a particular single second. So that is another way you can distribute it. The beauty of having this hash distribution is now you can have co location of joints. So for example, if you have line item orders and the ordering joint condition is on the order key. So basically what you do is each segment will process all of the orders line item join and then then ship it to the coordinator node which can then present it to the customer client. Everybody asked for a range partitioning. You guys don't support range partitioning because anybody asked for it. So there are two kinds of partitioning. One is this is distribution and then there's partitioning where within orders you can have based on dates. Within partition you can have multi-level partitioning. First level could be year, second level could be state, third level could be Timbuktu or whatever like color of the product or something. So you can have multi-level partitioning that is called vertical partitioning. This is horizontal partitioning where it's distributed across the segment. We currently don't support leave partition to leave partition joints but that is one of the active research that we are trying to do. So say for example, let me just scribble here. So you have sales table and then you divide it by year for example. So you have 2022 here and then you can further divide it by say for example state. So and then you'll have Massachusetts blah blah blah. But so this is the distribution and this is the partitioning. And then after partitioning of state you could potentially say, hey, I want to do it by county. So the year 2021 sales are all here but within the 2021 sales there's a state and a county. So now you can still join based on the county per county. What is the sales? Say for example, if you want to find out if dead people vote in the last election by county or something, you could do it this way. So just kidding. Okay, so let me clear the annotation. Did I answer the question on this? But you can't say partitioned by 2020-2022. You can't do that. Currently you don't do that. Did anybody ask for it? No. So what they try to do that the partitioning at the single segment, that's what the vertical partitioning will help you with that. So that is what they have wanted is given a particular year, sorry, say for example the first level was a state. Everything was in the state and then the second level is year. They want to have range joins. So for that you have partitions. But the problem is we don't do leaf to leaf joins right now. If that makes sense. So the next section I'm moving, that's a quick introduction for Green Plum Disclaimers. I took a couple of slides with the citation and their request. This is Dr. Christen's one or two slides at that. So I'm just giving you a citation up there. The first couple of slides. So let's just give me a life journey of a query. You have this movie schedule and you want to join and you have an actor. And basically your parser will take this query and generate a parse tree. So basically you have a project and you have a select and a partition product. And then you do some rewrites and you split the two predicates of title, the join condition and the select condition out. And then you take this join condition and make the partition product into a inner join. And from there you say, oh, I can now then further do some optimization. I can push the select down into the movies. And then I can find out whether I use this index, that index, do a nested look join, hash join, merge join, whatever join that option available. So this is a basic lifestyle of the query optimization. So if you, this is going to be the whole talk, this rundown that I've given and how we have implemented this and how cascade optimization helps us. That's what I'm going to pitch to you for the next couple of minutes. The rest of the talk. So we, as I said, we are based out of Postgres. So Postgres has a very established 20 years research of a planner. So, but why did we decide? So I want to pitch you why in the next two, three slides, why did we decide to generate a new, build a new optimizer? So basically most optimizers, at least when we started, split the whole optimizers based into two things. Join ordering, which was the bread and butter for everybody. And then the rest of it. And join ordering, there's so much research, so much research that has been done in join ordering, how to enumerate the plans and what kind of order, how many, how big the join graphs, how to do, if you have correlated the statistics, how you can exploit that beautiful work that has been done. And CyBase keeps coming out with beautiful research every year, one or couple of years in how to handle large join graphs. Okay. So this is great. But if you see, just to see a TPCH queries, one of the queries of TPCH with six tables, this is one of the papers written by Mike Vaz and Cesar in 20 Sigma 2000. They said, if you look at in the parallel database world, the amount of search space this join order solves is just 100,000. There is a billions of other plan alternatives, which interplace between aggregate, CTEs, correlated queries, apply all of those things that is still left and that's still not touched. How do these, this rest of the green space interact with the join order in the black square? So that is what the Postgres planner lacks. They're making, they're making a lot of inroads every year. Definitely. But when we started out the project with Green Plum, they said, hey, this whole space, there is no interplay between all of the rest of the things and join order. Now you will say, hey, thank you. What is this rest of the things? The rest of the things is basically if you have partition, how do you do static and dynamic partition elimination? How do you nest queries? How do you collocate your distributed queries? How do you do aggregates? If you want to do local filter, intermediate aggregate, local aggregate or final aggregate, can I push aggregates between below joints and have collocation of the execution? How do I do window functions? How do I do CTEs from TPCDS in an efficient way? How do I pull the predicates and push it in such that the plans are better? What if the data is replicated and what is the data distributed? All of these is what the green blob talks about. It's like, hey, I need to know. I need to not just worry about the join, but I also have to worry about the interplay between complex queries. One thing I want you to know, if you have not worked in an enterprise product, the queries and the things that I used to work when I was in the PhD, which was TPCDS and stuff, they look like child's play when you see customer BI tool-generated queries. Some of them are 20, 30 pages long, sequels that are auto-generated for some tool and then you need to figure out all of these parameters and queries and components. How do you generate a good plan for that? All right, so we wanted to build an optimizer and when we looked at it, again, this was a little bit old when I was graduating. So usually in car, I don't know whether you all know this, but if you see infinity, I think it has a Mercedes engine. And most of the Hyundai or any other Volkswagen, most of the cars, different models have the same engine with minor tweaks. So you have many models that share the same engine. But when we started, we had everyone writing their own storage layer, everyone writing their own execution layer, everyone writing their own optimization layer. So basically what we saw was one model, one engine. So right now we have many of the enterprise operators have successfully done it, separate storage and compute. But what about query optimization? We want to do separate Spark execution engine, separate execution from the system. What about optimization? We need to also separate. We can also do what Orca wanted to prove is, hey, if you can separate storage and compute, if you can separate the execution engine from the rest of it, you should also be able to separate query optimization out. And that means if anyone wants to build on top of it with the same API, not that it's trivial, but it need not take six years for you to build, maybe it takes you about a year and a half to build it. So that is the pitch that we wanted to do. And so, and one of the things why we wanted to, the goals that we wanted to achieve is, hey, Green Plum has millions of queries on 10 terabytes, hundreds of terabytes and 10 terabytes, tens of terabytes of data, and it's distributed. So performance is very critical. That means if you get a wrong plan, it's not about a query that is going from 10 seconds to 100 seconds, it'll never finish. So right now I have, for example, Morgan Stanley is our reference customer. They cannot one with the Postgres planner, several of their reporting queries that they have to, they have to go and say to SCC saying, hey, this stock seems like a problematic stock trade. So I need to report this. So these queries with Postgres will not work because they are finally focusing on joins and not of the rest of the problems. So and the second thing is when we started in, when I came into Green Plum, it was Postgres 8. Now we are in Postgres 9.5 and we are in the GP7, which is the master, which is not yet released, is on Postgres 12. That means the planner and the executor and the storage is changing underneath us. So that means every time I bump in my Postgres version for the rest of the ecosystem, Orca or the optimizer needs to keep up with it without major surgery. That means our API has to be clearly defined and distinct enough from the system. Only thing that I need to change is the, like the driver that connects to the database. So I have to make sure that it's really extensible and really modular because if I have to, if Postgres, say for example, came up with Brindindex or some other index form, then I have to support that. And how do I do the cost model? How do I do the cardinality estimation for spatial joins, which Postgres supports? How do I build all of those things while the database underneath is changing? So that is one of the goals that we wanted to do. So Postgres, as I said, was not designed for distributed data. The code complexity for the planner, I'm talking about Postgres planner. The planner, the code complexity is quite high. It's okay for quick pictures, like singular features that you want to do, but if you want to do a swath of features and build on one on top of each other, it's much harder. Adding new features are challenging and it was, it was, that's why we felt it was time to build a new big data-based driven query optimizer. And we didn't reinvent the wheel, we used cascade optimization paper with Grit's Graphi and others. They have a beautiful framework. So we said, hey, let's learn from that and learn from 20, 30 years of query optimization knowledge and build something that we can improve quickly, we can enable future developments and we can connect to multiple platforms. This connects to multiple platforms, we kind of failed, but I will tell you why. So before that, yeah. Good question. Like, so in terms of like being, I mean, the multiple platform was hawk. You'll come back to that later. But like the, Postgres now, like, as you said, they add new features and you have to now support that in your cost model and certainly incorporate that in also your, I guess the question is like, there's a bunch of knobs you can tune in Postgres to make it the optimizer to do certain things. Do you ignore all of that and basically say you have your own cost model, you have your own understanding of what Postgres is actually going to do and therefore you can do a better job than their planner. Correct. So the first avatar was, I don't care what Postgres is doing. I'm not going to take in all of the bad habits that the Postgres has. I'm going to try to figure it out on my own. I just, I'm going to use what your execution, say for example, if Postgres didn't support merge join, I'm not going to generate a merge join. I know what Postgres executor, how the partitionings are represented in the catalog, how the data types and casting are represented in the catalog. If you have to do coalition, how the coalition is represented in, I need to know all of how Postgres internals work, but I can generate a plan. I can generate a cost model that's totally orthogonal to Postgres. Right? But now what problem happens is now that the worker has established, they want a convergence of at least a one-to-one mapping saying, hey, these are the Postgres things. Are there some things in there that you can migrate over? And we are a little bit, we are a little bit muddy in the water where we are saying, okay, these things, say for example, the page cost is a very commonly used thing that they try to hint to optimize the planner to pick index scans instead of sequence scans. So those kind of, there are certain knobs in Postgres which are really important for customers, the BI customers, and that one we want to migrate, but we are trying to keep the balance of, you know, what is the right thing to do, what level, because we just don't want one-to-one mapping for everything. So what happens in a modular structure is, so you have, this whole system is say for example, a green plot. The parser comes in, generates a parser tree, and then we convert it into what is called DXL, data exchange language, then that's how Orca communicates to the world. And we have a small driver, a translator sitting inside the host system that can take Orca's language and convert it to what Postgres can do. And while we are optimizing, we have many metadata information that you need to know, and then we go to the catalog to find it out and give it back. And with all of this metadata, we can now generate a plan and it can then be executed and produced outside. So basically what this is doing is Orca is just communicating with the host system, whatever the host system is. Currently the easiest host system is any Postgres-based host system. You could pretty much lift and shift things with some amount of work, okay? So whether if you want to, we actually had a project of trying to do it with Postgres itself, but given that the optimization cost for Orca is a little bit higher because we searched larger plan space and for short running queries, which is what for all TP queries, sometimes the cost is a little bit too much for the customer to bear. So we have something called layered optimization, but currently our primary focus is green plum, but we have something in the back corner to do that. So let me just focus on metadata provider. Basically what we want to do is to say, hey, there is this Orca and it communicates with the rest of the world and tells me, hey, I have an int8 and I have an int4 and I want to do a comparison. Can you tell me what it is? And then it comes back and saying, Dave, this is it. Hey, what is the hashing operation that your system uses for this data type? You tell me, I don't want to know. Hey, I have this expression, this constant expression, one plus one, if it is not an integer, your executor engine can generate a different value for it. Let us assume it's a non-integer, right? Two dates, like you tell me how to compute. So if I have to do constant expression, I create a constant expression and I say, here, cost system, you tell me what the value is for this and then I'll get back to constant. So any kind of metadata saying, what is your partitioning scheme? What is your distribution policy? What does your table looks like? What are the indexes? What are the constraints that you have? All of those information, you go to the whole system. We have something called a metadata provider API that can connect to this. What this gives us is if you have a file-based API, what I can do is I can pull up the whole database in a file and anything that I want, I can just go to this file-based provider and why is this useful? So I'm just going to jump to the code and show you what I'm talking about. So this is a simple test. This pretends to be the database. So basically what it says is, hey, this is the sequel, this is what, this is just a comment. Then I say, these are the hints that I'm trying to do. And this is the metadata. This is the statistics for some of this column that I have. And this is the statistics, all of the statistics, this is the entire statistics bucket that you have in PD statistics, but it represented in our format. Then it may have, say for example, let me go down, it may have, what are the cost function between int, and this is the, this is the int for a courseable cost function between these two data types. Then you may have, there's too many statistics stuff. Should have pulled a better example. So say for example here, most of it is statistics, but you could have data, you could have column, you could have type, you could have partition, any of those things you could, you could base pretty much, the file will provide you what the data, what the optimizer needs, whether it is operations, statistics, schema, indexes, any of those things, constraints, any of those things. You said that, it wasn't clear, but if you have a one plus one in the expression, like two, where two equals one plus one, for the one plus, is Orca going to then evaluate one plus one, or you go as the database server? If it is an int, if it is not an int, we will tell the, it's called constant expression evaluation. We'll just say, this is a constant, but the whole system, you tell me what the value of this is, because I don't want to be in the business of telling what the whole system will executed as, if that makes sense. Yes, I understand. Yeah, yeah, yeah. I remember the MySQL guys talked about how their query afterizer would actually run select queries, while in the middle of a planning, given query. That is a little bit different. What you're trying to say is, say for example, you have a sub query that could actually be evaluated to get a constant. You could say A equal to select max from foo. Right? You could just say, select max from foo, go and execute it and come back and tell me what the value is, so that rather than a sub query, I have a constant. Yeah. That is a little bit different. So that is an, that is called an init plan in Postgres and Green Plum. So that is a little bit orthogonal question. Okay. All right. Thanks. Okay. So I'm going to skip this. One of the things of having Orca as multiple, the vision we had was to have multiple databases using a very powerful optimizer engine. That you could have a dedicated optimizer engine and you could just send all of your jobs to that so that you can, because optimization can take a lot more memory and you don't want your coordinator node to be hung up on if you are searching a lot of space. So, but we haven't, that is maybe a pipe cream. I don't know. So, so one of the things we wanted to do is to make sure, since we are going on, we are going to, we keep adding new Postgres versions. So we want to make sure if anything is affected, it's a very modular and extensible. So for example, if you have the memo need not change, memo is how we represent the plants or the third space need not change. And operator, if you have new logical or physical operators that you need, you want to have to create those logical operators. And then I'm trying to come up with, so if you go to a cart race, go cart race, you will see when you are going in that big, gasoline filled room, you'll have new engineer, new people joining into the race in between, right? So what we are thinking, the query optimization is that go cart, the track is the query engine, the X form, all of those things. You want to add a new operator, just put it in there so that it will go into that engine rather than re-got the whole thing. And I'm going to show you how we do this in the talk, but that was our goal that we wanted to do. So what does that mean? So we divide the Orca architecture the following way. So we have optimization tool, this is the optimization tool. So if you have new operator, say for example, you have Jennifer Wittem or someone comes with a new joint operator, all you have to do is a joint operator that you do, whether it's logical or physical, and how you go from one to the other, and then Orca will take care of the rest. Provided it meets some of the criteria, and I will show you what are those criteria. Then there is something called property enforcement. Say for example, I want something to be ordered this way. Then how do I enforce the property of ordering in a certain way? Another one is cost model. So we had multiple cost models, we had three cost models that we built. When I joined, Mike always joke saying all cost models are some ghetto ass laptop or desktop under someone's desk that they did some linear regression modeling and then they came up with some constants and that's what became the cost model. And that's actually sometimes how the cost models are done. So we came up with multiple cost models across multiple hardware on the cloud, on a standalone system. And then whenever the user can now pick which cost model they want. And it will not affect the whole query optimization process. And the job scheduler, the memo, all of those things don't need to care what operators are there, what models you are running, what transformations you are doing. That is just like a machine that says, this is what I need and this is how I'm going to make the cookies. Let me dig into what I'm talking about and what are the different steps in query processing in Orca and how we do it. So the first step is pretty much push down, preprocessing step, basically what we say is anything that you can think about that you should do it without like, it's like saying, hey, you should walk 10,000 steps away. That's like no brainer. You should do it for good health. You should have less fatty foods like instead of New York state trying to, or city trying to tell you you just logical things that you should do it without anything. So these kind of preprocess that you just do it off the bat. The next one is exploration. That means you do logical to logical transformations. So you have a logical plan and you convert to another logical equivalent plan. And then you do statistics derivation, then you implement from logical to physical. And then since we are, we are MPP system, all of the MPP specific issues that we face, those are the things we didn't do. And then we need to do a costing and then we need to do a plan generation. So what are the preprocessing? For example, say, say if you have a derived table, that means select star from within brackets, select star from foo, order by A. So there is a derived table where there is order. SQL standard says, if you have an order within a derived table, it is not guaranteed outside that you can remove that. You can remove that in our order by, because if it doesn't have a limit, if it has a limit, that's a different thing. If it doesn't have a limit, the SQL standard says you don't need it. So why would you sort a billion rows to not guarantee that it is not going to be used outside? The order is not preserved outside. So another one is, hey, if I want to do a collapsing of joints, collapsing of unions, if I do collapsing of joints, then I can now look at the whole joint graph to figure out what it is. What if there is predicates on a particular table? Say for example, I know for foo table, all of the values are less than 10. And then I'm joining with bar table, which has the values from one to million. So I can use the predicates from the table foo, saying that, hey, all of the values, the values are less than 10. Now since I'm joining on the same column with the other side, I can now infer new predicates on the other side so that I don't have to hash it if I'm using a hash joint. So there are plenty of simple effective rules that you will always use immaterial of cost or skew or any kind of data distribution. So these are the things that we do so that we can make sure that before we come inside the query optimization, things are sanitized. You're doing this on a logical query plan or like how? It's logical query plan. So if you want to, if you have say for example, if you have, if you know one sub tree is always going to be false because some constant expression evaluation tells you that this predicates, if one is not equal to one, you know it is going to be false, that the whole sub tree is false. Therefore you can remove all the whole sub tree. So any form of things that you can do without any cost or skew or any of those things, just do it beforehand before you enter. Then what you do is then you start doing exploration, exploration is saying, hey, I have this joint. So let me use this. The query I'm going to be using is T1, T2, joining on A and B, ordered by A. The first table T1 is distributed by T1A and the second one is distributed randomly. That is T2 is distributed randomly and I'm joining them on A and B and then ordering them on A. So what I do, I have an inner joint. So I have an inner joint and then I have get T2 and get T1. So now what I do is I push it into a memo. A memo is you can just like basically, if you imagine an Excel sheet where each row represents equivalent things. So I put it into an Excel table where the first one is, hey, I'm a joint and my first child is group one and the second child is group two and get T2 goes here and says, I am T2. I'm group two and get T1 is group one. So basically you shred the plan into a table or format. Why am I doing this? Then all I have to focus is on a single level. Every time, if I want to optimize a joint, I just look at the single level. I don't have to look at the whole tree to figure out how to do it. So it's a top-down query optimization. So you just look at each level and then you trigger it. Now you'll tell me how do you do this? That's what I'm going to get into. So you look at this exploration and I said, hey, I want to do the joint. So I have to do associative commutativity. So I'm going to flip the joint order. So instead of one two, I'm going to do two one. So that's why I'm still doing in group one. I don't have to look at group two and group one. Group zero is just I'm flipping the joint order. So what does this mean? You can do any kind of joint order here. Basically you have an inner joint where T1 is the outer side, T2 is inner side or T2 is the outer side and T1 is inner side. That's the only thing. So any form of joint order. Say for example, you had an aggregate. So you had an aggregate. You said, hey, I have an aggregate. I want to split it. I want to split it into local aggregation where I can do local aggregation on each segment. And then I'm going to redistribute it so that I can do global aggregation. What do I do? Within that group, say for example, it was group four that was group four, which was the one that was doing the aggregation. I just look at the aggregate operator node. I split it. And then the local, the global aggregation will be in the same group and the local aggregation will be inserted into a new group as its child. Is this clear a little bit? Any questions? Okay. Yeah. Yeah. Okay. So then I say, whenever I'm trying to do any form of aggregation, I always want statistics along with it. Right? The steps one and step two are interplayed always. So how do I know? So how the way the cascade operation does is there is something called required properties and something called derived properties. Required properties is what I as a parent node need my children to provide. For example, if I'm a function. So here, here I'm only showing physical, logical operators. You could have scalar operators also. They will also be in their own individual group. Say, for example, if I was a function, I want to know, I want to, I have to tell my parents that I'm a set returning function. That means what is coming out of me is not a row, a scalar, but a set. Because that may be needed for how it is, how the data is going to be processed. Say, for example, if I know for a fact that I'm a null returning function, possibly, then I can try to figure out, can I do some optimization tricks to use this knowledge that this scalar is null returning? What if I'm a volatile function? Say, for example, to date, if I'm saying insert int and I do to date, this to date has to be executed only once. So I need to know where this function needs to be executed. So all of these properties, each function will say, each scalar operator will say, I can derive that I am a volatile function. I am a set returning function. Then my parent will ask me, are you a set returning function? Oh, you are. Then I have to do this way. Are you a volatile function? Oh, you are. Then I can do something. So the parent doesn't need to know the intricacies of what the child is doing. The same way the statistics says, hey, I'm an inner join and I'm joining between A and B. That means for me to estimate the cardinality of A and cardinality, the joint cardinality, I need to know the histogram of A and histogram of B. So what do I do? I require my group, my children, to give me statistics of A and B. And the group says, hey, I don't have the statistics of A and B. So I'm going to ask my metadata accessor, go to the catalog of Postgres, and get the statistics and represent it to you. Hey, this is what it is. And I cash it. Within a session, we don't assume statistics. We don't have even a statistics change. Within a session, that is what visibility is. So in the same way, we'll get the statistics of B. And now the parent says, oh, you derived the histogram of A and B. Now I can use that to join and then predict what the cardinality of the inner join is. So these are the two examples. Given the time, I'm not going into that. So then what I have is, okay, now I have statistics while I'm doing query optimization, I always know whether I should apply this rule or that rule. But before I go, I'm going to show you the code a little bit. So, so if you see here, so this is the NRA join. So the NRA join says, hey, these are the transfer. This is a logical NRA join. It says I can, I have several transformations that I can apply. For example, I can go from NRA join to a dynamic programming join order. I can do greedy join order. I could do min card join order. I can do a DP version two, which doesn't have a cliff as the dynamic programming with a clip. A little bit simpler would be here, for example, inner join. Hey, I can implement inner join as index apply. I could do a bitmap index nested loop join. I can do a hash join. I could do a nested loop join. All of these transformation transforms, basically each operator will register itself. What are the transforms that you can apply on me? And then the transforms will say, the precondition is I need a tree of this. I need the scalar properties of this. So now from an engine perspective, all it needs to say is, hey, I have a row in the memo. I'm going to take the operator. I'm going to say, what are the logical transforms I can apply on you? And they say, this is the logical transforms I can apply. And then it just applies this transformation. Which card, which, oh, you are a join operator. I need to know cardinality. Can you go get the cardinality? We have an API that the join operator. So the join operator itself has how to generate statistics for it. Who to call? How do I, what are the transforms that need to be applied? Am I a good candidate for cardinality estimation or am I a bad candidate? For example, if you have a large nested tree, as you go up the join order, your cardinality is going to be really screwed because you will have attribute level correlation. So if you have an option of doing joins of two base tables, you have a higher probability of getting it right, rather than a join operator on top, which has maybe skew or some, the cardinality and the bubbles up so high that it's really useless as a candidate for cardinality estimation. Andy, you had a question. That's not an enforcement rule to do that, but it's like a preference, it's like a priority, right? Each operator will say, am I a good candidate or a bad candidate based on my tree? Okay. Because the parent knows how deep the subtree is. So given a memo group, it'll say, you tell me whether I should I be, if say for example, you come up with a cardinality estimation at one point in the optimization, and then you create in another tree which has a higher probability of correctness. Then you will ignore the cardinality that you created before and use the new subtree's cardinality estimation. So it's like a binary thing, like I'm good or bad, not like, okay, all right, yeah. We wanted to do it as actual probability, like a zero to one, but it was too over-engineering, but then a simple, it's a three flag, whether a good, bad, don't even think about it. Got it, thanks. So then what do you do? You go into the physical implementation, and in the physical implementation, when we also have merge join, but I didn't use merge join as an example here, you come up with a hash join and a nest loop join. So as I showed you, each inner join or any join operator tells you what are the implementation operators, transforms that you can use. And the transform itself will check to see, have I met the preconditions for this to be applied for a subtree? For example, you should not be optimizing or implementing a subtree of a join, of any of my plan, when parts of it, your children are not implemented. So say for example, you have a sub, you have a tree and the root, you're trying to do a physical implementation, you will never do a root as a physical implementation if that some of the subtree is still not physically implemented, because it doesn't make sense. For example, that's a simple example. Then okay, then then we come to the last. So we have all of the alternatives of logical things, then we while doing that, we did the carnality estimation, then we did all of the logical to logical transformation and then physical to logical to physical transformation. But still this doesn't look at, hey, am I co-located? Do I need to shuffle some things? Do I need to order? Do I need to do any materialization? If it is to save on computation cost, all of that information comes into the distribution enforcement framework. So for example, you have the coordinator note that says, hey, finally the customer wants it on the coordinator note, that is a singleton, and then it also, they also want it to be ordered on t1.a. That is my requirement. So I'm just looking at these two properties, which is distribution and order. And the inner hash join then says, okay, I'm going to hash distribute the first side. And the first side, this is the top one, the inner join is in one memo, the group one is in another memo. The group one then says, hey, I have a physical implementation for you and it's distributed by, I derive that my distribution policy is t1.a. Oh, the parent looks, the enforcement framework of the optimized, the engine will say, oh, the parent requested t1.a hashed, you are distributed on t1.a, yeah, it's a perfect match. So I don't have to do any enforcement. But the second side, oh, now I have, I want to co-locate both t1 and t2. So I say the outer, the inner side also has to be hashed by t2.b. But now group two says, hey, I am randomly distributed. Now there has to be like, a person, a judge who comes and says, I need to read the requested property and the derived property does not match. So what do I do? I add a redistribution of t2.b so that now the requested property and the derived property are matched. So this is how basically the memo, so now if you have a replicator, for example, if you have a replicator table, I introduced a new scan for replicator table. What would t2, if t2 was a replicator table, it'll say, hey, I'm replicated. That means I satisfy hashed. So the enforcement framework, you don't have to rewrite your whole optimization engine to have this work. If you want to use, if you want to create new kind of storage or distribution policy, any of those things. So say for example, Andy, your question is, if I say, hey, I'm range partitioned, if you want to come up with a new distribution policy all you have to change is how each operator generates its distribution policy and how the enforcement is done. Nothing else. So the last step is, okay, so we figured out the hash problem. Everything is collocated. You still haven't solved the order by. So the order by now comes in and says, hey, I need to enforce the coordinator node gets everything in order of t1.a. So now I have two alternatives. One, I have to sort it after the hash, after the hash join, and then I gather it to the, gather merge it to the coordinator node. Alternatively, I gather merge, I gather everything, and then I sort it on top of that on the, on the coordinator node. How will you make that decision? If the number of tuples are small, then you might as well do the sorting on the coordinator node. If it is too large do it at the segment level. So to end this, what happens is what comes in as a sequel, you rewrite it as a parse tree, you generate some pre-processing step, you do logical, logical exploration, as well as optimization, which is physical exploration as well as MPPization, then you now generate what is called a plan statement, a plan statement that Postgres knows. So the parse tree part and the Postgres plan statement part, you need to write, like meaning anyone who wants to use Orca, they need to write. The rest, Orca will handle it. So that is the overview of end-to-end optimization of Orca works. And what are we looking at? Okay, Venki, you finished all of this. What is it that you want to do further? Currently join order predominantly focuses on how we are doing collocation. But it doesn't do how does collocation interplay with partition elimination? Because you could have a plan that is perfectly collocated but does not do enough partition elimination. So how do we accomplish that? The second one is with Postgres and others, we have spatial predicates. How do you do cardinality estimation for spatial predicates? Then we have enhanced in, you have multi-level nested inquiries. This BI tools are crazy and they are crazy complex and sometimes unnecessarily. So we have 10 levels of in predicates. How do we unbox that and make it unoptimized? So currently the way Orca does it, it generates so many alternatives that it has the optimizer, even though the plan generated is correct and fast, the time it takes to generate that is too slow. It takes a couple of seconds, which is too expensive for Orca to take a hit. Another one is index plan generation. How do we, as more and more indexes are coming in Postgres, how do we support all of them and how do you do better cardinality estimation, especially with correlated statistics, how do we do it better? Then we have something called global deadlock detector. So because we are a distributed system and we have updates, you can have panel updates and you don't want to take, right? You want to take only the locks that you need. So how do you do, how do you generate a plan that takes less locks? The last two is, we may not get to it this year, but I really would like to do is parameterize queries because every engine has some queries that are built in and you just send parameters. How do you, currently Orca looks at it as a black box? That means it's not taking full advantage of it. So we want to do a better job with that and multi-phase optimization. So what is the example for multi-phase optimization? So, Menki, you gave me hundreds of transformations, but you are letting the engine figure out the order in which these transformations are fired. So can I have phase one, apply these transforms? Postgres doesn't need all of these distribution things. So cancel those things. Don't even look at it. So we have a way in which we can order the optimization in stages and at each stage, we can say, hey, what is your cost? Can I bail out soon rather than going all the way to the end? So this is a multi-phase query optimization that we have the prototype for, but we have not productionized it. A priority firing for the rules, right? Correct. In the sense, you want to end-to-end rules in levels. So say, for example, for a short OLTP queries, just run five rules if you can get it, get it. Otherwise, just ask Postgres planner to pick it. Orca don't even try it. So we want to... Or say, for example, one customer says, for me, this is the most important, so I want that to be fired first rather than anything else. So some form, Orca has a means already to say, you set the levels and we will fire the rules accordingly. Perfect. I'm cutting to your top 10. Prepared statements here. You basically, if I call a pair, the query shows up, and you see the question marks, the placeholder variables, and you don't do anything. It's like 40%. These are 40%, and we pretend everything is great, or currently we don't even... Actually currently Orca doesn't even support it, falls back. Falls back to Postgres. Correct. So we love Postgres planner because it's our insurance policy. But anytime we don't get it, if we smell we are going wrong, we just say, hey, you have an insurance policy, just take the insurance policy. What percentage of queries do that? Currently about 70%, 30%, Orca 30% Postgres. So for instance, all of the catalog queries, we don't even touch. Because we say, we don't need to even touch it. So if you have more OLTP queries, then it'll be much more Postgres. You have to look at your workload and see what it is. I mean, a simple hack for prepare statements to be, you call the pair, yeah, you run it once, or you fall back. But since someone calls, execute another pair statement, then you can take the variables being passed in and then use that to declare. Exactly. It's not perfect, but you can do that as a starting point. Yeah. So the code is right there. It's all open source. The Orca code is back in Orca. And a list of publications are here. Please do check us out. So we have quite a bit of publications more to come soon. So that was a short, not so short description of what, how do you build a optimizer roundup with cascades? Questions? Thank you. No problem. I have a ton of questions, but I don't want to be selfish. I'll open up to anybody else if anybody wants to go. I mean, you go for it. Okay. So my question over here is that if you look at the modern query engines like even Delta, for example. Yep. More and more we are pushing the decisions. They can, they can actually, for when you do a hash join for some blocks, you run out of memory. So it's very bad. So then for those blocks, they turn it to merge join because you use a lot more, a lot less memory. Correct. So a bunch of such things are happening. And also during the runtime, you collect the statistics for that particular query. And then you can change the distribution. The plan at the middle of it, which data actually is doing it today. So more and more really, because particularly when you go to Disney or the engines, they don't have a lot of stats. And this is data lake and the stuff varies a lot. So you can just go collect a lot of data. So more and more is being pushed to that. So is this really prepared to lose such things? We haven't looked at, we haven't, we haven't looked at the data lakes aspect of it where, our potentially maybe it'll be challenging. We've had a real tech, it was all for because you need to be able to take the query, come thin and then it's like a one shot thing. You would have to be able to say, Hey, maybe the query plan has told you about, here's an updated view of what I'm seeing whenever the query and then apply your rules. And I think it'd be a major rewrite. Right. Or you could have the sub trees where you are fine. And then you pause an execution saying, Hey, this subtree redo the whole thing. So say, for example, if you have a data lake where you're pushing sub parts of the queries into different green plum instances, for example. So within a particular green plum. So if you have a federated lake of multiple green plums, for example, and you query comes into one green plum instance, you could potentially divide up into sub trees where a particular subtree can be paused and read them. Maybe I don't know if you haven't looked at it from that perspective, it is going to take a lot of work for that. We had a question. Yes. I got another one. So how aggressive are you in use of material as views? Not that much right now. Okay. We have a beta version of material as views on green as a whole, but we don't fully support material as views in GP seven, but GP six. You have to manually refresh it. It's not otherwise. No. Do you do the position part or not? No, we don't right now. Okay. So what about the correlated sub queries, which is the bane of existence in the MPP is because of any square problem, right? Because for every node, then it wants to do correlated across all the nodes. So it becomes in a square. So we do have a whole lot of slides for that. We do support a lot of correlated execution. We try to decouple most if not a most a good chunk of it. I, how about this? I will, I will send you the paper for what we have on it. Let me take that. We do some. Yeah. Go ahead. So you have seen what we did in dv2, which one of the best paper I will actually know it for the data engineering 1995 or six. We did the magic to correlation because we had the magic sets inside dv2. So we use that one. It is very complex because most people do it and then it's buggy and they don't work. Most. Okay. Well, I can show you what we have. I can, I can send you both what, what are the kinds of correlated queries we support and what the code to look at. So I can send you both. I'll connect. Okay. Thank you guys. I got your name. So I, yeah. All right. So my question is, um, my first question is more of engineering side, right? It's meant to be modular. And that's me that you drop in different plugins and get that you have to have these providers to get information. But like, I mean, you didn't like, you talked about how like, this is all about green plum, but the original goal was to support green plum and hawk. Um, I don't know three. So like, I guess if I had a brand new database, how much time roughly would it take to, to save my optimizers based on, based on Oracle? Like weeks, months, years. So if you are, if you are a Postgres database, I would say less than six months. Okay. Like if you basically, if you don't have to rewrite how the catalog and representation and everything is about six months, but if it is not, it's much harder. And that partly because it started off with a lot of optimal goals, but then we had to get to GA. So we went very Postgres heavy rather than like agnostic. So if you are a non Postgres person and you want to adopt it, it's going to be take a year. Okay. And like, when you say Postgres heavy is because it's calling like information schema queries to go get things. Basically the issue where the time, the vast majority of time that you are going to be spending is translating the logical nodes in Postgres to logical nodes in this one, all of the metadata constraints, all of the execution, also generating the execution plan. If you are a Postgres based execution engine, you don't have to do all of that. You can pretty much lift and shift most things and only target on your own area of need. So say for example, if you, if you had a in-memory database where the execution engines are like, you know, an extension of Postgres and the parser and everything is all, and the metadata is all Postgres who would take six months, but you may need some hand-holding and we are more than happy to hand-hold. And is there anybody other than VMware Green Plum that's running off Orca? Like the cal side has at least been, there's been a bunch of databases that use it in different forms. But as far as I can tell Orca is only being used for Green Plum. Right. Or any Green Plum extension, Green Plum derivatives. Got it. Okay. Okay. I guess a last sort of open-end question is like, since the cost model of the architecture is like, you know, pluggable, in theory, in the course of me, some come along and implement a cost model that happened to rewrite all the other rules and things like that. Have you guys played around with any of the learn cost models that have been showing up in a bunch of research papers or not? We would, we have not, but we would love to. Okay. We can give them something to do it. Okay. There's a lot of papers in this space. And, you know, there's the bowel thing from last year. They hooked it into Postgres. And I think of the DV2, but I don't think anybody's running production. I was just curious.