 The Carnegie Mellon Vaccination Database Talks are made possible by Autotune. Learn how to automatically optimize your MySeat Core and PostgreSQL configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Okay, let's get started. It's another vaccination talk. Today we're here with Harold Subcheck. He is a co-founder and software engineer at Starburst, one of the main companies that's backing the Trino project. Prior to that, he worked at Teradata because he was part of the team at Adapt, the company founded by Dana Boddy that got bought by Teradata. And then he started working on Presto, which later became Trino at Teradata. So, if you have any questions for Carol as he gives his talk, please unmute yourself, say who you are and ask your question. I feel free to do this anytime so he's not talking by himself. And we have to say that Carol is in Poland right now. It's 10.30 at night. We appreciate him being here with us. And we also appreciate him getting through this while sick. And I would say he is vaccinated. Okay, so my name is Carol Subcheck. Yes, I was one of the initial employees of Starburst. And I will talk about Starburst Trino query optimizer and the engine, how it evolves over time, what are the reasons Trino exists and so on. So, first I will go over the company overview, then I will go over the Trino overview, and I will go over the evolution of the optimizer and the milestones of how optimizer was evolving in the last years. I will also talk about other key features that we have, like dynamic filtering, advanced pushdown, and importantly, I will also talk about the current initiatives that we have for the future and for the current developments. So, the history of Trino actually starts in 2013, and even slightly before. In 2013, Facebook did open source Presto, and Presto was created as a replacement for Hive at Facebook because Hive was too slow. Hive Sympax was not the SQL standard. So, there was a project initiated at Facebook, which was then open sourced. Then, in 2015, Teradata joined the party and started cooperating on Presto. And I was part of that team. Then by 2017, there was 180 plus releases already, 50 plus contributors, 5,000 commits. Then at 2017, at the end of the year, Starburst was founded by the former employees of Teradata on Facebook. And then we had a couple of series of founding. We had an A-series in 2019, and we also had a B-series in summer 2012-20. We also had C-series, so we actually became a unicorn at that time. 2021 was also a year where we renamed rebranded from Presto SQL into Trino. So, here are some key employees that were part of the founding team. There's three gentlemen at the bottom, Martin Traverso, Dain Sandstorm, and David Phillips, who are the original creators of the Presto. So, on board, we have all the major contributors, and we also have the original creators. So, it's a very good team. Okay, so why actually Trino was created, right? I said that it was slower than Hive. That's true, but there is also another reason. So, if you look at the companies today, a lot of the companies, in order to get the data, to get some insights from the data, they have to create expensive ETL pipelines. So, for example, you have to have a team of data engineers and data scientists that create pipelines for you. That takes time. You have a lot of data duplication, so it's harder to know what's the golden source of your data, the golden copy. This is complex, so you can imagine that it might take for some companies a lot of time to actually get some basic insights. So, this is not perfect, and hence their Trino was created also. So, you can treat Trino as your single point of access to data. So, instead of creating ETL pipelines, you can actually set up your data sources within Trino, and you can query data where it lives. So, Trino is a federation engine. Also importantly, you can have a lot of data sources, but Trino is also a very capable distributed system which supports high-comparance, you know, massive scale. So, we also have a very good support for the lake houses. So, you can join your, let's say, some kind of remote data or some relational data from some relational data sources with your, let's say, newest lake house data. So, we put a lot of effort within the project to have good APIs for connectors. Connector is a component that essentially represents a data source. There is a lot of APIs that connector can't implement. Some of them are obligatory. So, what happens is that on the coordinator, there is a metadata API where connector can tell the engine to the planner what are the tables, what are the column times, what are the columns. There is also, you know, you can do access control this way. So, there is also, for example, data location API where you tell the engine where the data is actually living. And then on the workers, there is stream API which actually is responsible for reading the data from the data source. So, connectors and API are the first class citizen even for the distributed connectors like the lake house connectors like Hive Tables, for example, or AWS Glue Tables. That's also a connector. And here is an example. Here are the examples of the connectors that we have. Some of them are open source. Some of them are starburst connectors with some proprietary features like improved performance or improved security. You can see that we can connect to almost anything that makes sense to connect with. Yeah. So, but because of this federation approach, it has challenges. So, within a single engine, we have to deal with relational versus non-relational data sources. So, with this comes the complexity of managing different data types. Like, you can imagine that all of the sources, they have kind of same data types, but kind of different. So, one example would be the CMAL types, for example. We have some semantics within Trino, but some databases have a different semantics. We have to adjust for that. Another complexity comes from the fact that the storage is externally managed. So, we have to carefully craft the API for connectors to tell us what the storage is, like what are the properties of the storage, like partitioning or bucketing and so on. There is no such thing like building storage within the engine, like all of it is connected. Another difficulty comes from the statistics. So, as you probably know, statistics are very important for optimal performance of the engine. Some connectors provide better statistics. Some provide worse statistics. And we have to deal with them. Even we have to be able to make some decisions, even if the statistics are not that great. There is a problem with predicate pushdown. So, when you pushdown predicate through your relational data source, for example, Oracle, it might have a different collision than Trino. So, we have to adjust for that. A similar problem comes from the predicate extraction when the database describes the table. It also could be a problem of collision and so on. Yeah, we have a separate operator pushdown, which has to be modeled in a generic way across connectors. We also have advanced dynamic filtering, which I will get to later. Important thing is that we can do kind of a join pushdown between connectors. So, this is pretty cool because you can have join between different data sources like, let's say, high meta, high tables compared to a joint with the Oracle tables, and you can do simulated pushdown of join via the dynamic filter. So, I will get to that in the later slides. So, let's now go over the... Quick question. For those joints across different disparate sources, they all have to come back up to the Trino execution engine. You can't tell, like, whatever this data source has been dated directly to this other data source and have it do to join there. Yeah, I mean, join is still being executed at the Trino, right? But with dynamic filtering, we can skip a lot of data from the source. So, we can... There will be a slide where I'll show how much gain you get by doing that. So, we don't have to fetch everything from the source connector. We just can fetch the rows that are actually joined later on, right? So, this is a huge improvement. Okay. So, let me go over the major breakthroughs over the time that happened with the engine. So, in the early stage, early stage, that are the... This is the optimizer. This is how the optimizer looked like in the early days. So, every optimization we have was based on the transformation. So, the transformation is a visitor where you visit all of the nodes. If you apply some optimization, you have to traverse the plan to the place where the optimization happens. Then you rewrite the plan and all the nodes upwards of the node that changed. So, as you can imagine, this is expensive to evaluate. And other problems are that there are some optimization paths where you have to apply the same optimization multiple times. Or even, you know, undefined number of times. And because the transformations are expensive, it's not feasible to do. This is one of the limitations. Another problem is that... Yeah, I mean, it's hard to add new operators because if you add a new plan node, then you have to rewrite all of the optimization rules. You have to add the support for the visitor in all of the optimizations. This is really painful. But, you know, transformation based model has some advantages. So, it's easy to compute properties because it's a simple recursion. So, we go down the plan. For example, if you want to compute partitioning properties, you implement a simple recursive visitor and you go down and evaluate what's the property of a given plan node. So, that's simple. So, this is an example of the case where you would like to apply the same optimization multiple times. And it's kind of not deterministic how many times you would like to apply the optimization. So, you have a simple join, for example, where between two tables, table A and table B, on table A, A.x has a predicate connector can tell you that values for A.x will be greater than 100. So, what you can do is you can apply predicate pushdown. So, you can push this predicate the other side of the join. So, now you can say B.y is greater than 100. So, that's good. And then what you can do is that you can actually push down this predicate the connector itself. Now, what can happen is that the connector that provides table B can actually tell you that the predicate is actually narrower. And this might happen, for example, when you have a partition table with a lot of partitions. What we usually do is that we avoid enumerating all of the partitions at the planning time without any predicate, because that's expensive thing to do. But once we get some predicate, we can actually enumerate the partition. So, it might happen that, you know, the predicate for B.y is actually narrower. And you might apply predicate pushdown again. And then you can push the same A.x greater than 200. You can push it down to the left side of the join again. So, there might be multiple steps like that. It's hard to define how many times the transformation should be executed. So, again, another neat thing that was really in 2016 was the equality inference. So, example of that is that, again, you have a join. You have a rather complex filter on top. A.x applied to F, applied to G. And join itself can be complex like it can be, it is a function on the left and column reference on the right. So, with a predicate pushdown, you can actually push down the filter, the original filter to the left side of the join. However, you can also push down modified filter to the right side of the join because you can derive more actualities. And you can say that B.y applied to G is actually the same as A.x applied to F applied to G. So, we generate this cloud of equalities and try to derive, let's say, transitive predicates this way. So, what happened also in that year, that was mainly driven by TRA data. We were adding support for different types. So, we added support for character types, load, decimal, for parametric types. And very importantly, we added support for collated subqueries. The goal was to have a support for full TPCDS, TPCDH query suite. So, now you, I mean, we wanted to be able to run them without any modifications and that was achieved. So, we really, we had support for all the query security as compared to competition. And we still progress, right? So, that we added some support for collated subqueries. At that time, we continue adding support for different more trickier say collated subqueries. There were other kind of neat things added in the recent years. So, this is still moving forward. Okay, at some point, in 2016, we switched to rule-based optimizer. So, instead of transformations, you now can define the rule where there is a pattern. The rule is applied to a plan node that matches the pattern. The pattern can be quite complex as you see here. And this has this advantage that now you can execute rule multiple times as long as they are needed, because they are quick to execute. You only change the subset of the plan. You don't have to modify the entire plan. You only apply the rule locally. And it makes it easier to add new plan nodes because the rules don't have to implement the visitor. Yeah, but those are the advantages. These advantages is that it's harder to evaluate properties. So, let's say you want to check what is the partitioning for a given plan node. With transformations, you could do the simpler recursion with the rule-based approach. You have to be more clever about this. There is a concept of trades and propagation rules. So, we still don't have the concept internally of trades within the Trino. This concept will be required once we move away from the greedy optimizer. So, that's another thing. Right now, our optimizer is a greedy one. So, once you apply the rule, that's the way you go. We don't keep the old version of the plan in memory. So, that could have an issue that we might end up with a suboptimal plan globally, even though it's optimal locally. And this is an example. So, on the left, you have a sub-join order. On the right, you have sub-join order. Okay. Globally, locally, the left-join might be better, joint order. But if there is some kind of a group by or aggregation on top, it might be better actually to do the right version of the plan because there is a different partitioning at the end of the join, on the top join, right? So, there is a partitioning of data by a.x. So, if you have an aggregation on a.x, you don't have to resuffle data again. So, this is why we will probably move to the non-grid optimizer in the future. Okay. So, then in 2017, there was a cost-based optimizer at that. Also, that was our little data office initiative. And, yeah, I mean, our join reordering algorithm can do, works on statistics and costs. And we choose, we switch between automatically, we switch automatically between repartitions versus broadcast join. We also change join sites. We also have a full join reordering support, so we can generate bushy trees. We work on statistics. So, mainly we work on number of rows, number of distinct values, nodes fraction, mid-max, average data size. From the statistics, we compute the cost of CPU, memory, and network. And we compare, we compare plans and choose the plan with lower cost with some weights applied to different, to the CPU, memory, and network. So, yeah, this is an example of how the filter estimation works. How do you pick the weights that you apply to CPU, memory, and network? It's an arbitrary one. I don't recall the, well, it's not really arbitrary, sorry. I don't recall the exact values, but we put phases on the CPU. So, if I recall correctly, I would have to check. But mainly presto query performance is mainly limited by CPUs for 3DOS, sorry. So, by putting weight on CPU, we choose plans where we have to do less computations and choose reduce wall time. So, yeah, I would have to check exactly, but I think we've the biggest weight from the CPU. Other than that, it's, what? I guess the question is like, maybe Tim's asking this, like, I guess what percentage of the queries are using the different connectors that are going at these other different databases, versus like the Trino native storage? What percentage could you, could you elaborate a bit more? So, like, so we have a connector for single store, like, and so, how often are people trying to federate queries across single store and then, you know, F3 buckets or something like that. And I guess maybe, maybe Tim's not asking this or I'm asking this, but like, single store is almost like a black box for you guys. How would you account for CPU of whatever single store is going to do? Or are these weights just put in the Trino based operations? So, so it's more about the cost of the join really. Okay. We have now support for the operator push down, but I don't think it's at the discretion of the connector to push it or not. So it's not expressed in them. Like Trino does not compare the plans with the operator push down to the, for example, relational database. We assume that by pushing things like aggregation down to relational data source, we will reduce the cost overall cost of the plan. Because you reduce the amount of data that is being transferred over the network. So what happens in practice is that while, you know, once we start extracting data from relational database, for example, Oracle or other data, for example, in a distributed way, those databases tends to put high cost. They getting the data from them is actually quite expensive. So if you can reduce the amount of data that is being produced by the connector, it's good by relational connector. It's good for the data like those connectors like high data, high tables. It's more about joint order. So there you scan the same tables. This cost does not change. The cost changed once you start reordering the joints. Right. So it's more about that. Okay. Okay. I'm going to repeat just saying, if it's connector, you don't account for CPU costs. It's just like how much data transfer coming out of it. Yeah, that's, that's, that's, yeah, that's, that's an empirical experience. I would say yes. Maybe what Jim is asking me does, okay, for the Trino operators, you have these arbitrary weights. It sounds like they're set by hand. You're not like running like dv2 runs micro benchmarks to figure out what the CPU cost is. You're not doing any of that. That was quite a long time ago, but I recall that we were on the benchmarks and we tried different weights. The weight where you put the most on CPU is the best one for, for let's say TPCDS, TPCH, TPCDS, TPCH. I didn't, but to be clear, like that. It's a, you ship Trino with that value hard coded as a, it's a, it's a knob you can tune, but it's not like it turns the system. Yeah. So it's like in dv2, you turn the system on, because the IBM people here, they can correct me. You turn dv2 on and it runs micro benchmarks on your CPU, almost like VogoMix to figure out what the weight should be for the cost model. So, so there is an idea. It's not, it's not in progress yet, but there is an, actually I will talk about this in the future, future initiatives. So you can actually improve statistics. For example, by computing the, collecting the data from the queries and comparing the statistics for the plan against the actual statistics of the query, right? So now you can actually have a better model where you can apply some techniques like, I would guess linear regression or things like that, where you can adjust this weight so that they match reality for the particular customer. The question is how much you gain from that really in practice versus the heuristics. But, but yeah, that's, that's an idea and that's actually quite interesting project to work on. It would be, so, so yeah, what we, it cannot easily like statistics are, are not, to get the actual stats, you have to run the query, right? So statistics are just an approximation of the reality. So hence, hence you will need this weights. However, you can apply this weights, you can change this weights according to the workloads of the customer, for example. But for this, you need to have the history of the queries. You need to have this, this algorithm. So this is, this is really something we've been thinking of. Yeah. Is that, does that answer your question? That's my question. Yeah. Sorry, I had your question. So if you're satisfied, that's good. I say, let's do what you're describing Carol. Also something like DBQ Leo, the learning optimizer stuff from 20 years ago. Okay. Yeah. Like a lot of the techniques here are, yeah, they come from, from past. Yes. I'm not saying like, oh, that's a bad idea. Don't do that. Like, or IBM, but you can't do it. I just think like what you're describing is pretty similar to what IBM had done. Yeah. And the difficulty here is that we are operating on a further rated environment where we miss some data. So from the connectors for the, for example, so in reality, sometimes some simpler approach works better than the more sophisticated one because, or maybe as good as the more sophisticated one. Because, you know, reality is, it's not like your model, statistical model is, is inherent, inherent, inaccurate because it assumes some data distribution and so on. So sometimes simple techniques work good enough. Okay. So like, let me continue. So this filter estimation, for example, what we, here is an example of if you apply a filter or item greater than zero, what you keep in the stats is, okay, there might be 61 million rows, but we know that item cannot be known. Similarly, if you have a filter item equals to 106, 106, 106, 107, then you might learn that there's only 3,000 of rows and one distinct value. Similarly, when you have in clause that could be 6,000 of rows and two distinct values. So what we do is that one of the optimizations we do is that we choose join type. On the left, you can see that both tables are reshuffled, but one table is really small. So our CBO will actually choose broadcast join type and the left table doesn't have to be reshuffled anymore. This is a typical situation for dimensional tables, for example. So yes, you save a lot this way. A lot of CBO, a lot of network. Similarly, you can choose the small table to be the right table because the right table is kept in memory. It's an index for the join. So you want the small table to be kept in memory. We do full join reordering. What usually happens is that the biggest table on the join that does not filter on the data will be on the top because if it does not filter on the data, it will increase the volume of data. So it's better to put it after all of the filtering. So here the line item table is left on the left on the top most join. So it's not in memory, not being kept in memory. If you apply a filter to line item table, then it might only produce 3,000 of rows. So then it becomes right bottom most table and then customer table on the top here. Yeah, so this is how it will work. We also support the full bushy trees. So usually, some engines, I think they only do deep, left deep or right deep join orders. We do support bashed bushy trees, if that's better in terms of cost. And CBO was a great breakthrough. We decreased increased performance by an order of magnitude. Yeah, that was the first great improvement. Yeah, but like I mentioned before, accuracy decreases with planned depth. If you have more complex plan than the more operators, you have to, for the more operators, you have to compute stats for the error increases. And that's one problem. Another is that some connectors don't provide accurate number of distinctive values, which is essential for computing filtering ratios. Sometimes we only know table sizes. So what we can do is that we apply different techniques to deal with that. We normalize statistics. If, for example, stats say that there is more number of, more distinct values than number of rows. We cut number of distinct values. We assume predicates are correlated. We assume distinct values match. Correlated predicates mostly for join. We assume that I will get to that next slide. We also can fall back to table size, statistics only for join type side selection. So yeah, so we assume distinct values match. When you write a query where you join by two columns like A dot X equals to B dot Y, we assume that the table with smaller number of distinct values that from that table, that distinct values will have a corresponding match on the other side of the join. And this is based on the fact that when user does join table, he really means to get some data. So we assume that there are matches instead of not having matches at all. This is the only reasonable thing to do. We also assume that the predicates are correlated for join. When you have a join with two equic clauses, this is example for one of the DPCS queries. What we do is that we assume they are correlated. So we choose one of the clauses as a driving clause and we compute filtering ratio for that clause. And we also have other auxiliary clauses and we apply constant filtering coefficient for those other clauses. So this gives better results in forward. This results in better plans and this is again, this is an example of this simple yet very efficient optimization that works good in practice and it's not too sophisticated. So there is plenty of such things. Another important improvement optimization is that when you don't know the statistics, detailed statistics of the table, it might be that some connector does not return table size at all but the other does actually say the other connector tells you that the table has 100 kilobytes. So in this example, there is a join with table B which has a size of 100 kilobytes. The size of the other table is unknown. What we do is that we, since the B table is really small for a typical Trino cluster, we choose that B table to be the build side of the join. It is on the right side of the join. This is a big win often because the join is replicated and also we can derive dynamic filtering, dynamic filters. So runtime predicates if there is not too many distinct values. This helps in practice. Some customers don't collect statistics. Like statistics, you don't have to assume but sometimes statistics are not updated. Some customers might not analyze frequently. There might be new partitions with new data and you might not get statistics for this. So we have to deal with these edge situations. Okay, so another big breakthrough was adding dynamic filtering. So dynamic filter is predicate that we derived from the join from the right side of the join, from the build side, from the index. That happens on the runtime. That predicate is then being pushed to the left side of the join either to the streaming source or to the file admiration of the connector. So dynamic filtering is an umbrella keyword. There is a lot of techniques behind that. There is a non-local dynamic filtering. There is a partition pruning. There is dynamic profiling and there is awaitable dynamic filters. One important thing is that the project was driven by community. There was, I think, three companies involved. So this was a great example of the community-driven work with the cooperation from the different parties. So what we added first was the support for non-local dynamic filtering. So once you have a join and join operator is running from the same node as the table scan, what you can do is once the join gets the index, so data from the right side, you can extract predicate from that build side of the join from the index and push that data, push that predicate to the table scan on the left. All happens within a single node. And this helps with stripe running in RSE parking files. So parking and RSE files have internal indexes and you can apply predicates on them to skip data within these files. So there was some improvements from this. It was not that dramatic, but there were some queries that improved. Then we added support for dynamic partition pruning. So similarly, you collect predicate at runtime from the join build side. Then the predicate is being sent to the coordinator. The coordinator then propagates the filter to the left table scan, but it gives the filter to the split enumerator. So split enumerator is an API for listing, let's say files from the table. So if you have a connector that has a partition data, it can take the predicate, it can completely skip the partition, completely skip the files from the partitions that won't be read. And this is a huge win. This was a big improvement compared to what was before. I would say that was in total comparable to maybe CBO. A lot of queries improved a lot. This is like a zone map where there's some metadata about the partition that says, I know there's nothing in here that I need. So if you have a lake house connector, for example, because dynamic partition pruning works mostly for the lake house connectors, they know the structure of the tables. They know the partitions, they know the values of the partitions. Once they get the dynamic filter, they can apply it to the partition listing. So they can skip the files completely. So you don't even start processing the files in the nodes. And one of the reasons dynamic filters were improved over time, we reduced the latency of how quickly dynamic filters are populated to the coordinator. We also added a feature for, we also added the way to wait for dynamic filters, so awaitable dynamic filters. So now a connector can actually wait for the future when the filter is ready. And this is important for relational connectors. If you have Oracle connector, for example, you only have one chance to apply the filter because the way we read data from the Oracle connector or other relational connectors is that we submit a query to the source database until we get the data, right? So we can do this only once. So connector can now wait for the filter for some predefined amount of time. And once the filter is ready, it can improve the query that is being submitted to the source database. And this is an example of how Oracle connector did improve. Yeah, I mean, that was a huge improvement for the federated case. It's like a night and day for some queries. Like some customers, it actually, it was, some customers have like, we have a support for distributed connectors, distributed relational data sources. But the thing is for some customers, I think this was the chance some queries from hours to, I guess, minutes or maybe less. So this was an enabler for the federated case. But does that sometimes make the queries that you push down just enormous, like within clauses that are hundreds of thousands of elements wide? No, no, no. We have a limit, very low limit, actually, on how many individual values you can push. So I run benchmarks on that. At some point, if you push the money values, you might get some regressions if you don't utilize dynamic filter. So when I was tuning this, I set the threshold for waiting for dynamic filter and the threshold for the size of dynamic filter so that there is no regression, to get huge gains. Usually it's quite low. But the thing is, if you pick our case, you have data partitioned by dates. So I guess that's the most common case you have. And what happens is that once we, while we build, while we detect the dynamic filter is too big, we actually simplify it. We extract the min-max range. And since the dates are incremental and they are kind of nice for that, you can still get a huge boost. And just that case is a normal scale. So you don't get a regression, but you can get huge gains. And there are some ideas how to potentially improve this for large dynamic filters. You could create a temporary table. But for now, it's enough that what we do here. Thank you. So another big improvement was the advanced predicate pushdown. Before 2019, we had separate predicate pushdown on column pruning. Now we have support for operator pushdown, projection pushdown. We've had the reference projection. I will get to that in a moment. We have a support for limit pushdown, aggregation pushdown, joint pushdown, column pushdown, and it's glowing. Aggregation pushdown is again, probably one of the more important in the pushdown for the relational connectors. Here is an example of the reference pushdown. A typical case for the lakehouse storage is that customers keep data in structure types. The structure types are stored. The structure data is stored in Parque for RSC files, typically Parque. Parque does keep the structure types. Let's say it's a row type or a map type. There are actually vectorized data with Parque for, let's say, key and values or for each row column. So with the Parque files, there's only vectorized data. So what we do now, if you write a select call.field0 from table, we will pushdown that projection into the connector. So connector can now skip loading of the entire structure, but can only lead that particular field. So this is an example. The projection that will decode only field zero from the structure type. There is a table scan that reads the entire structure column. But after projection pushdown, there is the table scan which produces field zero. So we don't read all of the structure blocks, let's say. This is an example of API of the projection pushdown. What is important is that we carefully design the API so that we don't really expose too many details about the optimizer to the connectors. So we have an abstraction called the collector expression. Right now it's either a call stand, field the reference or a variable, which is variable is essentially a column reference. But the way it's done, it makes us easier to change how the optimizer works, how we model the plan internally. So it's a good thing to do. We don't have to deal with the backward compatibility too much here when we change optimizer. Okay, so what we are working on right now and what we will be working in the future. So there is a couple of interesting projects here. So for example there is an idea for caching. We actually started working on that. So what you can do is that Trino now has a separate materialized use, have an API for that. We actually at Star Wars have implementation of materialized use. There is also an idea to have caching of the intermediate results of the plans. What you can do is that you can if you have a query, there is a query on the left here join that can also be just part of the query of the bigger plan. What you can do now is that you can compute the signature of the join here of the plan that starts with the join here and you can compare that signature against let's say cached data and if you detect that signature matches you can actually replace your entire plan section with the already pre-calculated data. So in this case let's say signature matched with the plan signature A which is a cached table A and that could be a join from table A and table B. However you can see that there is no filter here and there is a filter here on the left. So what you can do is that you replace that entire plan with table scan on A and filter on top and this can be this is a neat trick you can do caching between queries you can create materialized views like if you are a data engineer you can create materialized views from your federated data or maybe from expensive lakehouse queries where you immediately take advantage of the performance boost this way because your data is already you kind of transparent cached your data for your use cases. So another project we start. Your materialized views though they can't be automatically maintained right? Somebody has to say go regenerate because you they are maintained automatically right now. So we are thinking about you can define the schedule you can define incremental non-incremental materialized views we are thinking about having a full CDC so the change detection API even tree nodes but they are someone inserts into postgres and that's a connector and I gave a trigger set up to tell tree node this thing got modified you guys you'd have to refresh manually oh yeah I mean that's one way to do that but there is also I would say maybe a simpler solution but since you are upgrading you are keeping some state of your materialized view you know what state it is what data is inside so you can do a union between your source and your materialized view so what you only read from the source is the hottest new data until you merge that we've already cached data so this is another thing we have on the table we will see where this goes we kind of apply that work based on what customers feedback we can do that we don't do that yet there is materialized views depending on the schedule there might be some lag right now another project we started we started the project to improve fault tolerance for tree node for longer workloads, longer queries and as part of the project there is fault tolerance will be implemented as a you know you add temporary storage you don't stream the data directly between operators between stages you have a temporary storage now where you keep the data for the case if something fails so because you keep the data in the storage you can now apply multi-stage planning because you know you now know what the data shape is you know the statistics so you can for example change the join type change the flip the join sites for example so you can do this planning in a incrementary this is also a neat technique yeah and we have plans for improved statistics so there might be cases where there are some outlier values which are very frequent so you might want to treat them differently you might want to detect correlation between columns another example is that if you have partition data in connectors in lake house like if typically you think like you or Hive metastar keep statistics per partition so it's hard to compute the number of distinct values for the for the entire table so you might want to apply some extra techniques to be able to compute the number of distinct values for the entire table even though you have just a sample of partitions let's say that could be hyper log log algorithm for example and yes it's not mentioned here but question this is Rebecca Taft from Cockroach Labs I was just wondering if you could talk a little bit about how you plan to determine correlation between columns I think I remember you said earlier in the talk that you currently assume that columns are correlated so does that mean you basically take the lowest selectivity predicate of several predicates and just use that as the selectivity of the full predicate or how are you doing that today and how are you planning to change so right now we take the auxiliary clause and the driving clause and the auxiliary clauses and we try different driving clauses we take the plan that produces the least amount of rows as you said for the correlations there is the histograms are on the plate but there is also a way to you can compute the things like the Pearson correlation between columns potentially so this is still in the planning so we will see but histograms are expensive essentially right so if you want to do computations of your statistics on histograms essentially what you have now is one histogram if you add 10 histograms then histogram with 10 values you multiply your computations by 10 which might affect the planning times you mentioned driving columns and auxiliary columns can you say again what that means so for the join we have a join statistics calculator the question is what is the output row count of the join if you have more than one equicondition you pick one you compute the filtering factor for that equicondition because if you know the number of distinct values if you know the number of distinct values for the left and right side of that equicondition and then there are other equiconditions auxiliary equiconditions so for this we don't compute filtering factor anymore based on number of distinct values but we apply the constant coefficient which is 0.9 today with the assumption that if user writes the join where it has multiple equic clauses it's usually the equic clauses are correlated that's the idea thanks okay so one more thing is that we are slowly thinking how to move to the exhaustive optimizer this is this is this is tricky but at some point it will be required right now if you add exhaustive optimizer you have to have good statistics always like maybe not always but it's more important even more and you might put some more it may cost more time to choose the bigger plant choose the better plant because you evaluate much more combinations so we will think about this so thank you for listening if you have any questions let me know I will applaud everyone else so we have a few better questions if you have any questions please unmute yourself and fire away so any issues let's go for it yeah so this is Samit I had a question regarding do you use any kind of semi-join so you went to the left side and you pushed the predicate on the right side but it can be actually blue filters as well yeah so we could convert tenor filters into blue filters yes this is again in federated case it's a little bit trickier because different lake houses for example have different format for the blue filters so once while you collect this blue filters at the run time when the engine you have to choose between which passing functions you want for example so this is open door still regarding semi-joins we actually switch to we have a semi-join implementation yes but we switch to we actually convert semi-join to join now to inner join now where it's a filtering semi-join so sometimes you have to have a semi-join that doesn't really have to be a semi-join can participate in join reordering the other thing is that we can derive dynamic filters we actually derive dynamic filters for semi-join too but mainly for the for the for the for the for the for the for the for the for the for the for the reason of CBO and to join reordering we try to avoid semi-joins yeah yeah I got it I mean the advantage of blue filters at the amount of data they pass is very very small and in fact you can make that one to be symmetric it's called zigzag join that is you go from the left form the blue filter go to the right apply it form another blue filter bring it back to the left and then finish it now most of the rows that you're like retrieving will be in the output because you delete symmetry on both sides so left teaches the right and right teaches the left that's a very small amount of data yeah yeah that's that's true so this is like I like I said dynamic filter is a never-ending story dynamic filtering like you still have a lot of you can you complete something there is two more doors open and you can improve here and there like so it's not finished yet I would say yeah okay thank you anybody else so I guess Carol I maybe I missed this like the join reordering that's pretty much the only place we're doing the cost based optimization like are you doing a sounds like you're doing like a bottom up approach instead of a top down approach is that correct uh we actually do the uh yeah it's a it's a bottom up approach I think yes actually like the system the actually right so what what happens is that we know we have a list of source tables and the algorithm works like that that it splits the the set into and then applies join reordering to the subsets so it's actually I think it's a top down approach yeah it's a top down approach but with memorization so okay I got one more question if you don't mind yeah so have you done any comparison with Dremio or you know the modern redshift well I don't know about redshift but we did comparison yes comparison Hamid that's a loaded question like is it faster like what do you want them to say like I think we are faster we pushed for the performance recently in recent spotters and we had some good gains so right now I think we are faster yeah also Dremio has the problem that if I recall correct me if I'm wrong I don't think it runs all of the queries from the TPCH, TPCDS dataset so we cannot fully compare just products still so Dremio talk is November 22nd so you can come and complain to them yes I know so I have two professions you guys are uniquely positioned to that you have data about the accuracy of the statistics of all these different data sources that you have connectors for like I understand somewhere like I'm reading from S3 buckets and the data files are static but I'm thinking like you're connecting to my SQL or Oracle Postgres single store all these different data systems out there that maintain their own statistics and you're basically trying to extract them out and use them in your own query optimizer right so I guess the first question is do you control when you call Analyze on those other systems no we don't have we don't do that and then so that means that also across all different queries that are running the starburst cloud platform not the if someone's on prem you don't have access to this but on the cloud platform people are probably connecting to everything so you could then potentially keep track of over time which these different database systems are actually have you know are usually wrong about their statistics yeah I mean I think in the direction it will evolve into is that you would have you would provide set up your data sources but transparency behind the scenes what will happen is that we will actually you know as a user you would be able to tell okay I want to cache this data or maybe we cache it for you automatically so even though because the problem with the relational data sources is that they are often slow and sometimes people don't want to have the load on the source database like there are data right because they have other production workloads there so if they want to offload from from such databases they usually want to move to Trino but in such a way that they can reuse the existing queries so they want transparency and here comes the technique of materialized view or the caching of the results or what else so people are using this the federated aspect of Trino to get additional usage out of their existing databases as a on the path towards migrating off of that one typical example is that some typical example is that people want to move from the companies want to move from the some of the classical databases like no for data possibly it's not necessary but like this big analytical box like the fridges and they but they have production workloads there so they start doing that by plugging Trino first and then they move workload to Trino but as they move workload to Trino more and more they put more pressure on this on the source and they so we want to we want to enable them to move even more workload to Trino so I think they start simple by just connecting your source and then they apply different more techniques and this is mostly I guess for on-prem customers in the cloud what will happen is that we will manage this automatically for the customer to make it a comment that I think you guys have a very interesting data that and as a researcher if there's something public you can share about that again just say not for any individual customer across your entire customer base care data has the best stats or you know or has the best stats like there's one paper that shows who has the best stats for a single node you know workload but across all different customers I think that would be actually super interesting okay yeah the thing is you have to the thing is you have to we probably could do that actually but there's also always a problem of collecting this like does customer allow us to collect this information right so on-prem it's hard in the cloud it would be easier yes in the cloud yes not on-prem okay all right cool