 Carnegie Mellon vaccination database talks are made possible by Autotune. Learn how to automatically optimize your MySeq Core and PostgreSQL configurations at autotune.com. And by the Steven Moy Foundation for Keeping It Real, find out how best to keep it real at stevenmoyfoundation.org. Welcome! So we're here for another vaccination database, some of our theories are a talk. Today we're excited to have Hussain and Alex from Google BigQuery team. Alex has been a staff engineer at Google, leading the overall BigQuery project. He's been a PhD from UIUC, but not in databases in IoT devices. And Alex has been a technical leader of Google BigQuery project. And he spent 13 years working on SQL server at Microsoft Friday joining Google. So with that, take away guys, I have to be happy for you to be here. And as always, for the audience, if you have any questions for Alex or Hussain as they're giving the talk, please unmute yourself, say who you are and where you're coming from, and feel free to interrupt at any time, let this be a conversation and not just them talking to the Blackboard that is there. Okay, for that guys, go for it. Thank you for being here. Thanks Andy for the introductions. So hi everyone, I'm Hussain and I'm joined by Alex to talk about query processing in Google BigQuery. We are very excited to share some of the details of how BigQuery does query execution and query processing. So to start, let me give you a little bit of overview of what Google BigQuery is. It is a serverless, highly scalable and cost effective data warehouse, cloud data warehouse. And, you know, we have several growing cloud data warehouses, things that are really distinct about BigQuery and really kind of powerful characteristics of BigQuery. One is it's a fully managed serverless and clusterless service. Users don't need to set up a provision servers or clusters, they could just come into this service. And if they want, they can reserve resources and we'll talk a little bit about that, but everything we see scales up and down depending on the work shown in storage demands. It's a 24-7 service with four nines of uptime availability. Users rely on a predictable performance from the system because the queries running in the system are used in really critical reporting or business intelligence applications. In terms of a scale, it's a petabyte of scale storage and query system. We often have customers that store data at that scale and query data at that scale. Everything is encrypted, everything is terribly stored. It provides real-time analytics and streaming data for a lot of real-time analytics use cases. The main interaction, obviously, this is a kind of SQL system, so it's going to be a true SQL, but we tried really hard to make sure this is an easy to use SQL interface and it doesn't require any hints, any fine tuning, so everything kind of is taken care of by BigQuery. And we have also embedded a lot of features into this SQL interface. For example, we have machine learning primitives within SQL that are kind of really powerful and really useful for analysts to do various jobs that are not traditionally as part of their data warehousing and data analytics SQL workflow. So how do we do that? This is kind of a high-level architecture of what BigQuery looks like. It's a decoupled storage and compute system. We have a distributed replicated storage that kind of spans across multiple machines and storage devices in a Google data center. And on the right-hand side, we have a highly available compute deployment. And basically, the compute nodes can talk to storage nodes through a Google petabit network infrastructure that's available in Google data centers. And also in between, we have a distributed memory shuffle tier that's highly used to communicate between the compute nodes. We'll talk a lot about that. That's the key part of basically the query execution in BigQuery. To interact with the system, users can run queries using SQL. It's a 2011 compliance SQL. There are REST APIs and then there's a web UI and command line interfaces. And then there are client libraries in various languages that are used to programmatically access the BigQuery service, ODBC, JDBC drivers, and various connectors as well. Also in order to get the data in and out, there are streaming ingestion and reading APIs to read and write data at a very high throughput and high scale. There's also a bulk load and export functionalities in the storage side. If I want to iterate over some of the high level key architecture design principles for BigQuery, there is a combination of three things that makes BigQuery really successful at delivering some of those functionalities I mentioned in the beginning. One of that is storage and compute separation. We know that from various industry leading systems that this helps a lot with getting to larger scale, petabyte scale of compute and storage. It makes systems highly available because fault can be tolerated at different layers independently. And also it helps a lot with serverless and multi-tenancy because resources can be isolated, scaled independently. So this is a key point to enable those functionalities. This is combined with a really good use of co-location and caching across various storage systems and network system that really gives us a high performance at low cost even though there is a storage and compute separation. And then the third thing here is really integrated hardware and software stack. BigQuery can take advantage of Google technology, the hardware technology and data center technology to get really great performance in terms of storage and network and data transfer. So with that high level overview, we're going to do a little bit of more of a deep dive into query execution. And with that, I'm going to hand it over to Alex to talk about query execution basics. Hi, everybody. So yes, I'm going to talk about how BigQuery optimizes queries and some details about how BigQuery runs them, how we distribute resources between running queries. First of all, a little bit of terminology which could be unusual. In BigQuery, project or reservation denotes a pool of resources available to workload. And in the cloud, of course, we have various ways to end up quickly to shrink or grow that resource pool. For example, BigQuery can do that with one minute granularity. For simplicity in discussions, we're going to assume that we have just fixed pool of resources available, which is kind of quite common configuration basically for many users. So now about BigQuery query processing goals. Of course, we have a goal to run queries quickly and efficiently. But besides that, we also have a goal of trying to achieve more predictable query performance whenever possible. As Hossain mentioned, BigQuery has a SLA of four nines to put that into perspective. Essentially, the goal is to have just one minute a week of downtime if averaged. Such high SLA is definitely suitable for critical workloads. But then if you're running critical workload, it's also important to finish queries on time by deadline. So again, we also need to invest in as much predictable performance as possible. And this affected some of the algorithms and designs. So let's go back to the topic of how queries run. Query is executed using multiple stages. Typical of example of stages, you know, some, for examples, we scan local table. We do some filtering on top of that. And then we do basically partial aggregation. Each stage consists of tasks. Task is unit of work to be done. Assigned to workers and workers then can process tasks. So stages done when all tasks are done. Then another stage could become unblocked and we can start another stage to execution. Multiple stages could be running at the same time. Again, typical example, if, for example, staging needs to filter aggregation results. So that stage would need to wait before aggregation is done. And then that stage could start to execute. Number of stages in query is not fixed and sometimes could change depending on data distribution. But overall role of thumb is that, you know, each aggregation probably requires one stage. And each join adds one or two stages and so on. So basically in this slide, we see example of basically of simple query which requires two stages. In the first stage we do table scan, filtering, then local aggregation, and then second stage computes final count. Next slide, please. Okay, here I'm going to mention some interesting facts about how BigQuery runs queries. So, BigQuery tries to keep data in memory during query processing if possible. That is, if you are running a medium-sized query in an idle project, meaning you have some idle set of resources, medium-sized is something, and we consider that query medium-sized is something that processes, for example, one terabyte of data, then quite likely intermediate data will never be stories on disk or not even on SSD until we finally persist query results. So basically, BigQuery tries to do memory processing whenever it's possible. We will go into more details about dynamic scheduling later, but as I mentioned, query is executed using multiple stages, which has multiple tasks. Number of workers that process those tasks is very flexible and can change in matter of seconds from anywhere from one worker to many thousands of workers. About failure recovery, so stage tasks could be restarted or executed speculatively. This helps to maintain high performance even if you have rare hardware problems in a highly distributed environment. So for example, if one of the computers becomes less responsive, we do not have to wait a long time till we decide we have real problem or not. We could instead simply speculatively start task execution in a never-location even after slide delay. This way, hardware problems won't affect query performance. So, any question? Yeah, so for the sake of working, can you say what percentage of the tasks in the entire BigQuery fleet are executed speculatively and were incorrect? Like there's a hiccup in the network or something and you think a node is down, but it's really okay. So you start speculatively executing the work, but it turns out you wasted that because the node came back right away. You know what I mean? I don't think it's very large. It's mostly extremely large queries that require basically, you know, some duplicate executions of some of the tasks. But in grand scheme of things, it's not very big fraction of things. Okay, any additional questions so far? Again, queries consist of stages, stage of tasks, tasks executed by workers. You're talking about the scheduling stuff later on, right? Yes, yes, yes. I'll hold my questions for that later. Okay. Next slide, please. This slide illustrates various components that are involved in query processing. It also shows how stages interact with each other. Each stage reads data from either a distributed file system or from a distributed component called shuffle, and we will talk about shuffle in more details. All work for given queries coordinated using, you know, coordinator. There is also scheduler component that helps to assign tasks to workers. So on this slide, you see stage one reading data from the distributed file system and processing it, and then it puts intermediate results into shuffle. The next stage two will read data from shuffle, process it again, and put again partial intermediate results into shuffle and so on. In summary, each stage usually reads data from distributed file system or from shuffle, and then usually puts results back into a shuffle. So looks like this shuffle component is very important. So what exactly it is? Next slide, please. So shuffle is essentially very fast and very highly available storage combined with data repartitioning engine. Shuffle not only stores data, but also, for example, can hash partition it. And as we know, hash partitioning is very common operation in distributed query processing. If you want to compute aggregation, we usually hash partition data on aggregation keys. If you want to perform join, you know, between two large tables, usually we need to hash partition inputs by join keys. So shuffle component does exactly that. So besides storing intermediate results between stages, it also partitions data. Shuffle generally tries to keep all data in memory, but of course, if intermediate data sizes becomes very large, then it could write, you know, those intermediate results into distributed file system. All these processes quite fast. And in many cases, it's actually hard to tell whether the data was in memory or, you know, some of the data ended up basically on disk. So any questions about shuffle? I believe there's a question on chat. Someone asked, what happens when a server dies? Do you need to restart or like Hadoop MapReduce query continues? And Alex could elaborate on that one, but basically we use this shuffle as an intermediate persistence layer, like MapReduce or Hadoop, basically to recover work that needs to be restarted. And it's at the task level, so not the whole stage needs to be restarted. Yeah, so for most part, you would never notice any, like, there could be significant, quite significant problems with hardware and you wouldn't even notice that query would not even slow down, let alone be canceled. Okay, next slide, please. So if people have questions, kind of type in the chat, feel free to unmute yourself or just ask them. Okay, I just sometimes do not know that. You're fine, I'm telling the audience. So Swar, are you able to unmute yourself? Yeah, hi. So is there any parameters for handling the shuffle partitions or BigQuery does that behind the scenes based on something, because we have traditionally seen in distributed things that based on the data load or too much of, you know, skewness of data or doing the broadcasting, we have to handle by use case by use case this partition parameter. It was BigQuery does that by intelligent looking at the data or? Yeah, so we have actually many slides, you know, discussing how exactly the partition data, so we'll go into those details. Okay. All right, Javi, you have a question? Yeah, can you hear me? Yes. So in your serverless environment, so you do parallelism across servers, but in a given server, do you have a special parallelism for multi-core like you got 32 cores and do you have anything special for that? Yes, so basically it depends of what kind of workloads you're running, BI or data warehousing, but we definitely use more than one thread basically to process tasks. Yeah, because in multi-core also you need to shuffle data across, so do you use the multi-node shuffle or this is something special? Again, it's slightly different depending basically where we're using essentially a business intelligence engine or a regular, but there are still other things that you can process in parallel depending basically on query shape and state shape. Okay, so let's go into next topic. Okay. So in BigQuery, one can have different size reservations or projects, so essentially again to remind this is set of resources. The slot represents some fixed set of resources available for query processing in the reservation on project. These resources could be CPU, worker, RAM, shuffle, RAM, networking and so on. If you compare 500 slot reservations, for example, versus 2,000 slot reservations, then a second one is going to have four times more resources dedicated to query processing. So essentially usually we do not measure resource pool size using number of computers. We measure it using a virtual description of resources committed to that reservation. The most slots we have, the faster queries will run. So typical configuration, like most common configuration for reservation is 2,000 slots, let's say, and it maps to some number of computers. We just don't kind of distinguish that number. So when queries run, sometimes some of the query stages do not need many workers simply because there is little data to process. The system is able to very quickly take away workers from queries that do not need them. The system is also able to very quickly redistribute workers between queries in flight to make sure each query gets fair share. So for example, if one query finished, then immediately the rest of the queries could start using additional workers and then they run faster. This could happen as soon as next second. Or if new query was submitted, then a small fraction of workers is going to be taken away from each already executing query and assigned to the new query. Again, every query gets own fair share of workers. This fluid resource distribution is important for more predictable query execution inside the busy system. So if you think in terms of how much resources every query gets, it is clear that the dedicated resources correlate with average reservation load and not with how busy system reservation was at the time query arrived. So imagine typical scenario where a few independent pipelines are performing calculations by running series of queries. Pipelines are independent, so there is no coordination between the moments we submit queries. So if we were, for example, to queue queries or to lock in resources allocated to query at the time query arrived, then system performance would be much less predictable. Because performance frequently would depend on luck. There were particular pipeline query arrived in front or behind another large query from another pipeline. And then it gets sometimes more dedicated resources, sometimes less dedicated resources. Because of this flexible resource distribution, that is much less important here. Essentially, you know, we get more predictable performance because each query gets a fraction of resources which is averaged over the time query was running. So any questions about this? Yeah, is there any mechanism for a user to override the default of fare distribution so that they can prioritize queries? Jose, do you want to answer that one? So there are some mechanisms in terms of reservation controls today that BigQuery provides in terms of how resources are allocated between multiple reservations. So there is some of that, but, you know, there are tons of workload management features that, you know, we've been working on to not exactly maybe do it as you mentioned, but have some kind of control over how things are prioritized over each other. So that's something that we've been looking into as well. Thanks. Yes, essentially, there are quite a few options to adjust this algorithm, which they described, but, you know, key idea is that you have quite a lot of flexibility to transfer resources from one query to another very quickly. Next slide, please. May I ask a question? Sorry. Yes. So I'm not sure whether I missed this earlier, but could you elaborate a little bit on what you mean by this fare? It's like you are going to estimate the required resource for each query and then distribute it, distribute your slots accordingly. What could you elaborate a little bit on this concept of fare and how that's decided? OK. So if you have five queries running simultaneously, we will try to make sure that every query gets the same CPU time per unit of time. For example, it depends on your reservation size, but we'll essentially try to make sure that approximately the same resources are used to execute each of the five queries. Regardless of how big those queries may be? Yes. Regardless how big those queries might be. OK. Sure. Thank you. If I can add to that, you asked estimating the number of slots of resource usage beforehand. We have algorithms there to monitor the query execution and figure out how resource usage has been evolving over time and adjusting based on that. So it's a lot of dynamic allocation of resources over time. So that's why it doesn't require to estimate that from the beginning. Thank you. Thanks for the explanation. Next slide, please. OK. So let's talk a little bit about BigQuery Optimizer. So BigQuery Optimizer was implemented basically from scratch as part of Dremel development at Google. It uses both cost and rule based optimizations to produce an optimal query plan. So do you have any special goals for BigQuery Optimizer besides performance efficiency and query predictability? So one of the additional goals was increased focus on handling nested data, like structures and arrays. It just basically quite convenient to have very native support for arrays and basic land structures. So what are examples of cost and rule based optimizations that BigQuery does? So a good example of cost based optimization is materialized view choice. In BigQuery, one can define materialized views that, for example, hold pregregated data. Materialized view could have clustering columns defined. So essentially data and materialized view will be sorted by these columns. So then query optimizer trying to find most optimal materialized view to use. It needs to take a look at data filtering involved and how data was clustered so it can find materialized view in estimated minimal number of rows to scan. And this is exactly what it does. What about rule based optimizations? So of course everybody knows now simplest rule based optimization usually is filter push down. We always try to execute filters as well as possible. Another good example of rule based optimization is partial aggregation. BigQuery always prefers a due partial aggregation before final aggregation because it helps to reduce possible data skew by the time you need to hash partition data by aggregation keys. So this way query processing time is less sensitive to data skew and more predictable. And again, data skew is frequently very difficult to predict up front. Next slide, please. So for the optimizer, can you say anything about like, is it a classic case cascade top bottom or a system or bottom top? No, it's not based on cascade ADS. So in data analytics workloads, we have some specific which sometimes now you can do essentially things a little bit more I guess in simpler way. And but we have many other interesting things happening with query optimization. So let's go maybe to the slide. Can I ask one question? Yes. Sure. This is Rebecca Taff from Cocker's Labs. And I was just wondering, you mentioned that you're doing some interesting things with support for arrays and other types of structures. Are you going to talk more about that? So I don't know if you saw some papers about Dremel originally, like, you know, quite frequently people like to do denormalize data for extremely large data sizes. If you think something like petabyte in size, frequently, you know, at that scale, it's even hard to normalize data or you may not have enough time to do that. So for this particular scenario, it's extremely large data scale. Sometimes it's useful to have good native support of kind of nested denormalized data just because, again, a UATL process could become extremely expensive if you go with those data scales. So when you say nested support, are these things like inverted indexes and things like that? No, I meant like you can have some columns that have repeated data inside like arrays and each array element in turn can have like structures inside with multiple fields and that field again can have some arrays. So basically some deeply nested data which is stored in single table. Got it, thanks. Okay, so let's talk beyond basically rule-based and cost-based optimizations. So BigQuery also has optimization that it does in the middle of query processing. This is because frequently queries are auto-generated by various tools and are quite complex and it's difficult to reliably estimate a data distribution, let's say, after 10 joins. Up-front cardinal estimates we have errors that grow exponentially with query complexity. In addition, those errors could be quite sensitive to parameter values or data size changes or various statistics that are collected. For example, BigQuery has two different algorithms to run analytical function row number with over close which contains order by close. One works well with small partitions and another works well with large partitions. If we were to use estimates to make algorithm choice upfront, not only choice could be suboptimal but because estimation error could keep changing, we may end up switching from one query plan to another back and forth resulting in less predictable query performance overall. So instead what we do is we observe and monitor data sizes in real-time as we process data and then we can change query plan as needed based on these observations. Let's see. So by now I mentioned a few times that we are trying to make query performance more predictable whenever we can. I already mentioned partial aggregation that helps make query processing less sensitive to data skew. Another example is broadcast joins. Those have less penalty in case of skew joins and BigQuery prefers to run broadcast joins wherever it can. And we will go into even more details in the next slide. Next slide, please. So what examples are of query plan adjustments that BigQuery can do in the middle of a query running? First of all, parallelism level. BigQuery can change query parallelism depending on data size. Say initially we estimate it, we need to aggregate one terabyte of data but then it appears that amount of data is going to be larger, more like 10 terabytes, for example. Intuitively, if you have additional workers available then we should use them because the amount of work went up and this is exactly what happens in BigQuery. Another example of changes to query plan that we do dynamically is we could switch between broadcast join and shuffle join. So BigQuery runs two kinds of joins. Shuffle joins is when the hash partition both joins sides and then each worker gets own partition to process. And broadcast join is when one of the join sides is small then we can broadcast it to every worker. So I'm pretty sure you're familiar with both join flavors. So broadcast joins have some nice properties that are less sensitive to join skew and usually faster, especially when broadcasted table is very small. So ideally we should use broadcast join whenever we can and this is exactly what happens in BigQuery. We measure observed data size in the middle of a query and basically in the middle of a query that we are processing and then we decide and we can actually change join flavor on the fly. And again, it doesn't depend at all on query complexity. It works with extremely complex queries. And yet another example of query plan change that we could do is basically row number analytical function as I mentioned. We have two different algorithms to process it. You know, when it requires sorting and we would measure basically data size on the fly and then we would decide which one algorithm to activate and for which partitions. So any questions related to kind of dynamic query plan adjustments and Hossein in next slides will go in many more details how exactly these adjustments are happening. Yeah, so this is Hamid. So do you also change during the runtime the join method? Like you know, if you use hash join, sometimes it runs out of memory. So for those blocks, you may want to change to the sort of join because you have much more control on the memory. This is something that actually pressed on us. So we don't use sort-based joins but we definitely pay attention to data sizes and again, you don't need to worry about which one join flavor we will run. You will never run out of memory or anything. Query will complete always successfully. And again, we will choose join based on looking on what kind of data sizes we get and data distribution. And so in many cases we try to use broadcast join if we can, but if we cannot, you know, we'll run shuffle join. Yeah, so that I understand. My point was that for some blocks you will be using hash join and for some blocks you will be using sort merge join. Yes. Sort merge, no, we're not using sort merge join. So how do you avoid the discuses due to hash overflow? So broadcast joins usually don't need any special handling. Shuffle joins do need usually special handling. I think you were saying, it's like the classic textbook recursive partitioning, right? If you overhaul the bucket, you just partition it again. Yes, so Hossein will go into some details how we basically deal with large hash buckets. We have dedicated slides about it. Okay. So now Hossein will continue from here. All right, thanks, Alex. So with that kind of high-level overview of dynamic query optimizations, I'm going to go through some examples and some of the kind of basic operations that we do here. I hope that I would be able to answer some of those questions as well. So let's start with an aggregation example here. We have a query that tries to do a group by operation. And the dynamic partitioning algorithm is really aimed at being able to, in a scalable way, execute this distributed aggregation without having any statistics and knowledge about kind of the input to the operation. So let's say we start with some workers running the first stage of this query and doing some partial group by locally from the input table. And then they start to actually do some shuffle. Basically, we want to do a partition by a dynamic partitioning based on the key, the group by key, and then do group by locally. So that's how we get the distributed group by running. But this shuffle needs to have some notion of dynamic partitioning in it because we want to optimize the level of parallelism and the number of nodes participating in the consuming side of this shuffle operation when we want to compute the group by output. And I will talk about that more in the next slide. But after that shuffling with dynamic partitioning is done, data is actually distributed based on the group by key across a set of workers that can now do the final aggregation for each of these key groups. And then they perform the sort and limit operations on top of that as well. And then finally, the results are combined in a worker with the final sort and limit to get the final, like, 100 top results out and back to the customer. So let me go through that shuffle with dynamic partitioning operation and then I'll pause if you have any questions about this particular algorithm. So dynamic partitioning is really a key kind of a core critical part of the BigQuery dynamic query execution. The goal of it is to dynamically load balance and adjust parallelism every time we have a shuffle operation. Assuming that there is nothing known about the input to these operations because it's hard to estimate, you know, filter effectiveness or cardinalities or things like that, we have this powerful primitive that can always adapt the execution. So let's go through an example and see how it works. We have two workers producing some data. They start to write data into two partitions. These are hash-based buckets basically. And, you know, going back to that skew example, a lot of times one of these partitions will get a lot more data than the other one. So in this case, the partition number two gets a lot of data. And while the query is running, while the stage is running, the query coordinator detects that there is this skew in terms of the partitions and on the fly, basically, directs workers to actually change the way they partition the data and use two new partitions, partition three and four to further divide the data into more partitions. And they basically stop writing to that partition two and start to write into partitions three and four. So as the query and the stage basically progresses, more data gets into those partitions. And finally, the data that we previously wrote into partition two gets repartitioned into the new partitions, partitions three and four. So at the end of the query execution, partitions one, three and four, sorry, I shouldn't say query execution. The stage execution, we have the data in those three partitions. Any questions so far? Yeah, so what do you do with aggregate distinct? Do you still do the same thing? Because you have to remember what you have counted. You shouldn't count anymore. Like some distinct. So any aggregate, hash-based aggregate operation that runs in a distributed way can be built on top of this operation. So what we want at this stage is create a partitioning that is kind of load balanced and kind of roughly equal across these partitions. What we do on top of that, which is an aggregate, whether it's a count distinct or something like that, can be done. It can be done basically independently because these partitions don't overlap. The key here is that that partition two, the data there gets repartitioned into three and four. So the same data doesn't appear when we consume it at the next stage or next level. There are only no duplicate keys. The keys get repartitioned into those two partitions. Hi, Hosin. I have one question. So if the clustering is enabled on this table, along with this partition, so how does this worker handle that? Let's say on your earlier query, let's say I wanted to look at based on state, like California or Georgia, if you put it on the VAT clause. So I think I hear that BigQuery recommends to put clustering when this kind of distinct values are pretty high, like the state or gender whatever, right? How does it balance this partitioning with clustering? So that's a great question. So when you have clustering and you have your, for example, aggregation or filter on the clustered column, what you see is that your input workers that are processing it, do the partial aggregation and produce a tiny amount of data that actually needs to be shuffled through the dynamic partitioning. And that's where this dynamic partitioning works pretty well, because if the data is already partitioned properly for that type of distributed operation, you produce, you know, the query produces a little bit of data and all of that sometimes just gets aggregated into one worker and then sends back. So a lot of work is saved through that. Okay. All right. So I'm going to move on to join example now. We talked about broadcast and shuffle join and one of the key parts of the dynamic join execution is to decide between the two join methods. So the broadcast join is a simple diagram, a simple query that shows that if one side of join is more smaller than the other one, the worker, in this case, a single worker reads that, uses the same shuffle primitive to actually send the data, transfer the data to the workers that are probing the probe side of the join and building a hash table based on the broadcast side and then aggregate the result finally. Similarly, a shuffle join basically uses a shuffle primitive on both sides when we want to actually co-partition the data on the same partitioning keys and then have workers reading from left side and right hand side partitions aligned together and then compute the join and this is in this case a hash join and then aggregate the results at the end and send it back. So these are the two different execution graphs for broadcast versus shuffle join but what happens in reality is that we don't know beforehand about the size of these two tables, so we need to dynamically go from one to the other. Now I'm going to, through an example to show how that works. Before getting into that, the dynamic join algorithm needs to make multiple decisions and those are very important performance of join. One of the advantages of broadcast join as Alex mentioned is that in terms of data skew, there is usually no performance penalty and also it transfers much less data in most of the cases compared to shuffle join. So it's very important to find the right join algorithm. Also we need to decide on the number of partitions and workers when we want to do the join and the dynamic partitioning algorithm that I mentioned earlier is very useful here. The dynamic join also needs to coordinate across multiple joins and sometimes swap join sides depending on the sizes and shapes of the join. And a pattern that's very common in data warehouses is the star schema joins where there's a fact table being joined with a lot of dimension tables. That's called a star schema and our optimizer is able to detect such joins and then do a constraint propagation. Basically compute constraints from the dimensions on one side and propagate them to the fact table, those predicates such that the fact table is heavily reduced before the join happens and that helps with whether it's a broadcast or a shuffle join reducing the amount of data participating in the join phase. So with that I'm going to go through an example of choosing the join algorithm through the dynamic join method. So in this case we have two tables that are being joined on a key and the dynamic query execution algorithm starts to run this stage basically that scans both sides of the join and shuffle them. So at this point the query coordinator doesn't know how big of a join it is or how big each side of this join are. At some point it detects that one side T1 is too small and this means that we don't shuffle both sides and run it at the shuffle join we can actually run this query as a broadcast join. So what it does it cancels the other side the T2 side so that it could scan it probe T2 basically and broadcast T1 into T2. So in this case that the same data that was shuffled is now read as a broadcast through workers that are doing a hash based join from the broadcast side to probe side here. They compute the result and then the result sends back the query. Now with the shuffle join case the same thing might happen but in this case T1 and T2 might be both too large. So the query coordinator detects that case and in the middle of the execution decides that this join cannot be executed as broadcast join. So it continues to run it partition join and then basically it schedules the next stage which does the shuffle hash join. Any questions? So I'm going to go through a more complex example of dynamic query execution to give you an idea of for more complex query plans how we can use these dynamic query execution primitives to predictably run queries without the statistics about the inputs. This example is about computing row number analytic function. So in this case we have a partition by clause potentially and then we are computing row number for those partitions. It's an expensive operation because we require to sort data for those partition and because of that it's important to get the right level of parallelism. However it's a little bit challenging because when we partition the data sometimes some of the partitions might get a lot more data than the other ones and we want to be able to scale to that and kind of handle that skew in the partition sizes efficiently. So what BigQuery does is starts to scan the data performs the hash based partition into initially created sort of buckets the partition by buckets that are used in the query execution. And then we have some of these partitions will be small and the execution will be kind of simple for those cases. The data is sorted locally within the workers reading that partition and then the row numbers are calculated. But for some of these partitions that are larger we have a secondary stage within the query execution. So as the query is running a new on top of the buckets that were created to do the hash partition basically one of those partitions is now split into multiple partitions based on the the ranges of the values that are sorted here. And then each of the range partition buckets are sorted separately. So it's kind of gives us a global order because these are range partition global in the sense of that particular partition and then the row numbers can be now calculated in parallel across those workers. And finally the results of both small partition and large partitions are kind of all combined and returned to the user. Any questions on this? So does it mean that one partition that you need to put the row number on like 012345 with the max is really split across multiple then how do you compute the row number? So this is kind of a distributed sort problem and computational row number. If you know for example how many rows you have in each of these buckets and these buckets are not overlapping in terms of ranges they are covering you can create a global row ID but you need to know how many rows. So the actual partition is even more complicated than this. It's not just you know you distribute the data and immediately start running the row number calculation on top of that you have to do a little bit of kind of communication and detection of how the rows are distributed across these buckets but at the end you have these ranges and each of them will have a known number of rows in them so that you can assign a row number to all of the data distributed there. So this is going to be our last slide so this is kind of a summary of what BigQuery does in terms of query optimization. It uses a mix of cost-based, rule-based and dynamic optimization and we talked about why we really need to have a mix of all of these to have a predictable performance for a very versatile workload. The key to use that dynamic optimization algorithm is that there are two things really. One is that BigQuery's ability to collect real-time statistics and updating all of these operations and flights mid-execution so that is a key to do that dynamic partitioning or choosing the join method or doing this row number calculation for example. And then the second key ability of BigQuery to enable dynamic query execution is the fact that we can re-execute the small parts of the query because Shuffle creates this snapshot or checkpoint basically that we can go back and re-execute parts even small tasks within the query so that that allows modification of the query plan really easy. The dynamic optimizations we just talked about when it's combined with that adaptive resource allocation that Alex talked about in terms of resource allocation gives us that performance predictability because now we can really, depending on the shape of the query and the shape of the data, provide kind of the same throughput and output in terms of performance. And also it is a much more robust way of handling arbitrarily complex queries or datasets without relying too much on things that might break easily in terms of the statistics. So this concludes our talk. Thanks for listening and if you have any questions, Alex and I would be happy to answer. Okay, awesome. Thank you. I will applaud on behalf of everyone else. So we have a few minutes for questions. So if you have any questions for Alex or the team, please unmute yourself and fire away. Okay, this is a media game. So one thing over here is that you talked about your caching a little bit, so how do you handle the brownout that is you lost the cache and then now you have to go to the object store to get it and it takes a while to do it. So is your cache really persistent within the over cluster? Let me try to answer that. So in this execution model we didn't cover a lot of caching most of the time the data can be read directly from distributed storage and this is relying on how Google built the distributed storage system and the network infrastructure. So most of the time, without any caching, the performance of the storage system is pretty good. So customers don't even see the difference. Caching is used in certain cases for example, BigQuery supports result caching for certain workloads some other in memory caching for data, but those are really specific cases not that in the general use case. Yeah, but the modern servers even Google publish numbers actually. If you use a local NVMe, the difference between that and even NVMe on the other side of the network is like order or magnitude. So it does make a big difference to go to your cache versus object store. You're right but it depends on the type of workload you're talking about. Especially when you're talking about large scale analytical workload, you have a lot of downstream operations such as shuffle, repartitioning and those are very expensive orders of magnitude more expensive than the cost of a network round trip to NVMe, remote NVMe for example or remote disk. So in most practical use cases what matters in query execution is the query plan, the amount of data that's transferred over the network, not necessarily the first round trip to storage. But you have a good point I mean there's obviously a lot of those and there are various techniques that are being used in certain cases within BigQuery tracks as well. So this is Lin so I have a little bit higher level question. I'm just wondering so I think it's great to have the flexibility and adjust things dynamically based on data sizes etc. So I'm just curious have you guys considered to remember some of the choices you have on a particular table for a particular query or sub query so the next time if you see a similar thing you don't need to do the adjustment dynamically again. I'm just curious have you guys considered it or what's your thoughts on that? Alex do you want to take that? It's definitely interesting idea but basically for the sake of this discussion we're probably not going to cover that one but I can definitely see where you're coming from it could be an interesting discussion in the future. Okay sure thank you Alright question from Ryan Hi I noticed that throughout this presentation you mostly focused on a fixed number of slots are you investigating allowing projects to use more slots dynamically? I don't know really what a slot means precisely so maybe that doesn't make sense as a question but I'll just let you run with that. Let me try to answer that so a slot when we talk about it is really you can think of it as a unit of CPU, a virtual CPU unit really and we are not I mean to clarify queries go can use dynamically however many slots they need up to depending on how the resource provisioning or resource commitment for a customer project for example if there's a reservation up to a certain limit so nothing is really fixed here things change very dynamically in terms of how many slots are being used by a query if your question is more about provisioning for a project or a reservation you know when for example a customer comes in and they purchase a certain number of slots and they pay for it they want to dynamically adjust that that's a different thing we have multiple ways to dynamically adjust the size of a reservation but again at the query level things are very dynamic and the number of slots that are being used very very rapidly and dynamically changed during the execution thanks okay anybody else I will ask my question well I had I posed one question earlier but you said it to me as a direct message you said it to everyone else oh I see go for it yeah so regarding your shuffle so you mentioned that you persist the shuffle so that if there is an old failure you will get it from there and continue like what my use does in this part but also that is very expensive so do you persist that in your local cash or you have sent it all the way to the object store so as Alex mentioned we have an in-memory shuffle layer that can persist it in memory but with certain durability guarantees and you know there is obviously always some risks that you lose data that's in memory even if you replicate it do things like that but the thing here is that if you reduce the chance of that then most of the time you don't have to redo the work but that's a good question I think this is really the key here being able to do in-memory shuffle otherwise the cost of core execution would be very very high okay so my first question is the Dremel paper was 2010 so it's 11 years ago are there any assumptions that were made in that paper that given that BigQuery it's an opportunity now as a product and it's exposed outside of people outside of Google are now able to use it is there any assumptions that were made in that paper in terms of the system design that are no longer correct or have been invalidated or you've changed the queries in time I think that's a very good question so recently if you know Dremel won the test of time award after 10 years in VLDP and there's a kind of a short paper that we published that exactly answers that question what are the things that we assumed at the time and things that actually no longer hold and I think it's very interesting to go through that yes there were certain assumptions that were made into how we handle you know the data for example how we handle execution and a lot of those changed but some of it is still very valid and I think Dremel and BigQuery pioneered that area and still continue to do that so I'm going to refer you to that paper and I can send your link sorry that was a softball that was killing you out but actually another question I have is as you said Dremel was very influential there's a bunch of open source clones Drill and I guess Impala either any part of those systems or any features or something that those open source systems have done and you guys you had a BigQuery and you saw that and that was a really good idea and you ended up implementing it as well but do you guys still see BigQuery as the state of the art and farther ahead than any of the open source guys I think I'll stop at this one Alex please feel free to add your thoughts as well and there's definitely a lot of very interesting developments in the open source committee you know BigQuery can claim that it always stays ahead in terms of being a state of the art I think some innovations in the storage data layout formats that happened for columnar processing was very interesting things evolve significantly again if we go back in time 10 years multiple iterations of how we store data we didn't BigQuery evolved so that a lot of it some of it were actually influenced by how things evolved in open source and then we obviously added things that we felt like are important and are important in terms of query processing and performance storage performance Alex do you have any kind of thoughts examples you wanted to add No, I think you captured it perfectly Yeah BigQuery essentially changed quite significantly in last 10 years always dynamic execution query plan adjustments most were not part of original Dremel but we definitely watched other progress being made and it looks like we still have quite a lot of opportunities in analytical data processing to make things better because that's what it takes a lot of money because it's always something else to do Alright guys Hussain, Alex we'll clap in half everyone else thank you so much for spending the afternoon with us