 Bad dinner, wicked dinner, bad, bad dinner, wicked dinner. So it's actually the, I think that the last course with the actual technical content, right, because both for this course, cause there will be Thanksgiving holiday on Wednesday and then we have the guest lecture next Monday and then the final review, right. So this is actually the last, the actual topic we are going to discuss in this course. Hopefully you have been enjoying it so far. So there are actually quite some administrative stuff for today's class. The first, of course, the last homework will be released today and it's due on December the 2nd and probably more importantly, project four, the last project, is going to do on Sunday December the 5th. I think we have, I think TA has posted on Piazza the time about the recitation, right. I think that's tomorrow. And then also very importantly, next Monday, we'll have the guest lecture from Google BigQuery to deliver a lecture in this course. And for this, we're actually going to require attendance because Google is actually a sponsor in this class for many of the course development, et cetera. So actually we would need everyone to participate in the guest lecture and we'll require attendance there. And I actually post more details about how do we count that. But that's next week. And then for the final exam, it will be on Friday, December the 10th. It will be 8.30 in the morning. And also we actually didn't get assigned with this room again. So we actually will go to a daughter hall on 2010, right. So there's actually a different from this room. We'll just be reminded of all that. And then similar to the midterm, we'll actually be using something like a scan tron. Essentially we'll use a grid scope to scan all your exams and then a grid that using the assistance of their auto grading method. So you should actually bring a pencil and a rubbers to make sure that you only have answers that you want to select. If you didn't have a pencil and you pick something and you select something and then later on trying to cancel it and then select another thing is very easy to get your question graded incorrectly. So please remember to bring a pencil and a rubber for the exam. Again, it's in a different classroom. So we remind for that, all right. So for the upcoming talks, again, right after this class, we actually have a dream meal come to talk about their acquire organization technique. And then also on Tuesday, right. This is like for some reason the display is not correct on the screen. But on Tuesday we'll actually have the same person that will give you a guest lecture on Monday to talk about a little bit more like a research stuff about Google BigQuery on Tuesday. It's actually a PDL talk. For that, I mean, if you're interested, you can attend. But we don't require attendance for that even though it's the same person, right, all right. So a little bit of recap about last class before we jump into today's content. So last class, we talk about how do we really handle the distributed transactional database, right, to deal with those distributed transactions. And especially we talk about different commit protocols. Like the more commonly used two-phase commit as Paxos or Raft, when you have more failures in your cluster, you can also consider, right. We talk about different design decisions when you're trying to replicate your data, synchronous, asynchronous, et cetera. Especially we have the concept of strong consistency versus eventual consistency, right. For strong consistency, you actually have to make sure that all your replica has received and propagated the changes for a particular record before you go back to the client to say that you have a commit, right. So that other users when it comes to any replica, it will read the same value. But for eventual consistency, you only need to make sure that the change exists on the primary or the master copy or the node of the data, right. Then you can tell the client that you have successfully committed and then eventually, right, the change will propagate to other replicas. But during this time, others may read a inconsistent variant of the data that's called eventual consistency. And we also put these concepts into the context of CAP theorem. And specifically for the three properties in the CAP theorem consistency available as partition tolerant, just no way you can satisfy all three of them at the same time in a distributed system. At most, you can satisfy two of them. And lastly, we mentioned a little bit about federated database, which means that if your organization or company already have different applications and built maybe by different teams at different times and then using different database, right. So you can also use this sort of like kind of like a middleware technique try to glue different distributed or distributed different database systems together, right. As if they are a single distributed database so that you can get easier access and processing of your data, all right. So that's all for our last class. So before I jump into today's lecture, I'm not sure whether you have seen this picture from Andrew's earlier lectures before, but essentially the typical setup for a modern data application would actually kind of like this two-side setup, right. So you will have many of these OLTP databases like I discussed last class, right, to handle the modifications to the data. For example, if you're Amazon, you have different phone apps, right. The users would order products or these sellers like selling their products or you have other applications either on your laptop or website have different places that generate those data, but then those are at fast speed, right. It's like kind of transaction or have short access and modification of the data. And then when you want to perform analytics on those data, right, trying to look at those data, trying to figure out, hey, whether there's some interesting trend or knowledge that I can extract from the data to help out my business, then typically you will actually put them in a single location first, right. To have this like unified place that store all the data so that you can perform, analyze and then maybe join different data sources together and then extract knowledge out of it, right. So in the middle step, middle step, usually there's some software to do the data extraction, data transformation as well as loading the data out of those OLTP databases and then put into this central location called OLAP databases for analytical purpose, right. So that's kind of actually a typical data pipeline that you will see in the modern applications. And I mean, one question is that why there's OLTP and OLAP side? Well, as you can see probably in today's lecture, there are actually different properties, right. Different optimization goals that the different systems has, right. For example, for OLTP database, like what we discussed last course, they are focusing on make sure that the distributed transaction are executed correctly and efficiently, right. You can modify data and commit records in a fast fashion. But then on the right for the OLAP database of which in short of online analytical processing, usually those are really only course, right. That are scanning through lots of lots of data and perform sometimes complicated operation to extract the knowledge and intelligence out of it, right. So this is like a very different type of query. So you may potentially need a different type of a system optimization. And then there's like a two different type of systems coexist in a data application, all right. So like I said, those, I mean, I mean sometimes this OLAP online analytical processing system will actually also be called a decision support system, data warehousing, right. And sometimes nowadays they also have the name of data lake. I mean, even though someone may argue that data lake actually contains more data, right. More data formats potentially. But then essentially they are very similar thing. So sometimes they are also called a decision support system. And essentially they are trying to help the organization to analyze a large amount of data that can generate some sort of intelligent information, like to help the organization to make a decision and help out their business. So before we talk about the implementation detail, so there are actually generally two models that we are going to organize the data in this type of online analytical processing system. So for this class, right, even though we are focusing on database implementation, right, how these drawing algorithms, right, how there's a tradeoff between different design decisions. But for the purpose of this class, right, in order to understand some of the design decisions made for these analytical processing systems, we also want to talk a little bit about after you extract all the data out of those OLTP databases, how you are going to organize those data in the analytical system, right. And there are different choices there and would impact some of the organization you would choose. Okay, so the two common paradigm, if you will, to organize the data would be first, star schema, the second is snowflake schema. So in the star schema, the way to organize your data, right, like again, all of those different sources but put into the central location, right. So the way to organize those data is that typically you will have a single effect table and then lots of dimension tables that is only a one level out of the effect table, right. Actually, just one quick thing to remind here is that you could, I mean, in theory, you could actually directly take whatever data schema out of your online transaction database, right, directly copy the schema onto the OLAP database, that's possible, right. But then in actuality, it's often beneficial to reorganize your data a little bit, right, to shuffle them and then put into them in a unified form or like organization so that the system can query them efficiently, right. Just like a quick comment. Okay, so back to this star schema, again, usually one single effect table and then lots of dimension table. But only, in the star schema, we only allow those dimension table to be one layer out of the effect table, right. Because like since we're single, like a central, like a sound, right, like in the universe and then lots of stars around it. So for the effect table, those are usually records, the events generated in your system, right. Again, think of an example where you are running the database for these cells for Amazon, right. Then in this case, so effect table would contain all the cells, right. It would have one row or one record for every single and purchase of cells made, I mean, on your website. And then this cells table, sorry, this effect table would only have a very basic information as far as foreign key references to the dimension table outside. For example, here in this example, the cells table would have the foreign keys to the product, to the location, to the customer. As there are some basic information like what's the price and what's the quantity, right. And then only in those dimension table you'll have those detailed information, right. So for example, for each product that is made on every single cell, you may have the category of the product, the product name, the product code, et cetera, et cetera, right. So one benefit of this star schema approach is that the drawing would be relatively simpler, right. So in this case, the drawings would be at most, I mean, joining two tables, right. Because I mean, there are usually, or maybe three, but usually there would only be one central drawing predicate that connects either cells against product or cells against location, right. So the drawing would typically just be one level. But then because of these restrictions, there could actually be some redundancy in the dimension table as well, right. Because right here, for example, again just looking at this product table, assuming that there are only like a very few category, right. Category, name, category, description, et cetera, assuming that there are only like a five or six category in the entire system. But then if you only allow one layer of dimension table, then for every row in the dimension table, you have to copy this information of this category as well, right. Even though at the end of the day, you may only have five or six categories. All right, that makes sense? Okay. Then on the contrary, another example would just be called a snowflake schema, right. Again, you will have a central effect table that will record all the events generated in your system, but then you just don't have limitation on whatever how many dimension table you have. So you can just do data normalization as much as you can. And then in this case, here for example, in the product table, you may recognize that hey, I have maybe like thousands or tens of thousands of products, but at the end of the day, there are only like five or six type of categories. And then you can just normalize that data and then extract the category information to another table and then reduce the redundancy, right. So of course in this case, you will store less data, but then the potential issue with that is that you may actually need to perform join across multiple tables. So the join query, maybe join query, maybe more complex. So by the way, the snowflake schema, that's actually where the name of the famous database system snowflake come from, right. So I mean, that's like a system that recently, I think went to IPO last year or something, right. And obviously, a snowflake system would support this snowflake schema and then they have lots of optimization for that, right. So sort of to reiterate the differences between a star schema as a snowflake schema, right. So in the snowflake schema, you will do this normalization on data, right. So try to reduce redundancy as much as possible, but then you potentially have, I mean as a second issue pointed out, with the snowflake schema, you potentially have to perform join on many more tables. So the queries may potentially be more complex. On the contrary, right, if it is with the star schema, the joins would be simpler, but then when you de-normalize the data, right, not only you potentially have the issues of data redundancy, but also sometimes it would be difficult to maintain integrity, right. For example, back to this earlier example, right, but the example with the product table, this dimension table called product, if you have lots of lots of copies of this category information on your system, right, you sort of have to ensure that all the, for the category with the same name, they always have the same description, right. Because if you do the star schema, if you do the normalization, you can easily ensure this information or this property because you only have one entry of the catalog in which catalog lookup table, right. But then in the earlier star schema example, you have to ensure this property by yourself and it's kind of easy to generate inconsistent data in this star schema paradigm. All right, make sense? Any questions? Okay. So talk a little bit about these, the specific design issues about the distributed on transaction, sorry, analytical database in this class, right. So I want to motivate this discussion with this simple example, right. Say you have, I mean, distributed database, you have four partitions of the data and then you have a simple drawing query, right. Just I mean, drawing in two tables and perform some analytics, but in this case just get the value of all the rows on these two tables. So in this case, I mean, for assuming that the query land on the first partition and it needs data from all the other partition, right. So a naive approach, right. A most naive approach is that, I mean, since this query is trying to access the data, I mean, all the data among the two tables and then you need to draw on them, the most naive approach is just to copy all the data to this very first partition and then perform the drawing there, right. But obviously, I mean, it's pretty costly to do this because you have to move lots of data around and depending on this, whether there's enough memory on a particular partition, you may also need to perform a external drawing, right, which will be very expensive. So the question is, can we actually do a little bit better than this naive approach, right. I mean, we'll be correct though, right. So, specifically we are talking about these four issues in today's lecture. The first is what will be the models for us to perform these executions in a distributed analytical database. And the second, we will talk a little bit about how we change the query planning in this scenario. And especially one important thing we need to consider in a distributed analytical query is that just the cost of network, right. And sending over, sending data over the network would potentially be very costly. So you sort of need to encapsulate that in your cost model. And next, we'll talk about a few algorithms to perform the drawing in a distributed environment. And lastly, right, not specifically related to analytical processing, but then we'll also last conclude with a few examples of the modern cloud system, right, to finish this course. So, first, the fundamental to different models to execute a query or especially analytical query in a distributed system is that either you push the query to the data, oh, you pull the data to the query, right. So in the first model, when you push query to the data, what it means is that you are just going to send either the entire query or portion of the query to the location where the data exists, right. And then you're going to try to perform as much processing as possible, right. Say, do some filtering or do some like early processing of the data as much as possible on the location that the data exists before sending the rest of the data to the other centralized location, for example, and then aggregate the results, right. And then on the country, the other approach would be a pull data query to the query. Essentially, what it means that, I mean, if the query land on a particular machine, but then the data is not there, you're just going to fetch the data to that machine and then perform the computation on the location that receives the query. That's the other model. So I should say that, so I will give you some example, the first example would be push query to the data, but I should say that whether you use a shared nothing or shared disk architecture, that actually, that is independent of what model you use, right. But just for illustration purpose, it's easier for me to illustrate a push query to the data example with a shared nothing example, shared nothing architecture, and then other example would be a different architecture, but that's actually a independent, which model you use for execute query is independent of the architecture, all right. So here in this case, again, I assume that we have a shared nothing architecture. The first node would potentially have records in the T-relation R with ID 100, and the second node or second partition would have records for both R and R and S with ID 101 and 200, right. So for this simple example, right, select star from these two, from the joining result of these two tables, assuming that this query land on the first partition, right, then the first partition only has data when 100 to 100, right, then the first partition will actually send a portion of the data, right, which would actually be performing this join on these two tables with ID from 101 to 200 to the second node, right, and the second node will actually receive this like a partial query and then perform the join and send the results back to the first node and then aggregate them together, right, and then send back to the result back to the client. Make sense? Okay. So in the second example, right, a pool data to the query. So assuming here we have a shared this architecture, right, and then I mean naturally those compute nodes just doesn't have any data, right, because the data would all be in the centralized location. So again the same query, right, so join this join query on RNS. So first compute node will actually have this logical partition that is responsible for computing the results from 100, right, and then the second partition, second node will have this logical partition to compute the results 101 to 200, right, but either the first node or second node, they don't have any data, right, so in a pool data to the query model, I mean these two nodes to compute node will send a process request to the centralized, I mean shared location for the data and then this like centralized shared disk will actually send all the data to this compute node and then do the computation and finally say here after the compute addition, the second node, I mean send the final result again back to the first node and then send back to the client to aggregate them, right, but then the main idea is that this compute node needs to pool the data out of this shared centralized location and then perform the computation on like where this computation is designed essentially, right, makes sense, okay. So one thing to note here is that I think I also have a related slides later is that we'll obviously it's kind of wasteful, right, if you think about it, so here this query is kind of simple but say there's some filter on this query with some ware clauses, right, so in that case it might be kind of wasteful if we have to pull out the data from this shared storage location to the compute node, assuming that for example, there's some ware clause to filter query based on some other properties, values, et cetera, right. So some cloud system and some modern shared disk system will actually support some filtering functionalities on this centralized shared storage or shared disk location and Amazon S3 is a famous example, so many of the distributed they have is offering by Amazon in their cloud would actually kind of like, it's not like a clean, pull data to query or push query to data anymore, right, so it's kind of do a little bit of both, so it should be a shared disk but then this storage layer would actually have, would support some simple functionality, for example, if you have a simple warehouse to filter the data based on the, I don't know, the age or the value, et cetera, right, that this shared, even though it's a shared disk environment, it can also perform a little bit simple filtering on this storage layer and then send the other data to the different compute nodes and do the computation, right, so in this case, it's not that clean anymore, it's sort of like doing a little bit of both, all right. So that's some example, a modeling example, right, that could like take the benefit of both words if you will, all right. So next, the one thing to note here is that while a distributed and equal database is like, no matter whether it's like a pulling data or push query, but while it is performing those operations, it usually have a local copy, right, especially when the data is coming from other nodes, right, for example, from a shared disk or from other partitions, it usually has a local copy of this data to perform the computation, but the thing is that usually these data are put into the buffer pool, right, if the buffer pool or the memory is not enough, that they got spelled on disk, but usually what happens is that system typically don't actually possess those data, right, so those local data or local cache is usually a specific to this computation, right, and then you may have a cache eviction policy to retain the cache a little bit longer, but at the end of the day, I mean, those local machines don't really need to possess the data on the disk, right. So what happens, one question is that what would happen if the query execution failed in the middle, right? So we talk about this a little bit when we are discussing a distributed transactional database, right. So in this case, we actually have lots of mechanism to ensure that, hey, if a node fail, how do we make sure that the data is still correct, right, how do we commit, to fit commit, or make sure that a majority of the node commit and then how do we abort transaction, et cetera, et cetera, right. So in that case, we did discuss lots of details to ensure this distributed transaction is a correct and efficient, but in this case, in distributed analytical processing, it's actually kind of a little bit different, right, because most of those analytical queries, I can mention a little bit, last class and early disk class, will actually be read only, right. Most of the time, they're just scanning through lots of lots of data and then maybe perform some complicated operations, try to extract the knowledge out of it, but not necessarily update or insert new data into it, right. So for most of the analytical queries, we don't actually need to worry about this, the correct issue when transaction or query fail. But then one issue does impact a little bit is that those analytical queries typically are long running, right, because many times you have to scan lots of lots of data, even billions of records to extract knowledge out of it. And then by long running, what I mean is that in many cases, in many practical cases, those queries can take hours, right. So if a queries fail, analytical query fail in a distributed environment, even though you don't need to worry about the correct issue, but then there will be lots of wasteful work, right. So it also seems, again, a little bit of wasteful or inefficient that if I mean distributed analytical query fail and then you just waste all the work. So what do people do in practice? So in practice, people actually don't really, usually people actually don't try to save the intermediate result and then recover the query out of failure, even though the query may take a long time. So if one node of this distributed analytical query fail, if the query is read only, then people just simply abort it, right. And then we execute the query again. So it's mostly just for again, simplicity and the efficiency of the query, except for the consideration of simplicity, as well as the efficiency of query execution, because if you want to materialize the intermediate result and try to recover the query back from failure, well, that potentially would be costly, right, if you want to write it out to the disk. Also, if one node fail, right, like we discussed in the last class, if this particular node that failed, it happens to be the leader or primary node that is responsible for aggregating the results, then you actually need to re-elect a new leader, right. You had to do the leader reaction and then figure out, find out a new master or primary node to be responsible for aggregating the results. So all of those mechanism would potentially be costly. So in practice, people actually usually don't try to log out the results and recover it. That said, there are definitely some systems try to do that, right. But usually, those system would just do a simple lightweight variant, maybe not necessarily materialize all the intermediate result, but then they just log out some metadata about this query processing so that when the, when if a node fail and sometimes, and then sometime later you spin up a new node that's trying to replace this node, then you may need to actually re-compute whatever computation down for that node, but then return the computation for other nodes, right, something a little bit lighter, lighter with that people sometimes do, but it's not that common. So to quickly use this, right, how you can, one potential approach you can do that, then again, a similar example, assuming that we have this shared disk architecture and two nodes, right, then this query got sent to the first node and then, like I discussed in the last example, one part of the query would be sent to the second node to be responsible for the execution and both of them needs to, I mean, either read it or push the query to the disk, right, doesn't really matter for this discussion, but then in this model where you want to log some intermediate result, right, in this architecture, after the second node finish the computation of its portion, I mean, on this drawing results, it may actually just try to write this intermediate or like this partial drawing results back to this shared disk storage, right, and then say later on this second node failed and then what the system could do is that it may realize that, hey, I have finished the computation on the second node on its portion of the drawing, right, and then instead of trying to recompute it, the first node can directly go to this shared storage or shared disk layer and then fetch that result back and then directly do the aggregation and send it back to the client, right, but it's something that some systems don't do. I think, again, Amazon Redshift supports a version of it, but it's like a pretty lightweight, but usually people don't really do it that way, all right? Okay, so any questions before we get to query planning? Okay, so query planning. In the distributed system, for the purpose of this course, would be a distributed analytical processing system. I mean, most of the query planning techniques we talk about in a single node environment will actually still apply, right? So for example, predicate pushdown, I mean, early projections of queries as well as optimization on drawing ordering, most of those techniques would still remain the same, right? So the biggest difference would actually just be the cost of sending data over through the network, right? So essentially, in the query optimization of a distributed analytical query, the operator and optimizer would need to consider the location of the query and the network cost of sending data over, et cetera, right? So that's actually the main difference. So for the mechanism, right, to send different portion of the query, I mean, because after you optimize the query, I mean, you have a query plan, right? So the mechanism to send the different parts of these side fragments of the query to different machines to execute will actually fall into two categories, right? So in the first category, after you optimize a distributed query, you may actually send the physical operators instead of SQL queries to different nodes in your partition or in your cluster, right? So this would think of them as the plan nodes in the third project you did about query execution, right? So this is like a physical representative operator would essentially mean the plan nodes in the plan tree, right? You can send different portions of the plan nodes in your plan tree to different machines and to responsible of executing a different data. But then there's also a second approach where you actually, I mean, after you finish this query planning and then generate a physical query plan, you would actually try to regenerate SQL queries, I mean, from those already optimized query plan, right? But then you generate SQL queries that are specifically targeted different partition on the data, right? That's actually a possible approach as well. And then in that case, the benefit is that you may actually do a second run of query optimization on different partitions, right? And specifically optimize a portion of the query for the data on the specific partition of node, right? So this second approach would actually, I mean, it sounds a little bit interesting but obviously it's more complicated. It's not used by many system. And as far as I know, there are only a two system that use this, right? First is single store or previous name, MCQ, the other is Avertise. So let me quickly use this, right? So in the second example, right? In the second approach, send you a SQL query. Again, assume that you have this simple drawing query and then I have three different partitions, right? That like has data with different ranges. And then with the second approach, what you will do is that you will first optimize this first query, like get a query plan. But then after that, you actually rewrite those or regenerate SQL queries out of its query plan by targeting different partitions, right? So in this case, the three are different queries you could generate is that with this join but then you can actually append this where clause, I mean, that are responsible for different ranges of values to these different partitions, right? And then you can send queries to each of these different partitions and then to write to re-opening them. So the benefit of this, I mean, the reason why some system do it this way is that when you optimize a query, right? You are largely relied on the statistics or metadata of the information or distribution of the data, et cetera, right? But then those statistics, right? All the metadata about the data distribution may actually be outdated. I mean, not necessarily you have the most updated data on the primary, I mean, know that it's responsible for query execution, right? So when the first time you optimize a query, you may actually not have necessarily have the best query plan for all the executions on these different nodes given that, I mean, the data is not really local to you and you may have a stale statistics. So when you send the data, sorry, when you generate those new queries and then send them to the different machines, they actually will have the opportunity to re-open them using the information directly available on the local disk and the most up-to-date statistic, right? So if, for example, on the different machines, you may actually sort the data differently, right? On one machine, you may sort the data and then on the other machine, you may not sort it. Then on the machine that the data is actually sorted, you may potentially perform a sort-matter draw, right? It would potentially be more efficient. But on another machine, you may just perform a hash draw. And then this way, I mean, the different machines would have the opportunity to optimize them. So it's pretty interesting, but it's a little bit complicated. So for the most case, right? Most of the system, most of the system that I know will actually just use a simple approach where you optimize the query in a central location, but then you just send the different subplans, right? Different subtree or the operators of this query to different machines and for the execution, all right? Any question? Cool. So of course, I mean, after you finish the execution, you will still need to send results back to this central location to aggregate it and send back to the clients, right? Okay. So next, how... So when we are performing a distributed query, right? So a most naive approach, again, like I mentioned in the earlier example in the class, would actually be to aggregate all the data from different partitions, and then, I mean, get the data all on a single machine and then perform the drawing on a single location, right? That's the most naive approach. But obviously, the disadvantage of that approach is that your data is very costly to send lots of data over the network, as if the machine doesn't have big enough memory, then you have to potentially spell data onto the disk, right? So what we really want in a distributed analytical system would actually be executing different portion of the query on different partitions, especially closer to where the data are located, right? And I mean, as I show you, as a sort of hinted in the earlier examples, one important property that we need to ensure when we try to execute the different portions of this query on different partitions is that we need to make sure that, especially, let me say there's a drawing query, we need to make sure that the tuples in these two relations that has the same drawing key would be executed on the same machine, right? Because otherwise, if you are drawing two tables or two relations, R and S on this attribute ID, but then you have, I mean, for the relation R, you have ID equals to five on one machine, but in relation S, you can ID equals to five on a different machine, then if you just don't communicate the data between different machines and then the user is executing the query on different partitions, then you actually miss this matching between the two tuples, right? So that's what the query result would be wrong. And then we need to make sure that that didn't happen and we always can match in the tuples that would have this same drawing key together if we perform a distributed drawing query, right? So that's what we usually, what we need to ensure in a distributed, in a drawing algorithm on a distributed and a decal processing engine. And then, depending on where the data is located, how the query, where you try to perform this query, there will be different scenarios and different amount of movements of data that is involved in a distributed drawing query and that's what I'm going to get you to next, right? So there are a couple of scenarios. I mean, they are ranging from either, I mean, best case scenario, I mean, worst case scenario or something in between, right? Depending on how much data you need to ascend over different machines. So in one scenario would be one of the best case or like better case scenario would be that when one table will potentially be a partition on different machines but then another table may actually be replicated on all the machines, right? So that's actually not that uncommon, right? Especially with the, it's called a star schema that we discussed earlier in the class, right? If a dimension table is small, right? For example, if one dimension table just stores the location information, the post codes of different cities, for example, right? Then, I mean, that database is actually not that big and doesn't really get updated that often, right? If the post codes of different locations in different cities. So in this case, actually in many practical implementation as well, people will actually just replicate the table that contains all the post code information on all the machines, right? And in this case, you can directly, I mean, no matter, I mean, whether the other table, how the other table is partitioned, you actually always can be able to directly perform the join on each node because this smaller table is replicated on all the machines. So here, for example, right? Again, it's a similar simple join query, but then assume that this table R or relation R is partitioned with ID on different machines and then you have this relation S that is like smaller and then replicated on all the machines, right? So in this case, say, I mean, this like left machine has ID one to 100 and then for relation R, then the right machine has the ID 100, a one to 200. And in this case, because the S table is replicated, so you can always find the matching table on all the machines, right? No matter what ID this two-pole formulation R has. So in this case, you can just perform the join in parallel on two different machines and then later on, we just aggregate the join results to one of the machine and then send the results back to the client, right? So you don't actually need this data movement, at least you don't need to move the data of the original values of the two poles in any of the relations, all right? Cool. So in the next scenario, right? So it's also one of the best scenario is that the two tables may both be pretty big, right? You don't really have the luxury to replicate either of the table on all the machines, but then potentially you may have the same partition key, right, that as the join key on both of the two relations, right? So in that case, for all the data that resides on a particular node, right? It would contains all the two poles from this join from the relation, from both relation that has the same range in terms of this join key, right? So in that case, it's also a very straightforward and you can also similarly perform the join operation locally on different machines without direct movement of the original value of the two pole, right, and then aggregate the results together. And in this simple example, right, assuming that the left machine has all the values both from relation R and S, from range one to 100, and then right machine again 101 to 200, you can just perform the join separately and then aggregate the results on one particular machine and then send results back to the client, right? Again, this seems a little bit ideal, but it's actually not that uncommon in practice either, right? Because in many practical applications, you can, you would know, I mean, in many cases, right? You will know what roughly what type of queries that users may issue, you can look at the history of queries, et cetera, and then you can sort of design the partition scheme of your distributed database that can be the most beneficial to the common queries, right, in the workload. And then, I mean, in this case, if you notice that there are many join queries on the same key, then you can just partition it that way, right? It's actually not that uncommon either. All right, makes sense? Okay. So the next example, right, not so ideal, right? Not the worst case yet, but not so ideal, is that when you perform this join, right? So again, assume that there's no table that you can have both of the, what's called, no table that is small enough so that you can replicate the entire table on the entire machines, on all the machines, but then, I mean, you are not lucky enough to have the same partition either, right? But then, assume that you have one table that is partitioned with the join key, right, that is required, but then in this case, for example, it would be R, right? Relation R, partition on ID, but then you have the other relation, right? Just unfortunately, partition on a different key or attribute, for example, value, right? Then in this case, I mean, again, for the R is the same partition, but then for the S, for the relation S, even though you know that, hey, relation S, maybe partition based on this value attribute with ID one to 50 on the left and the other on the right, but then you don't know what their ID is. Their ID could be one, could be 100, could be 200, you just don't know, right? So in this case, you just can't naively just perform the join separately or individually on these two different machines. So what do you have to do here? In here, you just have to copy the data for the relation, from the relation that does not have the drawing key, right? To each of this node. So in this here, you have to copy the value from a table, relation S from the left table to the right, right? To perform this join on the ID 101 to 200, but similarly, you just have to copy the value from the right machine, I mean, from the relation S to the left machine as well, right? Because you just don't know what would be the ID of this relation S and that's two different machines, right? So in this case, you just have to make a copy in this case, on both machines with relation S and then perform the join on relation R, right? But then you can perform join separately and finally send results back to the client, all right? Yes, please. Oh, it's, yeah, that's a good question. So here in this example, as a user-related, it's because it's partitioned on something unrelated, right? So you just have to, you just have to make the copy and in both directions, right? But I mean, you're actually correct that in some cases, right? If you recognize that, hey, one relation is particularly small, right? Even though one relation is partitioned, the other is not, but then if one relation is particularly small, right? For example, R in this case, right? You can also choose to copy R, right? I mean, you can do that. But for this example I showed is because it's partitioned on a unrelated key, right? Cool. And then in the last scenario, right? Which would be the worst case scenario that none of the relation is partitioned on the join key, right? In this case, you just have to, I mean, it's called shuffling, right? By the way, I forgot to mention, for this scenario three, this is called a broadcasting scenario, right? Or sometimes referred as a broadcast join, right? And just broadcast, right? This is actually the standard term to describe this type of distributed join error, right? Because it is broadcasting one relation to all the other partitions. And then the last scenario, right? Assuming that you just, I mean, unfortunately, don't have either a table partitioned based on the join key, they just do a reshuffle, right? Or in more common terminology, it's called a shuffling of the data, right? Or shuffle join, right? Or same thing, right? To real, essentially it's, you have to rearrange data or repartition the data across different machine, right? See, here in this example, you have this two relation joined on this ID, but none of them has the partition key as a join key. Then you have to reshuffle or repartition the both relations, right? In this case, for relation R, you have to move the data for relation R with ID 101 to 200 on this machine and one to 100 on the other machine. But similarly, with relation S, you have to do the same thing, right? You have to do, move all the records in relation S with ID 101 to 200 on this machine and then the other records on the different machine. And then you come back to the second scenario, right? Where this data will be partitioned based on the join key and then you can perform the joined operation on different partitions and finally aggregate the results. All right, makes sense? Yes. Oh yeah, in this case, this would be just entries with relevant IDs, sorry. Yeah, yeah, yeah, sorry, yeah, yeah, yeah. Yeah, I didn't really use it here. Okay, so one thing I want to note that there's actually a special case of those join queries would be called a semi-join query. So this is a, this is referred to a type of query where, I mean, obviously you are performing a join but after you perform a join, you actually only require data from one side of the relation, right? You actually don't really need any records from a different relation or any value from the two parts on a different relation. So for this particular type, this is actually not a SQL standard, right? Just like a category of query with a potential organization, right? So for this particular type of query, just actually the opportunity, right? That you don't actually send the data of all the attributes on one relation when you actually need to shuffle the data to reduce the network overhead, right? Because, I mean, for this query, they only need data from one particular relation, right? So this is a simple example here, right? Say you have this, again, the same join query, right? Join table R and S, but then at the end of the day, after the join, you only need these, some attributes from relation R, right? In this case, for simplicity, just the ID from R, right? So what you can do here is that, again, assuming that they have these two relation R and S, just for simplicity, right? Assuming that they are on different machines. So here, I mean, in order to perform this drawing, you have to move data around, right? For here, for example, let's say you want to, we need to move the tuple from S to R. But in this case, instead of moving all this entire tuple, you would actually only need to move the record ID of all the relation, from all the tuples in relation S to a different machine, right? Similarly, when you try, if you assume that you have to reshuffle the data, assuming that you have to move data from left machine to right machine, you only need to move the IDs of this record to the right machine, right? Because, I mean, the drawing operation actually doesn't need any computation based on the specific values, I mean, in collaboration of these two tables, right? So in this case, after you send the particular value of this like a table, after you send the IDs of all the tuples from this table R on the left to right, you can actually just directly check, hey, whether I have a match there, right? If I have a match, then I recall the ID. If I don't have a match, then I throw that away, right? And after you finish the drawing, similarly, you only need to send the IDs of those matching tuple back to the nodes on the left, right? And then after that, you can perform whatever remaining operation that can do the projection or do some calculation based on the values in the R relationship, right? Because again, because this query does not need any actual value from the other table, right? This would just type of query, which is called a semi-join. And then the system would actually do the corresponding optimization for this specific type of query. So some system would actually support the semantics of a semi-join, but it's not in secret standard. So some system don't actually, I mean, support these semantics. And in that case, many system would actually use the exist clause to rewrite this query and then fake this semi-join semantics, right? For example, in this case, the above query could be written to a query that you select all the IDs from R, but you check whether there exists a particular tuple in the S relation that it has this matching ID. In this case, it's a semantics for this semi-join. All right, any question for this? Okay. Yeah, I mean, in this case, you just directly send the ID and then after you finish the semi-join, you send the ID back as usually here, okay? So the last topic, right? So we have finished all the high-level, again, there are many detailed organizations with distributed analytical database, but for the purpose of this class, we only focusing on the high-level ideas and intuitions, right? So before I give you more examples of the modern cloud databases, any remaining question about distributed analytical system? No, okay, so just the last topic, right? Give you some examples of cloud system, right? To finish the topic in this course. So, like I mentioned, many of the modern systems or modern distributed database system will actually move into the cloud, right? To give the users more flexibility and more convenience to deploy their data application, right? So in this case, I mean, many of this system, like I mentioned before, even though many of the newer system would generally use a shared disk architecture, but as I mentioned in the data push and pull example before, some of the system will actually just blur the line between shared disk and shared nothing, right? So sometimes, even though you have a shared disk architecture, like I mentioned, you may still push some part of the queries to this shared disk layer and perform some simple filtering, et cetera, and then trying to be as efficient as possible, right? And that is possible, especially because, maybe many of these cloud database vendors, if you will, they have the control of the whole stack, right? Amazon, for example, they build their distributed database, like, for example, Aurora or Redshift, but then they also have control to their storage service S3, right? And then it is very, I mean, convenient for them to install or extend some additional functionality on S3, which would be the shared disk, right? In collaboration with the distributed database layer, right? To maximize the system performance. So generally, there are two types of cloud database systems that you may have heard of and know how familiar you are with those terms. One type of system would be called managed database systems. By the way, those are not super strict textbook definition, right? It's like a general categorization of a system that people has, but they are not like, sometimes which category a system exactly belong to is not super clear, right? But generally, the first type would be called managed database systems, right? Those would be typically referred to the database system that are not like, specifically modified for the cloud environment, right? They're just whatever original design for the, either a single node or distributed system, right? But whatever original design for their unprint version, they just run them in the cloud, right? And then they just let the cloud provider, instead of you trying to manage your own data center, right? To trying to deal with those machines, keep them running, you let the cloud to manage, to cloud providers to manage the machines in the data center or in your cluster for you, right? But then you are still running the same software or the same database as you are running in a on-premise setting, right? So with those will be called managed database services, right? This is actually supported by most of the cloud vendors, right? For example, in Amazon, it's called Amazon RDS, right? I'm Amazon Relational Database Service. I think that's the full name. And then they offer their cloud, their managed version of Postgres, MySQL, et cetera. It's just the same software, right? It's just that the cloud vendor runs the cluster for you. But then the second type would be usually referred to as cloud native database management system, right? Those are typically referred to the database system that are designed from ground up, right? That are specifically designed for a cloud environment. And typically they use this shared architecture because again, cloud typically provide these convenient and flexible shared disk service, right? For example, such as Amazon S3. And then they will have specific organizations designed in these environments, right? Those systems will be referred to as a cloud native database system. And there are many examples for that, right? Snowflake would be a very famous example for that. But also Google BigQuery, Amazon Redshift and also Microsoft Azure, SQL Azure. But for the later two examples, it's actually kind of interesting because for the later two examples, they actually start as a managed database that they have a system services in the cloud, right? So when Amazon first build Redshift or Adro, they're actually using some existing techniques and build on-prem. And in the Redshift scenario, I think they bought a existing distributed analytical company, right? And then just put it on top of their Amazon cloud. And then in the SQL server, Adro example, right? They just put their original SQL server, I mean managed by the Adro cloud, right? So initially they are all management, managed database system services. But then later on, because, I mean, they are in the cloud, right? And then they control the software and they is kind of oblivious to the users, right? They just keep updating the system and keep changing it, optimizing it. And eventually it actually became different enough than the original version, right? So nowadays we just call them a cloud native database system, right? But they actually did not start as cloud native to begin with, all right? So another type of database system, okay, not to confuse you guys more, but another type of database system in the cloud that you often see will be called serverless database system, right? So by serverless, it doesn't really mean that there's just no server because I mean you have to have a server to run any database system, right? So by serverless, it actually usually is most mean refers to as a pricing model where you actually don't necessarily have a designated compute node for the database system, right? Instead what you do is that typically is a shared disk architecture, right? Typically you have your disk that is always there, but then you will actually only have the compute node being executed being a spin up when you execute query against the system, right? And then you're only, you as a user, right? Would only be charged for the amount of computation resources, amount of time where you execute the query on the compute node, right? If you're the system as idle, right? Don't execute anything, you actually only be charged by the storage and which would potentially be much cheaper than the CPU and memory cost, right? So in this case, you actually would save potentially save money when you don't need a system, right? So this is like a called serverless, it's like as if there's a no designated server that is you allocated for you, right? But you only have the server when you need to execute queries, right? So here, right? In this example, right? See I have this application server, right? I have this compute node on the side. Let me see that, I mean right now I want to perform some queries, right? And then I exchange with the system and then I got billed, I got charged by the amount of computational resource when I'm executing queries on this serverless database, right? And then for example, right? I go to sleep, I go to rest and then I for some reason the database system I don't need to execute queries on the database system for now, right? So what happened is that you are to save money, I mean in this case, I mean in the simple case, right? If you have a designated node for your system, then you'll be charged by even though the system is actually idle, you're not executing anything, right? So that would be the earlier example with the managed database service, all the traditional, I mean even though it's cloud-nitty but the traditional pricing model, right? But then in the serverless world, right? What's different is that again typically it will be the, like I'll share the disk architecture, right? Because it's very difficult to do this in a share nothing architecture, right? Just almost impossible, right? So typically it's a shared disk architecture. What you will do is that you are again, similar to the earlier example, when you have some queries to execute, you will go through this compute node by executing them and then getting data from the disk, et cetera, et cetera, sending results back to the client. But now say if you go to sleep, you have a rest, what happened is that the system could choose to write all the content in your buffer pool back to this shared storage, right? And then just directly shut the middle node, the middle compute node done, right? So that you don't actually need to reserve this node anymore and you are not going to be charged by the CPU and memory for this node which would be much more expensive than the storage, right? So this storage would be much cheaper. Later on say that you come back and you want to execute some additional queries and you can just read this buffer pool, the buffer pool page table back from this shared storage and then repopulate the database, right? So in this case you don't need to have a code cache, you just have whatever cached content in your buffer pool, I mean originally and last time you log off and then the system would pretend, it will give you a illusion that you just, the system is always running even though it doesn't have a designated compute server for it, right? So that's called a serverless and potentially it's like a much more flexible and then you may save money if your system has lots of idle time or downtime, right? So many account system operators will support this, for example, Amazon, Fauna, CQ Azure, Microsoft, CQ Azure, SS, Google BigQuery, right? Many BigQuery will be coming next Monday. I don't know whether they will talk about serverless or not but potentially many system will actually support this newer model as well. This is kind of like becoming more popular these days, all right? Cool. So a next thing interesting I want to mention a little bit, right? Again, this is like a more general discussion, right? It's not like too deep into the technical details but another interesting that people do these days is that other than try to build all these components by yourself in a giant system, right? Build the catalog, build the optimizer, build the customer management layer, one approach to build distributed database especially common in the cloud is actually to leverage lots of these existing tools or the services in the cloud to build a disaggregated distributed database system if you will, right? So for example, for the catalog there are lots of these existing catalog management services in different cloud, right? Google has this Google data cloud service and Amazon has this Amazon Glue data catalog. So instead of having their own customized in-house catalog service, right? Similar to Amazon distributed database you have this Amazon S3 as a shared disk layer your Amazon, the Amazon, if you want to build a distributed database in Amazon you can even use this Amazon Glue data service, right? And then similar to this, similarly, right? You for the node management, right? This like a commit protocols, replication protocol, et cetera you can use Kubernetes or Apache Yarn like there are many existing services as well and also for the criminalization, right? There are even this like open source and query optimizers that you can use as a query optimization service such as GreenProm, Orca, or as well as Apache Kale site, right? And then with those services there are two benefits essentially, right? The first benefit is that it's easier, right? It's simpler to build a database system and especially a distributed database system leveraging all those existing services in the cloud, right? But the second benefit is that your system may actually need even be able to share components with other system, right? For example, take the Amazon S3 example again, right? If you build your distributed database using Amazon S3 as the shared disk then not only that, I mean, it's simpler to write your distributed system but also with this shared data stored on S3 you can actually expose your data to other services, right? Not only your distributed database can use the data on S3. If there's some other analytical tools or even some machine learning applications need to read the data, not necessarily need to compute it by a regional database then you potentially can be able to read the data there as well by using those shared services. So this is another potential strategy to build a system. But of course, typically, right? With these general purpose data services you may not be able to fully utilize the software and optimize the system as efficient as you want, right? So that's the disadvantage there because you are using those general purpose software but then this is the one approach you can take, right? So lastly, right? There are some universal, similar to the universal or the general purpose tools for you to build a data services or distributed database there are also many universal data formats that you can choose nowadays, right? Try to share the data and then make the system to support more applications, right? So traditionally, right? Again, typically a database system would have their, like they have their own customized catalog management, commuter product, et cetera. Usually a database system would have their own customized data formats as well, right? For example, I think in your first class, I mean in BustHub, sorry, in your first project, in BustHub you have this customized page layout, right? That's specifically for BustHub and then if you want to move data between different databases or if you want to export that data to some machine learning applications, you actually have to do a transformation of the data first, right? Transform data into a CSV format, a JSON format, et cetera, et cetera. And then that could potentially be costly. So one thing that some people do nowadays is that, I mean, instead of using this specifically highly customized data format, you can actually potentially choose a open source data format, right? There are many popular choices of them as the data format of your system, right? Then you can actually be much easier to move data around different databases and then try to support other applications via data, right? But by using this data format, I don't necessarily mean that you have to store your data exactly in the format of this open source specification, right? You could actually, there are different choices. One choice is to, I mean, you directly store the data this way, right? Using this like a many open formats. The other choice is that you can actually still store your data in your own format, right? But then you could provide mechanism or functionality to efficiently translate your internal data format to a open source format, right? And then it's easier, again, for people to move data around different system or support other applications, right? So there are many of those open data formats nowadays. One, the more popular one would actually probably be Apache Parquet. That's from a cloud, from a collaboration in Twitter, a cloud era, a few companies. And then the Alka format is actually pretty popular as well, right? That's from Apache Hive. And there are also other, other these kind of formats from different places, right? This like carbon data from Huawei, ice broke from Netflix. And then this HDF file is also kind of interesting. It's not actually that common used in a database system, but it's actually also pretty common used in a high-performance computing world, right? It's like a multi-dimensional array and usually used to compute, I think for this like a super machine to compute the trajectory of some physical problem, et cetera, right? And then lastly, Aero format is actually also getting more attention these days, I would say. It's like a in-memory execution format, but it's actually came out from this original inventor from the Pandas library, which is actually used very common in a data scientist, data science world. And then they have this open format called Apache Aero as well. And I mean, many system would actually support that to again to make the data movement much easier, right? And actually besides either, just like, I mean, this is the last slide, right? Just to mention a little bit more is that besides either you directly store data as this format, or you just provide a transformation, an internal transformation mechanism. Another choice is that you could actually try to tweak those format a little bit, right? You can, one of the systems that we build at CMU, the noise-based system is actually pretty much based on the Aero format, but it's a little bit different, right? So it's like a tweak of the Aero format as well to support many of the relational database functionalities that we want. But then because our format is very close to the Aero format, right? So besides supporting the regular functionalities, we actually can easily transform our format to Aero as well, right? So we make this data movement very efficient, right? So that's all the content I have today. Again, so I'll give you many examples of these cloud analytical processing database vendor. And then the way to think about them is that for those analytical database, right? When you start an application, you start building a small company, like a small application, you may not necessarily need those like a heavy lifting, high-end distributed analytical systems, right? But when your system, when your company or your organization grow in size, right? When you have more and more data, now you may have the need to perform analytics on like this, like a large amount of data and trying to extract intelligence, right? Trying to make data, make informed decision based on this information with your data. And then that may potentially generating loss of business value, right? So usually it's only that when your data are large enough, right? So you try to start to resolve to these high-end analytical processing engines. And usually these are like you perform a very like a business critical decision with these systems. And then they have many potential benefit of those potential financial benefit of those decisions, right? So people are also paying a loss of money to this system and store lots of data. And then there are lots of organizations that are people putting into this system, right? All right, so that's all I have today. And then next class we'll have the guest speaker from Google talk about BigQuery, right? It's like one of these example of a cloud distributed database system. And again, we need a tennis on this course and I will pay on the guest lecture and I will post more details on Piazza. Thank you. Yeah! J-O-I-C-K, talkin' about the Stans rule, run through a can of two, share with my crew is magnificent, bust is mellow, and for the rest of the commercial, I pass the mic on to my fellow, or a mic check, bust it, the views are set to grab a 40, from New York to snap snacks and eyes, take a sip and wipe your lips, to my 40s getting more, I'm out, he gots the tip. Drink a drink and drink it, then I burp, after I slurp, ice cube, I put in much work, with the BMT and the e-trump, get us a St. I's brew on the jump.