 This is sort of the last lecture we have on distributed databases, but on Monday next week VolTB is coming and as I said last class VolTB is a distributed in-memory database. So a lot of things that we talked about the last two lectures will come up in the VolTB discussion of their architecture. Today is all about analytical systems, the VolTB is not meant to be doing analytics. So some of the ideas that we'll talk about today aren't applicable to VolTB. But it can still be used for some analytical operations, but not to the level that we're talking about here. So again the two talks we have as I said VolTB guest lecture on Monday's class, but then if you want a more in-depth discussion about research at VolTB then come to that talk at 4.30. If anybody is still looking for an internship I'll send an email out if you want to meet with them while they're here and see what you would actually be working on at VolTB. I've had many students go there for internships and they say they enjoy it, just be mindful that it's in Boston not San Francisco, so if all your friends are going to San Francisco if you go to VolTB you won't be going to San Francisco, okay? And then the Swarm64 talk is tomorrow as I said this is a startup out of Germany that is doing, they built a query accelerator that runs for analytical queries that runs on FPGAs. Instead of running everything in CPU they can shove things down to the FPGA and do faster computation than you could otherwise do. So that talk is tomorrow and that'll be in the gates, a floor, and then we'll repeat that one, okay? So last class we talked about distributed OTB systems. We talked about consensus protocols, replication, consistency issues and federated databases. So I want to go over the consistency issues, the CAT theorem, just a little bit more detail because I felt that was a bit rushed and everyone's blank faces gave me the impression that it wasn't sort of connecting what I was actually talking about and why the CAT theorem actually mattered. So I'm going to go over this sort of real briefly again, stop if you have questions and then we'll jump into the material about analytical systems, okay? So the CAT theorem as I said was this proposal from this professor Eric Brewer that basically said if you have a distributed system that you want to run transactions on you can only have two out of three properties. It has to be consistency available for all network failures and then partition tolerant. So as I said, you can't have all three, the middle ground is impossible and it's been proven that it's impossible. The systems though that are sort of out today that no SQL guys usually fall into the AP realm so they want to be available and partition tolerant whereas the transactional relational database management systems, they're usually CA meaning like if there's a disconnect between the nodes they'll stop running and that's the case safety stuff that we talked about before. And this is avoid a problem called split brain where both sides think they're the current master and you start making updates and you can't reconcile them later on. So the three example scenarios to show you what I mean for each of these properties. The first one was consistency and as I said, here we have a replicated database, we have two objects A and B that are on both nodes, all the rights end up going to the master right so we want to set A here, we go ahead and comply or change and then we go ahead and replicate it now to our other machine here and then now if anybody comes along and wants to read this object A after we've told the outside world that we've committed this transaction successfully, they will get the correct value. So that's the property that the data system can provide for you, that it's always going to be able to have strong consistency across all nodes for all data objects that are replicated and if I read something that I know from a transaction that's committed I'll see the latest version, I'll see the correct version. The availability issue says that if a node goes down then no matter what the system will always still be available, we can always access the database. So in this case here I had the database again is replicated on two nodes and we have a master so the replica goes down and then no matter what I can still go read that data from the master and if the other node goes and wants to read this data it can get to the master and read it even though in this case here in my scenario it's not local to it. So again no matter how many theories we have the database is always available to us. And then the last issue was partition tolerance and this says if the network goes down that now we can't have these two nodes communicate with each other then we can still allow the system to make updates and make changes and we're always going to be in a consistent state. So that's difficult to do. So now if this guy here can't communicate with the other node so he thinks the master went down so it's going to run Paxos or a leader election by itself and then it's going to elect itself as the new master. So now when the two applications, the two instances of the application running on different nodes start communicating the database they're both communicating to two separate nodes that both think they're the master and they both make changes to it in different ways and this is fine because again each one thinks they're the master so they say I'm the master I can apply this change for you. They send back acknowledgments say that change was then applied. At some later point the network comes back and now we have to reconcile this difference between the two masters because this one says A equals two, this one says A equals three but that can't possibly be if we were actually truly consistent. So now you're going to see why you can't be all three. So I can't be network partition tolerant and be consistent and still available because I would have this issue here. So the node SQL systems they choose to be in many cases available and network partition tolerant and they have this mechanism to allow them to reconcile differences between the nodes if you have a disconnect like this. Whereas the traditional transactional OTP database systems they want to always be consistent. So what they would do is say in this case here if the network goes down and my case safety factor is two meaning I have to have two replicas for any object in my database in order for the system to always be on line. At this point here both of these guys say well I can't communicate with the other node therefore I assume it's down and my case safety factor is two but now I'm only k equals one because there's only one copy of the object so then the system will halt it won't accept any new transactions won't accept any changes until the network comes back online and then you can then hook up back to each other. So is this part clear? That's the sort of key difference between what the node SQL guys are doing and what the traditional transactional OTP systems are doing okay and again as I said anybody says they defeated the cap theorem doesn't know what they're talking about okay. So just to remind ourselves what we're talking about when we talk about OLAP systems and the last two lectures are about transactional systems and these aren't workloads where in transactional workloads we're making really small changes to subsets of the database. The queries are very small usually index lookups and we're actually modifying the database. The analytical workloads that we're going to look at today these are primarily read-only and you're going to be reading large segments of the database like complete table scans to compute some complex join or aggregate across all this data and the example I always like to give is under OLAP setting in an OLAP application I can do things like an Amazon storefront I can add things to my cart make payments to my account I can only do that to my account though I can't touch yours. So my queries only touch the queries that get invoked for my transactions only touch my data but in analytical workloads now I want to do things like compute the average price of an item's purchase you know within some kind of time range now I'm looking at all possible accounts all possible customers. So the way OLAP systems are usually set up the traditional way is a bifurcated environment like this. So you're going to have on your front end you're going to have your OTP applications or sorry OTP databases and this is where you're getting all the new updates from the application the transactions that are modifying the database this is where you're ingesting new data from the outside world and then you're going to have this back this giant back end data warehouse that you're going to take all the data from the front end guys and copy it over into this thing and the idea here is traditionally you don't want to run analytics on your front end databases because that's going to slow you down while you're running transactions which is what you don't want to do. So there's usually a process called ETL where you're going to stream out the changes you get from the OTP databases do some kind of transformation on them and then load them into your back end data warehouse and then now you do all your analytical queries on this back end system here. So this transformation process is usually things like entity resolution or data cleaning so a classic example I always like to give is Zynga the I guess they're still around like the what was it Farmville right those guys they would buy all these little game startups and they would each of these game startups they sort of left alone and whatever database they were running they just continue to run them for their game right so that would be all the front end all three databases but then if they want Zynga want to do analytics across all their customers and all their assets and all their games they have to put into a back end data warehouse and they have to make sure there's a uniform schema meaning like if the column is in one application is first name and the other column is F name we need to combine you know they're the same thing with the same first name concept we need to make sure that we resolve that put that to the same schema so that's what you do in this transform process or another thing too is say I have an account on on one game my first name is Andy another account my name is Andrew it's the same person right you don't want two copies of a customer when they're actually the same thing so that's what you do for data cleaning or entity resolution in the transformation process so this is the very common setup right you get everything in this giant data warehouse and you do all your analytics on this so what these are sometimes often called our decision support systems the back end data warehouse and this is where the your organization is going to look at all the data you've collected from your OTP side and try to infer extrapolate new information about it in order to make better decisions about the organization so the classic example I always like to use comes from Mike Stonebreaker where the you can figure decision support system would be like Walmart's back end data warehouse where they keep track of every single item and everyone's ever bought at Walmart at every store and so say there's a hurricane coming up and so what Walmart wants what we'll do is they'll go look and say alright I know hurricane's gonna hit me two weeks from now at this this location in the US what items were the most bought the one week before the hurricane and one week after the hurricane and then make sure that they go make purchases and route those products to those stores so that they're available for you know but before and after the hurricane so that's an example what a decision support system does and again to do this you're doing large analytical queries over large large table segments okay so in this environment there is a distinction often between what is called a star schema and a snowflake schema so again if you go out in the real world you're gonna come across databases if you see like a backend data warehouse you're gonna come across the these terms this is sort of why I'm going over this so star schemas look like this where you have a sort of central fact table and then you have these spokes emanating from out of like a star that are called the dimension tables so the fact table in the case of Walmart would be all every time an item was bought you have another entry for in your fact table like it's when when an event occurred in the organization in your application right so again every time someone someone scan something at the checkout line you would add a new entry into the fact table so this thing is always grown the the the fact table though will have foreign key references to these dimension tables so this is where you store information about the you know the products that you sold or the events that are occurring in your fact table here right so in this case here we have a foreign key reference to the product table so again this is this is every item ever owns ever bought and then we have in our product dimension table we have information about the product name the product description and whatever category belongs to so in a star schema you have one fact table and then you have one level of dimension tables emanating from out of that contrast this with a snowflake schema where you're allowed to have multiple dimension tables or multiple levels going beyond the fact table so now in this case here again we still have the the fact table in the beginning in the middle for all the sales items but now in the product dimension table i've normalized out the category information about each product until it's its own separate lookup table right you can have these these lookup tables going beyond the dimension table so i think i guess why one of these approaches would be better than another it's sort of obvious thing that we're going to be talking about most of today yes yeah so he said that say if you want information about the category for a given product he said you wouldn't have to loop through and get through the product but what is that really called a join right i wouldn't if i wanted to get information about the for a given product it's category information i don't have to join it with the category lookup table so the main advantage you're going to get from a star schema and if you read most of the books on how to design data warehouses at a logical level for the application they recommend doing star schemas because you don't have you're only doing two-way joins it at most you don't have to go you know several levels deep into into the system or into into your into your schema and then again this looks a lot different than what we talked about for o2p applications o2 applications it was like a tree structure i have a warehouse a warehouse warehouse warehouse has district a district has customers customers orders orders have order items right it's like a hierarchy here it's sort of emanating out from it it's quite different so they recommend doing star schemas because you you have fewer joins the joints are less complex some data systems actually won't even let you do snowflake schemas they restrict you to doing star schemas but it's just sort of something to be mindful of so the two issues in deciding whether you want to use star scheme or snowflake schema is that in the first case for for if you do a star schema then you're essentially denormalizing tables you're not denormalizing the lookup tables into the dimension tables and now all the same issues we talked about earlier of having redundant copies of the same same pieces of data and you have to make sure that you keep all those all the things in sync now arise in this case here because again you're you're denormalizing the category putting into the product table and now for each category name you're replicating that multiple times the uh so the snowflake schema seems sort of more natural to us as humans uh but the issue is now because you have these these these hierarchies or not hierarchies these these these branches from the fact table going out several levels deep that's going to have more joins and therefore the query is going to be more complex and that becomes harder to execute and harder to do query planning on so for most of the things we'll be talking about today uh it doesn't matter whether you're actually doing snowflake schema or star schema just sort of be mindful of again when you go out into the real world you'll come across these things and there's different trade-offs for for each of them okay all right so what probably now we're trying to solve in a distributed OLAP system so the problem that we have is that we want to run a join query it's the most common thing we're going to spend most of our time doing joins and that's again that's why in the star schema it might be better because you're going to minimize the number of joins you have to do and that'll make the queries go faster so we have some application once issue a query here just join r on s on some foreign key relationship just like before in a transactional workload we would always have one node be designated as like the the the base partition or the home partition we're going to have the same thing here our query is going to show up at one node and it's going to be in charge for in charge of communicating and coordinating with the other nodes to get the query executed and then we'll return the result back to the application so what's the dumbest thing we could do in this case here to execute this query say save this query here needs to touch data at all four partitions what's like the the most easiest naive thing to do yes exactly you just take all the all the three other partitions you need sorry and copy them to the home partition then do the join and complete the result back why is that stupid yes all right she said because i said it was stupid right which is correct this sort of defeats the purpose of having a distributed database right the whole point of a distributed database was because we want to have stored really large databases and we want to get parallelism so all we've done here is just paid a bunch of money to ship ship over to the to this node then do the join we could just left the data there in the first place right so in this case here we're not getting any advantage of additional computational resources because these guys again could just be shipping data and then now we're sort of limited to how fast that one node can go so we need to look about other ways to actually to solve this problem here actually we're going to actually distribute out the queries or distribute out portions of the query to our different resources in our distributed cluster and then combine those results and put them back together to make it as if make the application think it ran on a single machine even though it didn't then we get all the benefits of parallelism that we talked about before all right so today's class we're going to first talk about different execution models you can have for distributed analytical queries and this actually what we'll talk about is actually germane to all to be queries but it's more of an issue in an OLAP queries because we have to touch a lot of data we're accessing a lot of data then we'll talk a little bit about some of the issues you have to deal with when doing query planning in this environment and how you actually send out the the portions of the query to the different machines and then we'll spend most of our time talking about distributed join algorithms in different scenarios you can have and then we'll finish off just sort of a quick overview of what kind of cloud systems look like today okay all right so the the main dichotomy we would have in execution models for for for distributed databases is this notion of a push versus a pull and the way to think about this is the think of it in terms of where the data where the query will execute relative to the data that needs to process and that it needs to access so the first approach is called to push the query to the data so the idea here is that we have some query which is small right it's like you know it's a single query so it can't be that big and we want to transfer the execution of that query to a whatever site or location the data that needs to process is located on so think of it like in a shared nothing system a shared nothing system we said has its own cpu its own memory its own disk so rather than having that shared nothing system send the data to another node to process it for the query i can just send that query to the data it can process it local locally and then send back just the results for it right the idea here is that the query is going to be smaller than sending the query and the sending the result back is smaller than actually sending the data you have to send over to do the query now contrast this with the pull base model this is where the query is going to run at some location and i find out what all the data that they need to process it and then i start pulling that data or streaming it into into my my where my query is now you may be thinking that this is clearly a difference between shared disk and shared nothing it's not always the case right in a shared nothing system it's you could still do both of these approaches and same thing with a shared disk although there is it there's always the copy out of the disk into one of the compute nodes so let's look at a high level what this looks like so this is a shared nothing system it's partitioned on id right and our query shows up it wants to do a join so in a push base system rather than taking the data we need from this guy here and and pulling it up to this node here we're instead to go to send a portion of the query the portion of the execution that we know will touch the data that's residing on this node down here we will push that down to this node it then does the computation and then just sends back the result of that computation so again the amount of data we have to transfer in this in this environment is going to be much less if we're moving the query and the result around rather than taking all of this and shoving it up there all right in a pull base example uh so this would be a shared disk model we're still partitioned but again these are logical partitions the data all resides in the shared disk these are just saying here that if the partition markers are saying any query that needs to access data at this partition within this range of this partition will be taken care of by this node here so the query shows up to the home partition we could then send the just like before we could send the query result the sorry the query fragment that needs to touch data that this guy's responsible for we send that down here so that part is like pushing the query to the data but because we're shared disk this thing doesn't actually have any data it's out in the shared disk so now we got to go out and make a request for the pages we need from the shared disk pull it into our computation nodes then crunch on it and then this guy sends back the result to this node here so this this is where that again as i said last class the terminology is a bit blurry for shared nothing it's quite obvious when you're doing push versus pull in a shared disk environment you know technically this is a push but back here since we had to get the data from the shared disk it's a pull is that clear yes whose statement is wouldn't this always be a pool in a shared disk architecture yeah so what i'm saying is back here i could have had this node pull all the data that it needs to commit this query to just this node here right but yeah so he says it's not responsible because this node is responsible for id for 1 to 100 yes that's what i think that's the example i'm saying so we're going to push the query fragment that touches that data down here so that part to push this part when you go get the data from the shared disk that's the pool okay in a shared again my example here the very first one this was this was a pool right i got the query to run here i pulled all the data from all these other guys into this node here this could this could have been shared or sorry this could have been shared nothing and i still did this so the point i'm trying to make is shared nothing doesn't always mean push and shared disk tactically or doesn't always mean pool you have to pull the data from the shared disk but you're not you may not be pulling the data between nodes so another aspect of this when you talk a little bit about this is what the fault tolerance model is for these OLAP systems as you run these queries is anybody here actually run like map reduced jobs of four like in a dupe very few yeah i mean it's an old work technology people don't i don't did they even teach it to you anymore it's fine right so what'll happen is since we're running OLAP queries they're they're read only so we don't really care so much if we crash while the query is running because we're not going to put the database in inconsistent state because we weren't modifying anything right everything's read only so it doesn't break if we crash but it also means that if we have a query that's going to take hours or days in older systems uh in the old days you know queries could take days uh if we crash with one node crashes sort of halfway through that query then all the intermediate results we we've computed are gone and we have to come back and and do it all over again so in the the mid-2000s when do sort of and that produced at google was sort of coming into vogue they decided that since they were going to run on sort of really cheap machines that any at any point during execution of a query they could go down they essentially were going to take snapshots of the query and almost every single step along the way so that way if if one node crashes during the computation another node could pick up and pick up where it left off without having to restart the entire thing so traditional analytical systems uh would not be fault tolerant within a single query if your query is going to take two days and it crashes on the first day you have to come back and start all over again the hadoop world they actually take snapshots every at almost every single step in the query plan and that way if you crash you you you don't have to start all over again now in that environment they uh in that environment that was a design choice that they made again because they're running on cheaper hardware but they pay a penalty of having to write things out every single time so i think in newer analytical systems the ones that are designed so explicitly for OLAP stuff you can tell it to do some systems letting you checkpoints as you execute the query so that you can recover from that if you crash later on right again we don't care about correctness or consistency issues in terms of putting the database in a in a weird state because we're not modifying anything but in you know we may have to redo work if we come back later on yeah in the old days i remember hearing stories i had a friend that worked at vertica uh vertica was a column store system that got bought by hp and then now they're um then they got bought by a holding company and actually they have an office here in pittsburgh believe it or not um they told me that like he he went to go set up vertica for a i think australian telecom company and they had queries running on some older you know older system that would take like days to run like one query and then when they plot in vertica it would go to like minutes right because it was a column store i did compression did all this other amazing you know modern analytical stuff um so now it's not so much an issue but like in the old days it actually really was but i think this is a tunable option in some systems okay all right so now we need to talk a little bit how we actually do query planning so there's nothing really new to teach you uh how to do query plan in a distributed environment because all the same optimizations that we talked about before are still applicable in germane here so you still want to do predicate pushdown you still want to do early projections you still want to try to remove as many tuples that you know are not going to be needed for your final output as early as possible in the query plan right but now the only difference is that you need to be mindful of where the data is actually located and how it may be actually partitioned in your cluster so we'll see some examples in a second when we talk about doing join algorithms but if i know that my my my data that i need to join is not partitioned on the join keys then i have to do a bunch of extra work and start moving things around so essentially what happens is in in a in a distributed query planner they also now start taking consideration the network network i o cost of different operations so when we talk about query planning before we said disk i o was the main bottleneck the main issue so we always had that be our cost model but now in a distributed environment we want to consider the network i o because that's going to be even more expensive so what'll happen is you'll generate a query plan that will have be annotated with information about where different physical operators in the query plan need to execute and then you can break them up into these query plan fragments that you can then ship off to the different nodes that then do their computation on that portion of the query plan and then send it back to the base partition so there's additional metadata to say about you know compute this join and then send the output to this node here it may actually not always go back to the base partition you may want to do a sideways pass to another node that may need your data later on but all again all that information you figure out in the query plan ahead of time so now what you actually send to the different nodes that that are going to execute your query for you can can can differ could either be like a physical query plan like the relation the relational operators that you that you've generated on the on the base node and you send that out to every node who then that just takes it and executes it based on how it was sent to them meaning you're doing all your query planning at the home partition or at some central location another approach is to actually send specialized SQL statements that target each partition who then can take them parts them and do query optimization locally as if it was like a local query so as far as i know most systems do the first approach they send the actual physical plan like here's the join operator here's the sequential scan or index scan i want you to do and then the output of the query is just a message to send that data to some particular node in the cluster the only system that i know that actually does the second approach is mem sql and they sort of extend sql the their dialect to sql some additional information about again where the data is going to come from where it needs to go to after you compute the query but the idea here which i actually sort of agree with in some ways is that rather than having the the a centralized query planner make decisions about how it's going to execute every query plan fragment they just have the local node do that do that for them the idea there is like the local node would have the best information about what the data actually looks like and you don't have to worry about keeping some global thing in sync so just visualize to visualize what actually mean so say again this is our join query we want to send out we have three partitions we could just send again we generate turn the sql into a physical relational plan and then we just send out those query plan fragments for that relational plan to every single node who then just take it and execute it just as if it was generated locally but what mem sql does is they actually will rewrite the sql query to now target the data that's at each partition so now we added a where clause to just say touch the data that that each partition knows about this then node gets that data runs it through the sql parser in the query optimizer and generates a physical plan for itself just on the data that that it knows about so in some ways to think about this again if if we don't send sql statements we're running query the query optimizer in parallel on three different nodes if we send query plan fragments the actual physical plan then we run the query optimizer once in some central location we hope that our catalogs are in sync and statistics are in sync to make good decisions and then we just send those plan fragments down and then you get back the results and you need them together and that's the final result you send to the application again as far as you know mem sql is the only one that does this i think it's an interesting idea and i don't know if anybody else does this so okay so now as i said the efficiency of the distributed query plan is going to depend a lot on what the actual tables look like and as i said joins are the things we're going to spend most of our time doing so this is the one we really need to be mindful about what the data looks like and where it's located in order to make the best decisions about how to compute the joint so the the easiest approach as we saw in the beginning was that we're just going to take the every single table that we want to compute in our join we're just going to put it onto a single node and then compute the join locally but we obviously we're going to lose all the parallels and benefits we get from having additional resources may also be the case that the data says anything fit or the two tables we want to join don't even fit on a single node right and then therefore we're sending a lot of data around and and in the join or just end up failing so the way we're going to do this is that again we're going to look at how the tables are partitioned or whether they're replicated figure out how we can then move data two different nodes with the moving the least amount of data as possible so that we get the data that we need to compute the join at each node and they don't need to communicate with any other node during the join and then they compute their final result and then send that back to a centralized location that we can union the result together so now once you get all the data on the node to compute the join you can just use all the same join algorithms we spent a lot of time talking about before nested loop join sort merge join hash join and there's no there's no magical distributed hash join algorithm that's going to be completely different than what we talked about before you build a hash table then you probe it right so once we get the data we want in the right location we just use the same join algorithms that we talked about before so there's be four scenarios we have to be we have to take take care of in in a distributed environment so the first one's going to be where one of the tables that we want to join is replicated at every single node right so this think of this is like when we talk about the star schema and the snowflake schema the fact table is going to be huge the fact table is going to have again people have probably billions of items I bought at walmart and since they've ever started that so that table is going to be massive so that's going to be partitioned and stored separately on different nodes but things like the dimension tables those are going to be much smaller they're going to be either read only or read mostly and therefore we can replicate that on every single node so that when we compute the join we don't have to go look around to try to find where the data we need is for the for the other table it's right there for us so a classic example of a dimension table would be like the zip code table for addresses in the US there's only like 35 000 zip codes in the united states the post office updates them four times a year so that's that's a small table we can replicate at every single node so all we need to do here is this the we just do our local join at every single node and this is here as this replicated are partitioned so we just do whatever local join algorithm we want at each partition and then we ship the result from one node to a central location union the result together and then that's the final the final answer right so that's pretty straightforward the next scenario is where the and so we're going in order of like best-case scenario to worst-case scenario so a another good setup is also when the when the two tables you want to join are both partitioned on the join key so we know that all the data we could ever need to examine one tuple from the outer table that the tuple in the inner table will always be at at my partition right so again both of these are partitioned on sides or sorry on the id column and the range of those values are exactly matching for at each partition so this is 1 to 100 this is 101 to 200 so for any tuple with a given id for r i know that it can there the matching tuple in s has to be at my partition it won't be at another location so again all i need to do is just do my local join as if i was a single node system get the output at every a partition and send that to a centralized location and we combine together and we and we get the result right also pretty straightforward all right so the next setup is where the the two tables are partitioned but one of the tables is not partitioned on our join key so again id and the id column in r is partitioned the r table is partitioned id and that's what we're doing i'll look up in the join but the s table is partitioned on value so can we do a local join no right because for some id say id 99 i have no idea whether it's it's going to be in here or over here right so the worst case the worst thing i could possibly do is as i'm looping through every single tuple in the outer table go do a broadcast message to every single tuple every single other partition and say hey do you have this id for me right that would be super slow because now every other node is doing the same thing so what we're going to do here is if we recognize that the table that's one of the tables that we want to do a join on that's not partitioned on our join key if it's actually small enough then what we can do is just do a broadcast where every single partition is going to send the contents of their table to every other partition so that every partition now has a complete copy of the table s so in this case here i take my contents of s of s here and i send it to this partition and then on the other side i take my contents of s here and send it to the to this partition right and now i'm back where i was in the first scenario where one of the tables was replicated at every single partition so at this point here after i do this this data movement i can then compute the join locally then ship my result to the central location and combine them so sometimes you'll see this be called a broadcast join uh again it or sometimes it's also called a broadcast hash join a broadcast sort merge join right the the i is the same that you always have to have this first broadcast step where every node is going to broadcast to every other node here's the data for one of the tables you want to join on and just like before when we we would pick the the you know the smaller table versus the the the larger table whether that's going to be the inner the inner table versus the outer table for our join we always want to pick the smaller table in terms of the amount of data we're sending to the the to be the one that we broadcast to everyone else all right the last scenario is is the worst case scenario and this is where both tables are not partitioned at all on our join key so in this case here the r table is partitioned on a name the s table is partitioned on value so we can't again we can't compute any local join here because we don't know whether there's a there'll be a tuple that matches on the outer table the inner table with the same id at any other nodes we'd have to send message to everyone else to try to figure out you know do they have a match for us so here what we're going to do is call a shuffle or reshuffle the way to think about this is we're essentially just going to be doing repartitioning of our tables on the fly but this is only for this query meaning after we the query is done we throw away the data we've we've shuffled around so we're going to do is going to have this node just like before send its contents of r to this node they when yeah sorry this node here it's going to say all right well i know i want to send a part my partitions data where id is 101 to 200 to this node here so it just sends that the middle amount of data there and likewise it generates a new partition here which says all right for 101 to 200 my contents of r goes up into there and then you do the same thing on the other side for you know for one to 100 then you do the same thing for s on both sides again so now at this point here now i'm back in scenario two where i have both tables are now partitioned on my join attribute so now i can compute my local join get my result send that to a central location and combine it together so again in the in the literature sometimes this would be called a shuffle join shuffle hash join it just means you have this extra shuffle step where you're moving data around to put it into the the the have the right partitioning scheme again at this point it the original data doesn't go away so i the the partitioned tables on on name and val here these don't disappear even though i've copied them here so when this query is done my repartition data i throw away and reclaim the space right so ideally we talked to us before we we want to pick good partitioning attributes so that we minimize the amount of data movement we have to have so in this case scenario if all my queries are doing joins on id then i'd be stupid to partition on anything else but id but you can't always predict these kind of things especially in the analytical queries people do all such a weird kind of joins you may be unexpected so there's there may not be the the best partition key that can make every join be local and not have to do any reshuffling or broadcasting yes so again so your question is when you when you reshuffle and now you have these partitions on id so his question is yeah so sorry i should be clear his question is the query shows up how do the how the nodes actually start moving data around like this this is a this is a step in the query plan this is for this query it's going to have a step says okay shuffle and move take take the contents of r that you have here look to see the ones that match 101 to 200 send it to this node here it's a step in the query plan that you have to do before you can move up to the next operating the query plan where you end up computing the join so this is done on a per query basis where do you put what sorry so his question is like yeah so it's like this question is where is physically this data right it's it's it goes in the buffer pool just like everything else right because it's it's data that might exceed the amount of memory that we have so we need to be able to swap it out of the disk right so the intermediate results of the queries is essentially what this is right like at every step of the way like this this data is just there we bring into our buffer pool so then we start operating on it now all these other things these other extra copies we're making plus the output of our joins these are intermediate results that then also get put out into the the buffer pool because if i had to if i run out of memory i wouldn't want to be able to swap it out the disk is what one node is crashing so that was the point i was trying to make i was trying to make about the fault tolerance stuff so in a if you have replicas then you could you could recognize that all right well this node went down but i have another copy of r and s partition on name and value here and so i'll just go back and repeat this step of of copying the data over from the other partitions to my replica node it gets elected as the master and it can restart about if you don't have that if you don't have a replica available to you or if you're below your case if you threshold then the query would just crash and the system will hold but this is the point i was trying to make in like a hadoop style system or map produced style system they do checkpoints at every single along the way we're also a shared disk system so in that environment they be actually writing these things out to share disk so this node goes down this other guy could come back and say well i know about these other partitions this guy wrote out so let me let me do the work that it was supposed to do and pick up where it leapt off and the query and keep on running but that writing out the shared disk for all this is expensive at every every step along the way so systems if you want to get good performance you try not to do that so i want to now introduce no a new relational operator that we haven't talked about before called a semi-join so in all my examples that i showed you i'm just showing you that there's there's some arrow of data going from one partition to another the question is what are you actually sending so as defined in the original relational algebra like from a mathematical standpoint you're sending the entire tuple so if i have a hundred columns in my table and i need to do a reshuffle then i'm going to copy a hundred columns over to the other partition but we know based on what the query is trying to do that's actually probably not necessary right so what a semi-join is it's like a natural join except that you just send the columns that are needed as part of the to compute the join not everything else right so again so say i have r and s r has three columns s has has two three columns as well if i want to compute a natural join i would do have to copy over a id b id and xxx and then three other columns as well but under a semi-join i only copy over a id and b id and from these two tables and i ignore these other columns because they're not needed to compute the join so this is essentially what distributed systems are going to be doing to compute joins you look at what data you actually need to compute the join and then you only ship that data over now there's another design decision you have to make as well you could say all right well i know i need uh i only need one column to compute the join but i need another column as part of the output so you can decide whether to ship the two columns over together to compute the join or you just ship the the in the semi-join just ship the one column you need to compute the join then for any filter or any tuples that pass that join operator you can then go back and get the rest of the data but now that's sort of extra network traffic to do this right extra communication so this is where the query optimizer can decide oh well i think my join operator is going to have a very low selectivity so therefore let me not go pass along that second column because i just go back and get it later and i'm not going to copy that much data or i'm going to have a very high selectivity you know a lot of tuples are going to pass my join so let me go ahead and push along the data that i actually need ahead of time so that way i don't have to go back and get a lot a lot more later so a semi-join just means that the the the data you pass along is only necessary to compute the join and nothing else and this is the common optimization that that distributed systems use somebody was complaining that semi-joins had nothing to do with distributed systems and i disagree right this is what every system does okay yes yes uh we'll have two rows like are you saying compression as in like like physical compression or logical compression the statement is that uh so in in in sequel it since it's it's bag algebra you can have duplicates so you would have duplicates i think in the original relational model paper it's it's based on sets and you wouldn't it's like so it's yes but think of it in a distributed system say i need to get i need to get r over to s i'm not going to send over xxx i just send over the thing i need to compute the join okay any other questions okay so now i'll talk a little bit and this this will be mostly like just me giving you my opinion about cloud systems um so i said last class or two classes ago i think that shared nothing systems what most people think of when they think of distributed databases uh and traditionally most distributed bases were shared nothing but there's been a large trend now moving towards shared disk systems and i actually think this is going to be the the prominent architecture for distributed systems going forward um and this is this is primarily due to the the proliferation of cloud systems or cloud cloud vendors so now i can get on amazon i can get my ec2 nodes and i can get s3 or get ebs i can scale up the disk independently from the compute compute layer and now i can run hgfs or whatever distributed file system i want on the the the storage layer and have different compute nodes for the you know for for executing queries so what vendors will sell you now is called database as a service dbas um and the way to think about this is like i i it's not like i'm getting a vm on ec2 and i install my sql myself i go give amazon my credit card and then they give me back a jdbc connection url and that's the only way i can communicate the database is over through through that connection and underneath the covers they're managing my sql postgres whatever it is that they're selling me they're managing that for me so i don't worry about setting up the hardware i don't set a worry about too much about setting up the configuration of the system amazon or microsoft or whoever does all that for me so in a lot of a lot of organizations are moving towards analytical systems in the cloud right because again you can get a large amount of storage a lot a large amount of computational power very easily in the case of oltp systems the unless your application is running in the cloud with your database you typically don't run your your oltp database in the cloud right doesn't make sense to have your your you have a data center here in pittsburgh and then you run your your your application here in pittsburgh then your your database is running in amazon in virginia right the latency of that would be very quite long so you wouldn't want to use this you have to run the application very close to the to the database server for transactions for analytics it doesn't matter right because that query is going to be very long and and we've run for a long time so i can't talk about this on on camera but there is now a movement which i think is very fascinating where the the line between the demarcation between a shared disk architecture and a shared nothing architecture is starting to blur so you're not going to have like a clean separation of here's my shared disk and then here's my my computation that can read data from that disk what you're seeing from some certain vendors now are they're actually pushing logic about the the database system in down into the storage layer actually this is public so amazon does this amazon on ebs they have a little shim layer in front of it that for aurora that knows about transactions and so rather than having to wait for all the computation to be done up in the compute layer there's some notion of transactions and consistency done and replication done at the storage layer and that allows them to get much better performance they also can go the other direction too they also can put logic about the the data is now in the networking layer so now it's this weird thing where it's not cleanly like here's your shared disk and here's your compute there's little bits of the database system at all the different layers of the architecture and you can really only do that if you're a cloud vendor because you control the entire stack so i think that's really fascinating so the two types of cloud databases you can have right as i said the easiest way to have a cloud database if you want to call it that is just you just go set up ec2 you install my sql on there you just run it maintain it yourself right if you want to go as a database as a service the two approaches are either have a managed database system or a cloud or cloud native database system so a managed database system just means that just instead of you running my sql and ec2 yourself they run it for you right they set up everything and they just give you that connection URL to talk to it and a and most vendors have something that look like this right you can go get a cloud when a if you go look at a database company a database startup they say they have a cloud service or cloud database system available it's them running their their regular software just for you automatically inside of a cloud vm now contrast this with what i'll call cloud native database systems and these are systems that are designed from the ground up to explicitly run in a cloud environment right with elastic scalability and the compute layer and elastic scalability in the uh in the the in the storage layer so the probably the most famous one is of this is like snowflake snowflake is a distributed analytical data warehouse system that again was originally designed from from the onset to be to run in a cloud environment and you can't download the software and run it locally like you can only run it as a service through them google big query is another one our big table is another one redshift was they bought amazon bought park park sell they run it for you to manage environment they've made a lot of changes now to make it more cloud native and then microsoft has modified sql server to make it also mean more cloud aware so these systems again are again the designed to be fully aware that they're running in a in a in a in a cloud based environment and they make certain design decisions to account for that and and it overcomes some of the deficiencies and take advantage of the some of the benefits you get in in in the cloud so the other big trend that i'm that that's coming along the last couple years for these cloud systems is these universal file formats so traditionally a database management system when it writes data out to disk it's writing it out in a proprietary format that only that database management system knows how to interpret knows how to read and you guys see this in your projects right when you write out a page of the buffer pool you write a log message that's specific to our little test database system i know the data system can they could come along and write support for reading that but that's not something that most people are going to do so now there's been this movement towards having these binary formats that are sort of universal that you can then store in a in a in a cloud system like a cloud store system like s3 or ebs or hdfs and then have different types of of database systems be able to access them and read them and this is more common in analytical systems right there's no there's no universal transactional data format so traditionally what people do is like you the way to get data out of one database system put it to another is you just dump it out as like a text file like csv json or xml and then you have someone some other piece of code that knows how to then convert that into insert statements into the format that the other databases needs so now there's these standardized formats that are meant to be these universal language that you can communicate between these different machines or different database systems more easily so these are probably the four most famous ones so the the oldest one is actually hd5 the hierarchical data format this actually came out in the 1990s so this is a multi-dimensional data set or data file format for scientific workloads so you don't really see it a lot in organizations you see this mostly in like you know big research labs or scientific labs the ones that you may have heard of are parquet and orc so or parquet came out of cladera and twitter and then orc came out of the patchy high project so again way to think about these is that it's a binary file format that is going to be a compressed column store so think of like they're doing dictionary compression on columns or other run life encoding other techniques compress the columns and then they have a bunch of libraries written in different languages that know how to read data and write data to to these files or generate these files so now there's a bunch of systems that can can can can know how to read natively these different file formats like so you can take presto for example or hive you can run write sequel queries that will access tables that will store data in these different file formats and you just go out to s3 or ebs wherever they're stored suck them into your buffer pool and then natively process them the newer one that i'm actually very interested in for research purposes is apache arrow so this is like parquet or orc which it's a compressed column store but it's designed to be for for in-memory computation for like an in-memory database and the kind of cool way to think about this is say i have a database that that i'm generating much data from from the otp side of things but now i want to run like a data science script like in pandas or or scikit learn i want to run that i want to run that computation on that data that i've collected if i don't use one of these universal file formats i had to then you know dump it out of the data is convert it and then put it back to a form that uh that like this my python script can read with with arrow the idea is that rather than having ready at the disk into one of these universal formats that can then read read back in by my python script what if i just generated a universal file format in memory and they give you a pointer to that location in memory and you can just crunch on it natively so arrow is again there's the one i'm actually most excited about and i think this is where everyone's sort of going so this is sort of another aspect of these cloud systems instead of just being everyone in their own silo everyone wants to now be able to use different interchanges or these these different sort of universal file formats so that data generated from one type of system can be used in another one and that's only they really come around the last couple years because of cloud systems yes so this question is what is the incentive for a proprietary dvms a commercial dvms system to actually use one of these file formats so if you want to play nice with everyone else you have to do this right like there's no one system that's a juggernaut anymore right it's not like oracle oracle can bend the market to their will there's enough open source solutions there's enough there's the data system is no longer the center of the universe you're collecting data from streaming systems you're collecting data from machine learning programs right like you're having data come come at you from all different locations and if you're if your database system that you're trying to sell can't read these file formats then you do that a lot because no one's going to use you because i have a bunch of files that are generated from from this this program i want to use but you can't read them so why would i want to use you right so that's another big aspect i think the big trend in the last 10 years uh map reduce or hadoop was sort of the first wave of this spark is now you know it's pushing this even further where the a sort of traditional relational data in a system is not the center of the universe but it's still an important component in a complex application stack like they're not going away they're just being being used alongside of other pieces of software okay so any questions about sort of cloud systems all right so we're done so this is this is it for the content of and for the course can the main takeaway from all this is oh that systems are hard distributed databases are just hard we can you don't remember anything else to remember that um if anybody tells you yeah let's let's roll our own distributed database and it's like you and your friend doing it on the side rethink that choice okay i'm not saying it can't be done and i'm all for new databases and if you go off if you graduate graduate from cmu and you make a new distributed database or new database in general email me because i'll add it to the list right uh but again the it's just because we're distributed as i said doesn't make things be magically easier or things are going to go faster there's a whole bunch of other issues we have to deal with and the main issue is now the network whereas before we assumed the network communication was non-existent when we talked about a single node system but in a distributed environment we have to account for that and that means in otb system we have to account for losing messages or messages showing up in weird orders and in olap system we have to account for having to move data around between nodes okay all right so as i said next class the old tb speaker if you haven't signed up for the voted for the system potpourri i'll send a reminder please do that and i'll send a reminder also about uh how to turn in the extra credit for a checkpoint next next wednesday okay it's the sd crooked i p e f i make a mess unless i can do it like a geck ice cube with the g to the e to the t comes do i play the game where there's no roots homies on the cus a llama because i drink brew with the bus a cap on the eyes bro bush we're gonna go with a flow To the eyes, y'all! Here I come! Willie D, that's me! Gee! And St. I's when I party! By the 12-pack case of a fart! Six-pack, 40-act gets the real price! I drink proof, but y'all are drinking by the 12-valve! They say Bill makes you fat! But St. I's is straight, so it really don't matter!