 Let's get started. There's lots of empty seats today. This is good. I succeeded in scaring a bunch of you away or boring you to tears or something, but the rest of you, we can enjoy our time together. This is lovely. And for those of you who are watching this on the Internet, hello. Okay. So today, what I thought I'd do, and this is kind of in the spirit of a quick start to the class, was to get you just enough information both on how a database kind of could work and how you might write some simple queries so that you're armed and dangerous to deal with at least one table at a time. So the whole spirit of the class, just by the way, the way I like to structure it, is the front load material so that you kind of get your homeworks early, kind of get going early. And by the time we're midway through the semester towards the end of the semester, this class is easing off in terms of workload. And you can focus on some of your other classes. So it gives a little bit of load balancing on the homework, hopefully. And also, just when I took databases low many years ago, they spent a lot of time on conceptual stuff. And it was a long time before we understood how anything worked or got our hands on anything. So I try in this class to get your hands dirty right away. That was basically homework one. And then this lecture will kind of complete a short, skinny slice through sort of data management so that you can at least walk up to a database and be dangerous. And also, you could build some simple algorithms to do stuff like joins and hashing and sorting and things like that. So today we'll knock off single table queries in SQL. And I'll talk to you a little bit about sort of the query processing architecture that's typical in building data pipelines. And then next time we'll start talking about joins. And after that, you kind of have query processing under your belt, which is one of the first things you sort of want to think about or learn when you're dealing with large amounts of data. All right. So jumping right in, the goal today, basic SQL, which is sort of the standard query language that everybody continues to use. And then query executor architectures. So how do you build a query executor? Just a comment before we dive into SQL. How many of you guys know some SQL? Great. And how many of you ran into SQL because you had to do for a job? And how many of you learned SQL for some other reason? Okay. Mostly for practical reasons. It's a pretty little language. It's a domain specific language for data. It's kind of reasonably well suited to its task. It's pretty old, and there's things that are kind of crafty about it. But for the most part, it's pretty good. How many of you know another query language, database query language? Okay. Shout out the names of the query languages you know. Loud. Pig. Anybody else? There were some other hands up. Just shout out the name of the query language you used. H-Base. Okay. Has its own query language, I think. Yeah. Anybody use the MongoDB query language? Couple hands. Okay. Anybody use JSONIC? Is it a query language for JSON? Anybody use XQuery? Okay. So over the years, as like different data formats and different data models have come out, people have tried to say SQL is old. Let's not use it. Let's come up with the new query language. And the new query language invariably looks like a somewhat cleaned up and usually made more complicated version of SQL. And then people end up devolving into saying, you know what? Let's just build SQL again. So like all the Hadoop vendors are building SQL, and a lot of the No SQL backends are building SQL, and a lot of people have like SQL layers over JSON. So even as we sort of evolve with technology, SQL's sort of an old friend of a language. You may not love it, but you're certainly going to want to know it, and it's handy. And it's more or less fine for a lot of things. It definitely has its quirks, and there's things you won't like about it. And for that, IBM apologizes, but hey, they invented it in 1975. Right. Okay. Query executor architecture will be the second topic. Great. All right. So let's jump in a little bit of relational dogma, and I don't particularly like dogma too much, but here's the way most relational databases have worked and continue to work, although there's some variation on this. Typically in a relational database, when you define a table, you give it a schema, which is a description of the contents of the table in terms of the name of the table, the number of columns, the names and types of the columns. And that schema is typically fixed. We assume it's not going to change very often. You can change it over time. There's a command called alter table in SQL. You can add a column or drop a column or do things like that. But generally speaking, you can assume that the structure of the table is kind of fixed. And that's nice, just the same way that in a language like Java, you declare your data types, and they have a particular structure. Unlike, say, something like Python where, you know, types can be looser. Right. So this is much more of a structured, well-typed language. Right. So we're going to define the types of our tables up front. We call them schemas. So you have attribute names, and then each one of the columns or fields of the table has a type, and that data type in traditional SQL is an atomic data type. So it's not like it can't be an object or some kind of composite type. It's always like int or float or text, or one of these sort of very basic atomic types. Now, most modern SQL databases will allow you to define more interesting types. You can have a type, a column of type JSON, or a column of type, you know, set of something, or a column of type array of something. You actually, those are all extensions to basic SQL that modern systems support. But sort of textbook database, and that's all we'll do for now, assume that the column types are atomic types, like text, or integer, or float. Okay. So the schema of the table is fixed. That's the description of the table. It's the metadata, the data about the data. Right. But the instance of the table can change. An instance is a particular population of records in a given table. So it's like the state of the table. What data is in the table right now defines the instance. And the instance of a relational table is a multi-set of rows, sometimes called tuples, or tuples, depending on how you like to pronounce it. So those are all synonyms, row, tuple. And it's a multi-set, which means you can have duplicate rows. So you can see here that Bob Snob with GPA 3.3 in the CS department appears twice. And as far as SQL semantics are concerned, those are two independent rows. They're different. They have sort of unique identity. We have to remember that there's two of them. Okay. They're indistinguishable. There's really no difference between them. But there are two of them. And when we count things up, for example, we'll count that as two. So it's a multi-set, not just a set of these rows, right? And every row has to fit the schema. Any questions so far? It's pretty simple. But you can imagine many things you might want to change. Yeah. Is there an order? There is, by default, an order to the columns. And it's the order you type in when you do the create table command, which we'll probably see later on. That order doesn't get used for much. You can, and it's usually not recommended, you can reference fields by position. You can, I think, say $1 or $2 for the second column or the first column. Definitely not encouraged. And the order can be changed with an alter table command. So also not a great thing to count on staying the same. But the names, people would typically not change the names of the column. Why would we want multi-set instead of set? It's an excellent question. Well, does anybody want to try to give it an answer? I think it's debatable, frankly. But does anyone want to vote for why multi-sets are good? This is sort of a dorky example. Any of you have a data structure maybe where you feel like if you put it at a table, it would need a multi-set representation? Or can any of you manage an application where you'd have two copies of the same row? I mean, for example, you might want to represent, this is maybe a silly example, but you might want to represent people visiting your store. And maybe in one table you have customer ID, product ID. Just which customers bought which products. And I want to make sure that the count of rows in that store represents every time a customer bought a product. It turns out I buy toilet paper from Costco every whatever six weeks, right? So it should have me in there every six weeks. It doesn't have the timestamp in that table. So they're not distinguishable. We could still count up that table and get the right numbers. Maybe that's a bad idea, though. Maybe it's a bad design. Maybe I should have the timestamp there, right? Be able to distinguish them. So people debate about this. And don't forget, there's a perfectly reasonable set-oriented way to represent Bob's knob here. We could have said that there's one more column in this table. Students try to write real big. So what do we have? We have a name, GPA, department. And why don't we just have another column called count? And we can just keep track at application level. If we want to put a second row with the same values in, it would just bump the count. It would represent the same information, but every row would now be distinct. We'd have Bob's knob. We'd have a GPA of 3.3. He'd be in the CS department, but he'd be there twice. Why not? I don't know. This is just the way it is. So what can I tell you? It's multi-set semantics. It's convenient for applications that are doing insertion logic. They don't have to think about counting. They just can pour stuff into the database. And it is annoying for people writing queries, especially queries where you're counting stuff up or computing averages or other what are called aggregate functions. Because you have to reason about duplicates. Do you want the duplicates when you're counting? Do you want to remove them before you're counting? SQL provides syntax to do all those things, but now you have to worry about it. In fact, in our algorithms as well, we're going to have to define things like what is the union of two multi-sets? What is the difference between two multi-sets? What is the join between two tables that have duplicates in them? We're going to always have to think about duplicate semantics for everything we do with these things. It makes the set logic just a little bit more interesting. It's not that big a deal. Anyway, great conversation. Thanks for bringing that up. But it is the way it's going to be, multi-sets. Here's your basic structure of an SQL query. Anything in brackets is optional. So, select column expression list from table is the minimalist SQL query. But you can also add things like select distinct, which is, well, we'll go through these one by one. That's the skeleton. And what I'm going to do over the next set of slides is walk you through all the clauses and give you examples. And this is pretty straightforward stuff. And I would encourage you, and we will along the way, I would encourage you to also get in and play with this stuff in the database, which you have in your VM and I'll show you how to fire it up. So the simplest version, which is what's now not grayed out, is very straightforward. If you say select columns from table, maybe where something, you'll produce all the tables, all the tuples in that table that satisfy that predicate. And for each one of those tuples that satisfies the predicate, we will output one tuple with the expressions in the select list. An expression can be just a column reference, like I want to know the GPA column of the table. Or it can be an arithmetic expression over column references. I want the GPA times two. I want, you can also just plain old arithmetic. I want to have 17 minus four in every row for some reason. So let's look at some examples. Here's one. Select s.name, s.gpa from students, s. Note the use of the variable capital s so that we don't have to type students over and over. We can assign it a variable. It's a little more convenient. Actually in this particular query, you don't have to say the table name at all. Well, here, let me do the following. Let's read it out the way it is. Where s., this is a total typo. It got pasted in twice. I'm sorry. I think when I was editing it this last night, I messed it up. This should just be like that. And somehow it got pasted in again. There we go. This is finding the computer science students names and GPAs, right? Get the names and GPAs from the students table where the department equals CS. That's all this was supposed to be. I don't know how that other syntax got in there. And so just for fun, well let's try this in a little database I set up. You guys see that? I can make it bigger. I could also make it a different color. Can you read that in the back? It's good? It's good. Alright, so really quickly let me let's do a little cut and paste. Select s.name we're in display mode. Gotta get out of display mode. This part's easy. Actually you know what I have these all in a buffer. Let's do this. I have them all in Sublime. I also have some other stuff in Sublime. Here we go. There's that query we just looked at. s.name, s.gpa from students, where s.depth equals CS. Let's paste that into our SQL prompt here. And we get that out. You get the names and the GPAs for the students in the CS department. And just to confirm, they got pretty good grades except for Bill. He dropped out. I think he actually got pretty good grades. Does anybody know who Bill Gates wrote his one academic research paper with? His advisor, Christos Papadimitri. That's right. And so you can compute your Gates number which is how many hops away you are as a co-author from Bill Gates. I'm happy to say I'm a two. Thanks to Christos. It's cool. Select star from students. Star is just a macro for listing out all the column names of all the tables in the query. So select star from students will show you everything that's in the table. So those are that's the full data set and you can confirm I think if we go back to the previous query that we got all the CS majors out and printed their names and GPAs. Okay? Great. Alright so that's fine. Let's go back to our examples here. So that's the most basic query you can have pretty much as select star from table. One step better is select particular columns from table where some condition holds. It's called a predicate. Okay. We can also remove duplicates by saying select distinct. Okay? And if you say select distinct all that happens is we run the very same query but right at the end and remember you have to do it at the end because there can be no duplicates on the input but duplicates on the output. So let's see this. Select distinct s.name s.gpa from students for the CS department. So let's pull that one out of buffer here and paste it into our SQL prompt. Okay? So that's fine. That's the distinct names and GPAs. That doesn't look any different right? But what if we say select distinct s.gpa from students s. What's that going to produce? How many rows? Well actually where? Sorry, we better do that s.depth equals CS. Just one and four right? Make sense? So the duplicates are done at the output because when you project which is to say just take out some of the columns you may get duplicates introduced by that projection operator. Okay? So here we just projected to the GPA column. We had lots of duplicates. We deleted them on the way out. So select distinct mixture that the output is distinct. Okay? We can also order the outputs any way you want. Okay? And the order is done lexicographically which means in your order by clause here at the end you can do a whole list of expressions. Note that a2 which comes from the select list way up there. s.age times 2 as a2. So we have an expression in the output that we gave a name. We're using it in the order by clause. So you can have whatever arithmetic or stuff you want in the order by clause. It could be in the output like it is here or could not. You could just put in arithmetic in the order by clause if you're so inclined. Okay? But the lexicographic thing means we're going to sort by GPA and if there's ties on GPA we're going to sort by name. And amongst the ties on GPA and name we will sort by this expression a2. All right? So this lexicographic ordering means you sort by the attributes in order and then you break ties by looking at more attributes going right. Does that make sense? That's what the dictionary does with letters, right? The a's come before the b's and within the a's the second letter determines what comes before what and on ties for that the third letter determines what comes. It's like that but it's with fields instead of letters. Right? So it's nothing surprising. All right? And notice this as clause for naming output columns you can reuse in the order by. So let's just have a look at that real quick. Make sure that everything's working the way we expect. Hello, thank you. There's that query. All right? And there it is run and you can see it's ordered by name. There's no ties on name so that's not very interesting but well let's just kind of well if we take out the name part then we'll get some ties. Where are we here? Oops. Just think. Oh man. What happened? Order by oh we got name in the ordering list but there's no name in the output. So I think I lied actually. We'll try that in a second. We'll see if I lied about something. That should work. Right? We got ties on four and then they're ordered by a2. Right? Now I made an assertion before that you could put any expression you want in this in this thing at the end. So let's just try that. I may be wrong. Let's say instead of ordering by a2 we just order by s.gpa square root of s.gpa which is an absurd thing to order by but it's not in the output right but are we allowed to order by that and the answer is no. All right so I'm sorry I used to be able to do that in Postgres but obviously it was illegal and somebody turned it off. So your order by expressions have to be in your output. I am wrong. Which is why we use that as a2 business. Okay? By the way there's lots of details like in any programming language in SQL. This is not going to be the kind of class where we're going to like quiz you on your SQL syntax and make sure it's letter perfect on a handwritten exam or anything like that. Like it's a programming language. We expect you to do live coding and consult the internet and all those good things. So don't get hung up if like what I just did I should get no points off as a professor for not getting that right because I tried it and now I know. That's how you learn to code. So it's fine. Same thing should hopefully apply on your homework and exams. On your exam we're not going to take off points for little syntax things and minor details. If you didn't understand something conceptual like how to use distinct properly to get the right answer that would be different. Alright. You can make your order by descending or ascending by basically column by column saying whether you want it to be ordered up or down descending is downwards ascending is upwards. Alright. And that probably isn't an interesting example to run but it is legal. Okay and then here's where life gets a little more interesting. This is computing aggregates. So in this case I want to know the average GPA of students in the CS department and I will tell you that in Postgres the word average is actually not a supported keyword. It's AVG and I think that varies from SQL to SQL but that should be legal. So we'll try that query. What the average GPA from students in the CS department. So let's first of all go back and look at the students in the CS department again. We'll just and you notice a variation I'm doing. I didn't do that capital S business at all. I just did depth. Right. I said select star from students where depth equals CS and it's implicit in a single table query that there's only one table you could possibly be talking about. So when you type in column names it will assume they're from that table and that's legal. Okay. So what I was doing with the capital S before was unnecessarily verbose for this particular single table case. So here's our students in the CS department. Their average GPA I don't know. There's you can figure it out. There's four fours and two ones but we can also ask our friend the database to figure it out for us. We'll say select average of GPA from students where department equals CS. Oh it's three. Okay. That's really true. 16 plus 8 divided by 6 is in fact 3. That sounds right. Okay good. So that's fine. Pretty simple. Let's make it more interesting. Before producing the output what happened here right we took all the input rows and we computed a summary of some arithmetic expression average of S.GPA where the input to that arithmetic expression was a set of numeric values right and this is going to produce one row of output in this case with just the one column of the average. There are other aggregates besides average there's some and count and max and min and standard deviation and median and a few other actually there isn't media some systems support media there's a few other sort of simple statistical aggregations right that you can compose together to do more interesting things. You can use distinct inside the ag function you can say select count of distinct name from students. Okay and so it'll remove the duplicates before it does the counting. If you don't say distinct on the inside it won't remove the duplicates so just to be clear let's uh let's go back here let's look at our students I think Bill Gates is only in there once but he's a bit he's the wealthiest guy in the world right so he should be in there twice. Cool so we'll say insert into students values Bill Gates this time let's guess that his GPA was 4.0 though he must be older than 60 he's in the world domination department and he is and you notice the single quotes and you notice the syntax for insert into is pretty straightforward not going to teach you everything so he's in there twice now he's in there three times now awesome if we say select count from students where name equals Bill Gates how many should we get back three now if we say count of name from students where name equals Bill Gates how many should we get three what if we say select count of distinct name from students where name equals Bill Gates now we get one okay so that's what the distinct clause does and you notice that count is actually a special aggregate you could put star in it because it does unless you're doing distinct it doesn't matter what column you're counting they all have the same number of rows most aggregates like max and sum and average you need to tell them what column you're maxing or summing or averaging count just a count but if you're doing distinct you need a column name okay and notice that if we did this distinct on the outside what is this do anybody the second query what's the difference with the second query when is the distinct get invoked on the whole output so the whole output set we then remove duplicates but the output set to this query is only one row so it has no duplicates so select distinct won't do anything here right so just to be pedantic about this let's type it in and you'll see that this is still three because there's one row it contains the number three and it has no duplicates life is good right so the select distinct is done right at the output you got to put the distinct inside the aggregate question they have to all be identical to be duplicates so let's just hammer that home let's look back at our schema for students oops that went by fast so let's look at say GPA and age select GPA comma age from students okay there's a couple of dupes in there for comma 35 is in there twice I see so there's 11 rows there and if we say select distinct GPA comma age from students there's only nine rows yeah so there were two duplicate rows before but they have to be duplicate on all the fields in the output okay so those are simple aggregates now let's make this more interesting I want to know not just for the CS department but for every department I want to know the average GPA rather than having to run a query for the CS department and a query for the business department and a query for the English department separately I can run one query that says select the average GPA and department name from students group by department so it breaks it into departments and then it runs that aggregate query per department essentially just to make this painfully clear let's run it on our database where's our group by query here it is right and Bill Gates even made it in there with world domination department and the average GPA there is four now notice that the department field was in the select list like the average GPA comma department okay what if we said the following let's also ask for the student's name right so I want to not only know the average I also want to know the name of the student that's a syntax error you can read the error it says it wants s dot name to appear in the group by clause or in an aggregate function why is that that's right so GPA we're boiling it down to a single number per department there's gonna be a row per department right the department name is per department so that's good but then the name of the individual isn't per department right what's the average GPA and name of students in computer science at Berkeley well there's lots of you or computer science students at Berkeley whose name would we pick right doesn't make sense we need to construct a single row now remember that standard SQL doesn't have a notion of structured types inside field so it's not like I can take the set of all your names and shove that into the name field right I would have to somehow pick one string to put there there are so basically it's illegal the only legal queries are it's either in the group by clause then it can be in the select list or it's nested inside an aggregate function and it's getting boiled down to a single single value per group make sense I'm tempted to do something weird and just see if it works I think Postgres has a crazy aggregate to accumulate a bunch of see if this works it might not my theory is that this is going to create an array of all the names in every department and put that in the cell but I'm not sure if this is a default function in Postgres and it's not so forget that sorry I tried remember what I said though this query is illegal because name is not in the group by clause and it's not nested inside an aggregate age is in the group by clause sorry department is in the group by clause so it's legal GPA is nested inside an aggregate so it's legal and that's exactly what we did at the bottom of the slide here okay so you know the general idea you partition the table into groups each group has the same group by column values you can group by a list of columns so we could also say group by s dot department s dot name and then you get all the Bill Gates and the CS department group together and the other people in the CS department would get different groups right so for example let's just do that we're going to select the count let's say of department comma name pursues group by department comma name so this is legal now right because everything that's in that select list is in the group by list so that's legal what's the output look like well it's department name pairs because we asked a group distinct department name pairs right to the group by clause the more things in the group by clause actually the more rows you get because you're slicing things finer you're saying I want more details per group now it's a department and a name and you'll notice that there are two CS Bill Gates tuples right there's two Bill Gates in the CS department so the count there was to make sense anybody identify does anybody know who all the people on the list are just out of curiosity raise your hand if you recognize all those names and you could describe who they are most of them I'd say all but two of them are fabulously wealthy and the other two are important data people let's say so if you don't know who they are you can Google them and some of them are Berkeley alums but I'm not sure how many at least one now one more clause on group by queries that you can add is the having clause only only makes sense if there's a group by query it's a way of filtering output rows after group by that don't satisfy a condition so remember the where clause filtered rows but it filters them before the group by the having clause is going to filter groups after the group by so this particular query I want the average GPA and department from students groups by department but I only want to see departments that have five or more people in them okay that's what that's saying that you can't ask how many people are in the department before you group you can only ask after you group right so let's just play with this query a little bit in the terminal here it is I think because I didn't want to type in lots of data I can change this to three instead of five but here's the query we looked at all right the CS department and just for fun let's make sure we also take in the the sorry the count so we see what the count is right there's six tuples there's six records in the CS department and so it makes it to the output right and the average for the CS department happens to be three okay if we got rid of the having clause then we would see all the departments including the ones with smaller counts right okay now let's change the query in a different way here's the having clause but let's put in a aware clause instead or in addition where and you'll notice new lines don't mean anything in SQL s.name equals little gates okay so the where is being applied before the aggregation the having is being applied after the aggregation how many rows in the output can anybody guess I heard two I heard one zero why zero sorry count is three is not satisfied by any department once we filter only two people named Bill Gates right so the input to the aggregation is just students where name equals Bill Gates which is the output of this oh that won't work select star from students where name equals Bill Gates that's the input to the aggregation alright there's two in the CS department one in the world domination department then we compute the counts of that table and the averages of that table and then we check the having clause on the output of that and no department has three rows in it right so where is applied before the aggregation having is applied after in this case the where leaves us with only two CS tuples and one world domination tuple and none of the remaining groups have three things now but if we change this to greater than or equal to two then we're good because there's two Bill Gates in the CS department yes so the question was for those you didn't hear where is before aggregation true having is after aggregation true is that because of those keywords or is it because of where they appear syntactically in the query let's say it this way it is because of the syntax so by definition where is before aggregation having is after also syntactically we don't allow you to say having until we've already seen a group by so having has to come after group by it makes it easier for the parser okay it's part of the standard and we can try it but I'm pretty sure it won't work but let's give it a whirl having count star two yeah it can't parse that it needs having to be after group by where also probably has to be before group by you can try that test at home okay and the syntax in the slides is correct you can actually be where predicate you can actually have a count of having but the having has to be after group by having can only be used in aggregate queries it doesn't mean anything without group by actually can only be let's try this I don't know if you need a group by maybe you can just have an aggregate without a group I so select count star from students that's a legal aggregation query one group by default all the students can you do a having on it having count star you can okay so the having only requires you to have an aggregate without a group by there's only one group and the having is applied to that one group so that is legal okay that's fun questions up to here other questions all right so putting it all together here's a query that uses everything we've learned so far it's going to find departments the average gpa the count for females group by department I think we better lower the count I didn't type in that many students order by department okay so you can always order the output of any query including an aggregation query I know this won't work unless we say AVG because I tried it and I have all sorts of gunk three sorry this is the only one I didn't clean up last night apparently let's make that actually greater than or equal to two so we get some output okay good all right it's got a select it's got an average it's got a count from where there's a syntax error there you can see we'll have to fix that we're s dot department what happened here sorry my cut and paste died I think that'll fix it there's some messy text in here sorry one more minute group I do do do where s dot gender equals f sure looks okay to me copy paste all right I think it's just a I think actually I think it's a screen wrap thing no all right well I'm not going to type this one in there will be homework for you because I don't want to type it in my hand sorry we won't try to type this one in but you can at home you know what it's doing right let's talk it through it's going to take the students table it's going to filter out only the females we're going to remain now just the female rose from the students table it's going to then group them up by department and per department it's going to compute the average GPA for the females and the number of females then it's going to throw away any departments that have fewer than five females fewer than six actually and then it's going to get the output of that which is going to be a bunch of departments with five with six or more females and the average and count it's going to sort that by department alphabetically and it's going to display that on the output so where clause and then group by and aggregates and then having and then order by that's the execution order for that guy okay that you should understand the order of kind of how those are evaluated as part of your understanding of SQL okay so I don't know if you can read this maybe you can't but you can try this in your class VM so you just vagrant up and then you can say su to su space minus space postgres and you'll be in the postgres account and then you can create a database by saying create db test db and then psql is that command prompts for sql that comes with postgres right and then you'll be in postgres and you can do a create table statement and an insert so I showed you how to insert the data in and then you're off running queries okay so if you I actually strongly encourage you to do this at home your yeah and so I encourage you to do at home any questions all right so that's pretty easy stuff that's just seek sequel stuff so we looked at sql what we're going to shift gears now is looked at software architectures so that was just a kind of I figured I should spend a few minutes teaching you some sql but we're going to go back now to designing software architectures and in this case for database management system query processing right so any system that's doing query processing you can think of relational database you can think about things like Hadoop or spark anything that's kind of doing pipelines of data through code is going to look more elastic what I'm about to describe all right with some variations here and there but this is basically how you stream data through logic to generate more data and pretty much all systems use some variants of what we're about to learn okay before actually we dive into that though let me give you a sort of overarching block diagram of what a relational database looks like okay so that you can kind of see the context we're living in for the next chunk of the lecture so the sort of a traditional relational database kind of contains the following components there's a database at the bottom that stores stores the data and sql queries go in the top okay and in between is the system architecture and all the software so when a query comes in it goes into something called the query optimizer that figures out the best way to execute that query and we'll learn in this class what are the design decisions that the optimizer has to make so what are the choices it can make to optimize the code and how does it make its decisions once the query optimizer is done it's going to generate what's called a query plan which is the way that we're going to execute the query you can think of that is kind of like bytecode for the query and then query execution is sort of the over it's sort of the master execution of actually executing that bytecode bytecode quotes what is that quote-unquote bytecode it's the relational operators it's these bulk data operations that we're going to be learning about in this class you've learned to already sorting and hashing those are examples of relational operators that might appear in a query plan but we're going to string bunches of those together and then flow data through a pipeline of them as we'll see in the next few slides so we're going to focus on today for the next few minutes is this execution layer this part without the optimizer and these relational operators all right which you've learned to have already and we're going to learn more on Thursday so mostly today we're going to talk about this part this execution framework the relational operators themselves might need to access disk all right like to spill things off to disk or to read things from disk so they're going to have ways of going out to files and some of those files may have special data structures on them which will call access methods right the way it's the code that allows you to access the data efficiently underneath those files so to speak those file access calls and here's probably worthy of another picture is something called a buffer manager which is a kind of cash in memory so imagine this is a whiteboard marker okay here's your desk this is dead here's your disk it might be an SSD it might be a magnetic disk but we'll just draw it like a can here is your code so to speak that's requesting data you know read next block okay there's some code in your system that wants to do that it could go directly to the disk drive but instead it's going to go through an API and that API is going to check in a cache of disk blocks that have been kept in memory okay so this each one of these little rectangles here is the size of a disk block remember a disk block is the size of thing you can ask for from the disk so the disk delivers things in blocks that say that's like 64 kilobytes so each one of these rectangles is a 64 kilobyte slot in memory to hold a disk block and that's what this thing is and this thing's called the buffer manager so these memory slots are called buffers and the buffer manager decides at any given time what disk blocks to store here and keep and what disk blocks when you read a brand new one should get kicked out to make room for the new one okay so the buffer manager is a cache of disk blocks between the access methods for the files and the actual disk itself and the last piece of software in here that we're missing is the disk space manager which determines how to lay out the files on the disk drive itself okay and it has to do things like you know reorganize things when there's free space maybe it also takes care of certain failure modes on the disk but we'll talk about file layouts for databases as part of the class as well so this is sort of a full stack in this rectangle here sort of a full stack for a single user database it goes from raw queries to raw disk Ios via query parsing optimization and execution these operators and all the file and disk access and then brings the data back up the piece that's missing from this picture is what's being alluded to over on the right which is if you have multiple users using the database concurrently we're gonna have to worry that they don't mess each other up so that concurrency control logic kind of affects all three of those bottom layers the files and access methods the buffer management the disk space management are all interleaved with concurrency control alright so we'll have to learn about that at some point and then logging and recovery is also going to be a piece of this puzzle so as you're doing Ios to the database and deleting things and adding things we're gonna also need to make sure that if kaboom that disk drive is explodes there's a way to recover the data and that's going to be done through a log file and it's gonna be on another device so the logging and recovery component of the system is another piece that interacts unfortunately with all three of those things on the bottom all right so this picture you have over on the left is good enough to understand kind of a single user or isolated execution of a database and if you want concurrent access you got to do the stuff on the right your traditional relational database has all that stuff when you look at something like Hadoop which many of you used in 61B it pretty much just has the stuff here in the rectangle with parallelism which we'll talk about as we go parallelism is actually easy right it doesn't have any of this concurrency control logging and recovery stuff that's traditional in a traditional relational database I'll get you in a sec so this think of this kind of stack is sort of like Hadoop if you will funny enough no SQL databases like Cassandra and MongoDB and things like that are kind of a horizontal slice from the file and access method layer to the right so they don't include much of a query language typically or the relational operators and execution but they do do some concurrency control and recovery and some of this disk management so it's sort of like Hadoop kind of went this way and then no SQL things kind of went that way and slice the relational architecture apart for simplicity mostly actually in order to scale up we'll talk more about Hadoop and no SQL and all that later but since some of you are familiar with it I wanted to make sure that this picture which is a traditional relational database picture you get a sense that the systems are interested in kind of fit into this picture in some way then there was a hand yeah awesome question so the question was for things like the buffer manager and the disk space manager doesn't the operating system do that and the answer traditionally has been yes but poorly and therefore the database system usually had to implement it itself in a way that was circumventing the operating system by today there's a little more cooperation but still because there's multiple OS vendors and multiple database vendors they tend to the database system tends to do its own thing and some of it has to do with this concurrency control and recovery stuff that database relational databases like okay that the operating system typically doesn't have hooks for so we'll talk about that at great length when we talk about concurrency and recovery and why the operating system doesn't do the right job for us there's something also to do with this replacement policy in the buffer manager sometimes databases want to do that themselves as well so typically the what the database does is it circumvents the operating system one way or another gets access to its own memory and manages it itself similarly with disk space like allocate a big giant file that's the size of the entire disk the operating system doesn't know any better anymore and the database works within that very typical and this goes back to like Sun and Oracle not agreeing with each other in the 80s kind of thing or you know the researchers in Soda Hall like the database guy and it wasn't Soda Hall Evans Hall the database people the operating systems people like the BSD group and the ingress group didn't talk to each other enough it like it goes back to those days to the 70s could you have built it all together yes the other thing I'll point out while we're on this little happy side comment is Windows tried to integrate the two at one point there's something that was going to be called WinFS which was a file system for Windows that was supposed to be a relational database and Bill G was all into this and it almost destroyed Windows 7 apparently because they just couldn't get it to ship and be fast at everything that file system needed to be fast at so actually Microsoft made a valiant attempt to bring these worlds together in the like early 2000s and failed and depending on who you ask they failed for political reasons or for technical reasons so you can get many versions of that story but typically they're simple long answer to a short question any other questions on this kind of high level view of a database architecture database system architecture okay so today as I said we're focusing on a very small little slice which is the execution of queries okay so here's the deal that we actually just talked this through but let's talk through the top pieces really just the top box of that big rectangle the query optimizer translates SQL to a special internal language I called it byte codes but that's a sort of Java analogy their query plans is what we call them in database land right and the query executor is if you like an interpreter for query plans now there's a long tradition of saying why would you interpret this stuff why not compile it to machine code and you can do that too and essentially what you'll do is you'll compile the kinds of programs I'm about to show you how to interpret and that's fine the compilation versus interpretation thing is kind of a detail it's just as well to understand how the interpreter works essentially you'll compile the executor with these query plan if you compile so don't sweat the compilation thing think of the query executor as an interpreter for this query plan language and query plans themselves are sort of a you can think of them almost as a graphical language of these kind of blobs with data flow arrows between them so it's going to show how data is going to flow through chunks of code okay so each blob is going to implement what we'll call a relational operator all right and the edges in these graphs are going to represent a flow of tuples all right with particular columns so every edge in one of these diagrams is going to have a particular structure a particular schema associated with the tuples that flow along that edge every tuple will have the same schema along any edge in this diagram so for single table queries the diagrams are just going to be straight line graphs so these are not going to be very interesting in general they could be trees or eve or daggs or even cyclical you have cyclical data flow queries that come in that's fancy we won't do that in this class but for single table queries it's super easy it's just a straight line so here's a picture of this query select distinct name comma GPA from students give it to the optimizer and it's going to spit out a data flow diagram that looks like this kid I don't know if you can can you read the letters in the back in this color it's okay great okay so the bottom operator is going to be a file scan it's going to scan the students table and as it starts scanning that students table it's going to start passing records up all right and for whatever reason it's customary to draw these things so that the data is flowing from the bottom of the page to the top okay so these are sort of upside down if you think of them as trees but that's where it goes the data is flowing bottom up so data is going to start flowing out of this file scan and the columns we're going to pull out of that file scan the only ones we need are name and GPA because there's nothing later on in the query that needs anything but name or GPA so we'll throw away anything but name or GPA out of this file scan and we're going to pass all those rows into a sort operator that sort operator is the thing we learned last time right it's that algorithm that takes a whole set of tuples streams them into an input buffer right and starts sorting blocks of them and then putting them off into side files so the left-hand side of that sorting picture is now this flow of data from the file scan all right so you just kind of wire that sorting picture into this file scans output and then the output of the sort which again will just be names and GPAs is going to be passed into a distinct operator this operator's job is to see if it sees two tuples in a row that are exactly the same it throws one of them out right that's all it does and it produces also named GPA pairs distinct actually there's a Unix command that does the same thing it's called unique with the Q so if you ever want to do this in Unix you do it with pipes and you can say some this is equivalent to something in Unix that might roughly look like cat in lower case of course cat file pipe into sort pipe into unique for these single table queries you can think of these edges as like Unix pipes and you can think of these operators as kind of operators sort of like those okay make sense this wouldn't actually work right because you'd have to say sort on what fields and unique on anyway but this is close so that's the kind of a query processing that we'll see in these single table queries it's very simple and then this is just a picture right how do you actually make this like work like what's the code that you generate to make this work so let's talk about that a little bit in essence this is a description of what we want now we need to interpret it so the way we'll interpret it is every one of those operators will be implemented in our system in a class called iterator all right this is sort of a c++ slash java ask bad object-oriented syntax but bear with me the idea is that every relational operator we're going to use is going to be a subclass of this class called iterator iterator has three methods in it next and close and look at their signatures and it doesn't return anything it just initializes the operators as we're going to we're going to do query now get ready next is going to return a single tuple every time you say next to an iterator it should give you back a tuple and then close it doesn't do anything either it just says we're done you can tear down your state okay and it's going to have an array of inputs which is going to be able to represent something not only like this picture which is the picture we saw in the previous slide but maybe something that's more dag shaped or tree shaped where you have multiple inputs to that iterator so there'll be an array of inputs to the iterator sort of left to right what are all my input flows some some some subclasses of iterator will admit only one input some will admit two inputs in principle some could admit many inputs all right it depends how you implement your iterators all right typically in this class we'll see iterators that take one or two inputs all right and then anything else you want to do in your iterator you can obviously subclass it and add additional state so the edges in the graph are actually implicit right which this is the picture of the data flow but actually the data structures are that there's three iterator objects and this one has a reference to two of them in its inputs that's actually what what the data structure pointers look like this data this class object points to this object and this object is its two inputs but data is going to flow this way okay so parents know about their children if you like well depends how you look at it I guess children know about their parents children know about their parents but the trees upside down how about that okay everybody good or did I succeed in confusing you iterators know about their inputs is the point and those represent the data flow edges and then this is all encapsulated in this iterator class which means that if I'm say the sort iterator I may not know I do not know what the type of the iterator is that's feeding me so for example in the previous picture on the previous slide sort was being fed by a file scan okay so that's fine so the tuples were coming in from a file scan but if this thing wasn't a file scan if this thing was like a gigantic query over 17 tables that would be fine too sort doesn't care it just knows it's getting a stream of name age pairs or sorry name gpa pairs right so nice encapsulation amongst our iterators we don't need to know much from one iterator to the next there's one exception that's on the slide there's a thing we need to know does anybody catch me I'm lying a little bit stop the professor what did I say that's wrong there is a lack of encapsulation in this slide one of these operators needs to know something about its input in order to do its job well you only get three choices and file scan doesn't have an input so now you only got two choices it's either sort or it's distinct which one of these things needs to know something about its child all right it's distinct actually distinct remember what I said I said is if it sees two things next to each other that are the same it ignores the second one so it remembers exactly one thing and when it gets a new thing if it's the same it throws the new thing away if it gets a new thing it's different it remembers the new thing and forgets about the old thing right so if I'm sorting if I'm distincting a set and you give me a three I'll remember the three if you give me another three I'll throw it away if you give me a six I'll throw away the three now I remember six etc this only works if your things that need to be duplicate eliminated come in batches together all the three's come together if you get a three and then a six and then another three I will have forgotten about my three and distinct won't do its job so you need to have a grouped input to distinct for it to work correctly right so if we sort that's the reason we're sorting we're sorting so that distinct can work so there's actually sometimes a requirement there's a little property of the class which is like is sorted or is hashed that will allow us to know that any duplicates will appear together and that's pretty much the only property we're actually going to need for these guys is a property that you know the output of this thing is sorted or is hashed by the way as we write these things different iterators will have all kinds of internal state right so for example in sorting if you do tournament sort the internal state will include two tournament trees and that'll be internal state to the iterator right it's in memory data structures that the iterator keeps to itself encapsulated if you implement hashing in the second phase of hashing after you partition when you bring it back you'll have hash tables in memory right or a hash table in memory and that'll be internal state to the hash iterator but it'll be encapsulated right so different iterators will add additional state when they subclass this guy okay here's an example of sort so the sort iterator it's got one input only it only allows one input and it's going to have a little bit of state it's going to have a number of runs how many runs did I generate on the disk and then it's going to have an array of locations on the disk of disk blocks which are the locations where I put those runs okay and it's going to keep track of the next record ID our ID that it's going to deal with and here's how the three methods will work in pseudocode the initialization method for sort is actually going to read the entire input to the sort relation every single row and it's going to generate the sorted runs on disk so when the top of the query plan sort of says init to the sort the sort's not just going to do one little thing in return it's going to say okay I'm going to initialize I'm going to initialize my child which might be a file scan file scan comes back says I'm all initialized and then the sort's going to say next next next next next next next next next next next next until it's gotten all the data out of its child and put it in sorted runs on disk and only then will it return from the init to the top of the plan say okay I'm initialized and then when the top of the plan says next well it's going to merge and it's going to start producing outputs all right so this particular implementation of init for sort is going to allocate this runs array fill it with disk pointers for all the places it put the runs it's going to get number of runs set correctly and it's going to figure out the next record ID array which is going to be null to start with all right it's going to be all set to go and the next record ID is going to be a pointer into each of the merge into each of the partitions so it can do merging so you think about merging two things like zippering to sorted lists together right you need to know where you are in each list as you march down them to put them together so that's what those next record IDs are so the next is going to say well then next ID is going to tell us what we returned last time from our various sorted lists and then when somebody says get next we'll say okay here's all the pointers into my different partitions which one of them is the smallest whichever one of the small is the smallest is the one that's going to go to the output and we move it down and then we put the thing in the output all right and then when you have nothing else to return you return especially the end of file or since they're not really files you can call it end of fun or whatever up to the guy above you saying i got nothing you said next i said i got nothing and then the guy above you eventually will say close and you can deallocate all the data structures that you allocated in ram and delete all the spilled partitions or whatever you had on disk and generally clean up okay so that's a simple pseudocode implementation of sorting hash is pretty comparable so you can do that as an exercise pseudocode out what's the init for hash what's the get next for hash note that the next call and the init call these are all synchronous calls the iterator above you invokes in it and it's blocked until the init comes back right in particular and this is kind of important think about it like this we have two iterators this guy says next all right that's a function call like traditional function call which means that this iterators state is now on the stack and the next function of this guy is now running and it will do everything it needs to degenerate a tuple and the tuple will be returned on the stack to this guy and then this guy will have control again and he can say get next or return it to his parent or whatever right so one thing that's kind of interesting here in the iterator model is that control flow which is to say which iterator is running when is directly coupled to the data flow data is passed back on the stack so that this if this guy's in control it means it's in the middle of generating a tuple and when it's done it's got that tuple ready it returns control to its color to its parent actually this arrow should go like this right the calls go this way in that way the data flows going that way why is this important what the suppose the control flow and the data flow were separate example you have a file scan like we had in the picture right and you have a sort and i'm going to let them have their own control flows so we'll give a thread to this guy and a thread to this guy and they run separately and the data is going to flow this way what's the concern if they each have their own thread of control and they get to run at whatever rate they want so anything could go wrong okay so end conditions might be a problem right this guy's done long before this guy is and he he doesn't have a way maybe to tell this guy that he's done but you could come up with a protocol right we use this EOF on the data flow we could probably have an EOF here too even though they have separate threads i think that part's not not so bad but you're on track of something we should be concerned about yeah yeah they are asynchronous and we have no guarantees about who executes first at any given time right so this guy could be running 10 times as fast as this guy or 10 times as slow as this guy right because we're not controlling scheduling anymore what happens if the file scan runs 10 times faster than the sort my buffer could overflow I actually don't have a buffer in this particular picture so this guy's like vomiting out tuples along this arrow I don't know what this arrow is but it better have some spare capacity to hold all those tuples that this thing is vomiting out while this guy's sorting real slow right so whenever you have these things separated you have to worry about things like buffering right this guy's got to put things in memory so that this guy can get them at his leisure it's a form of rendezvous like we talked about last time right the handoff between this and this is made possible by having a little extra space what if this guy's really slow and the sorting guy is like really fast what happens to the sorting guy some file scanners really slow the sorting guy's like I'd love to have a next tuple what do you do about that does anybody have a written code with two threads that have to communicate it's a producer-consumer problem right what happens if the consumer is way faster than the producer sorry starving interesting word yeah so this guy would like to have data and he's starving for data so what does he do there's sort of two choices either he can he can do what's called polling which is he keeps checking he's like got anything got anything got anything got anything got anything right which just kind of soaks up CPU or you can use the operating system or something like that to have him go to sleep and be signaled by this guy when he awakes all right but you need an external scheduler like from the operating system now to synchronize this guy and wake him up when there's something for him to do which introduces overheads of various kinds so this is actually kind of elegant okay because we don't have any of these problems there's a this arrow only has to hold a single tuple what where is this buffer in the iterator model where does this single tuple go in memory it's on the stack right it's being passed back in return it's on the stack and it just comes back as the return to the function call all right the operating system is real good at that and so is the chip they optimize the crap out of that okay that's like your basic computer function call thing so that's pretty good actually that's a pretty good thing so this is all pretty good why might be a problem with this iterator model why might you want to have multiple threads going anybody think of a scenario where it would be a good thing to have multiple threads going to be really lame to do this iterator thing multiple multiple queries multiple queries on like different tables and stuff just have nothing to do with each other I think so that's okay but it's an interesting theory it's multiple something you're on the right track oh multiple threads asking the same table could be a problem it might not be a good thing so I'm looking for an example of something where multiple threads might be a good thing so we're sacrificing this has a single thread of control so I'm asking the question is when might that be bad to the single thread when what's the the basic reason you might want to have more than one thread yeah suppose you have multiple expensive operators like this is slow and this is slow what might you do to speed that up giving them each a thread it's not necessarily going to make them faster what might make them faster think outside the box a little think hardware throw hardware at the problem yeah yeah let's have multiple processors I'll put this on processor one and this on processor two I can't do that if it's a single thread right how many cores does your laptop have four two eight right seems like if you want a high performance database engine you don't want it all to be single threaded on one core that seems like a bummer okay so hardware multi-core or even better when you have lots of processors lots of boxes like Hadoop kind of context we're going to be dealing actually for sure with multiple threads somehow okay so today's a little bit simplistic it's kind of from the era of a single core where this makes the most sense okay now your operating system and a lot of other things on your laptop are not really taking advantage of multiple cores very well why is that okay for now why is it okay that like microsoft word is single threaded and uses one core which is probably false actually but let's assume it's true yeah you don't need i mean yeah microsoft word doesn't use a lot of compute so what do you do with the other cores on your laptop you listen to music play video games you have some process that's reorganizing your disk right there's all sorts of junk going on it so you just kind of say everything will be single threaded but we can soak the processors with important background tasks in a database server that usually says well we have more than one user it's a database server it's fine every user will get a single thread it'll be fine and that's okay if you're running lots of small queries and it's kind of bad if you're running one really big query okay so there's a bunch of trade-offs here when you're thinking about building the next infrastructure for your new employer after college you want to keep in mind what your workload looks like what hardware trends look like and whether this is a good idea anymore the single threaded match your control flow to the data flow it might be a great idea single thread per query run like bazillions of them at once on really beefy hardware that might be awesome a lot of in-memory databases will do that right lots of small transactions it's all in memory it all goes lickety-split you don't want to do process and task switches and buffering if you're building a Hadoop like object which is going to take a pass on a petabyte once a day all right you want to get lots of processors moving on that thing um you know and lots of data flowing through each one of them and all the CPUs pegged which by the way Hadoop has terrible CPU utilization but in theory that were well implemented it would use up lots of CPUs um so in that case you might want to think about revisiting some of the stuff I'm talking about here but for now we'll just focus on single threaded uh iterator model all right and the fun thing about this just so you know is that the code you actually write in it next and close if I told you that this was instead of being function calls it was going to be at the end of this thing instead of saying return to next you push the tuple into a ascend buffer to go off the network somewhere else the code would be pretty much the same okay so the basic idea here of encapsulating these things into these operators the operators pass data between them and the operators have methods to produce a tuple and handle a tuple doesn't change so much when you change this control flow model all right but if all that seems like I don't really know what you're talking about that's fine just stick with this for now okay stick with the iterator model and function calls that'll be enough for this class all right hopefully that was interesting to some folks and if it confused you what's in the slides is good tried and true and it's enough for this class all right let's talk about group by a little bit how do we implement group by so we didn't quite do this we came real close in class with sort but we didn't quite do it so the group by query is going to kind of look like this there's going to be a sort operator and then there's going to be an aggregate or group by if you call it that operator and here's how it's going to work the sort iterator is going to make sure that all the tuples that come out of it are ordered by your sorting clause okay which is going to be the group by columns so the query optimizer is going to take your query it's going to generate a sort and it's going to be sorted by the group by columns all right and then the aggregate operator what it's going to do suppose you said group by department it's going to get a flow of tuples from the sort that's going to say c s c s c s c s c s c s c s data science right and the minute it gets a data science tuple it knows it's done with c s right so how's this going to work the aggregate iterator is going to keep running information on aggregate functions in the select list so let's do a specific query just make this concrete black is always a good color oh let's not do average first just one tiny bit more complicated let's do cal select count star from students group by department the sort is going to sort the tuples by department let's say this the lowest alphabetical department is agriculture okay so what we're going to need to do for count is keep the count so far of agriculture tuples it's going to start at zero and then it's going to get an agricultural tuple it's going to say hey we're doing agriculture one all right let's get throw away that tuple get next another agriculture tuple two throw it away it's like I'd like this account three agriculture tuples right and then you get you get a computer science tuple and you go ah ah ah ah three agriculture tuples and you output three and then you say one computer science tuple ah ah ah ah ah then you get next you get another one right and you count up the computer science tuples and then you output that one right and then you get a data science like output the final answer for computer science then you get data science you output that right so that you need this running state count so far for some that works too you could have a guy named the sum he wouldn't look like the count he would he would look like anybody he would be the somebody or something and he would sum things I would say you know I don't know GPAs don't add up very well whatever he would add things right so he would keep adding things that when he was done he would have an evil laugh and then he would output the sum average well average is sort of annoying right because you need the count and the sum to do the average so actually even though this query says average of GPA the running state of the aggregate is the count and the sum and when you transition from one group to the next you divide the sum by the count and that's the average right so the running state may be more than one thing for average it's two things that's fine and as soon as the aggregate iterator sees a tuple from a new group it produces output for the old group based on the final function of the aggregate right and it resets its running info and updates it with the new tuple make sense pretty simple stuff all right you can even imagine an API for defining aggregate functions you make up your own right it would have an initialization to say get started it would have an iteration function that would say you're seeing a new tuple from the same group and then it would have a final function that would say okay at the end of a group what do you do right so in the case of average you divide the sum by the count in the case of count you just output the count and so you can make up your own aggregate functions and many database systems have an API for that and Hadoop has an API for that it's called reduce right and so on so it's pretty straightforward to think about extensible aggregation have any aggregate function you like hash group by a little bit different but really not much different at all right the hash iterator or sorry the hash iterator is going to be there it's going to output its tuple to the aggregates the only difference here from the aggregate's perspective is that it's not getting its input in alphabetical order so it might get zoology before it gets computer science but the count really doesn't care one zoology tuple all right oh computer science one computer science right so that's fine group by hashing works just fine but here's a funny thing you can well okay no we're done right so hashing is almost exactly identical to the previous slide in fact let's do a little visual diff oh I didn't line them up that's so sad they're almost identical and I forgot to line them up so that the diagrams lined up but the text got reformatted by powerpoint so sad but almost exactly everything is the same except it's not ordered it's hashed right okay but we can do better actually we can do better with hash we're going to define a single operator called the hash ag it's going to do the work of hash and aggregate but it's going to do it all at once and it's going to get a win by doing this so this is kind of cool so what we're going to do is while we're doing our hashing algorithm we're going to aggregate while we're hashing so let's see how this works we're going to first partition just like we always did in hashing right so you take that coarse-grained hash function and you generate output partitions and you spill them to the disc but when we read things back in from these hash partitions rather than populating a hash table with all the tuples we're going to read them in and we're going to aggregate them as they come all right what are we going to build in memory for this partition we're just going to build hash key comma aggregate state so we're going to have an in-memory hash table this is in memory which is basically going to be a map from group by columns to ag state we're going to use a hash function because it's our hash table all right some like a a map from this to that right we're going to hash things into here as they go so you read a tuple from the partition figure out what group by columns value it has so let's say it's computer science say hey there should be an entry in the hash table for computer science let's mark that down as one and the next tuple you get may not be computer science because we're just reading a partition off the disc the next tuple you get might be zoology that's cool that goes somewhere else in the hash table and that's one and then you get another computer science tuple and you say that's cool that's two right all right you keep going reading that partition off the disc sticking it in memory and bumping the aggregate and this is all happening during the second phase of hashing right previously we did all the hashing generated the output and spilled it over to passed it over to aggregation now we're doing the aggregate here so my question to you is why is this better can anybody see why this is better than just doing hash followed by for each hash group passing the tuples to aggregate and I'll tell you this could be like infinitely better well maybe not infinitely it could be way better maybe it could be way better all right I'll give you a clue suppose that what you're trying to count up is the number of males versus the number of females and suppose I tell you there's seven billion people in the world so you have seven billion rows but all I really want to know is count of males and count of females how big is this hash table in memory going to be it's going to leave two entries in it right how big are these spill partitions going to be assume there's b minus one buffers right it's going to be the size of the input file divided by b minus one each one of these could be gigantic it could be way bigger than memory if we do normal hashing we have to build a hash table big enough to hold this thing which it won't fit in memory so we'll have to do recursive partitioning and we'll you know partition it and then do it and partition it and do it till we get it memory sized in this case with group by if it's just male and female it doesn't matter how big the input is generate a run of actually in that case you could just do this all in one pass right you just read the data and you generate a hash table in memory from male and female and you're done right so that's pretty cool so actually I didn't treat there's a way to even push this logic here if it's small enough if that if the total number of values fits in memory all at once you can actually do your aggregation on the first pass and it's just in memory hashing but if it's a little bit bigger and you need a partition these partitions can be pretty big too all right and it all has to do with the key here is the group by memory usage is proportional to the number of distinct groups whereas the plain old hash memory usage is proportional to the number of rows or tuples and that could be very different like if your column that you're grouping on is gender okay so it can be a very big win to do hash and aggregation together all right so that's the main point on this slide and I think we're out of time with that yep all right homework is due tonight as you probably know uh homework one homework two will probably be passed out on Thursday it's a partner homework so you need to get your partners all locked down