 All right, so today is going to be a new lecture that we haven't done in previous semesters. And it really is, there was only one paper listed, because there's kind of only one paper that does this awesome thing we're going to talk about. But the general idea here, we want to talk about the notion of taking logic we normally have running in our application server and embedding it inside of the database system. And we do this for performance reasons. So the background is going to be, we'll start with the background about why you want to do this. The paper sort of lays out the reasons why it's sort of obvious, but just to understand what's going on. And then we'll talk about the Freud technique, which is going to be the main thing we want to focus on, because in my opinion, it's probably one of the best papers written in the last two or three years in databases. The impact is significant. You see the numbers, they say, oh, it's like 500x improvement. You get that for free without having to buy a new hardware or without having to rewrit your application. So to me, it's very rare you see 500x improvement out of the box like that. So that's why I'm super excited about this paper. But then I'm going to finish up talking, if we have time remaining, just talking about some general tips about going forward, working on your project the semester and beyond this class and in your career. The general tips that I've sort of looked at is some of these other opinions, some of these are based on my own experiences about how to get started working on large software code projects. Because again, it's sort of what we're trying to do in this class, teach you guys not only the database things, but also software engineering things on the practical side. All right, so up until now, everything we talked about in the class has really been assuming that all the logic about what the application wants to do with data is embedded inside the database system itself. We'll be able to run queries to answer questions that people may have. But what they actually do with those answers and how they actually apply them inside the application itself, all that runs on the client side. Whether it's PHP code, Java code, Ruby code, we don't care. That's running over there. And our database is only interacting with a client through SQL. And this is because if you're using the computational APIs that we talked about, JDBC, ODBC, it's basically just asking the data system to give you data, it gives you back response, and you ask it to store some data and it tells you whether that succeeded or not. So again, it looks like this. You have on the application server, you have say some transaction, and it's going to be a combination of SQL intermix with program logic. So in program logic, it would be the imperative code that it would be in like a UDS. If branches, while loops, for loops, maybe calls to other things in the world. And so when the transaction starts, we invoke a SQL statement. It sends it over to the database server. It then parses the plans, it optimizes, and does the query execution. We give back result. And then the database system stalls waiting for the next query request because now the application server is executing some additional program logic. So even between this back and forth, the database system has to wait to know what you're going to do next. And we're not going to be holding latches because you don't hold them for long periods. We could be holding a locks on tuples and records and objects in our database system. We only can then free them. Actually, when we go ahead and commit. So the idea of embedding database logic, we're going to take some of the code that we had in the application server, and now we want to run that inside the database system. So we can either do this through an RPC, through store procedures, which I'll show in the next slide, or we can do it through the user defined functions, or UDS, that was in the paper that you guys read about. And the idea here is that we want to avoid having to do these back and forth round trips between the application server and the database server because ideally, we want to have the database system, we want to run our logic close to where the data is actually stored. So just again, look in this example here. If we just take all of this, we can embed that inside the database system as a store procedure. Now in our application code, the only thing we have is just this invocation call. Another system is the SQL you call exec or execute. Now we invoke this function, pass in some arguments, and then this thing is executed directly inside the database system. So again, this is an example of a store procedure. A store procedure can be run as a standalone function, user defined functions have to be invoked with as a part of a SQL statement. So let's go into detail now of what is actually going on in the user defined function. So a user defined function, UDF, is something that the application developer is going to write. And the idea here is that we want to extend the capabilities or the functionality of the database system to now include additional things that we write in our, that we write in our UDF. And again, it's just a basic function, right? You take in some much input arguments, and these are gonna be scalars, you do some kind of computation, and that computation could be, you do the standard, a pair of code things like for loops, if clauses, while loops, could be invocations of other SQL statements, could be invocations of other functions, right? It's all the things you would get normally could do in your application program. We can now do this inside of a UDF. And then it has a return result, and the result could either be another scalar or it could be a set of tables or set of tuples, right? So again, the reason why we're gonna do this is because these, there are certain things that are difficult to express in SQL that we may wanna run inside of our database system, and we can use user-vind functions as a way to do this. Bleep, excuse me. So let's look at an example here. So this is a user-vind function called customer level. And so in this example here, we're gonna be based everything on T-SQL. T-SQL is the variant of the UDF language that SQL server supports. It actually came from CyBase. So SQL server is a fork of CyBase. In the 1992, 93, Microsoft bought a copy of the source code or licensed the source code from CyBase, deported to Windows NT, then they rebranded it as SQL server. Then CyBase sort of went off and, yeah, I guess it's sort of stagnated, and since then it got bought by SAP, whereas SQL server is like state-of-the-art now, like they're adding new stuff all the time, as you guys see from this paper. So the SQL standard is SQL-PGM, but nobody actually implements that directly. It'll be in Postgres as PLPG SQL or Oracle as PL SQL. It all roughly looks the same, right? But the exact syntax is slightly different. So we have a function here called customer level. We're gonna take in a customer key as an integer and we're gonna turn a 10-character char, right? So we always have to declare our variables ahead of time. I mean, I don't wanna teach you UDF language here. I'm just showing you what it does. So this is like Pascal or Ada, if you're familiar with those older languages. You declare your variables at the very top and you can't declare them at the bottom. Everything has to be at the top. So I'm gonna declare two variables. Now I do a select statement that wants to compute the total amount of items that this customer has purchased. And then based on that total amount, I'll tell you whether they're at the platinum customer level, therefore they get preferential treatment or they're a regular customer. And then I return back that string I just computed. So if I wanna say for every single customer to give me their customer level, I can include my UDF here in my output or our projection list for my select statement. And then as this iterates over every single customer, we pass in the customer key for that row and vote this function, which then invokes, sorry, then invokes more queries, right? These up here and then returns back the result, right? But again, you could implement this all in SQL, but by putting this in a UDF, it makes it sort of easier to program now because it's again a single function call. So I've already said why you wanna do this, but let's go into more detail, right? So again, you have this nice modularity and code reuse where now not every single SQL function that maybe needs to know what's the customer level for every single customer, you just invoke that function. You don't have to reimplement that logic of essentially computing the total and then having the if clause for every single query, right? You just have the function and everybody can invoke it. It reduces the number of never ground trips. In that last example, we could rewrite that as a single query, but in other cases, right? If you have more complex logic, you may have to go back and forth in the application unless you can embed it in the UDF. And then from a sort of engineering standpoint, it's way easier in many cases to actually just write and read UDFs because most of your programmers are gonna be know how to write JavaScript, Python, and C++ code. A lot of them are not gonna know how to write the more complex UDF code. In my example here, it was pretty easy to read, but you can imagine there's more arcane aspects of this language, especially if you don't know Pascal or Ada already, that may trip up users, right? All right, so these are showing you again, these are examples of why you wanna do this. There's clear benefits of this. But what are the problems, right? Where do things go bad? What did the Freud paper talk about? What's that? Efficiency is probably the biggest one, right? But why are UDFs inefficient? Like if you look at your previous query, it does, you know, if you're doing it across all the customers, it runs an individual, some SQL query for each customer when you could have done it by a single group by enjoying. Right, perfect, so he said in this function, we have a select statement that confused the total sum of the items purchased for all customer, for a single customer. But that means that every single time I invoke this, I'm invoking this function over and over again. That means like in setting up a query is not, it's not, it's not expensive, but it's not cheap, right? You set up a cursor, you allocate some memory, right? Then you have to do the look up of the customer key. Instead, I could just do a sequential scan if I did this for everyone and then computed it in total and then now I just do my join based on the total, you know, the table I've computed already, whereas this thing is invoking this query every single time for every single customer. Right, so the first issue is gonna be sort of what he said about, well, that's on the next slide, but about the execution costs. But even before you started executing it, the query optimizer is not gonna know anything about what's inside your UDF, right? Because it's not declarative, it's imperative, meaning it's, you're telling the database system, here's the logic I want you to execute, not, and then SQL is like, here's the answer I want you to compute for me. So the query optimizer can't reason anything about what your program is gonna do, because it's going, you know, it's not gonna know what branch you're gonna go down in your if clause, because it doesn't know what data it's actually looking at until you actually invoke it, right? Going back here, I don't know what this thing is gonna compute, right? I don't know whether I'm gonna be a platinum or a regular customer until I get the answer of this, but I can't get the answer of this until I obviously compute the answer, right? So we can't, the optimizer can't optimize anything down here because it doesn't know which path it's gonna take, right? So basically gonna treat the UDF as a black box. I think the paper talks about how they, they just assume that like the cost is almost zero, but in many cases that's not the case. The other issue is gonna be now that they're gonna be difficult to paralyze because you're essentially just gonna be executing you know, the steps one after another, and you can't reason about any of the queries that are inside of them, and therefore try to do optimization across multiple queries inside the same UDF. So you're in the same execution context. You're executing UDF as part of the same query, and you may be looking at, you know, for a single record in your table, if you know you have a bunch of queries that are all gonna be operating on the same tuple, then rather than doing a scan to find that tuple for every single query you execute, maybe if you could combine them together and then jump to the tuple once, then do whatever it is that you wanna compute, that would be way faster, but it can't do that because it doesn't know what the query you're actually gonna execute. So another example of this would be you actually can construct queries from strings, like I can append a bunch of strings together and make a query and then invoke that. So that one, you can't optimize at all because you just don't know what it's gonna be until you actually run it. The next issue is gonna be that the, again, for really complex UDFs, you're not gonna be able to paralyze them because you have to then invoke that function and it's gonna run that as a single thread. And so in the case of SQL Server, and it's probably true for the other commercial systems, I just don't know this, if your query has a UDF, they're gonna execute that as a single thread, they don't actually let you execute in parallel. Even though you could, it may be possibly do this for every single customer, execute in parallel, but because there might be weird correlations between the different UDFs that you're calling, you have to execute them in the single thread. The last issue is gonna be, this is what we're gonna say this too, we are before cross statement optimizations. So at least in the case of SQL Server, they were actually trying to speed these things up by actually compiling the code to the, instead of interpreting the UDF, actually compile it to machine code and invoke that. That'll get you a little bit improvement, but it's still not gonna solve all these other problems of like query optimizer, like it doesn't know what the cost of these things are. It can optimize across multiple queries within the same UDF. So just because you can compile this machine code doesn't mean it's gonna go any faster. Postgres actually does the worst thing in my opinion, but it makes sense from an engineering standpoint that it was easy to implement, but it's from performance standpoint, it's bad. So going back here, this thing is a query, this is just taking the sum and storing to this total variable here. So we execute that as a query, that's fine. All of these things, you could just interpret, if total is greater than one million, then go down to this, otherwise go to that. The way Postgres does it, it converts all of these guys into select statements. So this would be a table of select, total greater than one million, return true or false. Then sends that to the query optimizer, and then it then turns into a query plan and then invokes it into the engine and then gets back the result. So for every single one of these lines here, that's another select statement in Postgres. And again, they make sense from an engineering standpoint because rather than having your own separate interpreter interpret these operations, these commands, you just let the data system do it for you because you already have a good engine there. But that means that's multiple round trips going from your UDF into the optimizer again, getting back results. And if that query you're invoking, then invokes another UDF, then you have all these recursive calls. So how bad is it? So this is an example from the guy that wrote the Freud paper, he sent me his slides. So this is TPCH query 12, and they modified it to have a user defined function. So they're just gonna do a lookup for the customer and go grab the customer name from the key, right? So this thing is invoking the simple function here. So all this is doing is just getting the customer key and is doing a lookup in the customer table, right? And every customer, as far as I know in TPCH, every customer is gonna have a customer name. So this thing's always gonna evaluate to true. So for every single customer, you're gonna go invoke this function that's gonna invoke this other query here, right? So this normally doesn't happen in TPCH, he just added this. So the original query, running at scale factor one, so it's a one gigabyte database, the original query takes 0.8 seconds. If you add this UDF, right, that's doing this really simple lookup, then now you're up to 13 hours, right? Because again, for every single, every single two is gonna see, it then does this other lookup. I'm actually surprised it took that long. But again, this is just showing you how bad this actually is. We'll see later in a second with Freud, but Freud can do the inlining of this function inside of this, the outer query, and it can get it down to 0.9 seconds. That's why I'm saying this is such a big deal, right? If your application had this stupid UDF in it, for whatever reason, right, the performance would be terrible. So if you remove it and you get back down to the 0.8 performance, but Freud can keep your application exactly as it was and almost match what you would have without this UDF. And that's why I think this is amazing. All right, so let's talk about the history of UDFs in SQL Server, see how they ended up with getting to Freud. So UDFs were first added by Microsoft in SQL Server in 2001. And then in 2008, people started to realize they had problems with UDFs and how it would wreck performance. And there's this blog article from a well-known SQL Server DBA where he said these UDFs are considered evil personified. So it's sort of, everyone knew that these things were problematic in industry, but Microsoft didn't really say anything until 2010 when they acknowledged, oh yeah, by the way, they are evil, right? And there's a blog article, I think they referenced in the paper from a, it's from Microsoft that says, oh yes, UDFs are bad. The guy even says like, yes, UDFs are evil personified. The problem is when you go to the URL now, it looks like they took it down and there's no copy on archive.org. So I can't find the blog article anymore. When I post the slides, you can click the link to this, that this thing is images linked to and it'll take you to this page. So I don't know whether this is just like the whether Microsoft marketing realized that, oh, we now fix our UDS, let's go remove all this other stuff, but it would have been nice to be able to still see that and Google doesn't have a cache, which is disappointing. All right, so 2010 people realized that UDFs are problematic, but again, everyone's using them because it's a nice programming abstraction for developers, right? Like I said, having to cobble some funky SQL statement to do exactly what you wanna do, you can write a little T-SQL and do more complex things and just have your query invoke that. So now the author of the Ford papers is Guy Karthik. He was a PhD student at IIT, Mumbai, and he did some early work on seeing how you can de-correlate UDS and do some early inlining. Some of this is a precursor to what Freud was doing. So then he has an ICD paper, this is really good early work in this area, but then he graduates, he finishes his thesis and he goes and joins Microsoft at the Jim Gray Lab in Madison and he starts the Ford project in 2015. It was sort of an early initial prototype and then it sort of snowballed and it got bigger and bigger, it got more people involved in it. And then it got so good that by 2018, they added Freud now to SQL Server 2019. So I think, I don't actually know what 2019's been released yet, but they have a early preview version that's available that actually has this thing in it. And you can go actually look at the documentation that talks about how to actually use this. So it's a SQL Server and maybe Azure SQL Database, right? So that's pretty crazy to think that like this guy went from like graduating in 2014, starting the project in 2015 and then within two years, it's actually now hardened and up and it's so good that they put a lot of energy in to get this into production, right? So the turnaround time from research to running production was really short here and it's amazing. And again, database systems are usually very conservative. Like you don't want people to have, like you don't want to put out a new feature and then have half your customers have really good performance and another half get regress. There are some regressions. We'll see some results at the end, but to go from here to here in a short amount of time is amazing, right? So kudos to him. And obviously Freud is like the internal research name of this project. It's, you know, to go back to this thing, Freud's not mentioned anywhere here because it's like Hecaton. Hecaton was the internal name. All the researchers know the project called Hecaton, but it's not called that in the product name. Okay, so what is Freud actually doing? Freud is a technique for automatically converting UDS into relational algebra expressions that we can then embed inside of the query that's calling the UDF. So no longer are we gonna treat the UDF as a black box, right? And it's this function called that we make and it produces some output. We can actually now reason about what the UDF's gonna do and optimize for it because we're now gonna embed it as a treating as a first class entity of a relational expression in our query plan, right? And again, the great thing about this is like, you don't have to change anything in your application code. So whatever UDF you were throwing at SQL Server in the 2016 version, now when you throw it at the 2019 version, it gets automatically sped up, right? Assuming it's the right thing. So the paper talks about, actually I really like the paper too because it's also very easy to read. It's like they don't go into the nitty gritty tales about the PL stuff which I don't care about. I assume, you know, someone could figure it out but for our purposes we don't care. They talk about how they decided to actually put the Freud transformation process in the sort of query rewriting or binder phase before you get to the call space query optimizer. And the idea here was that these are all sort of static transformation rules. There's some simple heuristics you can use to decide whether it's okay to transform or something or not, like depending on how far nesting or how deep you want to go in nesting. But you can do this without having to bring in the call space query optimizer and therefore you don't have to modify that part of SQL Server. You sort of have this shim layer in between like maybe the binder and the query optimizer where you do the rewriting, inline everything and then now when you throw it to the query optimizer it just looks like any other query plan. And the optimizer can do all the optimizations that it can normally do. So there's one team in the class here that's actually doing, we want to do sort of the similar thing, but they want to do this at either rewriting expressions. So they're doing in between the binder and the query optimizer where you don't need a cost model, right? So we'll talk about more about how to do these transformation rules for subqueries. I mean, I'll talk briefly about it in the next slide but we'll cover these more when we talk about query optimization in the semester. But the bottom line is that the commercial systems have really, really good techniques for unnesting subqueries to make them be more efficient. Like worst case scenario for a subquery was essentially what the UDS were doing was for every single tuple you call the subquery. But you can do some rewriting in some cases to actually maybe convert it to a join. And then now we know how to optimize joins efficiently so we can run that. So that's why that's essentially what the goal we're trying to do here is when we want to convert everything into subqueries using the apply operator or lateral join and then now we've hand that off to the query optimizer that then can rewrite that to be a join query, right? So again, Freud just does the transformation. It doesn't worry about any of the higher level query optimization techniques that the optimizer does. All right, so subqueries. So again, we'll cover subqueries in more detail when we talk about query optimization later this semester. But the basic idea is that we can just sort of treat the way a subquery as almost like a function the same way UDF is but to avoid having to call for every single tuple there's two techniques to unnest it. The first one is to rewrite it and decorrelate it into a single query, a single level query with joins. The other approach is to decompose it to move it out to execute it before you execute the main query, store its result to a subtable and then do a join between that, sorry, that join, take the interquery, store its result in a temporary table, then do a join between the temporary table and the original query and then throw away the temporary table when you're done. All right, so I won't talk about this, but I'll talk about this one very briefly. All right, so say we have a simple example here. We want to get all the sailors that reserved a boat on this day here. So think of this as a sample application where we're keeping track of sailors and what boats they rent and what day they do, right? So the nested query here says for every single sailor in the sailor table in our outer query, return true if there exists a reservation for that sailor on this particular day, right? So the only thing we really care about in this inner query here is do we have a match for whatever tuple we're looking at in the outer query? So think of like a for loop iterating over every single sailor in the outer table and running this query. Of course, that would be slow. So instead, we can rewrite this part here because we see that this is essentially a join, right? Does something from one table equals something in another table and we can easily rewrite that like this, right? So again, this is what the query optimizer is going to do after we do Freud. This is why we want to get things into domestic queries. Now, I should have shown an example of this. I ran out of time, so I apologize, but they talk about this apply operator in the paper, right? As far as I know, actually I should have checked the SQL standard, in other systems, they're called lateral joins. In SQL server, it's called the apply operator. I think the standard might be actually apply, whereas in Postgres or Oracle, you have to have this lateral keyword. The basic idea is that in my example here, I had the inner query and the where clause, but if you want to have it in the from clause, you need to add this lateral join operator or this apply keyword operator that allows you to have one query reference the data, the inner query reference the tuples from the outer query. So in my example here, because I was in the where clause, I know what tuple I'm looking at because I'm essentially doing a for loop over the sailor table, so I can reference it inside of here. But if I put it in the from clause, I'm not iterating over the outer table. I'm sort of looking at the two tables in total by themselves. So I can't correlate one tuple here from one tuple there. I'm looking at everything. So that's essentially what the lateral join of the apply operator does. Essentially have a for loop over one table and do the same thing we did before in the where clause in my last example. That makes sense. I'm seeing blank faces. All right. The details of it doesn't matter, but just think of two nested for loops and you're allowed to match in the inner for loop with a tuple in the outer for loop. You normally would not be able to do that unless you had this lateral or apply keyword. Okay. So how does Freud work? So there's five steps. The first step we're going to transform in the statements, put it into the form that put into essentially the T-SQL into SQL statements. And then we're going to break our UDF onto regions and now we can reason about each region individually. Then we're going to merge the expressions for these different regions into a single SQL statement. And then we inline that merge region into our outer query. Then we're done and we run it through our query optimizer. So I'm going to walk through an example one by one and show what's going on in each step. And again, the PL semantics of the PL aspect of this I don't care about, right? And we're also going to operate on SQL because that's easier to read as humans as we should go through in examples. But in the actual system, the actual implementation, they're not doing transformation on SQL statements, they're doing transformations on the low level relational operators or relational expressions. Okay. All right. So first step, we're going to transfer the statements. So again, we have these UDFs in T-SQL. It could be in PLPG SQL or PL SQL. It doesn't matter. They do a T-SQL. It's fine. And so these are all not valid SQL statements, right? So we need to convert these into valid SQL statements that we can then combine together when we do our transformation, right? So in the first case here, set variable level equal platinum. What they're going to do is just convert that to a select statement where we assign the value platinum to a variable called level. So it's almost like a one-to-one correspondence. For every single variable I have referenced here, I'll generate an output attribute for my select statement over there. Next one. We have a taking the output of this summation and assigning it to this variable here. Again, that's essentially the same thing. We can compute this inner query here, right? And this is technically a, the outer query is a table-less select statement, but there's no from clause here. We're taking the output of this computation and assigning it to this attribute here. Last one we have, if total is greater than a million set at the level, right? We can rewrite this to be a case statement. Case when total greater than one million, then output platinum, otherwise null, right? I'm showing null here because there's nothing else that comes after this, right? And again, this is a table-less select. We're doing this computation without having to do any table, right? We assume that this total value is going to come from somewhere else, right? It's going to come up from up here. That should be V, yeah, whatever. But we should compute this. This is going to come from something else, not from a tuple we're actually examining because there is no table here, okay? Right, so again, this converts all the T-SQL statements into select statements or into a SQL. Then we can now break these, break up the UDF into these different regions and generate the corresponding SQL for each region, right? So the first one here, we declare two variables, total and float, and then we assign the summation into total, right? So we're going to create a new synthetic table, right? ER1, right? And then the, it's just the same thing as before. We're taking the output of our total and assigning that to the total variable here, and at this point here, we have not assigned anything to level, so we'll make level be just null, all right? Get to the second region, that's what we showed in the last slide. In this case here, since we know we're assigning level to something inside of this branch inside the if statement here, that means that in our case statement, we had that computation before. We're now getting the total here from up above, right? So ER1 comes from here. So we can do that check to see whether total is greater than a million, then send it to platinum. Otherwise, we're going to set it to ER1.level, meaning we didn't assign level in this, unless we go down this if clause, we're not going to assign level, so it should be whatever it was up here. But in the first select statement for the first region, we just set it to null, and we're done, right? Same thing with the last one here. Else, otherwise, set it to level. For this one, for simplicity, I'm just sharing say if ER1.total is less than an equal to the 1 million, which is the else clause, then we set it to be regular. Otherwise, it's just whatever this is. But we know in reasoning about the total logic of this UDF, that if we don't go down this if clause, we're always going to go down here. Now, whether or not they actually rewrite this to be like this or not, or whether it should just be a select statement that immediately assigns this thing to regular, I don't know the example wasn't clear. The last region here is the return on the value of level. So our purpose is here, we'll see this later on when we actually combine all these together and to merge them into a single expression. But the paper talks about how you could have multiple return clauses that return values different points of the UDF, and you need to account for all of them. In our case, it's really simple because it's either we know exactly where to go down this path and always hit this one, but we could have had the return side of here, and then they have to again reason about making sure they always return the right value. So essentially, it is create an implicit variable, the implicit variable called return value, and then they just assign it the same way they're assigning the other values in the select statements. So now we have a bunch of these select statements for each of our regions, so we want to merge these together. But this is where the lateral join operator or the cross-apply from SQL server, this is where this comes together. So this is where we're going to allow the, in this case here, we're going to cross-apply our lateral join ER1 with ER2, and because we're in the from clause, I would not normally be able to have, in the second query, reference anything from the first query here. On the other side of the lateral join, but because I have this cross-apply or lateral join, I'm allowed to do that. So that's why I can see inside this query, the value of an attribute that was defined or set up above. Yes. Then in the last block, shouldn't it be taking it from ER2 on the block? This question is, the last block should be taking it from ER2.level. Yes, you're right. Yes, that's the type of, yes. Yes, so he's saying this should be ER2, because you need to get it from whatever this thing set, not what this got set. Actually, that's a typo too. That should be less than ER2. Because it looks to me like the region should depend on the region uniquely before any variable that you have not set. Yes. So his statement is, it should be that the region, all the variables you look at in one region, should depend on the region that came immediately before. Because again, normally, these would be executed sequentially. So yeah, you're right. So this thing should depend on this, and this depends on that. Yes, so that should be ER2. Actually, one thing I don't really understand too is, and I emailed him, he didn't respond though. In this case here, I didn't understand why does this have to be, I guess it has to be the inverse, because you need to know whether this thing would be as actually set or not. But in theory, you could take all of this. If you didn't have any if clause, if else, why not be able to embed that here inside the else clause and to only have one region cover both of them? I don't know why he did it that way. Okay. So again, that we use the cross-apply to allow our join, to allow us to execute these things one region after another and be able to get the data that came from the previous one. So now you see at the very top, we have for region four was that return statement. Now this is where we're actually going to return things. Because the only thing we care about at the end is what was the output of this final query here. So whatever ER3 level is, that's our return value up there. So that's where the final region went. Whereas all of these inner ones are the region one, region two, region three. All right. So now that we have this giant SQL statement, now we just can inline it into our original query. So our original query was for every single customer go compete their customer level. So we just want to replace this thing here with that query I showed in the last slide. We're just treating that as a nested query or a sub-query. You just plop it inside of there. Then now again, you see all the different regions inside of it. So now we just throw this through the optimizer. This is just a more complex SQL statement. At this point, there's nothing fancy, there's no notion of a UDF. We just take this and shove it to the query optimizer and let it do its thing. As I said, the query optimizer has a combination of cost-based search, but also has a bunch of rewrite rules to allow them to reason about what's in the select statement and do more complicated things. So you can basically take this complicated giant select statement with what one, two, three, four, five, six select statements and rewrite this into a select statement with just one nested select statement. So these two are equivalent and produce the exact same answer. This is obviously way more efficient to execute because now I just have this little interquery here where I'm computing the total amount of items bought for every single customer, and I'm grouping by the customer key, that gets materialized as a temporary table, and then now I just do my join on the customer based on this thing here. That's way more efficient than we have before. For every single customer, we are invoking this summation every single time. We compute this once for all customers, do that efficiently with a sequential scan the way we talked about before, materialize results, do the join, which we'll talk about next class, and then we're done. So I don't remember seeing any, they didn't have any numbers for this particular example here, but in the paper they talk about seeing again up to 500, 800X improvement for some UDS. It should be obvious for you, right? Like we don't have any function calls, we can now run this in parallel very efficiently. We can now reason about what the UDF is actually doing because we know in our SQL what it's trying to do, so we can have the cost model for our query optimizer, pick the best plan based on the data layout, the data location, all the things we want our data system to do for our SQL statements we can do because we're not embedded as UDF. No one's excited about this as much as I am, or is this so like who cares? I mean, I'll talk a little bit, we tried to optimize UDS here at CMU. So with the way we were going to do, we were actually in the same way that we would generate machine code through the LLVM for SQL, we were doing the same thing for UDS, but we were just taking whatever the UDF was and just inlining the machine code directly inside the machine code of the query. But we can't do any of the query optimization stuff that they can do, we would just not have the overhead of interpreting or make function calls. Whereas this thing is like, again, this is a big deal. Okay. So beyond the five steps we talked about here, you also get some other cool aspects of optimizations because everything's being inline as a query. So you can actually get all some of the classic optimizations that a regular programming language compiler for imperative languages will give you. The SQL server query optimizer will give you that as well. Again, without having to change anything in Freud or having to change anything in your application. So say we have a really simple UDF that just says, for a given value given as an integer, see whether the value is greater than 1000, return high if the value is less than that, return low. Then you prepend that or append that result here with the string value and return that. So let's say we have a really simple query that says, select getValum pass in 5000. So again, we would already have a cache query plan for this that we could then inline into this query plan. Then we throw it at the query optimizer and the re-writer to let it do whatever it wants to do to optimize it further. So the first thing you see, again, we take this, we convert this into a Freud statement, which is again, it's just a select statement doing the lateral join between the two of them. So the first optimization we get is dynamic slicing. So this is basically saying, since I know what the value is being passed in as X to this function, I can then chop off or slice out different portions of the program that I know I'm not going to go down or I'm not going to need. So since I know the value is 5000, that should be considered as a high value. So I don't even need to include this else clause and I don't even need to include the if clause. So since I know exactly what the value is, I know I can always set val to equal high. So the same thing in the Freud program, since I know the values mean 5000, I don't need this case statement anymore. I just say select high as val, but again, hard coding exactly what it is. The heck optimization you get is constant propagation and folding. So again, here, since I know in my optimized query, whatever the value of this is then going to be prefixed to the string here, rather than maybe storing that value as a separate variable then prefixing it, I could just store directly in the string, high.val. So I don't have to worry about the concatenation of the string, I just take the exact string I need and propagate it. Same thing in the SQL statement here, since I know that I'm going to take whatever this is, select high as val and then append it to this guy here to produce high value, I don't even need to do this out or apply, I don't even need to do the join, I just select high value as the return result. The last step is just to remove dead code. I don't need to declare this variable high because I'm just always going to create the string high val, so I'll just make it return high value. Same thing in SQL, I don't need to do this nested query here to return high value new. Select high value as return val and that selects return val from this. I don't need any of that, I just select return high value. Right? So again, this is awesome. You're getting like this is the shit, like the data system will do this for you automatically. This is the query rewriting stuff. We don't need any cost model to figure this out, we always know that we can do this, so we just do it. It's never going to be bad as far as I know. Yeah, it's never going to be bad, so just always do it. So rather than having to reason about the UDF code to figure out how to transform in this way, we get this for free in our database system. That's why I think this is amazing. So what can we support? As of 2019, they support, declares, sets, selects, if then else, return, exists, not exist is not a bunch of other operators here. They talk about it like if you have UDS calling other UDS, like how far do you want things to recurse. I forget whether they said they sort of a hard-coded threshold allows them to decide when to cut off. There is other aspects of this, I think they couldn't support like the, if you have dynamic queries, like if you're generating select statements from concatenating strings, then they can't handle that because again, you don't know what the string is going to be before you actually run it in some cases. And it's for all SQL data types. Again, they support for TSQL, but you can imagine extending this for other programming languages as well. A lot of systems now let you write UDS and JavaScript and Java and other things. So converting that into the SQL semantics is slightly tricky, but again, it's mostly engineering, it's not a fundamental scientific problem. So let's talk about how much they can actually help people. So they looked at like some of the top, they took out the top 100 SQL databases on Azure, looked at their UDS and they tried to figure out what percentage of the UDS that people are running, you could actually run FROID and have FROID generate an inline version of this. So they found overall, I think in the tile fleet, about 60 percent of the UDS could be inlineable. That's pretty significant. So 60 percent of the UDS people running in Microsoft Azure are simple enough that FROID would actually work. This is why we stopped the compilation stuff for PLPG SQL in our own systems because it'd be hard to beat something like that. The simple cases are already taken care of. So then they talk about a bunch of different work loads, three simple work loads, they evaluated and again, like 84 percent, 91 percent, 95 percent. Like a large percentage of these of the UDS that people are actually running in the real system, you can actually optimize the FROID. That's pretty awesome. So then they actually did an evaluation of what the benefit is going to be, and so these are sorted in based on the order what the benefit's going to get, and this is in log scale, and this is showing that you have some UDS for these two particular work loads, where you can get 800x improvement over what the original version could do. Like again, when do you ever see that in computer science? Almost never. Like architecture guys are lucky, they squeak out 1 percent improvement, and then that's a big deal. To get 800x without changing anything in the application without buying new hardware just purely through software, that's amazing, you never see that. So there are some examples here though, you see the tail end, that there are some UDS where actually you're getting worse performance. I forget the reason why they said, but it's not to say that this thing is magical, it solves all the world's problems. But in my opinion, the benefit you're getting from this end of the tail versus that end is totally worth it. Some people got screwed, but who cares? Like how much better you're on average overall, right? So again, I mean, I've already been sort of saying this entire thing, but you never see this kind of thing. Again, the other approach you talked about was compiling UDF into machine code. Again, this is nice to do, it's orthogonal to in some ways to what Freud is doing, like you could still want to do maybe the inlining, and then you can then compile that inline version. Freud is the right way to do this in my opinion. There's a patent for Microsoft and no one else is going to be doing this, or maybe I'm not a lawyer, so I don't want to speak about these things. But to me, this is quite significant and I'd be interested to see how other systems try to approach the same problem. All right. The compilation will help a little bit, but not the same scale that we saw before in the last slide. All right. So that was sort of rushed, but any questions about Freud, what we're doing? Again, this doesn't really fit into the narrative of things we're talking about because the next class we're going to talk about hash joins. But in my opinion, this is worth knowing about. All right. So in our remaining time, I want to talk about life lessons for working in code. I don't know why I put him, I need to put something there. I didn't want to put a d*** because I put that. What's that? What's that? Sorry? I guess this is a large project and country is also a large project. But that's collaborative? Yeah, but he has torture chambers and d*** like that, right? It's not. It's a bit. What do you say about the nature of the project? All right. We need to work together as a team. There we go. That's what it is. For the people. All right. So again, just as a disclaimer, I don't want you to take away the impression that I'm saying, things I'm saying here is exactly what you should be doing. If you're not doing, you're failing or you're not going to be successful, you're not going to have trouble working on systems. These are things that I've figured out in my own time on this earth about how to actually approach large source code. This is through either working on now two and a half databases. So the d***er is a half one at this point. It'll be three soon. But it also worked before I did my PhD. I worked on Condor, which is a major distributed batch processing system. I've also done sort of legal work where they give you a bunch of source code and you have to read it and try to figure out what's going on, right? And you can't talk to developers because you have to do everything on your own. So these are things that I've sort of learned that helped me work on large code bases and understand what's going on and to impart them onto you guys so that when you go forward in your projects, in the back of your mind you think, well, this is not making sense what's going on. You can sort of try some of these tricks. Now, fortunately enough that for you guys, all the people that have been working on the system, since we started rewriting it, are still here at CMU, right? WAN is still here, TIN is still here, MAT is still here, right? But like years from now, you may go off to a company or the next year, which takes this class, they'll be looking at your code and you'll be gone and they can't call you and ask you. So I think it's very important to try to work independently, right? Because it's very unlikely that like going into your life that you're ever going to be working on a database system or any system from scratch. Like no one's going to hire you and from day one you sit down and start writing like the main function and start something from brand new. And even if you do that, your chances are you're going to be using third-party libraries and other packages because you don't want to write command line flag parsing yourself, there's packages to do this for you. And now you've got to go understand what those things are doing. So these are general rules I think that help you navigate complex software. The other thing I also say too is like when I talk to my friends in database companies and I said, I'm teaching a database class, what things do you want me to cover in the class? Like you want me to cover B-plus trees, locking techniques, like what do you care about? And they all pretty much said the same thing without me telling them what the other people said. They all said they want to hire students that can work on large source code independently. They want to hire someone and go through the bootstrapping phase or the onboarding phase and can then start making contributions to the source code without having to bother everyone else on the team. So that's, again, these are why I think these things are really important. So the dumbest thing to do, the biggest waste of your time is do what I'll call passive reading. And this is basically where you say you're just going to read the source code for the sake of reading the source code to try to figure out what's going on. So a lot of times I'll have students email me to say, oh, I want to become working on your database system at CMU as an intern or I want to get started on a project. And they'll tell me, oh, I've been reading the source code for the last two weeks. And I'm like, why? You're not going to get anything out of it, right? Because unless you're going in and reading something for a specific purpose, you're just reading to say, oh, this calls this, this calls that. Like you're just not going to absorb it because you're not really trying to solve any particular problem. You're just trying to build a mental model of what's actually going on. So this is like the worst thing you could possibly do. I do not recommend this, right? And I think it's really important to just get your hands dirty and your feet wet or whatever metaphor you want to use and start jumping into the source code and doing something even if it's small to get started on it because then you go through the whole process of actually how to build it, how to test it and things like that. So the easiest way to do this is actually to write test cases. Now you may think, oh, that's boring. I don't, you know, who wants to write test cases? Like you're going to go out in the real world and write the test cases, right? Even if they have a QA team that does a lot of testing for you, right? They're going to want you to do some initial testing before you hand it off, right? Like no code you're going to write, it's going to be, you know, the Holy Gospel. Like they're never going to have any bugs. So you're going to have to write test cases. And so one way to get started understanding the source code is maybe there's something that a part of the system you want to understand, you can write a test case that sort of tests your assumptions about what you understand that piece of the code is doing and seeing whether that matches up with what the test case actually does. And what's nice about writing test cases, unless you're doing it wrong, it should be independent of the actual implementation of the thing you're testing so you can't break it, right? If your test case fails, who care? Like, because you break your test case, that doesn't affect the rest of the system. Now if you go add in special test only functions into the system to get some data out, and that's a software engineer mistake, you should not do that. You should be able to have enough APIs or get enough data or information about what the component is doing so it's existing APIs that allow you to write the interesting tests you want. And this could be either unit tests or high level tests like a SQL regression test, right? And again, you will make a lot of friends right away if you go ahead and write tests. Like when you first start out and say, all right, you're starting a company and I want to help you guys write some tests, they're going to love you, right? They may try to get more tests to you and that may or may not be what you want to do. But this is a good way to get started on something right away. The next thing you can do is do refactoring. So the idea here is that you find a piece of code that you want to start working on and of course it's not going to be up to the standard that you would write. So maybe you want to start doing some refactoring and to help improve it, make it a little bit cleaner. And this can be just sort of adding comments and explaining what the complex operations are doing. Maybe clean up messy code, refactoring redundant code, maybe breaking out to separate functions, right? The idea here is that you're not, again, you're not really changing the logic of the application or the system, you're just making it a little bit easier to maintain and understand for other people. Again, you don't want to break anything at this point because you don't really understand what's going on. So maybe implicit assumptions about the system you're not seeing in the part of the code you want to write or you're going to modify. So adding comments is a nice way to say, all right, let me put my stamp down, explain what's going on without changing what calls water, what not, right? The other thing that's important to do also too is actually build it. It sounds sort of trivial, right? Like to build the software, of course I want to do that, right? But it's more than just maybe running meek, but actually going through the entire process of like getting it deployed on whatever build infrastructure that they have, sending a pull request, right? Going through the whole testing process, right? Because then that'll get you to understand what are the assumptions people are making about how the software is going to be run, what data is going to be given as an input, right? How, what environments do you expect this thing to run in? There might be a document internally, we sort of have our own here, but other companies, same thing, that explains exactly what our testing infrastructure looks like, but it's not really to you actually go and do it, we actually understand what's going on, right? And of course if the documentation is not available for either, it's sort of like the comments, if there aren't comments available or there aren't documentation in the source code to explain what some complex piece of the system is doing, if you write it, people will love you. Same thing for the build process, if it doesn't exist, maybe take a first stab at writing it. And again, it'll help you understand it better if you have to explain it to other people, right? All right, so any questions about this? Again, so I think just sitting down and reading software is not the way to go. Now our system, the current system is small enough where it's not all these weird dependencies of a bunch of source code that already exists. Many of you guys are working on projects using source code that you have to write, that doesn't exist yet, but you're using pieces of the system that maybe have not been designed in the way that you expect them or assume the way that you're actually gonna use that. I think all of you are probably in this boat, right? So in this case here, I think this is why we ask you guys to write the design documents by writing down exactly how you think you're gonna modify the system, what files you think you're gonna need to add and to modify, right? Get to that level, then you can understand how what the big picture of the system is gonna be. You're never gonna be able to understand all the individual components of a database system or any other system. But if you wanna understand at a high level how the thing is sort of interconnected so that you know what's the right place I need to add this piece, what's the right place I need to modify to do this, right? So give an example, I'm not trying to disparage the team from last year, but we had another team do query rewriting the same way that one team is doing this semester. And it's sort of like the rewriting stuff we talked about here. So they were doing rewriting in the SQL parser, the query optimizer, and then an extra little shin layer that they wrote themselves. So they're doing this rewriting in like three different parts. And that would be really confusing if your SQL query shows up. And then when you see the query plan, it looks different than what you threw in and say, oh, it must be doing query rewriting. Let me go figure out where that's actually happening. And you gotta look in three different places to see where that is, right? So that's an example where maybe they didn't have a full understanding of what the right place was to actually do query rewriting. So they did it where they thought was actually feasible, easy to do. And from engineering standpoint, that's easier for them than to write, but from a maintainability standpoint, that's difficult for other people to handle. So as I'm saying, this is like, you wanna tread lightly in making big changes and make sure you have a good understanding what's actually going on in the system. So writing test cases, writing some documentation, and actually just going through the build process and see all the different pieces that fit together is a good first step, okay? All right, so what's gonna happen? So next week, we're gonna have these separate status meetings. So I've already met with some of you. Some of you I'm meeting with this week, some of you we scheduled I think for next week, but I'll send an email for everyone who hasn't met with me yet. The idea is that next week, we'll meet one-on-one with your team for like 10, 15 minutes. You guys tell me what you're doing, what problems you're facing, where the current status is of the project, and then we'll sort of figure out what you need to talk about when we have our presentations about the status update of the project on April 8th, right? The other thing that also happens on April 8th is that you wanna send a pull request on GitHub to our repository for the other team you'll be assigned to review. So I'll have instructions to say how you're actually gonna do the review process, but the idea is that you'll be teamed up with another team. You submit your pull request to them, they submit their pull request to you, and then you guys both review it, provide feedback about the code they've written, and then a week later, and then you take the changes or suggestions and you try to apply them to your source code. Where appropriate. I don't wanna say something stupid, but we just ignore that. Okay? All right, so again, I will send email about the status updates, about those meetings next week, and then, again, you should be planning for two weeks from now having the status presentations. Same thing, everyone gets five minutes. We'll go in reverse order, then we did last time. So I think who went last last time? Yeah, so you guys go first. All right, so, and that'll be two weeks from now. Okay? All right, next class we're doing hash joins. So we'll discuss hash tables, hash functions, and then actually how to do a parallel hash join. Okay? All right, guys. See you on, was today.