 Yo, hey, yo, hey, yo, yo. Pack the chrome styles, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stones. I put heads to bed, lick shots, and rap in the bed, with the church a few times. Now I'm not trying to get me hit. What's that? I mean, for dental surgery, sure, yes. I like to think I'm known for other things. OK, all right. So today, guys, we're going to talk about server-side or UDS, which is a trend function, server-side logic execution. So in the intro class, we do cover this topic again. We revived it this past semester. But now, for this lecture here, we're going to see about why, in particular, user-defined functions are problematic, and some automated ways to actually improve this. So until now, for the entire semester, we've assumed that all the logic that we would have for application, it was typically called the business logic, or anything trying to explain what the application wants to do, that obviously exists in the application code, like the Python, Ruby, JavaScript, whatever. And so the way the application is going to interact with the database system is through what is called a conversational API. Next class, we will discuss this in more detail, what ODBC, GDBC looks like. But the basic idea is that you run some application code. At some point, it needs to get data from the database. So it runs a query. The database is in the process of that query. You give back results, and you use some processing on it, and maybe go back to the database server and get some more data. So it basically looks like this. You have some application code, some execution. At some point, you hit a SQL command. That's got to go over the network to the database server, which then has to do the parsing, the planning, the optimizing, and then actually execute the query, all the stuff that we've talked about so far. So while this occurs, this is typically going to be a blocking call. The newer version of the Python Postgres library, PsycogG, that has support for asynchronous query commands. So you can asynchronously call this and get the result back later as a future. Most applications aren't written like that. Most of them, this is a blocking call. So then now you get back the result from the database server. The database server is sitting idle, at least for this connection. If it's supporting multiple connections at the same time, then it could do other work. For our work for OLAP queries, this is not a big deal if it's idle because we're not going to hold any locks because we're not doing transactions. But then the application takes the output of this query, does some kind of logic on it. We're not going to find what that is at this point, but it's something that they didn't express in SQL. Then it's going to execute more queries, go back, get more data, do some other updates, whatever you want, do some more processing, and then commit and you're done. So this is how most applications are written today. So the idea of embedding the application logic inside the database system is that we want to move the things that we were writing procedural code in the application to put it now inside the database server. And the benefits of this are obviously fewer, fewer round trips between the application and the server because now it's one SQL command or the middle of a request over the network to go execute some query and then that program logic is now embedded and running inside the database server itself. We're not really going to discuss about listen to notify, but in some cases you can get notifications immediately like when it changes occurs through triggers. If we're not worried about transactions, but if you were holding blocks, there's less network round trips, so the time it takes before transaction can commit is reduced. But the other two benefits are going to be, the major benefits are going to be the application programmers don't have to re-implement the same functionality over and over again for different variations of the application. Typically you would have like there's the web interface and then there's the mobile phone interface. Those are re-talking to the application server, not the database directly. Some cases some applications do talk to the database directly and in that case, instead of having the application have to re-implement between the mobile phone and the web server version, some logic, you just can write these things as is UDS or stored procedures embed that in the database server, then all the different versions of the application can reuse this. And of course obviously we want to extend the functionality of the database. So it's going beyond what the built-in capabilities are. So this is actually the motivation of Postgres, right? When Stormbreaker was building Ingress, he always talks about how there was some customer that when they're trying to send Ingress to the banks, all the banks computed interest in using Julian dates, instead of the Gregorian calendar, right? So to notice support now doing interest for the banks, they had to go implement a new type and a new operator's new functions to operate on the date format that the banks wanted. So with Postgres, the idea was like, you may get extensible from the very beginning so that you don't have to rewrite the server or you rely on the application programmer, the person using the database to extend the functionality very easily without having to recompile the whole system. All right, so this UDS stuff really comes back to, Sybase was the early one in the 1980s, but this all came out in the 1980s because people wanted to extend the functionality of the system without having to recompile everything. So the four or five major types of embedded database logic are gonna be user-defined functions, UDS, which we'll focus on mostly in this class. Store procedures are basically UDS, except you can invoke them without having to embed them inside of a SQL query. Triggers get fired off when there's some event occurs, like a table gets updated, you fire trigger, and you're basically invoking a function. And then UDTs and UDAs, which I talked a little bit about before when we talked about project two, project three, these exist but they're not as common as the other ones. And so this survey here comes from the Freud guys in a follow-up paper from the paper you guys read where they looked at all the different object types that exist in real-world customers databases in Azure SQL. So their cloud version of SQL server. And what they find is that almost 70% of the different kind of embedded logic is gonna be store procedures. And UDTs are only 24%. So for this class, again, we're gonna focus on UDS, but some of the logic that we'll talk about today in Freud could also be applied for store procedures. Like there's multiple SQL queries and you combine these things together, right? So another side comment I'll say as well too, in the SQL server world, there's a strict economy or difference between UDS and store procedures. Store procedures in SQL server, they're the only function type that can update the database, like call updates, insert update deletes. In UDS, you can't, in SQL server, you can't call insert update, they have to be read only. In PLBG SQL and Postgres, they don't have that limitation. So you do all sorts of crazy things like your UDF that you call an aware clause for a select statement can then do updates of the database. But we'll ignore that today, okay? All right, so today's class we're gonna focus on UDS because that's what the Freud paper is about. But again, the same logic can be applied to store procedures. So at a high level, it looks like this. So we're gonna take this portion of our application code and now embed that as some function, I'm showing a pseudo code here. That gets now registered and stored inside the database system. So now I can rewrite my application to call the SQL statement that will then invoke whatever that function is that I had before, right? So in this example here, it's still two network round trips from these two queries. But you can imagine in other cases where I'm gonna have more complex things that execute more queries and everything's embedded inside the database server that I don't have this issue. So today's class, we're gonna start talking about background about UDS. And then we'll focus on, then we'll discuss again the Freud technique. And then we'll talk about an alternative Freud from a different set of Germans where you instead of converting from relational algebra expressions and injecting that or inlining that directly in the database server, they're gonna convert UDS into other SQL statements using CTEs. And then I've asked Sam to provide some additional commentary on his opinion about the papers because he's thought about this and either is, I guess the expectations or the claims in the paper don't actually match up with what happens in the real world when he and Kai were running benchmarks, okay? Okay, so I think I've already said this, UDS. The idea here is that the, sorry, what's wrong? What was wrong? I just thought it's not a rant anymore. Yeah, it's an erudite commentary. Yeah, it's more refined, okay. So UDF is gonna be a function that's gonna be written out again by the application developer that is gonna extend the functionality of the data system beyond what it's built in operations can support. All right? And so the computation that UDF is gonna perform could be anything you would sort of expect in a sort of standard imperative language, going beyond what SQL can actually do. So for loops, while loops, if clauses, it can invoke other SQL statements, it can invoke other functions, right? You knew things that would be quite difficult to do just purely in SQL. So the function is basically, all these functions will take in some input arguments, they'll be scalers or an array of scalers. You can't take in table or row sets. And then you do some kind of computation on them and then you return result and the result could either be a, again, to more scalers or additional tables and row sets, right? And you can embed these sort of anywhere inside of a SQL query. So the easiest type of UDF you can have are called SQL functions, where it literally is just the body of the UDF itself is just a bunch of SQL statements. So in this case here, we have defining the function, this is the name, this is our input arguments, then we have our return arguments, and then we define what the UDF type is, in this case here it's a language of SQL, and then the function body is the SQL query here. And I can have multiple queries and then the output of the return result of this function is whatever the return result of the last query that I execute, right? So all this up here, like the returns and the language, we'll see this in a second, but this is the UDF language in the SQL standard, is this thing called SQL PSM, which is derived or inspired by ADA, who here has ever heard of ADA, the programming language? Yeah, it's small, which is an extension of Pascal from the 70s, right? The guy that defined this for the SQL standard really loved Pascal or ADA, and that's why this all looks like this, right? So now I can invoke the UDF either in the select output here, I could put it, in this case here, because I'm returning a row set, I can put it in the select output, I can put it in the from clause, I can have it in the where clause, using again set operators, all right? You can also do other interesting things like provide annotations like this case here, in the SQL standard you can define this as being atomic, and then tells the database system that you should try to track the dependencies between what this thing is, what table this is calling and who's calling this, so there are some hints you can provide to the database server to tell it what this thing is actually gonna do, so that way if one UDF calls another UDF or this UDF calls a table, then it gets dropped, you would then throw an error to say this thing can't be dropped, right? So there are annotations you can provide about UDFs to give more hints to the database server, what the UDF is actually doing. Again, if it's a SQL UDF you can parse this, you can parse this in a part, you would know exactly what tables it's touching. The challenge is then when you start using a sort of a non-declared or non-SQL programming language to define your UDS, right, an external programming language. So again, the SQL standard since 1996 defines the UDF language in SQL as being this thing called SQL PSM, persistent stored modules, I think it's called, right? And so with the challenge though in SQL, in all cases databases, there's a SQL standard, nobody actually follows it exactly, right? Because Oracle and DB2, they define their own query language. Oracle had PL SQL, DB2 had SQL PL, but now they also support PL SQL. So SQL PSM looks a lot like Oracles, but not exactly, there are some minor differences, right? So again, SQL standards, like all the different vendors try to get together and say, hey, I have this new feature, I want it in the SQL standard, then somebody says, oh, my other feature looks very similar to yours, but mine's different, I want mine in there. Then you end up with like the lowest common denominator, and it's not exactly the same as anyone's system actually supports. So no one truly supports the SQL standard. And so the UDF language is a good example of this problem. So you can get, I think, for, I know from Postgres there's an extension to make it support SQL PSM, and the DB2, you can get pretty close to SQL PSM as well. Oracle, again, has their own thing. The one that's gonna be quite different, and what the Freud paper's gonna be focused on is the thing called transact SQL, or T-SQL. So T-SQL, transact SQL comes from CyBase. CyBase was, again, one of the first systems that supported UDS in the, actually it might be the first, it did support UDS in the world. In the 1980s, and again, before the SQL standard was a thing, they defined their own language called transact SQL. SQL Server used the transact SQL because it's a forker CyBase. Again, Microsoft bought a license to rewrite the source code for CyBase, to port it to Windows NT in the early 1990s, and then since then they've completely rewritten most of, if not all, of SQL Server, but it still maintains the compatibility with transact SQL. CyBase still exists, they were bought by SDP, it still makes them a lot of money, but no startup would say, hey, I'm gonna use CyBase, right? So these are the ones that we'll be focused on today, how to take these sort of procedural imperative languages to convert them into relational algebra and SQL. There's other sort of more common programming languages like Python and Java and C, if you're crazy, that you can write UDS in, and the different systems will support them in different ways. Obviously if you write your UDS in C, and you link them as a shared object into the database server, you can see everything in the address space of the database system, therefore you can do some stupid things that cause it to crash. Postgres will let you do that. Other systems like in Oracle will compile your CUDF into their own dialect of C, and then they'll run it either, I forget if it always runs in the case, or sometimes, but they'll run it as a separate process in a sandbox that'll go over like an IPC to send commands back for to the database server to prevent you from doing something you shouldn't be doing, because it's obviously a huge security hole if you can link in any shared object. That's written in C. All right, so let's look at an example of this. So this is gonna be coming now from the Freud guys, and so this is gonna be for SQL server, and so this UDF here will be in transact SQL. The easy way to tell whether you're looking at transact SQL versus PL SQL is if you see these at signs where they declare for variables. PL SQL or PLPG SQL doesn't use this. All right, so for this UDF, we're gonna get all the custom IDs from the order stable. We'll get all the custom IDs and compute their customer service level based on the amount of money you have spent as derived in the order stable here. Right, so the invoking query just as for every single customer, then you call this customer level UDF where you pass from the customer key, and you see here that now we define, we have a function, here's the body of it. Again, it's a remit of Pascal and Ada. You have to define all your variables at the beginning of the function. Who here has actually ever written anything in Pascal? That's what they told me when I was in high school. It was a long time ago. I don't know if, whatever, it's fine. But the, all right, so the Thomas Norman loves Pascal. That's what, he confessed that to me last week, or two weeks ago. Anyway, that was very surprising. And there are some database systems written in Pascal. So you clarify your variables ahead of time, then you do this summation query on the customer key, store the output of the summation into this total variable, and then if the value of this total variable is greater than a million dollars, then you set them as platinum, otherwise you set them as regular and you return that. Right? Pretty simple, right? So as I've already said before, the great thing about UDS is that they're gonna encourage, you'll get modularity, you'll get code reuse, because now, whatever that logic I have in my UDF, any query, assuming they have the right permissions to access it, can reuse it, without having to re-implement it themselves. And we're gonna get net around trips for complex operations. And of course, obviously some logic will be easier to express in a procedural language like PL SQL, then writing it as pure SQL, as we'll see in some examples in a second. That's debatable, right? Whether that's easier to read or easier to write for ML and data analysis tasks, maybe. But usually those things don't, you wouldn't express like a machine learning algorithm in PL SQL, you make a call to an outside function to do it for you, like something like PyTorch. All right, so these things we've already said before, this is kind of obvious, but let's talk about when things go bad, or why the UDS can be bad. So the number one problem is gonna be, is that the database system doesn't know what's in the UDF. It's gonna treat it as a black box because it's a procedural language and it doesn't know what it's gonna do till it actually runs it. Now my example before, it was pretty easy, right? You do summation, then there's an if clause. But in some UDS, they can do crazy things like declare a string and then start appending to that string to construct a SQL query, and then invoke whatever the SQL query that's defined in that string. So the database system has, it's gonna have no idea what it's actually gonna do at runtime because, again, it can't, it's not a declarative language. What the query's gonna do or what the UDS is gonna do is gonna depend on what row is looking at, what the data looks like and so forth, right? So the database system is not gonna be able to understand what's the cost of executing that function. And then in some cases where the, if the UDF is being used in a where clause as part of a predicate, it's not gonna do the selectivity of that predicate. So a really stupid example would be where value equals my UDF and I pass in some constant one, two, three, right? So what is the selectivity of this? Nobody knows, right? Because it depends on what this function's actually doing, right? So in most cases, the optimizer is gives up, or it doesn't give up, but it'll set it to some constant value that's really high. I say this is bad, but I have to put some value in. So it won't pick affinity, but it'll pick something that's arbitrarily high. I think there's a pound to find somewhere in Postgres that I forget what the actual constant is. So there's a pound to find, they say, if I see UDF, here's the selectivity of it. The additional challenges is that it's gonna be difficult for us to parallelize UDS because we could have correlated queries being executed inside of it, right? So because of this, so again, since we don't know what's going on inside of it, the database system is gonna be very conservative and it's gonna decide to execute the within a single query, each UDF one at a time, right? You can't run it in parallel with the ways we've been talking about before. You could run it potentially on, if there's a hint to say you know that the UDF is read only, or won't have any side effects, then you could have different threads run it in parallel on different portions of the table, like using the morsels approach that we talked about before. But if that's not the case, then you've got to really run it sequentially because you don't know whether the output of the some change that the UDF's gonna make that could affect the result of the next UDF invocation, right? Again, this is also the reason why the SQL server guys don't want you to put any updates into UDS because you don't want these weird side effects where depending on what order you execute the UDS, you make a different results for your query, right? You don't want that to happen. All right, it keeps going, it gets worse. Okay, so although the optimizer one's pretty bad. So the other challenge is gonna be now is that you can't, you're not gonna be able to execute the UDS in sort of using batches and vectorizing a processing model that we talked about before, because again, you don't know, you don't know what's going on inside the UDF and you don't know how it's gonna relate to sort of one invocation to the next. So Microsoft has coined this term called Rbar, row by agonizing row, so you can't do the vectorized approach that we talked about where they get a batch of tuples and maybe applied to the functional at once. In their world and their implementation, they're gonna execute the UDF for each row one at a time, right? And that's gonna be slow. So because now we're executing each UDF one by one, now it's gonna be challenging for us to do sort of cross statement optimizations within even the UDF because again, it can't reason about, I see this query and then this query and I could maybe execute them together as a single join or merge that with what's going on the calling query. Again, it just treats it as a black box and it can't do any additional optimizations. So we'll say this in a second, one of the optimizations that SQL Server did try before they got to the Freud stuff is that, well, they say, okay, well, the function's expensive to invoke because there's a sort of context switch for every single row, they'll compile it down to machine code using the query, similar to the query compilation techniques that we talked about before and that just makes the interpretation of the procedures of the commands inside the UDF faster but he still doesn't solve all the other problems where again, it's treated like a black box to the database system or to the query optimizer. All right, so here's another UDF, here's a query that's gonna invoke a UDF and this again from the Freud paper, this is a derived example from based on a TPCH query 12 where they've added this little piece here where they're gonna go do a lookup to see whether the customer key is gonna be null. And in this case here, the answer is it will never be null but again, because the UDF is a black box, doesn't know what this thing's ever gonna return so therefore it's gonna have to do this check every single time. So all the UDF is actually doing is just looking up given for given customer key doesn't look up in the customer table and just returns back the customer name. So again, the name of the customer table in TPCH is not null but because again, the optimizer doesn't know that this thing is just doing a lookup on something that can never be null, it has to execute every single time for every single row. So for the original query without this additional check it's gonna run in 800 milliseconds. But if I rewrite it at this UDF for a pretty simple check, it then gets converted to 13 hours for this same query on the same data. I think this is like TPCH scale factor one, so it's a one gigabyte database, all right? Again, and that's not saying this is a, this is not a knock on a SQL server because I think SQL server is a great system for what it is, for what it's trying to be, but this is just the nature of UDS of how bad things can get. Also too, we'll say, we'll cover this, I think starting next week, the SQL server query optimizer is probably the best one in the world. Can't do everything, the Germans have a pretty good one too, but in terms of like all the things that can do is probably the best. Leads for a commercial system. And if they're choking on this getting 13 hours, the process is gonna be a train wreck, right? So a preview I would say for, well, we're talking about in Freud, if you run the same operation with Freud and then inline this directly into the SQL query, they can get it to 900 milliseconds, right? So it's slower than the original query without positional check because you're adding additional join, but it's not the 13 hours blowout that you would get if you actually have the UDF. So, all right, I'm sort of jumping ahead to what Freud can do, but there's other things you could do to try to speed up UDF. We've already talked about the first one, you can do compilation, you can take the UDF, the procedural code, do a translation, convert it into either C or whatever intermediate representation you would have for in your system, and then compile that into machine code. So Oracle does this, I think Hyper could maybe do this as well, right? If you're doing a holistic query compilation that we talked about with Hyper, like converting the actual query plan into machine code, then you can just inline the UDF now to be directly, the code directly in type, like the where clause piece that you're compiling. But again, that doesn't solve the optimizer problem. The optimizer doesn't know what the selectivity of the UDF is gonna be. You can do parallelization that we talked about before, that you rely on the user to write annotations in the UDF and say, is it read only? What parts can be parallelized and so forth? MemSQL had this something like this, they call it MPL, the MemSQL programming language. I think I might have got renamed when they renamed the single store, but they had a parallel UDF language that was to have the basic idea here. And then the thing we're gonna focus on in today's class is the inlining. So how to take a UDF that's written in a sort of procedural code, convert that into a declarative form, and then inject that into the query that that's calling it. Either as another embedded SQL queries, inline SQL queries or nested SQL queries, or nested relational algebra expressions. Okay, any questions so far? All right, so let's talk about what led to SQL server building out Freud, right? So in 2001, Microsoft added support for and transact SQL for a scalar UDS, right? Seems like a logical thing, you know, database system what we wanna do because you wanna have people starting, you know, writing more complex queries and embedding the logic into your system. So this is around 10 years after the system was originally launched, the first version of SQL server. Then it actually came out in like 92, 93. All right, so then, but 2008, people started realizing that UDS are evil and there was a famous blogger on a book called UDS or Evil. They basically goes to all the things that we talked about before, right? In this example here, like if you add a UDF, they go from two seconds to, what is that, 38,000 milliseconds, right? Just by adding UDF, right? So it sort of became well known that UDS were evil amongst the developers. Two years later, Microsoft acknowledged that the UDS were evil. I'd even wrote a blog article that talks about how UDS are evil, right? Overall, evil personified, they wrote here, right? So this is a updated version of the blog article from 2016, but the original version came out in 2010. So then, in 2014, the guy that's gonna vet in Freud, Karthik Ramachandra, he finishes PhD in IIT Bombay on doing decorrelations of UDS. IIT Bombay is probably, is the best database school in all of India. So he got his PhD out of there. But then he joins the Jim Gray Systems Lab which is Madison, Wisconsin, named after Jim Gray, founded the Freud Project in 2015 after graduating. And then in three years later, they were able to get SQL Server to ship or started Freud to ship in SQL Server 2019. Right, and in documentation, they talk about doing this inlining. They're not gonna call it Freud because that's the research project. The average programmer is not gonna know what the hell that is, or even know what a Sigma paper is. But the implementation or the technique that's described in the Sigma paper that you guys read about is in what's shipped in SQL Server in 2019. So it just gives you sense. Reason why I like this paper, and I've said this before, like why I think SQL Server is a state-of-the-art system, this is very impressive to get something that was a research project that you were building. And then three years later, get it actually shipped in production code. Because database systems and database companies are typically, at least for the enterprise ones, are typically very conservative and how quickly they put out new features. Right, because the last thing you would want is put out a feature that causes people to lose a bunch of data or things to crash or things to get much slower because that affects your reputation. And enterprise customers are very, very conservative when it comes to their database systems. That's the reason why Oracle and Sybase and DV2 and SQL Server still make a lot of money because there's a lot of major companies that are still relying on these systems to be rock solid. So again, to get from a research prototype into production in a short amount of time like this is impressive. All right, so what is Freud? So Freud is going to convert imperative UDFs that are into relational algebra expressions that it can then embed or inline to the query plan that calls that UDF. And we can do this without having the application developer change anything in their UDF code. Again, the great thing about SQL is that if the hardware changes or the data system changes, the optimizer gets better. From the same SQL query that your application has been using for years can all of a sudden take advantage of new features that the data system provides if you upgrade without having to go back and change any application code. That was the original idea that Ted Cod came up with in the 1970s. So we need the same thing here for our UDS without requiring the developer to go back and change their UDS to add annotations or do anything differently. We can use Freud to inline them and get speed ups on the system performance or the application for running these queries. All right. So the basic idea where it's going to work is we're going to do the conversion of UDS during the rewrite phase of the query. So that means we don't have to change the query optimizer to be aware of that it's operating on UDS or doing some transformations on UDF derived code. All the conversion is done before you get to the cost-based optimizer. And the advantage of this one is mostly from a software engineering perspective is the query optimizer is the hardest part of the system to build. So if I can just do simple rewriting rules like static rules without a cost model to decide how to do the inlining, then if I could then convert the query plan with the inline UDF into the form that the cost-based optimizer expects and to shove it to the cost-based optimizer and let it do all the tricks and optimizations that it normally does for regular queries. All right, so that's the big advantage you're going to get with this is that you don't have to do anything special in this piece because you just do it in the rewriters phase. Postgres has a rewriter. They convert views into SQL just using static rules. So same thing here. We take UDF, convert it into a relational algebra, inject it into the query plan then hand it off to the optimizer to do whatever we want to do, right? The way to think about it at high level is we're basically going to convert UDS into subqueries and then the optimizer knows how to deal with subqueries. Cab you out some examples. We'll get to you in a second. And then since it knows how to optimize those, it doesn't know, doesn't care that it's optimizing query that came from a converted UDF, just sees a SQL query and just does its thing on it, right? All right, so we didn't really talk about subqueries too much in this class. We talked about a little bit in the intro class but I just want to remind everyone what's going on here. So conceptually, a UDF is basically like a subquery. A subquery is like a UDF because unless you do a conversion to either inline it or de-correlate the subquery, it's going to be treated as if it was a function call with the whole separate execution context to invoke that nested query, get some result and then return execution to the call and query of the outer query, right? So my SQL famously for the longest time until like three or four years ago could not do any de-correlation of nested queries. If you had a nested query in a where clause for every single tuple in the outer query, it would re-execute that inner query, which is like the dumbest thing you can do. And it's basically the example I was showing before with that UDF going from 800 milliseconds to 13 hours is for every single row on the outer table it's calling UDF one by one. So the same thing for the nested query if you don't do it right. So the two approaches to handle this, a nested query is either de-correlate them or flatten them like by rewriting them into a join or removing the outer query entirely, the inner query entirely, or the alternative is you extract out the nested query, run it once as like a subplan in Postgres they call it or like a query before you run the first query in the outer query, materialize it's results to a temp table, then now do a join on the outer query referencing that temp table. And then when the query is done you throw away the temp table. So let's see a really simple example like this. Say we have a table, a bunch of orders. So we want to get the first user that has made at least two purchases. So I have my outer query here. Just do it, go to the user ID, order them by the user ID in the sending order but only on one. And then the inner query here is gonna do a lookup where you reference the user ID of the outer table, do a join with it on the inner table but then you do the aggregation on the user ID and then you have a having call to throw about to only get the result that has more than two users. So if your optimizer is good you could obviously rewrite it like this. You remove the nested query, extract out the inner query, so remove the outer query, extract out the inner query or bring in pieces of the outer query into the new query here. All right, so now you get rid of the join. So you have a, well you do a sequential scan and then as soon as you find the first aggregation as soon as you find the first element that has more than two orders then you can drop out and you're done. Post-test can't do this. I think Oracle can, I didn't try SQL server. I also didn't try DuckDB. All right, so another way to do this is type of sub-query is through what are called lateral joins. The idea here, and this is what the apply operator is gonna be across apply that was in the SQL server, in the Freud paper. SQL, in the SQL standard I think it's apply in Postgres and in most of the systems like Oracle, I think DuckDB, they use the lateral join keyword but it's basically the same thing. So the idea here is that we're gonna allow interqueries can, or allow queries to access or reference the attributes with other tables or sort of other queries that are at sort of the same nesting level as it. So going back to my example here, this interquery here can reference the outer table, right? But the outer table can't reference anything on the inner table. With a lateral join if they're at the same nesting level then you can reference each other, right? So this will allow you to have a bunch of sub-queries in the from clause that can reference each other without having to keep embedding down nest and nest, keep nesting them within each other, right? So the way it's gonna work is that is you're gonna iterate through every table, each row in the table, and then if there's an interquery you gotta reference each other row and then you combine them together as if it was a join just like a regular query. So look at an example like this. So this is the same query we had before where we're gonna retrieve the first user that has made at least two purchases, but now we're gonna also include the timestamps of the first and the next order that the customer actually made. So it's basically the same as the nest query I showed you before, but we're actually now also getting timestamps. So you see this part in here, this is the nested query, but it's at the same nesting level as this query up here because I'm referencing this interjoin with a lateral keyword here, right? So you can see inside of now this query I can reference the user ID of this interquery of this query up here and then the first order can be referenced in this one here. You wouldn't be able to do that if these are nested just two regular queries without the lateral keyword. So this is clear. This is the building block that Ford's gonna use to make this inlining stuff work. We probably should add this to be an assignment in for 4.45, like for the first homework, right? All right, so let's go through now, okay, disable how Ford is gonna work. So there's five steps. So the first step, you're gonna take all the T SQL statements that they have written in the UDF and you're gonna convert them into SQL statements. Now for examples here, I'm gonna show the conversion to be done in SQL statements because it's easier to show and understand on the slides. But in the actual implementation, they're doing, they're generating logical plans or relational algebra expressions, right? They're not dealing from SQL to SQL. We'll see the app fell approach in a second, which is SQL to SQL, in this case here, or generating SQL from the UDF statements. In this case here, for simplicity I'm showing SQL. So we take the statements, transform it into SQL commands, and then we're gonna divide the UDF into regions that's gonna allow us to reason about what the dependencies and the contents of the different regions are. And then we'll combine those regions together into a single SQL statement. And then we're gonna take all the different regions, combine it, attach them together with lateral joins that inline the UDF expression into the query, then take this updated query with all these now embedded SQL commands with lateral joins, shove that through the query optimizer and let it do whatever optimizations that it wants to do, right? All right, so let's walk through an example. All right, so again, this is derived from the Floyd paper. So it's that example where I showed before, like you check to see whether the amount of orders that someone has purchased is above million dollars and they get a platinum status. Otherwise they're regular status. So the first step is what we wanna get, convert the T-SQL or the PLPG SQL statements into commands into SQL statements. So in the case of the first one here, we set the level variable to platinum. That's the same as just doing a select with a constant string platinum and then assigning it to the, in our projection output to be level. In the case of assigning the sum variable, or sorry, the total variable to be the summation of this aggregate query here, that's the same thing as doing a nested query, compute the aggregation, then taking whatever the output is, the scalar output of the aggregation and assigning it to the total attribute in our projection output. For the if clause here, we can't directly express if clauses in SQL, he's in the SQL standard. I think my SQL has something, but we can use case statements, right? Case when total is greater than a million, then assign the output of the scalar output is platinum. Otherwise it's null, meaning unknown. And again, we signed that to be the level actually in our output list. So my example here, I'm showing a one-to-one mapping, like for one command in PLBG SQL or in the UDF, that converts to one SQL statement. And so for all the examples, this is the case, this is true, doesn't have to be. You could have one command be multiple SQL statements or you could have multiple commands could be combined to a single SQL statement. But for simplicity, we'll always do one-to-one. All right, so now the next step is, we want to take our original UDF and then break it up into regions. Again, the original paper, they were constructed like algebra expressions for the different regions. For our purposes here, we're going to put this into SQL. So this first region here, you see that we declare two variables. We have total and then level, right? Total float level of the bar chart. Then we're going to take, run the SQL query here, we compute the summation and assign it to total. So in the, for this particular region here, we're going to now define in our output list, two values, we're going to have the level, we're going to set that to be null because it's undefined, we didn't assign it to anything here. And then we know that for the total attribute, that's going to be whatever the output of this summation is, this aggregation query here, right? So that's going to be the, again, the total variable here. But now we're going to assign the tuple that comes out of this first region here to a temporary table called ER1, right? So this region, when we execute this code, right? The output will be put into whatever this table is here. Now in the second region, we're going to do our if clause, a total greater than a million and assign it as a platinum. So in here, in our case statement here, just like before, if total now is greater than a million, where the level, actually, sorry, this would be a line here, total is being defined by this variable here. Otherwise, we're going to reference, then we'd assign the level to be platinum. Otherwise, we're going to set it to be whatever the value is that was defined from the previous table, which is null because we haven't done anything yet. So again, if you go back over here, at this point in the code, we haven't gotten down to the else clause. So unless total is greater than a million dollars, then it's going to be null as because we didn't define it up here, right? Then the third region here, again, it's basically the instead of having this else, you have to do the inverse of this or the negation of this. So it's if the total is less than equal to a million, then you assign it to regular. Then otherwise, the level is just whatever came out of this level in the second table, right? And then the last region here, we'll cover the next slide. It's just the output of whatever this guy's level is, right? So you kind of see where we're going with this. That would like for each region, we're gonna use some kind of computation in purely SQL. And then the variables that are getting passed along from one region to the next are assigned to these temp tables. And then the next region can then reference that temp table, which is why we need the lateral join because we want to be able to have one table reference the values that are in another table at the same nesting level. All right, so now the next step is we wanna merge these expressions into one giant SQL statement, right? And again, I'm showing cross apply, same thing as the lateral join, right? And it's just as I said, now we have our three temp tables, ER1, ER1, ER2, ER3, each of these sort of blocks within the cross apply can reference the previous one as needed. But then at the very top, this is where the return clause is where we're just taking whatever the output of this query and we're just returning whatever the ER3 level result is. Right? Yeah, so each region's gonna depend on the computation performed in the previous region. And the cross apply allows it to execute these in sort of sequential order. So now at this point here, this giant SQL statement here is equivalent to the UDF. It'll produce the same result. And we can verify this. We can prove that this is the case. So now what we need to do is take that giant SQL statement and then basically inline are injected into the original calling query, right? So assume that this thing here is called this customer level UDF of the customer key. Well, that's basically invoking can be written into this query like this, right? So we do get the customer key, then we have the inline computation from our converted UDF from the customer table. So for every single record in the customer table, we're gonna invoke this query mess here with all the same regions that we have before, right? Yeah, this is why, like this is very impressive. They can take arbitrary UDF and convert it into SQL statements. You can't do it for all possible UDFs. We'll see what cases they can't. Like they don't handle, you can't have while loops in UDFs. At least in Freud. The outfell approach we'll be able to handle that. So then now we take the SQL query, we throw it to the query optimizer, and then it's gonna end up rewriting it to something like this. Where it broke out all the, rewrote all the lateral joins into a simple left outer join. And then the case clause is now in the, is in the projection output list of the SQL query, right? So we have an implicit join grouping made explicit using the left outer join. And if it was able to rewrite that. All now the operations that were within the UDF that were treated as the black box before. Now it's just SQL, the query optimizer knows that it can reason about the selectivity and the cost of these things. You can do whatever optimizations that it wants to do. There's no function calls or context switch overhead of like for every single row, evokes some function set up like stack and all that stuff, all that's gone, right? And we don't have to change anything in our query optimizer to do this. Cause now it's just, you know, query optimizer is not gonna know that, you know, well this is here, we was given this. It doesn't know that this was written by Freud. It just applies all the same rules that do transformations to get it down to a more compact form and more efficient form like any other query. All right, any questions so far? All right, so for me, I like Freud because also I cannot, this is probably naïve to me. I can actually understand it. The outfell approach, we'll talk about in a second. It gets into the PL world, which I don't understand. But we'll take it as it is, it comes along. Okay, so let's look at some additional optimizations you can get. If again, if you can inline your UDS into SQL, there's a bunch of cool things that the query optimizer is gonna be able to do that starts to look a lot like what a regular, or a traditional compiler like GCC or Clang can do for your arbitrary code. So say I have a really simple UDF where for giving some constant value as an integer, it returns back a string that says whether it's a high value or low value based on whether the value is greater than 1,000. All right, so you just have a SQL function or SQL query that evokes the function like this. So if you could inline this with Freud, you would convert this UDF into an inline query like this with the lateral join. So now in the next phase, because again, this is being invoked in the query when you pass in a constant value, the optimizer can say, okay, well, I know what the value of the input parameter are being passed in, the X value is. So it can do dynamic slicing and throw away this portion of the else clause or the case statement because it knows that X is gonna be greater than 1,000. It knows it's never gonna actually look at this. So it can entirely throw away the case statement and just only return high, the string high. Right, that's equivalent to the same thing as a traditional optimizer could do. Say, well, I know this is a dead branch. Let me just go ahead and I'm never gonna evoke it. Let me just throw it away. They need us to do constant propagation and folding, further recognizing that the value of X has already been known. So you don't need to do this outer apply or you just combine these two into a single select statement and just return the candidate string high value. Right, same thing up above, right? You don't need to declare the, wait, you don't need to do the catnation of creating the string value, right? Actually, this line should be removed too. Sorry, right, you just return high value because you know it's exactly that it's always gonna be the same. And then lastly, you do the decalumination. You have things you don't need, right? And then you return it through a single select statement. Right, again, we can do this because if we get with Freud, we convert the UDF into our traditional algebra form, then the optimizer would do this again as if it was any other query. All right, this only works again if it's going back to the beginning here. I can do all these things because I'm passing in a constant, right? All right, so in the 2019 version of Freud, they can support these commands in T-SQL. Again, the key thing that they're missing is you can't do loops, you can't do exceptions, you can't do cursors. And we'll see in a second, but they claim that over about 60% of the UDS that they see in the real world can be covered by the syntax here, right? And it can handle all possible data types. It can have UDS called all the UDS. There's a parameter you can set to say how deep you wanna unnest. Like if I have a UDF called like seven levels deep, I don't wanna inline all the way. I forget what the default is in SQL Server. But, and they also can't support dynamic SQL queries. So if I create a SQL string, start concatenating to it and then invoke it. All right, so in the paper, they talked about when they looked at real workloads and they were trying to figure out, again, what level of compatibility they would have or what sort of, how many UDS would they actually support with just the constructs that I showed in the last slide. And for three real world workloads, they see that they get about over 80% for them, compatibility with them. I think, again, across the entire fleet of all the SQL Server databases that they had access to, I think it's about 60,000. I'm sorry, 60% of the UDS could be inline with Freud. Again, this study was done in 2019. I don't think in the newer versions of any updates to Freud still can't handle loops and things like that. So it's roughly still 60%. So they have this graph in the paper, which is quite impressive, where they show for these two different workloads what the percentage of speed up they're getting for these queries, the improvement factor they're getting for these queries, when they're able to do the inlining. And so the way to do that is the x-axis where both these graphs are unique invocations of different UDS when you enable Freud. And in some cases, you're getting almost up to 1000x improvement. Because, again, you're able to do this inlining and not have to invoke the UDF one for one row after another. So I always just say every semester when I teach this topic, this is amazing. Like, how often are there updates to your database server when you get 1000x improvement for queries? It doesn't happen that often. Unless you're doing something really stupid, like picking the worst join order and then the update fixes the join order. But even then, it's very unlikely you're gonna do 1000x. So again, one, this is the beauty of SQL, that you don't have to change any application. You just say, I want Freud or I want this inlining feature and you've got this huge improvement. And then Sam will show some examples in a second that people are tweeting out how much they love this feature, how much betterment they got. And then Sam will pop that balloon and tell who he really thinks. Okay. All right, so any questions about Freud? Again, I think this is very impressive. This is, Freud is obviously, the technique is obviously patented by Microsoft. So there isn't another system that's gonna do anything like this or in exactly the same way for a long time. All right, till the patent expires. Yes? So you can patent techniques, right? This question is, your question, whether you can patent techniques like this? Like patenting software. It's a software patent, absolutely, yes. But aren't they like... Microsoft has in-house lawyers that are very expensive. Do you want to roll the dice? I don't know if I said that. If you're Oracle, would you roll the dice? No. Right? Could Postgres supplement something like this? I don't know how much that's been litigated, right? Cause there isn't a corporation for Postgres, right? So, but there is a nonprofit, you could see that. But no, again, there's... Well, again, next slide will be the next technique. I don't know the way to do this, get to some benefit. And I think that would not encroach on the patent. So could we write a research paper that extends Freud and do something better? Yes. Could we make a product and sell it as a competitor to Microsoft? Probably not. Again, Microsoft would destroy it. All the major corporations have in-house lawyers that all they do is this kind of stuff. We can take legal stuff online. But yeah, software patents are a thing. The problem is also too, people get software patents and then the company folds and patent trolls by them and then they sue you in Texas. So there is no company, there's just a whole company that owns the patent and they sue big corporations and this kind of stuff. We'll take that offline. Okay. So, AppFell is a technique from another set of Germans that are also very good. But it's going to be just like Freud where they want to convert UDS into something that they can inline to the SQL query. But instead of using relation algebra expressions, they're going to convert it into CTEs, common table expressions. And what's fascinating about this is that they're going to support some of the control flow constructs that loops, I don't think they can do exceptions, but looping they can do in a way Freud cannot. And instead of actually being directly embedded inside the database server, the way Freud is, they can actually be implemented as a separate middleware that sits above the database server. You put your UDS in there, you put your SQL queries in there and then it'll spit out SQL queries that you can then invoke on the online database system. So they're trying to be database agnostic. Although I think they only support these Postgres. They have an online demo, I was down, I couldn't get to it, but if I can find it, I'll post it on Piazza, what it looks like. So this technique is going to get more PLE than Freud is. So I'm not going to be able to understand everything that it looks like you're doing, but the base basically is going to work is that you're going to turn your UDF code into these intermediate representations like the SSA or SSSA and the administrative normal form and then you'll be able to convert it into an administrative normal form and convert it back into tail recursive functions and then from there you can then spit out recursive CTEs. So let's walk through this and then stop me if it gets confusing or if someone knows what this is going on beyond me because they're taking compilers of PL, let me know. All right, so here we have a UDF that's going to repeat the power function. All right, so the part of the UDF that we care about the most is this while loop here, right, where we're going to give in, for some iterator we find is I and for the number of times we want to take the exponent or multiply by itself, you loop through that, update our value and then return it, right? So again, this is PLBG SQL, it looks a lot like T-SQL except there's no at signs for variables, right? So they're going to convert this into the static single assignment that's going to be based entirely off of go-tos, right? So the control flow is going to express through go-to statements. So I define my variables at the top, then I have my clause for the while loop, if my iterator is less than n, loop back or loop down here, do my computation, otherwise go to the exit. Then I do my computation and I go back up to the top, right? So far so good. Then we need to convert this into the administrative neural form which contains mutually recursive functions that are only in the tail close. So now again, it's just like before but we would have in these different blocks, we have a jump to this function here to do the while loop, this then jumps to the function that does the body and then this has a recursive jump into the, back to the while loop again, right? And otherwise again, if we complete our iteration, then this function pops out, we pop up at the stack and we return result to the power function at the top. Then now we need to convert this from a mutual recursion into a direct tail recursion where the only place where that we're allowed to call another function is always at the last command at the end of the function, right? So we start a function here, we then can loop through this run calls down here which then we'll call back itself. Otherwise, if we don't, if we finish our looping then we pop off the stack and then we're done here, right? So the reason why we want to put it in this form is because now when we call or convert this to SQL, the output of the function is to be over whatever the last computation actually, the last thing that gets computed in the function, right? The output of the SQL statement should be the last thing in the function. So if we need to call another thing here, we want our output to be whatever this thing returns, right? So now we can convert the, this tail recursive thing into CTEs. So, yeah, so bear with me. Okay, so here we have now the single CTE that's recursive. So you have the union all operator where the CTE is allowed to call itself. So the name of the CTE is run. This thing here is a string we're passing in to say, should I call or not? And then inside of this, we're basically do another nest query, do you need a unit at all where we're, as if it was iterated to produce our output, right? So the first part is here. And these inner parts are combined together into the single SQL statement here. We're done, right? This is now valid SQL that we can run. There's some stuff that's code-gen or programmatic stuff like this constant string call that we're passing around. That's an artifact of how this actually works. But this point here, this is valid SQL that's representative of what the original UDF was doing. So to this point, there hasn't been an evaluation of the CTE app file approach against Freud. We were trying to do that here at CMU in the old system noise page, but our optimizer couldn't handle the lateral joins the way we needed to. And so in the paper from the Apple guys, they have a comparison where they compare in Postgres the PLC goal approach versus the CTE approach as you scale up the number of iterations in that power function. Unfortunately, in all of the outfile papers, they all have these derived UDFs. There was a benchmark that the Freud guys put out that came out after this paper came out that has real workloads. But this is sort of obvious here as you would expect. There is overhead of all these different CTEs, but it's still going to be cheaper than having to do the context switch into the PLC code, okay? So any questions about Freud or outfile? All right, so Sam's going to give a guess in sub lecture. Here's the clicker, I'll put this in my pocket. You put it in your pocket, can you touch it? Yeah, can you touch it? All right, you don't really want to touch students, but. Okay, there's your clicker, cool, cool. All right, go for it. All right, awesome. Okay, so Freud, what happened next? What happened after the paper came out? So as Andy mentioned, it was released in SQL server. It was actually shipped as a feature in SQL server. Then after that, there was huge performance wins in the wild. So from Carthix, Twitter, we see a bunch of posts like this. 100 fold improvement in UDF performance. Order of magnitude, dramatic performance gains due to Freud. The CPU time is three times lower. The query is more than 20 times faster, right? Here it goes from four minutes, 25 seconds to nine seconds, right? Huge performance wins from Freud in the wild. And of course, high praise from Andy, right? So when he was asked, what are the papers in the last decade that are your favorite? In no particular order, of course, Freud was listed first, right? And Andy says this, I've set it before, but Carthix UDF inlining is one of the most important query optimization techniques for databases developed in the last decade. I dedicated an entire class on Freud, my advanced degree course in 2020. Which is this class, yeah, okay. So since then, right, to motivate more research in this area, there was this paper that was released, the prof bench paper, by Carthix, right? And the reason is that if there's no benchmark, it's really hard for researchers to do more work on UDFs, right? So that's this paper, procedural extensions of SQL, understanding their usage in the wild. So what came out of this paper, yeah, the Microsoft team published a paper, analysis of real world UDFs, TVFs, triggers stored procedures. So it's the first time that researchers, the public, had like a real insight into what do these procedural extensions look like in the wild, right? And then with this was this benchmark that was released, and it's on GitHub, the SQL prof bench, right? So finally, we can benchmark and see what are the performance of these different UDF techniques, right? And the authors of the paper argue that prof bench faithfully represents real world workloads. So from the prof bench, I would classify two sort of different types of UDS. The first sort of style of UDF is something like this. So the UDF that takes no parameters, right? And it's just invoked like this, right? So when you have something like this, where the UDF is invoked only once, the wins, the performance wins that you can get from inlining are pretty minimal, right? Because you're not doing any sort of like joints between like an outer table and a table inside the UDF, right? But if you have something like this, which is like the example that Andy was showing earlier, you know, you're computing the customer level for every customer, and inside this UDF, you're doing a select query where you're sort of matching on that customer key. You're doing this implicit join between the customer table and the orders table, which is kind of like almost like a nested loop join, right? So inlining that as a sub query, if the optimizer can see, yeah, I can just turn this into like a hash join or something like that. It's gonna be like orders of magnitude better, right? That's like where the real wins come from Freud, okay? So naturally, what we did when we saw that this benchmark released was, okay, Freud is already in SQL Server 2019, okay? So why don't we just test the PROC bench on SQL Server 2019? And we found some surprising results. So what's surprising is that SQL Server's optimizer could only decarrelate two out of 13 of the UDFs in the PROC bench that are cold with parameters, right? And you can imagine that if the optimizer can't decarrelate the join in the, the sub query in the UDF, you're basically doing a nested loop join, and then the performance is gonna be super, super bad, right? And then what we did is we manually sort of Freuded these UDFs and gave them to Umbra. And Umbra was able to get 13 out of 13. He was able to decarrelate every single subquery. So why? So the way that SQL Server does subqueries is based up this paper from 2001, orthogonal optimization of subqueries and aggregation. And the idea is that they have these rewrite rules, and then they just blindly execute these rewrite rules where they can. But some of these rewrites may potentially involve duplicating the sub expressions in multiple places in the query plan tree. So maybe you can decarrelate, but replacing it in multiple, but putting the sub expression in multiple places ends up being more expensive. So what they do is they leave this kind of decarrelation up to the query optimizer to decide based off cost, should I decarrelate, should I not decarrelate, right? On the other hand, the way that the Germans do it is based off this paper, unnesting arbitrary queries where they can systematically eliminate and decarrelate any subquery. So the way that they do that is they introduce a dependent join operator, and then they use a query plan DAG instead of a query plan tree. And this allows them to reuse sub expressions without having to actually compute them multiple times. And as a result, they can systematically decarrelate any subquery, no matter how convoluted. This is why in practice, it's always getting decarrelated. The 13 out of 13 UDFs get decarrelated. But SQL server can only handle two, right? So what are the implications here when it comes to UDF inlining? Well, to start with, UDF inlining as a technique is amazing. There's huge wins practically, right? Orders of magnitude performance improvement, right? But to get the real benefits of UDF inlining, again, assuming that these UDFs in the PROC band do represent real world UDFs, you really do have to have a German-style query optimizer that can just crunch through and decarrelate all these subqueries. Otherwise, you're left doing this correlated evaluation and then you're just screwed, right? So that's why for our research, what we're doing is we're extending ductDB, which has a German-style query optimizer, so they can have support for UDFs. And we can do all our UDF research there. So the takeaway is Freud is amazing, right? Andy will agree, right? But the emphasis should be like, you need a really good query optimizer to get the most out of Freud. Yeah? Finish the class. Finish the class, all right. Okay, so anyways, this is huge, right? You rarely get 500X speed up without either switching to a new DBMS or rewriting your application, right? It says here, another optimization approach is to compile the UDF into machine code. I think that was covered already, right? But it doesn't solve the optimizer's problem. You really need something like inlining to be able to, like, realize that you're doing a subquery and, like, get rid of that join, right? All right, so in the next class, we're going to be talking about database networking protocols. Any questions? That's my favorite all-time job. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cuss, so y'all gonna focus on drink proof. Put the bus a cap on the ice, bro. Bushwick on the go with a flow to the ice. Here I come. Willie D, that's me. Rollin' with fifth one. By the 12-pack case, I'm a four. Six-pack, 48, gets the real promise. I drink proof, but yo, I drink it by the 12. They say bill makes you fat. But saying eyes is straight. So it really don't matter.