 So for today's class, we're going to talk about server-side logic execution. Again, I'm at my house here with the terrier who, for some reason, smells like onions. I'm not sure what that's about. So for today's lecture, this is a topic that I find very interesting. And it's not something I would say that is critical for building a sort of modern high-end performance in the database that we've been talking about the entire semester. This is now making it so that people can extend your database system and have it integrate with more different types of applications. So again, this is not something we're going to do and try to support because we absolutely need to get the best performance, although it's going to help in the way we're going to execute our UDS or going to help. This is more about sort of extensibility of the database system, but we want to do it in a really smart way. So for today's lecture, we're going to first start talking about the background of this, what I mean by server-side logic execution. And then we'll focus on how to optimize the performance of user-defined functions. The one way would be the inlining approach that was used by Microsoft and the fraud paper that you were assigned to read. And then the other approach would be a newer method at a Germany on converting UDS into comment table expressions. Again, both of these are about making UDS run faster. So as I said, server-side logic execution, I think is an interesting topic. I used to teach it in the introduction class. The last of my taught it, I think, was maybe 2018, and unfortunately, last semester we had to cut it out. So this is not something that maybe if you've been here at CMU that you may be familiar with. So I'll spend a little time discussing what it is we're actually trying to do on our database system. So the key observation you have to make about what server-side logic execution is going to do for us is that up until now, we've assumed that there's been this clean division or dichotomy in our application stack between the logic that's in the application server and the logic that we execute in the database system. And the database system would only see queries that were being sent over from what I'll call a conversational API, like JDBC or ODBC, like what we talked about early in the semester. And the downside, though, with this approach is that it's a chatty API, like I have to send a query, get a result, and then process it, and then send a result back. So we saw a diagram sort of like this before when we talked about JDBC or ODBC. On the application side, this is where our program lives, whether it's written in Node.js or PHP or Django or Python application. Here's all the business logic of our program. And so when we started transaction, we would call begin, and then we would execute a SQL statement, execute something that would then go over the wire, again, through ODBC or JDBC. And that run through the full stack that we've talked about, like the SQL parser, the planner, the optimizer, and eventually do it to the query executor. Now, you can use prepared statements and maybe skip the first two or three steps, but in the end, you're still going to have to execute the query. So then now the database system executes the query and sends the result back. And then now the application server can proceed with the program. The SQL call is a blocking call on the application side. So now, while it goes off and executes more program logic to process whatever the results that it got from the SQL query, our database system is idle for this particular connection. Now, again, it's a multi-threaded system, or a multi-processed system. So therefore, we could be processing queries from other transactions or other connections at the same time. But for the context handle we're going to maintain for this particular connection, it's essentially idle. So that means that if we're running this in a transaction, if we're using something like two-phase locking as a concurrently tow protocol, whatever locks that we acquired for this SQL statement are now just being held on the server side because it's waiting for the next command to be told what's going to happen with this particular transaction. And this network round trip is not cheap. Best case scenario, we're running on the same machine, but that's not very common. It's usually you have a server's re-application and a separate server for the database system as you can sort of scale them independently. And so, maybe you're on the rack unit right above, the application server is one unit below the database server in the same rack, but again, that's pretty uncommon too. So this could be a couple milliseconds to send these back and forth. And who knows how long this program logic is just going to take to execute. So then now, after we complete the program logic here, maybe we do the same thing. We execute another SQL query, send it over, get a result, send it back, we do some more processing, and then we wait. And then eventually, our application says, okay, we're done, we send a commit message. And then we can release whatever locks or whatever memory we've allocated for this transaction. Whatever we did to maintain the state that we had this connection open and they were exiting transaction, when we get the commit, we can finally start to clean things up. So the things we're going to talk about today are essentially how to take that application logic on this side and now move this inside the database system to avoid these round trips and have transactions take longer or whatever the query we were executing take longer. Try to combine everything into a single invocation that we can send just one requested database server and not go back and forth to the application. We're also going to embed logic to allow us to extend the functionality of the database system. So in some cases, we'll be able to do certain types of manipulation or filtering or other types of steps we would do when we process queries. We're able to do that in a way that the original developers of the database system may have not anticipated or may have not offered to support. So again, so the benefit we'll get from this is that because we're going over, we can make fewer round trips that'll improve the efficiency of our system because now transactions can take, we're going to execute in less time. We're also going to get some reuse because now if I have some functionality inside of my database system with now I have my web application now it needs to be ported over to a mobile phone application instead of rewriting the PHP code in my web application to now be Swift or whatever I'm using on my cell phone like whatever that application logic I had to use to interact with the database system in the website I had to port it over to the mobile phone application but if I can embed it inside the database system then now potentially the same logic that would be executed on the website could be executed on the mobile phone. Again, it depends on your application stack setup. I just assume that there's a logic running in both of them. So what does that look like? So one example, a simple example that I just showed here is that we could take this entire transaction logic and somehow send that over to the database system and we'll get installed and then it'll be the ability to call it will get exposed through some kind of function handle like we'll call this proc and then it takes an input argument or whatever you want. So now if I want to invoke this transaction in my application, I don't do that back and forth that I had before. But now I just use this invocation or this command called call. So I call the function pass in whatever argument I want to that function, I send that one request over like an RPC it executes, commits, and then I get back the result. So the benefit about this is that it's one round trip over and to execute all this complex logic where before it was multiple round trips. Now in this case here, I'm invoking with call other, this might be in the SQL standard other databases systems might support exact or execute but the basic idea is the same. So this is what is called a store procedure where the idea is that we can take a chunk of logic and wrap that into the procedure call that we can then just invoke independently of a query. The other types of embedded logic you can do the ones that we focus on today are user defined functions, where it's UDF is attached to a query. So you can't invoke a UDF independent of a query. It has to be part of a select, insert, update, or delete where storage procedures can be separate. Triggers are a way to have functions get invoked when a certain event occurs. Like if I update a tuple in a table, I can have that fire off a trigger actually before update or after update and then invoke some function to do something. So let's say that I wanted to, every time somebody updated the table I wanna insert an audit record in another table. So I could put a trigger on that table to say anytime it's updated, keep track of what that change was in another table. User defined types are a way to extend the internal type system of the database system and allow you to support, again, some more complex object types that go beyond maybe within what the base scalar values or the primitive values that the database system supports. User defined aggregates are a way to have some more complex aggregation functions other than min, max, count, and sum, things like that. So I can't prove this, but it is my intuition based when I've seen out there in programs and talking to people, the prevalence of these different applications that are using these types of embedded logic at the top would be the most common going down to the least common. So user defined functions are very, very, very common. You see these all over the place and then user defined aggregates are a little less common. Triggers and store procedures, they're probably interchangeable. But again, the way you implement a trigger is through a user defined function. The other thing too about, I think in the Microsoft paper, they talk about how user defined functions cannot update tuples or update tables. They read only where store procedures, you can update things. That I think is specific to SQL server. I don't think the SQL standard specifies one way or the other. So you could have user defined functions that actually do update tables. Okay, so what we're gonna focus on today though is user defined functions. Because again, this is the most common one. And it can be used in combination of at least for the store procedures and triggers. UDTs and UDAs are sort of a severed beast. Okay, so user defined function UDF is gonna be a function that is written by the application developer that allows them to extend the system's functionality beyond what it's built in operations provide. So the SQL standard defines some basic SQL functions, string length, string upper, casting functions, things like that. But there's also now to go beyond what the SQL standard says or even what the proprietary functions that the database system supports, you can write user defined functions that you would invoke just as if they were a built in function. And so the setup is that you're always gonna be given scalar input arguments. So like single primitive values or even arrays of scalars. But not like table sets or row sets. Then you perform some kind of computation in them. And this computation is gonna be the standard imperative language constructs that you're familiar with, for loops, while loops, if clauses, things like that. It can also be invocations of other SQL statements or other UDFs within this UDF. And then you get to return a result that's either gonna be more scalar values or a relation or a set of rows, right? So that's the basic idea of what we're trying to do here. So let's look at an example. So this is gonna be written in, for all the examples in this first part for Freud, we're gonna show examples in a programming language called T-SQL or transact SQL. So this is the UDF language that Microsoft SQL Server supports and Freud was improving UDFs in SQL Server. So that's why we're using this. So T-SQL comes from CyBase because SQL Server originally started off as a licensed fork of CyBase in the 1990s. They ported it to Windows NT. Since then, the forks have diverged significantly. CyBase got bought by SAP. And it's kind of more or less in maintenance mode at this point, whereas SQL Server is still very state-of-the-art and still being actively developed. Another way to think about this is like, there's a lot of major, major corporations and major applications running on CyBase, but no startup with a bunch of like 20 year old kids are gonna say, I'm gonna build my startup on CyBase. That probably doesn't happen anymore. Whereas SQL Server actually still is a good choice for many applications. Okay, so I'll say also too, the SQL standard defines a language called SQL-PGM. As far as I know, nobody actually supports that to the exact specification in the SQL standard. They all, the things that look close enough to SQL-PGM still deviate from it. So Oracle has PL-SQL, Postgres has PL-PG SQL, which is slight variant or is a variant of the Oracle one. Usually PL-SQL, again, PL-SQL looks close enough to SQL-PGM, but it's not exactly the same. But at a high level, it all looks like Pascal, it all looks sort of like Ada. So other than the more complex things that going from Oracle to Postgres is in a major change. Transact SQL is a little bit different, but it's still at a high level, it looks the same. All right, so here we have a function that we're gonna find that's gonna get all the custom IDs from the customer table or from the orders table and it's gonna compute what customer service level that they're supposed to have based on the amount of money they have spent in the total number of orders that they've submitted. So our function's gonna take in a integer that's the customer key and it's gonna return a 10 character string. So in the beginning, we will have to declare all our variables. Again, this is a remnant of Pascal or Ada was what SQL-PGM is based on. So in Pascal, you define all your variables at the beginning. And then we have now a select statement that again, looks like just like a regular stick. Select statement except we've added this little at sign total equals where we're taking the output of this summation and storing it in this variable here that we just defined up above. Then now we have some if clauses to say if the total order is greater than a million, then they're at the platinum level, otherwise they're at the regular level. And then we just return the level value which is a 10 character string. So the way we would invoke this in a query would be a scan like this on the customer table where we, again, for each customer key will invoke this function to get the computed service level. So as I said before, there's some important advantages of UDS that sort of explain why they're actually very, very common. Again, the first one to be that they supporting modularity and reuse because again, if I have a complex function or complex logic that I need to reuse across multiple applications in the same database, I don't want to have to re-implement that logic over and over again. I can instead embed it inside the database system and have all instances of the application no matter how many times I refactor or change it, all rely on using that same centralized location of that logic. We talked about before having fewer network round trips. Again, that makes the queries run faster because the databases doesn't have to go back for the network to say, define what to do next. All the logic can run inside the database system. And this one is a bit subjective, but there's been some arguments that say that for certain application domains, the logic you need to write in your application is easier to express in UDS than SQL. Certainly for some things, yes, this totally makes sense. Other things, with CTEs, things have gotten a little bit better, but certainly some things that you just don't want to do in SQL and UDS are a better way to do this. Some ML data analysis things sometimes are better in something that looks like UDF. All right, so this sounds amazing, right? This sounds like UDS are a big win. We totally want to use this. What's the problem? Why are we actually spending time today to talk about how to make these run faster? Well, the first issue is gonna be that when our query optimizer encounters a UDF, it doesn't know anything about them and essentially has to treat them as a black box because it doesn't know what the functionality it's gonna be, right? This is because UDS are written in an imperative language like PL SQL or SQL PL, and whereas SQL is a declarative language, that means that the optimizer can reason about what it is that the queries actually wants to do. And it also knows something about the data, although as we saw, the estimations are usually wildly inaccurate, but it has at least some sense of what's happening. And so when you encounter UDF, you basically say, I don't know what this is gonna do for me. And so if now you have like a UDF being used in a where clause, where like some value, just some column equal another column after it's been churned through or run through a UDF, you have no idea what transformation that UDF is gonna do. So therefore you don't know what the cardinality of the selectivity of that predicate's going to be. So you basically throw your hands up and use some, I actually don't know what you would do. And sometimes, oftentimes you just say it's the worst case scenario. You say it's the worst selectivity you could ever have, like everything's gonna be 100% selective, or the other end could be 0% selectivity or just take the average. You could do a little bit like the adaptive query optimization stuff we talked about, where you can run it for a little bit, see what happens and make a decision. But as I said, until very recently, most systems didn't actually support this, at least commercial systems. And none of the open source systems support this, but they do support UDFs. All right, so the next one is a bit more nuanced, but it basically says that, the next problem is that the, because the UDF is in a pair of code, we essentially have to execute them sequentially, line by line. And so that means that we're not gonna be able to paralyze any of the operations on the inside because the output of one line could be used as the input for the next line. So we can't execute it until we execute what came before. So that sort of lends itself to zero execution. Now, it doesn't mean that if we're doing a scan parallel and parallel across multiple threads, that each of those things could invoke the UDF sequentially, but we're not getting any of the benefits of recognizing that maybe there's a bunch of SQL queries in the UDF that we could run combined together, maybe plan them at the same time or optimize them at the same time. And instead of invoking all the individual queries one by one, we can combine them to one giant query with joins and make that run fast. We don't know anything about what's going on inside. So we can't do those kind of optimizations that we normally could do if we had regular queries. The other really tricky thing is gonna be also is that some UDS can create dynamic SQL statements. So essentially you can declare a string variable and it incrementally construct the string. Like you can coordinate strings together, say select star from table and things like that. And therefore, you have no idea what the query is gonna be until you actually execute that code. And then once you have the string, then you invoke that string as a SQL statement and then that runs as a query. So you have no idea how to even prepare for this because you don't know what the query is gonna be ahead of time. All right, some more problems you're gonna have is that for complex UDS, when you have meaning there's select statements or where clauses, the database system is gonna have to execute these things, execute the UDF on each row or each tuple one by one. Because again, I don't know what the UDS is gonna do. Therefore, I don't know whether the logic inside of it could change from one tuple to the next. Because I don't know whether the UDS is gonna access attribute foo or bar in my tuple. So therefore I don't know how to then maybe combine multiple invocations or vectorize multiple invocations of the UDF within a batch, right? You throw your hands up and basically have to do one by one. So in Microsoft's world, they coined the term row by agonizing row, R bar, to represent how bad you have to execute, why things are bad when you execute UDS. So then the last one is that it's related to all this because you're executing the commands or instructions in the UDF one by one, it's hard to do the cross statement optimizations for invocations within the same UDF. Okay, so how do we solve this? Well, one approach would be to have, that people have tried is Microsoft actually tried to convert the UDFs into machine code. The same way we talked about when we did Cogen for queries, we could do the same thing for our UDS. But again, this may speed up the invocation of the instructions in the UDF, but it still doesn't solve our parallelization or our cost estimation problems or the cross optimization problems. Again, it just makes the crappy UDF run faster. The opposite end of the spectrum, probably the worst type of invocation of the UDF, it's not the worst, but probably the slowest in the type of invocation would be what Postgres does, which actually converts each line into a separate query itself and invokes that. I mean, it's clever from the engineering standpoint, but from a performance standpoint, I think it's pretty abysmal. All right, so let's see how bad things can actually get. So this is query 12 from TPCH using scale factor one. And so they've modified the query to introduce this artificial predicate here that invokes the UDF we saw before, where we're gonna compute the customer rank. So we're just gonna get the customer name for each customer we're looking at. And we just check to see whether it's not null. And so this UDF, which just looks like this, all it does is take in the customer key and it does a single select statement to do a lookup to get the customer name. And so this predicate is essentially useless because in TPCC, for this column, the, or sorry, for the customer name column, it's never going to be null, right? But the optimizer doesn't know what this is doing here because this is just a black box. Now you could say, all right, well I can try to interpret what this query here and try to inline it, but it doesn't do that because it's not all UDFs are gonna be this simple. So the optimizer doesn't know that no customer name will ever evaluate to not null or to ever evaluate to null. And therefore, this predicate is always true. So this is just wasted work, right? We're invoking this function for something to check something that will never be false. So if you execute this, just the original query in SQL Server, without the UDF, you can evoke it in 800 milliseconds. But if you just add this simple UDF, the same query now takes 13 hours, right? Because again, it's for every single, for every single, for every single tuple, it's invoking this function over and over again. Now to me, this seems super surprising because it's like, all right, well, if my scale factor is one, how many customers or line numbers are actually gonna have? How bad that can actually be? But for whatever reason, again, SQL Server chokes on this and it goes 13 hours. So what we will see today is a technique called Freud, is how to inline this inside the function to avoid all the bad things we've talked about. And the spoiler would be, if you write this same function, have the same code and run this in Freud, you can get it down to 900 milliseconds. So it's 100 milliseconds slower than what the original query was, but it's certainly not the, it's not going up to 13 hours, like in orders of magnitude slower. All right, so let's look at this interesting history from how we got to Freud in SQL Server. And so the, in 2001, Microsoft added support for the scalar UDS, again, computing or taking in scalar values and producing scalar results. In around 2008, people started to realize that, oh, these UDFs are actually kind of bad and they're making things really, really slow. Like they're great for programmability and extending the capabilities of the database system, but it's just crushing performance. And so there's this blog article from a non-Microsoft employee where he basically says that, oh, these UDFs are evil. They're making everything really slow. Then in 2010, Microsoft actually publicly acknowledges that UDFs are evil. And they come up with this blog article that says, oh yeah, UDFs are evil, here's why. And they actually actively encourage their developers using SQL Server to not use UDS. Again, and they explain why. And this is when they also introduced the native compilation to machine code of the UDS. But even then, it doesn't solve that problem of invoking the UDF one by one for each tool or each row. So now, at this, around 2014, there was research being done at IIT Bombay, which is probably, you know, it is the best database school in all of India. And they were doing research on UDF correlations. So it was like early work before Freud that showed that you can actually do inlining of UDFs. And this work was being done by Karthik Ramachandra, who was a PhD student at IIT Bombay. So then he graduates, Karthik graduates, and then he joins the Jim Gray lab in Madison, Wisconsin to start exploring how to go further with this as part of Freud. And so by 2018, Microsoft announced that Freud was becoming generally available in SQL Server 2019. And last year, Microsoft released the version of SQL Server that actually includes Freud. And if you can look in the documentation, Sharon Offish says, here's how to do the inlining of the UDS using the Freud technique. Now, they don't call it Freud because that's like the research name from a product name. No one would understand that. So it's just called UDF inlining. But it's sort of like hackathon. Microsoft doesn't call the in-memory engine for SQL Server hackathon. The researchers know what that is. The researchers know what Freud is. But if you search for Freud in their documentation, it's not gonna show up. It's called inlining. So this is, I say this every year, or the second year I taught this, this is very impressive to go from like starting a project in 2015 and then getting it actually in production in a major, major piece of software that makes Microsoft billions of dollars that will have a lot of impact. Because a lot of people are gonna get exposed to this because UDFs are used all over the place. This is a short amount of time to do something with this. It's just very impressive. So one that shows you how significant the gains that Freud is going to achieve. Because Microsoft put a big engineering effort to get this in production, get this in the real version of SQL Server. And this is rare because most of the time database companies, at least the major ones, Oracle, IBM, and Microsoft, they are quite conservative about pushing out new features very quickly because people rely on these things. And if your product is seen, especially the database system, it's seen unreliable or kind of flaky or you're not taking the engineering process seriously, that can seriously hurt sales and your reputation. So to go into this short amount of time to get this in production is a big deal. All right, so what is Freud? So as I said, Freud is a technique for converting imperative UDF code into relational algebra expressions that we can then embed inside of a query plan inside of a query plan that was invoking that UDF. Right? And the reason why we wanna do this is because this is gonna enable the query optimizer and the rest of the system to reason about what the UDF is doing and optimize it just as a regular query. It's gonna basically, you think about converting imperative code into declarative SQL, declarative relational algebra expressions. All right? And the advantage you're gonna get from this is that from the application developer's perspective, you don't have to do anything. You're just taking your UDS you already have and you flip the switch and say, I wanna use Freud and it'll do that inlining for you. It's not like you need to even touch your application. Again, this is super rare. Think about this. Any time, say there's a new magic SQL function that can make your application go faster, they oftentimes they'll keep the old one in place and then you have to go change your application to use the new one or change how you're setting the queries or something like that. You don't have to do any of that Freud. It just works, right? That's very cool. So what's gonna happen is that the query is gonna show up that's gonna invoke a UDF and then Freud would do the conversion in the sort of binder rewriter phase before you get to the cost-based query optimizer. So all the rules that we're gonna do to do this inlining are just don't require a cost model. Like there's sort of static heuristic to say, the static transformation rules to say, here's how to manipulate the UDF to convert it to something that I can inline. The only sort of thing you need to consider is a sort of parameter or flag that would specify how deep you wanna go in sort of doing the inlining for like nested UDF calls. Like if you can have UDFs called UDFs called UDFs and if you try to inline all of that, then that becomes more difficult now for the optimizer to try to plan and optimize because now you have this giant, giant query whereas before it was just all bunch of UDF calls. So as far as I know, there's only one parameter that sort of says how deep you wanna go. And so essentially what it's gonna do is it's going to try to do this inlining by creating subqueries. And the reason why we wanna do this is because all the major database systems, especially SQL server, have very sophisticated constructs or transformations for unnesting subqueries and converting them into either joins or temp table invocations, right? So we're gonna invert all our UDS into subqueries and then let the optimizer rewrite them, those subqueries into a more optimized form. So we didn't get to talk about subqueries too much, but I really wanna talk about them and lateral joins with the apply operator because these would be sort of the setup to understand what we're gonna do. So with a subquery, it's basically when you have a select query inside of another query and the idea is that the database system could just sort of treat these, the subquery, the nested query as almost like a function that can return a single value or set of values. So conceptually, the subquery is almost the same thing as the UDF where it's like something inside of us and you can get back values. So as I said, the way the optimizer is gonna handle this is two approaches. You either rewrite them to decorrelate and flatten the queries so that they're doing joins or in some cases you can rewrite them into a single table query. Or you're gonna break out the nested query, store its result in a temporary table or sort of run that separately first. Whether or not it's attached to the same query plan or it's broken out to separate query plan, that depends on the implementation. But the idea is that we run the interqueries first, materialize it to a temporary table and then we run the outer query and then it can just read that joins against that temp table. And the temp table is thrown away when the query finishes. All right, so let's look at a simple example of doing a rewrite. So say you have a query here where I wanna get the name of the sailors that reserved a boat from the, made a reservation for a boat on this particular day. So I just have, give me all the, I have a where causes that exists a tuple where there's a reference to the sailor ID from the outer table and a reserve ID on the inner table. So when a database system sees this and if it can do rewriting, it could rewrite this as a simple join like this, right? Cause the alternative would be I run this query for every single tuple I have in the sailors table, but that's gonna be super slow. So I can just rewrite it to this and then I just do whatever the hash join of the initial join that I wanna do, right? So again, we're not gonna talk about this so much. There's a group in the course who's doing this for the final project and as they talked about when they did the proposal and status update, the different types of cell queries that they're going to try to support in our system. The thing, another thing that we haven't talked about is lateral join. And the way to think about a lateral join is when I have nested queries, I can have the inner query reference the outer query, right? So inside my inner query here, I can reference the sailor ID from the outer table here. But if I had the two queries, not nested, but sort of executed one after another, like in my from calls, I can have a bunch of select statements inside of that. A lateral join is gonna allow me to reference the attributes or from tuples from queries being executed sort of sequentially executed at the same nesting level, right? So the, without this lateral keyword, you would execute each sub query independently and you wouldn't be able to refer to the items of the attributes of each other in the from calls. But with this, we're gonna allow ourselves to reference the other attributes, right? So this'll make more sense when we see examples in a second, but just know that like in the Freud paper, they're using what is called the apply operator to do these lateral joins. I think that's within the SQL standard, but other systems like PostgreSQL and Oracle, you explicitly call this as a lateral join. Okay, so let's go through Freud now. Freud's gonna have five steps. So in the first step, we wanna transform the sort of T SQL statements in our UDF into SQL queries. And then now we're going to convert the UDF into regions that are gonna allow us to then reason about the SQL queries within that region. And then we'll combine all the expressions from each region now, or the individual SQL statements within a region into a single SQL statement. And then now we can inline the UDF expressions that from the combined giant query into the query that invoked our UDF. And then we just run this through our query optimizer to do the optimization and the unnesting of the subqueries as needed, right? So what I'll say is that for this example with Freud, I'm gonna show an illustration and I'm gonna show you these steps using SQL queries. But the real implementation of Freud in SQL server doesn't operate on SQL queries and instead operates on relational algebra operators. But I'm using SQL because it's easier to read this way. The second technique we'll do inlining, I'll show you, actually does operate on SQL queries. Like it's going to convert them into comment table expressions. The intermediate forms aren't gonna be exactly SQL queries but the end result is what it inlines our SQL queries. Whereas what this is inlining is relational algebra expressions. All right, so our first step is assuming we have that example from before where we're getting, given a customer key, we'll tell you what service level you're at. In the first step, we're gonna break up all of our imperative statements that are in our query. We have the setting a level, the select statement here or doing the if clause here. We can convert all of these into SQL statements. We have to do this because this is T-SQL commands. This is not SQL, this is not relational algebra things. So we have to convert them to SQL statements so that we can then combine them together. So it's almost always gonna be a one-to-one mapping. So for one sort of statement within the UDF, we'll produce one query. It's not always the case, but for our example, it will be. So for this setting the variable level to platinum, that's the same thing as doing a select when you take the constant string platinum and project it into the attribute level and the output result for the query. This is a tableless query, which is valid SQL. Then for this one here, computing the sum, well, that's the same thing, it's just the same query as before, computing the sum, but then again, projecting it out to the total attribute. The last one here, if total is greater than one million, that's just a case statement. Nestle if clauses, which then produce outputs. Since we don't have an else statement here, we don't know what else to do, we end up putting a null. All right, so now the next step is, again, since we know how to do this conversion now, we'll break up our UDF into regions and then construct a relational algebra expression or a SQL query for each of those. So for this first part here, we're gonna declare some variables and we then compute the sum total price for the customer starting in the total key. We're gonna create for each region a synthetic table where we can assign, again, the values that we need to store in variables to be projected output attributes. Right, so in this case here, we initialize the level attribute, sorry, the level attribute to nothing here. So we'll just send that out to the null and then in here we compute that sum we have before and we'll assign it to the total variable. And then we'll store this in a synthetic table for our temporary table for this portion of the query as ER1. Now for this next one here, our second region is going to cover this portion of the if clause. And again, it's that same case statement that I saw in the previous slide, right? And then we'll just store this in, that's gonna store the result of this into the level attribute and then we'll store this in the ER2 table. So then the third one here will be just the other part of the if statement and this will just be the opposite of what this is. So if the total is less than or equal to one million, then we'll set it to regular. But notice actually for both of these here, on this case here, if the total that's generated from this guy up here, if this is greater than a million, set it to platinum, otherwise use whatever the level value that was assigned up in here, right? So I'm referencing this table, I'm referencing ER1 down inside here because I'm sort of again going from top to bottom, if I didn't set level here then I wanna make sure that I'm just propagating what the level was up before. I don't know whether it's set up above me in our example here, it's set to null but it could have been set by some other computation. And so I'm just saying that I don't know what it is and I'm not modifying it. So just pass down whatever it was in the region up above me. The other thing I'll point out to is for this example, although it's cut off, this is one, two, three, I could have a single region for my if statement that included the if else because then the else portion of this case statement up here could have just been the regular thing and then I don't need to do this, total less than equal to a million. For this example, you could do this. In other examples, you cannot because the, you could have like return statements inside of this. Again, like this is easy because this is like, this is like for inside BF calls, I set one variable but I can get, I can do anything I want inside of this if calls. Like if I fall into it, I can call another function, I can do a return statement, I can actually query. And if that's the case, it'd be hard to maybe declare that inside of this and have an else statement. So for this reason, for sort of simplicity, we're breaking it up into two separate regions. All right, the last one is the return statement. This one will be easy to handle. We'll get able to see this on the next slide. It's just producing whatever the output of level is when I combine the queries for the different regions together. Again, for this one, it's super easy because inside my function, there's only one return statement. The paper talks about how for statements you could have, for some UDS, you could have return statements in all different parts of the program. At the end of the day, you always kind of need one return statement at the bottom. And so the way they handle that is they'll create a synthetic variable called return value. And instead of calling return and trying to return result there, you'll set return value inside of wherever it was invoked and then just return that as the output at the bottom. All right, so now we have our regions that we defined, the queries we defined for the regions that we broke up our UDF into. The next step is now to start merging these into one select statement. So that's what this is here. And this is where the ladder of drawing of the cross-apply comes into play, right? So the first thing to point out to you is like here's our return function. Here's the return value that we had before, right? E-R3.level, right? Cause that's the last thing we're gonna invoke at the bottom of the UDF and then we want whatever the level was set to and produce that as the output. But now you can see why we need to use these cross-joints because I have my from calls from the top, then I have three sub-queries that aren't nested inside of each other, right? They're executed as sort of siblings for lack of a better term. So in order for this sub-query to reference a attributes in this sub-query up here, I gotta use the cross-apply or I gotta use the lateral join so that I can reference E-R1 here. Cause without this, I can't do that. It'll say undefined attribute, right? So this also shows you that the output or the computation of each of these sort of regions depends on the previous one. And then as I'm setting up, I'm modifying level as I go down and then the last one, whatever gets set up down here is what I produce as my output. All right, so this is it. So this is the SQL query we would generate for that original UDF that we saw in the beginning written in the T-SQL, right? This will produce a computation that is produced result that is equivalent for any given queue. So now we need to inline this. We need to put this inside of our original query. So if we go back, the original query was that we were invoking the customer level function for each customer key in the select output clause. So all we need to do now is just replace customer level here with our combined together applied operator query from the last slide. And then here's again all the same regions that we have before. Okay, so now with this giant SQL statement, we now want to inline this into our query. And now we can run it through the optimizer and let it do its thing to produce a more efficient query plan by rewriting and breaking up or unnesting our subqueries here. So what SQL Server Optimizer will generate is an equivalent plan like this where it no longer does all these sort of sub-selects. I'm just doing a left outer join inside this where I compute the total amount. I compute the total amount of orders placed for each single customer. You're doing the group by on the customer key. Then I just joined against that and now I pushed my case statement up here in the select output just to look whatever the output is from this aggregation and then produce the right output level, right? So what went from a black box invocation of this computation to get some customer level thing, I just completely rewrote it into a simple aggregation followed by a left outer join. This is amazing, right? This is thinking about this. It's like you're doing this for any UDF, almost any UDF they can do this on. So it's not like I had to pay someone a lot of money to sit down and look at my UDF to figure out, oh, how can I actually rewrite this into a query? Because not every, how it's gonna be invoked is gonna be different from, sometimes it'll be in the from clause, sometimes it'll be in the where clause, like it could be invoked differently every single time. But then rather than having to rewrite every single SQL query to use it, by doing this conversion into this giant SQL, inline SQL statement, I produce a more efficient query plan. Again, no change to the application code, didn't have to hire me to do that, no changes to the query optimizer. The query optimizer couldn't already handle this. There's nothing specific about Freud in this query. It's all vanilla SQL servers, SQL syntax. So this is amazing. So now also to the benefit is that UDF is no longer a black box in the optimizer in terms of, from a cost model perspective. So we know how to do, search the most efficient physical plan for this SQL statement or even the logical plan. We can now invoke this in parallel as well. So I can have a parallel thread compute the group by, I can now have a parallel and then parallel thread do the, produce the output of every single query. In the UDF, you couldn't do that. So again, I'm super excited about this because I think this is a huge win. Okay, so now let's talk about what are some additional optimizations the query optimizers can do that we can get because we've inlined our SQL statement. So let's look at a really simple UDF like this called getVal where you input some value to this function, some integer, and then it's gonna tell you that if it returns back a string that says if the value is greater than 1,000, return of high, if the graph, the value is greater than, or less than equal to 1,000, return low and then just concatenate whatever that value is to the string value. So you get high value, low value based on the input. All right, so say that we had this invocation so we know what the value of this number is. So we can substitute 5,000, X to be 5,000 wherever you need to. So what we would get with Freud is a select statement that looks like this, right? Where I have a lateral join where I'm taking the output of this case statement depending on what my input is and I'll get high value and low value. So now let's go across and see what kind of optimizations the rewriting the optimizer is gonna do to cut down the complexity of this program. And it's gonna look a lot like what a traditional compiler optimizer will do, right? For a pair of languages like C or C++, like this is something the optimizer can do, and the query optimizer can do. So the first thing is that we can do dynamic slicing, right? We can recognize that we know the exact value that's been given to us. So we can throw out the low condition or the low path in our program because we know what the constant is for the X. So we know that we're never even gonna execute this. So we just remove it entirely. Then we can do propagation, constant propagation and folding. Again, recognizing here that we have a select query that all it does is produce the output high and then there's a select query that takes whatever this thing produced and produces that as the output. So, and then do the same thing over here or take and append these two things together. So rather than invoking all these subqueries with the lateral join, I can just evoke that as sort of combine that together into a single constant that I produce, right? And then the equivalent in the UDF will be up here. And the last step would be dead code elimination. Again, recognizing that I don't need to even set the value to high or set it to clear variable or produce the output, I can just return the value directly. So the same thing in the UDF, the same thing as in the relational address. So Freud is essentially achieving the same thing by using the query optimizer to do this, to do these rewriting steps. We're getting all the same benefits you would get from a traditional like GCC client query optimizer. So again, but they didn't have to write that because they got the UDF into relational algebra form, they got all that for free from Microsoft's, you know, very expensive and very good query optimizer. So what can Freud support? So as of the 2019 in SQL Server and the paper you guys read, Freud can support most of the constructs in T-SQL. So declare and set, select statements, if and else returns exists and other set operators. They can handle recursive calls or nested calls to other UDS. They can handle all the possible SQL types. As of this, you know, this paper here and in production in SQL Server 2019, they don't support loops, they don't support for loops, while loops, they don't support dynamic queries because again, you don't know what the query's gonna be before it runs and it doesn't support exceptions which are surprisingly common in UDFs. So all right, so you say, well, there are some limitations but how prevalent are these things in sort of real-world applications? So in the paper, what they did is they looked at the top 100 SQL Azure databases running in the cloud for them and they just sort of looked at all the scalar UDS that people were using and there was like sort of 85,000 of them and of those 85,000 UDS, they found out about about 60% of them could be inlined with a technique using Fruite. And then in the paper, they evaluate three different customer workloads and they're showing here again what percentage of the UDS that they're using are FROID compatible, right? So again, there are UDS out there using for loops and dynamic SQL instruction that FROID can't support but for a large percentage of them, this is quite significant. And so the benefit, the speed up you're gonna get for some of these customers are very impressive. So there are some cases where FROID actually makes things worse. I forget why the paper said this was the case. I don't remember, I don't remember whether because it may be optimized or run longer or maybe because it picked a bad query plan. That I don't remember. But there's still a large... So this is long, the x-axis is just a unique invocation of one particular UDF and then it's sort of based on the speed up they're getting. So again, for these cases here, they're getting up to 800x improvement in performance for UDS without having to change any application code. This is unheard of because other than, if you buy a new hardware, unless you're running on computers from 1990 and you buy brand new computers, maybe you would see the speed up. But where does it ever the case that there's something that you can get a huge speed up like this for these queries without having to modify your application code at all? And again, I think this is telling about why Microsoft got this in production very quickly because the performance benefit you can get from this is so massive. The co-gen stuff we talked about today, like in certainly in that case, if you're going from an interpreted engine to a compiled engine, you're gonna get a huge speed up. So that one would be sort of an example of something similar to this where you could run the same SQL queries and get the same kind of speed up. But like the vectorized stuff that we talked about, sure like in the Columbia paper, for individual algorithms, they were getting a big speed up, but that only happened when everything sat in CPU cache. But when you actually put things and things exceed the cache and you actually kind of spilled a DRAM, maybe at best you're seeing like a 20%, 15% improvement. This is 800X, that's massive. So this is again, this is why I teach this paper because I think it's actually, it's a huge deal. Okay, so as you can imagine, Microsoft has patents on this. So it sort of limits, you know, at this point, I'm not a patent lawyer, but I imagine the, using the Freud technique and other database systems would be, could cause some legal problems. But the good news is there's other ways to do this inlining without going through the same kind of transformation steps that Freud does. And so this came from a paper published actually this year in January at CIDR on how to convert UDS into comment table expressions. So rather than rewriting the UDS into relational object expressions that you then inline into the query plan, we're instead gonna rewrite them into SQL queries using recursive CTEs, recursive comment table expressions. And what this is also gonna get us is that other than not doing exactly the way that Freud does it, it also gets you to supporting iterations and loops and other control flow concepts or constructs that were not possible in that version of Freud. Now I will say there is a newer version of Freud, Freud 2, I don't know if you wanna call it that, being published this year in SIGMOD. I was given an early preview of the paper a few days ago, so I don't have time to discuss it here. And in that version, they do support loops. But up until now, this particular approach that I'm showing you here is the only one that can do this. So the other interesting thing about this approach from another set of Germans is that the rewriting technique that they're proposing can actually be implemented as a middleware layer that does not need to be intrinsically tied or tightly coupled with the actual database system implementation. So in this paper, they show that they're doing this on Postgres, but you could do it for Oracle, you could do it for any database system that's supported UDS and also supported recursive common table expressions. All right, so let's see how they're gonna do it and how it's gonna be different than Freud. So it also is gonna have five steps where the last step is again running it through the query optimizer. But they're gonna take a more, I can't comment on like whether this is more PLE or more of all principle, this is not my area. It just seems, it's a different approach to end up with sort of a similar end result that you can inline these UDS. So we're gonna go through each one by one, but basically we're gonna start off with our PL-SQL or PL-PG-SQL UDS that's gonna have some kind of arbitrary iterative control flow in them, more so going beyond what Freud can do. And then in the first step we're gonna do is convert them into SSA, which is gonna convert it into a program that uses go-to-based control flow. Then we're gonna hand this off and now convert the SSA form into an administrative normal form that will give us a mutually tail recursive function. Then we'll convert the type of recursion we're doing here into the tail recursion that we can then rewrite that into CTE using the with recursive command. And again, now we have a giant SQL statement as we did with Freud and then we shove that off to our optimizer. All right, so for this one, we're gonna show an example in PL-PG-SQL. So again, it's different than T-SQL, but as you can see, it roughly looks the same, right? The difference is I declare my variable separately outside of the begin, where in T-SQL it was all, you declare them after the begin, right? So it's like minor differences like that, but at a high level it's still the same thing. So this is gonna be the power function where we're given a base value and then we give an exponent and then we're just gonna take it to the nth power. So for given x, take it to the nth power. So, and the way we do that is just do a while loop for the number of iterations that are less than the nth power and it's gonna multiply the value by itself and then we return it. So when we convert it into SSA form, again, this is now gonna be control flow of the function expressed with go-to statements. And the other aspect of SSA is that each variable within the block or within the program can only be assigned once. So I'm not a compilers person, but my understanding that this is what compilers are doing on the inside when they convert their programming languages. So in this case here, right, it's the same thing as the power function, I assign my variables and then I use go-tos to jump out there to the exit to return values or loop through and with the body of the loop there. All right, so now that we have our UDF and SSA form, we wanna convert it to the A normal form and for this one here, this is the mutually recursive calls where we're just doing the recursive call to another function is always at the tail or at the bottom of the function block. So we call power, we set up some variables, then we call our while loop and then this spins through and in this case here, this is our exit clause, right? If we've iterated enough that we're greater than equal to N, which is the number of times we wanna multiply the value by itself, if we satisfy this constraint, then we break out. So this is essentially the return statement here. Otherwise we jump in here and actually do the computation but then loop back, right? So now with the A normal form, we can then convert it into using direct recursion. So I'm sort of simplifying the step here but I'm combining now the while block and the body block into a simple run block where I do all that same computation that before. So now there's a single recursive function that is only using the tail recursive calls, right? There's no recursive calls somewhere arbitrarily in the query. So then now with this, I can take this imperative code and convert that into a giant SQL statement with CTEs. So this is a bit gnarly to look at but briefly the way to think about this, here's the first part of the function that does initialization. So I'm setting up my I value to keep track of the number of times I've looped, setting up the P value to keep track of the running total as I do my, I've raised my thing to the power and that's just the select statement here which is just setting up some variables as the initialization clause of the CTE but then the union all is gonna allow me to invoke the same query over and over again and get the recursive calls that we need. So all of this here is combined into this SQL statement here which does the logic that we need, right? So the, if I've gone past my number iterations then I, this thing evaluates, this thing would evaluate the false and then I can break out and then all of this down here, like so this is sort of confusing but this call thing here, this is the name of a, of an attribute that we're defining as but the name is call question mark. We can put the double quotations and marks around it to make sure that it's treated as valid SQL syntax. So we're using that variable to decide that the false or true weather to keep calling myself and then when this thing evaluates the false then my program breaks out, I unwind from all my recursive calls and produce the output to the select statement here. So again, this is wildly different than what Freud is doing but it conceptually ends up with the same result where you have the ability to now inline an arbitrary UDF with actually for loops and you get the for loops with using recursion and be able to invoke them in a query. So at this point, it is unclear whether this produces more optimal plans than Freud, that sparrow to that evaluation has not been done. This also does not support the dynamically structured SQL queries and that doesn't support exceptions but this is something that actually we are talking with now with Karthik and Torsten, the other friendly German that made this paper, they wrote this work, we're looking at sort of comparing all these together along with the compilation stuff of the UDFs. Like for the UDFs, you can't actually inline if you compile them with the LLVM what happens. So this is sort of ongoing research we're exploring here at Carnegie Mellon. All right, so one graph to show the benefit you can actually get. So this is running the same PAL function on Postgres 11 and the x-axis is scaling the number of times you wanna iterate the call and keep raising the base value. So the black line is with the original UDF and then the red line is with the CTE and so you're getting about roughly 40 to 50% performance benefit over the UDF call. And again, from this case here, for this particular example, there is no issues of the optimizer choosing bad plans because not knowing the cardinality UDFs, this is just showing you from a pure performance benefit which you can get from using recursive calls versus using the UDF. And so where you're getting the benefit is because in the case of Postgres, as I said, all the individual lines or the commands within a UDF get converted into select statements that are then sent to the parser, the binder and the optimizer, the query plan and executed for every single line. So essentially, for every single line of UDF, they're invoking a query. So this is showing that avoiding all that overhead by being able to do as a recursive CTE is a big win. Okay, so as I said, these results are huge. In the case of Freud, the performance gains are quite significant. The CTE approach is still very new. So it's unclear how much benefit you're going to get from this, but the early results are very promising. And as I said, show me another example where unless somebody was doing something stupid and like Bitcoin mining in the middle of your query engine, why you run queries, like showing you something you can do where you don't have to change anything in your application and you get a 500X improvement. Like this is quite significant. And as I said, another approach that we're exploring is to how to compile the UDF into machine code the same way you would for a query. This doesn't solve the optimizer cost model problem. Like you still need to use the CTE or the Freud approach to get that benefit. But for some computations that you can't inline, that actually may be a bad way to go. But I think the real end up, the solution is going to be sort of hybrid approach where sometimes some portion of the UDF could be inline and another portion could be cogent. And how you balance all that and where you make sort of the cut points is the interesting direction that I want to explore. Okay, all right guys. So the next week will be the last lecture that we have for the semester. And this will be sort of a recurring theme of databases of like what happens when new hardware comes out and how can you have your database system take advantage of it? Oftentimes when new hardware comes out, the now it's machine learning is probably the hot thing. So there's a lot of hardware tailored for machine learning, but traditionally as Intel or disc manufacturers or whoever puts out new hardware, one of the obvious targets is going to be database systems because no one ever complains that you make your data systems go faster. So for next class, we want to see some of the newer, some of the newer database hardware, sorry, the newer hardware that's coming out that looks a lot different than what we've seen in the past, how that would affect the way we design an architecture database system. Okay, all right guys, wash your hands, take care. See ya. What is this? Some old Porsche. Ay yo, ay yo. Took a sip and had to spit cause I ain't quit that beer called the OE. Cause I'm old cheap ice cube, down with the STI. You look, then it was gone. Ripped the top off, they brought the drop though. Your same eyes hopped off, and my hood won't skew. Take a same eye to the brain.