 Carnegie Mellon University's Advanced Database Systems course is filmed in front of a live studio audience. Okay, so today's class is going to be a little bit different than what we talked about throughout the semester because so far we've most been discussing about, okay, here's the internals of a database system at the lowest levels and how to make queries run faster. And so today's going to be a different topic where that we're going to be further up the stack now in the system where we're going to do a bunch of tricks up before we even get to the query optimizer when the SQL query shows up and how to make the query run faster given the architecture we designed below. And so this will then be from this point going into the semester, like for this lecture, our next lecture will be about getting things in and out of the database quickly. The next week we'll spend a lot more time on the query optimizer. And I know I need to update, finally post the papers we'll be reading next week, but I'll take care of that today or tomorrow. So just a reminder before the break, we spent two lectures discussing join algorithms. And we discussed how to do parallel hash joins because I said that's what, you know, every database system needs to be able to do hash joins, right? If you're relational, you just want to use SQL, you need joins. And hash joins is always going to be the fastest. And then we spent a whole lecture talking about worst case optimal joins. Although very, very few systems do this now, this is something they're all going to need to support within the next decade as people start doing more graph-like things on their databases. All right, so today's lecture, again, we're focusing on how to embed more complicated things inside of our database system to execute queries. Now, loosely categorize these as embedded database logic. And so we made the assumption that the scenario that we're supporting in our conceptual database system is that there is an application or some tool that the user is using. And they're interacting with the application. They're either typing all SQL queries in or they're using a dashboard. And then the application is sending over SQL queries that we then compute in their entirety and then send back the result. And so the scope of what those queries can execute or operate on, the computation they can perform on our data, is limited to whatever the database system itself actually supports. And so in some cases, very common, especially in the Python Pandas world, you'll see people just do select star queries to get all the data out of the database system, then bring it into your Jupyter Notebook or Pandas, whatever you want, then do some additional computation on it, and then push the result back to the database server. And so if we can avoid that, some cases we can, some cases we cannot. If we can avoid that, then obviously the database system will have a complete view of what you're trying to do in your query or in your data. And it can optimize accordingly, assuming you have some of the techniques that we'll talk about today. But it's always getting a better position to operate on the data where it resides, rather than always having to bring it out to an external application. So this is what I mean by embedding the database logic. So the benefits are kind of obvious, right? Fewer network round trips, as I said. Like if I can just do one query, have all the computation I need for whatever the result I'm looking for in that single query request, then that's fantastic, rather than having to go back and forth. Obviously if now, if I'm incorporating changes, this may be not so much matters in the lake house world we're talking about. But rather than me having sort of a stale snapshot of my data that I'm processing locally, if I'm going to push all my computation to the database server, then as new things arrive, new data arrives, then I'll see those updates immediately. And we're not going to talk about transactions, but thinking of, in a transaction board, if I call begin, run a query, get back result, do some processing on the application side, the database server is holding blocks while I'm doing that computation. So if I can push all my computation to the database server, then I don't have to go through those round trips. This one is debatable whether you allow your developers to not have to re-implement functionality by using a better database logic. And I say it's debatable because often times in large corporations or enterprises, the people that write their application build that software aren't the same people managing the database servers. So the application developers might be on one sort of engineering cycle, but the database developers are usually very conservative. And you may say, here's my new user defined function, UDF, or short procedure, but the DBA is like, well, I gotta vet this. It's going to take a couple of weeks before this actually happens. So you end up having developers re-implementing the same thing just on different code bases. We certainly saw that in the case of the Velox paper, right? They talked about how there's 11 implementations of substring in all of Facebook. And then this last one, again, this encompasses all of this. But now we're going to be able to extend the functionality of the data system to go beyond what the built-in capabilities is. And this last one here is one of the original motivations of user defined types, the user defined functions that Stoenberg likes to talk about when they built Ingress, they started trying to start selling it to a bunch of banks. But all the banks were computing interest for accounts on the Julian calendar. Whereas the rest of the world is running the Gregorian calendar. So in Ingress at the time, they didn't have a Julian date type. So that meant the developers had to go modify the data system to add this new date type. But if you can allow them to support user defined types, user defined functions, and other things, then the developers can extend the system without having to recompile the binary, right? So there's different categories of types of embedded database logic. The most common two are going to be user defined functions and stored procedures. They're conceptually the same thing, like it's a function of some kind of procedural code that you can run in your database server. The difference is that in a server procedure, you can invoke it outside of a SQL query. Like I can call execute, and then the name of the function, and it'll just run it like an RPC call. Whereas in EDF, it has to be embedded inside of a select statement or a SQL query. In some systems like SQL server, they make the distinction that user defined functions cannot update tables. Like you can't call insert and update delete in the UDF. Postgres doesn't let you do that, right? Whereas in a store procedure in SQL server, that's where you can call update queries. Right, so again, a bunch of these should be mostly familiar with, but the one we're going to care about today is user defined functions. And this survey comes from the follow up paper from the Freud paper you guys read where they actually did a survey of real customer databases in Azure. And it is counted what a real EDF store procedure would look like. And that's where they came up with this pie chart like this. All right, so user defined function, and this should be reviewed for everyone here. User defined function is going to be a function that's being written by the application developer that allows us to extend the functionality of the database system beyond its built-in operations, built-in functions. So the SQL standard specifies there's a substring function. And every system that supports the SQL standard is going to have their own implementation of it. But if I have some weird, wonky substring version that I want to use for whatever reason, it's not realistic for me to assume that my database server is going to have that, but I can write it as a user defined function to have the exact capabilities that I want, then I can port my application technically anywhere. A lot of times when you see people that have migration services, well, like I'm running on Oracle, and I want to switch to Postgres, I'm running in Teradata, I want to switch to Postgres or something, they'll take whatever the custom functions you're using from the different proprietary database servers, and they'll re-implement them as user defined functions to ensure compatibility. So the function is pretty straightforward. You're taking some input arguments, always as scalers, you can perform some kind of computation on it, and then you can return result either as a scaler or a table. For our purposes here, we're going to assume that the UDFs are not pure functions in another neighborhood, but basically, they're not going to call it outside things. In some database servers, you can actually make RPC calls to remote services. To keep things simple today, we're going to assume that everything's going to run inside of the function itself and doesn't escape, although one function can call other functions. So again, conceptually, it looks like this. This is our application. It wants to execute some SQL that has some kind of programming logic, conditional clauses, calling whatever libraries it wants, execute more SQL, and then some programming logic, and so forth. So what would happen is that if we can take maybe these two portions here and then embed them as functions inside the database server, then now we can rewrite our application just to invoke the queries and the functions like this. And then now there isn't this back and forth where maybe pulling a bunch of data, processing it, and then passing it on the next query, and so forth, I could keep everything always on the server side. Again, obviously, for some things, like if you're calling machine learning libraries, like PyTorch, this doesn't quite make sense to express everything as a UDF in the native language of the database server. There are tools, there are extensions to Postgres and other systems where you can make calls into PyTorch. They basically have UDF wrappers for that. Like I said, we're going to ignore that for today. So today we're talking about the background of the challenges of the UDFs. Then we'll talk about three techniques to optimize them. The first one is going to be the inlining approach from Microsoft that you guys read. Then there'll be a follow-up work from other sets of Germans to convert UDFs into common table expressions or CDEs with lateral joins. And then we'll finish off with bashing and some numbers about which systems can support these various techniques. All right, so I've already said this, you define function basically is going to take some input, do some processing, compute the output. But there's broadly two categories of UDFs that we're going to care about. The first would be SQL functions, where the inside of the function is literally just going to be queries, one after another, separate by semicolons. And then the output of whatever the function will be when you invoke it, will be whatever the output of the last query is. And so input arguments, you take integers, there's this return argument that defines what you can return. So in this case, we're going to return the tuples that have the same schema as the table foo. And then we have our computation, the function body, down here. So for this example here, I can either invoke it in as a query without a from clause or embedded inside the from clause itself. Or in some cases, I can put it in the where clause. You can put these function calls anywhere. So this is not that interesting from our perspective today, because we can more or less treat this as a macro. So in this case here, the calling get foo inside this query, the database will literally take all the SQL queries inside the function body and just embed it, inject it inside of this thing instead of an nested query. And then at that point, the optimizer knows what it's operating on, because it's dealing with SQL queries, and it can do whatever it wants. Now, you see why SQL Server doesn't allow you to do update queries. Because if I update queries inside of this thing, then that can certainly change the order in which I execute things. And if it's a select query with updates inside of it, then things get weird. The types of UDFs that we're going to care about today are going to be ones that are written in a external programming language. So the SQL standard specifies something called SQL PSM as persistent stored modules. And that goes back to the ninth question. Is that a question of when updates are permitted in the sequential SQL function? Yes. Does the UDFs enforce strict ordering on what they execute? Its question is, will the data system enforce strict ordering when you update queries? I think yes. And literally, the blind leads copy it in. Yes. I actually don't know what Postgres does when you have it updated in there. Well, they'll just blindly copy it in. It's almost like a view. In that case, there's rewrite rules on Postgres. They literally drop it in. But if you have a big query, I don't know what they do. But yeah, you want to keep the order correct. All right, again, so the SQL standard specifies this thing called SQL PSM. And as all cases in SQL, there's a standard, but nobody exactly follows it. Everyone is going to do something slightly different. But at a high level, they're all going to look the same. The built-in or the standard programming language for UDS is going to look very similar to Ada, because the story goes, the guy that invented UDS to serve procedures was really into Ada. If you've never heard of Ada, it's like a modern variant of Pascal. It's an older language in the 70s or so. But that's why you have the declarable variables in the beginning, but it's all very archaic. So the SQL standard specifies SQL PSM. Oracle's got their own PL SQL. Postgres has their own dialect of PL SQL called PL-PG SQL. It has some Postgres idioms in there. DB2 had their own UDF language, but now I think you can install SQL PL modules that look like the Oracle one. And the one we're going to start most of the time talking about today is this thing called Transact SQL, originally from CyBase. Again, it's going to look a lot like the PL SQL or SQL PSM in the SQL standard. There's those at signs they're going to use everywhere to declare variables, whereas the PL SQL doesn't have that. So again, for more circle background, CyBase came first. CyBase was in the 1980s, and I think they were one of the first UDS systems that supported UDS. Ingress had UDTs in the 70s, but CyBase had UDS. Microsoft bought a license to the source code of CyBase in the early 90s to port it to Windows NT, to compete against IBM. And then since then, it's a hard fork of the source code. SQL Server has basically been rewritten. CyBase is still around. They're still making a lot of money. But again, no news chart will say, hey, I'm going to use a lot of the banks. But for historical reasons, because CyBase had Transact SQL, that's why SQL Server has Transact SQL. There's other programming languages you can get. Like in Perl's class, you can get Tickle, you can get Python, you can get Perl, you can get PDFs in any arbitrary language. If you're crazy, you can write UDS in C, which is a bad idea, because if you're operating the data system, because now you're linking in a shared object in C, which can touch anything in your address base. And for security reasons, it's a nightmare, and obviously for stability reasons, it's a nightmare. So in some cases, like in Oracle, for example, you can write UDS in C, but then they again, they transpile them to ProStar C, which is their dialect. And then they run you as a separate process. So if you crash, you take down the UDF, not the whole system. All right, so let's look at an example here of what a PL SQL will look like, or sorry, a UDF written in, in this case here, it's Transact SQL. So this is a really simple UDF where we're going to take a customer IDN, a customer key, and then we're going to discount the number of orders that they've purchased over the lifetime of being a customer. And then depending on whether they spend a certain amount of money, they'll get a platinum level or they're a regular customer. And so in this case here, we're invoking the UDF inside of the projection output of the select statement. So you sort of think of this as like a for loop iterating every single customer, and then they're going to evoke this customer level function by passing in that customer key. And again, because it's based on ADO or Pascal, we declare our variables in the beginning, and then the at sign tells us that it's Transact SQL. All right, so a lot of these are already said, UDFs are great, because they're going to allow us to break up complex logic in our application, into separate functions, and potentially allow different parts of the code in our application to be able to reuse those capabilities. Some scenarios also, too, you see applications written in different languages, like there's the mobile app, and then there's the web server app. In that case, they're usually always talking to a standard application server, but in some cases, you can go directly to the database server, and now instead of having to re-implement logic in the different programming languages, if they're all UDFs, then you can just reuse that. We're already talking about reducing network round trips. And then for some things where UDFs can be very helpful is that it's easier to write some complex logic in UDS versus SQL, right? So data analysis stuff is very common in this. All right, so this all sounds great. Why aren't UDSs maybe more common then? Well, the number one problem that we're going to face is that the query optimizers, if the UDF is written in an external programming language like PL SQL, or PLBG SQL, doesn't know what's inside of that function. Again, SQL is declarative, so the SQL query itself is specifying here's the answer I want, and now the data system optimizer can reason about the expressions of the operators within that query plan to make estimations on selectivities of the various computational steps in the query plan. But now if I had this function that I'm calling in some language that isn't SQL, what is the cost of things? So if I have my where clause, where value equals myUDF123, we'll say UDF123, this myUDF is written in C, or even PL SQL, do I know what the selectivity or what percentage I'm going to imagine this? You don't, because you don't know what's inside of the function, right? So that's going to be the number one problem we're going to face. The other challenge is going to be that it's going to be hard for us to paralyze our UDS and take advantage of the vectorized query processing model or even running the query across multiple threads. Because again, we don't know what's inside of the function, right? It may just be, like, we may have been doing an implicit nested loop, nested loop join because the outer query is invoking the function once per tuple, and inside of that, now I'm just doing another lookup inside of that function to another table that's basically doing a join. And because there's a separation between the SQL side and the UDF side, the optimizer can't have a holistic view of the entire query itself and do all the optimizations we know how to do about switching to hash joins and so forth. Things get really nasty, but fortunately they're not that common, is that some UDFs actually will construct a string inside of the UDF, like to incrementally build up a select statement and then invoke it. You're allowed to do that in PL SQL, meaning like to clear a string with select, like with a bunch of conditionals, I'm adding a pending literary SQL to it and then I execute it. In that case, you have no idea what you could possibly be doing, because you don't know what the SQL query is going to be until you actually run the function. So you're screwed. And for this one here, no one's going to solve this. Again, we did a survey where we scraped GitHub and we tried to see how common this was. It's less than 5%. It's not that common. At least, again, that's for a static evaluation of just looking at the UDS. We don't have a number to say how often they're invoked, but we don't think they're very common. So related to what I was saying before about this parallelization stuff, so if you can't figure out what's inside the UDF and now you're just going to be looping over the outer table or the calling SQL query, the outer query, and for every single tuple inside the outer query you're invoking the function, you're literally calling the UDF one at a time for every single record. So in the Microsoft world, they call this row by agonizing row, R bar. And as I said, inside a UDF, you're invoking other queries that you can't see until you actually run it. Then the optimizer has no way to be able to say, oh, these are just a join. Let me combine these together. Or I'm executing the same query over again. Let me cache it and leave it over user. So this has sort of been well known for a while that UDFs are bad. They're going to make your queries run slower. And so there's this sort of semi-famous blog article from 2006 where they're very blunt and say, T SQL, that's transact SQL, scale of functions are evil in SQL server. And they give a bunch of examples that cite a bunch of the problems that I just talked about. So here's one query that takes 2,600 milliseconds, so that's 2.6 seconds. But then if you add the UDF, it goes to 38 seconds just by adding a UDF. So the developers and DBAs of SQL server and other systems, this is not just a SQL server problem. Every system has this problem. This has sort of been well known for a while. And then Microsoft actually just came out and set it themselves in 2008, so a few years after this one. So this is an updated article where they introduced a new way to do compiled UDFs. But in here, they basically used the term, oh yeah, R bar, the row by exiting row. That's going to make your queries go slow. Scalar UDFs are, I think, the SQL incarnate, or evil incarnate, evil personified. They're very blunt. So Microsoft is trying to solve this problem for a while. So again, what I'm telling you is not any big secret. People have known this for a long time. And UDFs are so important and make developers lives a lot easier that we want to figure out a way to try to optimize them. So here's another example from Microsoft. This is from the Floyd paper. This is from TPCH query 12. And they basically took the where clause that's just checking to see if the customer key is null. And they made a UDF that just does a look up on the customer table and to see whether it turns back a valid customer key. So this is a contrived example, because you're taking the original TPCH query that didn't have this UDF, and you're adding this one piece here. And so without this UDF, the query is going to take 0.8 seconds, so 800 milliseconds. But if you add in just this UDF, which is really not doing that much, then it goes to 13 hours. Because again, the database server doesn't know that for this, I'm evoking this function. And I'm just checking to see what the output is null. Well, it's the customer key. Am I doing a look on the customer table here? No, so it's the customer key from the order table. It's not going to be null. But because it doesn't know what the computation is inside of this thing, the optimizer just throws up his hands and says, OK, well, I'm just going to execute this for every single row. And then now you get the overhead of, that's pretty significant. So we'll say Freud in a second. Freud is going to be able to take this, and line it back into this function, and get this query back down to 900 milliseconds. So not exactly as it was without introducing the, not exactly what it was before you added this piece here, but certainly not the 13 hours that they're getting before. Yes? This is strictly a, like. Yes, so the same as, this example here is a SQLDF. But I don't think that, this example here, I don't think, because I'm declaring variables and I have a return clause, this is not considered a SQLUDF. A SQLUDF, you don't have variables or returns. It's literally just the SQL queries by themselves. So in that case, yes. If I got rid of the declare, got rid of the term, and just got rid of this assignment to the variable n, if it was just this, then that would get inline, and the optimizer could figure that out. OK, so how can we optimize this? Well, there's four basic approaches. Compilation we've talked about before, right? We could just take our UDF and we'd normally interpret it. We could compile it into native code and that'll run faster. Doesn't solve or optimize the problem because now we have, although the function now compiled is much faster, it's still going to be a black box to the query optimizer. And I said Oracle does this and SQL server already does this now, and we did it since 2016. Another approach is to extend the programming language for the UDF to introduce pragmas or directives or other hints to the database server that could tell it what portions of the query could be optimized. So SQL Store, or sorry, SQL, SQL Store has their own variant of a PL SQL called, it came out when they were called MemSQL, so it was called MPL, MemSQL Programming Language, but they have a parallel version where you can write UDFs and you can use them, and that's additional hints to the optimizer to figure out how to paralyze stuff. But again, as far as I know, for that programming language, the optimizer still sees a black box or the UDF. Inlining is a person I'm going to talk about today, is how to convert the UDF into some kind of declarative form that we can natively embed into our query plan as if it was just bunch of SQL queries and then let the optimizer optimize that accordingly. And then the last one is actually predates inlining, but it was rediscovered by us and other Germans a few years ago. You basically take the UDF and you convert it into a bunch of SQL queries that run in batch or multiple tools at a time. And then now you don't have the invocation cost of invoking this single function, now you're sort of operating things all together. Again, I'll show examples of that as we go along. So in today's class, we're going to focus on these two. Because again, this is quite different than everything we've talked about so far. So UDF inlining, again, the idea here from Freud, and again, we'll use the term Freud described because that's what it's called in the paper. I think obviously when you pay for a SQL server, download a SQL server, it's not called Freud. If you look for Freud in the documentation, you're not going to see it. I think they call it UDF inlining. But again, the research name of the project was Freud. So the idea is that we're going to take our UDS and we're going to convert them to relational algebra expressions that we can then inline into the SQL queries themselves. And we're going to do this before we get to the actually cost-based search for the joins and other parts of the query optimizer. So the third thing is static transformation rules that we can do this conversion, this transformation of the UDF into relational algebra without needing a cost model. Because we'll just let the query optimizer handle as if it was any other query. So as I said, we're going to do this in the rewrite phase before we get to the cost-based optimizer. Because our cost-based optimizer, in theory, we'll cover this later, should be able to handle these subqueries effectively. SQL server is not going to be able to do that. The Germans can do this. Dr. B can do this because we did it for him. We'll cover that in a second. Let's talk actually real quickly about these subqueries again. Again, I'm not going to say how to do it exactly. The German way to do it. But this is going to be the challenge. This is what the inlining approaches are going to leverage, because they're going to assume that the optimizer will be able to take care of these subqueries. And then we're going to introduce lateral joins, because that's how we're going to chain these things together to ensure that things execute in the order they need to execute in UDF. But again, things will fall apart if it gets too complicated. So again, subqueries, basic idea, this is a refresher from the intro class that we just have a nested query, like a select query. Inside of that, there's another select query. It can be anywhere. It can be in the projection output, it can be in the from clause, it can be in the where clause, it can be in having group i anywhere you want. And so the two ways to handle them is to rewrite the query to decorrelate and flatten them to joins. And this will be the best case scenario. This is what you always want to do, but not everyone can. Or you just pull out the nested query, run it once, put its results to a temp table, and then join that temp table against the calling table, or in the calling query. Again, some systems do this. If I have my where clause, I have something that wants an aggregate, like the max value of a column, I can run that once, materialize that as a temp table, and then just join against it later on. And people, you have to do this when you can't support DAGs in your query plan. Again, we'll cover how to do this all more thoroughly next week, and we'll talk about query optimization for the Germans. Question? Sorry. OK. All right. So rewriting, as I said before, we take this guy. This is some query, and we have inside of our where clause, we have a nested query. And then we can pull this out and basically do a join. And we see that we're doing a join on the orders table with the user table. And this person here, we would recognize that there's a, we know the relationship between the order table and the user table, and we realize, oh, we don't even need to access the user table, because everything we need is in the order table itself. So in this case here, this is the best case scenario, that we went from a nested query, instead of having to invoke this nested query for every single row, every single record on the orders table, we can just remove the access thing, the orders table entirely, or the users table entirely. And that'll be a big win. All right. So the other thing we're going to rely on, in addition to the nested queries, is through lateral joins. I think we covered that also in the intro class as well. I think the first homework required it. And the idea here at lateral join is that it's going to allow a subquery in our from clause to reference a value or attributes in other nested queries at the same sort of nesting level. Again, you can't do this in joins, right? Typically, if you have a subquery, join a subquery, those two subqueries can't peek into each other and see what they actually have. A lateral join allows you to do that. And this is how we're going to be able to guarantee that, again, we'll execute the queries in the order that they're specified in the UDF. So this thing of that is like a bunch of sort of for loops where for each clause in a lateral join, I'm iterating over every single tuple. And if necessary, I can then invoke do lookups on the previous join, the previous table. So let's look at an example like this. So here I'm specifying that I have an inner join with a lateral. And then inside of this nested query here, you can see that I'm allowed to reference the select query up here. I can reference the order, user ID, and other things up in here. So this reference here, OIUSD, is this one up here. And this first order is this one up there. Again, the query optimizer just knows that, OK, the binder needs to figure out, OK, I'm referencing these things here. And a lateral join allows me to, again, peak up to the one above me and be able to see what they have. This example is a bit abstract when we walk through the UDF. I think it'll make more sense. All right, let's go through the five steps of fruit. So the very first thing we need to do is take our UDF, and we're going to transform the T-SQL statements or PLC signals, whatever it's written in, into SQL queries. And for everything that's in the SQL standard, some exceptions are like, you can't use exceptions, you can't use other constructs. In the case of Freud, they're not be able to have a wide loop, some conditional loops. But if clause is another thing, you can convert all of those things into the corresponding SQL queries, SQL statements. Then we're going to break our UDF up into regions. There's lots of reason about their contents and understand the dependencies between those regions. Because their dependencies are then going to get expressed through these lateral joints. Then we're going to go and merge these expressions based on trying to combine the multiple expressions in one region, and then we're going to link them together with lateral joins. And then we take our UDF that we put together through lateral joins, sorry, take our SQL query that we've generated through all the lateral joins, we're going to then embed that now into the calling query, the thing that was invoking the UDF. So we're doing this at runtime. And then we just run this now through our query optimizer. So in all my examples here, I'm going to show you through SQL statements, or the conversions will be from the UDF statements into SQL. As I said, in Freud, they're going to be based on relational algebra. But the OpFel approach we'll see afterwards, they're going to do everything at the SQL level. So this is that example we have in the beginning where, given some custom ID, we're going to look up and say, how much money have they spent with us, and then what customer status are we going to give them? So again, the first step is we're just trying to transform the contents of the UDF, literally the lines of code with semicolons, into corresponding SQL queries. So in the first case here, we're declaring a variable called level, and we'll set it the value to regular. Well, that's just a select query without a from clause where we pass the constant string regular and assign it to an attribute called level, nothing special there. This case here, next query, we're taking the aggregation on the orders table, and we're going to assign it to the total variable. That's the same thing as just taking the query in here, just nesting it inside of a select query, and then just assigning the renaming the output to be total. Then that assigns it to the variable total. And the last one here, SQL itself does not have if clauses, it has case whens. I think my SQL might break that, my SQL might have if statements, but case whens in the SQL standard. So I convert this if clause into a case when total is greater than a million, then we get platinum. Otherwise, we set the output to null, and again, then we assign that to the level variable. Yes? Do we then lateral join all of these? This question is, do we lateral join all of these? Yes, we're not there yet. Two more steps. Right? So this part seems pretty simple, right? I can conceptually see how I can map things like, oh, a variable name, that's just an attribute name in my projection output, my select statement. So in this example here, it's basically one-to-one mapping between a statement in the UDF to a SQL query. It doesn't have to be that way. It could be multiple statements could get combined into a single SQL query, or you could have one UDF statement split out and cross multiple SQL queries. For our purposes here, to keep it simple, we're assuming one-to-one. So next thing is now we want to take this UDF and break it up into regions, right? And then for each region, we're going to do the transformation I just showed, where we're converting the statements inside of that UDF, or that the statements inside that portion of the UDF region into corresponding SQL queries. So in this case here, I declare two variables, total and level, and then I have this nested query here, where I signed the output of the aggregation to the total variable. Well, that's the same thing as in my general SQL query where first I assign the level variable to null, and then I have this nested query here, where I'm going to compute the aggregation, and then I assign that to total. And then now for this region, I'm going to assign the output of this nested portion of the query to this temp table called er1. It's an ephemeral temp table. In theory, it should reside entirely in memory. It's not persistent in the catalog. It disappears once the query is over. So I can assign it into this. It's like a table ales in a query. Do the same thing with the next region here. Convert this into the case when statement, and then do the same thing, and then sign the output into this temp table er2. But notice here now that I have this variable total that I'm now referencing in the region above me. That's where the lateral join is going to help us, because I have basically now two nested queries, two separate queries here, but one of them needs to reference the other one, has a dependency going up. So the lateral joins is how we're going to connect them together. Same thing for level here. And then I have this next piece here, again, the else clause. And it's just the inverse of that, where if the total is less than a million, less than equal to a million, then my status is regular. Same thing. Total can reference the one up there. And then this level here is actually referencing what was passed before us in that one. Are we done at this point? What's the last one? Return, exactly. Yes. So how do we handle that? Well, what's that? Finger, he's just likable. Yes. It's just another nested query. So we can take all these now regions. We can current the SQL statements. And we put them all together into one giant SQL query now. And so, yes, I know I just talked about lateral joins. In the SQL standard, it's not lateral join. The SQL standard is apply or cross apply. SQL server uses cross apply. Postgres and SQLite and Oracle, they all use lateral join. They're basically the same thing. So as she said, the last step was to do the return clause. But again, that's just a return. Sorry, that's just the output of the select statement up above that wraps all of this together. And again, in this case here, at the very top, I'm referencing ER3. And that's generated down here for this nested query here. And they're linked together through the lateral joins. Yes. Why is the else locked a different region? This question is, why is the else locked a different region? Because I think basically, I think most of it is like, in theory, it should just be this, right? Because in this, they're trying to be pedantic in this example, like showing you the different regions. But also, too, you could have, going back to the original UDF, you could have arbitrary things inside of this, right? That you couldn't be able to express through the case when exactly. But yes, in this case here, I think there's being overly verbose. Because they're going to say later on, once I get it to this form, to my query optimizer, the query optimizer can figure out, oh, this is referencing this. And here's this case statement, and here's this case statement. Well, they're just, they're disjoint regions. I could just merge these together. Because the optimizer already knows how to do that for where it causes anyway. But they just settle it like this. Other questions? So I have you guys read this paper. We'll see the app file paper in a second. This one, I feel like I can understand, right? Because it's transforming SQL. That's all fine and any. The app file one would be way more complicated because it's basically converted to IR that's used in compilers. And that MPL stuff, which is not my area. OK, so now we want to actually inline the expression. So this is the original calling query that called this UDF. So when this thing shows up, we then just wrap this customer level invocation. That just basically gets replaced with the entire block here, which is the convertive form of the UDF. Yes? I don't know why you would ever do this. But if you go to the previous example, and in the if condition, if you actually go to the DC port. So instead of having that else there, let's just say we remove the else. Let me say we've always set levels to regular. Of course, you would never do that in terms of log. People do stupid things, so yeah. So how would that work? Because you would not know which level to reference to. Because there's a ER2 level and there's a ER3 level. So you're saying, if I give it this else clause, and it's just. You're going to do just else, but you have the set level. Yes. So if it's greater than 1 million, then you set it to that number. Yes. And then you set it to level of equal regular always. No, no, no. So in that, give it to the else, and just have this be without it. And so no matter what, you just overwrite it. Yeah. Just an example. I'm trying to understand how that would work. Because you are trying to do different levels here. So it would be this. I would just take set level equals regular, and I would generate this regular as level. So then now in here, for my third region here, I would have that select regular as level, and then as ER3. So no matter what happens here, then it just gets overwritten. The question is the compiler, sorry, the query optimizer is smart enough to figure out of this thing, whatever happens here, it gets overwritten by this. Who knows? We'd have to open the SQL server and see what happens. Hm? It would still have to use ER2 level of this. So it needs to be, no way. No, no, it literally would be select regular, select constant regular as level, as ER3. ER3. ER3 level. So in the else, in that block, exactly where I think. No, this case when goes away. If your example, it's just set level as regular, this goes away, it's this, it's this query up here. Select regular as level. Then just overrides it. And then going forward, when I put it all together here, then as I do my select ER3 level, well that's just whatever came out of this one. So now if you throw this to the query optimizer in the SQL server, what you end up with is all of these cross supplies get torn out and simplified into just a left outer join against the order table. So you're looping over every single customer record, and then you do a left outer join to compute the maximum number of, or the total amount of items that they bought. If they haven't bought anything, you get null, that's fine. Otherwise you then compute what the true output is. Pretty cool. Again, I'll say this next week, for most things, SQL server will have the best query optimizer in the world. Not for some things, right? For joins, for nested queries, the umbral hyper ones will be better. And DuckDB is getting better. I'll give a preview of what we'll talk about next week at the end of this class today, right? So in this case here now, what do we have, right? There was an implicit join in our UDF that because we converted it into this cross supply, this lateral join contraption here, that the query optimizer was able to figure out, oh, it's actually, it is a join against the orders and customer table. And in particular, it's a left outer join because I may not always have an order record for a customer. And so now I can just inline or just do a join as I normally would do the hash join really quickly that we know how to do. All the operations that were previously a black box inside of our UDF are now embedded as SQL and the query optimizer can use all the statistics and other information that it has to be able to reason about this activity estimates for our query. It's paralyzable now because we know there's nothing, there's no weird dependencies between invoking this query from one record to the next. So all my threads can be running in parallel at the same time. There's no function call overhead of setting up the call stack to go into some function for every single record. And furthermore though, they claim in the Freud paper, one of the big advantages of their approach is didn't require any engineering changes or changes to the query optimizer itself. That's a whole complicated piece of machinery inside the database server. And then if you can avoid having to modify that and therefore you know there won't be any regressions for anybody else, then this is fantastic. So if your query optimizer is very sophisticated, in the case of the SQL server one is, you actually can get a lot of the same optimization advantages you would get in a what I'll call a traditional optimizer or a compiler, an optimizing compiler, like for Clang or GCC. You basically can get the same benefits in now inside your UDF. So let's say I have a really simple UDF for giving some integer and I return back whether it's a high value, the string high value or low value. So I would invoke it as, you know, select getVal as a, you know, passing in some kind of constant here, right? So if I Freud this mofo, right, I'm gonna get at least in the first version something like this, right? We have the case when statement if the value is greater than 1,000, you know, set of high, otherwise give it low and then we do an outer apply. We ignore what that is on that and we get, you know, we're returning back the string high value, low value. Well, the, in a query optimizer, sorry, a traditional optimizer, it would be able to recognize that because I'm invoking this with a constant value of what was it, 5,000? Yes, yeah, 5,000, that I can do dynamic slicing and identify that I'm never gonna go down the ELFs clause for low value and I just remove that dead code entirely, right? In the case of the SQL query, it's the same thing as I've removed my case when and I just have, you know, just spit out the constant value high. Then I furthermore, I can do constant propagation and folding, again, a traditional compiler optimizer would recognize that, well, I don't need to, I don't need to concatenate high and value as separate steps. I can just put them together at the very beginning by propagating the constant up. Same thing in our SQL query, the query optimizer could figure out, oh, well, this is just taking high, appending it to the string val, so why do that as an out and apply with separate SQL queries? Let me just do it in one statement. Even further, you can do more dead code elimination to saying, well, I don't need to declare the, you know, an outer query return value or setting up the variable and then returning it. It's just select high value. So again, you get all the same benefits as if it was a traditional optimizer but the query optimizer is doing this because it already can do this for queries today. SQL server can, not everyone can. I don't think Postgres can do this, right? Yes? Can you just stop teaching people SQL and use UDX? Statement is, if this works so well, could we just stop teaching people SQL and UDX or everything? So again, it doesn't support everything. So as in 2019, this is what you can do. You can do clairs and sets, you can have select queries if then else, or if else if, if return clause in multiple locations of the function which they can handle that. And they do all basic relational operators, outer operators exist, not exist, is null in any and so forth. They don't support exceptions. They don't support dynamic SQL queries and they don't support updates. Again, in SQL server, that's not a big deal but in, you know, Postgres and other queries or other data systems you could. So your original question is like, why do we need SQL instead of just using UDS? You still need both, right? There's certain things you would not, like to do the things you would want to do in a, on your database server, like your UDX is gonna start making SQL query calls in it, right? So SQL doesn't go away. So this is the result they had in the paper that they shared where they, they had a bunch of different workloads from real customers and I think they got permission to extract out the UDS and sample the data and they showed what benefit they're getting for a bunch of UDS by inlining it with Freud. So for the first workload, it had 90 UDS and the 82 of them could be inline with Freud. The second one was 170 UDS and 150 were compatible. And so you can sort of see the long tail here. For the first customer here, with only one UDF actually had a regression. And I think it's gonna be because the SQL server is gonna choke on the, gonna choke on handling a large amount of lateral joins. Same thing for the other one here. But like, this is pretty significant, right? Some customers are getting almost 1000X speed up, right? That's huge, that's insane. Like that almost never happens in databases unless you like rewrite your application or like switch vendors to go from like, you know, a row store to a column store. Without having to make any changes to the UDF itself, just with Freud and inlining, the performance win is significant, right? And the inventor of Freud is Guy Karthik. He was a PhD student at IAT Bombay, which is the best database school in India. And then he was at the Gray Systems Lab in Madison, Wisconsin with Jignesh. And when he worked at Microsoft on this. So he got it, the paper came out I think in 2016, 2017, I think the paper predates us, but they got it shipped in SQL Server in 2019 in like three years, which is insane. So there's a bunch of his tweets that shows that like, people are talking about how like the benefit they're getting with Freud is, you know, 20X faster, you know, significantly more. Or this case, the query went from four minutes to nine seconds, but just turning on the flag says use Freud. Again, that's a huge win without having to make any changes to your application, right? I think in the paper, they talk about the overall compatibility or support for UDS in all of the top 100 Azure databases, I think it was about like 60%. So 60% of the UDS could be converted into, could be in line with Freud. All right, so this is one approach. This is like, this is how to again take the UDF, convert it into Relation Algebra, and then inline that. And then I show you how to do it through SQL. There's enough, yes, question. The question is, what are the challenges we're doing with ScaliUDS, or is it vector UDS? For ScaliUDS, for the women in the scope of ScaliUDS, what are the additional challenges that prevent us from just going into vector as well? ScaliUDS is a construct in UDS, like they return a single value. If you return multiple values, they're called table value functions. Yeah. So actually it's like, why can't you do this for everything? I don't know. It's a question I don't remember. Most functions out there are ScaliUDS anyway. Yes. The question is, are there restrictions on UDS to make sure they already terminate? That's usually a construct of the execution engine itself, like how long a query can run. So you set your time out to say the query can run for one minute. The data system doesn't care whether you're spending all your time in UDF or not. So yes, could you write an infinite loop in UDF? Yes. The question is, is that why they can't convert for loops? I don't think it's a limitation of what a time out. I think there's a section of limitations in the paper and they have recursive of course, but they can't get out of the form. The point there really is the limitation is that there's a corner memory and they don't want the UDF statement. Yeah. There's a follow-up paper we're not gonna cover called Agify where they show how to do aggregations, like loops in UDF. For that one, you're gonna rewrite your portion of the UDF as a user-varying aggregate and invoke that. That's like doing a translation. We don't wanna focus on that. And that far as I know, it never made a production. Exceptions are the other weird one too because they know that it's a support because that's literally like, it's like a go-to statement. You're jumping to another part and they don't support that. All right, so again, I wanna talk about how to do SQL to SQL or UDF into SQL. You're using this app file approach from other set of Germans at Toborgon. And then I wanna finish up talking about batching, which is another approach, alternative to inlining. So for this app file approach, what they're gonna do is they're gonna take your UDS and they're gonna convert them into common table expressions, basically SQL statements. And this is gonna allow them to do the looping that Freud can't do and additional constructs that Freud can't handle. So instead of actually embedding this inside the database server, they actually wrote this as a separate middleware, as a standalone compiler. I can give a quick demo. Right, so if you go to that website here, you have on one side, you have the UDF. Can you give it to me? Make a full screen. What's that? Yeah, they're always gonna have to call it lateral, but they're gonna meet recursive CTEs for this as well. So again, so this is the original UDF and then this is what it'll spit out, right? And you see a lot of lateral joins and nested queries. So I can use something really stupid. Your X, and then it spits it out and then you see now like it's including X variable and it's doing some of the same things, like it's setting up the variables in the same way that we saw before. It's like changing this to 99, right? Then you get the same, this is very similar to what we saw in Freud. And then if I actually run this though, so this is Postgres. So I've already installed it. So here's the real simple function and then I execute it as, what you can't see? No, that's, right? So if I run it now, it takes about half a second to do the original UDF, but if I run their giant, the lateral join one like this, no, that's how long it took to install it. It was great function. So now I invoke it. Now we're taking what, two milliseconds? Yeah. So in this case here, the UDF call in Postgres is actually faster than using their Freud one. Yeah, yep, yep, yep, yep, yep, yep, I got it. This is why I always use my laptop and when I give demos in the class, I want to keep it quick, right? And in this case here, there's the cost of just invoking that function, right? You see how Postgres at the optimizer level can't expand the SQL query because it's embedded inside the UDF, right? And again, the link's in the slides if you guys want to play with it. All right, so this is gonna get a little PLE and capillary. So bear with me. And again, I'm not expert in this area, so like I know enough how it maps to the SQL stuff, but beyond this this is all, I can't go too deep in this. So the idea is that we're gonna take our UDF and we're gonna convert this into a form called SSA, a static single assignment form. And this is gonna allow us to basically convert the arbitrary code that we had in our UDF into some form that's gonna use go-tos to define blocks of things. And then we're gonna take this SSA thing and convert it into administrative normal form, which is gonna use mutually recursive functions to allow us to simplify the blocks of the regions themselves. Then we convert the administrative normal form which is using mutual recursion into direct conversion, sorry, recursion, and then that gets converted into SQL using with recursive CTEs. And then that produces our SQL query and we run it through our query optimizer. So buckle up, we'll go through the examples. All right, so let's take a really simple function, this PAL function, right? Give it an X, give it an N. We just have this, the part we care about is this loop here. We can enter every I and multiply X by itself, right? To some power. So in the first step, we're gonna convert this to SSA form. It's gonna look something like this where we define these blocks with these labels and then we're using go-tos to jump around where we need, right? You're happy, you look all happy you are, right? Seem you undergrad, yeah. All right, this is what they do to you. All right, so again, so in SSA form, as far as we know, we're only defined each variable once. We're all shaking their head, yes, Sam, sir, yes. This is what a traditional compiler would do on the inside. So then now we're gonna take our SSA form and we're gonna convert this into administrative normal form and this is gonna allow us to have tail recursion, meaning at the last statement of every sort of function that we can call another function and we're allowed to call recursively ourselves and other functions and those functions can then call us back, right? So you have this, you can have cycles in this, right? So in this case here, for our while loop, we're gonna loop through and do our computation and then we're recursively calling ourselves but then once we've finished the last one, then we break out, right? And then whatever the return result is, whatever the last iteration was, yes. For this issue of tail recursion, it's just like the calls to go to the other labels that is recursive, you're still allowed to do computation before the calls. Yes, yeah. And the question is, are we allowed to do computation before the other calls? Yeah, so that's what this is, down here. Right, so again, this is doing mutual recursion, meaning one function can call all the function and that function can call you back. We wanna convert this to direct recursion and that's just doing another transformation and that's gonna basically now where we only have recursive calls in the, as the tail is the last thing we do within our function and then the embedding, the recursion call is gonna only go in one direction. So run can call run, pow can call run but run cannot call pow. And the reason why we're gonna do this is because we care about getting the last output of whatever this, in our tail recursion calls, call stack and that then produces the output to the select statement within our nested queries, right? So the outermost query is gonna produce the output and that's gonna be the innermost recursive call. So, all right, so then we take this thing and now we convert this into SQL. So again, I'm not gonna go through all the details of what it's all is but basically think of like, here's the set up all the variables where that's just this nested query inside of this and then we have our if and else and then this recursion calls here, right? And that corresponds to this SQL statement like this, right? And then compiler magic happens and then this works. All right, so does it make a difference? Well, for this one, they didn't have Freud so they can't compare against, you know, is their approach better than what Microsoft was doing? They just compared how much faster their approach using all these nested CTEs versus, you know, letting Postgres just, you know, call the UDF as it normally does. And in my example before, I showed how the, my trial to show before, like for that really simple function, it's actually faster to just call the UDF and sure for real simple things, this doesn't make sense but if you're gonna invoke the UDF over a very large table, then now you can start to see the divergence between the different approaches. I would say also too that the, I would say that the reason why this is not, the performance gap is not as more significant is because then the Postgres query optimizer is not as sophisticated as Microsoft's so therefore it's not gonna be able to do all the, the, all the optimizations that I was showing before breaking it down, removing dead code and other things. All right, so the last one approach I'm gonna show you is bashing. Again, so the background here, this actually came out of a 7.21 project that Sam and another student were working on last year and the CMU undergrad sort of independently developed this technique and then we found a master's thesis from the Germans that did Apfel that they invented the bashing technique but then we found another paper from the Freud guy's PhD advisor from 2008 who actually invented it before anyone else but in that version 2008 required changes to the query optimizer itself to make this all work in our version that they developed here and with the, with the, the Toboggan Germans, the Apfel Germans, you don't have to make any changes to the optimizer. Right, so the idea here is that we're gonna translate the UDF into a series of update statements. We're not gonna, we're not connecting them together with lateral joins. It's literally like it's gonna be one query evoked after another and what they're gonna do is they're gonna do some amount of computation in the set clauses to then set values in a state table that's gonna be as if we're maintaining those, that's equivalent to the variables that are in the UDF itself. So when we invoke the UDF we first create this temp table, we instantiate all the attributes for each variables that are in the UDF, then we have a series of updates that then update these variables corresponding to the computation that would be in the UDF. So we're doing that same translation we saw with Freud of converting the UDF procedural statements into corresponding SQL queries. So this is gonna be useful for any database system which we'll see in a second that is not able to do the deco-relation stuff that Freud and OpFel rely on. We're gonna convert those lateral joins and get them down into nested queries. So this is a UDF from Procbench we'll talk about in a second. This is the, this is the paper, a follow-up paper that the Freud guys put out of a real benchmark that's based on all the UDFs they were seeing in real customers. So it's sort of a synthetic version of what UDFs look like. So this is from their example here. The gist of it is that you're doing a lookup on an item ID to figure out what manufacturer has sold the most of it. So you have this like select query here. I say there's three portions here, there's three nested select queries here in these different blocks. And then say this is the calling query that's gonna invoke it. So inside of this you have the, you have some additional computation you're doing and then for every single record within this query here because you're trying to get all the first 25,000 most bought items, then you're gonna hook the UDF up above. So I'm not gonna go through all this in a little detail, but you think of this as like the combination of the UDF plus the calling SQL query will get converted into a sequence of SQL queries like this. And you can always treat this as like a SQL function although it has updates in it, but like just think of like a macro, this thing would get embedded when you call the outer query like this. So in the first step here at the top, here's that temp table we're creating and inside of that you see that we're declaring attributes inside our temp table that correspond to all the variables that we defined, right? So we defined about a man variable, count one, count two, all those are getting defined in the create table itself. But then we're also gonna have this special return boolean that's gonna tell us whether this, we want the value of this, we want this record within this temp table corresponding to a tuple that was passed into us should it get returned or not, right? So you can sort of think as like the, every single tuple in the temp table is gonna correspond to a tuple that would get passed into the UDF. So I have a thousand tuples, like a thousand, what is this? A thousand web sale items or item IDs that I have a thousand items in my temp table. I'm just basically updating this giant state table as I go along. And now when I do all my computations, what was done at sort of one record at a time in my original UDF, I can now invoke across all of the tuples that are being passed into the UDF at the same time and they're all independently updating their state table. And so the way I would invoke this, this generate series is in SQL standard. Basically, you can generate a list of numbers from one to whatever or zero for whatever. And I'm just doing a lateral join to that. So I'm sort of seeding the computation to invoke the, to generate the result that I'm looking for and produce the output that I need. And that's equivalent to invoking it, the original UDF. Yes. You probably can't do this for every single, like your parallelizing stuff, I don't think that's going to work out, I don't think it's going to work out every UDF, right? Stephen, it's not going to work out for every UDF. I think it does, because even exceptions, you could handle that through the state table. I think it's more generalizable than Frouder and OpFal. Because again, so you actually, you could potentially handle dynamic queries because you just put the string that you're concatenating to the SQL query, you can put that in the state table. It's a little weird, but you could do it. Okay, so I think I've already mentioned this, right? So these are slides from Sam from last year, right? So the Microsoft guys wrote the Freud paper, they wrote the Foller paper at Agafi, and then they put together this sort of benchmark called SQL Procbench that was, they argued it was a faith representation of what the, what real UDFs actually look like. Because prior to this, it wasn't anything. And you can sort of classify the UDFs into sort of two categories in the Procbench. The first are gonna be UDS without any input parameters, right? So select max return reason web, right? Nothing gets passed into it, and I'm just invoking this once. And so in this case here, there isn't actually any advantage of using inlining or batching because this UDF is just invoked once. There's really nothing special about it, right? The ones that matter the most is you have things like what we saw before, where you are passing in some kind of input value that you're iterating over in the calling query on the outside, right? So I think that when Sam did his analysis, looking for a Procbench, despite Microsoft inventing the inlining technique with Freud, they could only inline a small portion of what's in their benchmark of these queries because a bunch of them just couldn't reason about and wasn't able to handle it. Or in some cases that it did do it, it didn't actually get a performance benefit because it wasn't able to do that de-correlation of the sub-queries. Yes? The question to Sam actually. So was it because the if conditions were too complex or was it that there was something that it just didn't, like there was the clear set and there was stuff like, that was in the SQL that they didn't answer for, Freud didn't support? Yeah, so I guess Andy can explain this, but essentially the problem is that when you get your DL and then you inline it, you get a very complicated sub-query with a bunch of amount of enjoyment. And then when you put that into SQL Server Optimizer, it's unable to de-correlate that sub-query and then you get very bad performance. Whereas if you use the German way of de-correlating sub-queries, you de-correlate any sub-query and you get really good performance with inline it. Yes. We'll cover that. We'll cover this in more detail next week as well. All right, so this is the table we had in the paper. That we just came out with two months ago and we compared it against SQL Server Oracle, DuckDB and Postgres. So Postgres, again, just can't handle any of these things, right? Because the Optimizer is not as sophisticated as the commercial ones. Oracle, we'll just ignore. But in the case of here, DuckDB, well, DuckDB could handle everything. How is that the case? Well, because last year, the SAM721 project with two other master students added support for flattening Nester lateral joins so that they can support the inlining and the bashing stuff that we've been talking about. Oh, sorry. And furthermore, they actually submitted the PR to DuckDB that actually got merged. When you download DuckDB, you're getting SAMs and other 721 students code to handle the inlining stuff, right? Well, what about Microsoft, right? So inlining, you can see for these select UDS in Procbench, again, the Microsoft benchmark based on their UDS, they can only handle two of them. So what's going on? And the issue is that, because as Sam already said, they're not as sophisticated as, their approach to decorrelating subqueries is not as sophisticated as this German approach which we keep alluding to, right? Instead, they're gonna basically have these handwritten rules that allow you to do the rewriting inside the query Optimizer for different use cases, but not all of them. And this paper came out in 2001. Before Freud and these computer-generated monstrosity queries and all these ladder joins existed. So they didn't, I guess they didn't know, these things didn't exist at the time, they didn't try to cover them in their rule base and they haven't updated it since then. So in this case here, Microsoft is simply, the SQL server simply can't handle the kind of monstrosity queries that something like Freud or Atfel is gonna generate. And so the German approach, which we'll see next week, it can handle any possible, you know, de-correlation for subqueries because one, they're gonna be able to, there's gonna be more DAGs instead of trees in the query plans, it allows them to reuse the computation for one nested query for another part of a query. And they're gonna introduce these additional constructs like a dependent join to keep track of the dependencies between two different from clauses and so forth. And therefore they can take any possible subquery combination you can think of and throw it at the query optimizer and it'll be able to handle this. DuckDB can do this because they're based off of what Umbra did or Hyper did and then Sam and his team came along and extended for the ladder join stuff that we needed. So that's the preview for next week, right? We're gonna talk about why, how this works in Hyper. So the main takeaway going back to this table here, inlining is fantastic, but if your database server query optimizer can handle the large queries that these things are gonna spit out and few can. Okay, so, again, I said this before, but like this is a big deal if you can get, if your query optimizer can handle the kind of things that OpFal or Freud's gonna generate, then you get a huge win for speeding up a UDS without making any changes to your application code. And that rarely happens in the world of CS or, you know, of software, so that's a big deal. But again, you need to be able to sort the German style, the Umbra style, de-coillation. We talked a little bit again about how to compile the machine code. This is gonna help some performance slowdown, but again, since it's still gonna look like a black box of the optimizer, it's not gonna be the, not gonna get the best performance. All right, any questions? Yes? So, this is gonna be some hassle, still some hassle time. Can I ask a question about the code that we looked at for the conversion? Sure, yes. I mean, this one? This? No, actually the one that we looked at for the Freud example. Oh, so they're going to act like this? Yeah, I was thinking this might actually not be possible if we have, instead of that L's clause right there, the third one, if we replace that L with a simple if condition, that just says, if a total is less than a million. In that case, if you look at the block over there, it says er2.level, it won't be able to do er2.level. It will need to do er1 or er2, depending on whether that condition is true. So it'll become, it won't easily translate from the left to the right. So the statement is, if you have it now inside of this, another if clause. Not necessarily, just replace the L's for the if, just replace the L's for the if. Oh, so this if ends, then you have another if. Yes. And then in that case, if it's less than this, if it's less than a million, set it to regular. Right. And then in that case, implicitly then, the value is er1, which is null. It has to be or it can confuse it. That's how it's done. Exactly, how will it have that? But initial, the level is initialized as null. At the top, so you just need to return null. You see your arrow there is er2.level, right? Yes. My problem is that er2.level, no, I don't get that. Yes. Exactly. That one will become something to change. Yeah, that's okay. You can call it er1. But that's not the logic that the UDX is trying to represent here. If level less than a million sets a regular, right? So that would just be similar to this. I see what you're saying, because it would be... It's changed in er2, and we want to answer the er2 if the definition is true, or we want to answer the er1.level if a different condition. I understand your point. You can imagine, I don't know before it does this, you can imagine it's combining them if then else. Go for it, yes. Yes, so basically the idea is you'll have one table which is going to say, okay, if this condition is true, then it's just new value. Otherwise, it's the level from the previous table. You can currently keep doing this. So each thing is just a region that refers to the cause in the previous table. So it's basically doing the update. So then, okay, use another case within the slide. Okay, we are over time, so let me jump to... All right, so next class, we're going to talk about database networking protocols. And the reason why I'm doing this after UDS is because this UDF idea is like, okay, I have my application logic. Let me try to embed that in the database server. The arrow guys and ducttb guys are saying, no, no, no, it's actually the opposite. Embed your database system inside of your application, and then use something like arrow to get the data in and out very quickly. Okay? So the paper you'll read is from the ducttb guys. I think it's on Monday, before ducttb was embedded, but it's basically showing you how terrible existing networking protocols are for database servers and how for OLAP queries, it's not the ideal. And then we'll see how arrow will fix this and then another project out of Simon Fraser University. Okay? Hi guys, see ya.