 I was thinking about you last night when I was making these slides, because I ran your buddy Freefingers. Oh, yeah. No. That guy owes me money. He owes you money? So quickly for the stuff on the docket. This is the same as I showed last class on homework 5 this Sunday. Then the following week, project 4 we do. Next week the Tuesday lecture will be virtual over Zoom. That's our friends from Snowflake that come and give a talk. And then the Thursday lecture will be back in class here in person, and that'll be, you know, Q&A session, AS&E, whatever you want about databases. There's no limits. And then the final exam will be Friday, December 16th. We have a room, I think it's Gates, the main lecture hall, 4401. And I'll post the study guide with the practice exam next week. And then actually on the Thursday class next week, we'll spend 5, 10 minutes at the very beginning just to go over a quick final review, okay? So there's a lot here, but any questions about anything that you guys are, that's coming up? Okay, cool. All right, so this lecture is, as I said at the NLS class, this is a, this is my attempt to expose you to a different way of thinking about how to build applications and try to use the database system more than just like, you know, just like a dumb data store. And we can talk about how now we can actually embed some of the logic that we would have in our application. So written in PHP, Python, JavaScript, whatever you want, and how we can start putting that into the database system. So to help motivate why we'd want to possibly do this, we first need to understand, well, how is it that applications are using the database system now? What's the most common approach? And the standard API, the standard method to interact with the database system is use what I'll call a conversational API, where the application connects to the database server, sends a query, waits for that result, gets it back, and then the, and then they're waiting sort of, each side is waiting for each other to do the next thing, right? So the application starts, connects to the database server, so the database server is waiting for the, for the query to show up. Then the query shows up, and then the database server sends it back, and now, you know, the application does something with it, right? So every single database system will implement a term wire protocol. We don't need to discuss that in this class here. That'll be in the advanced class. And then the client-side APIs would use to communicate with the database servers, JDBC or DPC, or the standard approaches. So just to show again what those looks like. Say this is your application server. It's running somewhere, you know, it's running in the cloud, dedicated server on the same machine, doesn't matter. But when the application starts, it begins a transaction, it executes some queries, that goes over the wire, the database server then has to take that SQL query, parse it, plan it, optimize it, then actually run it, right? All the things that we've talked about this semester so far. And then it sends back the result. And then now, assuming this is done in the context of a transaction, now the database server has to sit and wait for the next command, right? Because the application server is going to take the result that it got back for that SQL query and now do something with it. I can have if statements to say, here's the next query I really want to want, maybe wants to abort the transaction. The database doesn't know, doesn't care, but it's sitting and waiting. Then we get the next query, again, goes over the network, we have to do the same thing, parse it, plan it, optimize it, run it, send the result back, and then we wait. And eventually it does some more processing, we're still waiting. And finally we get down here to commit and we tell it to go to commit, right? So there's a lot of round trips here. I'm not saying whether it's on the same box or the network or the same rack, it doesn't matter, but still it has to communicate to say, hey, do these things and then each side is waiting for the other one to respond. So the goal for us today is to figure out what logic we would have over here, can we actually push into the database server? Because it's sort of related to that, like pushing the query to the data versus pulling the data to the query. We want to be able to push as much logic we can directly down to the database system and have that be able to take actions or do certain things without having to go back to the application and ask for the next command or what to do. So we're going to get several potential benefits by pushing this logic into the server. And I'm putting the word potential in here because there will be some cases where we'll see some downsides where if we do use certain things like a UDF, then the query can actually get way worse. We'll understand why. So I've already said this, we'll have fewer network round trips because we just sort of try to push as much in a single query, a single request to the database server and make it at to as much work as possible for that request. We'll have more immediate notification of changes because the database server will be there when the change occurs and we can define some action or something for it to do when it sees a change. We talked about how the database can spend less time waiting for transactions, waiting for the next query. This is important again on two ways locking. I'm holding those locks potentially when I send back the result and waiting for the next query. So if I can reduce the network round trips and reduce the time to get the next query or see the commit command, then transactions will run more quickly. And then the last one is more fuzzy, but it has to do with sort of software engineering idea where if we can have some core logic in our application now embed inside of the database system and have the database system maintain this for us, then it doesn't matter if we end up rewriting the application or make different versions of it, we don't have to reimplement that logic over and over again for different applications. The database server system can just handle all for us. So like say if I had a desktop application for a desktop version of my application that I want to want to make a mobile phone version. Maybe the desktop one is written in C sharp whatever people are in Java, but then the phone one's written in Swift or whatever Android is Java I guess. Like if I have to have the same there's some important logic that those two different applications have to both implement. If I can put it into the database server and have it just be that the single implementation, then I don't worry about having bugs when I try to port it to different languages or different environments, right? So this is going to be a again a high level overview of a bunch of different methods that people use to embed database logic. UDFs or user design functions are going to be the most common one that you're going to see in the real world. Store procedures are a little much less common. Actually UDFs plus triggers are probably those two are going to be the most common. Change notifications, you define types, and then views. And then we'll do some demos today. Actually all the demos are going to be in Postgres because it's going to support all of these and I know the syntax much better than all the elements. It's much more friendly, okay? All right so user design functions, UDFs. The idea here is that we want to be able to extend the database system's functionality to go beyond the functions or the capabilities that it has out of the box. So I think I might have said this in the beginning of the semester that when we talk about string functions, date functions, there's a SQL standard that says here's a bunch of built-in functions that a database system that supports SQL has to have. But obviously if your application has something that's very specific that you need done, there might not be a built-in function that does that. So you can build, you can write your own functions and embed them inside of the database system and then they'll be treated as if it was like string length or concat or like a built-in function and you call them directly in your SQL statements. So it's a function, right? It's pretty basic, right? You take in some input arguments, typically scalers, you do some kind of computation, and then you're going to return back a result. And that result can either be more scalers or the tables. So the computational part can have all the sort of imperative constructs that you'd expect in other programming languages, which are not really easily or not really defined in SQL. You could have if-causes, for-loops, throws exceptions, make calls to other functions, and so forth. And so we want to do this because again, there'll be some logic that's going to be difficult for us to express cleanly in SQL, whereas writing into an imperative language will be much easier. So when we define our UDF, again, it's basically got the return types, and again it's going to be returning a single value or a single result table or set of rows, and then the definition of the function itself, the thing that actually does the computation, could either be a SQL function, meaning it's just a bunch of SQL queries without any deep logic, or it could be an external programming language, additional programming language that isn't SQL. So we're going to go through the SQL functions first because that's probably the most common, and this is one that's actually the most supported in systems, and then we'll look at the additional programming languages. So a SQL-based UDF is just a list of queries that the data system is going to execute in the order that they're specified, and then whatever the last query in the UDF, whatever it is, that's just the return value. So we have this create function now, command in SQL, so we call create function, we specify that this is the name of the function, get foo, we'll define some input arguments, we have a return type, in this case here we're going to say returns foo because the last query is going to be returning from foo, or the last query will be a select from foo, and then we have the actual function body itself. So when we invoke this function, we pass in some parameter as int, and we can reference it with dollar sign one. So in this case here, this function can be invoked either in the output list of a select query, or in a from clause, because it returns back a row or table for this type of query, or UDF, you can't reference it in the where clause. Pretty basic, right? Yes, the question is, what is the dollar to dollar sign? This is just saying that this is the body of the function, and in this case here, I think for Postgres, it treats this basically as a string. I think it does some parsing, because it knows that it's SQL, so it assumes that everything comes after it is going to be SQL statements, and it'll actually parse it. We'll see in other UDF languages, it's just a benign string that doesn't do any parsing until you actually run it. So again, pretty straightforward. Let's do some live demos. That's always fun. All right, so first we're going to do is going to define our table foo. I'm going to move this over here, right? And then we're going to put some data here, right? Click select, star from foo, right? Nothing special there, and then now we're going to define our function, awesome, because I'm missing the dollar sign. Sorry, that's a bummer if I broke this already. Sorry, there we go, all right. So now I can call select, as I showed before, I can call it select get from foo, and just the query in the front, move this down, right? In this case here, it's returning back me a tuple as a single column with the tuple in it. Then if I put it in the from clause, because it knows it's returning a table, right? Then you actually get the columns, right? So far so good. So pretty easy, right? Nothing fancy. So the SQL standard also provides an atomic keyword, which may seem kind of confusing, like, oh, does this mean this is a transaction? Now, so what atomic provides means that the database system is going to track dependencies between the SQL queries that are defined here in the UDS, similar to like when you have a foreign key, so that if I try to drop a function that references another function, sorry, if I drop a function that's being referenced by another function, right, it would know that it can't do that and would prevent you from deleting this, right? So this is a way for the database system to provide some additional functionality about, you know, making better for developers that like you're not the same protection you would have for referential integrity in tables, we can also have it in our functions. Let's quick demo that one. Okay, so what we're going to do is we're going to create three functions, ABC, dropping them just to make sure they exist. So all they're going to do, right, all, you know, here's A at the top, all takes in as an integer, and it's going to add, multiply by three, B is going to take in another integer, and call A and multiply that by two, and then C is just going to call B, right? So there's a three-way dependency, or so the dependency between C depends on B and B depends on A. So again, it's just nothing fancy, I just call it select C1, and I get back a result, right? So now what I'm going to do, I'm going to drop A and C, but not B, right? And so when I created these functions, I didn't use that as how the keyword, so it doesn't, this is not trying to figure out, okay, what depends on what for these different functions. But now if I go back and try to add C, which is going to call B, right? I can do that. Even though B calls A, but it doesn't, since it didn't track dependencies, it doesn't know that at this point here, it doesn't know that A is missing or doesn't care, right? And I'm allowed to create the function. But if I call it again, right, it'll crap out because when C calls B, B then tries to call A, but, and it doesn't exist, right? So let's drop all the functions again, and then we're going to add them all back, but this time we're going to add in, we're going to use the atomic keyword, right? So again, I, so I define a function C, tells the language of SQL, we get atomic, and then there's the body there. So now when I go to try to drop A, it's going to throw an error because it knows that something depends on it, right? It knows that B is calling it. So it doesn't let you do that. So to handle that, just like again, from foreign keys, we can add this cascade modifier to our drop calls so that we're going to drop A and then it's going to cascade the drop to all, to any function that references it, to avoid a dangling pointer. Yes. The question is, why isn't the feature the default? Good question. Uh, my postcard is just out of this like two or three years ago, so it's not, I think it has to do, they're trying to do lazy parsing and, like, of the functions, that you just define it as a, as a string, put it into the catalog, and then only at runtime when you go to actually invoke it, then you start interpreting it. I think that would be, that's the reason. Okay. Nothing fancy here so far, right? Right. So the SQL, you know, the SQL functions, they are, you can almost think of like a macro, right? That like, like, you could copy and paste a bunch of SQL queries and put it into a single function. But again, you're still limited to the type of logic you can represent and define in the SQL function. It's limited to whatever SQL supports. And so some database systems will also support writing UDFs in additional languages that go beyond SQL and can be procedural or imperative. So the SQL standard defines that the, that the, a UDF language, the UDF language in the SQL standard is called SQL PSM. It's like SQL persistent stored module. It's, it's some 1990s enterprise technology, has always had, always has crappy names. So they, the SQL standard defines the programming language called SQL PSM. Nobody actually implements that, right? The, everybody has their own. Oracle's PL SQL probably looks the closest to SQL PSM. Postgres has PL PG SQL, which is their variant of Oracle's PL SQL. My SQL, sorry, my SQL doesn't support any of these. You have to write your things in C or SQL functions. But then SQL Server and Sybase support this other UDF language called Transac SQL. Add a cure acid. Who here has written T SQL before? No one. Okay. SQL Server is a good system. SQL Server is a solid daily system. The history again of SQL Server is that Microsoft didn't make it from scratch. Microsoft bought a license to, to Sybase in like early 1990s to port it to Windows NT. And then they had the, in that contract, they were allowed to do something with, like muck around with the source code. And they ended up doing a hard fork in the mid 90s and hired the guy that built the volcano, you know, did that volcano model stuff we talked about before. They hired him and a bunch of other smart people. A lot of them at a University of Wisconsin, because that was a, that's a database powerhouse school. And they rewrote SQL Server entirely. And now Sybase got bought by SAP. It's more or less a maintenance mode, whereas SQL Server is, is a state-of-the-art system. So they, they, they were together and then they diverged and SQL Server, I think has surpassed them in capabilities. But anyway, Sybase had their own thing called T SQL back in the 1980s before the SQL standard came out. It looks similar a lot, but you'll see some dialect differences. But that's why, historically, they support their own thing, Transac SQL. The, you'll see this in a second. The, the language is going to look a lot like ADA. Has anybody's heard of that? Nobody? He's waving his hand. You're European, so maybe, I don't know. It looks like Pascal if you ever heard of Pascal, right? Because the guy that was on the, that defined the standard, he was really big into ADA. So that's, that's why it looks like this. So other systems are going to support additional programming languages that are more, I don't say more traditional, more common. Like in Postgres, you can have, you can write UDFs on Python and, and, and Pearl, maybe Java, definitely in C. And some of these languages will be safe enough that you could run them directly inside the database system. Actually, Amazon put out something yesterday. I haven't looked at it. They had their own, they had their trusted language extensions for Postgres that's like Python and Pearl. So it's a way to write Python, UDFs and other functionality to extend Postgres, but then still be able to run it in, in Amazon's hosted environment. Right? Because the issue is going to be, if we write my, if we write my UDF in C, what can you do in C? What's that? Oh, I mean, the answer is everything, or anything, right? Like, if you can write it, if you can write a UDF in C, you can, now I could do whatever you want, write to any arbitrary location in memory. So in some systems that, that, like in Oracle, you can write your UDF in C, but they'll run you in a sandbox and a separate like process that's isolated from the main database server. And you have to use like a, you know, IPC command to send data back and forth between one process to the other. Because what they don't want is, they don't want your stupid C program to like start trashing the memory heap and crashing the database server and corrupting data. So they'll run you in a sandbox. Postgres does not. Postgres, if you write a extension, or you write a UDF in a, in C, compile the shared object, you can link it in and then do whatever you want, right? So anyways, there's other programming languages other than the ones up here. We're going to focus on ones up here, but like, you know, they have to some, there'll be some standard API that the data system provides for you for the different bindings with different languages, right? All right. So let's look at a PLPG SQL example in, in Postgres. So now we see here that we have, defining that same function we have before, get foo. But now in our return, we're going to return a set of foo. And then the language specified here is now a PLPG SQL. And to his question again earlier, the double dollar sign just means like, here's the body of the function. And you can, you can treat it as a, as a giant string. Yes. With the return set of foo, just means that you're getting, you're getting tuples of foo. So we know that the, that the return type is going to be the same schema. It's going to be a set of tuples that have the same schema as the foo table. So in this case here now, instead of just having implicitly the last query of the function being the return value, I have to now specify that here's the return clause. And it's going to be a query that, that, and I, this is, you've got the query here. One nasty thing you can do in PL, PL SQL, PLPG SQL is you can actually concatenate strings together and like define a SQL statement that way, like as like a, as like a SQL string doter and then invoke the query. Right. Now you see why it is kind of hard where you can't actually maybe parse the, the UDF when it's instantiated, like you code with a SQL statement per SQL function, because you actually have to execute the code to actually maybe figure out what the query is, what query is actually going to execute. And again, there could be if clauses, could be conditionals that were this, the query that it's going to get generated might be different versus, you know, one pass versus another, because it depends on the state of the data system. All right. So let's look at a more complicated example. So the next one is now just going to do a summation. So first thing to point out again, because we're, we're, this is inspired by ADA. It's the 1980s, you have to define your variables before your function, right? So you declare at the top here, you tell them what they are. So the foo record is going to be a record. And then out will just be an integer. And then now in my, my body, now I have a for loop where I'm going to loop over every record from foo that comes out of this select statement here. And all I'm doing is just taking the, the counter and adding whatever the, whatever this integer value to, to it. And I return that. So let's see how it works. So, right? Just create the function like before. So now I can do things like select star from some foo and I'll pass in one, right? I get some value like that, right? So when I call this function now, the function gets some foc and then it, inside that function, it then evokes another query. So it's almost like a nested query in some ways, right? Or again, I can put it out. This one here, this returns a, going back to my definition. So I'm defining that the return type is an integer. I can now start using this in the where clause because it returns a scalar. Even though, so this, what the computation actually doesn't matter, but I can start doing where ID less than some foo one, like that, right? I can start doing, putting it in here, some foo ID, right? Yes. The question is, what would be the difference between this and a nested SQL query? From a, so from a, there's two ways to think about it. From the database system perspective and then from the application developer perspective, right? From the database systems perspective, it's actually worse because the, the, the UDF is a black box, right? If I call explain on this, all you see is that it did a sequential scan foo, right? And then inside this filter, there's some foo clause in the invocation here. So it didn't, the optimizer didn't try to figure out, okay, what's actually some foo doing? Oh, there's another, there's another query inside of that. Maybe I can lift that out and use that as a nested query. Then run my, all my optimizer tricks that we talked about before. So this is a common problem in UDFs, which we'll get in a few more slides is that although my example here, the logic is pretty simple. As I was saying before, you don't actually know what it's going to do until you actually run it. And, and then the execution path of the UDF is going to depend on both the input parameters and the state of the database because I can make arbitrary queries to the state of the database. So from the application developer perspective, it's potentially better because again, there might be some logic that is traditionally more easier to implement using UDFs. All right, so this is actually summarized as much as I've already said. So what are the benefits of UDFs? So it's going to, it's going to allow developers to have, be able to actually reuse, again, complex logic across different implementations of the application. So again, if I have a mobile phone app and then the web app, I don't have to even let maybe real complex things in, in the different languages, I can have one, you know, one, express it once inside the database server and reuse it everywhere. For complex operations, it's also going to reduce the number of round trips because I don't have to go back to the application. I forgot the next thing to do. I can push things to the database server. And then the last one is already said. Some things are easier to express and read it in UDFs. But now, what are the bad side? All right. So the first one I already, I've already said. The query optim, most query optimizers, the exception of one, I'll say that in a second, most, almost every single query optimizer is going to treat the UDF as a black box. Doesn't know what's inside of it, doesn't know what the, any queries is going to evoke, right. And therefore, it's not going to be able to make estimates about the, the selectivity of the cardinality of expressions in our where clauses or our join clauses, because it's a black box. All right. I already said that was a big deal for the query optimizer. You'd want to know, you know, if something equals something or something less than something, you want to know what the selectivity is going to be, because that's going to help define how many tuples are going to be pushing up in different parts of the query plan. So then I can figure out what the join order, the most optimal join order is. So because I want to filter out things sooner rather than later. But again, if it's a black box, if it's some imperative language, right, if it's C, I mean all bets are off, who knows, right. But even if it's PLC or PLPG SQL, the system's just not going to know. The other challenge is going to be that when things are correlated inside of the UDF, that you're not going to be able to easily paralyze them. There are some systems, the names, which ones exceed me now, but postcards got better at this, but some systems used to, that anytime you had a query that had a UDF in it, because you don't know what's inside the UDF, you would default to run it as a single thread, because you maybe don't know the ordering of, like there could be any updates inside of it, you can put anything. Now there are some, in case of postcards, you can pass in hints, you can say I know this UDF is paralyzable or I know this UDF is immutable. SQL server will restrict you that only UDFs have to be immutable, strong procedures, which are another type of functions, they can update things. Postcards let you do anything. And then the other challenge is that if you have, as I was saying, some UDS where they're concatenating a bunch of strings together and then invoke a query, again, you don't know what that's going to be before you start running it. So it's hard to be able to document that. The other one is a bit more nuanced is to say that when you have UDFs and a select ware clauses, that you're not going to be able to take advantage of all the tricks we can do in our database server or database system during query execution to like batch things together to speed things up. Are we talking about vectorized execution? Sorry, yes. Okay. Like in some systems, when you see a UDF, since you don't know what it's going to do, you got to invoke it for each row by row by itself. And what, because the UDF itself is going to be defined for the most part to operate on a single row. But maybe what you really want to do, a better way to do this is take a batch of tuples or all the tuples that I'm passing off from one operator to the next and pass that in to the, is a single invocation into the UDF. Have it do all, you know, whatever it needs to do on the batch of tuples and then complete. Because that's going to be way more efficient than, you know, calling individual selects or updates one row by at a time. Okay. So this question is, if it's, if it's, okay, this question is, if it's written in C and you, does it use a div, does it use, what memory is it using? Well, I mean, if you call malloc inside the C function, it's whatever the OS is going to give you, right? But like for any of the results, actually even during query execution, it'll go to the database server. So we do all the standard buffer pool stuff we've done before. But like, when I get back my result, it's going to be, I have to copy it into my process. So I'm going to have to malloc for that. Yeah. The fact that C doesn't matter. Same thing with PLBG SQL, I don't think like when, when, you know, Postgres is running this, if you have a variable, I, it's just sitting in work, work memory. It's not actually backed by the buffer pool. Right. So again, the, again, because we're executing the, everything in the, in the UF1 by 1, there may be actually situations where in my UDF, I call two queries, one after another. And I could potentially kind of run them together, like combine them together automatically, so that it becomes one query. But again, I can't do that because again, I don't know what's, what's on the inside. So I would say there's been, there's been attempts, or there are, there's been some extensions to UDS, maybe like in, in single store, I mean, SQL, they had their own version of a UDF language called MPL. I don't know what it's called now, but like it had constructs to like, do you define parallel functions in your UDS? SQL server has tried compiling the UDS into like the native language, like jitting it and seeing that speeds things up. It does make the UDS run faster, but it doesn't solve that black box problem that I talked about before. Yes. Okay. So his question is, he's correct. Well, no, hold on. If there's, if there's SQL functions, then you could potentially do some pre-optimization on those, and like store, as per pair of statements. But if it's, if it's the imperative language like PLBG, why, like, why not just take those out and, and I don't, what's that? Yeah, yeah, you, I mean, but, but it's no different than a CTE. So here, make sure you do this right. So his question is again, if it's a SQL function, which might, I think my get foo is, right? Actually, let me start it over. Make sure. So if it's a SQL function, why not, why not to drop it in like a macro and let it optimize it as before? Or sorry, like, have it all be in line and optimize. So let's see. So if I call explain, select get foo one, that doesn't tell me anything. Yeah. So it's not, so in this case here, looks like Postgres is not being smart enough to know that get foo is, you know, is, is, is only one query up here. You can't do that. Get foo is the only one query. So in theory, like I should be able to see, I would expect it to, when I call explain like that, to see this, right? If it was actually planning right there, you would see the index scan and explain how, but in this case here, it's a function scan called get foo. So I think it's going to, it's going to parse it in front of it. It's going to optimize it at runtime. There's no obvious reason why it doesn't do that. Well, again, if I have multiple SQL functions in my, in my, sorry, SQL queries in my SQL function, then you actually can't just plop it down in line at one, right? Like I could do this. Let's see if we can break, put this in a loop. So let's do insert into question is the language used for UDS more than just, than just SQL. So I mean, if you define it as, if you define it as, as a SQL function, it has to just be SQL statements one after another. There's no, you can't have like if clauses. If it's PLBG SQL, then you can put in if clauses. All right. So in this case here, I'm going to put now, I'm going to call insert and then I'm going to call, then do the select. And again, the, the output should be the, whatever the last query is. So now if I run it, let's see, I want to run this one, right? I get, I get the one out, I get to get the output I expected to get before. But now if I look at my food table, I should see, I see the insert called, right? So in that case, you, it's not, it's not like a macro to plop it in. It's one query. Yes. Otherwise, no. So this is also running in the context of a transaction, right? So, so all it's basically, again, if you think of like, you sort of pause the query, the outer query that invokes the UDF, put that context aside, then evoke whatever queries you see inside the UDF and then any of those queries fail, then the whole transaction fails. So in this case here, this should fail now because the ID field is the primary key, right? And I try to insert my UDF duplicate key. And then now, now my transactions roll back, right? This question is, if I put explain before, before the UDF, what will happen? Inside the UDF. Yeah, I understand. Yeah, let's see what happens. So let's get rid of the, let's get rid of the insert. But actually, that right there, that tells you that I try to drop the function, but I was still in an aborted transaction that shows you that Postgres treats catalog updates like adding, creating function, dropping function. They, you can do it in the context. It, those are transactional. So it didn't let me do that, which is cool. You want, you want that. All right. So now I'm doing what he suggested before where I want to put explain inside of it and see what happens. Didn't like it. Okay. So let's try, let's try this with the, so I'm going to put the explain first and then run the query. Let's see what happens. I think, I think the explain is going to get ignored. Yeah. Yeah. So you don't want to have it inside, which again, we'll get to this in a second, how to debug these as pain in the ass. All right. So I'm going to show one example here of, of, of how bad actually things can get with UDS to treat it like a black box. So this is from a paper we'll cover in the advanced class, where they took TPCH was the standard analytical OLAP benchmark that we use for David's systems. And they, it's a three-way joint between orders, line item and customers. Thing is like, it's a, you're tracking what items people are buying and line item is, is the fact table we talked about before. So they did the join as a UDF between, for the customer name, for the customer and all you're doing, looking up, make sure it does the customer actually exist. Right. And so the scale factor of this is, one, in TPCH parlance, that means the data is one gigabyte in size. Right. It's not that big. So the UDF looks like this, all you're doing is you're taking in the customer key as, as an, as an integer, and then you're going to invoke it and then return the name. As long as that name's not null, then this returns true. So if you write it just pure SQL, the original query, right, you can complete in less than a second. If you just change to just this one and close, now be a UDF, it takes 13 hours. Right. Well, 13 hours, 30 minutes, almost, almost 14 hours. Right. Because it's, again, the data says no, I think it doesn't know anything about the, the customer name. Sorry, it doesn't know about this UDF and doesn't know that it can do the join in parallel. So for every single row in the line item, it invokes this function over and over and over again, over again. Right. Whereas if I was basically inlining this select query as part of the join here, then I wouldn't have that context which is going, evoking this, this UDF and evoking this query every single time over and over again. So it's so bad, Microsoft actually were, you know, they acknowledge that their, their functions are terrible and bad. And it's not, again, not specific to Microsoft. It's just the nature of UDS because they are opaque. You don't know what's going on. The optimizer is, you know, makes bad choices. Right. And so there's a lot of blog articles that talk about how these things are evil. Don't use them. And so what we'll cover in the advanced class is the thing that sort of what he was alluding to is Microsoft has a technique where you can, you parse this, you can actually convert it to relational algebra. And then you inline it back into the query plan of the outer query. And then you run all your optimizer tricks as before. Right. It's called Freud. Well, the research paper is called Freud. It's very impressive. They have patents. Nobody else does this. But it's, to me, it's one of the best ideas in databases in the last 10 years. Okay. So similar to UDFs are called store procedures. Again, these are going to be less common. But the, the difference is going to be all the UDFs we showed before, they have to be attached to some query. Right. You have to invoke them a part of a select statement or update statement or other some kind of SQL query. Right. But with the UDF, it's almost like an RPC construct where I have this function. I just call it without having to invoke it inside of a query. Right. So some systems actually Postgres used to do this. They didn't have a distinction between UDFs and procedures. Newer versions do. We had to declare something either a procedure or a function. Right. And then I said in, in case of my SQL server, if it's, if you declare it as a procedure, you can update the database or update tables, you can do anything. If it's a function, then it has to be read-only. So the basic idea is here. So here's that application we had before. It's about SQL queries intermixed with program logic. We just take this thing, embed it inside the database server. And then now our application only calls, only invokes the call command and invokes the, invokes the storm procedure gets result. No network round trips. Yes. This question is, functions are typically written in SQL, those, the SQL commands. So I mean, invoked in SQL or written in SQL. No, I, I mean, I showed a PLBG SQL example, the, the summation. That was arbitrary logic. Yeah. These will actually, I think you could probably write these as just, with the SQL commands, SQL queries. I guess you're good. I don't know. So, but no, I mean, like in, in the UDS, you could be a PLBG SQL. It could be, could be C, could be Python, could be any, any of the languages. All right. I hate showing, showing code in class, but it's unavoidable here. I apologize. All right. So this is a more complex example of doing a, a procedure now. And this is the example I showed at the very beginning when we talked about transactions. Taking money out of somebody's account or checking these, whether somebody has enough money, if they do, then take money out of their account and put it into another account. Right. So you see here that we're doing a select now. And we have this into command here are these, these modifiers, where that's going to take, tell them to take the output of the select statement and then store them in these variables we defined up above. Then we have an if clause to see whether they have this person has enough money. If they don't, then we raise an exception. This is basically, you know, a way to throw an error and say that this thing can't complete. And then afterwards we do, we do the, the updates on the accounts balances and we commit. All right. And again, if I had to do this with, with sort of client side code, it would be one round trip to do the select, then the application code checks this balance. Then it's another, another round trip to do the update to take them, take the money out, then another round trip to take the money, put the money back in. But I can do this again as a single procedure call. All right. So let's quick example. So I'm going to create our accounts table, right? I'm going to put, I'll put your money in and, and three fingers. And then I'll create my procedure, right? Just as before. So now I can call, well, all right. Okay. How much is the OU? Yeah. TAs don't get paid that much. DJs don't get paid that much. All right. So I, now I, what's that? So now I can call transfer. So I'm going to take money out of Mooshu's, put it into three fingers and we'll send him $50. Right. And then now I can do a select, start from accounts. And we see we transfer the money. Right. And of course I had that, I had the conditional logic that says if he doesn't have enough money, so we try to just send a thousand dollars. It throws an exception. Mooshu does not, does not have enough money. Right. Okay. Right. There's just the location of the call. So I've already said a lot of this before, but the UDF is meant to be some subset of a read-only computation within a query. I think PostgreSQL restricts you to be read-only now. They didn't used to be. And then where it's a store procedure is meant to be some more, more complete transaction. Like when we talked about before, like the idea of a transaction was this high-level operation that the application wants to perform, like take money out of account, put it in another, multi-step. That's what a store procedure is good for. Yes. Yes. Yes. So the statement is, if we can use a procedure in UDF to put complex logic into a Davy server, why even bother write application code? Who here has written PL SQL before? Nobody. Right. We'll get to the end. I mean, this is actually a good question. Why don't people use this more? I mean, one of the obvious things I have to say out of the box is because that code sucks, right? It's in the 1980s style. Like all the sort of the nice things you want to have in a modern programming language, these things will have. Right. That's some other reasons what we can get to the end. Yes. Yeah. So it's meant to be read-only. SQL server enforces it. Postgres does not. Yeah. Another question over here? Okay. Yes. Sorry. Sure. Okay. So let's hold off on this. I want to get through all this stuff. I've already said some of the reasons why people don't do this, but there's other things we can get to. Yeah. All right. So now one thing we can use UDF force that is super important and actually is very common is to do triggers. So trigger is a way to tell it isn't to do some action when some event occurs in the cyber database system, specifically for data on tables. So if you want to use a trigger, you're going to have to define what type of the event that you wanted to call is the trigger to get fired. You have to specify what the scope of the event is, and we'll describe that in a second. And then we define when the trigger is getting invoked relative to the event. So an easy way to think about this is say I want to have a trigger that anytime somebody deletes a tuple, I want to fire a UDF from fire trigger. But do I want that UDF to be fired before the thing is the tuple gets deleted or after? And should it get fired for all each individual tuple or for the batch of the tuples? So let's do a really simple example here. So we have this table foo has ID and some kind of value. And so what I'm going to want to do is I want to maintain a separate table that's going to keep track of all the values that got overwritten in my main table foo. I think of like I'm using this for auditing. I keep track of like if someone updates a record since we're overwriting the original value, I want to keep track automatically what the old value was. So I'll make this audit table that's just going to have a some kind of ID field that'll have a foreign key reference to this ID here. And I'll keep track of the old value that got overwritten and then the timestamp of when that thing got changed. And then I'll define a UDF that will get invoked anytime the the foo table gets changed. And so now we can see here that we're passing in we have these implicit variables called new and old. And that'll be the new value that's going to get updated that's getting installed and then the old value of the tuple. So it's a reference to the tuple of the previous and the new. And then I'll again just take that and insert it into my audit table and then return the new one. So now to to set up this trigger I'm going to create called the create trigger command to install it. And I'm going to specify that I want it to I want to get fire this trigger before the update occurs for each row that's getting updated. And then I want you to invoke my log foo updates, my UDF that they find above. So this is actually a very, very common scenario application scenario that like you do automatic auditing. So first thing we'll do we're going to create our foo audit table. And then we're going to create our UDF. And again the main thing here is that there's this new and this old you see I'm not they're not getting passed in it's implicitly available for me in the function. Right because the return type is trigger. So the system knows that like okay you expect to look for new and old. The question is new and old part of the SQL standard as far as I know yes. And then I'm going to find my trigger. And so now if I look at my table foo right so let's do update foo set vow equals y y y where id equals four. Right so now that I go back to my foo table I see id four has been set to y y y but I look in my audit table now right you can see that it automatically created an entry for me with the timestamp and the old value. And then the reference to the thing the the primary key in the table got modified. Yes. The question is and again this question is the question is how how does the trigger interact with the the transaction maybe the the outer query they call it the trigger defined. It's all together right it's all like it's equivalent as if I was actually the first query then executed mainly the second query right just because it's a trigger it doesn't mean you lose it's done outside of the consequence of the transaction. So I can see that here so I can do begin. So we'll start a transaction and so the question is like good thing what what can I violate here. Yeah yeah so I'll update this now to like that. So again I look into foo I see that's been updated audit that's been updated but now I roll back and my audit entry is gone right. What how can a data system roll back in the trigger? It is a user-defined function that's embedded inside the database system that you attach to an event. How would they know what is modified? Who's who's they? The trigger is what and such as the yyyy into two audits to roll. Yes. But how do you know exactly what to like roll back? The question is how does the data system know what to roll back from a user-defined function. Because user-defined function if I go back to my like the definition of it it's making a SQL command it's calling insert right. So it knows it knows what you did because it had to do it right. So like in order to do what the thing it needed to do it needed to know what it did. So again so then it's just me it's the point I'm trying to make is like the application it's like it's all embedded in the server side where I don't have to have something else polling all right sorry I don't have to write my application code oh I know I'm updating this table so let me also insert into this audit table. The database system does everything for you. Yes. So your statement is like when I when I updated the table the the query plan has basic checks that say okay do I have any at in my query plan when I'm about to update something are there any triggers attached to this table for the update and the rollback is you know the right set of the transaction you know what you know what it inserted right and then when when the transaction aborts you say okay well it doesn't actually know or care that that it inserted something a foo audit because of the trigger it just knows I put something to foo audit that shouldn't be there anymore. This question is can you write UDS that have side effects like sending email yes should you no uh we talked as before like the like you can do it but the if the transaction rolls back after the UDF after the email has been sent out you can try to be like an old person using Outlook and try to retract the email but like nobody does that right like it's not going to work it's something once something leaves the purview of the confines of the data system we can't nothing we can do yes uh okay next slide so what are the events we can attach ourselves the triggers to so insert update delete truncate which is just dropping a table like a delete without a wear clause on crates altars and drops the the scope could be what are we looking for for the event right so the triggers I'm showing are showing now are table triggers like when something happens this table you know do something right but you could have it for the entire database you could have one for the system itself and then when the trigger gets fired it could be either before the query executes after the executes before each row gets both after each row gets modified and then there are some systems I forget who does this where it's almost like a rewrite rule where like you try to invoke a query that has a trigger attach and the data system will replace that entire query with whatever the trigger trigger function is yes so the question is can you get stuck in an infinite loop yes let's do it um I will say that the the I think actually I know that like a udf can call another udf and I know that you can have uh basically uh there's like a counter say how many how many layers are you down into it and then they'll kill you yeah so let's do this let's let's attach a function or like let's attach a trigger to foo audit that see what are we going to do let's attach a trigger to foo audit that then inserts into or updates into foo which then would call that calls a trigger into foo audit so let's do this let's call this uh bad idea uh so we don't need the old and new we're just always going to insert into you know that's been update right so update foo set vowel equals vowel handy where id equals one actually we do we can do we do old id that should work return new oh shit yep okay so now I have a trigger I have to attach it now so now create trigger trait trigger trigger bad on update or sorry before after insert on foo audit for each row execute procedure and I said it's about bad idea all right so now this should put us in an infinite loop make sure I'm not interested I'm not let's do it yeah okay so what happened that didn't get fired all right so I bet it's new what's that so we got to drop the trigger what up how do you draw a trigger then I guess like we do uh before all right let a rip what's that yeah yeah the answer I think yes I know you can I don't I don't know what's going on here did it actually no because you're because it's you're inserting into foo audit not updating so I want to trigger on foo when I insert into the foo audit yeah I don't know it's through it all right you can do it I don't know how to do it I'm trying to do it live okay we still a lot we want to cover too okay all right so the next thing we can do is do change notifications and this so the trigger we showed about four that's still all inside the database system right and so I can you know with the trigger you know if I had a write without the trigger I had the right application code that would pull the database server maybe that table over and over again see whether there are ever a change and then put something like audit log or write my application code to put things in the audit log but it's still all done on on any any with the trigger the notification of what happened and what change is still embedded inside the database server so in the sql standard you can have these these some these change notification uh constructs or commands that allow the database system to tell the outside world that something has happened inside the database system and again to his point before like can I write a udf that because sends an email yes don't like uh you could do you could still get something similar in the construct of sql and so can you get this like a pub sub system right I can I can subscribe uh to a channel that says give me updates when they occur right and then in the sql standard it'd be listen and notify so this is all pure at this point like well listen listen and notify is in the sql standard what I'm showing here is is purely postgres only right because we're using postgres specific functions so this is going to be a function now call a notify for updates that's what we're going to treat as a trigger but inside my my trigger function body i'm going to take the row that just got updated there's a there's a postgres function called row to json it basically takes the row turns it into a json document store that in a string or a json string and then i'm going to call this thing called pg notify that's going to invoke the uh convert the json object into a json text or serialized to text and then call uh on foo update which would be the name of the channel we're going to attach to and then i'd find a trigger that says anytime i insert into the audit table uh for each row i'm going to invoke that uh invoke that function invoke that procedure all right so let's try this live so again first thing i'm going to do is create my function just look before then i'm going to attach a trigger uh and then on the terminal the bottom here which i think make sure i'm still connected the terminal the bottom it's going to call it's going to listen on foo update and then my terminal at the top is going to invoke uh you know query updates the table and then the one on the bottom if i call now select one again now i get a get a notification of here's the changes that occurred way more than i thought you were going to get actually why oh so sorry because the the sequel query is uh where id doesn't equal one i missed the not right so this is updating all of them so at the bottom you're getting we've got a bunch of these these notifications and then you see the payload is the json uh this is the json string of the tuple right so the reason why i had to call select one is because it's asynchronous the the pcgo terminal here it's not pulling the davis server over and over again and say hey do you have any updates for me there are um there are client-side api's and libraries that that will do that for you so that you can get the notification later on again there's just a way to to push updates outside the davis server okay so the next thing to talk about is how can we start representing more complex data types in the davis server again because we want to avoid having to have application code that maybe knows how to pick a part of json document right or do some additional has some additional logic for a to parse a blob that was during the davis server right so there may be oftentimes in applications there's a complex data type that you want to represent atomically or as you know it's not a scale but yeah atomically inside the davis server uh so that all your different imitations of the application that can connect to it and reuse that type without having again the custom code to handle it right so let's say that i want to keep track of coordinates of a coordinate type x coordinate y coordinate and then some kind of string label here so i could just split it up split it apart uh and store it as a you know as a as a three separate attributes right this would be the most common thing to do um i could then also could also do what i call application civilization where i just stored either json directly inside the database server and i had to you know have to have write sequel code and application code to pick that apart uh worst case scenario i could just use a client-size civilization library like protocol buffers thrift uh to make make me turn this thing into a blob and store the blob inside the database server that's a terrible idea don't do that um some systems like support user defined types and actually this is what the big idea postgres one of the big ideas that postgres created in the 1980s uh postgres you know was defined was built to be this extensible database server so that you could add these user defined types for all these different types that may be out there that that the application that the data server wasn't going to support by default the sequel standard didn't specify by default right so again postgres defined this in the 80s but now it's in the sequel standard and it's this idea of these what we call object relational databases object relational model just think of like the relational model that that's extensible right and this was a uh this was a way to adopt a lot of the ideas that came out of an object oriented programming in the late 80s early 90s there's a whole slow of these databases called object oriented databases they're all dead or almost dead json xml is basically the same idea but the idea is that instead of having this district scalar atomic types in in the relational model you could extend them and have more complex things so it sort of looks like this right so i can call create type and i can specify uh the type to be a composite of of the built-in types this is how to do in postgres uh oracle we have to find as an object um if you need something to be more complex than this though uh you can write udts in actually c or its sequel server dot net and then you embed that inside the database server and then now the data system basically it has a bunch of libraries and routines to deal with integers and strings and so forth now has it for for your type which is again a really powerful idea quick example right so i can create my type my coordinates well i did it before as soon as that and then i have this object tracker table so i'll define my table to have an integer field and a location insert some basic data and then now i can do select star from object tracker right and i get back the i get back that that that composite type as a um you know as a tuple right again think of it like a macro uh but the issue is that the database server doesn't know anything about your type so if i call select some on that location what should the result be yeah an error so if you want to support you know additional uh you know just the additional operators you want to do on a on a on a on a on these composite types you'd have to define that again the you would write in c or c plus plus the the sum implementation of the the plus operators the equals operators for all these things for composite types equals pretty easy but if you want to add two coordinates together what does that mean the data system doesn't know what you're trying to do right question is can you index let's find out the answer i think yes why wouldn't you well for this type for a composite i'm assuming that it knows how to do equals right so create index this question is under the hood is this json no i think this would be sort natively it's like the native printer type this question is you know we find that out let's see let's see you can index it um so can i do this select star from object tracker where location equals and then i'm going to do something kind of ghetto see if this works see what it'll last me i don't think it's gonna let me cast it as a coordinate it did damn okay let's guess uh it's quick i doubt it uh can i index oh so over here so with with two tuples i think it's unlikely but yeah it's because it sits in a single page so so actually the question is can i do so i can do this right i can just select but can i do this no because it doesn't because i didn't find the plus operator for uh for for the coordinates type okay good good good all right the last thing i want to show you uh before you got in the real world another thing you actually will see is also views so view you think of a view as like a virtual table that is going to be defined based on the output of some select statement and you can then query them and access them as if they were a real table and in select calls but underneath the covers that you said was going to know oh you're accessing this view uh let me translate that view into the actual sequel statement that that you need right and the advantage we're going to get for this is that for really complex queries we can just define it as a view and then get that reuse where everybody everybody doesn't have to re re re re-implement the same logic uh on on a table they can just query the view right but we'll say it's not the data is not going to run magically faster because you have a view defined because it's again it's literally almost like a macro going to just drop it in uh and over top of the the declaration of the the invocation of the view um another common use for this is that you want to be able to hide a subset of the attributes from certain users it's another way to think of like again implement security right instead of having uh calm level access you can do sort of roll access based on views or tables so if you want some group of people to only be able to see some columns then you define a view on the underlying table and make sure you don't include those right the the craziest database scheme I've ever seen were some SAP uh and they had thousands of views because their application is from is that's what 50 40 years old and it's the views on top of views on top of views on top of views right it's they make a lot of money but um but the views are very common too so so here we go back to that that student table we have at the very beginning so I can create a view where I only want to see the cs students so when I define my view I have my I define to give it a name and that's going to be treated as if it's the table name and then I have my as calls and the from clear the query comes after that so in this case here I'm filtering I want to say only show me the student ID and the name of the login and I only want to see cs cs students right so this is my original table when I define this view I run a query on the view I would only see me and the RZA right and this is you know and say that again I want to be able to give this access to this view just to like you know the administrator someone doesn't need to see the grades right I can exclude it in the output here and all this is really happening uh it's just a rewrite it sees cs students and then it takes that that that select query up here and plops it in here as a nested query then you can run it through the optimizer and it does the the unnesting and turning it to a join as needed all right so then I can do again more complex things like this uh you can put aggregates you can do whatever you want um the difference between a view and select into is that a view is only going to be materialized whenever the query whenever query accesses the view and it's always going to get the latest data right ignoring transactions but it's always every time you you call from a select statement it then executes the view query so you always get the latest information yes isn't it basically like a macro renaissance where is yes I'll call these virtual views because again it's computed on demand select into we essentially do the same thing but this would be a static snapshot of the of the of the table at the time you call select into right and if I update that the underlying table that the selected into reference it doesn't automatically get propagated to the uh to the to the temp table right the SQL standard does allow you to modify a view on some rare cases I don't think Postgres allows this um but and you obviously can only do this if there's no aggregation and it's a single table because you need to rewire it and go back to what the original tuple was so this sounds nice but now what if the view query is super expensive maintain right it's one petabyte of data I don't want to have to call every time I invoke the query on the view I don't want to call you know the underlying query on one petabyte of data so there's another category of views called materialized views and the idea here is that the the the output is is of the view is computed and materialized and stored as if it was the temp table and I can potentially have the do some automatically maintain the view uh as the underlying base table changes right so Postgres doesn't do this Postgres calls materialized view it's basically sending a select into except I have they have a refresh command that calls it to to propagate again with the latest results high end commercial systems SQL server in particular is very very good at this where they can have a materialized view where they can basically track the changes to the underlying table and do incremental updates to the to the materialized view so I want to maintain a tally of like the top scores of an online game it knows that okay here this this this player got this score I only update that one portion of the view this is not easy the basic reversing a query back to the change that that matters to it there's some systems like materialized that are designed entirely for maintaining these materialized views but again SQL server is is probably the best one at this and even they don't even they're not perfect there are some queries where it says I cannot figure out how to do incremental updates and it has to do a complete refresh okay so I'm rushing through this but you'll see views in the real world okay so I was going to spend time at the end discussing okay what why do you guys think nobody does these things so I probably say the the the most common thing is going to be views then it's going to be probably udfs and then implicitly then through triggers uh you need to find types are going to be rare that the change notification is rare and short procedures are rare so we already discussed a bunch of these things and you know at first that I that you know PLBG SQL PL SQL is a an archaic language nobody's ever here's ever ever seen it before like it's not something that people like pick up and learn right um the other challenge is give that it's it's super hard to debug right what did I show here in the terminal I was just like trial and error just trying to get this thing to the thing to compile did the bugging tools are just not almost non-existent SQL servers little SQL servers better or goes a little better but like Postgres doesn't have anything it's trial and error the other thing is that if you have a database administrator they are very conservative they don't like a lot of changes to your database so if you're trying to do like an agile software development and we have you know two-week sprints and every two weeks you have new store procedures or new udfs you want the dba to install they're not going to want to do that because they want to vet everything before you make any changes make sure there's no you know performance degradations or cause any problems so if it takes your dba a month to approve whatever udf you want to install you're never going to get anything done it's as better as doing the application code now you potentially also maintain different versions of udfs because you maybe do uh rolling updates and so some database some servers some applications are using the new version some applications are using the old version but now you need to maintain different versions of the udfs because maybe calling different things right and then also now i would have in terms of deployment things have gotten a little better with like terraform and other other ways to do like uh cd and ci but you would now maybe your applications has has it's you know it's version history over here and then there's this other udf version history over there i'm going to make sure these are installed they're always in sync it's it's nontrivial and in many cases for small shops it's just faster and easier from a development side just to uh to keep everything on the application code the the trend we see oftentimes in the real world uh is that you start out doing all application logic and then as things get slow and you need things run faster then you start pushing things incrementally into the data server right you should figure out what's the slowest thing what what what could i speed up and push that in so i guarantee for if anybody's here can start a brand new startup you're not going to build any of these things a udfs trigger you probably would use uh but then if you go to a more uh i don't say older but like a more mature application code you you'll probably see a lot of these things and if it's oracle good luck so okay okay any questions quick quick is it quick the ways between this and elab system you can do all this in any elab system or to be system they're orthogonal okay all right hit it