 Today's class, we'll be talking about embedded database logic. The basic idea here, as we go along, the idea is that we're going to want to start pushing logic that we normally have in our application. We want to implement it inside of our database system. This is certainly different than the things we've talked about so far. So this is what I wanted to show last class about some updates on the time. So basically for Project 3, Project 4, and Homework 4, I bumped up the deadline for all these projects by one week, right? So you'll have a little bit more time for Project 3 and 4. And then for the homework, that'll come out a little bit later. Because we won't cover this material until next week. I'll give you guys more time. Yes. That should be November at the top, yes. Yes, Project 3 was not due five days ago. Okay, sorry, okay. All right, and then I mentioned this before. We have a bunch of other database talks coming along. We have one from Blazing Bee tomorrow. This is kind of cool. There's two brothers down in Peru, started building a GPU database. They won some kind of startup contest in Colorado. And then they moved to Texas and they built a company there. So one of the brothers is coming, the CTO is coming tomorrow to get the talk. And then BrightLite is another GPU database that is coming out of like Norway or London and they'll be coming here in two weeks to talk about their system, okay? All right, so what this lecture is really about is looking at how we can take things that we normally would have to implement in our application and see how we can actually support them inside of our database system. Okay, so this is not gonna be so much about the, we won't talk too much about how these things are implemented internally inside the database system, we'll just expose you to a bunch of different concepts of the more complicated things you can do inside of a database system other than just SQL, right? For now, everything we've been talking about, we've been assuming that any request we get from the application is always through a SQL interface. And it's always been done through what is called a conversational API, right? Where the application makes a request to the database system to say, hey, execute my SQL query, then the application blocks and waits until the database server executes that query, sends back the result, and then the application can then proceed. So the type of protocols we'd use to support these kind of, this application configuration would be either JDBC or ODBC or the various wire protocols you get. Either these systems actually support internally that their own proprietary network protocol. So again, at a high level it looks like this, right? So we have our application server. This is running on another machine, right? So this is like your PHP code, your Python, your Node.js, whatever you wanna implement your thing in, whatever the hot application server is. And you wanna invoke some high level function as part of your application. That's gonna be a combination of program logic and SQL statements, right? So say you wanna add something to your shopping cart on a website, right? You would invoke that function and say, add something to your cart. And maybe the first thing you would do is send a SQL request to the database server to get some data, right? Get some information about maybe my user ID. And so when this request arrives at the database server, all the things we talked about so far in the class, you're gonna have to invoke one after another, right? You're gonna have to run it to the SQL parser to turn the SQL query into an abstract syntax tree. Then you do a planning, a rewriting phase, then you run it to the optimizer. Then you actually execute the query. And then when this is done, you get the result, you can ship that back to the application, right? And during this time, the application was just sitting there and waiting for the database server to come back with that response. So now, while the application moves on to the execute more program logic, the database system for this connection is essentially idle, waiting for the next SQL statement to show up, right? And this is part of the reason why we want to support parallel execution, because if we can only have one connection at a time active in our system, then we would essentially be blocked, waiting for the next SQL query to show up until this thing finishes and then we can execute the next connection request, right? So after this, again, we get to another SQL statement, same back and forth, server waits, and then at some point, we're finally done, right? And then our operation can finish. The presentation display in PowerPoint is flashing at me and doing something weird, so let me just try to restart this. All right, that fixed it, all right, cool, perfect, sorry. All right, so at the end, we send a commit message. We don't have to say what that actually is just yet, we'll cover that next week. And then we know we're done. I'll make sure to make sure the recording's okay too. Yeah, recording's fine, good, sorry. So this back and forth between the application server and the database server can take a long time, right? Because now we're doing network round trips, right? And depending where the application server is relative to the database server, this might be a long time, right? If we're on the same machine, as a lot of database systems are set up in this way, then who cares, right? They were just sending an internal message over TCP IP on the same machine. The OS can do that reasonably fast. But if we're on another machine, either in the same rack, the same data center, or even worse, maybe in a totally different state in the country, then these round trips are expensive and they're going to slow things down. And so what we're going to try to do is try to take some of that logic that we would normally have to execute on the application server side and embed that inside of our database system. And the reason we want to do this is because, as I said, it'll be more efficient because now we don't have to send requests back and forth, but it's also more efficient because now we don't need to get the data out of the database system, bring it over to the application server, make a copy of it, then crunch on it. Maybe you can do some of the computation with the data as it resides inside the database system. From a software engineering standpoint, this is also going to have a benefit because now, if we have some complex operation that is now embedded inside the database system and exposed through like a function call, like a RPC request, then I can rewrite my application in Python and as a phone app and JavaScript, whatever way I want. And I don't have to rewrite that complex logic, right? That's inside the database system. I can just invoke that from my application side. And so it makes it really easy to, again, reuse complex operations across different implementations of your application. So we're going to go through a bunch of different ways to do this. We're going to go through user-defined functions, store procedures, triggers, change notifications, user-defined types, and views. So all of these will be things that are available in the SQL standard, but not all database systems will actually support this. And going for all of these demos, we're going to use Postgres because Postgres pretty much supports everything, right? The commercial guys support a lot of these, and we'll see in some cases the semantics will be slightly different. But my SQL, as far as I know, at least 5.7, I haven't looked at eight, the latest one, they don't support a lot of these things, okay? So the most common type of embedded logic you will see in applications are called user-defined functions, right? And the idea is basically, as I said, there's some program logic that we would normally have in our application. We can now write that in a function, and then invoke that function inside of a SQL statement. And it'll execute and do whatever the operation that we would normally do in our application, we would do it inside the system itself, right? So it's just as you would expect a function in any other programming language, right? You specify what the input arguments are, then you perform some computation, and then you return some values, and these values can be either scalars or other tables and rows and things like that. So again, the way you define this is that the return types can either be scalars or table functions, right? And then the computation itself can either be one or two things. It can either just be a list of SQL functions that will just get invoked when you invoke the function, and then whatever the output is of the last query is then the output of your function. Or you can write your computation in a external programming language, that's something that's not SQL, and then have the data system invoke that, and then you just have to make sure you return the proper type and the proper information that the function expects you to return. So let's look at a really simple example here. So this is an example of a SQL function, right? So it's just gonna be a list of SQL statements inside of our function body, and then whatever the final query does, that's what gets returned. So we have a simple table here, it has two attributes, id and val. And so the first thing we see when we invoke create function, which is the way we do this in SQL, we're first gonna define the name of our function, in this case it's getfoo, and we're gonna define what it's input parameters are, a single integer scalar value. Then the next one, we're gonna specify what the return arguments are. So this is gonna say we have to return something from the table foo. And the return result of this would be whatever is in the body below the actual SQL statement here. So this is saying we have a function called getfoo. It's gonna return one or more rows, or zero or more rows from table foo. And here's the SQL statement that'll do it for you. And we see here, we specify that we have one integer input argument. And then we can reference it in our SQL statements down below, in the body of the function with the dollar sign. So we have two, we do dollar sign one, dollar sign two, and so forth, all right? So the other thing to point out here is that I actually don't know whether this is in the SQL standard, but this is how Postgres does it. The double dollar sign specifies the begin and end, the actual body of the function. So the way to read this is this is begin, and this is end, and then anything inside of this is the actual body, all right? So let's do a demo. So this is Postgres at the top, and I'll kill the bottom one in one sec. So the first thing I'm gonna do is define my table and put some dummy values into it, all right? So we'll do, no, I gotta drop the function, sorry. Shit, sorry, I should have set this up before. There's another function here too, sorry. Nope, didn't like that. All right, good, sorry, we started. All right, so we're gonna create a table foo, and we're just gonna put in 333 tuples, all right? So we're gonna now create that SQL function I just showed before. Where all it's gonna do is, again, just invoke this one SQL statement. So we're gonna take an input value for our argument and we'll reference it here. And that's what we'll do a look up on the ID, on the ID field of the table, right? And so now you can invoke this in two ways. You can invoke this as a select statement like that, right? And you get back a single row with a single attribute that's a paired combination of the two attributes of the foo table. But if we wanna actually convert it back to original table, we can put it in the from clause like this, right? And now we get the same schema that we would expect from the select statement inside of this. So what's happening here is we've registered this get foo function with the database system. The parser parses it and says, this is a token. I can do a look up into my catalog and say, well, is there an entry for this particular function? And it says, yes, there is. Then it knows what that function handle is. So when it actually executes this query, it invokes that function that then materializes the tuple as the output of the SQL statement inside the function. And then it treats that as a temporary table. Then it runs whatever it is the rest of our query is, right? So we can go and do explain on this and see what the query plan actually is, right? And so this is important that the only thing it says is function scan on get foo. This is because Postgres doesn't know anything about what's inside the function. Doesn't know that there's a bunch of select statements. It doesn't know what it's actually gonna return, other than a tuple from the table foo. So it knows that it's just gonna invoke this, and then whatever this generates, that's the thing that it'll process, right? Is this clear? Okay, so the other type of function we can have are ones that are written in external programming languages, right? And the reason why we wanna do this is because again although SQL is turning complete, there are more complex things we may wanna do that we can't, that we easier to write in sort of a generic programming language like Java or Python or something like that, other than be hard to do with this in SQL. So the SQL standard specifies a language that you write your UDS in called SQL PSM, right? It stands for the persistent stored modules. The thing about this is though nobody actually implements this standard. And instead, everyone sort of has their own dialect of it. So Oracle and DB2 have something called PL SQL. Postgres has something called PLPG SQL that looks a lot like PL SQL. And PL SQL looks fairly similar to SQL PSM, but it's not exactly the same. SQL server and sidebase have their own thing called transact SQL, which because this was invented in the 1980s. And again, it sort of looks like the other guns, but it's not all exactly the same. So this would be, most of you probably have never heard of this language. But the SQL PSM is based heavily on the language Ada. Because the guy that wrote the standard was really in law with Ada from the 1980s. If you know Pascal, Pascal sort of looks like that. So it's a lot of it's gonna look like programming languages from the 1980s and not have all those sort of modern stuff you'd expect in Python and Ruby and other things, but it is what it is, it's fine. So if you don't wanna write your SQL functions, your UDFs in your PL SQL or SQL PSM, some systems will actually allow you support other more common programming languages like the pythons and the rubies and Java and C, right? But some systems will run your user defined functions in a sandbox to prevent you from screwing up the database system, right? So Oracle will let you write a user defined function in C. But what can you do in C, right? You can access anything in memory of the address base of your process. So if that function is getting invoked to send the same process of the database system, now you can just read any location in memory and start writing crap and possibly call the system to crash and corrupt your data. So they don't let you do that. If you wanna write, if you wanna run a user defined function in a sort of non-safe language like a PL SQL, they have to run you in a sandbox as a separate process and then they communicate with your function over a, like a pipe, like an IPC message, right? And again, to prevent you from destroying the database system. I think Postgres lets you link in C objects or C libraries and do whatever you want. But again, it's up to you to make sure you write safe code. All right, so this is gonna be an example of what PL SQL looks like. So this is in Postgres, this is using PLPG SQL. So this is gonna be the same function I showed you before where we just had a single select statement and it would just return back whatever the last query's value was. And this one here, now we see that we're adding a, we have a sort of a begin and end clause in between the function body specified by the double dollar signs. And then we have a return keyword and a query keyword that says, all right, the thing you're gonna return is the output of this query that you would invoke here, right? So it's a little bit more syntax to do the basically the same thing as the other SQL function, right? But this allows us to do more complicated things. So in this one here, now we actually have all the constructs you'd expect in your sort of common programming languages. We have our for loops, we can have if clauses, we can have while loops, and things like that. So now in this particular function, what we're gonna do is we're gonna be passed in an argument, an integer argument called I, and we're gonna do a lookup where for every single tuple where the ID is greater than the value that I passed in, I'm gonna loop over every single one of those records, and then I'm just gonna compute the sum of all their IDs, right? Pretty straightforward here. And again, we can do the same thing going back to Postgres. It'll work the same way. So the first thing I'll drop that function I had before where I had the SQL query. And then I can evoke one where it's now using PLBG SQL. And then the same thing where I can evoke it in the from clause. I can evoke it in the where clause. I can evoke it as the actual output list of a select statement. It all works the same. And then we can do the same thing for the sum function, right? Compute it like that, and then now we can evoke it from our terminal. It invokes the query, it invokes the query that invokes the user-run function. User-run function goes, invokes its own query, gets a bunch of tuples, iterates them over the for loop, and then computes the sum. And then spits back that as the final output, right? Pretty straightforward, right? So, user-run functions can have side effects. So some languages won't let you have user-run functions make calls to update tables, right? But in Postgres, it actually lets you do this. So if I modify my getFoo function, and now what I'll do is I'll make a new function where, wrong one, sorry. But now what I have inside of it is, this is a PLBG SQL function. But I can just evoke a query that's actually gonna do an update on to the foo table, and set the value for whatever it is ID I'm passing in. And then I can return that as the output. So if I go to my original table, select star from foo, right? I see three values, a, b, c, d, e, f, and g, h, i. But now if I invoke this UDF, right? And I go back to my original table and I see I've modified the first attribute, right? So I know SQL server doesn't let you have updates in user-run functions. One of the other systems might actually do that. But Postgres lets you do this, and I think Oracle lets you do this as well, all right? Okay, so very, very much related to user-run functions are another type of functions called store procedures, right? And the way to think of a store procedure is that it's meant to be the self-contained function that doesn't have to be invoked inside of us a select statement. So in all of my other examples, I couldn't just call get foo from the terminal. I always had to put that inside of a select statement. But the idea of a user-defined function is that, sorry, of a store procedure is that I don't have to do this as a select statement. I just call it exec and invoke it directly, or call operator and invoke it directly. So Postgres doesn't actually make a distinction, at least Postgres 11, because they just add a store procedures. They don't make a distinction between user-run functions and UDFs. They essentially look exactly the same. In some systems like SQL server, UDFs can't update the table and they can't run their own transactions whereas store procedures can. But at a high level, they're essentially the same. So if we go back to our original example we showed in the beginning, the idea of a store procedure is that we would take all of this logic here. So the SQL statements intermixed with the program logic. You can think of this program logic on one line could be a user-defined function. But if I take all of it and just encapsulate this inside of a store procedure and then now embed that inside of my database system, on the application side, now the only way, just invoke it without running a select statement, I just use the call keyword and then pass in the name of the function and the arguments I want to use. So the way this is implemented, again, when this is actually being invoked, it just the same way that we would hand off the invocation of a query to one worker, let it process the beginning and you do the same thing here for a store procedure UDF. Whatever worker that's invoking the request is the one that'll invoke the store procedure. Now some systems, I know MemSQL supports this. Some systems support actually parallel store procedures where it can recognize that you're doing a for loop maybe over a table instead of having just one thread do all that computation. They can form it off to different nodes or different threads and run your for loops in parallel, right? And it's kind of nice because you don't have to declare to use MPI or other sort of parallel programming constructs to do this. The data system could just do this directly from something that looks a lot like POPG SQL, right? And again, it's just one round trip to go invoke this thing. So we're going to talk about a lot of these things. Again, UDS sort of meant to be read only within the competition of a single query, although not all systems actually restrict you from doing that. And then the store procedure is meant to be a larger amount of computation you want to do, right? Like in adding something to somebody's shopping cart, that would be something you would implement in a store procedure. This is just purely semantics between the two of them, right? Again, some systems make the distinction, other systems do not. All right, any questions about UDS or store procedures? Who here has ever written a UDF? One, good, okay. How was it? She said pretty fun, okay. They can be hard to blog. I think there are now better tools for this, but a few years ago, it used to be through trial and error, you have to debug your UDF, right? Because you think about it, you can't really connect to the database server with GDB and walk through your program, right? Nowadays, this has actually gotten much better, especially the commercial guys have better tools for these things, okay? So the next kind of logic we can talk about are triggers. So the idea of a trigger is that it's going to provide an instruction to the database system to tell it to invoke a UDF whenever there is some event or some change in our database, right? So think about how you would implement something like, I want to keep track of every single time a two-page gets modified in a table, I want to keep a log record of it, right? And a table log that I can view through SQL. We'll talk about how to do logging internally in a few weeks, but think about application level. How would you actually implement this? Naively, right? You may be just right on your application code, something that pulls every 10 seconds or something. The table and see whether there's any changes, right? That's inefficient because again, it's a select statement that's going from the application server to the database system and it's doing computation. Maybe you have to read a bunch of stuff that is unnecessary just to find the things that actually changed. So triggers are a way to avoid this and push all that logic to get notifications about changes in our database. Had that running directly inside the database system itself. So to make a trigger, we're going to have to define what type of event we want to have cause a trigger to be fired or invoked. We're going to just define what the scope of the event is, and I'll explain what that means in a second. And then if the trigger gets invoked, or we have a trigger defined on a particular event, when should that trigger be invoked relative to the event? So if I insert a tuple into a table, should I fire my trigger before the tuple's inserted or after it's inserted? So let's look at an example. So let's start off with our food table we had before, right? We have two columns. And what we want to do here is that we want to have any time the application modifies the value field, a value attribute, and any tuple in this table. We want to write out an entry to a new table called foo audit that records the primary key of the record being modified, what the old value was that was being overwritten, and then a timestamp of when that occurred, right? Basic information, this is a pretty common setup. People want to audit what changes occur. So we'll create now a trigger function. And in Postgres we can write this in PLPG SQL. And what's going to happen is this trigger function, we're going to define this as the return type as being a trigger. So this tells the data system that you can invoke this thing as a trigger and it'll have the correct semantics of the return values. And then internally inside the body of our function, we see that we have these two implicit variables, new and old, right? And so if we're doing an update, we're going to have what the values of the old tuple, the original values were in that tuple, that's old. And then the new tuple will be what they're being overwritten with. So these are like these internal variables that get set up for you inside of a trigger function that gives you access to the old and new values. So here what we're doing, say if new.val, so again, check a new as a tuple. So the new tuple at the value attribute, it does not equal what the old tuple's value attribute was. Then we would fall down now into this part of the if clause. Well, we'll invoke a insert query that will add our entry to the foo audit table that we defined up above. And we'll include the timestamp of when this occurred, right? Pretty straightforward. So then now we would declare a trigger called foo updates. Now here we see where we specify the scope and when the trigger should be fired. So we can read this exactly as it stays in English. So we're going to create a trigger called foo updates. And it'll get invoked before an update occurs on the foo table. And we're going to fire this trigger for every single row that's being updated in that table, right? And we tell it invoke our function log foo update. So this part sounded kind of confusing, right? I sort of said that there's user run functions and there's store procedures. At least in Postgres here, you're saying that you want to execute procedure. You don't say you execute function. Even though up above, we defined it as a function, right? I think it's just something in the SQL standard, right? So let's see a demo of this, right? So again, we have our foo table like this. We're going to create our foo audit table. That just says, again, the primary key is just a serial attribute for this table. Then I have a foreign key reference to the foo ID. I have the original value that was in the tuple before it was modified. And then I just maintain the timestamp when the audit record was created, right? So now I'll define the same function that I showed you guys before, right? Same thing. Take the new dot val, if it doesn't equal the old dot val for the tuple being modified, then I want to insert an entry into the foo audit table. And then I define my, I create my trigger, right? So now, if I go back to my original table, if I do this, update foo set val equal, maybe we'll just append Andy to it. Like that, where ID equals one, if I go look at my table, right? I did the modification. I go look in foo audit and I have an entry, right? So the trigger got fired before the tuple got updated. And it said, take whatever was the original value, which was xxx, and store that with the ID of the attribute that I modified, which in this case here was one, right? So think about this. I invoked an update query that then fired the function inside the database system. I didn't have to go anywhere in the outside world to make this happen. The application server didn't see any of this. And then it created new entry into another table for me, right? So you can kind of think about how you can build upon these basic concepts as building blocks and do more complicated things. Again, the great thing about embedding the logic inside the database system itself is that no matter who actually connects it to me and doesn't update in this table, right, whether it's from the terminal, whether it's PHP code or Python code, whatever, that trigger is always going to get fired. So I don't have to re-implement that logic to add something to the audit log and every single implementation in my application, or have to build some middleware to trap all the update commands and then convert that to other insert queries, right? So again, the three things you need to use to define a trigger are we want the event type. So we can do this on any modification, insert, update, delete, truncate, create tables, alter tables, alter system, drop tables, all the sort of standard commands. As far as I know, there are, I should double check this. I don't think you can do triggers on selects. I can imagine some cases that actually would be useful for, again, for like security auditing. If you want your people to actually read data, you want to know whether they read it or not. I don't know whether you can do that in a trigger. The scope would be, again, what you're looking for, where you're looking for changes to occur, right? And this could be on a table, on the entire database on a view, which we'll define later on, or the actual system itself. If you change the system configuration, you can have that fire trigger. And then the way, the timing of this could be either before or after the statement that calls the trigger to get fired occurs. So in my example here, I did it before I updated the tuple. I invoked the trigger. I can do it after I updated it. I can do it for every single row I'm going to modify. I can invoke the trigger before I modify it, then modify it. Then for the next tuple, the next row, I invoke the trigger again. And then I modify the next tuple, right? Whereas in the first case here, it's before anything is actually updated. And then the last one is, not all systems support this, but you can basically have the trigger say, whatever query you're going to execute that's going to cause this trigger to get fired, don't execute it at all. Execute whatever my trigger is. You sort of override the operation, override the query, and have it do whatever it is you want to do instead. So we'll talk about transactional semantics in next class. But if the way to think about this is if the trigger fails, like if there's an error in my trigger, then the update won't actually occur because the trigger is sort of tied to it. We can try this out in Postgres right now if you want, right? The way to think about it is like it's part of being invoked of the same command of the query that caused the trigger to get fired. So if the trigger fails, then the query fails itself. So let's do this. Let's drop our trigger, which we call it what, through updates. Maybe it doesn't let me drop it, through. There we go, okay. So we drop the trigger and then we'll rewrite the function to insert into a bad table, or table that doesn't exist. So this is our function. So we'll change the insert here to insert into XXX. And let me do it. So this is a good example of what Postgres is actually doing. Postgres doesn't actually do any compilation or static type checking, right? It's interpreting these UDS. The way it actually interprets it is pretty wild. So pretty much everything, every line, every instruction inside the UDF, will get converted into a select statement. So there's if calls here. If new.val is not equal to old.val, the way they're going to actually invoke this is that they're going to rewrite this to be select new.val not equal to old.val. And if that select statement returns true or false, that's how they know whether to go down inside the conditional branch or not. And they do this because rather than having a completely separate interpreter know how to interpret these expressions or predicates, they already have an awesome one implemented in the database system itself. So they just rely on the, they just sort of reuse that in order to execute these queries. So now again, so we define our function. It's going to try to do an insert on a table that doesn't exist, foo audit XXX. And so now I have to define my trigger, right? And again, it doesn't know that there's an error in the code, so it's allowed to do this. So now we try to invoke our query. Same thing, we're going to append plus Andy to it. I didn't like that. All right, let's try another one, sorry. Right, so here we see that I tried to invoke it, the query, and then that fired the trigger and it said, oh, you have this table that doesn't actually exist. So that failed, right, and it gives you the line of where the function failed. But now if I go and look at my table in theory, right, we see that the update failed, right. It didn't append Andy to the first guy here. Right, because before it invoked the query, it invoked the trigger, the trigger failed, so it rolls back everything, okay? Okay, so triggers are nice, but still the information about the change is still only inside the database system, right? So we sort of had that same problem we would have before. How do I know when something actually changed? I still would have to write code that would pull the audit table every so often to see whether something got modified. But we can do better. So we can use what are called change notifications inside the system to allow us to alert somebody that a change has occurred, right? And think of this as like, if you know Kafka or some of these sort of measures processing systems, think of it like a PubSub notification. I can declare a channel, an event channel to say, if you care about getting updates, listen here. And then I can tie that into maybe a trigger or other things. And anytime there's a notification, whoever's listening on that channel is told about it, right? So the SQL standard specifies that you have listen and notify. All the major systems including Postgres support this. My SQL and SQLite do not as far as I know. And in case of Postgres, the notification you send, the payload has to be just a text string, right? In the commercial guys, I think you can send along more complex types and information. So let's look at an example. So I'm going to create a trigger here and what this trigger is going to do is it's going to any time, I'm going to invoke this, any time we modify the audit table, I'm going to take the tuple that got modified. So the new version of the tuple. And I'm going to have this be the payload and I'm going to send out to anybody that's listening on my event channel. So what we're doing here is very Postgres specific, but we're taking the tuple new and then Postgres has a function called row to JSON, I don't think it's part of the SQL standard. It just takes the tuple and converts it into a JSON document, right? And we're going to store that into a notification variable. And then we have a Postgres specific function called PG Notify that's going to send an update or notification on the foo update channel, the event channel that somebody's listening on. And it's going to pass along the notification that we converted to JSON. But because in Postgres, all the notification payloads have to be in text, we're going to convert the JSON into text. So this dot dot is just doing casting in Postgres, right? Again, I don't think it's part of the SQL standard. The other thing I want to point out too, again, why I'm not such a fan of PL SQL, is that it's like in Pascal where I have to declare all my variables before my function body, right? It's not like Python or C, you just declare variables anywhere you want. You have to pre-declare them at the top and then only then you can use them down below. So up here I'm declaring that notification is going to be a JSON type. And then later on I convert that into text when I put it out into the channel, right? So then now what I'll do, I'll register this function to be invoked whenever there's a insert into the audit table that we just defined before, right? So again, I say, create a trigger called foo notify, then after insert onto the foo audit table for each row, invoke my store procedure notify foo update, which will then invoke this part here, right? So again, let's give this a try. So I'm going to create two terminals now. And again, so through the syntax of this, I have to listen on foo update. So the terminal at the bottom, we're going to listen on this foo update channel, right? It wasn't declared before, but now I've defined it, yes. Oh, sorry, yes, better, okay? But at the bottom here, again, I have two terminals. I created one, I listen on a channel called foo update, right? Now in the top terminal, what I'm going to do is I'm going to create the function I shared before, right? Again, all this is doing is taking in whatever tuple got inserted, converting it to JSON, and then evoking PG notify to tell it I want to get an update, right? Just to show you again how this works, I think I just call perform PG notify. And then it's foo update, what does it do it to the trigger way? All right, so again, I'll create my trigger, right? Anytime I insert into foo audit, which gets added when at any time I update that table, which actually we have to go fix the UDF, right? Because otherwise it's going to fail. So remove this, so we're good there, all right? So now we'll do an update, right? We'll update the second attribute. So we go look in the foo table, you see we got updated in the bottom here, right? The second tuple got modified, that's good. We look in our audit table, all right? We still have an entry for that, but again, we modified the second tuple. Here's the old value, and here's the date when it occurred. The problem is I didn't get an update down below, right? And I'm going to take a guess why. What if I do this? Now I've got the update, right? A second certification on foo update with the payload, and then here's the JSON. Here's the attribute, or sorry, the tuple that got inserted into foo audit, converted into JSON. Let me take a guess why this happened. Why do I only get this when I called select one? And the answer is sort of right there, right? It's asynchronous. So the terminal is just sitting here. It's not actively communicating with the database system, right? Only when I go, say, select one, it then sends a message over to the database system. The data system will process that query, which is just, again, returning the value one. But then it says, oh, by the way, you were listening on this channel foo updates. I have an update for you. Here it is, right? Because the data system is internally keeping track of what client connections are waiting on what events. And it knows whether you've received the events you're supposed to receive. So again, this doesn't solve our problem entirely, because now we still have to write now something to call select one over and over again, which actually a lot of application servers actually do. So if you ever use Ruby on Rails, I think Django might do this too. When you go look at the query log, you see all the select one show up all the time because they're trying to keep the connection active and they're looking for notifications like this, right? So some application frameworks will do this for you. If you really want to have it be able to push anything in the outside world, you could have your user defined function. You could write it in C and then have your C function call whatever outside API you want. You can call and look at the Kafka and send data out there, right? I think Kafka also has connectors for doing listen notify in SQL like this as well. So this approach is used in a bunch of different ways, right? Yes, in the back. This question is, what is the performance overhead of using triggers? In what way? The more specific. This question is, what would be the performance overhead of using triggers to maintain the audit table? Well, in my particular example here, what was I doing in my audit table? Checking to see whether two values didn't match, then do an insert. It'd be the same thing as if I just wrote from the terminal those two queries and shoved them to the Davis server at the same time, right? This is sort of a toy example, but the trigger is going to be invoked by whatever threat is executing the query. So the overhead of what the trigger is actually doing depends on what you've wrote it to do, right? So if I want to compute the million digit of pi in my trigger, I could do that. But my query is going to block until that computation is done, right? And the commercial systems have mechanisms and monitoring tools to make sure that no one threat or no one user defined function eats up all the computation. You can go ahead and kill it if it's spinning forever, right? In very complex applications, there's a major Internet service company that everyone's heard of. They have told me that they have user defined functions that invoke other user defined functions that go like seven levels deep. And the user defined functions can actually construct SQL queries based on inputs. So I can actually, in the same way that I can put strings together in Python, and then invoke that as a query. I can do the same thing in PL SQL. So I can have concatenate strings together to generate a SQL statement, and then it gets invoked as if it was written as a constant string. So the overhead depends on what you want to do with it. Most user defined functions that can't prove this are actually pretty similar. There's a great paper we won't cover in this class. We'll cover in the advanced class from Microsoft SQL Server, where they can actually take user defined functions. And 70% of all user defined functions they see on SQL Azure and their cloud system can actually be rewritten as relational algebra and invoke directly as the query, inside of the query. So we didn't really talk about this too much, but the database systems optimizer is going to treat these user defined functions as black boxes. It doesn't know anything about what it's actually doing, right? Because as I showed, it doesn't actually do any interpretation into a runtime. I had a bad query in there and didn't find until it actually got invoked. So the data systems optimizer is just going to see, hey, I got this function. I don't know anything about what it's actually doing. So I could have like a predicate where something is greater than something inside of a user defined function that's doing filtering for me. But the optimizer doesn't know anything about the selectivity and therefore doesn't know anything about how to estimate the cost of actually invoking that query and I'm using it in query planning. So user defined functions can be very tricky to get performed well. Not so much the computation they're doing, but actually how they're being used in the rest of the query. And then the Microsoft paper we'll cover next semester will show you how you can extract out the logic of a user defined function and embed it to be directly inside the query itself, almost like a nested query. Then you can infer all the selectivity and cardinality estimates we talked about before on that user defined function. And the way that you couldn't do. That was a long one to answer your question, but the answer is basically it depends. Depends on how expensive your function is. And then how it relates to the rest of the query can affect what the optimizer chooses for a query plan. All right, sorry. All right, so the next thing we want to talk about are the type system. So, every single database system is going to support the basic types of the SQL standard. Ants, floats, barchars, dates, time stamps, booleans in some sense. And they're going to support basic manipulation arithmetic operations on them, like adding two integers together. But what happens if we have a type that is more complex than these primitive types? Right, how would actually want to store something in a database system that doesn't match up exactly with what the SQL standard specifies? So, give a really simple example. Let's say that I want to store a coordinate type. And a coordinate is going to have x coordinate, y coordinate, and then a string label, right? Some tag that people put on a map or something like that, right? So, two ways we could actually store this in a database system are to just do attribute splitting. Meaning, since this thing has three attributes, x, y, and label, then we'll just create a table that has those attributes stored separately as primitive types. And then we can just, anytime we need to operate on them, we know what actually we want to go get at, right? And it's up to now in our application server, an application code to recognize that all right, inside my application, I may have a complex type of these three, this triplet. But anytime I want to read and write to it, to the database, I have to know, I have to split it up across these three attributes. Another approach is to do serialization. So, pick your favorite programming language, Java, Python, Ruby, whatever. They all have these serialized functions that take types or objects and then serialize them to like a binary format that can be written out to disk, right? Or if you're using an RPC package like Google Protocol Buffers or Facebook's Thrift, they have the same kind of things. You can define these complex types and then you can write them out into a binary format. So we just store those as blobs inside of our database system in a single attribute. Alternatively, we can just serialize it as JSON and store it like that. So this is in Postgres. I can create a table called Locations, and then I define a code type that is a JSON, JSONB means in Postgres, this is JSON. And then now whenever time I want to store the location of something, I just serialize the JSON document and write that out as almost like a text field. So what are the downsides of these two approaches? What's one really obvious one, right? In my application code, I have to know how to split these things and put it back together. In the same way we talked about with user-defined functions or triggers, if I have a mobile app and a server app that want to talk to the same database, I got to make sure I reproduce that logic about how to split the types and put it back together in both versions of the code. In the case of the bottom one, this is actually really bad because if I just take it and serialize it in Python or Java, then the only way I can access a sub-element of the type is to pull back the whole blob and then deserialize it, right? I can't say find me all the objects where x equals 10, right? And if I want to compute aggregations on this, I have to do this all in the application server side because I can't deserialize inside the database system. JSON is a little bit better because in some systems like in Postgres, you can access the individual types or the individual elements of the object or the document and you can build indexes on them. But this is still not great because again, we have to know how in application code how to handle these things. So the way we can overcome this is to do what is called user-defined types. So I can specify to the database system, here's a custom type that I want to use. And here's all the way you can interact with it, access it, and manipulate it. So think of it like you can override the plus and minus operator, the subtraction, all the things you want to do with sort of primitive types. Your user-defined type can support them as well, right? So in the SQL standard, I think they call these structured user-defined types or structure types. The user-defined types is one of the major inventions or the major things that came out of the Postgres project in the 1980s that was invented by those guys, that is now used in pretty much every single database system. The original story was that, so one of my advisors when I was in grad school was Mike Sternbreaker. Sternbreaker was the inventor of Postgres. Before Postgres, there was this thing called Ingress at a Berkeley, right? So Postgres is post-Ingress, right? It's the thing that came after Ingress. So they spawn off Ingress in the late 1970s, early 1980s as a startup. And Mike always tells the story where he would go on these sales calls at these banks in New York City, these financial firms, and try to get them to use Ingress. And the banks would always say to them, well, we need to store bank information, we need to store a bunch of records with dates. So Mike would say, yeah, Ingress supports dates, no problem, right? We can do whatever you want to do, we have the date type. The problem was the banks at that time in the 1980s were storing their dates on the Gregorian calendar, which has 30 days for every single month. Or we use the Julian calendar, which has the mix of days per month. And so what would happen is when these banks would use Ingress with the Julian calendar and when they would compute an interest, they would end up losing money because the dates didn't actually, how you would actually do operations on those dates didn't match up with how our calendars worked. So he realized that the only way that they could support these banks' databases would they have to go into the Ingress code and modify it to have a special type for Julian calendar dates. And so when Postgres came along, they realized that rather than having to have everyone have to know how to modify the databases and source code, they'd expose a user to define type API. They'd make it much easier, easier in quotes, to easier for people to extend the database system to support new types, right? And as I said, this was later added in the SQL standard in 1999. And this is anytime you see a system called object relational database, this is really the key part what they mean. That you can extend the basic type system to have what essentially look like objects, right? So this is probably the one aspect of the database systems that vary the most, even though there is a standard, right? It's even worse than SQL, even worse than user defined functions, right? Of how you actually define different user defined types. So Oracle is probably the best in my opinion, I can't prove this, but they actually support PL SQL, they support creating types based on just almost looks like create table statements, right? This is how you create the coordinates type that I showed before. I'd say create type as an object and I'd define inside of this the same syntax I would have for a create table statement. And now I can use this inside of create table function to define things to be for this coordinates type. DB2 supports types based on the built-in types, so it looks a lot like this. SQL Server and Postgres require you to write shared modules, or shared objects that get linked into the database system at runtime. And you have to implement the API that they want you to implement in order to have your type interact with the database system. And this is where things go off the rails in terms of portability because in SQL Server, you're gonna write this in .NET, right? In Postgres, you're gonna write this in C, right? And the APIs they're exposing are completely different. So this is a very, very powerful idea, but it's not very portable. All right, the last thing we talk about are views. So the way to think about a view is that there's like a virtual table that's gonna be defined on a select statement. And it's gonna look to the application as if it's just another table. And you can do queries on it, just as it was a table. But underneath the covers what's gonna happen is the system's gonna rewrite your query to instead of invoking it on the view, which doesn't actually exist cuz it's just a virtual thing. It's gonna rewrite your view to now be based on what the definition query would be for the view, right? So the reason why you wanna do this, cuz again, it allows you to take a really complex query and maybe just encapsulate that into a single view so that everyone doesn't have to copy the same complex query over and again. They just invoke whatever the query they want directly on that view and it does the complex computation that you want them to do. It's often used as also a way to hide certain amount of data or certain attributes or columns from some users. So let's say I have a base table and I have ten attributes. But if you're the chief executive of the company, you can see all ten attributes. But if you're a lowly person in the mail room, you should only be allowed to see two of those attributes. So we can define a view to prevent you from seeing everything. So a simple example here would be go, go back to that student table that we talked about at the beginning of the semester and I wanna create a view of just the computer science students that only has their student ID, their name and the login information, right? There's also a GPA field but I wanna hide that in my view. So I call create view and I say that the name of my view is CS student. And then after my as clause, I have a select statement. And this is what's gonna be the underlying query that's gonna back this particular view. So now if I have my original table with these four tuples, I can invoke a query on my view CS student. And that's gonna produce the output like this where again, I only have the students where the login includes at CS. And I'm only gonna include their ID, the name and their login information. I don't include age or GPA. So again, what the data system's gonna do at the rewriter level, right? This can be done with static rules. It doesn't have to be, you don't have to go through the optimizer. We can take CS students and recognize that this is a view and then we just replace whatever was inside the select statement here. That'll get put inside of there, right? Pretty straightforward. So again, views are virtual. That means that they're only invoked, the contents of this query are only generated anytime you ask for it inside of the query, right, inside of a query. And then when that query is done, that those contents aren't reusable for the next query. So every single time I invoke this, select star from CS student, it's gonna fire that query up above every single time, right? So we do more complicated things, right? So let's say here we wanna get the average GPA age of all the students in CS. So again, I can do an aggregation now inside of this. And anytime I would wanna do a lookup on CS student, it would then fire this and actually compute the aggregation. So some systems will allow you to modify views. Like you can go update CS student and then with the where clause. As long as you're not doing aggregations, right? Sort of obvious why, right? Cuz if I do an aggregation, I'm coalescing multiple tuples down to a single tuple. So I have no way to reverse this GPA to go back to the original values. Whereas in my other example here, right, I'm just taking this query here and rewriting it. So I know how to do this lookup for this where clause to go back to the original tuples that this thing is based on. So not all systems will let you do this, some will and some won't. So this looks a lot, yes, sorry. His question is, it was a very good question, actually perfect. His question is, if the underlying table, the base table is updated, is the view updated? So again, the view doesn't really exist. There aren't any tuples materialized. It's only invoked when you invoke the query, right? That references the view. Then it goes and invokes the view to generate the results for your one query. So if I update the underlying table, then I invoke my query that invokes the view, I will see those updates. So again, the results are only materialized when you're actually needed in the query. Now this is different than like select into. Select into will actually create a static table in the database that is like a snapshot of what the value was and what the tuples were at the moment that you invoke that select statement. But now it becomes disconnected from the base table. So if I call select into, write it out to that table, if the underlying table gets updated, those updates don't get propagated to the copy of the table, the select into table, right? So we already talked about this again, some systems allow you to update, some don't, we've already covered this. The kind of view you could use that can do exactly what he's asking, where I can maybe pre-compute the view and then anytime the underlying table is updated, propagate those updates to the view, those are called materialized views, right? Again, the way to think about this is I'm gonna create a materialized view on the same query here, like here. And then this is gonna get stored internally as if it's a regular table. But anytime I update the student table, the system knows that this view is derived from the student table. So it's almost like a trigger. If you update the student table and fires off a trigger, then updates the view, right? So it's always being kept in sync. Now the tricky thing is how you actually propagate these updates, right? The dumbest thing to do is what? The easiest thing to do is to what? How do I update a materialized view? Yes? Recomputing the whole query. Recomputing the whole query, right? Anytime updates student, it fires to this thing off again and recompute it. That's fine if I have 10 tuples, if I have a billion tuples, this is maybe not what I wanna do, right? So the commercial systems, I'm not saying they're good at this, but they're better than the open source ones. I mean Postgres Mexico don't even support these. There's been quite a bit of research in the last 20, 30 years about methods that do sort of delta propagations of changes to underlying tables to update materialized view. So in this case here, we can be smart and recognize that, all right, well, we only need to recompute this thing when this where clause evaluates to true, and if so, maybe if we fire off the whole thing. Or even smarter and recognize, well, I know what the average GPA was before, I know what tuple got modified, and therefore I know how to then adjust what the average GPA is without having to look at everyone else all over again, right? So internally you can maintain the sum and the count for the students that match this, and you just know how to decrement what the old value was that got modified, and then add the new one and then recompute the average. So again, these are sort of incremental updates to materialized view. This is ideally what you would want to do, because then you don't have to scan all the original table again, but not all systems actually support this. And as far as I know, at least in Postgres 9 and 10, they don't support materialized views. I don't know about 11 and 12, and my SQL doesn't support this, and SQL Lite doesn't support this. You can fake this with triggers, but you're going to end up having to implement a lot of the logic to do incremental updates yourself, okay? Yes? So the question is, if I have a materialized view and I have a query that does an update on the table, will my query get blocked until I update the materialized view, or will it be deferred an update later? So I think by default you get blocked. I think there are ways to define deferred updates, right? So if you're okay with things being slightly out of sync in your materialized view, then I think it's a way to add a deferred command. I know you can do this with triggers. You can say, don't invoke this trigger after the query is actually done, but not during it. And that just says, all right, I'll propagate the change later. So I think you can do both. I don't know actually what commercial systems do, and by default I think it's always in sync. Yes? So are materialized views actually materialized? His question is, are materialized views actually really materialized or they're virtual? So the most naive implementation would be that they are materialized, right? Whatever this query is, it's always going to be materialized internally, right? Underneath the coverage, you can imagine a scenario where you can be a bit smarter about how you do things, and maybe only collect the data you need to in chunks, so that you know how to do incremental updates to those chunks, and then any time you actually invoke on the view itself, you put those chunks together and produce the full answer. There's a bunch of different ways to do it. I don't know what the actual commercial systems do. In addition to the query optimizer being a very closely held secret for what these systems do, how they do updates for materialized views are also another one. Microsoft did a lot of great research on this in the 1990s, but I don't actually know what they do today. So we'll cover materialized views a bit more in the advanced class next semester. We're out of time. So the main takeaway I want to get from all this is that there's a bunch of different ways we can take the logic that would normally be in our application, and we can embed that inside of the database system. And we do this for efficiency reasons. We do this because we can reuse that logic across other limitations of our application and not worry about people having typos or implementing things incorrectly that causes up the database. So the issue, of course, is going to be that all these techniques, they're not magical. They don't come for free. A lot of them are not portable. It's a lot of times it's hard to make constant changes inside the database system because DBAs want to be very conservative, meaning if your application is getting updated every two weeks and you need to update your user defined functions every two weeks, your DBA is not going to like you because every single time you give them a new change, a new function, they want to vet it to make sure you're not doing something stupid. Because the database system is a core critical part of your application stack. It's probably running on a very expensive hardware. And so you don't want this to be overloaded because you have some mistake in your function. And the other tricky thing from a software engineering standpoint, again, if you're doing updates every two weeks, you need to make sure, and you have different user defined functions, if you're doing rolling updates, you now need to be doing version tracking of your function to make sure that the application on this server is getting the right user defined function. Think of it this way, Facebook rolls, they do rolling updates every three weeks or four weeks, whatever it is. And so that means that the application server would have some applications that would have the new version that occurred, some of them would have the old version. They may be talking to the same database, and therefore you want to make sure that they get the right user defined function for what version of the application they're looking at. And the data system is not going to maintain this for you, you have to do this yourself, manage this yourself in your application. Again, I'm not saying Facebook uses user defined functions, I'm just guaranteeing that they don't, but it's a thing of a scenario like that. So at this point in the course, starting on Monday, we're going to cover what my most favorite thing is about databases, and it's transactions, it's a current goal. It's hard, you will cry at night trying to understand it, but that's OK. But I think this is one of those interesting things, and this is what really differentiates database systems in terms of the benefits they can provide versus you rolling your own and writing things out to files on disk yourself. This is hard to get correct and hard to get right, and we'll spend the next four weeks discussing it. All right guys, have a good weekend. That's my favorite all-time job. What is it? Yes, it's the SD Cricut IDES. I make a mess unless I can do it like a GEO. Ice cube with the G to the E to the T. Now here comes Duke. I play the game where there's no roots. Homies on the cusse, I'm a fookus, I drink brook. Put the bus a cap on the eyes, bro. Bushwick on the goal with a blow to the eyes. Here I come, Willie D, that's me. Rolling with fifth one, South Park and South Central G. And St. I's when I party by the 12-pack case of a four. A six-pack 40 act gets the real bounce. I drink brook, but yo, I drink it by the 12 ounce. They say bill makes you fat. But St. I's is straight, so it really don't matter.