 This how to move some logic that would normally be in your application into the database system and we'll do this for performance reasons and other efficiency reasons. So I was going back and forth whether to call this extended database logic or server-side database logic and I finally end up with embedded and I think it'll be clear what I mean as we go along. So real quick, some administrative stuff, project two is due this Wednesday at midnight. Everyone should have gotten started on this by now. The midterm exams are all graded, all the grades are up on grade scope, sorry, on canvas and they've been pushed to S3 which I think goes out at 4pm today. So we're not handing back the exams but if you want to go and look at them anytime during the semester, come to my office hours, bring your CMU ID and you can sit and look at it with the solutions and figure out where you got things wrong. And if there's any issues, you want something regraded, just take a photo of the page and then send the email to us on canvas and we will look at it. So at the half of the semester, we're skewing very high on people doing very well in the course. This doesn't surprise me because most of the grades at this point have been the homeworks and only the first project, right? There's still three other projects that are still due after the midterm and there will be actually fewer homeworks. So at this point in the course, we're switching to more development. So you should plan accordingly. The other thing I want to announce too is that going up today will be the information about the extra credit assignment. So the extra credit assignment is for you to pick some database minimum that you want to play with and get one of our benchmarks running in our open source benchmarking framework called OTP Bench. Get that benchmark to run on your system. Get it to run as fast as possible. We'll set up a leader board and see how fast you are compared to other systems. And then you want to push your updates into our benchmarking framework. So by getting something to run, I really just mean making sure that the loader and the workload generator works, it's written in Java. And then any SQL changes you would need. So everything in the current benchmark framework targets my SQL. And so there's these dialect files where you have to go maybe modify things slightly to make it work on the system you're targeting. So when I say pick any database management system you want, there are actually three constraints. One, that system must support transactions or analytical queries. And then the second thing is that it has to support SQL and JDBC because the benchmarking framework runs through, uses that API. So the link there will take you to the GitHub repository for the benchmarking framework. But I'm announcing right now for you guys that are actually in class today since it'll be first come first served for picking a database system. So if you go to that URL now and you want to pick your favorite database management system while we're going on, do that now for you guys here before we announce that to everyone else in class and maybe watching the video later on. So you'll get first dibs. The way we're going to do this is that we're organized into groups of three. So three people want to work on the same database management system. You can work guys work together and get it all running. I have talked to some friends in industry for some of these database startups and if you need to, if you want help getting this thing to run as fast as possible, I can help you get in touch with them and they can help you try to get things running. So we'll run everything on Amazon and everyone will get $100 credit per person, not per group, to set up their system and do this. So again, the sign up sheet is right there for that URL. That'll take you to the Google spreadsheet. Just add yourself to the system that you want. There's also another link to another Google spreadsheet that I've been maintaining for a while now of every single database management system that I know about. It's like a 400 or 500 or so. Not all of them are going to support JDBC and SQL. Not all of them are actually going to be something you would actually want to play with. But if you want help picking one of these, you can look at this list. The other thing I'll say too is that what you end up turning in is a pull request on GitHub that says here's the dialect file. The patches you need to make sure this thing actually works. Then you'll also provide some information about how you configure the system to run the benchmark. And then we'll have a leaderboard where you can post for each benchmark how fast you can actually run. The idea is that everyone's going to target the same EC2 instance type, run this benchmark and then we can have this contest as you use and run the fastest. The other thing I'll say too, there's already a bunch of database systems that our framework already supports. So you go to that link, you'll see on the spreadsheet the ones you're not allowed to pick. So like I said, we already support MySQL, we already support Postgres, SQLite, and then there's a bunch of these embedded Java databases that we use for testing that you're not going to want to run on these. Any questions? Does everyone here have that URL and can sign up first? Again, I'll announce this late tonight. So everyone who's not in class will have to take your table scraps. The other thing I want to announce too is that we have a bunch of database talks that might be interesting to you guys coming up this week. So on Tuesday, tomorrow night at 6 p.m. in Gates, one of the lead architects from Oracle's in-memory database system will be, or their database engine will be talking about their system. I think there's dinner there, so that's free food. Then on Wednesday, I have a friend from a venture capitalist firm out of Boston who, and they do a lot of early investment in sort of technological technology startups. So he'll be talking about what it takes to do a startup in a database field or networking or things like that. And then part of our time series database lectures on Thursday, we will have one of the lead engineers from KDB come give a talk about their system. Who here has heard of KDB? One, have you worked on Wall Street? Or you heard it from, because of me, right? So KDB is actually one of the, actually, you worked at Goldman Sachs. KDB, no? Okay, so KDB is one of the original time series databases out of the 1990s. It is used all over Wall Street. It is not used, as far as I can tell, that often outside of Wall Street. So we're very curious to see what is it they're doing that makes the Wall Street guys really like this particular system as opposed to using Influx DB or all these other ones that are out there. So this is, they've been around since the 90s. A lot of the other time series databases where we have people come and give talks, they're maybe only five years old. So this one is very interesting. All right, any questions? And all of these links and the information about it is on the database group website. Okay, yes, question. His question is for the benchmarking framework for the extra credit. Can you pick a NoSQL database? First of all, if they support, actually hold on, yeah. If they support JDBC, that's the first thing, you have to have that. If you have JDBC, then things are not going to work. In terms of whether they support transactions or not, I'll qualify this. If you're doing analytics, then you don't need transactions at all, right? So that's okay. So if you have a NoSQL system that can do analytics, like Spark SQL for example, his question is, is it possible to do on RocksDB? I'm going to take it, I'm going to know the answer to that one. Does RocksDB support JDBC? No, it doesn't. RocksDB is an embedded database, right? You run that inside of another database system. You can use actually MyRocs, which is the MySQL. It's MySQL using RocksDB instead of InnerDB. Then that satisfies those things. So yes, you could use MyRocs, but RocksDB by itself isn't going to work. MyRocs is Facebook has modified MySQL to support the RocksDB engine, right? InnerDB is the default engine, right? RocksDB is the replacement for it, it uses the LSM. In the same way like in Mongo, you can take out WireTiger and put in RocksDB, right? You can use different storage engines. Any other questions about the extra credit? So up until now in the class, we've assumed that all the logic for an application will be written in the database system itself. Sorry, it should be written in the application. And the application is just going to treat the database management system as sort of this black box somewhere else in our system that we can only communicate with using SQL queries. Now, SQL can be very expressive and you can do very complex things as we saw with using CTEs. But in terms of like the logic you would normally have in your business, as you write certain programs. Like if statements, for loops, and other things you may want to do. That typically is always stored on the, or ran on the application side of things. And so the way to sort of think about how the application is going to interact with the database management system is through what's called sort of a conversational API. Meaning it's going to ask it to read and write data and then it stops and waits until it gets that response. Then it gets the results and then does some kind of processing based on the application logic that's running inside of it. Then it maybe goes back to the database server and gets more information. So the protocols that you would use to have this sort of conversational API would be something like JDBC or ODBC or the various wire protocols that the different data management system support. So a sort of visual way to represent this. Say that we have here in this gray box, we have our application server. And then over there we have our database server. And so what happens is, say you want to execute some high level function in our application. Like I want to add something to my shopping card. Or I want to update my email address for my account. Well, what would happen is the application would initiate this routine that maybe goes out and executes a SQL statement. And then that runs on the database server. And then the application server is going to block until it gets back the response. And so now we've got to go through all the steps that we've been talking about in this class where we have to parse the SQL, run it through the planner, the rewriter, the optimizer, and then actually execute it. So the application server is blocked while all this is going on. So then it gets back a result. And now the database server is going to say, going to wait for the next connection for this application or the next request before it starts processing the next query. Now as we talked about before, we can have parallel systems so we can run other queries for other application connections. But for our particular connection, it's blocked. So then we have some program logic. Could this be doing if clauses or for loops to look at the data we got back from the application server. Then we come along, and now we execute another SQL statement and do the same thing. Now back and forth, the database server is going to wait until, again, it gets the next query. And at some point, we'll get down here and then we'll finally commit. We'll say whatever it is that we were trying to do, the changes that we made, we want to now safely store them into our database system. Now I'm avoiding using the term transaction here, because we'll cover that on Wednesday. But this is essentially what sort of way to think about this. This is some transaction that we want all these operations to happen atomically. And at the end, we tell the database server to go ahead and commit or save our changes. So the downside of this is that this sort of standard approach, and this is pretty much how almost every single application is written. Anytime you download any PHP application that you're running that you got as an open source thing, it's going to be using this sort of conversational API. So what we're going to talk about today, though, is that it may be possible to actually move some of that application logic from the application server and embed that inside of the database management system itself. So now we can avoid things like having to go back and forth between the application server and the database server every single time. We get some data, process it, get some more data, process it, and then maybe update things. Instead, what we want to be able to do is maybe just send a single request. If they do run my queries, plus run all the program logic that I normally have in my application, run that inside of the database server, and then just spit me back the final result. So I think this is like an RPC request. And the reasons why we want to do this are sort of obvious. One, we're going to get better efficiency because it's fewer network round trips. In my diagram here, I didn't say where exactly that database server is. It could be like a rock TV system where it's embedded inside of our application process. But typically it's going to be on a separate standalone database system process, like MySQL Postgres Oracle. And now you may be going over a network connection to send these requests. And that machine may be either in the same rack, hopefully in maybe in the same data center, but it also could be in another geographical location. So now these network round trips are really expensive. So we'll get better efficiency by reducing the number round trips and shortening the time it takes to actually perform whatever it is, the high level operation that we wanted to do. Like now it takes less time to actually add something to somebody's shopping cart if all the logic is running in the database server. The other advantage of this is that we'll get better reuse of our complex business logic or application logic across multiple applications. So for example, if the code I have, if I can embed inside the database server the program logic and the SQL queries to add something to the shopping cart, I can have my application written in PHP and I can reuse that code. I can have my application now written in Python or whatever else, and they can reuse that same code. Whereas if I don't embed inside of the database server, then for every single time I have a new version of that application, I have to go reimplement that same application logic in the different versions of it. So this will have some downsides which we'll talk about as we go along. This sounds amazing. This sounds like exactly what you would want to do. But in practice, a lot of the things I'll talk about today are not used because it's tricky to write. It's sort of separating the code for your application. It's now separated into different parts and from a software engineering standpoint, that can be bad. But it can actually make a big difference. And we'll see examples as we go along. So there's a bunch of different ways to do this. So we're going to go through each of these. The last one are views which aren't technically, I would say, wouldn't fit in this category. But I think it's useful to understand how a view works. And in some ways, you can actually implement materialized views using some of the techniques that we're talking about here, like triggers. So to add a quick show of hands, who here has written a user defined function or store procedure in SQL? One. OK. Perfect. That's fine. And how was it? Painful? It's painful. Right. OK. All right. So a user defined function or UDF is a, it's always never good to define the thing you're trying to define by using the same word that's in the thing you're trying to define. But a user defined function is a function that will be written by the application developer that we can embed inside of our database management system and have it extend the functionality of what the systems can support beyond the sort of standard built-in functions that we normally deal with. Before, remember I talked about there's all these string functions that are in the SQL standard. But let's say we have some kind of function that we want to do something that isn't in the standard. Now with a user defined function, we can define this, embed it inside of our database system, and then we can invoke it just as if it was a regular built-in function that was part of SQL. So again, at a high level, what's going to happen is we're going to take in a bunch of input arguments, and these would be scalar values, integers, floats, text strings. We'll perform some computation, and I'll be vague what that is, but I'll show some examples. And then we're going to produce a result. And our result can be either the more scalar values or actually tables themselves. So the way you define a UDF is in terms of what the return type is going to be and then what the computation will be. So as we said before, you can have scalar functions that return a single value, or you can actually return rows from a result table. But the real difference is going to be what the computation is actually going to be. And there's essentially two choices for this. So the easiest way to do a UDF is what's called a SQL function, where you're just going to have a bunch of SQL statements listed in the body of the function that will get executed in serial order when you invoke the function, and then the function returns whatever the result is for the last query. So in this example here, this is from Postgres, but this will be another dirty secret as we go along. When we start looking at PL-SQL or SQL-PGM, this is where the standards start to get really bad, and all these different systems do different things. But for the most part, this should be pretty portable, because it's doing a SQL function. So we see here at the top, we have our input arguments. So we're defining a function, we're going to call it get foo, and it's going to take in an integer. And then the return arguments will say, we're going to turn a table foo. So I've defined the table foo up there, so we're saying here the result that we're going to return will be sort of an instance of a table with the same schema of foo. And then in the body of our function, we just have a single SQL statement where we're just going to select star from foo, where foo.id equals some number, actually some input argument. So what happens here is in the function arguments, I say I want an integer, and then now $1 means give me the first argument and put it there. And the other thing to point out too is there's these double dollar signs where you define what the body of the function is. So we can do a demo of this. And actually, I figured out how to type faster. I can type faster in Postgres by actually using the, by logging in for my terminal over here. All right, so we have a simple table that has three rows. So the function we wanted to define was on get foo. I moved this down. So this guy says get or create some function, and that's the same thing I showed before. So now what you can do is there's two ways we can use this. We can just call get foo like this, and we'll get back the entry we find as a single scalar value. And here I put it together as this sort of composite array type here. But we can also use it as in our from clause. And now we get back the proper table, because the functions returning back a tuple is that will follow the schema of the table foo. So now when we do this, we can do all the things that we can normally do in a regular table, because Postgres or the data server is going to materialize this as a intermediate result as a table, and then we can do whatever else we want on top of that. So we can say where value equals abc, and that works, and value doesn't equal abc, and that produces nothing. So when you look at the actual plan itself, you see that it does a function scan on get foo. So it's going to invoke foo, and then it does our filter that we want before. So that is a SQL function. The alternative is to use an external programming language. And as I said, this is where the standard starts to vary wildly. So the basic way to think about it is instead of writing the SQL functions are just a bunch of list of SQL statements, and it executes them one after another, and then whatever the last SQL statement is, that's what spits out as the return value. But if you want to do more complex things, like conditionals, for loops, and other things, then you can write them in these other languages. So in the SQL standard, they define a language called SQL-PSM. No one's probably ever heard of this. And at the point he was making how painful it was, SQL-PSM was designed by somebody who was really like the language Ada. They might know the language Ada. Nobody, all right, it's from the 1980s. It sort of looks like Pascal. So no one follows this, and everyone has their own thing. So Oracle and DB2 follows something that's called PL-SQL, which sort of looks like SQL-PSM, but not exactly. And then Postgres has their own dialect of this called PL-PG-SQL. SQL Server was actually originally derived from Sybase in the 1990s, and so Sybase had this thing called transact-SQL, and so their UDF language looks like that. Some other systems actually also support the ability to write your UDS in sort of more commonly used programming languages, like C, C++, C-Sharp, Java. And so what'll happen is when it'll get compiled and you link it in the server, and then you can invoke that through the SQL statements just like I showed before. It might take a guess why or why not using these other programming languages might be a good idea or a bad idea. Say you write a UDF in C, and then you write a SQL statement, and then you invoke that function, and then the database server invokes your C function. What's the problem with this? It could kill everything, right? And so in some systems like Postgres, they'll let you do this, and other systems like Oracle, they'll actually put your UDF in a sandbox. So they'll fork a process and invoke your function because if you trash the memory or the address space of your process in your C function, then that doesn't affect the database server, right? The function crashes and who cares? So I'll show you what PLPG SQL looks like in the next slide. But these upper languages are sort of declarative, and you can compile and make sure they're actually safe, and there's no side effects to the actual other things in memory. But to get the best performance, you probably want to use a C, a C++, but then this has the danger of you actually might take down the whole thing because you read into memory that you shouldn't have read into. Of course, there's an issue of having to fork a process and then send things over a pipe to use a sandbox function. So here's what PLPG SQL looks like. So here's the same function that I showed before, GetFoo. But now instead of running a SQL function, now we have PLPG SQL. So here you see that we have our begin, just as before, and then we have this return query method, or keywords, that will just return the result of whatever it is in our output. I can show that running in a second. And as I said, so now we can do a more complicated thing. So here's a PLPG SQL function to actually compute the sum of all the IDs. And now you can start to see why I was complaining about how ugly PLPG SQL is, or PLPG SQL is. So here now we're going to find, anytime we have a variable, we have to declare them ahead of time before we start our function, and we can't declare variables anywhere we want. And then we're going to have a for loop that's going to iterate over every single tuple in foo where the ID is greater than some attribute. And then we're just going to add the ID together to produce our output variable and then return as output. So we can do a demo of this again. So we can do, first do the, this is the PLPG SQL version of get foo. Not changed, it's just going to return to the existing function. Drop function, get foo. It's important you always have to pass the input arguments as when you do this. So now we can run that same function that we had before, and it'll work exactly the same way. I can do that and get all my output, and then I can put in the, is that, and get it as a combined tuple, right? So that's not interesting. So now we'll add the sum function, right? We'll declare our variables up here. Here you see we actually have to declare, I can't do it here, here we have to declare a variable for the, that's a record that we're going to get as we iterate over all our elements in our select statement here, and then we declare our output element to be an integer so that we can do the summation there, right? So now that we can do the same thing, select star from sum foo, we'll pass in some input argument, one, right? And then produce the other output. Yes? Is this compiled or interpreted? This question is, is this compiled or interpreted? This is, and Postgres is actually interpreted. That's a very good question. So the way Postgres will actually implement this is that for every single line in our function, it actually converts them to a select statement. So if you have, say for example, this one here, select output equals, you know, of that, right? It will run this as like select out plus foo.rec, right, ID. And for each line, it then invokes that in the database system, right? And actually, if you think about it, it actually, this makes sense because they already have this powerful, you know, runtime engine to do, to do, to execute all these queries, and they're just sort of building on top of that. So in Postgres, they interpret this by converting everything to select statements, right? In the commercial systems, they will actually compile this into a runtime binary, and that will run much, much faster, right? In our own system, we have a student working on this right now that actually will use PLPG SQL and actually compile it into LLVM machine code, or IR, that we can compile that and run that much faster, right? So in MySQL, MySQL really doesn't support UDFs, like Postgres does, I don't know what they do, and I don't know if SQL Lite does as well. They have sort of subsets that do triggers, which we'll see later on. So any questions about this, right? All the things you expect, you would want in a programming language, for loops, while loops, conditionals, and other things you can do inside of this. It's sort of this wonky syntax because it's based on Ada. OK, so something that's very related to a UDF is called a stored procedure. Now this gets a bit nuanced, because in Postgres, for example, they don't distinguish between UDFs and stored procedures. But in other database systems, they actually do. So the way that I would define a stored procedure is that it's a self-contained function that does some kind of complex logic execution inside the data spend-risk system. So what I mean by self-contained is that in the case of the UDFs that I showed before, you would have to use them in a SQL statement. In a stored procedure, you can just invoke it like an RPC call, and it doesn't have to be inside of a SQL statement at all. It runs sort of independently. And that's sort of confusing in some ways, because a UDF can have SQL statements inside of it and all the sort of the conditionals and all the other things you would expect. And a stored procedure is essentially the same thing. It's just the way you invoke it is through an exec man instead of running it inside of a SQL statement. Is that sort of clear? The other definition, the thing they talk about is that typically in a UDF, they're meant to be read-only. You're not supposed to update the database and have side effects to when you invoke the function where in a stored procedure it's sort of expected. And this will come up a lot. I'll talk a little bit about this when we talk about transactions, but we'll talk about more about this in the spring because one of the ways that you can get the really good performance in a transactional database system or OLTP system is if you convert all your transactions to be stored procedures. And the reason why you'd want to do this is because if you start making changes in the side of a transaction to start modifying the database, you may end up taking locks on tuples. And so now, if there's a long round trip between the application server and the database server, then you're holding those locks while you're waiting to go back and forth. And now that blocks other transactions that may not run at the same time. Whereas everything's a UDF, you just throw the exec command to the server. And then it connects to them really fast without going back to the application server and asking for more input. So the way to sort of visualize this, again, go back to the same example that I had before. But now what we're going to do is we're going to take all that logic and then embed that entirely inside of the database server. And then in our application side, we just make a call to that procedure and pass in some input arguments. So this is essentially, again, the same thing that the UDF was doing. But now, in my sort of procedure, I can update the database and I can invoke it by not having to piggyback it on a select statement. So now, again, when I want to run this transaction, which I'll define on Wednesday, it's just one round trip to say, invoke my procedure, here's the input arguments, and then it blocks until you get back the result. And that can make a big difference in performance. So again, the difference is that the UDF is meant to be sort of a subset of a read-only computation that you'd want to do within a query, whereas the stored procedure is meant to be sort of a single atomic unit of execution that you want to do. And again, I'll define what a transaction means on Wednesday and hopefully this will make more sense. All right, so any questions about UDS? Any questions about stored procedures? OK. UDS are actually very common. And things can get pretty nasty because a UDF can call it a UDF. I think Postgres limits you to six or seven levels deep before they kill you. But you also can have UDS that dynamically creates SQL statements that then call other UDS. So in very complex applications, for example, Salesforce, they use UDS all over the place because what they want to do in their application is not easily done in SQL, so they add UDS to extend this functionality. Yes, question? This question is, can you be recursive in a UDF? Yes, or not? We can do it. You want to see it? So you can do Fibonacci sequences and everything. And actually, there's a tutorial online that shows you how to do recursion in PLPG SQL. And they show how it's so slow to do recursion because it's all that interpretation overhead. So another kind of construct we can have in our database system to push logic to the server is called a trigger. And the way to think about a trigger is that it's a way to tell the database system to invoke some UDF when some of them occurs inside the database server. Say if I update a tuple, I want to then say, all right, if I update a tuple, then I want to update another table to say this thing got modified. Or I want to check to see whether it actually has the correct value or not. So when you define a UDF, you're going to define three basic things. You have to define what type of event will cause it to fire, the scope of the event, and when that trigger will fire relative to that event. So to understand this better, I'll go through each of these three categories and show the different things that you can set. So again, the first thing would be the event type. So this would be when there's a modification in the database system based on any of these queries. So if you insert, update, delete, truncate, which is just deleting without a where clause, it bleeds the entire table. Or when you create a table, alter the table, or drop the table. And so basically you define a trigger to say, when I see this type of query on my table, then do something. Actually, to relate to that is the scope of this. So you could say, when I do an insert onto a table, when I do an insert into the database, when I do an insert into a view. And then in a system like Oracle, for example, you can also define system triggers when there's sort of administrative things happen, you fire some UDF. Like when a user gets added, when a user gets dropped, when you change somebody's password, you can have it do something. And then the sort of tricky thing is the timing of the trigger. So the easiest way to think about this is if you can have it done on the entire scope of the event that calls the trigger to get fired, or you can do it for each individual sub-element of the event. So the first two are to say, before, after the statement executes. So the statement would be the event type. So I can fire trigger after I insert into a table, before I insert into the table. And then the next two are to do it for every single row or tuple that's going to be affected by the statement that calls the trigger. You can then iterate over them and invoke the function for each of them individually. So before each row that the statement affects, say if I update 10 tuples, then the trigger will invoke my UDF for each one of those 10 tuples. And I'll pass in some context information for you to say, here's the tuple that actually was modified. And then you can do whatever you want with them. And the last one is to have the trigger be a replacement for whatever the statement is. So let's say I want to do an insert into a table. And I can have my trigger get fired to say, when I try to insert into this table, invoke my UDF instead of invoking the actual insert. And I'll also plant it. So again, the easiest way to sort of think about this or understand it better is actually walk through an example. So for this, we have our same table foo that I've been showing before. But now we're going to add a new table called foo audit. And what we're going to want to do is at any time that somebody modifies the foo table, we want to put an entry into the foo audit table to say, here's the old value of the entry that got modified. And here's the timestamp when it got affected. So anytime I update in foo, invoke a trigger, they then updates as an entry into the audit table. So here's where things get ugly is now I have my giant PLBG SQL trigger for this. So the first thing you have to do is define a function that will get invoked for that trigger. And then you see that in my return type here, I'm saying that this is meant to be used as a trigger. So inside of it, what I'm going to do is I'm going to have these new and old context handles. So these are virtual versions of the tuple that I can interact with in my trigger. So if I'm doing this before, here's the version before it and if I'm doing this after, here's the version after. New or old will be set accordingly. And then inside of that, I will then insert into my audit table and then return the result of the new tuple. So again, this is a simple function that just checks to see whether if the old value doesn't equal the new value, then I want to insert a new entry into my audit table. And then I define my trigger. And now we say before we update on foo, for each row, execute my log foo updates function, or UDF. And then inside of that, then I'll have the new and values will correspond to the row that got modified. So of course, let's go do a demo. Let's sleep. One sec, sorry. I got a log back in. One sec. OK. So first thing we're going to do, we're going to create our audit table. And then inside of our foo table, again, we have three entries. So now we'll create our function. It's the same thing I showed you on the slides. And then we'll create our trigger. And again, it says before you update the foo table, invoke our start procedure and pass in the new tuple. So now if I go back to my foo table and I look at my audit table and I see I have nothing, now I'll do an update on foo, update foo set value equal value with Andy, where ID equals 1. That's coming from my SQL. Now I look into my foo table and I see I have my entry. I modified it. But now I'll look at my audit table. And then now I see that it has the trigger fired off and it did an insert and put the original value in when I made my change. So you can use triggers to do all sorts of things. You can have it actually do validation to make sure that the value you're trying to insert or update is actually correct and then return an error which then kills the query that actually calls it. So you can use triggers to do more complex constraint checking than you normally would be able to do in the check statements when you create a table or global assertions. You can have it do things like update other tables, update other tuples, and do pretty much anything you can do after you can do this. So the question is, are there any guarantees for how this trigger will be ordered? So this is going to get into transactions, which I don't go too deep into this. So the database management system is going to guarantee that when you do your modification, you will get past the correct old and new version that calls that trigger to get invoked. Even if someone comes along and tries to update the same tuple that you did, their trigger will have the proper versions that they deal with. And then in the case here, if we're inserting new entries, then it doesn't matter that those things get inserted. And there's actually no reason for the data system to guarantee that if I do this thread updates this table, and then this second thread updates this table after that, there's no guarantee the data system has to say this says this trigger has to fire first before this one. It can interleave it in any way, and it's still considered correct. What would happen is to say if we want to do things like in our trigger function, we would say make sure that there's no other entry in our foo audit table for this particular foo ID. If that gets violated, then we throw an error, and it would roll back the change that made to the original table. So it recursively goes back and says you can't do that. How it actually does that depends on the currency control, which we'll talk about on Wednesday. So any questions about triggers? We can do another one here. Actually, I'll show you other cool things you can do with triggers in the next demo. So again, you can do this for insert, update, delete. You can't do this for selects. Although sometimes selects would actually be useful. And the way, again, to think about this is that think about how you would actually have to implement this in your application if you didn't have triggers. If you had to do this for real without a trigger, you'd have to write your code to have a thread just keep pulling the table over and over again to check to see whether there was any change and then invoke the trigger manually. Or you would have to write your code so that any time the application server tries to execute an insert, update, delete, you'd have to have a little middleware in front of it to trap that and then fire off the trigger for you manually. But now we don't need to do this. All that logic can then be put inside the database server. Essentially what happens is the database server knows that for this table, there's these triggers defined for it so that any time a thread modifies the table, it knows how to fire it off. It doesn't need to check to say, over and over again, should I do something? Should I do something? It knows how to do it at the moment you make a change. There's a whole bunch of other tricks you can do to make triggers be deferrable. Maybe you don't have them going to invoke immediately after you make a change. You can have the system schedule that later on. For that kind of stuff, again, that depends a lot on the implementation of the database server. So let's now look at something else we can do with triggers. So related to the example I said before where I want to get notified that when a change gets made to my database server, it solves that problem halfway. Because any modification and trigger that gets invoked will only affect other things inside of the database. So if I want to get a notification to say, when my table has modified, I still have to write some thread that goes and pulls the table to see whether my audit log has been updated. Now, you could write a UDF using NC, and in that case, you can call out to whatever REST server or whatever you want on your own. But as I said, you maybe want to not do that, because that has to be run in a sandbox to make sure everything's safe. So in order to get notifications out of the database server to say that there's been a change to your database, you can use what's called the list and notify commands in SQL. So the database change notification is like a trigger, except that instead of the trigger function updating something, only updating something inside of the database itself, we can actually have it notify some external thing that may be listening on a channel to say that a change has occurred. So if you're familiar with Kafka or these PubSub systems, it's essentially the same thing. You could have the data system notify Kafka to say, here's the change that was made to my database and pass along whatever data you want to say what the change was. So in the SQL standard, the way you do this is through listen and notify. I know that the major commercial vendors support this, Postgres definitely supports this. I don't think MySQL supports this. This is a bit esoteric. So the way this is going to work is that we're going to implement a UDF that will get invoked as a trigger that can then call a special function inside of Postgres and say, here's the notification I want you to send out on this channel, and then we can have some other thread pull or check for that notification using listen and notify. So this is my UDF that's going to do this. The only thing that's going to be different now than what I showed before is that we have this middle part here where we're actually going to go ahead and do the notification. So the first thing I'll see is I declare a notification variable as a JSON object inside of Postgres. So Postgres supports native JSON data types. So the basic special function in Postgres says row to JSON. So I take my new row that just got modified, and then I'll convert that to a JSON document. And then I'll invoke this, I use this perform operator to say invoke this special Postgres notification function called pgnotify. I pass along what channel I want to send it to, so in this case I'm calling it foo update, and then I pass along the data that I want to send, and then Postgres knows how to route that to anybody that's subscribed on this channel. So let's do a demo of this. So you'll see here I have two terminals. I'll move the other one up. So what I can do is you can say in Postgres I want to listen on some channel, some event. So I'll call this one AndyEvents. And then up here I can do a notify on my channel AndyEvents, and I'll pass along the data that I want to send. Why didn't the bottom one get an update? So this is asynchronous, right? So the way the Postgres terminal works is that it's not always going back and forth and saying, is there new data from my connection? It's only when we go as the server to actually execute something will we get the notification. So if I do select one, then it actually runs the querying, gets one, but then here you see that it told me that on the AndyEvents channel I got command hello world. Pretty basic, pretty simple. So the other thing I'll point out too, in this case of Postgres, I don't know about the other systems, that the payload, the thing that you send over the channel always has to be a text or Vartar. So now let's do, we'll add my notification function. And that's the same thing I showed before. And then we have to add it as a trigger. And then so now if I do the same thing before where I go back and now update the table, actually here you see the channel is going to be on, the channel was, there it is. The channel is called foo update. So down here, what I'll do is I'll listen on foo updates. So now I create my trigger. So now I'm going to go and do update foo set value equal value, and I'll append like that where ID equals 2. Awesome. There you go. So now if I go back down here, and I do select 1, in theory, the whole connection died. Awesome. Well, that killed that demo. Sorry. Wait, how can it be dead? Oh, Windows. That's all right. All right, there you go. You see I got my payload, and you see that it took the tuple that I updated and packed it up into a JSON object and sent it over the payload. So you can do really complex things with this. You can use this as a building block within a complex data application where you have all your updates from your front end of your application going through Postgres, and then maybe you want to stream the updates to a back end system, like on Spark or whatever analytical server you want to use. And you want to use Kafka in between. And you can have Kafka listen in on these channels, get updates for these JSON documents and pass them along to other parts of the system. So no longer again do you have to write a separate component in your application that is pulling the database server to see if there's any change. OK, so any questions about UDS, stored procedures, triggers, or the Pub-Sub Listen Notify stuff? So I say I'm very interested in all of these things. If you're going to take the class in the spring, this might be something you do for your class project. Yes? What's the stop? So the question is, what's the stop, someone from being stupid, and writing a trigger, a UDF, a stored procedure, something that just computes pi to the billionth digit, so that any time you update your tuple, it tries to invoke that and kills the system. This is what people pay DBAs for. So I'll talk a little bit at the end of the class. But part of the reason why things like stored procedures, UDS, and sort of more things we're talking about here, the reason why these things don't are not widely used, even though the benefits are sort of obvious, is that there's sort of from a software engineering standpoint, now you have a bunch of logic in your application, a bunch of logic in the database server, you need to make sure those things are in sync. So that means that if you're in a really agile environment where you're always putting out new versions of your application, every single time you change your application, if you have to update your stored procedures, you have to go talk to the DBA and be like, look, here's my new code and actually run it. And DBAs are very conservative. Their job is make sure the system is always up and running, because if it's down, that's really bad because you're losing money. So they don't like changes. So they will always vet whatever UDF or stored procedures you give them, and you have to justify to them why you need to update it, why you need to change. So it's a completely manual that the user has to say, at least in case of Postgres, that the code is actually going to not have problems. I don't know this, but I can imagine it's actually possible in the commercial systems you could say, they have different monitoring tools that allow you to identify, here's the SQL query that's taking all the time, or here's the UDF that's taking all the time. But again, there's still a manual process where the human has to make a value judgment to say whether that UDF has to go, or it has to be rewritten. SQL statement, select statement. So your statement is, because it's always a select statement, you can have them put a cost model on it. So yeah, another big aspect of UDF that calls problems is that from the optimizer standpoint, the UDF is a black box. It doesn't know what exactly it's going to do. So when it does its cost estimations, it sort of throws up in hands and says, all right, I've got to call this function. I don't know what it's going to do. I'm just going to call it. So the cost models for UDFs are almost non-existent because, again, it could be C code. And how do you get a cost on that? The best you can do is have monitoring tools that say, I invoke this UDF, and I took this number of microseconds and milliseconds, or it read this much data, and maybe used that for human to make a value judgment about this. And actually, this is related to my slide here. Again, so the UDFs are a way for us to extend the functionality of our database server in a way without having us to actually open up the source code and modify it. And it makes these things more extensible. It doesn't require you to actually have to modify the post-gross code to add a new function. You can build it on top of the API they expose. So we sort of have the same issue when it comes to actually now the data types. So all of the different database management systems support the basic standard types that are in the SQL standard, ins, floats, barchars, timestamps, things like that. And then they're also going to support all the basic operations you'd want to do on these types natively. I can take two integers, add them together, post-gross knows how to interpret that query plan and actually do it. But now the problem is, what if we have data types that are more complex and actually don't map exactly to the types that we want to use? And then maybe it is actually the type we want to use, but maybe actually the operations, the arithmetic operations on them don't actually match up. And I'll explain what I mean in a second. So one way to solve this problem is actually just to take whatever your object is in your application code and then just serialize it out to whatever the format that's used by that programming language or your application environment and just store that as a blob, a binary object inside of the database server. You can use Java serialize, Python has pickle. If you're familiar with the protobufs from Google or Thrys Facebook API, same kind of thing where you take a struct of these objects and you can write them out to byte strings. You could also just store it as JSON or XML documents. But these have problems. Why would this be actually a bad idea just to use Python, pickle, or Java serialize? Just store some complex data type in your database server. Exactly right. So any time you want to do anything that you want to do is maybe on a portion of the thing you're trying to store that you stored and you want to maybe go read one field or modify one field, you have to take the whole thing out, deserialize it in application code because the database server doesn't know how to do this. And then do whatever you want to do and then store that back into the server. Then related to the statement I made before about UDS being black boxes, the optimizer doesn't know anything about what these data types are trying to store are because it just sees a bunch of byte strings. And so it can't do anything to estimate the selectivity for predicates. It can't really do any query optimization for it. It's going to say, the thing I'm looking for is somewhere, I'll just do a sequential scan to find the thing that I need because I don't know what actually it is that you want. And so related to that is now I can't do the basic computations that we talked about before. You can do with the built-in types. I can't compute aggregations on these things. I can't invoke regular string manipulation functions because, again, the database server just sees a binary blob. So just like before, we could extend the functions in our database server. We actually can extend the data types as well. These are called user-defined types or UDTs. In the Segal standard, they're called structured user-defined types or structure types. And the basic way to think about this is that we, again, instead of having to go open up the Postgres code or the MySQL code or the SQL lite code and modify it to support our new data type, we actually can write a data type using the database systems API, put that library into the server, and have the data system now use that as if it was one of the built-in native types that are in the Segal standard. So this idea, as far as I know, was first introduced in Postgres in the 1980s. And the story about how this came out was actually really fascinating. So one of my advisors, when I was in grad school, was Mike Sternbreaker. Mike wrote this system called Ingress in the 1970s. And then Postgres is the system they built after that, right, it's Post-Ingress. Mike won the Turing word a few years ago for databases. But he has this great story about in the early 1980s, when he was going on sales calls for Ingress, he would go to banks and be like, all right, what kind of data do you want to store? And we need to store dates. And so Mike was like, great, Ingress supports dates, right? We support, you can do whatever you want with them. But the issue is at the time, all the banks were storing dates using the Julian calendar rather than the Gregorian calendar. And so when they computed interest in the Julian calendar, it has 30-day months. And the Gregorian calendar has the different days per month. So when they computed interest using the database server with Gregorian dates, they came out with the wrong answers and they were losing money. So what they ended up doing in Postgres then was rather than having to support whatever crazy date you want to use or whatever crazy date arithmetic you want to use, they just exposed this UDT API that now allows anybody to now store whatever data type they want in the server. And they can have the system automatically know how to do less than, greater than, and all the other functions that you want to do as if it was a native type. So this got added into the SQL standard in 1999. There was a big push for what are called object relational databases. Sort of think of this as extending the relational algebra relational databases to now support sort of complex objects like JSON, but this is before JSON. And so pretty much every single database system now supports UDTs. And again, this is another great example where, again, there is a SQL standard that says how you define these things, but no one actually follows it. So I can't show you a demo of this because the way you do this in Postgres is actually through writing C code and then compiling that and linking that into your server. Oracle comes the closest, which actually looks really good, I haven't actually used it though. You can write your types in PL SQL. So think about this. You define your type, you say what its composition is, and then you can say, here's the add function, subtract function, multiplication division, less than, greater than, all the way to take two of these types and do comparisons on them. And you're actually also going to take a type like an integer, for example, and say you're storing integers in a funky way. You can have conversion methods to convert the native types to whatever your type is. In DB2, I think we also support complex type definitions, but one of the things you can do is actually essentially make aliases of types. So you can say, I want to store a audio file type, and I can say that it's a blob, but a blob can't be more than two megs. So you can just sort of reuse that. And again, for the commercial, sorry, for SQL server and Postgres, you have to define them using these external languages that get linked in. So UDTs are actually very powerful. This is probably one of the hardest things to actually write in a database server, because you have to support whatever the API is that they want to have, and it's different across all the servers. This is actually even worse than UDS. PLPG SQL and PLSQL and SQLPGM, PSM, they're close enough where you can convert things. In the case of user-defined types, they're completely all over the map. So again, this is another thing that I'm actually extremely interested in using, building in our own database system. So if you're interested in this kind of stuff, then you should take the Spring class, and this could be one of your projects. All right, question or no? OK. All right, so to finish up, as I said in the beginning, views don't really fit in the category of these embedded database logic things that we talk about in this class, but I think it's actually important to understand what's going on. And I don't want to spend a whole class talking about views, so I'm throwing at the tail of this lecture. So a view in a database system is essentially a virtual table that's defined by a select statement. And what will happen is from you as the application programmer, a view looks like a regular table. It has columns, it has attributes, it has data. But underneath the covers, the way the database system actually implements it is that any time you do a lookup on that view, it then converts it to the underlying select statement and goes and gets the data that you want and then passes that along to whatever the operation that you want to do on the view. So you can do this for a couple of reasons. This is actually very common. This is probably more, views are way more common than UDFs, UDTs, or the other stuff. Triggers are probably very common too. But this occurs all the time in very large enterprise applications because, one, it allows you to essentially hide data from different users that may not want to have access to certain columns. But it can also be used to support backwards compatibility with older versions of the application as you upgrade the database schema. Because you can create views on the new schemas that follow the old schema and the application server doesn't know that it's dealing with a view, just knows that it's dealing with a table. So the key thing to point out though is that although this will simplify a complex query, it's not actually going to make it run faster because underneath the covers again, the database server is just going to run the original query. So let's look at a real simple example. So let's say that we had the students table for our university and we want to create a view where we just get the list of students that have the login that ends with at CS. And the only thing we want to show is just the student ID, the name, and the login. So our original table looks like this. We have five columns. And we have Justin Bieber, and Connie A. West, and Tern Loewek, and me, all mixed in there, even though they're not all CS students. But then when I create the view, then I only end up with just the columns that I want and the data that I want. And now, again, if I do a select on the CS students view, just as I would as a regular table, it will only process those results. So there's another example here. So let's say I want to get the average GPA of all the students in my class that are CS students. I can create a view like this. And again, now we have our aggregation inside of this. And then the thing that actually will get populated will be the output of this aggregation here. So this looks a lot like selectInto, where you can take the output of a select query and put it into a temporary table or table. But the difference here is that in a view, it doesn't actually materialize any results of the query only until you actually invoke the query on the view. Whereas in the case of a selectInto, you run the query once, and whatever the output is, then it gets stored in that table. And if I go modify the original table, the base table, those updates don't get propagated to the thing I selectInto, because it only happens at the moment that I write it. So what will happen is in the case of the view, the optimizer is going to try to rewrite things and try to say, well, you're really trying to go with this view, and it's really on this base table. And I know how the view is actually defined, so I can rewrite your query to actually be on the base table along with the actual where clause, whatever it is that I'm doing in my view, and have to invoke directly on that. So in the case here, say if I do a lookup on this example here, I could write my query with the where clause to say, where log in equals pablo at cs. The system will know that I can just use that instead of the log in like and find the data that I'm looking for much more quickly. So in the SQL standard, it says that you're allowed to modify a view, only if it has the following properties. It can only have a single base table, because if you do a join, it doesn't know how to unwrap that and go back to the original data when you do your modification. And then the view cannot have any group eyes, aggregations, distinct clauses, unions, and things. Because again, it doesn't know how to take the condensed version or collapse versions of the data that the view computed on and go map that back to the original data. Yes. This question is, what is the life cycle of the view? What do you mean? Yes? The question is, if you create the view, can you use a contingency? Yes. It's like a table. It's like a virtual table. It's always there. And the difference though is that any time you do a select on the view, that's when it actually invokes the query, the underlying query. But it sits there forever. So this question is, what if we close the client? Yeah, so a view is not, it's like calling create table. It gets put in the catalog. It's not tied to the session. It lives forever. I don't know, we can try this in Postgres. I don't know if you can create temporary views. Maybe not. All right, so, yeah, question. Right, so his question is, this seems like a UDF. And actually, we can do an example of this. So again, what I'm showing you is a bunch of different ways to have complex logic executing in your database server. And some of it seems like it's redundant. All right, so let's kill this one or not. OK, so I have select star from foo. So let's say I can create a view, create view, and we'll do something like this. Select star from foo where id equals 3. So now if I look at my catalog, so that answers his question. So right there is, in my catalog, my list of relations, I have no, I'm highlighting and you can't see it, sorry. Right there. So there's my and view I just defined. And it lists it there as being a view. So it'll work just like a regular table. So by your select star from and view, I get back my tuples. But as we saw before, select star from when I had my other thing here, that produces the same thing. Now, the difference is that the database server doesn't know anything about my UDF here, right? It just says there's a function. I got to run this one, run to figure out what the hell that actually produces. If I go back to my view, it knows that it's dealing with a view that has a base table. Because c equals declarative, it knows how to unravel all that and say, well, this is the real clear I want to execute. Correct. So a statement is, what I said before was that in the case of Postgres, it actually interprets the function and converts them to select statements and then runs them, right? So your question is probably, why can't the optimizer do that? Because the optimizer is not going to run your query. It's not going to run your UDF to figure out what it's actually going to do. That would be really slow. Because again, I mean, this is a really simple query. But if you have to look at me, a million different join order into different options, it'd be really slow to go invoke that function every single time, right? And the other thing to say before, it's like I showed how you could use this like this, right? You can't do that with a view because it's essentially a table, right? So views are just sort of another way to push more complex logic inside the database. But it is essentially just at the end of the day to select statement in the back, yes? So this question is, is using a view mean you have to take the lock on the entire table? So we'll talk about locks next week. But the answer is no. Because again, the database server is just going to rewrite this as the same query that I would have if I just went on the same table like this, right? You see it actually, I can highlight that, sorry. It actually generates exactly the same query plan, right? It even has the same estimates. It knows it's dealing with one row. So even though it's on a view, it's still going to access a one, rewrite that to be the query on the base table, and it goes through all the normal current structural stuff that it would do before. The good question. OK, so just to finish up really quickly, I'll blow it. So in this example, in the view I showed you before, again, that's a thing of a virtual table that gets run every single time you do any lookup on the view itself. So for some queries, this is fine. For more complex things, this actually may be really bad. Because if the view is actually computing something very, very complex, then every single time you do a lookup on the view, it has to rerun that complex query every single time. If I have a billion rows and I want to compute an aggregate on it, then every single time I do that lookup, it's got to look at a billion rows. So an alternative of a view is what's called a materialized view. And this is where the database server is going to maintain the result of the view automatically and internally. And as you make changes to the base table, it will incrementally update the view, the materialized view, to reflect those changes without trying to have to rerun the entire query. So for example, say if I insert new students here and I have a billion students, rather than maybe having to compute the entire query every single time, I'll just do a small delta change. And then that way, anytime I do my lookup on my view, it's really fast because I've already materialized it. So materialized views are awesome. It's really tricky to get these things actually to work efficiently, because not all queries are obvious of how to actually incrementally update them. So there's a lot of work done to this in the late 1990s in this topic. But even now, the open source systems don't support these. In Postgres, you can sort of fake this with triggers and UDFs. I don't think MySQL supports this at all. At the end of the day, the simplest thing to always do is just like every single time I update the base table, just rerun the materialized view. But that may be bad. So I'm not going to talk about materialized views at all in this class, other than say they exist. How to actually do these efficiently is something we'll cover in the advanced class in the spring. And I realize I'm punting a lot of stuff into the spring. But the idea is there just to give you the idea that these things actually exist. All right, so right of time, I'll finish up. Basically, I'm going to say that the moving application logic inside the database server is almost always a good idea because you can get better efficiency, fewer network round trips. And then the logic is now reusable across applications without having to reimplement the wheel over and over again. But as I alluded to all throughout the lecture, this stuff is not very portable. The different database systems support different dialects of these languages. PL SQL is awkward to write. If you know Pascal or Ada, maybe it's OK, but it's never fun. And then it relates to the question he had before. DBAs don't like having you make modifications over and over again to store procedures and UDS and other things because that can cause the system to become unstable in terms of performance. And so you have this disconnect between the development cycle of the application and the development cycle of the database server. Those things can be at a sink and that causes problems. Another big thing that I didn't really mention, but in a really large application, it's thinking like Facebook scale, the way they do upgrades of the application is they do it sort of in a rolling fashion. They'll update the application code on some region first and see how that works and then they sort of expand it out to all their other servers. But now the problem is if the applications are relying on some UDF implementation in your database server and that UDF implementation changes from one version to the next, now you need to make sure that no matter what version of the application it's getting invoked, it knows how to find the right version of that UDF. And as far as they see in Postgres and other systems, there's no sort of nice way to do this. You have to sort of put a version number in the name of the function you're actually invoking, which is awkward. So all these things are awesome and allows you to do things that you couldn't normally do in a database server, but they're not widely used as people would not even actually think. UDFs and triggers are probably the two things that are OK. The other things, especially UDTs, are pretty rare. All right, any questions? All right, so as I said multiple times about this lecture, on Wednesday we're starting with, I consider it to be my favorite topic, transactions. It's sort of complex to think about because it's a lot different definition of correctness that you may be familiar with in sort of standard programming languages. So we'll start off talking about the theory behind transactions in CoquetryTroll on Wednesday, and then we'll spend the next three lectures after that talking about different ways to actually implement these things. OK? Any questions? All right, guys, again, I'm having all four hours after this. If you want to see your exam, you can come to that. Thank you.