 All right, we are starting the last day of the lecture. All right, yeah, thanks, thanks. So, yeah, today is the last day. We still start with a little bit of administrative stuff. And the first is that the last homework, homework five is going to do tomorrow. And the final project, right, project four, that's going to do on Sunday. It's the same as any other project. And we'll still be holding the additional office hour on Saturday. Again, just please try to do as much as you can in these few days. But you have some last few issues that you cannot resolve. Then you probably can check out the office hour on Saturday, right? You don't really... I mean, even though we allow the... You use the grace day on the final week. I mean, typically you might be wanting to focus on your exams in the final week. And then again, for the final exam, it will be on Friday on December 10th. It will be 8.30 a.m. in the morning. And then, importantly, it's actually in a different place, right? It's in a doorty hall at 2210. So just don't habitually come to this place in the morning and then figure out it's not the correct place. And again, we'll be using the exam paper similarly as a scantron. So please bring a rubber and a pencil so that if you figure out that you want to change your answer then you can easily do that. Otherwise, if you just mark the exam paper all over the place, it's not very easy for us to grade and to grade correctly, right? So please make sure you bring that. So again, a little bit of additional things on the final exam. So like we clarified on Piazza, the exam would be focusing on all the materials after the midterm. So the questions will be based on materials. It will be evaluating your understanding of the topics after the midterm. But I said the questions... I mean, in order to understand what the question is asking, you may need an understanding of earlier lecture material as well, right? For example, if we are asking you a question about logging and recovery, then you need to know what a buffer pool is, right? What a buffer pool is doing and what does it mean by flashing a dirty page onto the disk, right? So those are all the concepts that discussed earlier in the semester, as well as in our first project. We will not be testing how to flush a buffer page in the final exam, but you need to understand that to be able to answer the questions related to logging and recovery, right, as an example. Again, it will be an open book, open notes, and open calculator. So essentially all the paper-based material will be allowed. But for the electronic device, only calculators are allowed, right? And lastly, we will be posting an announcement on Piata with the practice exam so you can check out, okay? So a little bit more administration stuff. We probably already received an email about the cost evaluation. It will be an evaluation for both me and Andrew as the instructors or faculties as well as an evaluation on the TAs. So we will appreciate any feedback, and please give all these feedback. Don't worry, offend us or anything, right? Please give all these feedbacks so either Andrew, I, or the TAs could improve, right? The feedbacks could include both the homeworks, the projects, the meeting materials as well as the lectures, right? Any feedback would be appreciated, all right? So any questions about the administration stuff as we do our final review? Okay, cool. So if there's no question, we'll just be doing a fairly quick final review today. So today's lecture would probably be rather, we can probably finish it in like 20 minutes also, right? Just like a quick review of what we have covered in the second half of the semester and what we will be focusing on in the final exam, essentially, right? So first of all, I mean, like I mentioned, the exam would not be focusing on earlier materials, right? And we will not really test your understanding of them, but I mean, you still need to understand those topics to be able to understand what you are asking in the exam, right? Especially for the topics related to logging, right? So in the earlier semester, we covered, I mean, basic SQL query, we covered buffer pool, we covered hash table, B pass tree, different storage model, rule format, column format, as well as like you can execute query in parallel, right? For example, I mean, just for hash table, we are not going to ask you guys, hey, what would be the extendable hashing algorithm or linear hashing algorithm, how to implement that? We will not be asking you guys about those things, but you need to understand, I mean, what's the purpose of a hashing table, hash table, right? What the functionality it provides can be used in joins and as an index, and then what's its purpose in the context of a larger database system, right? So we need to still remember those high level concepts. Okay, so for the detailed stuff, we covered in the second half of semester after midterm. So first, we spend the two lectures on query optimization, right? Talk about how you come up with a efficient or semi-optimal query plan that can execute a query in the potentially efficient way, right? So we talk about a few elements in this query optimization. The first, there are a few heuristic algorithms, right? I mean, you can just rewrite or like tweak this plan structure a little bit using heuristic rules, right? For example, we talk about a predicate push-down. I mean, essentially you can split the predicates in the where clause of a query, right? And then push a specific predicate as close, I mean, as possible to when you scan the table, right? Essentially to push a predicate down to the lower level of the query plan tree, right? To filter as many to pause as early as possible. And the second we talk about projection push-down is a little bit similar concepts, right? You want to, if the query has a projection, there are certain columns that you, the query does not really need you to return the results, then essentially you can do the projection earlier, as low as possible, so that you can, I mean, reduce the number of columns that you eventually return, I mean, as much as possible, as early as possible. And the third, for example, we also talk about different techniques to optimize nested sub-queries, right? Essentially we talk about either and either a rewrite, the nested query into a drawn query, as well as you can decompose nested query, right? Essentially try to lift the nested query or the sub-query up into a separate query and write that to the, write the result to a temporary table so that you don't have to re-execute this particular sub-query over and over again, right? So these are the, a few basic heuristic rules that we can apply when we're trying to optimize a query. And then in order to try to figure out a even more efficient query plan, we talk about, we first talk about different statistics that would be helping us in the search for a better query plan. We talk about how do we use these simple methods as well as some concepts in basic mathematics and statistics to estimate the cardinalities, right? Whether we have one predicate or multiple predicate, whether it's conjunction, disjunction, et cetera. We talk about how do we handle those cases in terms of cardinality estimation. And we also talk a little bit, focus a little bit on histograms, right? Which would be a very common approach in database that would help us to estimate the cardinalities when we are trying to optimize queries. And lastly, we talk a little bit about the, find out the cost-based search framework, essentially dynamic programming, right? Trying to use this memorized search process to figure out what would be the best order to draw in different tables and arrange this query plan structure. So these are the concepts we covered in the query optimization, all right? Next, transactions. We actually spend a fairly few, quite some amount of time on transactions in this course. Essentially, we have four lectures on that. So it's like it would be a very important concept covered in this class. The first, we talk about the very basic property or concept of ACID, which would be the, arguably the fundamental property that the heavy system can provide to the programmers to make their, to make their life easier to develop their application and focus on their business logic, right? So ACID would be capacity, consistency, isolation, durability. You have to remember what they are and what they are trying to do, how they are going to help the programmers to access and manage data, right? And then, and beyond that, we first, from a theoretical analysis level, right? We talk about different serializability concepts, right? They include config serializability would be used more common. As a view serializability, I mean, in practice, view serializability is rarely used, but you still need to understand the concept of it and how to check those serializability and how generally, in general, how to ensure those serializability, right? In the following lecture, we also talk about this concept of recoverable schedules, right? And that's also an important type of scheduling that a system would look at when they are trying to guarantee the transaction correctness. And then we also talk about the concepts of different isolation levels, right? And different anomaly patterns. So, I mean, with different isolation levels, the system would potentially have a different efficiency in terms of guarantee the concurrency or consistency and isolation property, but at the meantime, different isolation level would also allow different anomalies, right? So it depends on what the requirement, the system or the application would want, compared to what efficiency that the system can provide in terms of implementing the concurrency control protocol, right? The next, we also talk about, we spend one particular lecture, talk about this like a pessimistic concurrency control method, essentially based on locking, and we focus on this algorithm called two-fist locking, which is the most commonly used pessimistic concurrency control protocol, and we will talk about the distinction between rigorous concurrency, sorry, rigorous two-fist locking. In other words, the strong strict two-fist locking, versus the original version of non-rigorous, right? The basic version of two-fist locking, or the different difference between them and what different properties they can provide. And second, we also talk about different methods to deal with the deadlock situation. Essentially, there's a deadlock detection, a deadlock provision. There are different design decisions they couldn't make, either one wait or wait die, right? Different trade-offs. These are the things we discussed there. And lastly, we talk about this concept of multiple granularity locking, essentially these intention locks, right? So if, I mean, a query is trying to scan the entire table, then it would be very costly and sort of a waste of effort if you try to lock every single two-pole in the table, for example, if the table can have a billion rows, right? So in this case, you can have an intention lock at a higher level of this database. For example, you can have a read lock on the entire table, then you can just read everything in the table with only one lock, right? But of course, with a higher-level locking, you essentially restrict the parallelism and concurrency that is allowed in the system, right? So potentially, there's a efficiency cost and you have to balance those two different scenarios. All right? Next, we spend a few lectures and talk about a more optimistic, essentially one or two lectures, talk about this optimistic approach to implement concurrency control protocol. We first talk about this like a basic timestamp orderly concurrency control mechanism. And also we talk about this optimization you can do to make the, to initially allow more scheduling of transactions, right? Potentially allow more parallelism called the Thomas-Wright rule, right? So you can overwrite a earlier transaction if the, so you can, one transaction can overwrite the right of the earlier transaction if that transaction never really, right? So essentially you can do this right organization. And then we spend a lecture to talk about this optimistic concurrency control protocol. So it's a one protocol, a specific protocol, but it also belongs to the category of optimistic concurrency control, right? So just a little bit confusing. The naming is not ideal, but it helps to be able to distinguish or identify those concepts, right? So in this specific optimistic concurrency control protocol, right? This specific implementation, there will be three phases, a real phase, validation phase, and a right phase, right? And in the validation phase, we talk about a different, again, different design decisions here, essentially of either forward validation or backward validation, right? Different choices you can do to implement this protocol. Lastly, we talk about this multi-verter concurrency control. And again, for this multi-verter concurrency control, it's not a specific protocol, right? It has a similar name with other protocols, but multi-verter concurrency control is in fact an optimization that you can apply on either like timestamp ordering, optimistic, or even two phase locking concurrency control, right? It's an optimization, but not a specific concurrency control protocol. And it's actually widely used in most of the systems, especially modern systems. And there are different design decisions there as well, right? Different policies. First, we talk about version storage, right? Whether you store the entire tuple, only a delta version of the tuple, where do you store that different version records? And then we talk about the ordering, right? How do you order this like version chain of different versions of this tuple, either oldest to newest or newest to oldest, et cetera, right? And we also talk about different garbage collection mechanism, either transaction level or tuple level, et cetera, all right? So that's that. Any questions so far on this organization and the concurrency control? Okay. Then we spend a few lectures, talk about logging and recovery, right? This is also a very, very important concept in the heavy system to guarantee especially the durability and sometimes consistency aspect, but especially the durability aspect, right? Well, it helps the atomicity as well, right? So like a very, very important concept in the heavy systems. And to begin with, right, we talk about this very, again, very important definition of a stale and a false policies, right? Those, I mean, two different choices on different types of policies will actually determine many of the later on, I mean, oppositions on implementations of crash and recovery because, I mean, depending on whether you allow stale, no stale, false or no false, that just determines what the later algorithms can do and cannot do, right? That's like a very basic fundamental definition. So essentially with a stale, that just means that if a dirty page, I mean, contains changes from an uncommitted transaction, you would allow, the system would allow the flush of that dirty page onto the disk, right? So that would just mean stale. I mean, the other update, right, you wouldn't allow a dirty page contains, I mean, modifications from uncommitted transaction to be written on the disk, that would be called no stale, right? I mean, on the other hand, false and no false, which means that when the transaction commits, the database system would have to force the buffer pool to flush all the dirty pages that contain changes from that transaction to be onto disk, right? Before the transaction can commit, and no false, it just means otherwise. So we next talk about this concept of right-head logging, this technique of right-head logging. So right-head logging would be an example of stale and no false, right? You have to be able to, you have a stale and no false borrowable policy so that you can implement the organization of right-head logging. Essentially, when you, when you run a transaction, make changes in the database, it would apply the changes onto a log record and flush that log record onto the disk first before it actually makes the modification, all right? Then your buffer pool could be very flexible if you have right-head logging. And then we talk about different logging schemes, where you're using the logical logging, essentially just log a single query, but it comes with the cost of when you try to recover from a crash, you have to re-execute every single query, right? That could potentially be very costly. We also talk about the physical logging as a physiological logging. Those two are kind of similar, right? Essentially, you'll actually log the changes, the pages that contain the changes of that query instead of the query itself. We talk about different checkpoints mechanism, right? The naive checkpoint, where you have to flush all the log records as well as all the dirty pages in the buffer pool, but also very importantly, we talk about this fuzzy checkpoint, which is like a very important organization that will make a checkpoint much more efficient, essentially avoid the need of flushing the pages in the buffer, dirty pages in the buffer pool and avoid the need of blocking other transactions. And also, we have a little bit, I mean, since in the class, we actually talk a little bit, but in the original class, we discussed the algorithm that is kind of inconsistent with the standard algorithm and also have a correction on the fuzzy checkpoint algorithm on Piazza, right? If you haven't looked it up, you should check it out. So make sure that you... We are using the... You are understanding or you are referring to the correct, quote-unquote, the correct variant of the algorithm in your homework and final exam, okay? And then we talk about, I mean, finally, in the concept of crash and recovery. And the most important concept here is the log sequence number. So, I mean, how many components of the system have their own notion and their own version of this log sequence number that can coordinate with each other, right? So that the system would know, hey, when would it be safe to write a dirty page onto the disk which number should I flash? Would it be the log sequence number or the master record or, in other words, the checkpoint record? And what we need to read or what log records we need to analyze after a crash and then redo and undo the correct thing, right? For all those things, they need a coordination of log sequence number so that the different components of the database system will know the status of each other and then try to bring back the state database into a correct state after a crash, right? So it's like a many very important concept and different sub-variance of log sequence number maintained at different parts of the system. We also talk a little bit about this composition log record, right? Sometimes it would be ignored, but that's also an important thing to help you deal with the scenario where you crash during a recovery, right? Just like a recursive crash. So the log sequence record helps you in that situation, all right? And then next, we have a few lectures to talk about the distributed database system. So for distributed database system, again, we are pretty much only focusing on the high-level concepts, right? We didn't really go into too much detail about, hey, how do we implement a Paxos or rough algorithm, right? We didn't have much time for that for the first. And secondly, in this class, we are more focused on the design and trade-offs that you would need to consider when you actually implement a distributed database because in actuality, right, if you are really going to build a distributed database, you are less likely to implement a rough algorithm by yourself, right? You probably would just use an existing implementation, but you would need to understand the trade-offs between different choices, right? That's very important in terms of solving a practical problem that you may encounter later. I will have a few more comments on that in my final slides or next slides, right? I mean, in the distributed database system, we first talk about the different system architectures, right? There's a trade-off between a shared nothing versus a shared disk. I mean, we mentioned a shared memory, but that's rarely used. I don't know any system used to share memory architecture, right? And we also talk about shared everything, but that would essentially be a single node database system. We talk about the different replication scheme. Either you use synchronous, unsynchronous, whether you have eventual consistency or whether you use... Sorry. Whether you use a primary replica or all primary-primary replication scheme, right? So those are all the different trade-offs and design decisions that you had to understand, right? And then next, we talk about the different partition method. Either you do a range partition, you do a hash partition, as well as we talk about this concept which is very, very useful if you want to deal with the situation where you have to add or remove nodes in your distributed database system, right? And lastly, we will talk about a few different commit protocols, but I would put a little bit of focus on the two-phase commit protocol because that's more commonly used, essentially. So how do we make sure... How do we implement the two-phase commit to make sure that the system can coordinate a correct result if you have a distributed transaction? All right. So any questions about the topics in the later semester? I mean, what would be the elements in the final exam and the logistics before we just present a few final comments on building database systems in the real world? Any questions? Yes, please. Right. Yeah, I'm thinking. Yeah, I think. As far... Yeah, the question is essentially we talk about, I mean, still and no force. I mean, no still and no force. I mean, would there be any system that used different other combinations in the buffer management policy? That's the question. So as far as I know, actually, most system would actually use still and no force. That's the majority of the system because most of the system would actually use logging, especially right-hand logging and areas to implement their crash and recovery algorithm because that's considered more efficient. It's difficult for me to give you a percentage but most of the system would be used still and no force. Certain system would use the other combination. For example, you have a shadow-paging system or copy-on-write technique. So occasionally, I know some system use that combination but even that is really rare. For other combinations, I don't really know. Yeah, yeah. I think there's other, potentially other questions? No? Okay, maybe I missed it. Okay, sounds good. Lastly, just a few final comments in terms of implementing or building a database system by yourself. Actually, I would say these final comments essentially are these like from my own experience either from my own research in my own research I also helped building a database system from a scratch and also from a little bit of my work experience. But I would say I don't think those experiences are restricted to the database system essentially if you want to build a distributed file system for example. So we were likely to encounter similar questions so similar experience might also be useful. So the most important principle if you will, that I think we are trying to build a complex system or a large system like a database system will be that first you have to be very clear of what will be your goal, what will be your constraints and what resources you have. So what kind of scenario you want to optimize for, you want to apply or TP or AP, you want to optimize for read or optimize for the write, what level you have. So you have to be very very clear of what do you want to achieve. Instead of trying to quote unquote directly through all the complicated optimization at the system at the beginning. That would very very much determine how efficient and how flexible, how maintainable your system architecture is, whether you know your goal or not. And also you have to know your constraints. So what will be the things you cannot do. Whether there will be a loss of failure in that system or not. So what will be the limitation that you have. And also very importantly what resources you may have. When you start to begin your work, either whether you are doing research work or you start your work in a company likely probably you are just by yourself. Later on you may have a few people. But then either you just by yourself or you manage a team of a few 5-10 people, you need to know how much work either you can do or your team can do and how do that match your goal and limitation. Not only you have to know all those things, but the general lesson I have is that you have to constantly remind yourself. Because I mean when you are trying to implement things and focusing on a specific algorithm or trying to debug a particular thing in your system, it's very easy to forget. It's very easy to lost yourself in a particular bug or particular optimization and forgot about the overall goal you want to achieve. So when I'm trying to build a database system I actually remind myself what would be my goal, what would be my constraint, what resource I have, multiple times a day actually, not even every day but multiple times a day so that I don't get lost. And another way to see this is that my advisor, which is really the lead person that developed Postgres CQ as well as like a touring award winner a few years ago, he always like to use this term of like a hypo in the tent to emphasize the concept that when you are trying to build a real system you also need to know, hey, what is the most important part in this system? What is the most important goal you want to achieve and what is, for example, what would be the most inefficient part or what is the most significant bottleneck in your system and you focus on that, focus on the most important thing, most important goal and most important bottleneck or inefficient component optimized on that part instead of trying to optimize everything together. So this is what I think is the most important principle if you will. The second is a little bit related to that, essentially trying to avoid premature optimization and avoid, try to avoid engineering for nonexist requirements as an engineer is very tempted that, hey, to say that, hey, I know an algorithm that can make this component to be faster but how much faster would it be? Would you spend like a month to implement an algorithm that makes this component only like 5 or 10 percent faster and even though this one particular component is 5 percent faster how does it contribute to the overall system? Try to avoid this premature optimization that potentially spend the loss of effort but then the benefit and also makes the system much more complex but then at the end of the day the benefit may not be that significant again in terms of achieving your overall goal and usually, I mean, if there are two like similar solutions like one solutions might be a little bit more efficient than the other but in general it would be better to prefer the simpler one and similarly it's also tempted to say that, hey, what if I've already implemented feature 1, 2, 3 what if I implement feature of 5 or 6, 7 with that future enable loss of different functionalities this and that so it's a little bit tempted to do that but in general if you don't really have a requirement for a specific functionality and a feature it's probably better to avoid to pre-engineering for non-existent requirements so that would potentially in many cases would make the system more complicated and make it difficult to maintain an extendable so that also related to the last points that I want to get across essentially when you are trying to build a real system I said to prefer simple solutions know your goal but that doesn't mean that you should cut corners and try to make shortcuts it's always a bad idea to cut corners to try to do a poor error handling for example or try to use a very hacky way to connect different components of the system together instead of thinking through what the interface is should be how do they collaborate in general cutting the corners in system engineering always bite yourself in the end so try to build things rigorously that would be generally a benefit but in the meantime similar to the earlier points I tried to make there's no correct solution in a system or in comparison in general it's always about different trade-offs, different considerations and what's your goal, what's your constraints you have to always think about your solution in the context of a specific environment I think about the pros and cons and in terms of these avoiding cutting corners you also have to you don't really want to spend a lot of effort trying to build things that would be not that beneficial as well so you have to essentially balance the amount of engineering effort you want to spend as well as how extendable the system would be and this also connects to the earlier things I talked about essentially that also you need to evaluate how much resources you have how much engineering can afford and then in that case how much extensibility you can support and in the case that there's a conflict between a solid engineering solution versus a hacky version of solution but potentially be more extendable, potentially being more features it's probably a better idea to prefer a simple solution but with a well thought design as well as a solid interface so these are the lessons that I think would be useful in terms of in your work or your research in the future no matter whether you go to work or function or pursue a research career especially if you're starting so any questions before we conclude this course for the semester alright cool that's all we have for this semester thanks a lot for everyone this semester and then good luck on your final homeworks, exams, projects and as well as not only this exam any other exam you may have in the next final week alright, thanks everyone thank you thank you