 All right, there we go. All right, thanks for waiting. All right, so like I said, last class, for today's lecture, we're not doing anything new material or a new topic. We're going to spend some time talking about project number three, because next class, you guys are going to have to after a break. You guys are going to have to propose the topic you're going to work on in your group. And then I want to spend some time talking about the extra credit. And there's nothing on the website for both of these beyond what's been there before. But I'll release that later in the week with a full explanation of everything that I talk about here. All right, so obviously, we're just going to talk about project three topics and extra credit. So let's jump right into this. So the idea for project number three is that every group is going to be required to implement some substantial component or feature or technology or functionality in the Peloton database system. So the idea is that everyone here should be trying to incorporate ideas and topics that we talked about in the class, not just what we've gone so far, but also looking ahead some of the topics that are coming down the pipeline for us. And then if there's anything also you want to do as part of your own research or you want to tie into our database system, by all means, you can do that as well. So the only restriction that I have is that it has to be something that everyone in the group equally contributes to. And the project topic has to be unique from every other project topic from everybody else in the class. We don't want two people in blending the exact same thing, because that's not that interesting. So the project three, we've talked about this in the beginning, but I want to spend some time talking a little bit more now that we're further along in the course. It's really going to be comprised of five components. So we'll go through each of these one by one. The proposal, which is what we'll just do on the first Monday after spring break, is just going to be a five-minute presentation to the class about the high-level topic that you plan to pursue. And so you want to be able to specify a little bit, talk about a little bit about what you think is going to actually have to change in the database system in order for you to implement your project. I don't want you to say, oh, here's what I want to do, this magical thing without really thinking through and looking at the code and making decisions about whether that's feasible from now until the end of the semester. So you want to talk about also how you plan to test your implementation to check to see whether it's correct and performs well. And then you want to specify what workloads you're going to use to evaluate your project. And this is why I've been asking you guys to write down when you review the papers what workloads all these other people have used in their database systems and their research projects, because that'll tell you what you should be using for yours. And I'm more than happy to also to guide you in the right direction. So we're not going to film all the project proposals, right? So don't feel like that you're going to be embarrassed to say something and then it's going to end up on video, right? It's just going to be us. So I end up maybe being a dick and ask really hard questions, but it's not going to be in video, so no one will remember after the class is over. So feel free to really discuss and try to think through what you actually want to say. So then later on in the semester, and I forget what the date is, but it's marked on the website, but you'll have a status update to the class as well. They say, here's what we've done so far. Here's how things are coming along. And whether you have decided to pivot or change your project topic in any way. And this is just really a checkpoint for everyone to make sure that everyone's making the same amount of progress across the board, right? We don't want one group super far ahead and one group super far behind. Joe and I plan to meet with you regularly as needed to make sure that you're making progress as well. But this is just sort of an update to the entire class. So now when we get to the end of the semester, and we actually want to start turning in our code, there's three additional things we're going to have to do, or three steps in part of this. The first is that we're going to acquire every group to get paired up with another group to do a code review. And we're going to set up a website through GitHub. Websites, they can leverage GitHub's issue tracking services, but we can allow you to do a code review and provide feedback to another group and say what they should or should not be doing to update their code or maybe improve their implementation. The idea here is to get you comfortable and familiar with sort of modern coding practices that you encounter when you go work at a major company and you'll have to do code reviews there. So I'll spend some time in the lecture before we get to this point saying what I expect of all of you and sort of the general approach you should take when doing code reviews. And the grading for this will be based on the participation, right? If you just sort of wing it and say, you use tabs instead of spaces, right? If that's all you say in your code review, then that's going to be insufficient, right? It's really meant to have you learn how to do code reviews and also improve your own code by getting feedback from your peers, OK? And then we'll do a final presentation. So they haven't released the final exam schedule yet, or at least I've seen it. I don't know if you guys have seen it. So whenever our final exam date is, that's when we'll have the presentation. So we'll have food, we'll have pizza or whatever, and everyone will spend 10 minutes and discuss what was the final outcome of their implementation, right? And if you want to be able to do some kind of performance benchmarks or evaluation of your work to say, we implemented this speed improvement, this optimization, here's the throughput we had before and here's the throughput afterwards, you want to show that your thing actually works and makes the system better in the way that you were seeking to do. If you have any kind of demo, live demo of your system, of your project to the class, that would be really awesome too, but that's not required. All right, so now once we get through the code review, once you get through the final presentation, there'll be the final code drop and we'll schedule this to be whatever the, I had to turn in final grades for everyone in the semester, but the basic idea is that everybody's project is not considered done and complete until we have a pull request from you and GitHub that can merge cleanly into the master branch, right? And so this pull request doesn't have to include also the, not just the implementation, but also the test cases, as well as documentation about what your feature is or what your project is, how it can be used and what are some tradeoffs that you make and made in implementation, what are some future work things you'd want to have if you had more time to work on your project. Again, these are all the things we expect from you in order to say that your project is done. So now I realize when you do pull requests and you want to do a merge on the master, if there's somebody else that does a bunch of changes and that gets merged in, your merge could possibly no longer merge cleanly because somebody else came in before you. So what we'll do to make this fair, we'll just randomly select the merge order for every single group so that you just have to make sure your merge can merge in the master that came before you, right? Because what we don't want is everyone merging on the current master and then as we start merging all the pull requests in, we can't cleanly merge another groups, right? So we'll just do this in random order and again, I will be lenient for the last group if they have problems, right? The first group that gets the first slot has the easiest job, right? They just merge on the master, so I'll be a bit more strict on them. But if you're at the end, again, we can help out. Any questions about this? And again, I'll write out exactly what I mean for all these steps and put a post on the website later this week. So now I want to spend some time talking about some potential project topics that you guys could pursue for your project three. So this is by no means an exhaustive list. These are just some ideas that Joy and I thought through that we think might be kind of interesting for you guys to look at and try out. So again, if there's something that you're doing in your own research that you want to apply to our system, right? So Joy is really interested in, like, pornography, so if you want to do something with pornography and our database system, like, Joy did wants to, then by all means, you can go ahead and do that, right? So again, this is not an exhaustive list. If there's something else that you want to pursue, you know, talk to me and we can see whether it's right for project three. So I'm going to go through each of these one by one and just say a little bit about what it is. And then what I think you actually have to do in the system to complete the project or make it work. And for some of them, I'll say, like, you know, stretch goal or bonus goal, we'll be like, hey, look, if you have enough time, here's something else you could do as well that would be kind of cool. So again, stop me as we go along if you have any questions. All right, so the first one is probably, I think the most interesting and probably the most important one is that our current database system, since we're built on top of Postgres, we're using a lot of Postgres internals, it still relies on the Postgres query planner and optimizer, which is inherently disk-based, right? The query planner has this catalog of statistics and it's using that to determine what the best query planner is for a particular query. And it's the cost model is using for its internals to make a decision about whether one plan is better than another is based on the number of disk blocks it has to read, right? Which obviously doesn't make sense if you have it in memory database because there's no disk IO. So the other issue we have is that the Postgres plan and the Peloton plan are not exactly compatible. So what happens is the Postgres query planner generates some query plan tree and then we have the transform function that confers it to what our plan should look like, right? So that's not a huge major overhead, but it's not trivial. So for this project, I think a really cool topic would be to actually implement a new in-memory database system query planner and optimizer that can emit directly the Peloton plans but then we can get rid of the Postgres stuff entirely. And so what I think would be kind of cool about this is that there's a lot of newer features in C++11 like Lambda functions that this query optimizer could use and it would be a sort of a better more flexible implementation than sort of the standard heuristic approach that the older systems like Postgres use. There's also other sort of modern optimization optimization algorithms you can consider as well. So Postgres actually has two query optimizers. It has sort of a simplistic rule-based one, but then if the query gets too complicated, it switches into a genetic algorithm, right? So that's something that we could consider for this project as well. The mem SQL guys actually have a C++11 query optimizer that's based on Lambda functions that they claim allows them to do some really cool things that the rule-based stuff doesn't allow you to do. So for this project, obviously, writing a query optimizer from scratch is a lot of work. And so for this, you wouldn't necessarily have to support all the possible different types of operators and query plans that Postgres does. I'm certain because we don't support that in our system yet. But the idea is that if you just had the bare, if your optimizer could implement the bare minimum amount of SQL stuff that we need for our system and you wrote it in such a way that it could be easily extended later on, then that would be really helpful because then two years later or next year, what if we have another class, the next student can build off of what you've done and improve it even further. So any questions about this? Okay. The next project topic is to add vectorization support or vectorized execution inside the system. So we sort of talked about this a little bit when we talked about the sort merge join by using SIMD instructions to speed up the sort process or hashing. And so the idea is that for this, we wanna be able to implement the same kind of SIMD stuff but directly in the query executors for the database system. So we'll talk more about vectorized execution after spring break, but the basic idea is that you would use SIMD to speed up scan, sorting and hashing the same way that you sort of saw with the sort merge stuff. So for this, I think a key part of it, and we'll see this a lot in some of the other projects, is that it's really important that you just don't blow away the non-vectorized code in the system, right? Because what you wanna be able to do is have a little flag in either a compile time or a configuration file that says, yes, I should use vectorized execution. And that way you can test to see whether your new improved implementation of these operators outperforms the existing code. If you just blow away the old stuff, then you'd have to use the old master branch that may not have the fixes that you're gonna put in place to fix other parts of the system as you do this. So I think it's really important to make sure that you have both of these things in order to test and evaluate them. The next project is to add better support for logging and checkpoints recovery. So currently we have a simplified right ahead log scheme that does physical logging of for all the changes that transactions make, but it is by no means as sophisticated as the silo R stuff or some of the other papers that we talked about. And we can't reload the database after restart either, right? We have basically, we can log at runtime and that's enough. So for this project, we wanna optimize the current runtime logging scheme to improve it basically. Do some of the stuff that silo is doing. We wanna implement checkpoints and it's up to you to decide whether you do fuzzy checkpoints or the consistent checkpoints. And then you wanna implement the recovery from the checkpoint and recovery from the log. And I think for this would be interesting to do the completely paralyzed version that silo R does rather than the single threaded version. And now the project is to do materialized views and triggers. So if you don't know what a materialized view is, essentially it's like a regular view, sort of a virtual table, but the underlying backing storage for that view is updated anytime the table is updated. So view, for example, so if you declare a view and you do a query on that view, it basically runs that query every single time you access the view. But in a materialized view, it sort of maintains it over time in the background. And then when you query against it, it doesn't have to run the full query, it already has the output ready for it. So the way you implement materialized views is essentially you use triggers that allow you to say whenever there's an update to the underlying backing table for the view, update my materialized view. And this can be either the naive case or just re-executing the entire query again, or you can do an incremental update which we'll talk about later using deltas and things like that to allow you to just change the, or only access a small part of the underlying table that's backing the materialized view whenever it's updated without having to scan the entire thing. So for this, to implement this in Peloton, well, we don't have any notion of views or regular views or even triggers at this point, but we can leverage a lot of what's already in Postgres' catalog to specify what the triggers or what the views should actually be. So there's a lot of stuff we can use for Postgres, it's just taking that information out of the Postgres catalog and then populating your own implementation of it at runtime and then being able to insert the actions at certain points of updates to the tables to fire off the triggers. A bonus idea that I think would be really cool because I'm not sure actually how hard this, doing incremental updates on materialized views as we'll see later is non-trivial, but you might be able to take some shortcuts to make it go faster. So this may be really easy. So one thing that would be really kind of cool to extend this even further is once you have triggers, then you can now implement a PubSub interface that allow the database system to stream updates out to a message broker or a message queue like Kafka. So to think of something like you have your front end O2P database system that's running Peloton, you insert new tuples into it and then there's a PubSub interface that gets fired off with a trigger that allows you to publish updates out to something like Kafka downstream. That's a very common implementation or architecture used in sort of very large scale applications. So I would consider this would be kind of a bonus thing that I think would be pretty cool. And again, it would cause you to become familiar with not only Postgres catalogs and not only how to implement triggers but also like another system that's widely used in industry. The next project is to do concurrent schema changes. So it's very important in your database system to be able to do alter table and add and drop columns, add tables, change the layout of tables, change the ordering of columns, things like that. And we obviously want to apply these changes while we continue to execute transactions and queries. So for this project, we wanna be able to implement these concurrent schema changes with having minimal overhead or minimal impact on the runtime execution of queries. Right now, I don't think we can do any add table or we can't do any alter table in our system. So this is basically implementing everything with Scratch. But what I think is cool about this in a main memory database system using our tile based architecture is that you can do lazy updates or lazy propagation of schema changes to the database system that's not quite possible in other systems. So when you call alter table in something like Postgres or MySQL, it scans through every single tuple and every single table and applies the change. But in our case, say we add a column, we don't necessarily have to go and modify the physical storage of the tuples in their tiles. We can just put a little flag to say, hey, yeah, you added this column, but this tile doesn't have it. So if anybody tries to access it, use the default value. Or they do a delete a column and rather than going through and applying the change right away, you just say, you know, in the same thing, if the head of the tile group or head of the tile, this column is deleted, so don't let anybody try to access it. And then over time, as you continue access data or reorganize the tiles, you can then apply the changes. So this would be actually a really cool project. The liability to get a significant speed up over what the existing systems can do today. And obviously what we'll need is to implement our own internal catalog, reusing much of the camera Postgres to keep track of the different scheme of versions that are out there. All right, the next project is to implement constraints. So constraints are things like, yeah, sorry. The question is, do the benchmarks we've seen alter the tables? No. Yeah, that's actually, that's a good question. So this question is, like, we looked at TPCC, we looked at TPCH, do any of these things have alter table? And the answer is no. That would be actually an interesting benchmark to implement. That's almost like you get a publication on that one. But like, here's the alter table benchmark. That'd be kind of cool. It's not hard to think of like, you know, simple experiments to test that though. All right. I'll talk a little bit about how we can do sort of SQL level testing in a second. All right, so the next project is to do implement constraints. So constraints are like, not null, check if something's greater than zero, less than zero. And these are important because they allow, this is how the database is going to enforce. Not only value integrity, but referential integrity of the different tuples. And so for this, as far as I know, we don't have any support for constraints at this point. You might have simple things like not null. But to do more complicated expressions, like in a check clause, that's something that you can implement here. And so you can start off with sort of the simple constraints, like check if the value is greater than zero, stuff like that, and leveraging the existing expression system that we already have that you already used for the join and the BW tree. But then the next step, the final step would be actually to implement foreign key constraint checking, which is a little more complicated because you have to look up in an index to see whether the foreign key you're trying to reference actually exists or not. And then you have to do, as you implement all the cascading deletes as well. So this is a lot of work for the internals, but this would be something very cool because it would touch a lot of different parts of the runtime system. All right, next is to implement user defined functions. So UDS are a way to declare like a function that takes a tuple in and can return back a value. So you can apply some kind of complex logic on that particular data item that you wouldn't normally be able to do using the sort of standard SQL functions. And so for this, we implement basically from scratch ability to take something like PL SQL for the create function table, load that into the Postgres catalog, and then be able to use it inside of the Peloton expression subsystem. So if you have this aware clause could you use UDF, your join clause could you use UDF, and the UDF could be in the select clause as well. So be able to use the UDF in all different parts of the system. I think that if you just get it in the expression subsystem in the correct way, it just automatically can be used everywhere as else because we reuse it. We use the same subsystem everywhere. All right, the next project is to enhance indexes. So we are still up in the air for our system. What is the best index to use? So we talked about skip lists. We talked about BW trees. We talked about the R index. We don't know what we want to use just yet. But the BW tree project is a good start for this project. So the idea would be you've already implemented the probably I think the second most complicated index, the BW tree, the concurrent BW tree is more complicated. But the idea is now implement the other guys and do an evaluation, do a benchmarking, and try to understand which one actually performs the best. For this one, we may have to modify the current index API that we have in Peloton to be able to incorporate all of these things. And then the goal would be maybe at the end of the semester. Whatever one we think is the best for all possible different workloads, we just plop that in there. And then whatever index you build is the one we actually want to use. For this, another cool thing, if this turns out to be really simple, if you ignore the B plus tree and just do these other two guys, a bonus project or a stretch goal would be to actually implement an inverted index that allows you to do full text queries. We're not really going to talk about inverted indexes in this class, but the basic idea is in order to speed up a light command or a string rather than expression, you can do a inverted index that maps words to tuples or individual strings in a text field to tuples. And then it's really fast to look at that. But how do you maintain this while you execute transactions at the same time is non-trivial. All right, so in next class or after you guys do the proposals, we'll start the lecture on database compression. The basic idea here is we're going to allow the database system to compress parts of the database in memory to reduce the amount of storage overhead it has for that data, but as well as reduce the amount of data we have to process while we execute queries. So we already saw this when we talked about the bitmap indexes for SQL Server. They use value encoding, and then we talked about delta encoding, dictionary encoding, and then the naive block compression where you just take a block of memory and run gzip or snapping on it. So the idea was to implement all these different schemes in the system and evaluate them for different workloads and see how well they perform. For this, you probably need to implement new query operators that can access compressed memory directly, because otherwise if you just say the naive block compression, you can decompress it and feed it into the existing operators. But for these other guys, you want them to be able to operate directly in the encoded data without decompressing them to reduce the amount of data you're passing from one operator to the next. This is that late materialization stuff we talked about. A bonus project or stretch goal would be actually implement logic in the database system either based on heuristics or some kind of machine learning algorithm that can figure out what would be the ideal compression scheme to use for different parts of the database based on how we observe queries accessing it. Next project is to actually add support for multi-threaded or intraquery parallelism. So currently, since we are still leveraging bits and pieces of Postgres, Postgres can only execute use one thread per query, and we have that same restriction. So we're going to have one single worker thread can operate on a transaction in the query as it goes along. So this would be like to be able to have intraquery parallelism, sort of the same thing we saw in Hyper with Morsel, the Morsel-driven architecture, where for a single query, we can break it up. It's sort of sub-queries up and have them run on different threads at the same time, and then we can combine the result at the end to produce a single answer. For this, you'd have to implement the exchange operators to modify the query plans, and that's sort of the standard approach that everybody does. So for this as well, another thing that a bonus project would be to actually implement the Morsel-NUMAware partitioning or data placement that they used to explicitly put blocks of data on one socket versus another, and then know how to schedule threads to only operate on local data. And for this, we need to update or add additional information to our internal catalog that says, you know, this block for this table is located at this socket, and therefore I know what threads can go and access it. The last project is to do integrated memcache. If you don't know what memcache is, it's sort of the most common caching system that's used in sort of large scale applications. You can think of it as an in-memory key value store where you can just call get and set on objects. And the idea is that when you, instead of running a query against your database, you look in memcache to see whether the, you know, based on the primary key, whatever the object you want is already there. And then the problem with memcache though is that when a application modifies an object, it has to go invalidate the cached object in memcache to make sure that nobody reads stale data. So a lot of times what people actually want is the ability to use a memcache like API directly on the database system and then clearly avoid having to maintain the cache in the separate memcache system. This is actually something that the hedge funds want to use, and a lot of people can do this, a lot of people do this, use this in MySQL as well because MySQL supports a memcache protocol directly in ODB. So the idea is that instead of calling, you know, select star from table war, primary key equals something, you just call get with the key and the internal system can just go and get the actual object as if it was executing a query. So for this, you'd want to implement the memcache protocol in a network handler thread inside of our database system. And then based on whether it's a get or a put or a delete, then you would invoke a prepared statement query that would then just do the same operation. So from the outside, it looks like memcache but internally you're just executing these sort of one query transactions that have been predefined to go do the thing you need to do. So it's not clear to me how easy this is actually gonna be. It's not clear how easy it's gonna be implement the memcache protocol in our system. From what I've read, it's fairly simple but you know, it's the devils and the details. So if this turns out to be super easy, what would be also useful or helpful or interesting is to also rewrite basically all our client connection handling code that we have in our system because we're still relying on Postgres stuff. So we can come up with a more thread optimized way of doing these things and handing off query requests from the clients and passing them off to the worker threads. So if memcache turns out to be super easy, this would be the next thing to look at. Okay, any questions about these projects? All right, so a big part of implementation of project number three is testing. So you've already done sort of the unit tests for the B2B tree and the hash join. We're also gonna provide you with a SQL based regression test, testing suite that can allow you to check to see whether your project breaks any high level functionalities or features in the database system. So the idea is that it would be this Python testing framework that will have all of these different permutations, variations of SQL statements, transactions, alter table commands, and then as you implement your thing, if you wanna say, oh, did I break count queries, you could run this testing suite and it'll check these things for you. So this is not meant to be exhaustive and the unit test cases aren't meant to be exhaustive, so every group is gonna be required to extend either the regression test suite or the unit test to check to see whether your project actually works. We won't accept any project that does not do, you know, thoroughly testing of their implementation. We may or may not enforce code coverage tests, which I'll talk about later, but the idea is that you wanna have test cases that touch all different parts of your code. Since there's so much existing Postgres code, we might not enforce that, but we'll set up something for everyone to see whether, for their own code, their test cases cover things. I didn't write anything down here, we also have a build and test server here at CMU that every group will have access to. So the idea is that you'll tell me what is the, you know, what's the main repository you guys are gonna use for your implementation and then we'll set it up so that at any time that somebody pushes code to GitHub, our system will pull down the changes, compile it and run all the tests and give you back a notification when someone breaks the build. This is called continuous integration, continuous building. So we use something called Jenkins and we'll set that up for everyone here. So as you go along and push changes, you make sure that you're not breaking things. Okay? For the computing resources, the hardware that MemSQL graciously donated to us has arrived, it showed up last week. It's in my office like this, it's still in the boxes, but we're working on getting it out. But each machine, we have three machines, each machine will have a dual socket Xeon that has six cores with 12 threads. So in each box, there'll be a total of 24 threads, 12 ruins, 24 with hyperthreading. And then each box will also have 16 gigabytes of DDR4. We're working on increasing that. So the goal for next week on Spring Break is that we're gonna set up these machines in the database group lab. And that way when we come back, we can give everyone access to these things. And we wanna use what's called Ubuntu Moss as metal as a service. So rather than getting a VM, you would be able to provision the actual machine and have exclusive access to it while you do your experiments and while you do your testing. And then when you're done, you release it and then the next group can come along and use it. So if anybody has experience setting up Ubuntu Moss, please email us because this is the first time we're doing it. We're gonna make sure that we do it right. So the other thing is that for doing the actual benchmarking evaluation, we don't want anyone to have to rewrite a TPCC or TPCH. So the good news is we already have a testing framework that has all of the benchmarks that you'd wanna use for your project. So it's called all a TPC bench. This is something that I wrote with other people when I was in grad school. It's a Java-based framework that has 15 benchmarks already built in ready to go. So we have mostly other TPC benchmarks. So TPCC, TATP, YCSB, Wikipedia, Twitter, Epinions, a bunch of other ones. And then we have only two OLAP ones, CH benchmark, which is from the HyperGuys, which is just a simplified version of TPCH. And then TPCH benchmark is the one you guys have been seeing. So we don't support TPCDS at this point. But I don't want anybody to have to write TPCC themselves, right? I wrote it five times when I was in grad school. Joy's already written it twice and he's only been in grad school for three years. So no one should write TPCC themselves. Definitely use our existing implementation. Save you a ton of time. I'll also say that we've only tested, for the most part, TPCC and YCSB. We're working on testing the other benchmarks. And for TPCH, there are some SQL operators we currently don't support, but we're working on adding them, things like like some simple math functions and things like that. So at this point, we can't run this thing entirely, but the goal is that by the end of the semester, when you actually want to start benchmarking your project, it'll be available for you. So this is the link and for more information, it'll just take you to the GitHub account with the repository. There's enough people here at CMU that has experience with LLT Bench. So even though we're throwing another piece of code at you, there's enough people you can talk to, myself included, that could help you get this running. And we have scripts that can do this with Peloton. Can set things up and run with Peloton already for you. Okay? All right, so again, reminder for project number three, every group has to do a five-minute proposal on March 6th when we come back after spring break, but I, both myself and Joy will be around. We have a deadline on March 7th, but after that, we should be available. That's next week, isn't it? That's not right. So it should be, what is it? It should be the 13th. So for March 13th, we'll have, we'll do the proposal and it's still not on the right evening area. Like what is today's date? Let's get this right. Today's the second, it's gonna be the 14th. Yes, thank you. Let's just do that now. So everyone come back, we'll do the proposal. And if you need to talk to me during this time, okay, good. Just email me or come to my regular office hours on Monday and Wednesday, okay? Any questions about project three? Yes. What is your group's design team doing today? You can point it out in the parking lot. So what I would do is I would send a link, I would send out a spreadsheet, all the group numbers and just put your topic in there and whoever gets there first. So don't be stupid and go try to delete the other group. They still have your idea because I can only see the versions and it will change. Yes, Lee, get it back. So I think I can give them like three choices and then we can figure things out. What do you mean by three choices? The three projects they want to like work on and then we can. I think if you're really, so. It's not really a gold restaurant. Yeah, just we're all friends, so there's never need to be some kind of, we'll take it in case by case basis. If everyone's dying to do query optimizer or everything, then we'll take something out. But I think there's enough topics where we should be spread out enough. And there's two or more things here that I know people are already talking about doing. Any other questions? All right, so again, the project three, I think it's 45% of the final grade, so this is the main thing that we'll be focusing on and we'll do some more lectures again, talk as you go throughout the semester to talk about how do you navigate and work in a large code base that contains a bunch of stuff that you didn't write and the people that wrote it aren't around anymore. Okay, all right, so to finish up, we'll talk about extra credit. So I am offering 10% extra credit for every single student in the class if they write a encyclopedia style article about a database management system. So it doesn't have to be a commercial system, it doesn't have to be my system, it can be any system you want to be academic, it could be a system that's still around, a system that was in the past. The idea is that the article should discuss all the major topics that we discuss in this class, right? What kind of current school scheme does it use? What kind of logging scheme does it use? Does it do query compilation? Does it do vectorization? What kind of joint algorithms do they support? How do they support them? And so rather than having everything that there's sort of be free form texts, you could say, why don't we just do this on Wikipedia? Well, in Wikipedia, if you look at the articles for different database systems, they all contain different types of information, right? And they don't use the same terminology. All right, so the idea is that the encyclopedia we're trying to create here at CMU has a standard taxonomy that we can then use to do comparisons across these different database systems to see different, ask different questions, like, is MVCC really the most common approach used in modern systems? Or has it always been that way? So what we'll do is we'll have a taxonomy that you use for your article, and then we'll pre-defined options to say, for current school, it uses two-phase locking. It uses MVCC, it uses OCC. So you don't have to write it right at the end, just select what's there. And then you'll need to provide a summary paragraph with citations that sort of elaborate what's going on for that particular feature. So if they say they use two-phase locking, you would write in the summary paragraph to say, they're using strict two-phase locking, they're using deadlock detection or deadlock prevention. All right? So the idea is that we'll have things you can click and specify what the options are, and then we'll have some additional space for you to clarify what's going on. So the website we're building to host this encyclopedia is the URL is dbdb.io. It currently is locked down because it's not available to the public yet, but I'll post the user and password on Piazza so everyone can take a look. The current version of the website is incomplete. It's not what you'll end up using and we're still working on development of it, but I wanna get, announce this now. You get people thinking about what they wanna do and then it'll be made available in a week or so. So just like for fighting over project three topics, we'll have a sign-up sheet to allow you to specify what database managers you wanna choose and then just to be fair, we'll say it's first, com, first, serve. But I'll say that you may think that it's better to pick something like MySQL or MongoDB or the more popularity systems that are out there today because there's gonna be a lot of information, but that means I don't expect you to have a really well-documented and really comprehensive article about that particular system. But if you choose an obscure system that maybe is around the 1980s that only has a few publications and was solved little commercials of success, you're obviously gonna have less information to cite and used in order to write your article, but it's gonna be a bit more work than the MySQL case because you're gonna have to struggle to find this information. So it's sort of a trade-off between these two and at this point I don't know which one's easier. So just give you an idea what DBIO is gonna look like. So you'll have this like form that you can then select the different options and then fill in the paragraphs and then there'll be information about the citations you can add at the bottom and then there'll be some sort of generic information that says like when was the system started? When did it, when was it finished? Who actually owns it? What languages are written in? Things like that. And then we'll have, there's a search feature that will specify all these different feature categories that you would specify here. You can then select which of these to search for and look at time series and things like that. So this is obviously, you guys don't write this but this is based on the information that you provide and here in the standard text autonomy. So my current list of database systems that I found, I've been sort of keeping track of this list for the last couple of years because I knew eventually I wanted to do this project. I currently have 456 different database systems. So there should be no problem in you finding one that you think is interesting that you wanna write about. And again these include both academic ones and commercial ones. Again, if you need to pick them, here's just no particular order, a smattering of different database systems that I have logos for. So again, you can pick any one of these, okay? I'm not endorsing any one system over another. I think they all do different things that it's all very interesting. Okay, so I warned you guys in the beginning about this in the semester and I'll say it again, that this article that you can provide for us has to be your own writing and has to have your own images and diagrams. That means you can't just copy and paste whatever you find on the web and just put it in the article. I don't care that they wrote it better than you could ever write it. That's not what we do here at CMU. That's considered plagiarism. So everything has to be in your own words and you have to provide proper attributions and citations for anything that you're referencing in your article. And if I find that you're cheating or stealing content from other people, then not only will I not give you the extra credit, I'll fail you in the course. Because what we want to be to do is we want to put this website out, make it a public service. We'll license everything as creative commons. And so people can use this and so it's not gonna help us if the stuff we wrote about is stolen. Okay, so, damn, my dates are way wrong. Oh, there we go, that's right, March 14th. All right, project two again is due next Wednesday at midnight. I gave you guys a week extension and again, we're gonna be a bit more stricter and harsher about how we do a test and check to see whether your source code is good or not. And then project three proposals will be due on the Monday, March 14th, after you guys get back from spring break. And of course I'll be available during my regular office hours or by appointment to discuss what you want to do for project three. And I'll send email out later today with a sign up for the extra credit and sign up for the project three topics. Any questions? Actually, I did one point I wanted to bring up. So someone asked about using a library for project number two and the BW tree. For that, I don't think we want to use outside stuff but for project number three, by all means you're allowed to use anything that you find that you want to use for your implementation. I would say we want to limit you to just either Apache, MIT, or BSD license code. And if you need to use a library, a package, make sure that it's part of the standard Debian or Ubuntu repository. If this is obscure library that you have to download specifically install, then we should talk about whether that's the right thing to do or not. But again, don't take any GPL code, don't take any other non Apache license compatible code. And again, I'll write up what I mean like this on Piazza on the website. Okay? We're done. See you guys. Enjoy your spring break. Nobody getting in trouble. Nobody getting arrested. Nobody get herpes. It's not for good for anybody.