 All right guys, welcome to 15.721, Database Systems, Database Systems Internals, Advanced Database Systems. It goes by many names. I don't know what it says on the register. So I'm passing around two sheets here. So the first is the people that actually enroll the class and the second people are, if you're on the wait list. So right now we're maxed at 30 people and so if someone dies or gets arrested and then a slot frees up, we'll go to the wait list. So just find your name and put your initials and please only put your initials. I don't care that your friend is at rehab and can't make it today. Just put yours down. Okay? All right. So welcome. It's a new semester. So this is the first time this course has been taught at CMU in 10 years. So all the material is new and it's going to be a lot of work to provide you guys sort of like up-to-date stuff. So normally when I start a new class, I always like to give a quick pop quiz to see what everybody knows to help you motivate why you should take this course. All right? If not, my clicker doesn't work. It would help if I plugged it in, right? Does that work? Okay. All right. So here's the pop quiz. Two questions. Does anybody know who this person is? Anybody? Take a guess. It's not my dad. Okay. Anybody know who this is? Who's this? Mark Zuckerberg. Right. Mark Zuckerberg, the founder of Facebook, right? This person is Jonathan Abrams, right? He founded a startup, a website before Facebook called Friendster who here has heard of Friendster. Very few, right? When I was an undergrad before Facebook, before Myspace, Friendster was the hot, you know, the big social networking website. Right? It was Facebook before Facebook. And why do we know the Zuck and not this guy? Right? Why did Facebook succeed and Friendster died? Friendster was there first. Well, when Friendster got started, they started having a lot of users and their database was struggling. Right? So every time you would load the page, just like talking like 2001, 2002, the database would be super slow because there was too much data and it was getting hammered. One of the things that Zuckerberg did correctly was he hired some really smart people to help them optimize their database system so that they didn't end up the same fate that Friendster did. Right? Now Friendster is like, I think it still exists by name only, like the Malaysian Gaming app. Right? But, right? So part of the reason Facebook was able to succeed where Friendster failed is because they were able to deal with the database problems that they were having, right? They were having some really smart engineers, you know, scale the thing out and apply some of the optimizations that we're going to talk about in this class. Right? So this is why you want to take 15, 7, 211, right? Because databases is still hot field, right? And database developers, people that actually can work on the inside of database systems are in huge demand, right? And there's a lot of unsolved problems that are very interesting that we're going to talk about in this class that are still definitely relevant today. And so one of the things that I've noticed, you know, in my time since I've been involved in computer science and things like that is if you're good enough to be able to hack on the internals of a database system, that has a lot of, that holds a lot of weight in the industry, right? So like if you're like a JavaScript developer, they're not going to have you go write some complex system code, right? But if you can work on the inside of a database system and the internals, right, and be able to deal with writing correct code and high performance code, then you can apply those techniques to almost any other database problem or any other computer science problem. So being able to say yes, I know how to hack on the inside of a database system is like knowing how to hack inside the Linux kernel, right? It's a very challenging skill to acquire and it's definitely in demand. So, all right, so for today's class, we're going to split up in sort of two parts. First I'm going to talk about what this course is going to be about with the expectations of you are as the student. And then in the remaining time, I'll sort of do a quick crash course on the history of database systems as I see it, right? So we'll spend most of the time beginning and talk about the course, what's expected and everything. And then we'll end up with something as a prelude for when we start the real content on Wednesday. Okay, and so for the people that got in late, there's a sign up sheet going past around, there's one sheet is for people that are enrolled. One sheet is for the people that are on the wait list. So just initial your next year name so that I know that you're here. I'm not going to be taking attendance to the class, just for the first class and the second class to figure out who is actually here. Okay, so the core idea, the core objective of this course is to learn about the modern practices of database systems. Actually, how do you actually build a database from scratch, right? What are all the things that you need to know about and care about when you do this? And so we want to use the student to be able to come proficient at the end of the semester to know how to write correct code and high performance code, right? Those things go hand in hand when you're doing database system work because you don't want it to be the case that someone writes data one week and then they come back three weeks later and the data is different, right? That's bad, we don't want that to happen. So you're going to be working on a large code base for all the projects, which I'll talk about in a second. And so we really care about you writing good code. We care about you documenting your code and applying tests, right? Database systems are probably the most well-tested software that exists in the world beyond operating systems and probably like the space shuttle. We're going to spend some time discussing how to do code reviews, right? So when you do the final project, you're going to be paired up with another group and you look at their code and make suggestions on what they need to fix, whether things look all right or not. We'll apply a lot of the standard practices that are used in the industry today. This bullet here about working large code base, when I've been sort of planning this course for the last year, I've talked to a lot of different companies and I said, well, if you have students coming out of a database internals course from CMU, what skill do you want them to have? What skill do you want them to know at the end of the semester? Do you care about them knowing how to write B plus trees? Do you care about them writing concurrency control? They all basically said the same thing. We care about having students that we hire having the ability to work on a large code base that they didn't write and the dude that wrote it has been gone for the last 10 years and yet you can come in and be able to figure out what's going on and make a significant contribution. So that's sort of another theme that's going on in this class is that we want you to be able to understand and learn how to adapt to a really large code base that you didn't write and we didn't write and make significant changes and do the thing that you want to do, right? And this last one here, so most of you here are either American citizens or planning to stay here in the US. So I'm gonna spend some time throughout the semester teaching you certain skills that'll be useful when you're out in the real world or on the streets, right? How to stay out of trouble, how to deal with the police, things like that. So again, we have extra time in the lectures. I'll spend this a little bit to talk about these kind of things that I think be relevant beyond the thing once you leave the confines of CMU. And hopefully we're gonna get a guest lecturer from a local defense attorney to come and talk to us about some other things that you should be wary of, okay? All right, so again, if you haven't figured this out, the overall topic of the course is about database system internals, okay? And specifically, we're gonna focus on single node systems with in-memory databases, and I'll discuss what in-memory database is on Wednesday. But if you're expecting this course to be about distributed systems or distributed databases, that's not what we're doing here at all. We're gonna ignore distributed systems for entirely. And the reason why we're gonna do this is that, yes, eventually, most databases, if you get too big, you may have to go distributed, but you have to be able to deal with all the single node problems that we're gonna talk about in this course before you can get to the distributed problem. So for pretty much every single topic in this class, we're gonna ignore the multi-node case or the distributed case, and really just deal with how do you make a single node shared memory system scale up and perform better. So we're also gonna be focusing on state-of-the-art topics or state-of-the-art implementations of these techniques you need in a database system. So that means we're not gonna go back and read the papers from IBM from the 1970s and 1980s about sort of the classic implementations of the algorithms and methods and data structures. We're really gonna be focusing on the modern incarnations, the modern variants of those techniques, all right? So in general, the layout of the course is roughly as this. So starting next week, we'll talk about concurrent schedule, because that's a big part of having multiple transactions run at the same time and run correctly. Then we'll spend about a week and a half talking about indexing methods and data structures. We'll talk about storage models, compressions, modern join algorithms to deal with in sort of multi-core, many-core environments, logging, recovery methods, query optimization, extrusion compilation and new storage hardware. So basically all these things here you can combine into a single system and this will build your database system, right? We're ignoring SQL parsing and simple things like that. But all these techniques is what you need. You say I'm gonna have a new single node database system that can do transactions and do analytical queries. You need to do all these stuff and that's sort of why we're sort of going from the ground up and building all these things out, okay? So now, I'm making a huge assumption in this course that all of you have the appropriate database background. That means you have either taken 15, 415, 615 here at CMU or some other introduction database course when you guys were undergrads or somewhere else, right? Because I'm gonna be talking about all the classical algorithms that you would learn in that intro course, but then the modern variance of them. So I'm not gonna sit and say here's what a hash join is, here's what a sort of merge join is. I'm gonna say here's how you do it when you have 20 cores running at the same time or everything's in main memory, okay? I'm also gonna be assuming that you have sort of a general knowledge of the architecture of a database system, right? Query plans and things like that. So specifically, the things that we're not gonna cover are SQL, serializable theory, relational algebra, and then the basic algorithms and data structures. So if you don't know these things, you should spend some time, maybe read a textbook or talk to me and figure out what this is actually the right course for you, okay? Cuz I'm gonna assume you know what serializability means and we're gonna jump right into it when we start talking about concurrency control or the lower isolation levels. So another big thing in this class is all the projects are gonna be in C++11, all right? Cuz we're working on a new database system we're building here at CMU that's based on Postgres, but we've updated it to be now C++11. And so you'll go in and do all your projects based on that and we'll provide you sort of scaffolding and sort of instructions how to do the first two projects but the last project is definitely gonna be open-ended where you'll pick what you wanna do and you'll be off and running in parts of the code that we haven't even looked at yet. So if you're not comfortable with debugging a multi-threaded program, like using GDB or other tools, then again, I encourage you to spend some time to figure out before things get started. So maybe start the project early and look at some tutorials and try to understand these things. Cuz I'm not gonna teach you how to use GDB in this class, cuz I'll assume you already know this, okay? So now the course logistics. So let's talk about now actually what the course is gonna be about, what the assignment's gonna be, and how you're gonna be graded. So all the course policies and the schedule sort of the reading list and the syllabus and everything are available on the course website. So you can check out that link. I think it's 15721.courses.cs.cmu.edu. It's on my homepage. But everything I'll be talking about will be here today. And I wanna stress this, I'm gonna stress this multiple times, is if you're not familiar with CMU's academic policy about cheating, about plagiarism, about stealing, you need to go read this, okay? So the systems faculty at CMU are actually cracking down this semester. We're not gonna be dealing with people plagiarizing and cheating anymore. We're gonna definitely report you to Warner Hall. So if you're in a situation where you don't know whether you're doing the right thing, please come ask me. Cuz I'd rather have you come and talk to me and say, hey, I found this code on the internet. I think I wanna use it in my project. Is that okay? Let's talk about it first rather than me finding out later on that you stole code that you shouldn't have, right? Because I have a lot of free time. I don't have any kids. Warner Hall is a two-minute walk. I will go over there. I will report you and wreck you, okay? I don't care what you guys did, which you did in undergrad, what you think is okay, plagiarism is not cool, and we're definitely not going to tolerate it. You'll see this multiple times. Okay, so my office hours will be on Monday and Wednesday, one hour directly after class. But my office, if you need to meet me with another time, just come let me know. And so when you want to come talk to me, we can talk about implementing the projects. We can talk about the papers. We can talk about databases in general. And the one thing that I've noticed in my previous classes that people end up liking to do, which I don't know why, is they often come to me and talk about their relationship problems. And at first, I was like, all right, now I'm not going to do that. But it's happened so many times, I'm actually going to be open to this, right? But I have no official training or qualifications to help you with your girlfriend or boyfriend problem. So take that with a grain of salt. But please, actually, this is true. If your girlfriend or boyfriend breaks up with you, please cry at home before you come to my office and cry, because that's not good for anybody. That was weird, okay. All right, so we have two TAs in the class. Unfortunately, they're not here right now and CMU are actually even in the state of Pennsylvania for legal reasons. So the head TA for the course is going to be Joy Aruraj. So he's the one that's helping us build the database systems. He's helping run all the projects. He'll probably be your main contact when you have questions about the code. Right now, he's back in India because he had to turn himself in for like a 30-day bid for some like accessory to armed robbery when he was an undergrad. So he'll be back in like a month or so, but you can't contact him now. Mu Li is in a sort of weird, again, more relationship advice problems. So many accused him of fathering a child. I advise him to leave Pennsylvania. So he's currently in San Francisco and he's unlikely to come back to Pittsburgh. So he's going to be helping out with all sort of the background stuff, the testing scripts, the utilitarian things. But again, the main thing is that if you have questions about your specific ports of the code when you work on your projects, Joy's going to be your main contact. Mu is not going to be able to help you, okay? All right, so now let's talk about what all the assignments are going to be. So the first thing is that for every class, there's going to be one mandatory reading. And if you go look on the course schedule on the website, you'll see that they'll be marked with a little yellow star saying that it's the mandatory or the primary reading for that day. So before every class starting this Wednesday, you have to turn in a one paragraph synopsis about the reading assignment. So this is not me trying to check to see whether you truly understood the paper, or check to see that whether you're making things up or whatever. It's really just sort of a forcing function for you guys to read the paper before we show up in class. So that way you're not sitting there and learning things for the first time. This is a graduate level course, so it's supposed to be a dialogue between us to discuss all the different aspects of these papers. So the synopsis just really needs to be, again, really simple, one paragraph, two sentences that say what the main idea of the paper is, a sentence that says what system they're testing and they modified to check their methods. And then a sentence that says here's the workloads or here's the benchmarks that they use in it. It's just again to give you exposure to sort of the research papers in the database systems field, all right? And so there's an online Google form. If you follow this link here, that's where you can go and submit it. And it'll be timestamped to say when you actually submitted it. And so you're allowed to skip four assignments throughout the entire semester. I think in total there's like 24 papers that are mandatory or so. So you only have to read 20 of them. And also on the schedule as well you'll see that there'll be the one marked with a star that says that's the primary reading. And then there'll be some additional ones that are optional. And those obviously you don't have to read. I'll cover them in the lectures because I think they provide additional information or another perspective of the mandatory reading. But they may be things that when you go to do your final project, if you want to learn more about something that we covered in the class, then the optional readings will be there for you as well. Okay, and everything is on the website, you just download it and click it. Okay, so again, not to beat a dead horse. Please do not plagiarize your synopsis. Do not find somebody else who wrote a paragraph about what these papers are because other classes in other schools read the same papers and they may have online synopsis as well. Please do not copy what they wrote and try to pass it off as your own. Please do not copy what's in the paper and claim that this person is smarter than I am, therefore I dishonor them by trying to rewrite it in my own words. Do not do that, that's plagiarism too, okay? And again, if you're not sure about what's the right thing, please read this or contact me, okay? So now the programming projects. So as I said, all the programming projects are gonna be based on Peloton, which is our new database system for building here at CMU. And again, I'll talk about more what makes the system unique or what's interesting about it next class before we get started. But the main idea is that it's gonna be an in-memory database system that we've been developing for the last year or so that's based on Postgres. And this is sort of the standard thing that everyone does when you build a new database system, you'll see this throughout the course. They start with Postgres, all right? So we'll provide more details in next class on how to get started with the first project, again it's gonna be a large code base that is gonna be pretty complex. But the nice thing about it is we support almost as much as SQL 92 as possible. So you can start your database system up, open up a terminal and start writing select queries to it and see that your thing actually works. So this is a real database system, this is not a toy. And I think this would be something that would be really worthwhile for you guys to work on. Okay, so for the programming projects, you're gonna wanna do all the development on your local machine. So unfortunately Peloton only compiles and runs on Linux. We spent about a month last semester trying to make it work on OS X, but we just couldn't get it to compile, okay? So all the development has to be done on Linux. If someone wants to go and be super, super motivated and trying to get it to work on OS X, let me know, we can try to make it happen. But we spent a while, we couldn't get it to compile, we couldn't get it to link. And so to get around this, we're gonna provide you with a vagrant configuration file, so we'll provide you with a Ubuntu virtual machine that you can then run on OS X or Windows or whatever. And you can do all your development inside of that, okay? And then when it comes time to do the final project or if you wanna do additional benchmark to see how fast your implementation is for some of these projects, we'll have a cluster set up later in this month. We'll be able to log in and probably have more power or more RAM than what's on your laptops. You can try larger data sets and try more complex things. We're super happy that MemSQL, which is sort of a leading in memory database startup out of San Francisco, donated all the machines we're gonna have for the cluster for you guys today. So we're super happy and super grateful for that. Okay, so for the first two projects, these are gonna be individual things. Or sorry, the first project will be individual, the second project will be done on a group. And so for these first projects, everyone's gonna implement the same thing. So we'll tell you to implement a hash join for the first project and then a B plus tree, a concurrent B plus tree for the second project, right? So that way we can actually check to see whether everybody has done the correct thing. We can benchmark you and see which one who has the fastest code, right? But we'll provide you with all the test cases and instructions on what you need to do. For the second project, we're gonna do done in groups of three. So we have 30 people enrolled in the class, so we'll do 10 groups of three. And again, since all this is everyone's working on the same thing, we want to stress this enough, you can't stress this enough that you cannot be taking code from other people in the class. You cannot be taking code from random thing you find on the internet, try to pass it off on to your own for these first two projects, right? That's considered plagiarism, that's considered stealing, and I will report you. I'm in an argument right now with Warner Hall because I'm trying to let them list all the students that plagiarized in my class on my website forever, and they won't let me do that. So I'm dead serious, do not plagiarize, okay? For project three, this is the open-ended one. This is sort of the final project for the course, right? So you can be a group of three people, and you'll pick some project or some idea that you want to implement in the database system, based on the things that we talked about, right? And the basic criteria for what I'll say yes, that's an appropriate project, is that it has to be relevant to the things we talked about, it has to require a significant programming effort for all the people in the group, right? You can't just have one dude hang out and write all the code, and then the other two people do nothing, okay? And of course you can't have, I don't want two groups picking the same topic. So one group wants to pick, I want to implement a skip list, and no other group can pick the same thing, even though maybe they think they can do a better job, right? We want to have sort of a wide scope of different topics. So you don't have to pick a topic until after spring break, which I think is like the first week of March, and then we'll provide you with some sample topics or some things that we think will be interesting if you can't think of anything your own. But I know there's a lot of PhD students in this course as well, so I highly encourage you to pick something that's actually relevant to your research. So not everyone here is doing a PhD in databases, but if there's something in your own research that you think you want to apply it to this course project, by all means do that. I think that makes for some really interesting work. All right, so for project three, there's gonna be five deliverables that you're gonna have to do that throughout the semester. So sort of like checkpoints as you go along the way to make sure things are working out okay, sorry. So the first thing is that after spring break, you'll have to do a five minute proposal in front of the class and say this is our project, this is what we're gonna do. This is how we think we're gonna implement it. This is basically just for you to figure out what exactly you're gonna be doing and I'll prove it and say yes, that's a good idea. Then a few, about a month later, we'll have a checkpoint or status update to the class. Again, this will be another five minute presentation where you basically say, here's what we've done so far. Here's some problems that we ran into and here's how we think we're gonna be able to finish. So now, when it gets timed at the end of the semester, there's gonna be three things that's gonna happen. So the first thing is that we're gonna pair off groups and then we're gonna sign them to each other and have them do code reviews. So this is a standard practice in industry now where you have people look at other people's code and provide them feedback or suggestions or criticisms about what they did wrong, what they could do better, or what doesn't make sense. So I wanted you guys to get a feel for what this is like for this course. So you're gonna be looking at other people's code and providing them with useful feedback to help them improve it. And I'll be grading you based on your participation in this process. If you just say, yeah, this looks good, whatever, that's not helpful for anybody. So we'll make this all hook up and get up and make it real easy for you guys to communicate and improve the quality of your code. Then in the final, the period for the final exam, so whatever, I don't know what the date is yet, but when this course is assigned a final exam, we'll have like a three hour session where everyone will come and present the final status, the final update of their project. So this would be a lot of more time. And so you wanna include any kind of performance benchmarks or measurements that you did for your project. If you have a cool demo, that would be kind of nice as well. I'll cater this. But that's sort of the end goals, present this thing and share it with the class. But the project is not considered done until you're able to provide us with a patch or a pool request on GitHub with your final project, right? So we're not gonna merge everybody's code in because if you write something that runs on some esoteric hardware, we're probably not gonna want that in our system. But we wanna be able to merge it as, be able to merge it cleanly as if we could or as if we did want it. So that means that you're gonna have to have all the code review comments that came up in that feedback process. All that has to be fixed. You have to document everything, provide instructions on what your thing actually does, and have it as be as if it's a real deliverable you're gonna provide us that we wanna merge into the master branch of our system. So this is not writing code for this one off project just to get a grade and you're done, right? And then no one ever sees it again. We will want you to write high quality code because this is what would be expected if you were actually working on, at a startup or a database company, trying to build a real database system, okay? And then we'll provide, I'll talk about later in the semester what's all expected to make the final code drop, okay? Now for, we're also gonna be having a final exam and this is, instead of having this on the final exam date, you know, whatever the reading period is, this will be in the last class on a Wednesday, August 27th. So the class is only about an hour long, so it'll be, you know, hour long exam. But it'll be a long form written short answer question exam where it's sort of asked you to synthesize ideas of the topics that we talked about in this class and show that you have an understanding of the trade-offs and implications of the different techniques that we've talked about today. So it won't be true and false, won't be multiple choice, it'll be short answer questions, okay? So now the last piece is that we're actually gonna be offering extra credit for the course as well. So as part of the Carnegie Mellon Database Group, we're working on a new online encyclopedia of database management systems. You kinda think of this as the database of databases, right? And so it's sort of like a Wikipedia style thing where we want anybody to be able to come and edit it, but it's more structured than what Wikipedia supports. So we're gonna be offering extra credit for anybody that wants to write an article about some database management system. It could be an academic one that we discussed in the class, it could be a commercial one, it could be a historical one that doesn't exist anymore, right? The idea is that we wanna build out this encyclopedia of all these different database management systems that have been around for the history. It's about 300 to 400 or so that I've counted, but there's no real good single source of what they did, why they were important, and how they implemented certain things. So the idea is that you would pick which some system, you would write the materials for what it did, and then add it to our website, and you'd be cited as an author for it. So this is not writing like a Wikipedia style sort of essay, it's more semi-structured. So for things like concurrency control, there'll be a list of things you can select so that we have the same taxonomy across all systems. So you ever look at like Wikipedia, you look at see how they describe things. One article for one system will say it's 2PL this way, another one say it's two phase locking another way, right? If we're trying to have a uniform sort of standard syntax, a nomenclature across all these systems, they make it easier to compare. So then we can say things like show me all the multi-version concurrency systems from the 1980s, right? And currently you can't do that in Wikipedia. So again, this is optional, this is extra credit, you're not required to do this. I'll talk about more about it later in the semester as we get closer to actually launching the site. And again, please do not plagiarize these articles because your name's gonna go on it, right? And if someone sees that you took images, you took text from what they wrote and try to pass it off as your own, they're gonna come back to you and complain, or they're gonna complain to me and then I'm gonna complain to you. So again, please do not plagiarize, I don't care that the person wrote it better than you, you have to write everything in your own words, okay? I can't stress this enough. Okay, so here's the breakdown for the grade. So that are reading reviews are just 10%, the three projects will be 70% and the final exam will be 20%, right? As you can see, I mostly care about the programming stuff, right? The final exam is just to make sure that you understood sort of what we talked about in the class, but I really want you to get out of this and be able to write code in a complex database system, right? And that's why we're placing this emphasis on the programming assignments when it comes for the final grade. And then we'll tack on 10% for anybody that does extra credit, okay? So the course mailing list will be through Piazza, here's the link here, it's also on the website, please sign up if you haven't already. We're gonna use this for all the technical questions we have for about the projects, right? So don't email Joy or I directly because if you have a question about it's not compiling correctly, I'm getting this weird error, I don't understand. We want that all to be captured on Piazza, so if another student has the same kind of problem, they'll be able to go online and figure it out and maybe not email us all. But any other sort of non-project questions about logistics of the course or other things like that, please email me and not the TAs. Okay, so any questions about the game plan or roadmap for the course for the entire semester? Any questions about the expectations, any questions about the assignments? Blank stares, okay. All right, so for the rest of the course we're gonna do an abridged history of databases. So this portion of, so this material is derived from two papers that are listed on the website. The first is a article written by Mike Stonebreaker called What Goes Around Comes Around and he wrote this as an introduction to what is called the Red Book in the database community. Basically it's a compendium or a collection of important database systems or database papers. And so he has an introduction talked about basically the history of databases up to around 2006 and then the second paper is called What's Really New with NewSQL and this is the article that I wrote with Matt Aslet who's a industry analyst out of London and this basically picks up where they left off in 2006 and talked about sort of these new new SQL systems that have been around for the last five years or so. Okay, so let's start from the beginning. All right, so one of the aspects that I think is important to understand about databases is that a lot of the issues that they had in the original systems from the 1970s are still really relevant today, right? Just the scale of things, the scale of the problems has changed, right? So dealing with disk IO, dealing with lock intention and things like that, all those are the same problems that they had back then but now we're just doing it with a hundred different cores running on the same machine, dealing with much larger data sets because we have the internet, we have a lot of more concurrent users, right? So all the same things that they worried about back then we still have to worry about today. And actually what's really striking is when you go read about the first database system that a relational database system that IBM built, system R in the 1970s, they explored a bunch of ideas that they thought were interesting but they ended up abandoning because they couldn't make it work with the existing hardware but now there's companies that are basically adopting the same ideas from back then and now it works and now it makes sense because the hardware is able to support it, right? So one of the key things is like one example of this is they had this idea in the 1970s of taking data from related tables and combining them together and storing them on the same page, right? You have a customer table, you have an order table, you take all the customer's orders and you pack it in on the same page as where the customer is, right? And then it didn't work because it was too difficult to manage, too difficult to implement but now Google does this in spanner with column families or the protocol buffers. They pack things together that are related so now it's only one fetch to get the things you need, right? So a lot of these same things that they worried about then we still worry about today. Another interesting thing is that I feel that the debate that's going on or if you want to call it that between the SQL versus no SQL crowd is reminiscent or basically the same debate that existed in the 1970s between the relational model people and the network model, the code of sale people and I'll talk about what code of sale is in a second, right? But basically the no SQL guys are saying well you don't need transactions, you don't need joins, you don't need the relational data model, right? You write everything as a key value store or as JSON, that's better, right? It's the same thing that they said back then and these guys lost because none of you have ever heard of code of sale, right? And now in 2016 all the no SQL guys except for MongoDB are adding back in all the stuff that they said they didn't need they're adding back SQL, right? They're trying to add back transactions. So again, a lot of the ideas that we'll be talking about are not necessarily new if they're just new ad debt adaptations of them to work a modern hardware. Okay, let's talk about the very beginning. So the very first database system that was ever developed was this thing called IBM IMS, Information Management System, right? So IBM developed this for the federal government to keep track of all the parts and suppliers for the Apollo moon missions so they were essentially working for NASA before it was called NASA, right? And so IMS, it still exists today, it's actually, I think it's one of the number on best sellers for IBM or it makes the most money for them, right? Because they have all these banks that used, you know, picked it up in the 70s and still have to maintain these systems, you know, over the years. I think a lot of the ATMs, the backend of the system actually still runs on IMS, right? And so IMS is based on sort of three key ideas. So the ones you can have a hierarchical data model and I'll show what that means in a second. You're gonna have programmer-defined storage layouts for the actual collections of data. So what that means is you would have, say, I have two collections of data, think of them as tables, you would say I want this table to be represented as a B plus tree and this table to be represented as a hash table. And then once you did that, once you made that choice, then IMS would expose a certain API to you that you then used to make queries against those collections of data. All right, so that meant is if you pick, say, like you pick one to be a hash table, you couldn't run range queries on it, right? The other weird thing about IMS is that it had this tuple-at-a-time interface or you had to write your own procedural code to traverse the data structure and access tuples at a time. Whereas in a relational model, it's based on like a bag algebra, a set algebra. So you can deal with large number of tuples really simply with easy queries. So here's what the hierarchical model data model looks like. So then we have a simple schema here. Parts, supplier database. So we have an upper collection here of suppliers. So these are the businesses that sell a certain item. And then you have another collection of parts. So these are the items that they sell. And so the way you would instantiate a database in a hierarchical data model based on this schema is that you would have the first collection here and that would have all the supplier information. But then for each of them, they're gonna have another data structure for all the products or parts that they sell. So the way you would say, if you wanna run the query, you say, find me all the suppliers that sell batteries. You would have to write a bunch of for loops to traverse the suppliers. And for each one, look at all their parts one by one by one. So there's a couple of problems with this obviously. So one, that's a pain in the ass to write. So you have to write really complex queries. But then we also have duplicate data. So we're gonna store that this guy sells large batteries and this guy sells large batteries multiple times for each of those suppliers. Because there's no way to maintain relationships between these things. So one is you're gonna have duplicate data. And then there's no independence between the physical structure of the database and the logical schema. Remember I said you have to specify whether a collection is gonna be a B plus tree or a hash table. So that means that if you pick a hash table and you realize, oh, I actually wanna run range queries on these things, I don't wanna use a hash table, I wanna use a B plus tree, you have to dump the database out and then load it back in, right? And that sucks, that's a big waste of time and it's a lot of overhead. So this is around 1968 is when IMS came out. And like I said, it's still being used today in a lot of different enterprise settings. So around the same time after IMS, there was a movement called Codesill where a bunch of people that were working or consortium of people that were working on COBOL programs got together and they basically came up with a standard API or standard interface for how COBOL programs will access a shared database. And the basic idea of how Codesill works is that it's gonna have that same tuple at a time query interface that the hierarchical model had in IMS but it's gonna use what's called a network data model. And if you don't know this person here, this is Charlie Bachman. He was sort of the lead proponent of the Codesill group and he won the Turing Award for databases and I think in like 1971 or 72, which I think he's still alive but nobody uses Codesill anymore. So here's how basically it works, is that you have sort of like this graph structure where you have the different collections of data. So now you say here's suppliers, here's the parts that they're gonna have and then you have this cross-reference relationship that says that combines the supplier part with the part part and say here's the things that actually that the supplier is supplying. And the weird thing about it is that you have these additional sets that represent the relationships between, I don't use the word relationship, the existence of different elements of this set with this set, right? So you have to instantiate these additional data structures here to say that there's a linking between them. So now if again, if I wanna run this query that says find me all the suppliers that supply large batteries, I have to loop through one by one in these nested for loops, look at all the suppliers, look at all their elements in the supply set, then I come down here and get all these and then in a separate thread, I go to the parts and traverse it down to get them and I mash these things together. And you're writing essentially again, nested for loops to do this. You're not writing a declarative language like SQL to perform this query. You have to write the exact steps of how you wanna traverse this network here in order to get to the answer that you want. All right, didn't show up. So the first problem is that you're again, you're writing complex queries to do really simple operations. The second problem is if this supplies or supplied by sets, if these things get corrupted, then both of these tables get corrupted because you have no way to link them back and forth. So there's no independence at all again of the logical schema with the visual representation. So around the same time that Kodasil was coming out, there was this guy called Ted Kod at IBM Research in New York in Watertown. He was a mathematician. He wasn't a computer scientist. And basically he was walking around IBM and he saw all these programmers were spending their time rewriting or over and over again these IMS or Kodasil programs every time the database changed. So what happens is because we're writing all these four loops to tell the database system how we wanna walk through things, if the layout of this thing changed or this changes, I had to go back and now change all my code to be able to walk through it correctly. And that's really expensive to do. It's a huge waste of time. So what he came up with was this abstraction where the logical structure of the database and the physical structure were completely decoupled. And rather than having these nested structures or these hierarchical structures, these networks and things like that, you would have these simple abstract objects called relations that would represent the data for a single entity in your application. And then you can maintain the relationships between them directly within the data and don't have to have these extra sets on the side. The other key thing that Kod came up with was, again, rather than you writing your own four loops to loop through and find the thing that you want, you wanna write at a high level declarative language. You wanna tell the database system what answer you want it to compute and let the database system figure out the best way to do this. Sort of like the same idea in compilers, right? Nobody writes assembly anymore unless you really have to. But you can write and see this is essentially what I wanna happen and then the compiler figure out the best way to optimize your code and compute the answer you're looking for. So again, the other key thing is that the actual how the data was stored on disk or on memory was completely independent of the logical schema. So no longer you'd have to specify, I want this table to be a hash table and I want this one to be a B plus tree. You just say, I want these tables with these attributes and then the database system makes the best decision of how it should store it. And now because you're dealing with it at a logical level in a declarative language, if the underlying storage layout of the data changes, you don't have to go rewrite your queries on top, right? Everything just works. So again, I don't have to explain what the relational model looks like but basically now we have that same parts and supplier database. Now we have relationships between all these things and we don't have to worry about traversing them. We don't have to worry about set inclusions and things like that. We just deal with these abstract objects in the SQL, which is really nice. So in the 1970s, Ted Cod came out with his paper, I think it's 72 and said, here's how you really want to have databases. We want to use a relational model. We want to have this abstraction. We want to write our queries in high level languages, right? But IBM was still making a killing with IMS and the Code of So guys were off doing their thing. So it was like basically two or three groups in California got together and took Ted Cod's paper and actually try to make it work, try to build a system, a database system that implemented the relational model. So the two most famous ones are System R out of IBM Research in San Jose and then Ingress, which was a project out of the industry in California led by Mike Stonebreaker, who won the touring award this year and who was one of my advisors when I was in grad school. They basically were building their own versions of our database systems based on the relational model. Around the same time, Larry Ellison was found at the Oracle Corporation and he would actually end up calling the IBM guys on the phone all the time and ask them what did their database system do? They would tell him that he would go implement the same thing, right? Again, here's another motivation. Why did you take this course? Has anybody heard of Jim Gray? Jim Gray was a famous database guy at IBM, helped build System R. He won the touring award in the 90s. He got lost in sea in 2006, 2007. People love this guy so much is when his ship went missing off the coast of San Francisco they moved satellites around above to try to find him, right? So he won the touring award. Mike Stonebreaker, he won the touring award. Larry Ellison is the fifth richest man in the world, right? This is why you should take database systems, right? Okay, so Ingress, they commercialized it. IBM never commercialized System R and Oracle went off and did their thing and became a real company. So then around the 1980s, in the early 1980s what happened was although IBM never put out System R because that would cannibalize their sales of IMS they eventually saw the light and saw the relational model was gonna win so they ended up building our new database system with bits and pieces of System R called DB2. So this got released in 1983 but when DB2 came out it announced that it was gonna use SQL as the declarative language because in the 1970s it was SQL versus this other language called Quel. So SQL came from IBM, Quel came from the Berkeley guys and they're basically, if you look online they're roughly equivalent. They'd more or less do the same thing. Stonebreaker says Quel is easier to read. I disagree but whatever. So when IBM came out since they were sort of the king of enterprise software at the time they said our new database system used the relational model and they used SQL so essentially SQL became the de facto standard and Oracle, since they were copying what IBM was doing all along, they supported SQL so they were there at the right place at the right time and they had a relational database that supported the standard that IBM was pushing. Around the same time by 85, 86 a bunch of other what I call enterprise database systems or relational database systems came out in the market Informix and Sybase are sort of probably the two most notable ones and all these have connections with Mike Stonebreaker. Ingress sort of floundered 85, 86 and then he went back to Berkeley and he built the next project called Postgres which is what everyone knows about today. It's called Postgres because it's the thing that came after Ingress and Oracle can sort of continue their dominance in the relational database world. So it's funny because IBM, DB2 all these systems are obviously still available but Sybase got bought out by SAP they're kind of not really as big as they used to be and then Informix got bought out by IBM. I think it's still there but again, if you're building a new startup you wouldn't pick those systems you would use something more modern. All right so then late in the late 1980s there was a new movement that came along sort of like the no SQL guys of today but they basically said well the relational model is not what you wanna use when you're writing object-oriented applications or using an object-oriented programming language like C++ because inherently in C++ you're working with objects and when you wanna store them in a database system you have to do this work to convert them from an object into relations and this is very cumbersome, this is very slow and what you're better off doing is just having a tight coupling between the database system and your programming language or your programming environment so that you can write objects natively and store them natively in the database system and you don't have to worry about splitting them up and joining them back together across multiple relations. So there was a bunch of systems that came out in the 1980s you've probably never heard of them because they don't exist anymore or they exist they're like barely used at all but a lot of the techniques that they came up with at the time were actually used in some of the no SQL systems and relational database systems today so basically you can think of the object model as the same as a document database or a JSON database or an XML database. So to give you an example of how the object-oriented model works so say this is your application code you have a class object or you have a class that defines a student a student has an ID, a name, email and then a list of phone numbers so if you had to implement this in the relational model you would have a student table and then it would have a foreign key reference or a foreign key reference from a student phone table they had the student ID that mapped it back to the student that has that phone number so and now when you actually instantiate the database you would have a sort of layout like this you would have one student record and then you would have all the phone records and there was sort of correspond to these fields here so now if you wanted to instantiate this object in your code, in your application code you would have to do a join between these two tables to get all the phone numbers for that particular student or you have to do two separate queries which is slow because that's two round trips back and forth between the application and the database system so the object oriented guy say this is terrible, you don't want to do joins what you really want to do is when you instantiate an object store it directly in the database system like this so now when you say I want to go get a single student you go fetch that one record and you get all the phone numbers as well and this sort of has the same problems that we saw in the hierarchical or the network model where you're going to have to write complex queries to do things beyond just getting one object if you want to say find me all the students that live in this location and have these types of phone numbers you're going to have the right procedural code again nested four loops to go iterate through all the different tuples that are different objects and find the answer you're looking for and now that means if the schema changes of your object all your code is going to break the big thing I think was the big downfall for these guys is that there was no standard API for how to do this, right? All the different programming languages or all the different object oriented databases all had their own APIs all had their own ways for doing reads and updates and things like that, right? And so if you wrote your application for one database system you could not easily port it to another now you may say oh SQL is the same thing because it's all these different dialects of SQL and you can't take your Oracle application and make it work immediately on PostgreSQL MySQL that's true but it was even more worse than these guys here, right? And now you sort of see why again why the NoSQL guys are slowly adding back SQL, right? The Cassandro's a big NoSQL system they have this dialect of SQL what they call CQL but it more or less is SQL, right? All the Hive and other database systems built top of the Hadoop infrastructure they're all adding support for SQL. All right so that got us through the 1980s the 1990s was actually pretty boring in the world of database systems, right? No workload really changed it was sort of before the internet got really big there's no, you know, the hardware got better the disk got better but there was no sort of major major change in how people use database systems in applications, right? So the only sort of, I would consider the notable things that happened was Microsoft made a deal with Sybase to port their code to make it run on Windows NT and then they essentially just changed the name but that became what's known today as SQL Server and it's funny because SQL Server it actually is a, what I consider is they're adding a lot of new ideas that I think are pretty cool whereas the original code Sybase is somewhat stagnant. Some Swedish guys took a, it was an earlier system called MSQL which is a precursor for MySQL so the guy that wrote MSQL sort of threw all the code away and started over and this ended up becoming what was known today as MySQL and then for Postgres they had went and formed a company called Elustra that eventually got bought by Informix and sort of fizzled out but then some grad students at Berkeley took the original Postgres code that they developed in the university and added support for SQL because again before it had supported Quell because that's what Stonebooker really liked so then they added SQL made it open source, made it available and then people actually started using it and now Postgres is, I think one of the best open source database systems that are out there today but again there's no, nothing really major beyond these three things happened during this time. So then we get to the 2000s we hit the internet boom, right, the dot com days so a couple of things happen the first is that again the number of concurrent users trying to access the database growth significantly, right and the size of the data sets growth significantly as well, right for it used to be maybe only a few handful of major enterprises had what would be considered a large database now you can do a startup and immediately have a large database overnight. So at the time, the early 2000s you had all the major database companies that we've talked about so far, right you had the Informix, you had the Oracles, the Sybase, the IBMs the Teradata, things like that but these systems were closed source and very expensive and very heavyweight so they weren't quite ready to be able to support the large number of users you would have in an internet application and the open source systems like Postgres and MySQL in the early 2000s they were still pretty still pretty young and still not I would say battle ready for some really hardcore applications, right so some people did some things to make this work was to write their own middleware to scale out a bunch of single no databases and have them appear as a single logical database so eBay did this obviously Facebook has the largest MySQL cluster that's doing this Google did this so they sort of took these simplistic databases and they built more complex support for more complex operations or scaling out on top of them, right again the major players were simply too much too expensive to do this the other big thing happened in the 2000s was the advent of special purpose data warehouses right and so these are databases that are not meant to be updated through transactions these are meant to be databases where you store all the data that you've collected from your front end in this back end and do all your analytics and analysis on that so that you don't slow down the sort of the front end system that's serving customers and so the big thing about these systems or the major thing that came out of these systems is that they were designed from the ground up to support OLAP queries efficiently, right rather than being a general purpose system sort of what Oracle tries to be or DB2 tries to be they had the internals are written such that they can run analytical queries more efficiently so the teaser is probably a bit earlier than what I consider the group for these guys but this was a FPGA based system that IBM bought I say of all of these the only one I think that was sort of truly novel and really groundbreaking was Vertica but all these guys have been bought the teaser got bought by IBM Park Cell got bought by Actian which is the version of Ingress Vertica got bought by HP and one of the big things and one of the ways that these systems were able to do much better than the incumbent systems is that they supported the decomposition storage model DSM so essentially they were column store systems so rather than storing data in sort of a row oriented format as you would learn in an intro class they're going to break up the attributes and store all the columns continuously in memory or on disk so that when you run analytical query that only has to access a small number of those attributes you're not going and reading a bunch of rows or a bunch of attributes that you don't care about and the performance difference you can get in a column store versus a row store for analytical queries in a data warehouse is massive is enormous I had a friend who was a sales engineer at Vertica and he went down to Australia to visit like a telecom and they had some query at the end of the month that would process bills and things like that it would literally take days one SQL query would take days they loaded everything up in Vertica and it caught down to minutes this is huge this is a big game changer for the database world when these specialized data warehouses came out so then we get into late 2000s and this is when the no SQL guys sort of show up and the basic they had all the same problems that the early.com companies had where the sort of existing databases were not be able to scale out and scale up and support the large number of concurrent users that they needed to have so they took a different route they actually just wrote their own database system and they made some sacrifices in order to speed up development make it easier to build these things by giving up on transactions and joins and other things like that and then they later said oh this is a feature you want to do this so the key thing about the no SQL systems is that most of them are non-relational so they're not going to be using the relational model they're going to be storing it as documents as key values, as column families or things like that they're going to be storing things without support of full asset transactions they're going to be using custom APIs in some cases instead of writing SQL but the nice thing about the no SQL guys is that they are usually most of them are open source and a lot of them that started off as closed source eventually became open source whereas a lot of the data warehouses except for one of them are still all closed source today they would argue that part of the reason they didn't support the relational model or part of the reason they didn't support transactions is that they weren't really worried about correctness in their operations they were worried about high performance availability and scalability it's not good that if your website it's great that your website could execute transactions correctly but it's not all the time but it's not great if you can't run anything at all if you have sort of the Friendster problem where every page load takes a second because you're just hitting the database super hard that doesn't help anybody so they gave up all those things in order to be able to scale up and support the large number of users that they had all right so then now we get to the 2010s and this is sort of the rise of the new SQL systems and I have to fully admit that I was one of the first proponents of this movement if you want to call it that so the system I helped build when I was in grad school was called H-door which then later commercialized as VoltDB if you read the Wikipedia article on new SQL you'll see that H-door is listed as one of the first new SQL systems because it came out in 2008 I wrote that so I can say whatever I want so the new SQL systems the basic idea is that they want to have sort of the same scalability that you can get from a new SQL system but without having to give up the relational model without having to give up transactions right and the way they're going to do this for the most part is they're going to be mostly in memory and they're going to be distributed so they'll scale out your database or shard your database across multiple nodes so that the aggregate amount of RAM across all those nodes is enough to store your entire database so now if you don't worry about disk anymore you can forego a lot of the legacy components that you would have in a traditional relational database system because they're going to slow you down and help prevent you from performing scaling off so again in the new SQL world they're all going to be relational they're all going to be SQL for the most part they're actually I say for all of them they're going to be distributed the downside is most of these are going to be closed source I think VoltDB and H-Store and Gemfire are probably the only open source ones it is what it is but that's their progatory alright so what's next what's coming after new SQL the next major trend the next major I think category systems that are being developed in the upcoming years are what I call it hybrid systems or H-Tap systems hybrid transaction analytical processing systems and the big idea of these systems is that again they want to be like a new SQL system where they want to support fast transactions that are asset and correct and high performance but then they also want to do some of the analytics that you could do in a data warehouse directly on the data as it arrives inside the database system right so normally what happens is there's a typical setup now you have sort of like a new SQL or front end database system that's executing all your transactions and then over time you stream updates to your backend data warehouse do all your analytical queries on that backend machine and then send updates to the front right so you see this a lot in like internet gaming so you have like so you have like say your farm drill application you have people clicking on things and you're changing stuff in the farm and all those are transactions that are getting recorded in the system and then they stream those updates to a backend data warehouse where they run some kind of analysis or machine learning models to figure out some way to make you buy virtual goods more easily right and then they send those updates to the front end right so there's sort of this round trip time of when the data shows up to when it gets sent to the backend warehouse and then it's processed and then moved back to the front so in an H-Tap system they say the idea is that as the data comes in you can run the analytical queries that you'd normally run in the backend data warehouse directly when it arrives so you reduce the latency of being able to come up with actual information directly in the front end and then help you sell stuff more quickly or whatever you're trying to do so this is sort of the major I think the trend that's coming out in database systems now so there's not that many of them that actually can support these kind of things so MemSQL is probably the major one along with SAP HANA and again we'll read papers from the HANA guys and the HYPER guys as well so HYPER is an academic system out of the Germans just one DB is a closed source system out of I think London there's not that much information about that but again the basic idea is again you're doing transactions and analytics on the same machine and this is not saying that the data warehouse is going to go away because you're still going to have to use it as sort of a single source where you want to store all the information from your front end silos but again some of the analytics that you would normally have to run on the back end you can now directly run on the front end with these guys so most of them are distributed most of them are shared nothing most of them are all relational and support SQL and then most of them as of 2016 are closed source okay so at the end of every lecture I'm going to try to have some parting thoughts sort of the nug is a wisdom for you guys to sort of walk away and think about I say the main thing that you should understand about the history of databases in the last 30 or 40 years is that a lot of the good ideas are all the good ideas don't come from a single source meaning they're coming from they're not just coming from academia they're not just coming from industry they're coming from everywhere right and the cool thing about database systems is that there's all these different organizations or groups of people working on building on new systems and coming up with new ideas so you have academics like researchers us here at CMU doing stuff but then you also have major companies like IBM, Oracle and Microsoft doing stuff but then you also have a bunch of little startups with their own little systems as well right so these all the new ideas that are out there are coming from all these different sources typically what happens is that a lot of them come from academia but they don't end up in since most academics don't build a full system they just have these little prototypes and they write the paper and then some other company will pick it up and use them but I think you know we're trying to change that here with our own system IBM was I consider to be the vanguard of the leading database system company in the 1970s and 1980s a lot of the early ideas come from what these guys came up with right they had one of the first distributors database systems they had one of the first in-memory database systems I don't think it's that way anymore I think the real trendsetter is Google Google does not release very many papers but when they do they usually have a big impact on this research area they don't open source anything either for the most part so that's kind of disappointing as well but it is what it is Oracle is always in my opinion about five years behind from what we could consider the state of the art now I don't mean this in a derogatory sense I mean they had the largest customer base in the world they have a very complex database system that has been doing quite well for them the last 30 years or so so you have all these legacy customers you just can't come up some academic writes a paper or some startup comes up with an idea you don't want them shooting from the hip just have one guy in his basement writing that new piece and adding it to this large complex system they have a very challenging testing process and you have to go through a lot of steps to prove that any new thing you want to add to the Oracle system is safe you won't degrade everybody else's performance so again they get a lot of ideas from other places and then usually four or five years later that will come out in the real piece Microsoft has actually been doing some really cool stuff in their SQL server and they will actually read a lot of their papers in this class I consider some of the stuff they've been doing the database research group out of Microsoft research that gets added actually to the SQL server product is really, really, really cool, really interesting so the other major takeaway I have for you guys is that the relational model essentially has won the day for operational databases so no one's going to write eventually all the new SQL guys will add back SQL and if you were going to build a new application today you'd probably want to use SQL you probably would not want to use some proprietary API that may change over time in the case of data warehouses and analytical systems we're not quite there yet but essentially SQL is going to win that over too Padoop came out people realized writing MapReduce programs was a bad idea so they built SQL layers on top of MapReduce and then Spark has come out and now they're showing that how most of their customers are writing their queries against Spark SQL and not writing using the Spark API so eventually this becomes the de facto standard as well so sort of like in the beginning people didn't like it then people liked it and then now people don't like it and then they're going to like it again and I think what really I think would rest my life is it's probably going to be the case SQL is going to be the standard API that everyone's going to use for databases okay so that's it for today so for next class we're going to start off talking about more introduction information about in-memory databases what they mean what changes in the architecture of a database when you assume everything's in main memory versus assume things could be on disk and then we'll introduce you to what you're going to have to do for project one and we'll discuss like the system discuss the expectation to discuss how to get started in that and then as a reminder the first reading review for the first mandatory reading is due at 12 p.m. on Wednesday it's an easy reading it's actually papers from 1992 so you normally think oh well that's almost 20 years ago or over 20 years ago how is that still relevant what you'll see is a lot of the things that he discusses about here's what a main memory database is all about here's the things you have to worry about are still relevant for the things we're talking about in this course today so alright any questions alright if you haven't again there's a sheet going around I don't know where it is there's a wait list and the enrollment list put your initials now next to it so I know you're here and we'll do it again on Wednesday alright