 Welcome to 15721 at Carnegie Mellon University. I'm the instructor for this course, Andy Pablo. As I posted on Piazza, I can't be in Pittsburgh right now. I'm actually in Amsterdam with my friend 187 Owen. The problem is that 187 booked us a hotel, but it's actually not the kind of hotel we'd want to stay at. So right now we're just kind of hanging out here in the red light district, trying to figure out what we need to do next, but Owen lost all his money and I don't have my credit card with me right now, so we're kind of homeless. So I figured now would be a good time to sit down and cover the first lecture for everyone. So before we get started, I just want to give a shout out to everyone who's been down with us since the very beginning, JL in Seattle, MC in California, and EZE in Brooklyn. These people have been staying true to us since the very beginning. I also want to acknowledge that this semester we are being sponsored by two foundations or two groups. The first is the Amazon database group. Amazon is actually surprisingly one of the largest database vendors. Everything they do is on the cloud, and a lot of the topics that we'll be talking about this semester are relevant to the kind of systems that they are building, and so they are helping out with course development. We're also being funded by a non-profit called the Steve Moy Foundation for Keeping It Real, and again they are helping out with paying for TAs and lecture notes and other things. So we appreciate both of them with helping us out, and for Amazon they will become giving a guest lecture later in the semester. So today's agenda, the things we want to discuss are first the overview of what the course and the semester will look like if you're a student at Carnegie Mellon, and then I just want to do a fun lecture about the history of databases just to provide some context about what the kind of modern systems we're talking about today, what things have people tried in the past. So if you're not enrolled in the course and you want to skip ahead to this video, you should skip ahead to the history of databases. Everything else is just for people that are here at CMU. So the first thing I want to get to discuss is why should you take 15-7-21? Why should you take this course? So I say this every semester, and every semester it continues to be true. Right now, databases from developers are in huge demand, not just in Silicon Valley, not just in the United States, all over the world. And it's because that there's a lot of unsolved problems in databases, data management, processing, and these companies can't hire people fast enough. So if you take this course, and you do well on it, you'll be immediately hireable or immediately employable, and that's a good thing, all right? The other thing I'll say also, too, is that even if you don't go off and end up spending your career writing database systems or building database systems, the things that you'll come across or learn throughout the semester are going to be applicable to almost every other aspect of computer science or information technology. And what'll happen is that you'll see is that the people that are good enough to write code for database management systems, in addition to operating systems or embedded systems, you can get pretty much a job in any other application domain. If you're a database system programmer, you want to write JavaScript code, people will hire you. But if you're a JavaScript programmer and you want to write a database system, that may be a harder sell, right? So the kind of systems background you'll get out of this course goes beyond just database systems, but we're going to discuss these things in the context of databases, because that's all I really care about. I can just give you sort of the brief alumni role of the students I've taken this class in, where they've gone to, but this is just a subset of the students that have taken 721 and then hung out with us and worked in our system even further at CMU, and they're all, everyone's gone off either to grad school or are now building database systems at either companies that have built databases or startups that are specifically just building database systems. So I'm pretty happy about that, and then the companies can't hire our students fast enough. All right, so the goal of this course is to give you the background and understand about the bottom practices in building database systems. And as part of that, you'll become better at doing sort of low level systems programming. So the goal is that by the end of the semester, you'll learn what it means to write correct and perform its system code. In the context of databases, we need both, right? You don't want to lose data, so you want your system to be correct. And obviously, you want to process the data as fast as possible, or query as fast as possible, so you want to support high performance. So I'm sort of in my ivory tower in academia, and I like to say, like, oh, correctness matters first, performance comes second, and that's how I'll teach things. In practice, in the real world, that's not always followed. You definitely look at examples like MongoDB or MySQL, you know, wildly successful database systems that are used all over the world, and they started off with going for performance first, and then correctness second. So the other thing you'll learn about in this course is that we will teach you proper documentation and testing for your database system. You'll also learn how to do code reviews and you work in a larger code base, right? So these are things just beyond the university. These are things that, you know, help you out through your career no matter where you go. And the major tech firms, when I ask them, what are they interested in? My students taking this course, what kind of, you know, do they want them to learn B plus Gs or locking code? They always tell me they want students that are able to work independently on a large code base. So that's not something I would take lightly. That's something that, you know, I think doesn't come easy for a lot of people, especially me. And this is something you'll get exposure to during the semester. All right. So the sort of core umbrella topic for this semester will be single node in memory database systems. And so that means that we're going to ignore, for the most part, writing data at the disk as part of like a disk ordering database system, and I'll clean with that is later on. And we're also going to ignore the problems and challenges when you have a distributed database system. So it's not to say distributed databases are not important. It's just to say that for this course, we're going to focus on getting the single node system working as fast as possible and as correct correctly. And we don't worry about going distributed, right? It's in my opinion, it's better to have a database system that works well on a single node and try to scale that vertically before you go to a distributed environment and try to scale horizontally, because that brings in a whole bunch of other challenges. So this means that we're not going to cover things like a census protocol or fault tolerance at a wide scale. We care about the things, but not just here. And so this course is also not going to be on classical database management systems. So that means that we're not going to cover the basics as we did in the introduction class about two phase locking or B plusries and things like that. I'm going to assume you know those things. And instead, what we're going to cover are state of the art implementations and other topics that that modern systems are using today. I said this is going beyond what we cover in the intro class. So the kind of things that we'll talk about during the semester are concurrency troll indexing data structures, storage models, database compression, how to execute joins in parallel or vectorize execution models, networking protocols, logging recovery, and then we'll spend a lot of time at the end of the semester talking about query compilation and query optimization. All right, these are sort of the more modern things that you don't really get the chance to cover in the introduction class. So as I said already, I'm going to assume if you're taking this course that you already have taken another database course, whether it's 15, 445, 645 Carnegie Mellon or at your undergraduate institution, some kind of intro to databases. So and the reason why I'm going to assume this is because we're going to discuss modern variations of the classical algorithms that are designed for what today's hardware looks like. So I'm not going to teach you what how to do a hash join, I'm going to teach you how to do a vectorized parallel hash join that can run multiple cores with SIMD instructions. And so this is sort of you need to understand the basic way a hash join works in order to understand how to do the sort of the modern implementation of it. So the core background that I'm assuming you already have would obviously be SQL because we're going to focus on relational databases, serializable theory or concurrential theory, relation algebra, and then the basic algorithms and data structures for sort of classic databases, again, B plus trees, Aries, two phase locking and so forth, things like that. So the course policies and the schedule is all available on the course webpage, right? There's a syllabus, there's a schedule and I'll cover what the schedule means in a few more slides. But any questions you have about what the course sort of how the course will be held and how it'll be graded, you always refer to the course webpage. In terms of academic honesty, I encourage you to go look at what SIMD's policy is on their webpage about what plagiarism and what cheating means. If you're doing something like copying a piece of code that you're not sure whether that's okay or not, I encourage you to please ask me. I'd rather you have asked me and you and I discuss it and see whether you're doing the right thing or what you're doing is okay rather than me catch you cheating and I have to go report you to Pointer Hall, right? So, you know, let's not plagiarize. Let's not be stupid. If you have questions, just ask me. All right, so I'm holding office hours on Mondays and Wednesdays at 1.30 to 2.30, immediately after this class in my office in Gates. That time is not good for you. Then please email me and we can make an arrangement for another time. So if you come to office hours, what can we talk about? Well, a lot of times students want to come talk about implementing their projects or if they read the paper, I'm not sure about what, you know, if they fully understand it. But there's other sort of life things you want to talk about. I'm happy to do that as well. Like how to get a job as a database engineer or database database developer. If you want to introduction to a database company in a certain state or region, I can definitely do that for you. And then because of my background, I think I'm qualified to talk about how to handle police, you know, if you're in a situation where you, you know, because you do databases, you may end up in a bad place and you need some help getting out of it. So by all means, come talk to me. I'm happy to do so. We have one TA this semester. He's Trilla. Matt Butchovich is a second year PC or second one and a half. It's not clear what he actually is. PC student with me in the CS department. He's actually a student that took 721 crushed it and I encourage him to hang out and be a PC student and now he's here. So he is currently the lead architect or developer helping build our database system at CMU. And this is what you'll be using for all the projects before actually coming to CMU. He was he's from L.A. So he was like an ex-gang member of the systems. You know, so he's he's ridiculous. But actually right now, since I've sort of got him on the right path as being a student, he is now I'm training him as a licensed boxing manager. So he's going to be a cage fighter. He's currently undefeated and, you know, through our training, we hope that it will continue to be his way. So any development questions that I can answer, I encourage you to go ask Matt because he will know more things about the internals of our database system than I do. All right. So the breakdown for which expected for you this semester is the following. So there's reading assignments, programming projects, final exam and extra credit. So one change we made this year the previous years is that I dropped the midterm exam in brought along a brought back the second programming project because I think students got more out of it last year than the midterm exam. So I'll explain what that is as we go along. And there's two additional reading assignments then in previous years. So the if you go to the course that schedule on the web page, you will see that for every single day there's a broad topic. What will we discuss for that lecture? And then there's a list of readings and one of the readings will have this aren't our yellow star next to it. And that's the mandatory reading for that class. So sort of be the primary paper that we will cover in the lecture. So before each class you're required to submit a review of our synopsis of this paper that just discusses like the overview or the in the main takeaway of what the proposed system actually does. And and you have to spit this through this formal line to just a Google form. And so in addition in the synopsis it also includes, you know, what system they use for the evaluation and whether they modified and how they modified it. And then what workloads they did to evaluate their proposed method. And this last one is actually really important because when you do the final project and you want to measure how your system, whether the system actually improved or got faster or or whatever you're modifying, you want to go see what the workloads are that people are using. In these other papers and apply the same apply the same experiments to your own implementation. So the reviews are the synopsis are due at 11.59 a.m. right before class. That stinks. That's harsh. That's sticky. The reviews are due right before class and there's no late days. But you're allowed to skip four readings throughout the entire semester. And then your final grade will be based on the ones that you do submit. So again, please don't plagiarize these writings. I think other people follow along in the course on YouTube or I've looked at the previous years and sometimes you can find the synopsis on on GitHub. Don't plagiarize. Don't copy from each other or run everything from awesome. If we catch you then we have to report you again. So again, don't be stupid. So now the bulk of the grade we surprised are basically the programming projects. So we are building a new database management system here at Carnegie Mellon. It doesn't have a name yet. It doesn't have a public name yet. The repository is currently named after my dog, which is just Terrier. It's not going to be the name of the system because that's not a good name for a database system. So we will announce the name later in the semester and post the website. Publish the website once we get the first version out. The things you care about is that it's going to be a modern code base that's written in C++ 11 or C++ 17. It's multi-threaded. We use LLVM to do query compilation. Again, I'll explain what that is throughout the semester. Everything's open source. It's designed to be progress compatible. So it'd be easier for you to actually interact with this thing using the terminal. So all the programming projects will be based on this and we're going to use GitHub to manage everything. So Matt will be running a recitation later early next week to give you an overview of how the system source code is laid out and give sort of pointers on how to get started on the first project. So we'll announce that where that's going to be on Piazza later this week. So all the development you're going to do is on your local machine. The database system currently builds on Linux and OS X. I think somebody has tried to build it on Windows with the Ubuntu packages. I haven't tried this. I don't know whether it works. But we'll also provide you with a vagrant VM file that you can just download if you want to run this on a Linux VM on your Windows machine. So you do all the development on your local machine, but for the first and second project and potentially for the third project, you'll want to do all your benchmarking and sort of performance tests using Amazon because they're an EC2 machine because you can get a machine that has more cores than your local laptop does. So we'll provide everyone with coupons or credit codes you can use on Amazon to get, you know, a couple bucks to get these machines so you're not paying out of pocket. And so we'll send that information out later later this later this semester. So the first two projects we're going to provide you with the test cases and the scripts and provide you clear instructions on what you need to modify the system for your project. We're also going to teach you how to profile the system using perfect call grind because this is presumably not something that most students have experience with. So the first project will be completed individually. The second project we've done in a group of three. So currently the class has roughly 36 people. So leave out 12 or 13 groups of three people. I like to keep it three people because that's sort of the it's not too few and not too many. The with some exceptions will take on a case by case basis. If we have extra people we could do a four or three or four or two person group. But again, most everyone else should be in a three person group. So it's OK for you to start thinking about what group you want to join to get right now. The third project will be the group project and this will be something that you get to choose to build in our system. And so it has to be something that's relevant to the materials or topics we discussed in the class. It has to be require a significant program effort from everyone that's on your team and has to be unique. You can have two groups work on the same project and obviously has to be approved by me. So don't freak out about this now. You don't have to pick a project until the we come after after spring break and then I'm also going to provide some sample project topics that you can choose from to build in our system. So again, for the first and second project, please don't plagiarize. Please don't copy from each other. If you're barring a library for project two, that's probably not necessary. Project three, again, come talk to me. But again, don't be stupid. Don't copy. And nobody will have any problems. All right, so the final exam will be a take home exam and that'll be long form of questions based on the mandatory readings to stop discussing the class. I'll give it out in class on April 22. So it's also the same day that the guest speaker will become and give a lecture. And then it'll be due potentially the last day of class a week later or when we do the final project presentations. I haven't decided yet. All right, and the extra credit for this semester is the same as we did in the intro class. So we are writing in a cycle PDF database systems at Carnegie Mellon. I currently know about I think the numbers like 683 database systems now that we know about. And so you can get extra credit if you write a Wikipedia style article about one particular database system. So there's a ton of them in there. Pick one that you find interesting and you'll write about how it's actually implemented. And you've got to provide citations and attributions attributions for where you found this information. It's not just saying, this is what it does because I watched a video. You've got to actually have citations because we're trying to have this thing being scientific. So this is entirely optional. You don't have to do this. But if you want to, the option is there. So again, for the for the extra credit articles, this happens every year. It drives me crazy. Do not plagiarize. Do not, you know, even those extra credit, you can still get in trouble. We still report people to Warner Hall that plagiarize for the extra credit, right? Like the reason why I'm saying this over and over again in these videos is because I'm telling you don't plagiarize. So then if you do plagiarize, I go to Warner Hall and show them this video and say, look, here's what I told the class, not the plagiarize, and they plagiarize and then so don't do that. The breakdown for the grades this year is that the reading views are worth 15 percent. Project one is worth 10 percent. Again, that's mostly designed to get you familiar with working in the code base. Project two is 20 percent. Project three is 45 percent. The final exam is 10 percent. And then you can order, you can earn extra 10 percent for your extra credit. So all the discussion throughout the course, we done a piazza and that's the link there. Everyone should already been signed up, signed up before the semester started. If that hasn't happened, please email me and we can fix this. If you have a technical question about the projects, like I tried running this in a VM and it didn't work, I tried compiling it and it didn't work, or what does this piece of the code do? Please don't email me or Matt directly. Please post in piazza because again, it's a group discussion. If you have questions that other people have similar questions, so piazza is the right place to do that. Anything that's not about projects or homeworks or assignments for other things, please email directly. Like if you're sick or your dog died, email me, okay? All right, so that's it for the background of what the course will be about. Logistics. So now I want to do sort of a fun mini lecture on the history of databases. So this lecture is actually based on two papers. One was written by Mike Strombricker called What Goes Around Comes Around, and that was published in 2006. And that's sort of a retrospective of the history of databases up to that point. And then the second paper is one that I wrote with a industry analyst out of London called What's Really New with New Sequel. And that sort of picks up where Mike left off in 2006 and talks about the rise of the new sequel systems and how they sort of fit into the history of databases. So again, databases are a super old topic. It's surprisingly still hot today. I love databases. So it's good to understand where, how do we get to the point where we're at now? So the main takeaway from those two papers is that a lot of the issues that the early database systems were facing back in, even all the way back to the 1960s and 1970s are actually still relevant today. How do you run transactions correctly? How do you maintain indexes? All these things still matter. It's just what's different is the hardware landscape is different. So the one thing in particular that's really fascinating about all this is sort of how history repeats itself is it's been a decade now, but this idea of sequel versus no-sequel in which one was better for different workloads, that debate was basically the same debate that they had in the 1970s about the relational model versus codosil or the network data model. And if you've never heard of codosil, it's no surprise because it lost, right? The relational model one, that's why this whole course is on relational databases. So the relational data model has proven to be resilient and robust and useful for almost every possible database workload. The only exception I would say is potentially machine learning because those are arrays, and although you can model them in a relational database, it's not the most optimal way. So many of the ideas that we're talking about throughout this entire semester are not new. Again, we're just discussing the modern implementations or incarnations of them. And all of the times you're hearing me say throughout the entire semester that when we talk about a particular topic, I'll say, oh, actually IBM did the same thing in the 1970s with System R or Ingress did the same thing at Berkeley in the 1970s. A lot of these techniques are not new. It's just they're now, some of them are back in fashion because the hardware can actually support this. So I think that's super interesting. And it's part of the reason why databases are still relevant, still hot topic and still in demand today. All right, so let's go back to the very beginning. One of the very first database systems was developed in the 1960s called IDS, the Integrated Data Store. And so this was developed at GE General Electric internally for some customer, I think it was doing a, it was like a timber service or timber company out of like Seattle. So they built this database system and for this one customer and it helped them get a hold on their large organization. So you may be thinking, well, Andy, I've never heard of GE selling computers or selling databases. How can GE be responsible for building the first database system? What happened to them? Well, it turns out that GE at the time had this business philosophy where they said if they can't be the number one company in a particular sector of the marketplace then they didn't want to be in that market at all. So they said, well, we're not number one computing, let's just sell it all off. We don't want to be involved in that. So they sold their entire computing division to Honeywell in 1969 and as part of that, they also sold IDS. I actually don't know whether IDS is still around today. I mean, it potentially is, but this is usually deemed as one of the first database systems. So the two key things about IDS that are relevant to us are, they're gonna use what is called a network data model which I'll explain in a few slides and that when you execute queries, you're gonna basically be writing a bunch of for loops that operate on a single tuple at a time and will contrast with that with their relational model which can operate on bags or sets. So the guy at actually GE that helped build IDS was this early computer scientist named Charlie Bachman or Charles Bachman who's actually still alive today and he ended up leaving Honeywell in the early 1970s and went off to this other computer company called Kulane Database Systems which still exists today and he helped build a new network data model system called IDMS or the Integrated Data Management System and that's actually still around today. So Bachman and a bunch of other people that were involved in co-op programming got together in the late 1960s and proposed a standard called CODACIL for how programs written in COBOL should access a database. Think of this, they were trying to define a standard API that every database system that written in COBOL could support. So again, based on his experience at Honeywell, this CODACIL was based on the network data model and therefore it operated on one tuple at a time during query execution. So Bachman was an early database pioneer and he won the Turing Award for this work in the early 1970s. He was one of the first people to actually win the Turing Award for databases. So here's what the network data model looks like. So for this sample application, say we're like a large engineering company and we're trying to model how to keep track of all the different parts we have to buy from different suppliers to build some large rocket which is actually a true example we'll see in a few more slides. But so I have in my database, I wanna keep track of suppliers parts and which supplier supplies each part. So the way the network data model is set up is that you have these collections of data like a supplier, a part, and supply. But then you're also gonna have to define these membership sets that say whether a record from one set is owned by another set. So to show an active instance what this would look like. So say these are my collections of data, my records, supplier, part, and supply. So I would have now these membership set called supplies and supplied by that would have pointers to say for a given membership between a parent and a child, here's all the parent pointers to the parent records and here's all the child pointers to the child records. So now let's say I wanna do a query and say find me all the parts that are supplied by a particular supplier. I'd have to write these bunch of nested for loops that finds the supplier that I want, iterates over the supply set, then it iterates over the supply set, then it iterates with the supplied by then I reach my part record or a part collection of data and I can find the thing that I'm looking for. So the sort of first obvious thing to point out here is that the network data model requires you to have to write complex queries. You're essentially writing low level for loops to traverse this network to find the data that you're looking for. The other issue is more of an implementation aspect but these databases turn out to be easily corruptible because obviously back then in the 1960s and 1970s disks were expensive, they weren't as reliable as they are today and people weren't maintaining multiple copies of data. So what would happen is if one of these membership sets got corrupted, you essentially lost the whole database because now you had no way to know how things got linked together. So the network data model and the IDS came first but another database system that was sort of built around the same time in the late 1960s was this thing called IMS or Information Management System at IBM. So IMS is super well known, it's widely used still even today. If you ever used at ATM, chances are you've interacted with an application that talks to an IMS database, it's still around. So IMS was actually a database that IBM built to keep track of the purchase orders, the parts supplies and supplied information for the Apollo Moon mission for NASA before it was called NASA. But unlike Codesill, it didn't use a network data model, it used what is called a hierarchical data model. The other interesting aspect about IMS was that they had programmer defined physical storage formats. So when you declared, I have a collection of data like a table, you also had to define what was the underlying data structure you would use to physically store that data on disk. Whether you want a hash table or some kind of order preserving tree. And so also like the network data model though, you would write these for loops to diverse the hierarchy and do queries that operate on a single tuple at a time. So again, now it's sort of simple supply or information example, now we just have sort of two collections. We have supplier and a part, but in the instance, we actually see a problem with a lot of redone information. So we'd have this supplier collection of data with our records, and then each of these records would have now a pointer to some other data structure that would have all their part information. So the obvious problem we would have here is that we're duplicating data because if multiple suppliers provide the same part, I have to have different instances of that record, in this case large batteries over and over again for every single supplier that sells that part. And that means now if the name of the part ever changes, I have to write extra code to go find all instances of the batteries, the large batteries and make sure that their names all get changed together at the same time and that they're in sync. The other big problem in addition to sort of the tuple at a time for queries is that there was no independence between the physical data structure of the database and the logical abstraction that programmers interacted with. So for example, if I declared that my table was gonna use a hash table as this underlying physical data structure, but then I later realized, oh, I wanna actually execute range queries, well now I gotta dump the table out and load it back in as a B plus tree. And then now the API that I'm exposed to for that table changes and I have to rewrite all my application code, right? So that's wasteful, it's very painful. So now in the late 1960s, early 1970s, there was this mathematician named Ted Cod who was working at IBM Research in New York and he saw all these programmers at IBM rewriting their IMS and code sale programs every single time the database scheme had changed or the layout changed. And he saw this as being wasteful, right? Because every single time the scheme had changed, you had to go rewrite your application code. Now back then computers were way more expensive than humans and so he was sort of pressuring in this goal of removing this burden because now humans are way more expensive than machines, machines are cheap, right? So back then this wasn't maybe that big of an issue of it was hard to find programmers, but certainly now it'd be a big cost problem. And so what he came up with in the relational model was this high-level abstraction of databases to avoid all this sort of maintenance burden on humans. So the relational model has sort of three key ideas that are widely used in databases today. So the first is that instead of storing the database as this hierarchy or this network with these sort of graph structures, we're instead store the database as just these simple data structures as relations or tables, right? And now we're gonna then write our program or to access the data through some high-level language that can operate on these relations. Now at the time in the first paper, he didn't actually propose a language. So this is the first paper he wrote in 69, but this is the more common one that everyone cites as the relational model of data for large shared data banks. So in these seminal papers, he actually didn't find what this high-level language would be, of course now we know this is SQL, at the time SQL didn't exist, right? So the paper is strictly sort of strictly describing the relational model from a mathematical perspective. Quell was actually the first programming language or query language that people came up with for Ingress that followed the relational model. The IBM later came out with SQL, which is the SQL to Quell, it's a play on words. Ted Cod actually proposed his own language called Alpha in a paper, I think in like 1975, but no one ever actually uses that. The other key thing that he came up with that goes beyond what IMS was doing was the physical storage representation of every relation was actually left up to the implementation of the database management system. So what that means is that I just declare that I want a table, I want a relation that has these attributes. I don't say how I actually want it stored physically. And now the database system can decide on its own how it actually wants to store things, right? If it wants to use a hash table, it wants to use a B plus tree, it can do whatever it wants. And no matter how the data changes physically underneath the covers, in theory, if I'm doing this correctly, I shouldn't have to change any of my query, any queries in my application because they're only operating on the database at a logical level. So these ideas sort of seem obvious today, but back then this was actually kind of controversial, right? The idea of having a database system be able to support a high level language that can generate query plans as efficiently as what humans can write was considered a farfetched idea. Back then, also you gotta understand, people didn't think a high level language like C and having a compiler for C could ever produce machine code as efficient as someone writing assembly. And of course nowadays nobody, very few people actually still write assembly, we write in even higher level languages, right? It's again, in our modern world, these don't seem like controversial ideas, but back then it was. So just to go back to our simple example before the relational model, now we have three relations, supplier, part, and supply, and these are gonna have far and key relationships between each other. So now in my installation, or instance of this database system, I can just do joins now between either supplier number or the part number to find the parent record for my supply record. So now to do a lookup to find all the parts applied by giving supplier, I can just do a join between these three tables and find exactly what I want. And this also means that in a high level language, I'm operating on the database as through sets, right? I just say, this is what I want you to find for me, and I don't specify how to iterate through tuples one by one to find what you're looking for. Again, this is a very powerful construct that is sort of the basis of what we talked about for the entire semester. All right, so this now, so the relational model sort of sort of in the 1970s when it came out, as I said, it was controversial. So it wasn't immediate right away that this is the right way to build database systems. So there was a couple other relational databases that were being built at the time, but the three that most people talk about so today are System R at IBM, Ingress at Berkeley, and Oracle by Oracle. So System R was at IBM Research in San Jose, and it wasn't led by Jim Gray, but he was probably one of the most famous people that came out of that group. Ingress was developed by one of my advisors, Mike Snowbreaker, if you've never heard of Ingress before, you may have heard of Postgres, right? So Snowbreaker has built Ingress, and then he built Postgres after Ingress, like Postgres stands for post-Ingress, right? So again, we'll talk about both of the systems about the time of the semester, and then Oracle is a commercial system about Larry Ellison, and now he's like one of the richest people in the world. So this listing is amazing, right? You look at Jim Gray, you look at Mike Snowbreaker, both of those guys won the touring award for databases. Larry Ellison, as I said, he's like the seventh or eighth richest man in the world. This is all because of databases, like this is crazy, this is awesome, right? So this is why I love databases because now only can you do research that has an impact. You can make some decent money. Now you can buy your own Hawaiian island, it's awesome. All right, so there's the debate going on between the relational model and Codasil in the 1970s, but eventually the relational model went out, and it's clear that this is the right way to build database systems. So Ingress and Oracle were commercialized, but IBM actually never released a system R as a commercial product. What they ended up releasing was, their first relational data system they did release was this thing called DB2, which is still around today, and that came out in 1983. So there's some bits and pieces of system R that made it into the first implementation of DB2, but I don't know how much of that code still exists today. And so because for system R, they were using SQL, IBM used SQL in DB2, and because IBM was sort of the computer juggernaut at the time, that essentially became the standard. They had to change the name though, so set a spell out S-E-Q-U-E-L, it just got shortened to become SQL. And I think that's because somebody sued them over a naming issue. So SQL became the dominant programming language used for databases. Ingress was supporting Quel at the time, they eventually had to support SQL, but Oracle supported SQL from the very beginning because they copied what IBM did, and when IBM came outside, and we have a SQL database, Oracle was like, oh look, we already have one too, and then they essentially won the marketplace. So the other thing that happened in the 1980s was in addition to IBM putting out the first database, there was actually a lot of other startups that came out at the time, but also built relational databases for enterprise market or commercial market. Informix, Sybase, Tanda, non-stop SQL, Teradata was late 70s, Interbase is another famous one. All of these database systems are still around today, not in the exact form that they existed back in the 1980s, but these are still billion dollar products that still make a lot of money for these companies. The other cool thing is that Sternbreaker, after running the Ingress company for a while, he goes back to Berkeley, and he starts a new database system built on his lessons that he learned from running Ingress, and he put that into a new system called Postgres. And this is actually, I don't know how, obviously not much of the same code is still there, but the same Postgres you're using today is a direct line to exactly the same Postgres code, or it's derived from the Postgres code that Mike developed with his students at Berkeley in the 1980s. All right, so now we get into the late 1980s, and now we start to see this idea of people repeating sort of not the same mistakes, but trying to fight the same battle that maybe people that they had in the 1970s. So, in the 1980s, object-oriented programming languages became more prevalent, and people recognized that, well, if I'm running all my application code in object-oriented programming language like Smalltalk or C++, but now I need to store things in my database, well, my database is storing things as relations, so I have to take my objects, which may have nested arrays and other objects inside of them, I gotta break them up and then put them into relations or single records of tuples and put them into my database. And so people recognized that, or observed that this was a potential bottleneck. This is called the relational object impedance mismatch. So, a bunch of people said that, well, rather than taking my objects in my programming language and then splitting them up into tuples from a relational database, what if I just could have a database system that could store objects directly? So then I don't have to do that translation and did serialize or deserialize and break things up and put them back together any time I need to access them. So, the couple of database companies that came out with what are called object-oriented databases for saunt and objects to are probably the two most famous ones, they still exist today, but they're essentially in maintenance mode. This sort of category databases never really took off because there was no sort of standard way to interact with, you know, to program these systems. And just, you know, because you have to use a proprietary programming language that made you, you know, tightly coupled with the database system where SQL is potentially an open standard that anyone can implement, you know, these other databases were proprietary. Martha Logic is an XML database that came out in the late 1990s, but it was similar to these things, there was sort of an XML database. So, although these systems aren't really around today and people aren't widely using them, a lot of the technologies that they developed to store data in these object-oriented databases actually exist today in the commercial relation databases or even the open source ones, right? Anytime you store data as a JSON field or XML field, that's essentially doing the same thing that the object-oriented databases guys did back in the 1980s. So again, so as Mike said, what goes around comes around. So let's look at a quick example here. So I say I have an application that I wanna store student information. So every student has an ID, a name, an email address, and then an array of phone numbers. So if I wanna store that in a relational database, since a student can have multiple phone numbers, I need to have a foreign key relationship between a student phone relation and a student table. And so now when I wanna do a lookup to get all the information I need for a given student, I either have to do two queries on these two relations or do a join between them to get the data that I need. All right, so again, this would be called the object-relational impedance mismatch. So what a object-oriented database would say is that, well, instead of storing this as two separate relations, what if I just stored all the information for a single student as a JSON field and inside that I can have my nested array. So now to go get all the information to instantiate that object in my application code as one fetch into the database. So this seems nice, right? But now this is gonna cause problems because for a simple example, like go get one student, that's easy to do. But now if I wanna start doing aggregations across multiple students and start joining things together, not only do I have to traverse the hierarchy of the every single record in my query, but now I need to make sure I can join across these multiple records. So the queries to do complex things end up being more difficult than what you would have written in using SQL. The other issue is that as I said before, there's no standard API. Even today for JSON databases, there's no standard programming language that people use to interact with these object-oriented databases. MongoDB or Redis are sort of becoming the sort of de facto standard, but there's enough document databases or object databases out there that do something completely different. Whereas SQL, again, the dialects are slightly different, but they're still similar enough that you could understand it pretty easily. All right, so now we get to the 1990s. I typically call this the boring days of databases, and it's not to say that people weren't putting out new database systems, it's just that there wasn't any major advancement or major change in how people were gonna design database systems and how people were gonna, or applications people wanted to run. Yes, the internet was coming around, but in the early days, there weren't that many people on the internet and the complexity of the applications that people were building and exposing through the web were not as sophisticated as they are now. So existing databases were sort of sufficient. The sort of four major events that I always like to say during this period were that Microsoft bought a license to the side-based source code, they forked it and ported it to Windows NT, and that became SQL Server. SQL Server today is still a state-of-the-art database system, and I don't think that much of the side-based code is still there, but that was their starting point, and side-based is sort of in maintenance mode now that still makes a lot of money, whereas SQL Server, I consider, to be a state-of-the-art system. The next major trend was that SQL Server, or so my SQL came about in the 1990s as a replacement for MSQL, and that's why they use today. Postgres added support for SQL, and it was based on the academic code that was developed at Berkeley in the 1980s. Postgres actually was forked, and then there was a commercial version called Alustra that I think supported SQL was eventually bought by InformX and sort of died off, but two grad students at Berkeley in the 1990s took the academic version of Postgres that only supported QL and added support for SQL, and that's why the official name of Postgres is Postgres QL because it was the added SQL after the original Postgres code was written. And then SQL Lite was started by one dude down in North Carolina, and it's most widely used, and that's why they use embedded database and what's why they use this database in general because it's used everywhere. So now the big change that happened in the 2000s was the internet. So the number of users online grew significantly and the size of the databases and the number of concurrent operations that these systems need to support was way larger than what existed before. So the problem was that all of the major sort of enterprise database systems, the oracles, the side bases, the DB2s, these were all very heavy weight and they were expensive. And the open source systems like MySQL and Postgres at the time were sort of missing important features that you would want in a database system. You couldn't quite use them in production safely. So for example MySQL didn't support transactions directly until NADB came along in the early 2000s. So ended up what happening was doing to support these larger workloads, a bunch of companies like Facebook, Google, and eBay and Amazon, they ended up writing their own custom middleware that would sit in front of multiple single node database instances to allow them to shard out and scale. And so that would allow them to support a larger number of concurrent users and ingest data more quickly. But now if you have a lot more data and you wanna do analytics on it, you need a specialized database system to actually support these kind of workloads. So this was the rise of data warehouses. So these were the special purpose database systems that were built in the 2000s that were designed for these larger, larger datasets. That's not to say that the existing database systems at the time couldn't do analytics, they were sort of jack of all trades and not designed specifically for it. Whereas these newer systems were designed just to do analytics efficiently. So the six most famous ones sort of at the time were Natesa, Park Cell, Mode ADB, Green Plum, Data Lager, and Vertica. Of these, Natesa, Park Cell, Vertica, and Green Plum were all forks of Postgres. Like they took the Postgres source code, hacked it up to make it support analytics more efficiently. The other thing about these systems that was different than the existing ones is that they were primarily distributed and shared nothing, because you wanna be able to scale out to do large scale analytics. And unfortunately, most of them were also closed source. So the key thing about them that makes them different than what was done before is that these were all column store databases. So column stores has existed since the 1970s, but it wasn't until this time now these special purpose data warehouses that people recognized that the decomposition model that a column store was superior for doing high performance analytics. So now we get into the late 2000s and the NoSQL movement. So how this all sort of started is my take on it was that the companies like Google recognize that for the kind of applications that they wanna support on the web, they cared more about making sure that the data system was always available and that they could scale to support a large number of users. So they end up foregoing traditional things like the data system provides, like transactions and joins and SQL support in exchange for these sort of other data models or proprietary languages, programming languages or access APIs. So there's a bunch of these systems that were sort of cloned based on what Amazon had done or what Google had done, but other ones sort of came out from just somebody was booting data system at a company and recognized that it solved a particular problem that was useful for other people. So of these the most famous ones are probably MongoDB, Cassandra and DynamoDB. Most of these are open source except for DynamoDB, Oracle NoSQL and Bigtable. But again, the main thing is that they called NoSQL because they weren't support SQL at the time, although a lot of these do now today because they were focusing on high availability more than correctness. So as a response to the NoSQL movement, there was a sort of another database movement that I was involved in called NewSQL. And this is where we're trying to build databases for transactional workloads that could have the same performance and scalability as a NoSQL system, but without having to give up transactions without giving up the relational model or SQL. Because these are useful things to have in the databases, in a database. So unfortunately most of these, at least when the NewSQL system originally started, most of these were closed source. The kind of cool thing now is in recent years with like CockroachDB, TidyDB and Yugobyte, some of the newer NewSQL systems are actually open source. So that's pretty exciting. Spanner is probably the most famous one at Google, out of all of these and they do transactions using special proper clocks, which we're not going to discuss this semester. All right, so now in 2010s and 2000s, this is sort of the rise of these specialized database systems. So these are where now, instead of this building to a single purpose database system that could try to do everything, you try to build these ones that could do transactions or ones that could do analytics efficiently, but not the other. But then in the later 2010s was these HTAP systems or hybrid systems, where it was trying to get the best of both worlds. It was like, and it was trying to be a NewSQL system where you could do transactions very quickly, but you also wanted to support some of the analytics that you would only be able to run on a data warehouse. And the idea here was that, instead of having to maintain two databases, one for transactions, one for analytics, you could push some of those analytical operations directly to the front end system and extrapolate new knowledge, new information on the data as it arrives in the database system. So of these, HANA, Splice Machine and MC Core, probably the two most famous ones, Hyper is originally started as an academic system, but it got bought by Tableau a few years ago and is now a commercial system that comes with Tableau's software. Hyper is actually an amazing system. We are actually going to discuss it multiple times throughout the entire semester. And next class or later this week, I'm actually going down to the Hyper headquarters to try to get, you know, go through like a German seance or get some German influence in my life to help me build a better database system. So look forward to that in the next class. So some of these are open store, some of these are closed source. Again, but all of these are going to support the relational model in SQL because that's sort of the main thing people want especially for analytics. The other big thing that happened in 2010 was the advent of cloud systems. So these are now database systems that are designed specifically to run in the cloud environment. And so how these first started were like the major cloud vendors would sell you a database as a service that were just taking an existing database system like Oracle or MySQL or Progress and running them inside a container or VM for you. But since then, there's been newer database systems that are designed specifically for running in a cloud environment or shared disk environment which I'll show in the next slide, right? So these are, database systems are not necessarily built from the ground up but the systems assumes it's running in a cloud environment and therefore it takes into account the latency of accessing disks that may not be local to the machine where the computer is actually running. So of these, Snowflake, Redshift, Aurora and everything from Amazon, Spanner, Bigtable and Cosmos DB are probably the most famous ones. Xeround was a cloud version of MySQL that went under, it's based out of Israel. They failed a few years ago. FaunaDB is a serverless OTP system and SlicingDice is a cloud server from MySQL based out of South America. So there's a bunch of other ones these are the main ones that I'm showing. So as part of the cloud database you had this new architecture that was assuming a shared disk environment. So that means that instead of having to write a custom storage manager in your database system that accesses like a local disk you would use the existing interfaces or API that the cloud vendor provided you or distributed file system would provide you and use that as the underlying data store. So to think of this instead of writing to a proprietary local file that's managed on the disk I would write my data out to HDFS or Amazon's S3 or EBS. So in these environments the storage layer is usually appended only and so these are gonna favor or typically use a log structure approach because you just need to keep appending to the log. So this is what most people think about when they talk about building a data lake it's usually something like this where they have a bunch of files on a distributed file system or distributed object store and you just build you have this you point the data system at it and you suck the data in and you run your queries on it. So again, although we're not gonna talk about shared disk systems or distributed databases you have to bring the data from the shared disk into a compute node and at that point it's in memory and so you're gonna do all the things that we're gonna talk about throughout the entire semester, right? So even though we're not talking about explicitly distributed databases our techniques that we'll talk about here are still used in them. The other cool thing about shared disk systems is that because now we separated the compute layer with the storage layer I can scale them out independently because the execution layer is essentially stateless. The final state of the database is always on shared disk. So if now I can't process queries fast enough then I can just keep adding new compute nodes and start running more queries in parallel. Likewise, if I need to start scaling out more machines I can just add new disk or the shared disk layer. All right, so I briefly wanna talk about graph databases although it's starting to rain here. Every year people always ask me why don't I cover graph databases in the class? And the answer is I have not really seen a from the low level systems perspective that I care about I've not seen a compelling argument for why you do wanna use a graph database. The only really main image that they have over existing systems is that they're gonna provide you a graph central query API. So that means is like if I wanna do a graph traversal instead of, you know, as I walk the graph going back and forth through the application and the database back and forth to do that traversal I have a single command that does the traversal and then it all runs on the server side. So, but there's nothing specific about that that makes this as, you know, you couldn't do that in a relational database that's storing a graph. And in fact, there's a recent research and it's a few years ago, 2015, Insider that shows that it's actually not clear that using a graph centered a specialized graph-centric database system it's gonna do any better over a existing relational database system with a graph API. So for that reason, we're just not gonna talk about graph databases this entire semester. The other major trend this decade in 2010s is the advent of these time series databases. So again, these are more specialized database systems that are specifically designed to store time series event data. I think of like a monitoring service where you're collecting the CPU utilization from every machine in a data center, you know, once a second. So all these events get sucked into a time series database and they can do, you know, analytics or do analysis on them to find whether, you know, there's any issues. So the reason I bring up the time series databases is that this is actually a good example about how if you make assumptions about what the data looks like and the workload pattern you're trying to execute on, then you can actually apply some additional optimizations at the storage layer and the execution layer for them. So you could make the same argument about graph databases. It's just at the execution layer maybe or the planning level, but at the low level storage level it's a relational model can still do this. So a time series database is they're still relational. It's just they are, they know what the queries or patterns are gonna look like and so they can optimize for it. So of these, the ones that's probably the most interesting to us in this semester is Clickhouse at a Russia. I'm gonna spend a little time talking about how they do things, because it's actually a super state of the art system that does a lot of the things that we're gonna talk about this entire semester. And it's open source, so that's very cool. All right, so again, the major trend, I would say of the last decade is the rise of these specialized systems. No longer it's Oracle or SQL Server, DB2 that tries to do everything for everyone. There's enough database system out there that can better for different classes of workloads. Again, a lot of these are still be relational. It's just how they actually implement things and can be tailored towards a particular application environment or domain. So there's a bunch of embedded databases that are interesting that we're not gonna have time to talk about. Multimodal databases or multimodal databases are systems that try to do graphs and documents and key dive stores all within a single database. Blockchain databases are a thing, apparently. Again, I don't find them very interesting or compelling. These maybe I just don't understand it enough to say that they're a good reason for them. So we're not gonna talk about them. And there's also hardware accelerator databases. We'll talk a little bit at the end of the semester. Basic databases that not just running on the CPU, but could you run on FPGAs, GPUs, or new non-volatile memory? So the main takeaway when you get from all of this is that there are a ton of database systems out there that all do things differently. And sometimes they make good decisions. Sometimes they don't make good decisions. And so the goal of this semester is to understand these trade-offs of all the different design points of building a database system so that when you go out in the real world and someone comes along and says, hey, I have this great new database that you should give me money for or you should start using, you can now make an educated evaluation about whether that's actually a good idea or not. All right, so here's what I think's gonna happen in the future. I think that the specialized databases that start off very specialized over time as they get more users and see adoption, they will sort of expand the scope in which the type of problems they'll be able to support. So the time series databases maybe don't do transactions very well, but eventually they will add transactions or maybe they don't do analytics very well in a document database, but eventually they'll do analytics, whatever. Eventually some databases will just people will stop using and they go into maintenance mode and then die off. But the ones that do thrive, I think will just again, they'll just get better over time. It is also my opinion that the relational model and particular programming languages are a boon for doing high quality engineering or data engineering in an organization. So instead of just having a bunch of Python scripts that are hacking up the data, if you have a way to sort of programmatically define what the data looks like and operate them all in sort of structured manner, then that makes it easier for other people to reuse what you're doing. It may not be the fastest way for you to do it in terms of you actually writing the software, but for an organization that has a lot of people touching data, I think this is actually the better approach. That's sort of just my opinion. All right, so next class, we will do the introduction on memory databases. I just sort of contrast them with dispute databases and try to understand why they're different, how they're different. And that'll be the underlying assumption we make throughout the entire semester. And for Wednesday's class, you also have to submit your first reading review. So go check this course schedule and see what paper you have to read and then submit that on the link below, okay? So it's like, I don't know, 10, 30 at night. It's getting kind of smoky here. Let me go figure out where 1-8701 is and we'll figure out where it's in and then hopefully I can record the next class when I'm down in Germany at the hyper headquarters. All right guys, take it easy. See you. Bank it in the side pocket. What is this? Some old Porsche. Ayo, ayo. Took a sip and had to spit because I ain't quit that some old cheap ice cube.