 Yo, hey, yo, hey, yo, yo. Pack the crumbs down, fly like Mrs. Jones. Lyrical mathematics will have the devil's smoke and stone. I put heads to bed, make shots, and wrap myself in. So we're going to start a quick shout out to everyone who's been down with us since the beginning. So I've got two people in New York City, EZ Evan, and then Brooklyn DBA, who's the finest. J.O. in Seattle, and Ip of the Greek. And these are people that have been helping us out and staying true to us since the beginning. And of course, I want to thank Carnegie Mellon for not firing me for another year. That's always important. The core sponsor for this semester is going to be Amazon. Amazon is the largest database vendor in the world, in terms of market cap, how much money they're making. They make all the money on the store, make a lot of money on AWS, but they make a ton of money on databases. Redshift, Aurora, RDS, DynamoDB, they have a ton of stuff. Usually Oracle was number one. I actually made Microsoft number one for a while too. Amazon is officially number one as of, Microsoft is actually number one, so too. Whatever, they're making a ton of money. They're helping us with the core logistics and sponsoring us, so we're really appreciative of that. And then at the end of the semester, somebody will come and give them from the Redshift team, come and give it to Tech, talk about the system that they're building. We will discuss Redshift throughout the semester. A lot of the techniques that will be described during the semester, they're actually implementing in their system. All right, so I quickly want to go over the course logistics and what the expected for the semester, but I always had most of the time on the history of databases because I find that part more interesting. And then as I sent the email out over the weekend, it'll be based on the two readings. Again, it's not meant to be like deep, deep into the internals of data systems, just to give you perspective of what the landscape looks like and why we're spending all the time talking about relational databases and not other things. So a lot of you guys are here for obvious reasons. You won't take this course, but just as a final pitch, I'll say, the things that we talk about this semester for database systems, and particularly for analytical database systems, these are hard problems. Not everyone can do it. And companies pay a lot of money for students not coming to this class from other places as well, they have experience in working in database systems. So if you're just a random JavaScript programmer, they're not gonna have you touch the database internals. Like the same way, they're not gonna have you touch the kernel in an operating system. They don't want people off the street. They want people to understand fundamentally how these systems are working. These are the things that we're gonna cover. This is not a full list, but these are just some of the former students that have worked with us and taken this class and places that they've gone. These are the ones I had photos. There's a lot of, there's way more than I'm missing. But these are the ones I could quickly find. And yes, three of them are with me at my startup at Otterton. Some of the best ones as well. All right, so unlike in previous years where you try to cover maybe in-memory database systems and spend a little time just testing transactional systems, this semester we're only gonna focus on analytical database systems. Just because this is sort of what the hot thing is right now and there's a lot of money sloshing around, a lot of systems being built. And to sort of understand, we want to understand what the state of the art is from how we got to the point where we are today. So the goal for you guys is that not only we become aware of what these systems actually are, what are the key features about them and help you just sort of understand the trade-offs between one system designer implementation versus another. You'll hopefully also become proficient in writing high quality systems code, right? And doing documentation and testing and doing curve views. These are soft skills that are gonna be important when you go out in the real world and do systems development that it's not something like, there's not any classic say, here's how to write documentation, here's how to write test code. It's just things you sort of have to pick up as you go along. So the project will be designed such that you'll get exposure to these sort of best practices on how to do these, to work on database systems. And so this course also too is only gonna cover state of the art topics. So I'm assuming everyone has taken a database class either last semester, four, four, five, six, 45, watch your undergrad. Like we're not gonna go over to the basic of what a join is, right? We'll talk, we assume you know how to do a hash join. We'll talk about how to do it in parallel and do it to make it run fast. So the topics that we're gonna cover, we'll first start off talking about storage models and compression and actually how do you just represent data in files on disk. Then we'll talk about how to do query execution using sort of modern techniques like vectorization and compilation. Then we'll talk about modern join algorithms, networking protocols, you're getting things out of the data in and out of the database. And then we'll spend a little time talking about how to do query optimization go much deeper than we were able to cover in the intro class. And if you look at the schedule, the last four or five lectures are actually targeting single database systems. Like there's a whole lecture on Snowflake, a whole lecture on Databricks, a whole lecture on BigQuery or Dremel. And the idea is that we wanna take all the things that we discussed throughout the semester and then understand the basics of them and then look at a real system that implements all these things and try to understand like why are they doing this a certain way? Or what are the benefits? Or what are the disadvantages of the approaches that they're taking? So, again, I'm assuming you've taken an intro database class, so I'm not gonna cover the basics of SQL and other things. So, all the website is up to date. I haven't updated the homepage because I try to use Dolly to have like the last supper of the person in the middle. It didn't work out very well. I gotta work on it. Anyway, so the website is up to date, so the syllabus and the schedule is all there. For the most part, it's stable. So, if you understand what's expected for you in the course, please go look at that. And of course, I'll say this throughout a couple of times in the beginning, right? For academic honesty, again, this is the advanced level class, I assume everyone here is smart. Therefore, you don't need to cheat. If you do cheat, we will take you over to Warner Hall and deal with that, okay? And you can wait in doubt if you're not sure about doing something like, hey, I have this little stupid code that somebody wrote in GitHub, can I use it in my project? If you're not sure, then just ask myself at the TA, okay? All right, my office hours will be Mondays and Wednesdays, the hour before class, upstairs on the ninth floor, and then we can talk about anything you want, any of the projects, the papers, how to get a database job. I have a bunch of database shirts, my office, the company send me. Please come and get some, because I'm kinda getting like a little hoarder level because they send a bunch of me during the pandemic. I have a bunch of boxes, I gotta start getting rid of them, so come talk to me. All right, we have one TA, WAN, he's awesome as f***. So he's my third year PhD student, this is true. He is a former paralegal for like the sketchy law firm. He is a certified chicken farmer, and he is the number one ranked PhD student at Carnegie Mellon, for databases, be very clear. So again, he'll help you with the projects, he'll be a part of the conversations as we go along for the semester, and so by all means leverage him if you have questions. So the expectations are 33 things. There's reading assignments, projects, and the final exam. Every class will have one assigned reading, except for today's class, and they'll be indicated by the icon here. So I'll post on Piazza what's actually expected in these, you know, in these reading reviews. But the main ideas I want to get out of is like, read the paper, understand what they're doing, understand like all the things that we talked about, and how would it fit together in the higher system, a higher level system. Then I'm also curious, I don't understand like, what are the workloads that you devaluate the implementation? For analytical databases, it's most likely going to be TPCH and TPCDS over and over again, but it's good just to know what these things are so that when we go to like project one, project two, actually or project three, you'll have an idea of what workloads you can use based on the papers you've read. And there's a Google form here that is live, and then there'll be just be a dropdown and you fill that out, yes? It should be three. You can skip three readings. Sorry, I'll fix that. Thank you. Again, all these reviews have to be your own writing. Some of these papers are pretty state of the art, like they came out last week and so they're probably not going to be a review on the internet you can use. You could try chat in GPT, see what happens. All right, so project one, we'll post next week, but you'd be basically writing a foreign data wrapper for Postgres to process columnar data format like a Parquet file. The idea here is just to sort of expose you to how to vectorize execution in the context of Postgres and we don't worry about do SQL parsing, we don't worry about doing query planning, let Postgres handle all that. And the idea here would be that you'd be able to see the trade-offs between your sort of custom engine on the columnar data versus Postgres' row-based iterator model. So again, what I'm saying doesn't make sense, then that's not good because we covered these things in the intro class, but make sure you at least go back and look at them and understand what I'm talking about. Again, we'll post this next week. Project two is gonna be writing an article for Encyclopedia, so we have an Encyclopedia of databases. It's every single database that I'm aware of, and so the idea would be you go look at the documentation, go read, maybe even run these systems, go talk to the developers, understand how they're implementing the different parts of the system that we care about, and you'd be writing an article for this, okay? Of course, don't plagiarize for this, avoid marketing language. Sometimes I give the companies access to the website and they'll say, like, you know, so my database is the fastest one ever, right? I'm like, we have to remove all that. Like, only say things that you can back up by specific claims, right? We don't really care about the internals, not like how people feel about the database, if that makes sense. All right, project three will be a group project, and the idea here would be some larger topic that you're interested in that's based on the things we're discussing throughout the semester. It doesn't have to be in Postgres, but if you want it to, it can be. And the idea here is that you just wanna, you know, build something that exhibits some understanding or mastery of the materials we discussed throughout this semester, okay? Again, I'll post some topics as we get closer. You won't have to decide later until March what exactly you wanna do. I have some ideas of things that I wanna do that could potentially turn into a short paper or like a capstone project. Again, we can discuss these as we go along. Again, don't plagiarize and all your code has to be, I think Cody Smith to us has to be yours unless you discuss and it's attributed. Same thing for the, it's like a pedia. Please don't just copy from the, from random things on the internet. One year somebody copied from Wikipedia and they had like the little, like, there's the text and then the square bracket and then the number for the citation. So we had a good deal with that, okay? Again, so the reason why I keep showing this because this is on video now, because then if you guys do plagiarize and then I got to turn you in, I show them the video link on YouTube or like, okay, here's me discussing this and you're screwed. All right, there'll be a final exam but it'll be take home, long form questions. I did try Jet Chat GBT and it was not able to answer them. So who knows how much better that'll get. So it's either that like, I wonder if it's a limitation of Chat GBT or whether my questions are just terrible and it can't parse them. Yes. Yeah, it's like multi-version currency tool is important because it does multi-versioning, like, thanks. That's what it told me. Okay, all right, gray breakdown is like this. Again, this is on the syllabus. No big surprises here. And then the court is mailing this, everything will be on Piazza. If you have any technical questions about project one and potentially project three as we go along, please post on Piazza. Don't email Juan to myself directly. And if anything else like, you know, your health issues or whatever you have going on, please email directly, okay? All right, any questions about this? All right, let's get to good stuff, databases. Yeah, one year at somebody complaining to YouTube is like, I spent, you spent 25 minutes discussing logistics of the course. Get to the good stuff. So here we are. All right, so as I said, this is sort of my abridged history of the last 50, 60 years of databases. So we're not gonna go too deep into any one topic, but just give you an overview of the lay of the land that like, and I think what's actually really interesting about this is that databases are a hot area right now in both research and in industry. And yet they're what, 60 years old? The first one's gonna be from the 1960s, right? Just shows you how important they are. And how hard the problem is if like, it's clearly not a solved problem. And so it's gonna, this lecture sort of derive from two papers. The first one was called What Goes Around Comes Around and it's written by Mike Sternbecker and Joe Hellerstein in 2006. And it's basically Mike's assessment of the database, of the database industry and how he was right for the last 40 years. Mike won the touring award in 2014, so I would agree with him. And then the paper, this other one draft I sent you, this is Mike and I wrote this last year where we basically looked at where this paper left off and said, what's the next 16, 17 years? And this was actually triggered by me because there was some like post and hacker news where somebody's like, I don't know why people keep using relational databases. Graph databases is the way to go. And I was like, all right. So I said that the Mike and like let's write the new ones. So the major takeaway I want you guys to get from this is that again, even though databases are really old, the concept is really old, they're highly relevant today because at the end of the day, every application, what are they essentially doing? They're exposing some user interface for either a human or another machine or something to interact with the database. And so what's gonna be interesting is a lot of the things that we'll cover this semester, these are not new ideas. They're not new concepts. I'm gonna see, I said this in the intro class, I'll keep saying this over and over again in the throughout the semester. IBM did a lot of this stuff in the 1970s. Just saying, you know, obviously the harbor was much different, the landscape was much different, but a lot of the things we'll talk about will be just modern incarnations of what IBM had vented 50 years ago. We're gonna spend a whole lecture on query compilation, right? How to take a SQL query, turn it into a query plan and then convert that query plan into a machine code. IBM did that in the 1970s with System R. They did it in assembly, now we'll use LLVM, but again, the techniques are not new. The other big thing is gonna happen and you'll see this the rest of your life is that every 10 years, somebody's gonna come along and say, hey, SQL is stupid. Relational model, that's slow, right? We can do it better. Here's my new database system that doesn't use a relational model of SQL, right? And everyone's gonna go excited, oh, this is the future, SQL's old, SQL's busted, we don't wanna use that. And then, lo and behold, people realize, oh, yeah, the relational model was a good idea or SQL is a good idea and either that thing that the new invention that comes along, either that fails or whatever ideas that this new concept or new database system has, that always get adopted by the SQL, by the SQL standard and by the relational model. Then that thing goes away, all right? So I think, I've seen this in my own life, like no SQL was a hot thing. People said, oh, SQL's slow, SQL's stupid, relational model's stupid, you wanna do documents, you need JSON. And everybody built all these JSON databases, right? And then now we're at the point where, okay, maybe that's a bad idea for everything, JSON's good for some things, so the relational model now supports JSON. And then all the no SQL systems that said SQL is a bad idea, every one of them, except for Redis, supports SQL, right? Mongo supports SQL last year, right? So you're gonna see this throughout the theme as we go along, is that at least once we get past and once the relational model gets invented, people will say, every 10 years it's stupid, and then turns out it wasn't. Mark my word, you'll see this the rest of your life, I guarantee it. All right, so let's start at the very, very, very, very beginning, 1960s. So to the best of my knowledge, at least what is considered conventional wisdom, the very first database system, our sort of general purpose database system was this thing called integrated data store, or IDS. And as a general purpose, what I mean is that it was a database system that was designed to support arbitrary data sets or databases, right? You obviously, you can write a little Python application yourself and read and write files that's just for exactly that one database, that's not what I mean by this, like those things sort of existed, but this was one that built specifically to say, okay, we can then reuse the system for another application domain or another customer. So GE built this for, it was like some timber company out of Seattle that had a huge, huge inventory tracking problem that they needed to deal with, and so they built this thing called IDS to handle that. And then they ended up spinning it out of that custom solution that for the timber company and try to sell it as a standalone product. GE was trying to be a software vendor. Is GE the hot thing in computers right now? No, right? So what they, a huge mistake they did is that they had this company policy at the time where GE said, if we're not number one in some industry, we don't wanna be in it at all. So they were like the number three computer seller. That wasn't good enough for them. And then they, so they sold off their computing division to Honeywell, right? So then Honeywell owned IDS and then they were sort of selling it for a while. So there could be two key things about IDS that are gonna be bad ideas that are gonna then get fixed in the relational model. The first is me, there's this network data model and I'll explain what that is in a second. And then the other one's gonna be this notion of a tuple at a time query, meaning like I'm gonna write basically four loops in my program to iterate over one tuple at a time and do something. Right, as we know in SQL though, we're gonna operate over bags or resets, right? So we declare what the thing we actually wanna do that then could apply to multiple tuples. And that's gonna be way more efficient. So IDS didn't really take off as far as I know, but what actually came out of the project was this thing called CODESIL. I should do a survey before the class starts. Who's ever heard of CODESIL? Well, who, not from the previous class? One of you, okay, yeah. So CODESIL was gonna be the hot thing in databases. This was the thing that everyone's gonna build for their database and obviously we don't. So there was this guy, Charles Bachman, who worked on IDS and he saw the need for having a standard way for COBOL programmers to interact with the database system. And so they proposed this data model and this query API called CODESIL that incorporated the network data model and the two by the time query model from the previous slide, from IDS. And they said this is gonna be the standard going forward. So then Bachman left IDS and he worked at this thing called Qlane, which has been bought and sold over many, many years. And he helped build a new version of a network data model based on CODESIL called IDMS. And this thing is actually still around today. You're obviously, if you're a brand new startup, you wouldn't use this, right? This is for like legacy applications. So the network data model is gonna look like this. So say I have this example where I'm, say I'm NASA. This is actually a real example. I'm NASA and I'm doing the, I'm building the rocket for the go to the moon, right, the Apollo moon mission. And so it's a huge engineering project. I need to keep track of all the parts that are going to my rocket and what manufacturer or what company can provide them for me, right? So they would have this database where you have suppliers and the parts and then you keep track of what supplier can provide certain parts and what price and what sizes, so forth. So under the network data model, you would have the high level entities like you would have a supplier, right? Like the name of the company that can provide a part or provide parts you have the part that you need. And then you have this other set that says, here's the, for this supplier, they will supply this part at this price. So you would have these high level entities, but then you would have to implicitly have these sets here what I'm showing in italics, right? You would have the supplies and the supplied by. So you would say I have a supplier and there'll be a supply that's in the set of supplies. So now if I want to find, find me all the parts that a supplier supplies I would have to do a bunch of nest of forlips, go look over every supplier here, look over all their supply sets, get the supply types, then reverse, go back up there. You basically write a bunch of these nest of forlips and you're doing this over one tube at a time. Another problem is going to be is that the, well actually the way you sort of look like this, if you actually have an instance of this, right? You have a supplier supply and some parts and then you would have these auxiliary tables that basically cross-reference tables you would have where you keep pointers to the actual objects. And so these red lines I'm drawing here, these are actually pointers, like physical pointers to like here's the location of this record on disk or in memory, right? So again, so now when again you're looping through, you know, try to find all the suppliers and supply certain part, you'd have to look at the first one and then you do a look up to find the record in this entity, I don't wanna call it a table because it's not, and then you would follow the pointer to jump here and then potentially you'd have a pointer to go back in the other direction, right? So the queries are very complex, right? There's all written in COBOL and it was less efficient because again, you're outputting over a single tube at a time and it was easily corruptible. So if this thing gets, if this, you know, Harvard was crappy back then, this supplies or supplied by these collections of data, these pointers, if this thing gets corrupted, your whole database is hosed because now you have no way to traverse and find things, you know, understand the relationships between these objects. All right, so the next big system that was in 1960s was this thing called IMS. And this actually was built for the Apollo Moon mission. So IBM was responsible for the building out of database to keep track of all the parts they were, and NASA was buying to build the rockets. And again, just like for GE and IDS, they built a sort of custom database system for the project then realized it was useful for other customers and they spun it out as a separate product. And this thing still exists today and IBM makes a ton of money on this thing, and then there's data systems until today. Like if you ever used like an ATM machine or anything on the bank, a lot of them are still using IMS because they set it up in the 70s and if it's not broken, you know, don't fix it. So IMS is going to use what's called a hierarchical data model and just like the network data model is going to use a two-plot of time queries, then another big thing is that it's going to support a programmer-defined physical data structures, meaning like if I have a collection of data, like a table, I don't need to do a table because they wouldn't call it that, but it's a table, I would actually tell the database system I want to store it as a hash table or I want to store it as a B plus tree. And then based on what I told it I wanted to store it as, you then got a different API to allow you to traverse the data. Like, because you can't do range queries and hash tables, but you can't do it on a B plus trees. So going back to our example here, we have a supplier now and I'm going to edit it because I'm going to need to plug it quicker in. That's pathetic. All right, so I'd have a supplier and now they have parts. All right, that looks a little bit better, right? Because now I don't have to have all this extra stuff. But now, but I do have an implicit relationship between, sorry, explicit relationship between the supplier and the part, right? A part can only supply by one particular supplier. All right, so now if I go to my instance, right, say this first guy, if this first vendor, they supply batteries, but I have to have another whole record for that. And then I would have to have another record for the second vendor who's supplying maybe the same batteries, but at different price. So let's say now if I change the name of the part from batteries to a brand name battery, whatever, I want to change this field here. I got to go through and change every single record instance of where the batteries exist, right? So you're basically repeating information because you couldn't have one part be supplied by multiple suppliers in this model. The other problem you're going to have also too is as I said, about the, since the system exposes the ability to tell how you actually want to store data, hash table versus B plus tree or whatever, then if I change my mind and I said, okay, I sorted the hash table, but I want to do range queries, so let me switch it to a B plus tree. You had no way of easily changing that. You had to actually dump the data out, then load it back in under the new data structure. Then you actually had to go update your code, your application code, because now it exposed the B plus tree API instead of the hash table API, right? So there was no independence between the physical layer and the logical layer of what the data actually was. All right, so this is sort of the, yes, the system didn't support it, yeah. With IMS now and like 2023, I mean, I think they put a, I think they have a SQL veneer on top of it, right? And I think it still has a hierarchy model underneath. There was attempts to convert it into a relational model in the 80s that didn't pan out. But I mean, this is like 1960s, this is the way it just was, right? So this is the motivation for the relational model. So in the late 1960s, there was this guy, Ted Codd, who just finished his PhD in math and he was working at IBM Research and he saw all of these IMS programmers spending a lot of time rewriting their application code over and over again because of this tight coupling between the physical layer and the logical layer. And so then he realized that, which is actually quite prescient, that this is not scalable, it's humans at some point, it could be way more expensive than computers. You know, at the time, computers were super expensive, humans are cheap. The opposite now, right? I can get an Amazon instance for like a fraction of a penny an hour, but a programmer's gonna cost me 200K. What? Well, now I'm curious. Are you laughing at the 200K? For a database programmer that's a bit low, yeah. Yeah. Okay, all right. Okay, anyway, database programmers aren't cheap. But at the time, it was a slip. So he saw all these people wasting their time over again rewriting the application and they saw the inefficiency of having this sort of tuple at a time programming API. So the relational model has sort of, has sort of three key parts to it that again, it serves as the background for all relational, modern relational data systems today. So the first of that, we're gonna store the database in simple data structures. So instead of this graph in the code of sale network model or this hierarchy under IMS, we're gonna store tables relations, so these single heap things. And if they have relations to, or if there are references to other tables, we just store that again as another attribute in the relation. No need to have explicit pointers to anything. The programmers will be able to access the database through a high level language. So instead of writing these nested for loops, they'll be able to say, he didn't invent SQL at the time, but there was not a programming language like SQL at the time, but he had this idea like, okay, there's a way to abstract what the actual physical materialization of the physical structure of the database system and to say this is the answer I want, and then the data system could figure it out for you. And then of course this also now means that because I have this abstraction between the physical layer and the logical layer, the strategy to store the data physically on disk or in memory can be left entirely up to the implementation. Because based on what the query is wanted to do, it can decide here's the best way to store your data. So the first paper on the relational model came out in 1969. This is the very, very first one. This is usually the one everyone cites as the de facto relational model paper. So this came out in the CACM in 1970. But this was the very first one. So we go back to our example before of suppliers and supplies and parts. It looks like this. It's essentially what the network model was, but now I don't have these explicit membership sets. And then now if I want to start in a table, or in tables in a relational database, right now I have these foreign key references that are just attributes in the object, and the data system can understand that, you know, this supplier number corresponds to some supplier number in this other table here. So we'll get to a sequel in a second, but this was a radical idea. A sequel was a radical idea. Now we take it for granted because it's so prevalent. But back then the criticism of the relational model was there's no way, a database of no way, you know, a piece of software is going to write queries as efficient as what a human can write. And this seems sort of strange, but it's sort of the same argument that people made back in the day that there's no way a compiler could generate program code that is more efficient than what a human can write. Sure that's potentially true for, like, highly skilled, like, embedded systems programmers, but nobody writes it suddenly today, right? Everybody writes in higher level languages, even higher, like, up in Python, right? So, but back then, again, this is like the C compiler came out in 1970. This was an insane idea. And of course, it turned out it was correct. Now, it depends on the implementation of the query optimizer, how good it is, and we'll see some papers that show how things can go wrong. And that's certainly a hard problem, but in general, like, a database system is going to generate a more efficient query plan than an average programmer could actually write. And also now it exposes the database system to people that maybe aren't programmers, like business analysts or people doing accounting and reporting, right? They are not hardcore programmers. They can write SQL, but maybe not in the first place. So, Ted Codd put out the paper in 1970. Again, he was a mathematician. This paper here, it's actually not that hard to read, but when you read the criticisms of it or people talk about it at the time, contemporary conversations about it, people say, oh, it was inscrutable. It was, you know, heavy on math. It's not, but again, maybe 1970s, I don't know, maybe it was. But he didn't propose a programming language. In that paper. He later did in, like, 1974-75, this thing called Alpha that didn't go anywhere. But it was all math at the time. And he didn't actually build a system to prove that his idea was working. Could work. So what happened was, there was a bunch of people saw his paper, said, hey, I think this is a good idea, and actually started building experimental systems to test that out. So the very first system that I'm aware of that did this was this thing called the Peter Lee Relational Test Vehicle, which sounds like an 1970s druggy band. But these are people in the UK that basically read his paper and were like, oh, I think this is a good idea. And they built sort of early prototypes. It's a four-sequel. They talked about how they could store massive data sets of, like, a thousand tuples. That was mind-blowing for them, back of them. That wouldn't usually people forget about. As far as I know, that's the very first one. But there's two other projects that came one or two years after. System R at IBM and then Ingress at Berkeley. These are considered the very first two relational data systems that, like, try to build something based on Ted Kott's work and actually build a real system. I think Mymer's sequel came out at a Sweden maybe one or two years later. But this is an act of a project that still exists today. And, of course, Oracle with Larry Ellison, we know about that, and we'll cover that throughout the semester. Basically, the, you know, it was all happening in Silicon Valley. Larry Ellison basically copied what IBM did. He would literally call them on the phone and ask about, like, hey, you know, how does this work? And they would tell him because their researchers didn't know and he went and copied it. And then Ingress eventually got commercialized at the university in the late 70s because people actually started really using it because they understood the significance of the relational model. But, again, in the 1970s, it was not clear. It could have been Kodysil. It could have been a relational model. And eventually, in the 1980s, relational model one. So two things happened, three things happened. First is that IBM never commercialized system R. They could have, they could have been a dominant player in the database marketplace, but they dropped the ball in this because they were making so much money on IMS. Like, why would you kill the golden goose with this other new database system that may not work when you're making so much money on IMS? But eventually they saw the light and they put out their first relational database system called SQLDS in 1981. This had remnants of system R, but a lot of it was, I think, written from scratch. SQLDS is still around. They renamed it to DB2. It's like DB2 for VSE. There's some, like, you know, mainframe system that they wrote SQLDS for. There's five versions of DB2. It's hard to keep track of them. But then what we know about, you know, how we consider DB2 today, that first came out in 1983. So when this came out in 83, this was sort of the shot across the bow in the database industry to say, okay, now IBM's serious about the relational model. These are real, you know, is a real idea. Ingress and Oracle were already still in the marketplace, but it basically showed that the relational model is the way forward, and SQL became the de facto standard when this came out in 83. And Oracle was at the right place at the right time because when IBM, you know, IBM was not going to say they were the juggernaut in the computing industry. So they said, hey, this is the way it's going to be. This is the language we're going to use. Everyone said, okay, yeah, IBM says that's what we're going to do. When Oracle wasn't there, we said, hey, we already support SQL and we're going to go. Ingress had its own programming language called Quell that Strombricker still claims is better than SQL. They eventually supported SQL, but by the time they added it, it was too late. So SQL originally was spelled as S-E-Q-E-L because it was supposed to be the SQL to Quell. They're going to play on words. And then they got sued for trademark infringement, so they renamed it to SQL. And then there was a standard body to figure out, okay, what should be the programming language we use for relational databases? And supposedly they were going to use Quell instead of SQL, but Strombricker didn't like standard bodies and decided not to submit any paperwork for Quell. So there's this paragraph here from the Larry Ellison unauthorized biography, I don't want to recall it, but it came out in the late 90s where they basically talked about how they thought Quell was better than SQL, but Mike hated standard bodies so he didn't submit anything. Anyway, so that's why we ended up with SQL instead of Quell. There's this one hacker news occasionally see people say, hey, I've been to the new version of SQL or a better version of SQL. And a lot of the things, they end up fixing the problems in SQL like having the from clause after the select clause. Like Quell already did that in the 70s. But we ended up with SQL. Sorry. Right, so Oracle basically wins the crown during the 1980s. There's a bunch of other startups that come along that do relational databases, Sybase and Formix, Interbase, Teradata was a data warehouse, Tandem got bought by DEC. Basically the only one that's still thriving today, I would say is Oracle and DB2, thriving is not the right word. I mean, all these like Sybase still makes a ton of money. But again, if you're a new startup, a new company, you wouldn't use it. A lot of these systems are still in RA maintenance mode, right? But Oracle is being actively developed and then same with DB2. Teradata is, they're getting crushed by a snowflake. So Stonebreaker, he commercializes Ingress, goes back to Berkeley, starts a new data system called Postgres. If you're a one-eyed Postgres, it's called Postgres because it's Post-Ingress. It's the system he built after Ingress. And instead of being a relational database system, he called this thing as an object relational database system. Because object-oriented programming was the hot thing in the 1980s. And this is why Postgres was designed from the very beginning to be very extensible. You can have user-defined types, user-defined functions and so forth, because they wanted to borrow some of the ideas of object-oriented databases, which will be in the next slide, and be able to extend Postgres very easily. So even today, technically Postgres is an object-relational database system, but people mostly ignored this part, the object part. All right. So where are we at so far? 1970s. Codesil is there. Cobalt, the Cobalt way to program databases is there. Relational model comes out. People say, hey, this is a bad idea. Codesil is the right way to do it. Eventually, some model wins. And as all these relational databases have come out in the 1980s, early 1980s, 10 years later after that, then we end up with these object-oriented databases. Then what I'm saying, where people come along every 10 years and say, I have a better idea. So in the 1980s, people recognize that if application developers are going to use an object-oriented programming language, like C++ is the hot thing in the late 1980s, that there was this impedance mismatch where the way the database system represented data as relations did not map cleanly into how objects or object-oriented programming represented data. So you'd have to write these SQL queries that would basically convert rows into now objects with sort of nested hierarchies. And so a bunch of companies said, well, this is kind of stupid. What if we just stored the objects directly in the database as objects? So there's a couple of systems, Versaun, ObjectStore, O2. I think Markology actually just got bought a few weeks ago. But they were late 90s. And so a bunch of these systems don't exist anymore. Again, they're in maintenance mode. What killed them was that there was no standard query language you could use for object-oriented databases. Eventually, they proposed OQL, the object query language, where nobody supported that and it was too late. And so basically the problem was because you had this tight coupling between the database system and the programming language, it made your application less portable. It wasn't very easy for you to switch to another object-oriented database system because you were writing to their proprietary API, similar to how IMS exposed a proprietary API to their internal data structures. So now SQL is supposed to be a standard course. But for basic queries, yes, you can easily switch them from one database to another. Every vendor has their own proprietary extensions. So even though there is a standard, it isn't a universal standard that everyone follows. So you can make the same argument about SQL today. O or M's hide a lot of this. All right, so here's what it looks like. So say we have an application code. We want to store student information. A student has an ID, a name, an email address, and then a potential list of phone numbers. So the way you would potentially represent this in a pure relational model, meaning where there's only scalar values, you'd have to have a student table and then a student phone table. So that now, in my programming language, if I want to instantiate this object, I would have to do a lookup where I either first query the student and then do a second query to go get all the phone numbers or do a join and then make sure I throw away the redundant student information when I get the result back. So again, the object order database guy would say, this is stupid. Object order database people say this is stupid. Just store the nested objects together in a single record. And now it's only one fetch to the database and one read call to go bring this data in. So the problem with this, though, is for a simple example where if there's like a one-to-one correspondence between a student phone number and the student, sure, this is probably fine. In fact, in modern relational database systems, you could store the phone number as an array of strings. Most systems will let you support that. The trouble is now when you go back to that part supplier issue, if you now start embedding or denormalizing all the parts that a supplier supplies and put them into the supplier record, then I have that problem where if I need to update a field, I have duplicate information and my application code needs to make sure all that's in sync. I don't update some of the records, but not all of them. And so doing complex joins, making sure you have data integrity becomes problematic in this approach. And I was saying to some cases where you do want to store JSON, and again, there are Postgres other data systems that will have a JSON type. And the methods to store JSON data in a binary form efficiently, instead of just text, MongoDB does this, they have their own BSON data type. Postgres has something similar. So again, a lot of the ideas that were done in these specialized systems or these non-relational systems have now found their way into relational systems. So then the 1990s, I'll call it the boring days, there wasn't any sort of radical change to the Harvard landscape or the workload landscape. There wasn't like the advent of the cloud or microcomputers of the 1980s, just like, all right, things are going along, things are getting better. Harvard is getting better. Data sets are getting bigger. But it really wasn't dramatic change. So I think the sort of four major events would be Microsoft bought a copy or licensed to the Sybase source code, forks it, and then they create SQL Server. SQL Server supports T-SQL, which is their variant of SQL, that comes from Sybase. Sybase invented T-SQL. At this point, again, SQL Server is state-of-the-art. Sybase is in maintenance mode, and I don't know how much of the original Sybase code is still in SQL Server these days. They did major rewrites in 2000, they did major rewrites in 1998 and in 2006. MySQL started, there's a guy in, he's in Finland, started rewriting, or started writing his own data system to replace mSQL and call it MySQL. My is the name of his daughter. He also did MariaDB, that's his other daughter. Then he has a son named Max, and there's MaxDB, like he names all his data set after his kids. Postgres, again, originally started as an academic project. Sturmburger loved Quell. In the 1980s, when they first started writing it, it was used Quell. But then in 95, 96, two grad students took the original academic source code and then converted it to make it actually support SQL, and that's why it's called PostgresQL. They already know what the original programming language Postgres was written in the 1980s. Was it a guess? He says, go on now. Abby, can you close it up? Did you take a guess? Lisp. It was the 80s, right? Cocaine or whatever. And then they realized that was a bad idea, so then they had a compiler convert the Lisp into C and compile that, and then that was a bad idea, so then they rewrote everything in C. And then SQL Lite started early in 2000. It's one dude who invented this down in North Carolina, Richard Hib. And he's still like the main programmer on this today. The one thing that did change, I would say it was not a dramatic change, is that people started to realize, okay, I don't just want to use my data system for transactions and ingest new data. Oh, I just want to start doing analytics on it, start extrapolating new information. Business intelligence decisions for it has a bunch of different names. And all these systems at the time were all row stores. And as we know, running analytical queries on a row store is highly inefficient. So there was this optimization technique called data cubes, where you basically think of it like a materialized view, where you pre-computed these multi-dimensional arrays of different group eyes and aggregations and so forth, and you run your analytics on those. Nobody really uses data cubes today because column stores are so much faster, but this is how people got by in the 1990s. So the big game changer, though, was in 2000s when the internet comes along. So, again, prior to this, when you think about it, who had big databases? The big banks, Walmart, right? Only, like, the Fortune 500 companies had big database problems. But when the internet comes along, it doesn't take that much for, you know, a small number of people to put something on the internet and have a lot of people start using, you know, the application, the website, and you start generating a lot of traffic and a lot of data, a lot of users, right? So this was a big change in how people approached databases in the 2000s. So, but at the time, all of the sort of commercial enterprise databases were very heavyweight, so Oracle, DB2, CyBase, and they were very expensive. And then the open source databases that we think about today, like PostgreSQL and MySQL, they were pretty primitive back in the day. MySQL didn't support transactions well until NADB came along in 2003, 2004. So what people ended up doing was writing out their own sort of custom middleware to route queries to these single node database instances. So they were trying to treat MySQL as like a dumb key value store and then have something in front of it to route queries to different shards. The idea is still widely used today, but the people were rolling their own back then. Another thing that happened was, again, as I was saying, it doesn't take that much to start collecting a lot of data. More people started wanting to analyze this data. And then we realized or they realized that the sort of general purpose database systems as a road story to try to do transactions and analytics was a bad idea. People started building these custom analytical database systems, which again will be the sort of key idea that we're focusing on this semester. So a lot of these were distributed and shared nothing. All of these were relational and SQL. Most of them were forks of Postgres. And they're going to store the database in its column stores. Again, it seems like an obvious idea now, because the wind you get from it are so significant, back then this was unheard of. Well, not unheard of, because the idea is from the 70s, but having these explicitly stored data as columns, that was novel. That was a game changer. So the main systems at this time are listed here. So Neteza was a fork of Postgres, but they put a FPGA down in the storage load to make it run filters faster. ParkCell was a fork of Postgres. It's a bit of a postgres. This is actually what Redshift is. Redshift, they bought a license to ParkCell. Didn't really make any changes. Just slapped it up and called it Redshift. What's that? Yeah, this is not a secret. I was telling you, right? This is public. And so they're like, just throw it up. It made so much money. Oh, now we got to start making this real, right? And it's been written many times over. Vertica, that's a company that started by my advisors, Stormbreaker and then Santhezonic. That's a fork of Postgres. Data Allegro was a sharded version of Ingress. And then Green Plum is a fork of Postgres. Monadb was actually the only one here that was written from scratch. And that came out of CWI. Have you ever heard of DuctDB? It's the same research group that made Monadb. DuctDB was originally called Monadb Lite. It was a fork of Monadb to run embedded in R programs. And then they rewrote it and ended up being DuctDB. I may have to bleep this. I'm not sure if this is public. Microsoft bought Data Allegro. IBM bought Netiza. But Parkcel, they can never get bought. They end up getting licensed. I think the company's dead. Vertica got acquired by HP. Monadb got never bought. And then Green Plum got bought by... Yeah. No. Well, EMC. Yeah, hold on. At some point, they were written by EMC. It's hard to keep track of this. They got bought. And then they divested it off with EMC. No, I think back. EMC had a database piece. VMware had a database piece. They took it out. And they formed a new company called Pivotal. It was Green Plum. And then Sequel Fire or Gemfire. So I think that... No. Yeah. EMC bought them. VMware bought Sequel Fire. They took them out, made Pivotal, and then VMware bought Pivotal. So it's... Yeah. Very cool. So Data Allegro got bought by Microsoft. It was a hacked up version of Shard of Ingress. And I think they paid a s*** money for it. And then after they bought it, so after they wrote the check, then they had their technical people actually look at it and said, this is all crap. We can't use any of this. And they threw it all away. And they made them... They ended up running the Sequel Server Data Warehouse from scratch. The parallel data warehouse instead of using any of this garbage. So there's another one, too. Missing Astrodata. They were bought by... I think by Teradata. Right. Okay. All right. So the... While all this work was happening on these parallel column of store data warehouses, there was this other big trend of these map-reduced systems. Again, so what? It's been 10 years now since the object-ordinated databases were a bad idea. So now we're 10 years later. So this thing comes along from Google. And they built a custom execution engine with this map-reduced programming model that to help them crawl their... to build the index for their web crawl. And they end up using it for a bunch of other data processing tasks or analytical tasks. So Google put out the paper. Say, hey, this is what we're using. This probably still happens now, but Google was really, you know, the super hot thing in the 2000s. Right. Anything they did, any paper they put out, people would end up going and re-implement themselves because they thought, oh, Google's making a ton of money because they had all these custom systems. Let's go build our own custom system, too. Right. Like HBase is a clone of Bigtable. Because Kastanja is a clone of Bigtable and DynamoDB. Yeah, there's... actually, HDFS. Hadoop was the clone of MapReduce. Right. So Google puts out the MapReduce paper. Yahoo says it sees that. So that's a good idea. We can use it, too. So they wrote their own open source version called Hadoop. And the basic idea was that you'd write these user-defined functions, like a MapReduce and a shuffle phase, which we'll cover later on for actually next class. So instead of using SQL, you'd write these custom functions and you just submit them to the MapReduce or Hadoop framework and it'll run them for you. But it takes us back to the 1970s where the programmer had to define what the data model actually was for the data they were processing. There was no SQL at the time on these systems. You had to literally write, like, I'm going to parse this CSV file and I expect I'm going to have these columns and you would write that explicit expectations in the program encoder, in these functions. So this was the hot thing in the late 2000s. I was like, this is the way to do this, the Hadoop. And then Stonebreaker and then this other guy who invented a lot of the first-pillow databases put out an article that said this is a bad idea. Then I wrote a paper with them and showed it was. And then people eventually realized, oh yeah, it turns out the old guys are right. This is a bad idea. So then they try to put SQL on top of MapReduce, Facebook invented Hive, and there's a thing called MapRDB at a MapR, which I think they're dead now. But then it turned out that was super inefficient and super slow because the way Hadoop was actually implemented or I don't know exactly how MapR's framework actually worked, but the way they were sort of storing these checkpoints at every single stage of the query was super inefficient. And so all this got thrown away. I mean, Hive still exists, but people realized this is actually a bad idea and that you want a parody of a warehouse, you want the thing that we were showing before. But again, it took 10 years for people to realize this was a bad idea. Also related to, I'm going to say bad ideas here too, but there was this no SQL movement again, I think, brought upon by Google in the big table paper. They basically said, hey, the information model is bad. The SQL is too slow for modern web applications. We don't need transactions. We don't need joints. We want to build these systems from scratch. Because you have to understand, if you're building a website, you want to sync up to 24.7 in a database system that supports transactions that maybe didn't have backups or replicas for high availability, that means if your whole website goes down, that's bad. You lose money. So the no SQL guys basically said, well, okay, maybe so you were going to let the system maybe have corrupt data or not strictly follow transactional cement or stress support in exchange for always being up, always being online. So there's a bunch of these systems that got built under this model. Some things, sometimes it's okay. So DynamoDB was built by Amazon for the shopping cart. If I put something in my shopping cart, it maybe disappears. Is it the end of the world? Probably not. If I put something in my bank account and it disappears, that's a big problem. So all these systems basically follow different lucid semantics over what the traditional relational transactional strongly consistent database systems would follow to varying degrees of success. And as I said before, basically everybody supports SQL now except for Redis and I think RavenDB. They all have their own version of SQL, but it's basically SQL without wanting to say it. React is dead. So I don't think they have it. And then Oracle new SQL, I think it's just perfectly DB underneath the covers. At the same time, or it's sort of soon after the new SQL, the no SQL databases got popular, there's another movement called new SQL, this one I was involved in, where the idea was that you wouldn't have the same high scalability and performance of a no SQL database system, but without giving up transactions. Obviously, you can't, you know, you can't go faster than the speed of light. So if you have machines in different areas of the world, you can't make that go faster, but you can at least make sure that things are correct. So all the systems that came out under the sort of no SQL movement, with many the exception of old DB, MemSQL got renamed to a SQL store and they wouldn't necessarily call themselves a new SQL system now. Well, Spanner didn't fail, Spanner's still here. Spanner's legit. But for the most part, a bunch of these you've never heard before, like Translatus, certainly nobody's heard of, Genedb, they all pretty much failed. FoundationDB had a SQL layer that was kind of crap. Apple bought them through that away and they since open-source it. But it's not used as a relational database system with SQL now. But what did happen though, a bunch of these systems didn't pan out, the new transactional database systems aren't actually getting some traction. And these sort of fall under the umbrella of distributed SQL. Instead of calling it new SQL, it's sort of vague. You say, oh, it's a distributed SQL system. So TidyB is out of China. CockroachDB has probably raised the most money of all these. And then Yugabyte is another startup that's based on Postgres. And ComDB2 was built by, I think, Bloomberg. I have never heard anybody use ComDB2 outside of Bloomberg. But it is open-source. It doesn't exist. So the main takeaway here is that these systems are going against the conventional wisdom that you want a relational model, you want a SQL one-inch transactions because all the new SQL systems are hot. By the time that people realize, oh, yeah, I do want SQL, I do want transactions, they failed or didn't pan out. And then these systems were at the right place at the right time, building off the things that the early systems had done. All right, finishing up. So we have cloud systems. Again, we'll cover this a lot throughout the semester. Basically, now the hardware landscape has changed. People are no longer running on-prem. You're not running in a cloud, right? And that means your resource is going to be elastic. You don't have to go through this long provisioning cycle of like, hey, I want to buy these machines and procurement. And then it takes a long time to actually get them. With a credit card, you can spend up a new instance very, very quickly. So initially, there was a bunch of these databases as a service products or offerings where it would just take, like, off the shelf my SQL, running it in a VM for you. It's before a container. It's running a VM for you, and they charge you for that. But it didn't. My SQL wasn't really aware that it's running in the cloud. It's just running on some VM. But since then, there's now systems designed from scratch explicitly for running in a cloud system. And we would call these cloud native. Snowflake's probably the most famous one of all these. Like, they designed explicitly in the beginning not to run on-prem. They're only going to run in the cloud. And therefore, you're going to make certain design choices which will cover next week that can take advantage of that. One of the big things also, too, that came out of the MapReduce world, plus now with the cloud are these shared disk systems. So prior to this, the conventional wisdom of how you would build a distributed data system would be shared nothing. But now with the cloud where Amazon or whoever is taking care of the storage layer for you, you don't want to maybe build a shared nothing system. You want to use a shared disk system and let the cloud vendor handle the storage for you. So there's a bunch of systems now built on top of this shared disk approach. This is how everyone's building modern data systems today. When people talk about, oh, I have a data lake. We'll see this when we talk about data bricks. They're talking about basically storing on S3 with a shared disk model. Every year, I get a complaint about this. So we're at the phase now where it's, again, no SQL. The no SQL moment died out. So when people realize SQL is the way to go, still coming along, I wouldn't necessarily call these a no SQL systems, but they've definitely come more prominent in the last five or six years with these graph database systems. But the idea is that instead of storing your database as relations, intrinsically, it's a graph. Let me sort it as a graph structure, which means relations and so forth. So the RDF, triple store property graphs, it has a bunch of different names. But it's essentially the same thing that they were doing back in the 70s with the code-assilled network model. So the big claim is that because you're storing your database as natively as graphs, and because you're now exposing a native graph API, you can be much better than a relational database system. We've heard that before. That was the argument for the object-oriented databases. The JSON databases, all these people make the same argument. And so, sure, there are some times when you do want to traverse maybe your database in a sort of native graph way. But the SQL standard is actually adding support for graph queries this year. And it's based on Cypher, which was invented by Neo4j. So they lose that advantage. So then now, OK, what about the argument that, OK, well, if I'm storing things natively as a graph, isn't that going to be better than a relational database? Well, no, because the paper that came out last week shows that if you build DuckDB, you incorporate some techniques that will help explicitly for graph workloads, like multi-way joins, which we will cover later in the semester. You cannot perform Neo4j by 10x. Neo4j has raised hundreds of millions of dollars. DuckDB is like a small team of people in the Netherlands. And they beat up my 10x. And it's a real system. It's not just a toy. So I think graph databases, and we've already seen this, the relational model is absorbing ideas from it. And I don't see these systems really replacing the relational databases in my lifetime. I should have put a screenshot. I made a public bet in 2020. By 2030, if the graph database market exceeds the relational database market, I will change my official CMU directory photo to be like, I'm sure when he says I love graph databases, and I would use that until I die or get fired from CMU. I don't see it happening. All right, so quickly finished up. Time series databases. So these are now newer databases that are designed to store telemetry you're collecting, or metrics you're collecting from other services, other devices, and so forth. And it's relational. There's this notion of explicit time and ordering and the data you're generating. And therefore, you can design the system to officially take advantage of the domain you're working in. So you wouldn't want to use these for storing arbitrary data. If you have this notion of ticks or events are showing up with some notion of time, and you want to do range queries based on those times, you can just design a system to be more efficient to do this. So probably the three main ones would be time scale, which is using extensions on top of Postgres, which is super cool, because you get regular tables in Postgres plus the time series ones. InfluxDB is written from scratch. I think they're third rewrite, but they're targeting time series databases. And Clickhouse is out of Russia. This is probably... When I first learned about Clickhouse, when you read the website and all the things that they supported, there's a lot of techniques we discussed in the class. It seemed unreal. This is a super state-of-the-art. The performance numbers look amazing. My impression, though, is that it is... It's not easy to get up and running. There's still a lot of manual work you have to do, but I think this one's going to be a big player. And then Prometheus is another big effort in the space. All right, last one. Blockchain databases. So, yes, Bitcoin... If you bought Bitcoin in 2010, great, thanks. Congrats. But people had made claims that, okay, blockchain databases are... Under Web 3 or whatever you want to call it, this is a radical different way of how you want to build modern applications. Like, the old way of having these provisioning servers of relational databases and SQL and whatever, all that stupid you want to build on top of a blockchain database. It's to solve all the world's problems. At the end of the day, what is a blockchain? It's just a log. They would call it ledger. It's like a right-of-head log, or the Paxos log, the state log. Here's all the events or things that are happening to the state of the database. And then they have these incremental check sums where the check sum of a new entry in the log depends on the previous entry of the log. That way, if you fudge anything below that at a previous entry, the check sum doesn't match and you know you don't have the full data. The technique was invented on Merkle treats. And then now, since you're assuming you're running a decentralized, distributed environment where you don't trust the people that are reading right into the database, you have to use some visiting fault-tolerant or BFT protocol to come to consensus to say, what's the next entry we should put in the log? I mean, it's a lot of cool ideas and put together in an interesting way, but is this game changer? Now, I've yet to see a use case that anybody has proposed where a blockchain would solve a problem that you could not solve with a traditional SQL database like Postgres. Or you have some external issue to resolve through like law, right? Like legal matters, right? So, I think this is all garbage. These systems here are explicitly doing blockchain on our decentralized model. This is the logo for PLDB. Amazon has the worth logos because it's like, unless you know what this is, what is it, right? So, QLDB is our quantum ledger database. It's not a blockchain database where it's decentralized. Like Amazon is the trusted authority. You authenticate Amazon. You don't have to do BFT. I think they just do two-base commit. But you still get that verifiable ledger with the check sums. As far as I know, this has not gone anywhere. They make way more money on reselling MySQL and Postgres as RDS or under Aurora. Again, at this point, when I first saw these databases, I was like, yeah, maybe there was something. I convinced this is all crap. And I was also, too, there's no inherent data model to a blockchain database. It's just entries in the log, just bytes. There's some engine above it that has to interpret what those bytes are. So, it could be a key value store, could be relational database. It doesn't matter. All right, so there's a bunch of other stuff that should be 2020, so the typo. There's a bunch of other systems, categories, things we could talk about. Embedded databases like SQLite and RocksDB and DuckDB, those multi-model databases where like ArangoDB where you try to support graphs and relations and documents all in a single one. In the paper you guys read, we talked about hardware acceleration. Basically, people, it's like the search for El Dorado, the golden city in South America. There's been a search for some kind of hardware acceleration for databases for the last 40, 50 years. It never pans out. People keep trying, but it never works. Well, not that it doesn't work, it just, it never gets the adoption in the market because commodity hardware is always going to win. I don't think FPGAs or GPUs are a big game changer in the space. Risk 5 potentially is some interesting stuff, but where are you going to see hardware accelerates for databases? It's only going to be from the cloud vendors like Google, Amazon, and Microsoft because at their scale they can justify paying $50 million in development to build up new custom hardware because they're going to make so much money and be more efficient for their millions of customers. It's very hard, I think it'd be very hard for an independent software vendor, an independent hardware vendor to break into that market because either, because not only is the design to chip or whatever the hardware accelerator is, they've got to go convince some other software company to put it in their database system and it never happens. We didn't really talk about Array, Matrix, and vector database systems. The vector database is the new buzzword now because of machine learning. So there's a bunch of these like you do nearest neighbor search on vector, vectors in new database. These have been around for a while, but the vector ones are new. I would say this is the only, as far as I know for now, this is the only type of data that you wouldn't actually want to use a relational database system for. You want to use a specialized system explicitly designed for vectors because when you think about it, what do you need? If it's a multi-dimensional array, you've got to go traverse it maybe row-wise and column-wise and in different dimensions and storing that in a table with index columns is a bad idea. So I think we'll see whether the market is big enough to justify needing a specialized system. At this point, there's no, as I say in the paper, because does Amazon or Microsoft or Google offer a vector database as a service? No. They could build one. They have unlimited money, but they don't see there being a large market yet. So I think it's still too early, but I think this could be the next thing. And of course, there's a ton of these logos. It's hard to keep track. I clicked. So what's going to happen in the future? So I think right now we're in the golden-air databases, meaning there's so many different choices. Open source, commercial, cloud systems, on-prem. SQL is considered the way to go forward right now. Again, that'll change in 10 years, but we'll be back again. An example would be all the new SQL systems from 10 years ago, except for Redis, have either died or they support SQL. It's something that looks like the relational model. So it's like 1 plus 1 equals 2. That basic arithmetic from thousands of years ago, it stands the test of time because it's the right way to do this. That's what I sort of see as a relational model. Now, SQL, the best query language you could have for a relational database system? No. There's obviously lots of problems with it. But at this point, there's so much buy-in, there's so much existing tooling and utilities that assume a SQL database system. It'd be very hard for anybody to change that. There's so much fracturing in SQL itself. How do you come along and say, I have a new query language and then everyone's going to start using it? Yeah, you can build one system that does it, but then nobody else is going to use it. So I don't know where SQL is going to go in the future other than barring ideas from other non-SQL systems, but I see the relational model being standing the test of time, in my opinion. There's a reason why Ted Cod won the Turing word for it. Charles Bocklin won the Turing word for code itself first. I guess the both days, it doesn't matter. All right. So next class. I got to go deal with some court s**t in Seattle. So next week will be not in person. I'll post the Monday lecture and Wednesday lecture on YouTube. But we'll kick off starting talking about modern and local database systems. We'll read about, the paper's about Snowflake. It's from the Snowflake people. And yes, there's some details about Snowflake, but I really understand the big idea of like, okay, running on a shared disk architecture in the cloud. And then make sure you submit your first reading review before the class on two o'clock. Yes? Monday, Monday, right? What did I say? No. Did I say Tuesday? Make sure you submit your first reading review. Monday. Okay? I see. Okay, awesome guys. Thank you. See ya. That's my favorite all-time job. Now here comes Duke. I play the game where there's no roots. Homies on the cusse, I'm a fool because I drink proof. Put the bus a cap on the ice, bro. Bushwick on the go with a flow to the ice. Here I come. Will he eat? That's me. Rolling with f**k. By the 12-pack case, I'm a f**k. Six-pack 40-act gets the real bounce. I drink proof, but Joe, I drink it by the 12-ounce. They say bill makes you fat. But saying eyes is straight, so it really don't matter.