 So my name is Andy. I'm the instructor for this course, 15, 445, 15645 Database Systems at Carnegie Mellon University. Unfortunately, I can't be on campus right now. One of the TAs, MF, is having some problems and we're trying to lay low. He's gonna turn himself in to do a bid, probably over one to three. So he probably won't be able to see this semester. So we're just kind of like trying to not cause any problems. And that's why we're sort of hiding out and why I'm not on campus right now. So I'm also here in the back with the one of the TAs, Jorah Rouraj, one of my PG students at CMU. So again, we're just, unfortunately, we can't make it, but we're here to record the video for you guys and get this semester started. So welcome. It's a new semester. It's a new class. So in today's lecture, I'm gonna first talk about the weightless problem, the weightless situation, because that's probably what a lot of you are concerned about. And then we'll do an overview of what the course will be about and logistic of what's expected of you as the student. And then we'll finish off talking about the relational model. So I'm sort of back on information to prepare us for future lectures. And then we'll also talk about homework number one, if we'd be going out today. So for the weightless, currently we have about 45 to 50 people enrolled in the course I checked earlier today. But the weightless is about 130 people. So our max capacity is 90. So unfortunately, we're not going to be able to take everyone. So what we decided to do this year to make it fair to everyone is that we will enroll people off the weighting list in the order that you complete the first homework assignment. So it's the order that you get, you finish the first assignment and you get 100% on all the questions. Get everything correct. Then in that order, we will decide how we pick people off the weighting list. So again, this is just to make it fair for everyone regardless of what college or program or year you're in. We think this is the best way to have people that really want to take the course show that they're actually interested. So I'll talk about what that first assignment is after today's lecture. So this is a new class. It has not been taught at CMU before. And this class is a bit more systems focused. It's supposed to be on the design implementation of database management systems. And actually to be more specific, I would say this course is about the design and implementation of disk oriented relational database management systems. And I'll explain what that is as we go on later on. But I will say that this course may not be what you're actually looking for. So this course is not going to cover how to actually use a database that when you build your application, it's not going to discuss how to actually administer a database, right? Again, it's about the internals of the database management system. So if you rather take a course like that, then I suggest you look at CMU 95703, which is a part of the information systems department in Heinz College. That is again about that sort of higher level. How do you approach your database as a user? But again, we're in this course, it's really about how you actually implement one, both correctly and efficiently. So again, just as to be mindful of this, make sure you have the right pre-rex, it's 15213 or 15513. And make sure that you're actually comfortable writing C++ code in a complex system and debugging it. So for this course, for those sort of high level topics that you would approach at sort of a user level or application level, we'll cover that in the first one or two weeks of the course. So we'll start off talking about relational databases, a relational model, relational calculus, relational algebra. I'll talk a little bit about, you know, advanced SQL queries, but that's pretty much going to be it for the rest of the semester. So we'll be spending the rest of our time talking about actually what goes on inside the database system. So we'll first talk about how the storage manager, how do you read and write pages from disk? How do you store them in a buffer pool in memory? Then we'll talk about how to actually execute queries or the relational operators in your query plan to produce answers, right? Then we'll talk about how you implement a concurrent control protocol or a concurrent control scheme, where you allow transactions or multiple queries that run simultaneously at the same time and still produce the same answer and still not lose any data even if you lose power. And then we'll talk about, you know, extending that, we'll talk about logging and recovery and actually what the mechanism you use to make sure that you can always recover after a crash. And then after that, I'll start talking about sort of alternative database system architectures beyond what we're focusing on in most of the course. So we'll first start off talking about distributed and parallel database system architectures. And then we'll finish off with a section that I like to call the potpourri section where I will cover a bunch of different, you know, open source, commercial, state-of-the-art database management systems. And I'll talk about how they work on the inside. I'll talk about what, you know, what are they good or not good for different types of applications. We'll also have a guest speaker come along and talk about their database management system and just contrast that with the things that we'll cover in the rest of the semester. So in terms of logistics for the course, the course policies and the schedule for the readings and the homework assignments and the projects are all available on the course website, which you must have found if you found this video. So if you want sort of up-to-date information about what's going on in the course and what's expected of you as the student, you should check that out. I would also ask you to look at CMU's academic honesty policy page to make sure that you follow the correct guidelines and don't end up cheating or plagiarizing. So I would say I take this actually very seriously. You know, this may not be, you know, if you're coming from an undergrad program in another school that was inside the U.S., you may not understand the rules that we expect you as a student. So I would say if you don't know, please ask me to make sure that you're in compliance with everything. Because if you do end up cheating or you do end up plagiarizing and I catch you or the TAs catch you, we will have to report you to Warner Hall and then it's a big deal because you have, there's an academic trial and you may be kicked out. So please, please, please, if you don't know whether what you're doing is the right thing to do, please ask me. Because I rather, you know, we have a discussion about what the right thing is for your particular situation rather than me find that you cheated or find that you plagiarized and then I have to report you. So this means that we will check your homework to see whether the same from other people or also run all the code you submit for the programming assignments through the Stanford Moss program and which will check your code against everybody else. And if you catch two people with the same code, then we'll have to bring you in and ask questions. So please don't be stupid. The course website will have the basic assignments and everything, but for the day-to-day announcements and a day-to-day discussion about these assignments will be done on CMU's new Canvas website, which is currently only available to the people that are enrolled in the class, but hopefully we can get everyone that's on the waiting list that wants to be enrolled. You can go on Canvas and see things and that way you can participate as you need. If you want to audit the course, again, let me know and we can add you to Canvas and you can follow along that way. So there is an assigned textbook for this class. We'll be using database system concepts, the sixth edition from Avi Silverchats, Hank North and Sudarshan. I would say that it's hard finding a good textbook for database systems that's out there. There's a couple alternatives, but they're a bit dated. This one is actually somewhat up to date. So in my opinion, this is actually the better one to use. And so for every lecture, there'll be assigned readings to cover all the things that we'll be discussing. But if you want to go into more detail about the things that we cover, then you should look at the textbook. I would say also that there are certain topics that I think are important for understanding in modern database systems that are not covered in this particular textbook. And so in those cases, I will cover them in the lectures and we'll provide lecture notes with links to other readings and other things if you want to go check that out and learn more about it. So in the course, you're greatly comprised of the following five items. So the first will be a homework assignment, so it'll be 15%. Projects will be comprised of 45% of your grade. And for those of you that are undergrads that need to take course that certify the system's requirement, because the grade is more than 40% due to programming assignments, this will count for that. Then there'll be a midterm exam and final exam that are about 20%. And I'll talk about an extra credit assignment that's optional that will allow you to get an extra 10% if you want to do that. So for the homework assignments, there'll be six assignments that we give out throughout the semester. The first one that we're giving out today will be a programming assignment when you program queries in SQL. But all of the rest of them will be pencil paper assignments where you sort of ask questions without doing any programming that look a lot like the questions that are in the textbook. So I would say again, all these homework should be done individually, you should not do this in a group, you're not going to be submitting multiple submissions together. This is meant to be done on your own. For the programming projects, this is the aspect that I'm actually pretty excited about for this course, because this is much different than what other universities are doing and what we used to do in the past here. So for all the programming projects in this course, we'll be based on the SQLite database management system. So if you're not familiar with SQLite, it's the most widely deployed database system in the world. It's in basically everything. It's in every cell phone, it's in every application, every car runs SQLite, every AOLL disk that was mailed out in the 2000s contains SQLite. So it's the most widely deployed database management system. And so what we'll be doing in these programming assignments is that you're going to build your own storage manager from scratch that's going to be embedded inside of SQLite. So SQLite has an, they call the virtual table API that allows you to write your own storage manager and integrate it with the rest of the system. So you don't have to worry about parsing SQL, you don't have to worry about doing any query plans, we're just going to deal with it at the storage management level. So the thing I was stressed is that this is going to be a lot of programming. So you want to make sure that you're comfortable with not only writing but also debugging C++ code. So because we're not going to teach you how to do this, we're assuming that you pick this up in the pre-rex. And in particular, we're going to be using C++11, which is not too different than C++99, but there are some minor differences. So again, you want to make sure that if you're comfortable with doing systems programming at this level, then this is the right course for you. Otherwise, you should maybe consider something else. So if you want to sort of jump ahead and learn a little bit more about how SQLite works, there's actually a video linked here from 2015 when one of the programmers or the co-founder of SQLite came to see you and gave a really great talk that talks about the internals. So that link is on YouTube and go check it out if you want to learn more. So now for both the homework assignments and the programming projects, we're going to allow you to have four slip days throughout the entire semester. And you can use them for either homework or the project, it doesn't matter. And so when you submit your assignment that's late, you have to mark down how many days you're late and how many late days you have left. And then once you run out of late days, we will penalize your submission and deduct 25% of points for every 24 hours that it's late. These are sort of no excuse late days, like you don't have to tell us why you're telling us you're submitting it late. If there is a medical emergency or something else, you have to contact us and let us know ahead of time so that way we can make sure we take care of it. But you have to tell us, right, be upfront about it. And again, I'm very serious about this plagiarism. Please make sure that the homeworks and projects that you submit is your own work. That means that you're not allowed to copy source code from other people, not allowed to copy that to take in the class, or if you find source code that does basically exactly what we're asking you to do, although I think it's unlikely, you just can't take it and copy right in, right? So plagiarism will not be tolerated. So this means that you should not post your project source code on a public GitHub repository because if someone else finds your assignment that you're posting on GitHub, and they copy it and submit it, and we see that the two codes look, the two pieces of code look exactly the same, then you're both going to get in trouble, right? We're not going to say, oh, well, you know, he copied from you or anything like that. If we catch, if we catch their plagiarism, both people will be sort of brought in. So again, if you're not sure what the right thing to do is, please look at CMU's policy on academic integrity and honesty. In the end, if you're not sure whether you're doing it, if you're still really not sure, please come send me a email. We can talk about it. All right, we'll have two exams in the class. The midterm exam will be in class on October 18, which I think is a Wednesday. And there'll be a final exam class at the end of the semester during the final exam period. And since I don't know when the date is, you should not book your plane tickets to go to Acapulco or anywhere else that's warm before we know when that final exam is. I would not allow people to take it early. The final exam day is when everyone has to take it. Of course, you have a medical emergency that's different in contact me, but we will not make any exceptions to do it before or after. So both exams will be closed book, but you'll be allowed to have a one sheet of handwritten notes on equations and other things you want to use for the exam. And it's handwritten, meaning you can't just take the slides and cut and paste them into word or shrink them down to a smaller size. It has to be handwritten. All right, so what we're doing this year that's different than previous years is that we're offering extra credit. So what we're going to do is you can get in a group of three people and you'll pick some database management system that you're interested in learning about and using. And you'll be responsible for getting that database system to run different benchmarks that we have implemented in an open source benchmarking framework that we've been working on. So the idea here is you pick your favorite database system you want to learn about. As long as it supports your JDBC and SQL, then you go make sure the SQL queries work on the database and you can actually get benchmark numbers. And then we'll ask you to write some scripts that either deployed in Docker or some other container to make sure that's reproducible. So we'll give you credits on EC2 to make sure that you can do this. And if you have, if there's multiple people in the group, then each person is responsible for making sure that one benchmark runs. So we'll make sure that we provide information about this late in the semester, but just be mindful that this is something that you can do. And maybe hopefully for some of some of these benchmarks, we can have sort of a leaderboard or bake off to see who can get their database system run the fastest. So I will say that in our benchmarking framework, it already supports Postgres and MySQL and a couple of other sort of, you know, well-known database management systems. So this would be trying to pick more esoteric things that we will cover in the class. And as a last final point for this course, it's been sponsored by NuoDB and they're helping us with the course development and other things. So I'm really grateful for them. They'll be the guest speaker at the end of the class. And I'm happy to talk more about what they do as we go along. Okay. So this class is about databases and particular relational databases, which I'll explain in a second, but it's really about storing and managing and safely supporting databases. So the first thing we need to do is define what a database is. So the definition that I like to use is a database is some organized collection of data that's related together that is designed to model some aspect of the real world. So I mean, like say you're running a bank or you're running an online website or you're running an airline, right? The database contains data that models how that your particular business or organization or whatever it is that you're trying to model actually works. So I'm biased, but I think databases are actually extremely important because they end up being the core component of almost any computer application you can think of. When you think of any website or any like mobile phone app or any cloud service, right? All these things at the lowest level when they actually need to store data is being stored in a database. So that's why it's worth studying. So now to give an example of what a database is, to help you understand what's actually involved in this, I want to use a toy example where we're going to create a database that models a digital music store. So think of like Spotify or the iTunes store. And what we're going to do is we're going to talk about what are the different things we want to store, what data we want to store for those things and how they're going to be related together. So if you think about a music store at a very simple, you know, a simple level is it sort of has three entities that we want to store three things. We want to store about artists, the people that put out music. And then we want to store the albums that those artists release. And then those albums, they contain tracks. So this is essentially the things we need to model in our database. So the typical way you do modeling in a database for a database is called entity relationship diagrams or ER diagrams. And the way to think about this is that we're going to write down in English what the different models or entities in our database should have, and then we'll then model how they are related together. So we start off the beginning, right? We say artists can have names years and years of a started in the country of origin. So this is the information or the data that we need to store for any artists that we want to have in our database. So we would define a new entity, represented by the square here in our model. And then we would have these spokes that come out with these old rules that specify the attributes or values that they can have, right? Name year and country. So we can do the same thing for albums, right? Albums have a name and they have a release year. So they would have a spoke sort of like this. Now again, same thing for tracks, tracks have a name and they have a number. So now we want to model what the relationships are between these entities. So in the first case, we know that an album can have one or more artists. So we would represent it by this diamond here that represents the relationship. So we say an artist has an album or an album has an artist. And we can also mark in this, this relation, this model, the cardinality. In this case, we're here, we're saying it's end to end, because an album can have one or more artists and an artist can have one more albums. So now we can do the same thing about albums and tracks. So we have the same relationship. We say that an album has a track, but now the cardinality is one to end. So one album can have many tracks, but one track can only have one album. So this is essentially what people do when they want to model a database for a new application. They will come up with a diagram like this. And then we'll see in the next class how we can then convert this to the database commands to actually store data in our system. So now we're going to... What's up? I gotta go. I gotta take it off something. You gotta do what? I gotta go. I gotta take it off something. Oh, here. You're going to go take this. All right? Bye, man. Yeah. You good? Good. All right, be safe. Yeah. All right. So to understand how we actually want to store this in our database system, I want to first propose a sort of straw man system using just CSV files, a comma separate of value files. And you'll get an idea of how we can take these empty yards and actually store them in data we want to store in files. And from there, you'll see why we actually want to build a real database payment system. I was rough down here. All right. So what we're going to do is we're going to store... For every entity that was in the model I just showed you, we're going to store its data in a separate file, right? So it's got a CSV file like this. So here we have our schema. We have an artist that has a name, has a year at the country. And we can see in these two files that we have the... Each of those we store to strings that are separated by the commas. In the case of the strings, we enclose them in quotes. And then for the case of year, it's an integer we don't use quotes for those. And the same thing for album. So now anytime our application wants to actually extract data, read data or update these data in these files, it's going to have to open the file up, parse every single line and find where the record is that you're looking for. So let's say that we want to write a query that gets the year that IceCube went solo. So you would write a program that sort of looks something like this, right? Some pseudocode, where you take the file and for every line of that file, your first parse it, which is essentially finding where the splits are on the commas. And then you'll look to see whether the first field in the record equals the string IceCube. And if so, that's the record we're looking for, and we can print out the second offset record, the second value, which is the year, right? So this is essentially what you would do if you wanted to store everything in a system that just had CSV files. So what are the problems with this, right? So there's obviously some performance issues, but there's other things I want you to consider to see why a database management system is a better idea. So the first thing we have to do is with how we may ensure the integrity of the data we're storing, right? So how are we going to make sure that if we have a multiple entries for artists and for each album, or if an artist has multiple albums, how do we make sure that that's always going to be correct? Right? If I go back to my example from here, the Wu Tang Clan put out two albums, right? Their album and then the St. Ice Mixtape. How do I make sure that the string Wu Tang Clan is always going to be the same? Like what happens if I update one record but I don't update another? Then what happens if someone comes along and writes bad code that ends up overwriting the album year with an invalid string, right? The album year has to be, you know, a number, but what if someone stores the string XYZ? How are we going to make sure that that doesn't happen? Or if we start parsing the data file and we come across XYZ, what should we do? Then the next thing that they deal with is how do we actually store in these files if there's multiple artists on a single album? Do we have multiple lines for each entry? Or do we store, you know, an array for each album that has multiple artists? But then how do we make sure that's a CS comma separated, right? So that becomes problematic. Then there's implementation issues that become tricky. So the first obvious one is how do we make sure, how do we find a particular record when we want to do a lookup? So an example that I showed, we were running a for loop and we were parsing every single line until we found the entry that said ice cube. But that means that could be obviously bad because what if the entry that we need is at the bottom of the file and we have to parse every single one. Now, if we only have three artists, that's not a big deal. But if you have a billion artists, then that becomes really expensive. Then you then you have another problem is what happens if we come along and we want to write a new application that uses that same database? That means we basically have to re-implement all the code to parse those files and make sure that's always correct so that we don't corrupt things. But now let's say that the code I showed was in Python, but now we want to write something else in C++. Now we need to make sure that we do an exact copy of that parsing code to make sure that we always end up with the correct ordering of our operations when we read and write lines. But then what happens if now we need to change the schema? I want to add a new column. I got to make sure I change all of the different implementations of the code that parses that database file to make sure that they're correct. So that becomes a big a big software engineering problem. And then lastly, what happens if we have two different threads in our application that are both writing to the same file at the same time? How do you make sure that they don't interfere with each other and we don't end up with corrupted data? So you may say you do simple things like lock the file whenever you do a write, but then that means you can only have one writer at a time and that's maybe a bad idea. Or then you can say, well, maybe what I'll do is I'll split the file up into different sub files. So that way I can take a lock for one file and another thread can take a lock for another file. But now you need to maintain some kind of directory to keep track of where the file it is that you're looking for, if you look for a particular record, right? So that becomes a very problematic. The last issue is having to make sure that all the changes we make to our database are durable, meaning if we are updating a record and we pull the plug on our machine, how do we make sure we come back and our database isn't corrupted? How do we make sure that if we update two records, and we craft before we can update update the first one, then we crash before we update the second one, how do we make sure that we come back and we can roll back the change the first one because that shouldn't have occurred because we didn't finish our all operations. Then what happens if we want to replicate our database across multiple machines to ensure high availability? This is very common in online websites that always need to be up. But now you can say, well, maybe I'll just use a distributed file system, and that way it can be used across different machines. But then now you have the problem we had before where you have two threads or two programs running on different nodes, trying to update the same same files. How do we make sure that that runs quickly and correctly? So the answer to all of these problems is a database management system. So the way I think about a database system, it's the God to the applications, right? It's the omnibid meeting that can make sure your data is not only correct, but also safe. Make sure that your programs can use it without having to make major changes to their code every single time you change what the schema looks like. So the applications don't need to know anything about how the data is actually being stored in the database management system in terms of the physical level. It doesn't know whether it's in memory, it doesn't know whether it's on disk, it doesn't need to know whether it's across multiple machines. If you're using a certain database system that provides the proper extraction level, all of this is hidden from the application, and it allows you as an application programmer to just focus on working, writing your application and not worry about how to store data. So a database management system is often confused with a database, right? When people say they run, you know, the MySQL database, they really means they run the MySQL database management system. So a database management system or DDUS is software that's going to allow an application to store and analyze information that's that's for inside of a particular database. So these things are very important, and that's sort of why we have a course like this. So I'll say that we're going to focus on what I'll call general purpose database management systems that are designed to allow an application programmer to define, create, write queries and update and administer multiple or any kind of database they want. So by general purpose, I mean that the software is explicitly designed to be to manage a database. And it's not embedded inside the application. Oftentimes, you'll see sort of older applications where people end up building sort of the straw man example I showed before with CSV files, where they end up managing the database themselves, or something that looks like a database. So when I say data management system, we're gonna be focusing on ones that are sort of standalone systems that are designed to be to be nothing but a database management system. So these are like the oracles, the Postgreses, the MySQL, SQLite, and all the other things that we'll talk about in this course. So now maybe thinking, well, why, why should I take even take a course about how to build a database management system? And again, I'm biased because databases are my background, the second most important thing in my life. But I'll say that no matter what field you're in, whether it's in computer science or not, you're going to come across databases. And it's very important for you to understand how a database management system works, or in order for you to understand how to use it correctly. So database management systems are used in almost every application you can think of every website, every software system, it's always going to be there. And they're important because the speed in which you access data or can ingest data or really determines what you can or capable of doing in your application. So another way to think about why database systems are important when you think about it is think about the other types of software that are out there, right, other types of application classes, right, database management systems are just considered another application class that runs on top of an operating system. Think about how important they are that actually CMU and SCS would actually dedicate multiple courses to how they're actually implemented, right. There's no course on how to build a map application or things like that. There's courses on how to build a database management system. I think they're that important because they're going to be, they're used in almost everything. So now we can talk a little bit about what kind of things you can do with a database management system at a high level. And so what I like to do is like to categorize different types of database systems based on first what kind of workload they can support, right. So what kind of applications could you run on top of a database system and use it. So the first is probably that's most common is what it's called online transaction processing or all the TP systems. And these are think of these as sort of the front end database systems that are used to ingest new data from the outside world. They're the thing that's interacting with customers. They're the thing interacting with other computers or other systems. And they're taking updates or requests from the outside world and applying to their database. So again, think of like on Amazon storefront, you add things to your cart, you make orders, those are all considered only to be operations. And these are typically you want these done very quickly because you want to be able to process as many customer orders as you can all at once or taking, you know, numbers, you know, stockpile class or purchases all at once. So typically, these operations are very fast. And they only read or write a small amount of data each time. It may be an aggregate, a lot of data, but each individual operation will only maybe access, you know, a small number of tuples in the entire database. So this sort of listing here, it just is sort of a small survey of some of the more major well known, all of the database management systems. So you see things you hear to expect with the oracles, IBM's DB twos, my SQL, Postgres. But then there's a whole bunch of these other systems, some of them are no SQL, some of them are new SQL, some of them are just, you know, regular database systems, that all sort of fall in this category. These are probably the most common database systems because one, they're, they're the somewhat they're easier to build than what I'll show next. But what I'll say is that not all of them are the same and not all of them do sort of things in the same way that we'll talk about in this class. And surprising actually, what's actually kind of cool about this is a lot of them actually, these, these are open source. The next sort of category systems are sort of called online analytical processing database systems or OLAP systems. And these are database systems that are designed to execute complex queries that read a lot of data all at once in, from the, from the database to compute some kind of additional information. So think of this as, this is after you have your ultimate database, collect all this new information from the outside world. And then you want to extrapolate new knowledge from that data to find new things like find me the, what items were most bought in Pittsburgh three days before it snows, things like that. And so these are typically read only, because again, you're trying to compute aggregations and other things like that. So here's a listing of some of the more well known open source or not open source, some of the more well known OLAP database management systems. Few of these are actually open source. So Presto, Stinger and Hive, open source of rest and high rise, the rest of them are all commercial ones. And I would say that typically is not there's not as many as these types of systems, because this is typically harder to actually build. Because you, you know, if you have a lot of data to run sort of complex and OLAP queries on them, is not is not trivial. So now the last category of systems is a newer definition called htap. And this is sort of a combination of the first two. The idea here is that you want to run your transaction processing workload and your analytical workload all inside the single database instance. What happens is you have your OLAP database, take all your updates, and then you stream them out to a OLAP system, sometimes called a data warehouse, and you run all your analytics there. With these newer htap systems, these are designed to be able to take in, you know, run the same transactions at the same speed as you would in an OLAP system, and then do some of the analytics that you normally would only do on a dedicated OLAP system directly on the data as it comes in. So we'll talk a little bit, we won't talk about these systems so much in this class. This will be more the focus on the advanced class in the spring semester 15-7-21. We'll talk a little bit how these things could actually work as we go along. All right, so now in addition to what type of workload or application you would have run on a database management system, we can also talk about what actually the data model is. So the data model is how the database is going to represent data or items for entities that stores inside of it. So the most common data model is the relational data model. This is not the first data model that came around, but since the 1970s, it's probably the most prevalent one. And most of the data systems you could think about are using the relational data model. MySQL, PostgreSQL, Oracle, SQL Server, DB2, these major systems all use a relational data model. The next category of data models are sort of loosely called the no SQL systems. So this would be key values, key value stores, graph databases, document databases, and column families. So these are different than the relational model because they're representing data in a different way. So in a graph database, you represent entities as the nodes and edges. Rather than a relational model, you distort things as records or tuples. So I would say that the no SQL term is a bit blurry. There's a whole bunch of other things that no SQL systems do in terms of how they execute transactions, if at all. And whether they even support SQL or not, you know, it says no SQL. And we won't really discuss them too much. But maybe at the end we will. The next class of data model is for arrays and matrices. And this is more common in machine learning or AI workloads. So you represent entities in terms of multi-mental arrays. And you think if you ever use TensorFlow or Torch, the input you have to give to the machine learning models that you end up doing are all going to be arrays. And then last two are the hierarchical data model and the network data model. I will say that these are considered obsolete. And they're actually pretty rare at this point. So the one of the first database systems that was developed by IBM in the 1960s for the Apollo Moon mission, IMS, uses the hierarchical data model. It's still actually one of the number one best sellers for IBM. But no one would actually choose to use IMS in a current system, right? If you're building the application, you wouldn't use this. The network data model was used in the early 1970s. And there's still maybe a couple systems out there that still use it. But again, it's pretty rare. So unless you get hired to go work on, you know, IMS or code of sale database, you're almost never going to see these. So for this class, we're going to focus on the relational data model. Because again, this is the most widely why they use in different database systems. But it is also in my opinion, the most flexible. And what I mean by that is it actually can represent all of the other data models that I'm showing here, right? So, you know, you can graph database, for example, you could represent nodes and edges in relations just fine. And the only difference that these other data models have is actually in the implementation of the data system for things like the query API. So for example, if you have a graph database using the graph model, they will provide you a query API to do, to do, you know, no traversal to walk the graph. There's no reason you couldn't do that actually in a relational data model with stored procedures and other things embedded in the system. So the relational data model can actually be used to model all of these things. Now, I will say with exception to the only exception to this would be the arrays and matrices. Again, you can model arrays and matrices in the relational model. It's actually really inefficient. You don't want to do this. But for everything else, it's just fine. And that's why again, that's why we're going to focus on this in the course. All right, so now we need to discuss what the relational model is. So the relational model is a, it's a mathematical concept that was developed by a researcher at Ted Codd and IBM in the early 1970s. And it's a way to represent data as unordered sets, where the relation contains a relationship of attributes that represent entities that we showed before in our ER model. Right. So in this example here, we have the artist relation again, it has a name, a year in a country. So the relationship really has to do with the relationship between name, year and country for a single entity, right? The Wu-Tang Clan was founded in 1992 in the USA. That's, that's the relationship. And so we'll say that this instance of this relationship is called a tuple. So a tuple will be a sequence of values that are stored in our relation. So the key thing to be mindful about here is that a relation is a mathematical concept and a tuple is a mathematical concept. And an instance of them that we will have in our database system is typically represented as a table or a tuple will be a record or a row. So the key thing to point out though, is that relation will be unordered and it has to be a set. And that'll be important when we talk about the relational algebra and calculus in the next class. But again, the important thing is that there's no ordering to the extra tuples themselves. So the relational model can provide us with additional integrity constraints that allow us to ensure that our data is always correct across multiple relations or even within a single relation. So the first constraint will be primary keys. And then we'll talk about foreign keys. So a primary key is a way the relation will uniquely identify a single tuple. Because we're a relation as a set, that means we can't have any duplicates. So we need a way to determine whether something is a duplicate of something else. So in our example before, where we have the artist, the problem here is that we actually don't have a good primary key candidate because there's nothing that prevents a two artists from having the same name. Sometimes you see this like there's one band in the US, one band in another country, or they're from different decades, and they end up using the same name. So that can be our primary key for our tuples in this relation, because that's not going to allow us to uniquely identify them. So for this, what we can do is actually inject a synthetic ID called ID, which is just a unique number. And we can use that as a way to uniquely identify tuples in our relation. So now we can say that there could be a routine claim with ID 123 and a routine claim with another ID, and we know that they're actually distinct. So this is actually an example of what we're called usually auto incremented auto generated keys, like to sort of generate a single unique integer value, you can use that as the primary key. So we'll discuss more about how this is implemented late in the semester. But I'll say that this is sort of a standard thing that most database systems support. So in the SQL standard, it's called sequence. And in my SQL, it's called auto increment. One of the things we'll see throughout this course often is that my SQL will have the same the same features as other database systems. But for whatever reason, the SQL syntax or the naming for what they call things is always sort of different, they always have their own. So now we can also have foreign keys. And this is allows to model relationships between different relations. So remember before that in our ER diagram, we had that that that relationship to say an artist can have multiple albums and an album can have multiple artists. So how do we actually want to store this? So if we look here in the the album relation, we see that for the first album enter the Wu Tang, it only has a single artist. So we can distort just that number there without any problems. But in the case of the St. I's mix here, we see that the we don't know how we actually want to store this, right? Because the the values for attributes in a relation has to be scalar, meaning they can't be erased, they can't be objects, they can't be lists, they have to be sort of single scalar values. Now, that's the original definition of relations from the 1970s. We'll see in later database systems, actually, you can have a raise, but for our purposes here, we'll just we'll just deal with this. So since we can't store multiple attributes, we actually have to now store what I'll call sort of a cross reference table that allows us to model the relationship between artists and albums. So we'll create a new relation called artist album, that only has two attributes. It has an artist ID and an album ID. And now we can then remove the artist relation from the from the from the album table. And the way this sort of works is now the album ID points to the album, from the artist album table points to the artist ID, the album ID in the album relation. And the artist ID here points to that album artist ID there. And so what the relational model will do will ensure that any artist ID that exists in the artist album table has to exist in the artist table. And likewise, any album ID in the artist album table has to exist in the album table. Now, I will say that not all database systems enforce this, but typically if you follow the SQL standard, you're supposed to do this. So this prevents you from doing things like inserting an entry into the artist album table with an invalid artist ID, or maybe inserting the string XYZ. And because what will happen is the database system will look and say, well, I don't have an artist ID with that number, and throw an error in front of you from doing that. This is how we're going to ensure that our database is is correct and they maintain the integrity of our of our data. So now the question is, we know how to model our database, we know how to store the inner and relational database and relational model, take our ER diagram and map it into that. Now the question is how do we actually write queries on this? So the relation again, the relational model is a abstract mathematical concept. So there's no standard programming language you could use to write queries on your database. And so typically, though, I will say SQL is the de facto standard that everyone uses, and we'll cover that in later classes. But the example that I have before here with parsing the records and and and finding the thing that we're looking for, this is still actually would be valid on a relational database. If you wanted to do it this way. But typically, if we're gonna, if you want to use SQL, you would look sort of something like this, right, you have a select query, where you go grab the from the artist table, you find the entry where the name equals ice cube in return of the year. So again, we'll cover the SQL a little bit more further in subsequent classes. But I'll just say that the big thing to point out here is that the example at the top is a imperative programming language, meaning you have to tell the data is exactly how to compute the answer that you're looking for. Whereas on the bottom, it's a declarative programming language, where you only tell that the database system, what the answer you want, and it's up for the data system to figure out how to compute it. So the relational model does not specify exactly what programming language you want to use, but it does specify an algebra and a calculus that you then use to execute queries or write queries against relations. So in the next class, we will cover this in more detail. And on top of that, you could build SQL, you could build another programming language, it doesn't actually matter SQL is just one implementation of a query language that uses the relational algebra. All right, so to conclude, I will say that databases are ubiquitous, databases are everywhere. And I think they're an important class of applications that are worth discussing, which is why I think you should take this course. I would say that relational databases are the most common data model that's used, because in my opinion, it's the most flexible. Now, the relational data model does have some problems, as we'll see as we go along, such when we talk about normalization. But in my opinion, the modern implementations of that allows you to overcome it in some ways. Okay, so to finish off, as promised, the first homework assignment is going out today. And what it will be is we're asking you to write a bunch of SQL queries on some some court judicial court data we're going to provide you. So I'm originally from Maryland. Joy is originally from from NYC in Brooklyn. But we decided to pick Maryland, because I have some friends back there, and was actually from from there from DC, that has some trouble with the law. So he's actually in the database. So we're going to give you some data we've collected from the Maryland judicial system. And we're going to ask you to write different queries about extrapolating new information from this. So everything will be done in SQL light will provide you a SQL database, you can just load in SQL light on your laptop or range machine. And you can do the assignment that way. So I will say that in the upcoming class when we cover SQL, I will not be teaching basic SQL. There's enough tutorials online, or you can read the textbook on how to do this, I rather focus, you know, spend our time focusing on how to do the sort of advanced SQL operations. So some of the things will be in the homework assignment will be pretty basic. And you can learn how to do that from the textbook. And I'll teach you how to do the more complicated things later. So this we do on Wednesday, September 13 at midnight. The link is there to go skip the full documentation of what's expected you for the first homework assignment, as well as how to download the the SQL light database to use for testing, as well as how to submit it on auto lab. So if you have any any questions about this assignment, please use canvas to post post them. So that way, if someone has the same question, they can also get the same help. And then if you have any questions about the course in general, send me an email and I will respond. So I'm very excited for this new new this new class. I'm excited about the project I'm excited about the homeworks. And hopefully you guys will enjoy it and hopefully get a lot out of it. So with that, this is the end of the first lecture, and I'll be posting the second lecture in a couple of days. Thanks