 So, by this point, we've actually now covered all the core concepts of querying data, but what about actually adding data into the database or updating data, manipulating data? Well, when it comes to adding a row for a table, this is called an insert operation. And unsurprisingly, if you want to insert a new row, you have to provide a value for each column. So, here in our cats table, we're inserting a new row and we're providing a name value for mittens, a lives value two, and a hair length short. And now we have a cat mittens with two lives and short hair. When it comes time to remove rows from a table to delete them, what we do is we specify a predicate, a predicate used to filter rows to select rows and then delete them. So, here in our planets table, if we delete rows by specifying a predicate of distance from sun greater than 100 million miles, that's effectively specifying Uranus, Neptune, Jupiter, Saturn, and Mars, and then deleting them. If we want to change data in existing rows, this is called an update operation. And first, we specify which rows we want to update with a predicate. And of course, we specify which column or columns we wish to update with a new value and what those new values should be. So here we're updating simply the column lives with a value of five in all rows where the name value is princess. So whatever number of lives these two rows with the name value princess had before, now both of those rows have the value of five in the lives column. In our database schemas, when we sort out what tables we need and what columns they should have, we generally designate for each table a primary key. The primary key of a table is the column or set of columns, the values of which are used to uniquely identify each row. In this example table here, it makes most sense to designate the make and color columns together as being the primary key because presumably you wouldn't have two rows where the make and the color are the same. It wouldn't make sense, for example, to have two rows where the make in both is Toyota and the color in both is blue because there's only one quantity of cars for each combination of make and color. In real world practice, however, it's by far most common to give each table its own ID column which has an arbitrary integer in it, a unique integer, and it's these ID columns which are most commonly used for primary keys. In this example here, you'll notice there's really no relationship between the ID value and the content of that row. It's really just totally arbitrary what the ID value is. The only important thing is that it stays constant for that row. To facilitate this use of integer IDs as primary keys, most databases have a feature whereby you could declare a column, especially as a primary key column, and the database will actually auto-generate an ID number for that row. So you, when you insert a new row into the table, you don't have to specify an ID. In fact, you shouldn't. The database itself will actually automatically create an ID number for that row for you. This makes a lot of sense because it's then the responsibility of the database to ensure that ID is actually unique, that it's not already being used by any other row in that table. What's called a foreign key is simply a reference to a primary key of another table. So here, for example, we have a table on the left of colors and a table on the right of people. And the color table has an ID column designated as that table's primary key. Again, the values in these ID columns, when we add a new color, they're generally just auto-generated by the database itself. For simplicity here, I've simply have chosen values starting from one down to five, just because it keeps the numbers simple. In a real database, when you insert rows, they might start numbered from one. They might start numbered from zero. They might start numbered from any random number, actually. So the IDs that get auto-generated aren't necessarily small numbers starting from one. But in any case, looking over at the table of people, we have the name of each person and then we have their favorite color. And the favorite color column is a foreign key because it's simply a recorded value that references the primary ID of another table, the color table. So according to the data in this table, the person named Alan, their favorite color is orange. And same for Chelsea, their favorite color is also orange. Sam, though, his favorite color is the one with ID 2, green. Now, the thing to note here is that primary keys and foreign keys are really what form the relationships between tables. In fact, the most common use of joins is to take two tables where one has a foreign key pointing to the primary key of the other. And once we take the cross-joint of those two tables, we then filter for the rows where the foreign key value of the one matches the primary key value of the other. And then we'd end up with this table, which is what we'd want when we wanted to query and find out, hey, that favorite color of Alan's, what's the actual name of that color? Because if we just query the person table itself, that only gives us the IDs of the colors. We want the actual name of the color, which is contained in the color table. Now, of course, the objection here is, well, why did we put the names of the colors in a separate table to begin with? Why not just stuff them into the person table such that the person table has a column that simply reads color? That'd be far simpler. Well, in a trivial example like this, doing that would have probably made more sense. But as you'll see when we get into more complicated schemas, more complicated database designs, such thinking tends to lead you astray. In fact, as we'll discuss in a moment, there are formal rules that have been devised for how exactly you should split your data up into multiple tables in your relational database. Before getting into those more formal rules, though, there are first two simple guidelines which are very important to keep in mind. And those two guidelines concern how to best represent relationships which are called one-to-many relationships and what are called many-to-many relationships. First a one-to-many relationship is like the one we just saw. And for any one color, there are many people who might have that as their favorite color. So one color, many people. And the way this gets expressed is that we have the two separate tables, one for the colors, one for the people, and the people table has a foreign key pointing to the primary key of the colors. Or for another example, one person might make multiple orders. So if you have a table of people and you have a table of orders, the orders table has a foreign key referencing the primary key of the person's table. Now the best approach for many-to-many relationships is less obvious because it involves the introduction of a third table. So here, for example, we have authors and we have books. And each author might have multiple books, but each book likewise might have multiple authors. The only way to properly express this is by introducing a third table which matches up authors with books. And this third table you will note actually has two foreign keys. One for author IDs and one for book IDs. And having more than one foreign key is actually perfectly fine. Foreign keys are not like primary keys. A table always has just one primary key, but it can have as many foreign keys as are necessary. Another kind of many-to-many relationship is when you have one kind of entity and you want to have multiple relationships between any pairs of those entities. Here for example, we have a table of people and we want to express the friendship relationships between those people. We can't do that in the person table itself because each person has potentially multiple friends. So again to express this many-to-many relationship, we have to introduce another table just for that purpose. Now I should say it's actually kind of hard to identify the scenario as a many-to-many relationship scenario. It's a bit deceptive because you think, well, one person has multiple friends. That's one person, many friends, right? Well, the people that you are friends with themselves might likewise have multiple friends. So it's actually many-to-many relationship, not a one-to-many relationship. It's pretty confusing in that regard. But any case, looking at our other table of relationships, the other thing that makes this odd is that friendship we usually think of as a two-way relationship, it goes both ways. So when we say that person one has friend with ID two, well then we also say that person two is friends with person one. If I'm friends with you, presumably you're friends with me, presumably. Now it's actually debatable whether you should have to express these relationships twice. You could establish a convention, a rule for your database whereby the friendship is only expressed in one direction, maybe you'd have a rule deciding which way you would express that relationship, that two-way relationship. It's actually a bit of a dilemma. The downside of expressing the relationships both ways is that you have, for one relationship, two rows. And so if you want to update, modify, or remove some existing relationship, you'd have to make sure to do so in pairs. You'd have to make sure to remove two rows at a time, the two corresponding rows. On the other hand, if you chose to express these two-way relationships just by using one row such that say you'd have person one with friend two but you'd omit the row person two friend one, you could choose to do that. It would likely though complicate any query you wanted to do, like you wanted to find out everyone who's friends with this certain person, you'd have to do a lot more work to figure that out if you only expressed the relationships in one direction. The takeaway from this is that the relational model is sometimes kind of an awkward fit for expressing certain things. In time though, with some practice, you'll find that you'll adjust and you'll be able to fit most everything pretty easily into the relational model and you'll find tricks to work around the other cases. The formal rules I mentioned earlier that we used to devise the structure of our database, the schema for our database, these are called the rules of normalization. The primary point of the normalization process is to minimize any redundancies. That is to ensure that when we insert new data into the database, we're only inserting it into one place, not multiple places. Not only do such redundancies lead to a waste of storage space and to increased complexity in your queries. They also tend to lead to consistency problems because if you record same data in multiple locations, that means when you want to update that data, delete that data, you have to ensure to do it in all of those places all at once every time. If you're not careful or if something goes wrong, you end up with data which has been half updated, so you have conflicting values in your database, which is generally not a good thing. The whole notion of normalization was actually originally devised by Edgar Codd himself, the person who invented relational databases, and in fact normalization was sort of an integral concept to the whole thing. Very shortly after he introduced the relational model itself, Codd also introduced the first three of what are called the normal forms. There's the first normal form, the second normal form, third normal form, fourth normal form, fifth normal form, and sixth normal form. Originally, Codd himself only introduced the first three. The fourth and fifth were introduced by other people later in the 70s, and then the rule of the sixth normal form was actually devised and introduced actually quite recently in about 2003, I believe. And the general gist of these normal forms is that they are progressive. So when you normalize your database, if you attain the level of the first normal form, you then from there can proceed to the second normal form, the third normal form, the fourth normal form, and so forth. So if your database conforms to say the third normal form then implicitly it also conforms to the second normal form and the first normal form. Now there's actually kind of a reason why the first three normal forms were introduced very early on and then the fourth and fifth came later and then the sixth way later. And that is that in practice, fourth normal form, fifth normal form, and sixth normal form are just not nearly as critical as normalizing your database up to the third normal form. Third normal form is actually what most people tend to aim for when they normalize their database. Most people don't fret too much about going beyond that. And actually it works out that in the large majority of cases, if you've normalized your database to the third normal form, it works out in most cases that the database most likely also conforms to fourth and fifth normal form without even trying to be clear. It is possible to conform to third normal form but not fourth or fifth normal form. It's just not likely, it doesn't come up in most situations really. So with that all in mind, we're only going to discuss the first three normal forms because they, as I said, are really the ones really relevant to most people. Before we get to those, Codd himself actually summed up the purpose of normalization pretty succinctly. First off, the point is to free the collection of relations from undesirable insertion, update and deletion dependencies. Again, relation is just Codd's term for what we usually call tables. And what he means by dependency here is a scenario where when you modify or insert or delete something from one table, you need to, for the sake of consistency, perform some other action on another table. Those are dependencies between tables. And while it's true that in some scenarios, such relationships really are unavoidable, they're really necessary, one point of normalization is to try and minimize those dependencies, because, well, the fewer the better. Codd also says that the point of normalization is to reduce the need for restructuring the collection of relations as new types of data are introduced and thus increase the lifespan of application programs. What he's saying here is that ideally when we add new tables to a database, it shouldn't require having to update existing tables or modify their structure. That'd be quite bad and intrusive, right? It's really important to have the flexibility to add new tables as easily as possible. In bullet point three here, he says that the point of normalization is to make the relational model more informative to users. And what I think he means thereby informative is that if somewhere to look at the schema of your database, look at the design of the tables, it should be quite evident looking at any one table what its purpose is. What you tend to get in an improperly designed database schema is a bunch of tables that have sort of nebulous roles. You're not really clear on what purpose those tables serve or how they relate to the other tables. Lastly, Codd says that normalization is also for the purpose of making the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by. That's a bit tricky to parse, but what he means first by the relations being neutral to the query statistics, he means that the database shouldn't make it unduly hard to write certain queries because of how the database is structured. And not only should those queries be harder for a human author to write, they shouldn't be especially more onerous for the database itself to execute. But of course, some queries you write are going to be naturally more complex than others, just because maybe they involve many more tables. And they're also potentially going to be more resource incentive. They're going to take longer for the database to process simply because they involve more data and more work for the database to do. However, Codd is basically saying here that normalization will prevent scenarios where cases which should be simple and which should run fast aren't going to be artificially hard to write or artificially overly taxing on the database. That's basically what he's saying. So now let's actually look at the first of these normal forms. The first normal form is the easiest to state, certainly. It's usually just expressed as no repeating groups. Tables should not have repeating groups. The only difficult part about this first normal form is an understanding, well, what the hell is a repeating group? And there's a little bit of a disagreement among competing interpretations of this first normal form. But at least what people agree on is that it means that there should be no ordering among the rows or columns in your table. And in practice what that means is that when you insert data into your database row by row, you the user of the database understand that the database is not maintaining that order. And so when you do a query later on, you're not going to get back the data in the same order. That's in practice really what that means. No repeating groups means that you also shouldn't have duplicate rows. And in practice really what that means is every table should have a primary key. As long as you have a primary key for every table, you're not going to have duplicate rows. No repeating groups is also interpreted as meaning that for any column, that column only really expresses one attribute. You shouldn't try and surreptitiously stuff multiple attributes into one. You don't want multiple values per column. And similarly, for every value in a cell, that is every intersection of every row and every column, each cell of data, it should never have more than one value in any cell, which is kind of like saying the same thing that every column should just express one attribute. So the first normal form is really not hard to understand, even if there are maybe some competing interpretations for more subtle cases. But in general, what I interpreted as meaning is just make sure your tables are really tables. No ordering of the rows, no ordering of the columns, one value per cell. And that's pretty much all it is. The second normal form is maybe a bit trickier. It's certainly more of a mouthful to say. No non-prime attribute is dependent on any proper subset of any candidate key. And to unpack that, we first have to define some of those terms. First off, what's called a super key is just any combination of attributes of columns. It's any set of columns that could be used to uniquely identify each row. So in this example table here, there are a couple super keys. The in combination of skill and employee, that's a super key. The skill and current work location would be a super key. Or the combination of all three together would be another super key. What are called candidate keys, however, are a subset of the super keys. The candidate keys are the super keys which don't have any so-called extraneous information in them. It's a minimal super key is one way to think of it. Another way to put it is that a candidate key would be a valid selection as the primary key for the table. And in this example here, there's really only one candidate key. And that is the combination of employee and skill. So once again, candidate key simply means a set of keys which could validly be selected as the primary key for that table. Now, what a proper subset is, if you forget, it's a subset which is not equal to the thing it is a subset of. So again, our candidate key here is employee and skill together. The two proper subsets of that are the employee column by itself and the skill column by itself. The combination of employee and skill together, that is a subset of itself, of employee and skill, but it's not a proper subset. So that's all what proper subset means. And finally, the prime attribute is a column which is part of a candidate key. So a non-prime attribute here is current work location because it doesn't belong to any candidate key. So to ask if this table is conformant to the second normal form, we ask is the current work location column dependent upon either the skill column or the employee column. And the answer is that the current work location is dependent upon employee because presumably it's something that corresponds to the employee. One employee has one current work location, right? So this is no good. We have a non-prime attribute which is dependent upon a proper subset of a candidate key. The most sensible way of correcting this example is to split it into two tables, one for employees and a second table just expressing the relationship between employees and their skills. This is actually an example of a many-to-many relationship. For each employee, there are potentially multiple skills which that employee has and for each skill, there are potentially multiple employees which have that skill. So if you just went by that guideline, you would very quickly see there's a problem with the original table because one single table cannot express a many-to-many relationship. You have to bring in another table for that purpose. And actually, if we recognize that this is a many-to-many relationship, we should see here that there's still really a problem and that is that the skills themselves should have their own separate table. And then this table expressing the relationships between the two would be between the primary keys for the employee and the primary keys for the skills. Well, the arrangement here might seem OK. It's going to become problematic as soon as we decide that skills themselves are something we want to provide more information about. As soon as we decide that a skill is something which is its own entity worthy of having its own attributes, that's when we'll want skills themselves to be in their own table. In a table in third normal form, every non-prime attribute is directly dependent upon every super key. And remember, super key means any set of columns which together uniquely identifies a row. And non-prime attributes are those attributes, those columns which aren't part of any candidate key in the table. So the question is, what does dependent mean here? And actually, there's a more popular, easier to remember formulation for the third normal form that makes this pretty clear. And the way this goes is every non-prime attribute provides a fact about the key, the whole key, and nothing but the key. And looking at an example here which violates this rule, we have this table of tournaments and their winners, which right off the bat, just saying that tournaments and winners should raise alarm bells because tables should be about one sort of thing, not multiple things. But anyway, here the primary key of the table is the combination of tournament and year, assuming of course that one individual tournament is only held once a year. And so what we need to ask about the columns winner and winner date of birth here, the non-prime attributes is, are they providing a fact about the key, the whole key, and nothing but the key? And in the case of the winner column, that one does check out, but winner date of birth, again, right there in the name of the column that's telling you that something's up here because it's winner date of birth, not just date of birth, it's the date of birth of the winner, not of the tournament. So it's not a fact about the key and nothing but the key, it's really a fact about the winner. So the dependency here between winner date of birth and tournament and year, it's not a direct dependency here, it's a transit of dependency where winner date of birth is really dependent upon winner, and winner is what's dependent upon tournament and year. So the solution here, as basically always, is to take this table and just split it into more than one. And so more sensibly, we'd have a table for tournaments and a table for winners. And the winner table has a ID primary key, and the tournament table simply has a foreign key referencing the primary key of the winner table. So that's an example of third normal form. And again, understand that if you're in third normal form, then by definition, you're also in second and first normal forms. And if you want to sum up all these rules into just one common sense slogan, that is to always ask yourself about any table, does this column really belong to this table? And if you just think in those terms, you'll usually catch most of your problems. Just going by that guideline will usually get you most of the way there to third normal form. Finally, one last thing to understand about normalization is that more normalized isn't always better. It turns out there are actually cases where one would actually want to deliberately denormalize their database, that is to take their schema, which is properly normalized, and to actually make it less normalized, effectively to add more redundancy into your tables and your schema. The reason you might do this is for the sake of performance. Having redundant data sometimes can allow for more efficient queries, depending upon what exactly your use cases are. So in practice, we really have this trade-off. If you want to have a very clean database with everything properly organized, then you maximize normalization. If you realize that, wait a minute, there are these few particular cases where we really need to have this data also in this column because trying to join it together with all these other tables and get our data that way, that's just too efficient for something that we're doing maybe many times a minute or something. And so in those cases, you actually deliberately denormalize your database and add in some redundant data. So denormalization is a tool, basically, of optimization. And database optimization is a whole subject unto itself that we're just not going to get into here.